-
Notifications
You must be signed in to change notification settings - Fork 149
Closed as not planned
Labels
Description
Hi! I have a spider that uses playwright with a proxy.
NOTE: the spider works as it should when the proxy is not needed and the proxy works, as the first page is correctly scraped.
This is what happens:
- first page is scraped, I see that the
************* RESPONSE *************log, soparse_itemis hit once - links are extracted and
set_playwright_trueis called (the list of links is logged) - errors are raised:
'NoneType' object has no attribute 'all_headers'
It seems similar to #10 and #102 and I saw that a fix has been merged with #113 .
When will the fix be released to the next version? Will this fix this or it will just prevent the error from being risen?
Any idea why using the proxy is causing such exception?
class PlaywrightSpiderWithProxy(CrawlSpider):
name = "client-side-site"
handle_httpstatus_list = [301, 302, 401, 403, 404, 408, 429, 500, 503]
exclude_patterns: List[str] = []
playwright_meta = {
"playwright": True,
"playwright_page_goto_kwargs": {"wait_until": "networkidle"},
}
custom_settings = {
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"PLAYWRIGHT_LAUNCH_OPTIONS": {
"proxy": {
"server": "http://192.0.0.1:12345",
"username": "username",
"password": "password",
},
},
}
def __init__(self, **kwargs: Any):
# ...
self.rules = (
Rule(
LinkExtractor(allow=allow_path),
callback=self.parse_item,
process_request=self.set_playwright_true,
follow=True,
),
)
# ...
super().__init__(**kwargs)
def start_requests(self) -> Iterator[Request]:
yield Request(self.start_urls[0], meta=self.playwright_meta)
def set_playwright_true(self, request: Request, response: Response):
self.log("%s => %s " % (response.url, request.url), logging.INFO)
request.meta.update(self.playwright_meta)
return request
def parse_start_url(self, response: Response) -> Dict[str, Any]:
return self.parse_item(response)
def parse_item(self, response: Response) -> Dict[str, Any]:
self.log("************* RESPONSE *************", logging.INFO)
return {
# ...
}