Skip to content

GoTo returns None for certain sites (never the first page) #115

@AlvinSartorTrityum

Description

@AlvinSartorTrityum

Hi! I have a spider that uses playwright with a proxy.
NOTE: the spider works as it should when the proxy is not needed and the proxy works, as the first page is correctly scraped.

This is what happens:

  • first page is scraped, I see that the ************* RESPONSE ************* log, so parse_item is hit once
  • links are extracted and set_playwright_true is called (the list of links is logged)
  • errors are raised: 'NoneType' object has no attribute 'all_headers'

image

It seems similar to #10 and #102 and I saw that a fix has been merged with #113 .

When will the fix be released to the next version? Will this fix this or it will just prevent the error from being risen?
Any idea why using the proxy is causing such exception?

class PlaywrightSpiderWithProxy(CrawlSpider):
    name = "client-side-site"
    handle_httpstatus_list = [301, 302, 401, 403, 404, 408, 429, 500, 503]
    exclude_patterns: List[str] = []

    playwright_meta = {
        "playwright": True,
        "playwright_page_goto_kwargs": {"wait_until": "networkidle"},
    }

    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "proxy": {
                "server": "http://192.0.0.1:12345",
                "username": "username",
                "password": "password",
            },
        },
    }

    def __init__(self, **kwargs: Any):
        # ...
        self.rules = (
            Rule(
                LinkExtractor(allow=allow_path),
                callback=self.parse_item,
                process_request=self.set_playwright_true,
                follow=True,
            ),
        )
        # ...
        super().__init__(**kwargs)

    def start_requests(self) -> Iterator[Request]:
        yield Request(self.start_urls[0], meta=self.playwright_meta)

    def set_playwright_true(self, request: Request, response: Response):
        self.log("%s => %s " % (response.url, request.url), logging.INFO)
        request.meta.update(self.playwright_meta)
        return request

    def parse_start_url(self, response: Response) -> Dict[str, Any]:
        return self.parse_item(response)

    def parse_item(self, response: Response) -> Dict[str, Any]:
        self.log("************* RESPONSE *************", logging.INFO)
        return {
          #  ...
        }

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions