Skip to content

Conversation

@elacuesta
Copy link
Member

Closes #10, closes #102.

I'm still not entirely sure why the issues mentioned above occur, however this should provide a reasonable way for the crawl to proceed without errors in such cases.

From the upstream docs on Page.goto:

The method either throws an error or returns a main resource response. The only exceptions are navigation to about:blank or navigation to the same URL with a different hash, which would succeed and return null.

@codecov
Copy link

codecov bot commented Jul 31, 2022

Codecov Report

Merging #113 (b4bd0e8) into master (253d7bd) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master      #113   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            4         4           
  Lines          337       340    +3     
=========================================
+ Hits           337       340    +3     
Impacted Files Coverage Δ
scrapy_playwright/handler.py 100.00% <100.00%> (ø)

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@elacuesta elacuesta force-pushed the handle-goto-returning-none branch from b4bd0e8 to 6103f0f Compare July 31, 2022 18:11
@elacuesta elacuesta force-pushed the handle-goto-returning-none branch from 6103f0f to 4f622b6 Compare July 31, 2022 18:40
@elacuesta elacuesta merged commit 5f54ec4 into master Aug 1, 2022
@elacuesta elacuesta deleted the handle-goto-returning-none branch August 1, 2022 14:37
@auxsvr
Copy link

auxsvr commented Aug 2, 2022

The documentation seems to be problematic in this aspect: page.goto can return None with network errors. Also, if page.goto fails, _download_request_with_page should close the page.

@elacuesta
Copy link
Member Author

elacuesta commented Aug 2, 2022

I'm not sure I get what you mean. Upstream docs clearly say that goto does NOT return None on network errors, it raises the error instead. Are you saying that in practice it is possible for goto to return None if there are network errors? If that's the case, can you provide an example to reproduce the situation?

The page is still being closed here, and the exception re-raised to be picked up by Scrapy's exception handling mechanism.

@auxsvr
Copy link

auxsvr commented Aug 4, 2022

Perhaps this was due to scrapy_playwright instead of playwright, it's been a few months and I don't recall the details: some errors would open the browser's empty page, which returns None. I can't reproduce this at the moment.

What if request.meta.get("playwright_include_page") is True, shouldn't the page be closed then? If I remember correctly, there are cases of errors that halt crawling, because the pages are not closed. For my code, I use a callback on the response event to close all pages with not response.ok.

@elacuesta
Copy link
Member Author

elacuesta commented Aug 4, 2022

No, when passing playwright_include_page=True the page is added to the request.meta dictionary and it becomes the spider's responsibility to close it. This could be done in a callback or in an errback if there are errors. From the readme:

When passing playwright_include_page=True, make sure pages are always closed when they are no longer used. It's recommended to set a Request errback to make sure pages are closed even if a request fails (if playwright_include_page=False or unset, pages are automatically closed upon encountering an exception). This is important, as open pages count towards the limit set by PLAYWRIGHT_MAX_PAGES_PER_CONTEXT and crawls could freeze if the limit is reached and pages remain open indefinitely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants