Don't break if Page.goto returns None #113

elacuesta · 2022-07-31T18:09:56Z

Closes #10, closes #102.

I'm still not entirely sure why the issues mentioned above occur, however this should provide a reasonable way for the crawl to proceed without errors in such cases.

From the upstream docs on Page.goto:

The method either throws an error or returns a main resource response. The only exceptions are navigation to about:blank or navigation to the same URL with a different hash, which would succeed and return null.

codecov · 2022-07-31T18:10:05Z

Codecov Report

Merging #113 (b4bd0e8) into master (253d7bd) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master      #113   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            4         4           
  Lines          337       340    +3     
=========================================
+ Hits           337       340    +3

Impacted Files	Coverage Δ
scrapy_playwright/handler.py	`100.00% <100.00%> (ø)`

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

auxsvr · 2022-08-02T05:44:02Z

The documentation seems to be problematic in this aspect: page.goto can return None with network errors. Also, if page.goto fails, _download_request_with_page should close the page.

elacuesta · 2022-08-02T16:13:56Z

I'm not sure I get what you mean. Upstream docs clearly say that goto does NOT return None on network errors, it raises the error instead. Are you saying that in practice it is possible for goto to return None if there are network errors? If that's the case, can you provide an example to reproduce the situation?

The page is still being closed here, and the exception re-raised to be picked up by Scrapy's exception handling mechanism.

auxsvr · 2022-08-04T16:30:31Z

Perhaps this was due to scrapy_playwright instead of playwright, it's been a few months and I don't recall the details: some errors would open the browser's empty page, which returns None. I can't reproduce this at the moment.

What if request.meta.get("playwright_include_page") is True, shouldn't the page be closed then? If I remember correctly, there are cases of errors that halt crawling, because the pages are not closed. For my code, I use a callback on the response event to close all pages with not response.ok.

elacuesta · 2022-08-04T18:25:35Z

No, when passing playwright_include_page=True the page is added to the request.meta dictionary and it becomes the spider's responsibility to close it. This could be done in a callback or in an errback if there are errors. From the readme:

When passing playwright_include_page=True, make sure pages are always closed when they are no longer used. It's recommended to set a Request errback to make sure pages are closed even if a request fails (if playwright_include_page=False or unset, pages are automatically closed upon encountering an exception). This is important, as open pages count towards the limit set by PLAYWRIGHT_MAX_PAGES_PER_CONTEXT and crawls could freeze if the limit is reached and pages remain open indefinitely.

elacuesta force-pushed the handle-goto-returning-none branch from b4bd0e8 to 6103f0f Compare July 31, 2022 18:11

Don't break if Page.goto returns None

4f622b6

elacuesta force-pushed the handle-goto-returning-none branch from 6103f0f to 4f622b6 Compare July 31, 2022 18:40

elacuesta merged commit 5f54ec4 into master Aug 1, 2022

elacuesta deleted the handle-goto-returning-none branch August 1, 2022 14:37

AlvinSartorTrityum mentioned this pull request Aug 3, 2022

GoTo returns None for certain sites (never the first page) #115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't break if Page.goto returns None #113

Don't break if Page.goto returns None #113

Uh oh!

elacuesta commented Jul 31, 2022

Uh oh!

codecov bot commented Jul 31, 2022 •

edited

Loading

Uh oh!

auxsvr commented Aug 2, 2022

Uh oh!

elacuesta commented Aug 2, 2022 •

edited

Loading

Uh oh!

auxsvr commented Aug 4, 2022

Uh oh!

elacuesta commented Aug 4, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Don't break if Page.goto returns None #113

Don't break if Page.goto returns None #113

Uh oh!

Conversation

elacuesta commented Jul 31, 2022

Uh oh!

codecov bot commented Jul 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

auxsvr commented Aug 2, 2022

Uh oh!

elacuesta commented Aug 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

auxsvr commented Aug 4, 2022

Uh oh!

elacuesta commented Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Jul 31, 2022 •

edited

Loading

elacuesta commented Aug 2, 2022 •

edited

Loading

elacuesta commented Aug 4, 2022 •

edited

Loading