Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for async callbacks #4978

Merged
merged 59 commits into from
Jul 29, 2022

Conversation

wRAR
Copy link
Member

@wRAR wRAR commented Feb 5, 2021

This PR adds support for defining async def parse_* and using both yield and await inside, with or without the asyncio reactor. There is one caveat: if the asyncio reactor is enabled, you cannot await on Deferreds directly and need to wrap them. I've added this to docs.

This change was tested on two real-world projects, for one of them I also replaced using inline_requests to get HTTP responses inside a callback with alternatively treq or aiohttp. I didn't test the performance, but as far as I can see the existing non-async callbacks shouldn't have any significant code path changes so there should be no regressions and if there are any perf problems with async callbacks we can solve that later.

This doesn't require or support async def start_requests.

My main concern with this is large and non-intuitive scrapy.utils.defer._AsyncCooperatorAdapter. I would like to replace it with something better if that's possible but for now it seems to work correctly. I've tried to document it as comprehensively as possible.

I've split it in several commits with different features in case this help with review for someone.

@codecov
Copy link

codecov bot commented Feb 5, 2021

Codecov Report

Merging #4978 (8659bf4) into master (737f396) will increase coverage by 0.14%.
The diff coverage is 98.44%.

❗ Current head 8659bf4 differs from pull request most recent head c7b90c6. Consider uploading reports for the commit c7b90c6 to get more accurate results

@@            Coverage Diff             @@
##           master    #4978      +/-   ##
==========================================
+ Coverage   88.70%   88.85%   +0.14%     
==========================================
  Files         162      162              
  Lines       10787    10962     +175     
  Branches     1851     1894      +43     
==========================================
+ Hits         9569     9740     +171     
  Misses        942      942              
- Partials      276      280       +4     
Impacted Files Coverage Δ
scrapy/spidermiddlewares/urllength.py 88.00% <90.90%> (-2.48%) ⬇️
scrapy/spidermiddlewares/offsite.py 98.41% <94.11%> (-1.59%) ⬇️
scrapy/spidermiddlewares/depth.py 97.67% <95.45%> (-2.33%) ⬇️
scrapy/core/spidermw.py 99.44% <99.08%> (-0.56%) ⬇️
scrapy/core/scraper.py 85.79% <100.00%> (+0.24%) ⬆️
scrapy/middleware.py 92.85% <100.00%> (ø)
scrapy/spidermiddlewares/referer.py 93.24% <100.00%> (+0.13%) ⬆️
scrapy/utils/asyncgen.py 100.00% <100.00%> (ø)
scrapy/utils/defer.py 97.29% <100.00%> (+1.33%) ⬆️
scrapy/utils/python.py 88.58% <100.00%> (+0.93%) ⬆️
... and 2 more

@wRAR
Copy link
Member Author

wRAR commented Feb 11, 2021

Not sure why some related tests fail on 3.6 asyncio-pinned, I cannot reproduce this locally at this time.

@wRAR
Copy link
Member Author

wRAR commented Feb 11, 2021

Now I can reproduce them on a freshly built 3.6.12 while it worked on my old 3.6.9, both from pyenv.

@wRAR
Copy link
Member Author

wRAR commented Feb 12, 2021

It's Twisted 17.9.0-specific again.

@wRAR
Copy link
Member Author

wRAR commented Feb 15, 2021

pytest-twisted installs the asyncio reactor with eventloop=None, so on Twisted 17.9.0 uses new_event_loop(). We added a workaround for this in Scrapy itself, in #4841 / #4872, but it's much harder to do for pytest-twisted, unless they add some support for that (or we drop 17.9.0). It's probably possible to install the reactor in conftest.py though.

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the separate commits, it definitely made review much more approachable.

I think it may be best to add tests for parallel_async that ensure complete code coverage for _AsyncCooperatorAdapter, specially given its complexity (I won’t claim to fully understand it).

I love the documentation entry on how to handle waiting on deferreds now! When releasing 2.5, we should probably remember to add a brief mention about this to the release notes as a backward-incompatible change (on an experimental feature, but still), linking to the section for further details.

@Gallaecio Gallaecio modified the milestones: 2.6, 2.7 Mar 4, 2022
@Gallaecio Gallaecio self-requested a review March 15, 2022 14:04
@Gallaecio Gallaecio self-requested a review March 16, 2022 20:49
Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am proposing some changes documentation-wise, but otherwise this looks great to me.

Copy link
Member

@kmike kmike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @wRAR! +1 to merge, after merging in a clean up by @Gallaecio.

I only got superficial understanding of how the code works, but what I understood makes total sense to me :) Tests also makes sense, and the overall approach looks right.

@auxsvr
Copy link
Contributor

auxsvr commented May 26, 2022

Shouldn't we also update CrawlSpider._parse_response, CrawlSpider._requests_to_follow, LxmlParserLinkExtractor._extract_links etc to support coroutines and async generators?

@wRAR
Copy link
Member Author

wRAR commented May 26, 2022

@auxsvr on one hand we haven't thought about this and doing this shouldn't be necessary to merge this feature as existing CrawlSpider-based spiders should continue to work and this support can be added separately (if you suspect this is not true please voice your suspicions), on other hand it should be possible to convert any CrawlSpider-based spider to a normal one if one wants to use async callbacks.

@kmike
Copy link
Member

kmike commented Jul 27, 2022

A fix for the typing issues: wRAR#2 (and wRAR@56e2eea specifically)

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overwhelmingly amazing!

docs/topics/components.rst Outdated Show resolved Hide resolved
docs/index.rst Outdated Show resolved Hide resolved
scrapy/spidermiddlewares/depth.py Outdated Show resolved Hide resolved
wRAR and others added 3 commits July 27, 2022 23:12
Co-authored-by: Adrián Chaves <adrian@chaves.io>
Co-authored-by: Adrián Chaves <adrian@chaves.io>
@kmike
Copy link
Member

kmike commented Jul 29, 2022

Tremendous work, thanks @wRAR!

@kmike kmike merged commit 52d9349 into scrapy:master Jul 29, 2022
@wRAR wRAR deleted the asyncio-parse-asyncgen-proper-rebased branch May 8, 2023 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants