-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CrawlSpider: add support for async callbacks #5657
CrawlSpider: add support for async callbacks #5657
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5657 +/- ##
========================================
Coverage 88.83% 88.84%
========================================
Files 162 162
Lines 10964 10969 +5
Branches 1894 1646 -248
========================================
+ Hits 9740 9745 +5
Misses 943 943
Partials 281 281
|
|
||
async def parse_async(self, response, foo=None): | ||
self.logger.info('[parse_async] status %i (foo: %s)', response.status, foo) | ||
return Request(self.mockserver.url("/status?n=202"), self.parse_async, cb_kwargs={"foo": "bar"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please also add a test for async generator callback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, I missed that. Updated, thanks for the suggestion!
CI errors on Windows are unrelated, see #5633 |
Nice! |
if callback: | ||
cb_res = callback(response, **cb_kwargs) or () | ||
if isinstance(cb_res, AsyncIterable): | ||
cb_res = await collect_asyncgen(cb_res) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we document that for CrawlSpider behavior of async callbacks is different from regular spiders?
If I'm not mistaken, here the behavior is similar to Scrapy < 2.7, where the complete output of a callback is awaited before the processing started, while after #4978 the output (items, requests) is processed as soon as it's sent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for pointing it out. Please note that this is coherent with CrawlSpider.process_results
, which is documented to receive a list (although said docs only mention it in the context of the XMLFeedSpider)
I stumbled upon this by seeing reports of people getting errors when trying to use the CrawlSpider in combination with scrapy-playwright (scrapy-plugins/scrapy-playwright#110 (comment))