Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CrawlSpider: add support for async callbacks #5657

Merged

Conversation

elacuesta
Copy link
Member

@elacuesta elacuesta commented Oct 2, 2022

I stumbled upon this by seeing reports of people getting errors when trying to use the CrawlSpider in combination with scrapy-playwright (scrapy-plugins/scrapy-playwright#110 (comment))

@codecov
Copy link

codecov bot commented Oct 2, 2022

Codecov Report

Merging #5657 (da8f915) into master (1445ebd) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff            @@
##           master    #5657    +/-   ##
========================================
  Coverage   88.83%   88.84%            
========================================
  Files         162      162            
  Lines       10964    10969     +5     
  Branches     1894     1646   -248     
========================================
+ Hits         9740     9745     +5     
  Misses        943      943            
  Partials      281      281            
Impacted Files Coverage Δ
scrapy/spiders/crawl.py 94.31% <100.00%> (+0.34%) ⬆️
scrapy/utils/asyncgen.py 100.00% <100.00%> (ø)


async def parse_async(self, response, foo=None):
self.logger.info('[parse_async] status %i (foo: %s)', response.status, foo)
return Request(self.mockserver.url("/status?n=202"), self.parse_async, cb_kwargs={"foo": "bar"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also add a test for async generator callback?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, I missed that. Updated, thanks for the suggestion!

@elacuesta elacuesta changed the title CrawlSpider: add support for async def callbacks CrawlSpider: add support for async callbacks Oct 2, 2022
@elacuesta
Copy link
Member Author

CI errors on Windows are unrelated, see #5633

@wRAR
Copy link
Member

wRAR commented Oct 3, 2022

Nice!

@wRAR wRAR merged commit a193d14 into scrapy:master Oct 3, 2022
@elacuesta elacuesta deleted the crawlspider-support-async-def-callback branch October 3, 2022 12:30
if callback:
cb_res = callback(response, **cb_kwargs) or ()
if isinstance(cb_res, AsyncIterable):
cb_res = await collect_asyncgen(cb_res)
Copy link
Member

@kmike kmike Oct 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we document that for CrawlSpider behavior of async callbacks is different from regular spiders?

If I'm not mistaken, here the behavior is similar to Scrapy < 2.7, where the complete output of a callback is awaited before the processing started, while after #4978 the output (items, requests) is processed as soon as it's sent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for pointing it out. Please note that this is coherent with CrawlSpider.process_results, which is documented to receive a list (although said docs only mention it in the context of the XMLFeedSpider)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants