Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CloseSpider can be raised on spider_idle signal handler to set the closing reason #5191

Merged
merged 8 commits into from
Jul 12, 2021

Conversation

ivanprado
Copy link
Contributor

The execution of a spider might have been successful or not depending on many conditions. For example, you might have expected many items to be extracted but none were extracted. It might be useful to be able to evaluate at the end of the crawling if the results are successful, and if not, close the spider using the appropriate message (e.g. "too_few_results") instead of the regular one "finished".

The spider_idle signal handler looks like the perfect place to put such logic. At this stage, all the work has been done, so all the stats are ready to be queried to decide the final status. The problem I found is that is not possible to set the closing reason in this handler.

In this PR I propose to reuse the exception CloseSpider to provide the closing reason. The required changes are not so big, and it seems to fit.

Let me know what do you think, and if you agree on the functionality, what else would be required (tests, etc).

@codecov
Copy link

codecov bot commented Jun 23, 2021

Codecov Report

Merging #5191 (cb08e36) into master (016c7e9) will decrease coverage by 4.07%.
The diff coverage is 91.66%.

@@            Coverage Diff             @@
##           master    #5191      +/-   ##
==========================================
- Coverage   88.19%   84.11%   -4.08%     
==========================================
  Files         162      162              
  Lines       10497    10502       +5     
  Branches     1517     1518       +1     
==========================================
- Hits         9258     8834     -424     
- Misses        965     1407     +442     
+ Partials      274      261      -13     
Impacted Files Coverage Δ
scrapy/core/engine.py 84.21% <87.50%> (+0.24%) ⬆️
scrapy/utils/signal.py 97.91% <100.00%> (+0.04%) ⬆️
scrapy/core/http2/stream.py 27.01% <0.00%> (-64.37%) ⬇️
scrapy/pipelines/images.py 28.07% <0.00%> (-62.29%) ⬇️
scrapy/core/http2/agent.py 36.14% <0.00%> (-60.25%) ⬇️
scrapy/core/downloader/handlers/http2.py 43.42% <0.00%> (-56.58%) ⬇️
scrapy/core/http2/protocol.py 34.17% <0.00%> (-49.25%) ⬇️
scrapy/utils/ssl.py 53.65% <0.00%> (-17.08%) ⬇️
scrapy/utils/asyncgen.py 83.33% <0.00%> (-16.67%) ⬇️
scrapy/core/downloader/contextfactory.py 75.92% <0.00%> (-11.12%) ⬇️
... and 9 more

@ivanprado
Copy link
Contributor Author

@kmike
Copy link
Member

kmike commented Jun 24, 2021

Hey! The feature makes total sense to me.

  1. Would it be possible to add a test to it?
  2. Do you think it worth mentioning this use case somewhere in the docs?

@ivanprado ivanprado changed the title CloseSpider can be raised on spider_idle signal handler to the the closing reason CloseSpider can be raised on spider_idle signal handler to set the closing reason Jun 24, 2021
@ivanprado
Copy link
Contributor Author

Hi @kmike . I've added a test case and also documented a possible use case in the signals documentation page.

I have a problem with the test Ubuntu / tests (3.6.12, asyncio-pinned). They are failing. I don't know why. Any clue?

@pablohoffman
Copy link
Member

This functionality makes a lot of sense to me too, nice contribution @ivanprado 👍

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me!

scrapy/core/engine.py Outdated Show resolved Hide resolved
scrapy/core/engine.py Outdated Show resolved Hide resolved
tests/test_engine.py Outdated Show resolved Hide resolved
@Gallaecio
Copy link
Member

I have a problem with the test Ubuntu / tests (3.6.12, asyncio-pinned). They are failing. I don't know why. Any clue?

That’s an issue in master unrelated to this pull request, do not worry about it.

ivanprado and others added 3 commits July 8, 2021 12:39
Co-authored-by: Adrián Chaves <adrian@chaves.io>
Co-authored-by: Adrián Chaves <adrian@chaves.io>
Co-authored-by: Adrián Chaves <adrian@chaves.io>
@ivanprado
Copy link
Contributor Author

Thank you @Gallaecio for your review 🍻. I've accepted all your suggestions.

scrapy/core/engine.py Outdated Show resolved Hide resolved
@wRAR wRAR merged commit 1c46d5a into scrapy:master Jul 12, 2021
@ivanprado
Copy link
Contributor Author

Thank you everybody for the reviews, commend and for merging it :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants