-
Notifications
You must be signed in to change notification settings - Fork 10.8k
CONCURRENT_REQUESTS_PER_DOMAIN ignored for start_urls #5083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Am I missing some pieces in the settings or is this an expected behavior not well clarified in the docs? Or is this a bug? |
It sounds like a bug to me. I plan to try and reproduce it when I get some spare time. |
I want to make my first contribution, so I'm looking into this! |
After looking into it, I’ve identified that the problem is in scrapy/scrapy/core/downloader/__init__.py Lines 135 to 147 in 0e57918
slot.lastseen is a single value, where maybe it should be a deque that rotates with the enqueued requests, or even be a part of the queue.
However, as I was trying to think of a way to make such a chance in a backward-compatible way, which I don’t think straightforward, I started to think that maybe there is no bug here, and the current implementation is how it should be. The The current implementation will make requests with 10 seconds of delay between them, but will also stop sending requests if the server responses are so slow that even with that delay by the time it’s time to send the 21st request none of the responses have arrived from the server. This may sound crazy with such high numbers, but with lower numbers for both settings it makes a lot of sense. What you are suggesting would result in batches of requests being sent at the same time. This is not good for servers. Your requests should be distributed in time, not send N requests and then wait M seconds before you send another batch of N requests simultaneously. In summary, I don’t think this is a bug, and If you can justify the need for the behavior you expect, maybe we can make of this an enhancement to make such a behavior possible. But I honestly cannot think of a reason why you would want to do that when scraping someone else’s server. |
We should make sure #5083 (comment) is clear in the documentation of |
Description
When DOWNLOAD_DELAY is set with a value > 0, the value of CONCURRENT_REQUESTS_PER_DOMAIN is ignored, when processing start_urls
Steps to Reproduce
scrapy crawl example
Expected behavior: all 20 requests should be crawled without delay
Actual behavior: the spider will crawl a page every 10 second
Reproduces how often: 100%
Versions
2.4.1, 2.5.0
The text was updated successfully, but these errors were encountered: