Skip to content

CONCURRENT_REQUESTS_PER_DOMAIN ignored for start_urls #5083

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Leogio360 opened this issue Apr 6, 2021 · 5 comments · Fixed by #5540
Closed

CONCURRENT_REQUESTS_PER_DOMAIN ignored for start_urls #5083

Leogio360 opened this issue Apr 6, 2021 · 5 comments · Fixed by #5540

Comments

@Leogio360
Copy link

Leogio360 commented Apr 6, 2021

Description

When DOWNLOAD_DELAY is set with a value > 0, the value of CONCURRENT_REQUESTS_PER_DOMAIN is ignored, when processing start_urls

Steps to Reproduce

  1. Create an example spider
import scrapy


class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['google.com']
    start_urls = [f'https://www.google.com/search?q=scrapy{x}' for x in range(0, 20, 1)]

    def parse(self, response):
        self.log("*" * 100)
        pass
  1. In settings.py
CONCURRENT_REQUESTS = 100
DOWNLOAD_DELAY = 10
CONCURRENT_REQUESTS_PER_DOMAIN = 20
  1. execute scrapy crawl example

Expected behavior: all 20 requests should be crawled without delay
Actual behavior: the spider will crawl a page every 10 second

Reproduces how often: 100%

Versions

2.4.1, 2.5.0

@Leogio360
Copy link
Author

Leogio360 commented Apr 8, 2021

Am I missing some pieces in the settings or is this an expected behavior not well clarified in the docs? Or is this a bug?

@Gallaecio
Copy link
Member

It sounds like a bug to me. I plan to try and reproduce it when I get some spare time.

@FilipePintoReis
Copy link

FilipePintoReis commented Apr 16, 2021

I want to make my first contribution, so I'm looking into this!
I can also reproduce it.

@Gallaecio
Copy link
Member

After looking into it, I’ve identified that the problem is in

def _process_queue(self, spider, slot):
from twisted.internet import reactor
if slot.latercall and slot.latercall.active():
return
# Delay queue processing if a download_delay is configured
now = time()
delay = slot.download_delay()
if delay:
penalty = delay - now + slot.lastseen
if penalty > 0:
slot.latercall = reactor.callLater(penalty, self._process_queue, spider, slot)
return
, slot.lastseen is a single value, where maybe it should be a deque that rotates with the enqueued requests, or even be a part of the queue.

However, as I was trying to think of a way to make such a chance in a backward-compatible way, which I don’t think straightforward, I started to think that maybe there is no bug here, and the current implementation is how it should be. The DOWNLOAD_DELAY is the delay between requests to the same domain, while your expectations are for DOWNLOAD_DELAY to be the delay between each CONCURRENT_REQUESTS_PER_DOMAIN requests, and I think the former interpretation is much more useful.

The current implementation will make requests with 10 seconds of delay between them, but will also stop sending requests if the server responses are so slow that even with that delay by the time it’s time to send the 21st request none of the responses have arrived from the server. This may sound crazy with such high numbers, but with lower numbers for both settings it makes a lot of sense.

What you are suggesting would result in batches of requests being sent at the same time. This is not good for servers. Your requests should be distributed in time, not send N requests and then wait M seconds before you send another batch of N requests simultaneously.

In summary, I don’t think this is a bug, and CONCURRENT_REQUESTS_PER_DOMAIN is not ignored, it just does not work as you expected and it’s function will likely not come into play in your example because Google will send responses very quickly.

If you can justify the need for the behavior you expect, maybe we can make of this an enhancement to make such a behavior possible. But I honestly cannot think of a reason why you would want to do that when scraping someone else’s server.

@Gallaecio
Copy link
Member

We should make sure #5083 (comment) is clear in the documentation of DOWNLOAD_DELAY, this is bound to be a common misunderstanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants