Support limiting the number of requests per interval #125

Closed
honglei opened this Issue Apr 27, 2012 · 6 comments

Projects

None yet

4 participants

@honglei
honglei commented Apr 27, 2012

Many Web site's Open API limits the maximum number of requests in a certain interval from an IP address, Like 40 requests per minute.
How ever, the current arguments are CONCURRENT_REQUESTS,CONCURRENT_REQUESTS_PER_DOMAIN
CONCURRENT_REQUESTS_PER_IP, and DOWNLOAD_DELAY. Which depend on the duration of completing requests, so I feel difficult to adjust according to the threshold in API.
To achieve high performance and don't exceed the threshold of API, I suggests adding arguments like MAX_REQUESTS_PER_MINUTE.

Thanks!

@honglei
honglei commented May 1, 2012

To suit maximum 40 requests/min, should I set like following?

RANDOMIZE_DOWNLOAD_DELAY=False
DOWNLOAD_DELAY=60/40.
CONCURRENT_REQUESTS_PER_IP=40

@dangra
Member
dangra commented Jan 29, 2013

right, except DOWNLOAD_DELAY should be a float: 60 / 40.0

@dangra dangra closed this Jan 29, 2013
@eLRuLL
Contributor
eLRuLL commented May 22, 2016

I would like to reopen this issue to a discussion for the case when we want to use multiple API credentials.

This could be a new kind of Spider, or just some settings to determine the minimum time interval per Request thread, kind of how cookiejars work. Any ideas?

@jmaynier

@eLRuLL I have the same requirement, be able to set settings on delay/concurrent request enforced per cookiejar.

@eLRuLL
Contributor
eLRuLL commented Sep 22, 2016

hi @jmaynier I was able to solve this using a combination of CONCURRENT_REQUESTS_PER_DOMAIN, DOWNLOAD_DELAY and download_slot.

To summarise, scrapy uses the domain of a url as the "key" to create a download_slot (which are in charge of download concurrency), that's why we can set CONCURRENT_REQUESTS_PER_DOMAIN, because that way scrapy only controls the requests by slot.

Now, you can create your own slots, to setup custom concurrency, and assign the Request objects to that slot for the requests that you want to control with that custom concurrency, the way to do it is to pass it on the meta parameter like this: Request(url, meta={'download_slot': 'mycustomslot'}) (if you want more requests to be controlled by the same slot, just keep passing that meta parameter).

Now that you passed your custom slot, they will be still be controlled by the CONCURRENT_REQUESTS_PER_DOMAIN setting, even if that isn't "actually" a domain.

Now what I did to control concurrency per "credential" is just to emulate the "domain" or "slot" behaviour but per credential, which resulted on specifying the following settings:

settings.py

CONCURRENT_REQUESTS=200 # a high number, just so it won't conflict with per-domain concurrency
CONCURRENT_REQUESTS_PER_DOMAIN=1 # this is saying do 1 request at a time per domain (and I will specify credentials as domains).
RANDOMIZE_DOWNLOAD_DELAY=False # just to deactivate random offset that scrapy adds.
DONWLOAD_DELAY=1.0 # The delay you want per credential, this says every 1 second, you can also specify decimals

Now when doing requests with your credentials, specify a unique identifier per credential (you could set the credentials in a list and use the list index) into the download_slot meta parameter and keep passing it on all the requests you want to do with each credential, scrapy will take care of the concurrency of the requests per credential.

NOTE: If you need to still change something in the request before scrapy really executes it (downloads it from the site), use a DOWNLOADER MIDDLEWARE, specifically in the process_request method and change the request.

@jmaynier

@eLRuLL thanks for the detailed explanation !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment