Rate-limiting (Bandwidth limiting) for downloads #157

achimnol · 2012-07-15T18:34:41Z

I have to crawl an website that enforces a certain download rate limit for all its URLs, for example, 800 KBytes/sec.
Since my internet connection is faster than that, accessing the website using my plain web browser and scrapy causes severe delay penalty by the server software. (Using 800 KB/s limit, the download should finish within 30 secs, but if I use unlimited download rates, the server sometimes does not return anything at all, or after a significant artificial delay of 150 secs more or less.)

Before using scrapy, I was using my home-made crawler with rate limiting.

After migrating to scrapy, I have realized that I have to re-implement scrapy.core.downloader.webclient stuffs and I need to learn twisted APIs.
(My first lookup was downloader middlewares, but they seem to be executed after receiving the whole body, unfortunately.)
There is an undocumented setting called DOWNLOADER_HTTPCLIENTFACTORY, so I could copy & paste the existing classes, extend them, and override this setting.
I think we can override rawDataReceived() in ScrapyHTTPPageGetter by inserting calculated delays to limit the receiving rate of the response body traffic, but I think I need more inspection.

Could scrapy be shipped with its native rate limiting feature?

The text was updated successfully, but these errors were encountered:

pablohoffman · 2012-07-17T16:01:12Z

It could be possible, but I don't know of anyone working on this at the moment.

Also, there are a few options for implementing rate limits like requests/min and KBs/min.

This limitation should probably go in scrapy.core.downloader.__init__ rather than webclient (which only deals with implementing the HTTP protocol), similar to the other limitations that are already there (delay, max.concurrency).

Patches are always welcome :). I hope this small introduction encourage anyone to work on it.

achimnol · 2012-07-28T16:44:03Z

According to http://stackoverflow.com/questions/3488616/bandwidth-throttling-in-python, I think we need to wrap HTTPClientFactory with ThrottlingFactory.
I think this must be done at either scrapy.utils.misc.load_object() or scrapy.core.downloader.handlers.http.
Is there any possible "injection" point for generic protocol implementations?

achimnol · 2012-08-06T16:11:13Z

I have implemented a monkey-patch at https://github.com/achimnol/scrapy/commits/bandwidth-throttling .
I'm new to scrapy and twisted, so not sure about this could be generalized for other protocol implementations.
I'll test this more and clean up the code.

pablohoffman · 2012-09-03T17:02:16Z

@dangra what do you think?

cerisara · 2017-12-19T12:13:17Z

https://doc.scrapy.org/en/latest/topics/autothrottle.html

achimnol · 2017-12-19T15:32:30Z

@cerisara I think it's about request throttling, not bandwidth throttling per request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rate-limiting (Bandwidth limiting) for downloads #157

Rate-limiting (Bandwidth limiting) for downloads #157

achimnol commented Jul 15, 2012

pablohoffman commented Jul 17, 2012

achimnol commented Jul 28, 2012

achimnol commented Aug 6, 2012

pablohoffman commented Sep 3, 2012

cerisara commented Dec 19, 2017

achimnol commented Dec 19, 2017

Rate-limiting (Bandwidth limiting) for downloads #157

Rate-limiting (Bandwidth limiting) for downloads #157

Comments

achimnol commented Jul 15, 2012

pablohoffman commented Jul 17, 2012

achimnol commented Jul 28, 2012

achimnol commented Aug 6, 2012

pablohoffman commented Sep 3, 2012

cerisara commented Dec 19, 2017

achimnol commented Dec 19, 2017