Rate-limiting (Bandwidth limiting) for downloads #157

Open
achimnol opened this Issue Jul 15, 2012 · 6 comments

Comments

Projects
None yet
3 participants
@achimnol

I have to crawl an website that enforces a certain download rate limit for all its URLs, for example, 800 KBytes/sec.
Since my internet connection is faster than that, accessing the website using my plain web browser and scrapy causes severe delay penalty by the server software. (Using 800 KB/s limit, the download should finish within 30 secs, but if I use unlimited download rates, the server sometimes does not return anything at all, or after a significant artificial delay of 150 secs more or less.)

Before using scrapy, I was using my home-made crawler with rate limiting.

After migrating to scrapy, I have realized that I have to re-implement scrapy.core.downloader.webclient stuffs and I need to learn twisted APIs.
(My first lookup was downloader middlewares, but they seem to be executed after receiving the whole body, unfortunately.)
There is an undocumented setting called DOWNLOADER_HTTPCLIENTFACTORY, so I could copy & paste the existing classes, extend them, and override this setting.
I think we can override rawDataReceived() in ScrapyHTTPPageGetter by inserting calculated delays to limit the receiving rate of the response body traffic, but I think I need more inspection.

Could scrapy be shipped with its native rate limiting feature?

@pablohoffman

This comment has been minimized.

Show comment
Hide comment
@pablohoffman

pablohoffman Jul 17, 2012

Member

It could be possible, but I don't know of anyone working on this at the moment.

Also, there are a few options for implementing rate limits like requests/min and KBs/min.

This limitation should probably go in scrapy.core.downloader.__init__ rather than webclient (which only deals with implementing the HTTP protocol), similar to the other limitations that are already there (delay, max.concurrency).

Patches are always welcome :). I hope this small introduction encourage anyone to work on it.

Member

pablohoffman commented Jul 17, 2012

It could be possible, but I don't know of anyone working on this at the moment.

Also, there are a few options for implementing rate limits like requests/min and KBs/min.

This limitation should probably go in scrapy.core.downloader.__init__ rather than webclient (which only deals with implementing the HTTP protocol), similar to the other limitations that are already there (delay, max.concurrency).

Patches are always welcome :). I hope this small introduction encourage anyone to work on it.

@achimnol

This comment has been minimized.

Show comment
Hide comment
@achimnol

achimnol Jul 28, 2012

According to http://stackoverflow.com/questions/3488616/bandwidth-throttling-in-python, I think we need to wrap HTTPClientFactory with ThrottlingFactory.
I think this must be done at either scrapy.utils.misc.load_object() or scrapy.core.downloader.handlers.http.
Is there any possible "injection" point for generic protocol implementations?

According to http://stackoverflow.com/questions/3488616/bandwidth-throttling-in-python, I think we need to wrap HTTPClientFactory with ThrottlingFactory.
I think this must be done at either scrapy.utils.misc.load_object() or scrapy.core.downloader.handlers.http.
Is there any possible "injection" point for generic protocol implementations?

@achimnol

This comment has been minimized.

Show comment
Hide comment
@achimnol

achimnol Aug 6, 2012

I have implemented a monkey-patch at https://github.com/achimnol/scrapy/commits/bandwidth-throttling .
I'm new to scrapy and twisted, so not sure about this could be generalized for other protocol implementations.
I'll test this more and clean up the code.

achimnol commented Aug 6, 2012

I have implemented a monkey-patch at https://github.com/achimnol/scrapy/commits/bandwidth-throttling .
I'm new to scrapy and twisted, so not sure about this could be generalized for other protocol implementations.
I'll test this more and clean up the code.

@pablohoffman

This comment has been minimized.

Show comment
Hide comment
@pablohoffman

pablohoffman Sep 3, 2012

Member

@dangra what do you think?

Member

pablohoffman commented Sep 3, 2012

@dangra what do you think?

@achimnol

This comment has been minimized.

Show comment
Hide comment
@achimnol

achimnol Dec 19, 2017

@cerisara I think it's about request throttling, not bandwidth throttling per request.

@cerisara I think it's about request throttling, not bandwidth throttling per request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment