Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rate-limiting (Bandwidth limiting) for downloads #157

Open
achimnol opened this issue Jul 15, 2012 · 6 comments
Open

Rate-limiting (Bandwidth limiting) for downloads #157

achimnol opened this issue Jul 15, 2012 · 6 comments

Comments

@achimnol
Copy link

I have to crawl an website that enforces a certain download rate limit for all its URLs, for example, 800 KBytes/sec.
Since my internet connection is faster than that, accessing the website using my plain web browser and scrapy causes severe delay penalty by the server software. (Using 800 KB/s limit, the download should finish within 30 secs, but if I use unlimited download rates, the server sometimes does not return anything at all, or after a significant artificial delay of 150 secs more or less.)

Before using scrapy, I was using my home-made crawler with rate limiting.

After migrating to scrapy, I have realized that I have to re-implement scrapy.core.downloader.webclient stuffs and I need to learn twisted APIs.
(My first lookup was downloader middlewares, but they seem to be executed after receiving the whole body, unfortunately.)
There is an undocumented setting called DOWNLOADER_HTTPCLIENTFACTORY, so I could copy & paste the existing classes, extend them, and override this setting.
I think we can override rawDataReceived() in ScrapyHTTPPageGetter by inserting calculated delays to limit the receiving rate of the response body traffic, but I think I need more inspection.

Could scrapy be shipped with its native rate limiting feature?

@pablohoffman
Copy link
Member

It could be possible, but I don't know of anyone working on this at the moment.

Also, there are a few options for implementing rate limits like requests/min and KBs/min.

This limitation should probably go in scrapy.core.downloader.__init__ rather than webclient (which only deals with implementing the HTTP protocol), similar to the other limitations that are already there (delay, max.concurrency).

Patches are always welcome :). I hope this small introduction encourage anyone to work on it.

@achimnol
Copy link
Author

According to http://stackoverflow.com/questions/3488616/bandwidth-throttling-in-python, I think we need to wrap HTTPClientFactory with ThrottlingFactory.
I think this must be done at either scrapy.utils.misc.load_object() or scrapy.core.downloader.handlers.http.
Is there any possible "injection" point for generic protocol implementations?

@achimnol
Copy link
Author

achimnol commented Aug 6, 2012

I have implemented a monkey-patch at https://github.com/achimnol/scrapy/commits/bandwidth-throttling .
I'm new to scrapy and twisted, so not sure about this could be generalized for other protocol implementations.
I'll test this more and clean up the code.

@pablohoffman
Copy link
Member

@dangra what do you think?

@cerisara
Copy link

@achimnol
Copy link
Author

@cerisara I think it's about request throttling, not bandwidth throttling per request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants