Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rate-limiting (Bandwidth limiting) for downloads #157

Open
achimnol opened this issue Jul 15, 2012 · 6 comments
Open

Rate-limiting (Bandwidth limiting) for downloads #157

achimnol opened this issue Jul 15, 2012 · 6 comments
Labels

Comments

@achimnol
Copy link

@achimnol achimnol commented Jul 15, 2012

I have to crawl an website that enforces a certain download rate limit for all its URLs, for example, 800 KBytes/sec.
Since my internet connection is faster than that, accessing the website using my plain web browser and scrapy causes severe delay penalty by the server software. (Using 800 KB/s limit, the download should finish within 30 secs, but if I use unlimited download rates, the server sometimes does not return anything at all, or after a significant artificial delay of 150 secs more or less.)

Before using scrapy, I was using my home-made crawler with rate limiting.

After migrating to scrapy, I have realized that I have to re-implement scrapy.core.downloader.webclient stuffs and I need to learn twisted APIs.
(My first lookup was downloader middlewares, but they seem to be executed after receiving the whole body, unfortunately.)
There is an undocumented setting called DOWNLOADER_HTTPCLIENTFACTORY, so I could copy & paste the existing classes, extend them, and override this setting.
I think we can override rawDataReceived() in ScrapyHTTPPageGetter by inserting calculated delays to limit the receiving rate of the response body traffic, but I think I need more inspection.

Could scrapy be shipped with its native rate limiting feature?

@pablohoffman
Copy link
Member

@pablohoffman pablohoffman commented Jul 17, 2012

It could be possible, but I don't know of anyone working on this at the moment.

Also, there are a few options for implementing rate limits like requests/min and KBs/min.

This limitation should probably go in scrapy.core.downloader.__init__ rather than webclient (which only deals with implementing the HTTP protocol), similar to the other limitations that are already there (delay, max.concurrency).

Patches are always welcome :). I hope this small introduction encourage anyone to work on it.

@achimnol
Copy link
Author

@achimnol achimnol commented Jul 28, 2012

According to http://stackoverflow.com/questions/3488616/bandwidth-throttling-in-python, I think we need to wrap HTTPClientFactory with ThrottlingFactory.
I think this must be done at either scrapy.utils.misc.load_object() or scrapy.core.downloader.handlers.http.
Is there any possible "injection" point for generic protocol implementations?

@achimnol
Copy link
Author

@achimnol achimnol commented Aug 6, 2012

I have implemented a monkey-patch at https://github.com/achimnol/scrapy/commits/bandwidth-throttling .
I'm new to scrapy and twisted, so not sure about this could be generalized for other protocol implementations.
I'll test this more and clean up the code.

@pablohoffman
Copy link
Member

@pablohoffman pablohoffman commented Sep 3, 2012

@dangra what do you think?

@achimnol
Copy link
Author

@achimnol achimnol commented Dec 19, 2017

@cerisara I think it's about request throttling, not bandwidth throttling per request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.