-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rate-limiting (Bandwidth limiting) for downloads #157
Comments
It could be possible, but I don't know of anyone working on this at the moment. Also, there are a few options for implementing rate limits like requests/min and KBs/min. This limitation should probably go in Patches are always welcome :). I hope this small introduction encourage anyone to work on it. |
According to http://stackoverflow.com/questions/3488616/bandwidth-throttling-in-python, I think we need to wrap |
I have implemented a monkey-patch at https://github.com/achimnol/scrapy/commits/bandwidth-throttling . |
@dangra what do you think? |
@cerisara I think it's about request throttling, not bandwidth throttling per request. |
I have to crawl an website that enforces a certain download rate limit for all its URLs, for example, 800 KBytes/sec.
Since my internet connection is faster than that, accessing the website using my plain web browser and scrapy causes severe delay penalty by the server software. (Using 800 KB/s limit, the download should finish within 30 secs, but if I use unlimited download rates, the server sometimes does not return anything at all, or after a significant artificial delay of 150 secs more or less.)
Before using scrapy, I was using my home-made crawler with rate limiting.
After migrating to scrapy, I have realized that I have to re-implement
scrapy.core.downloader.webclient
stuffs and I need to learn twisted APIs.(My first lookup was downloader middlewares, but they seem to be executed after receiving the whole body, unfortunately.)
There is an undocumented setting called
DOWNLOADER_HTTPCLIENTFACTORY
, so I could copy & paste the existing classes, extend them, and override this setting.I think we can override
rawDataReceived()
inScrapyHTTPPageGetter
by inserting calculated delays to limit the receiving rate of the response body traffic, but I think I need more inspection.Could scrapy be shipped with its native rate limiting feature?
The text was updated successfully, but these errors were encountered: