New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow crawling speeds with allowed_domains #19
Comments
How large is the allowed_domains list? Could you add the allowed_domains and disable the offsite middleware? Could you run the spider with the allowed_domains list but disabling scrapy-redis? |
Its about 250 urls, for some reason can not disable the offsite middleware been setting 'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': None . I disabled scrapy-redis and looks like the same result so this must be a scrapy bug in the allowed_domains? |
You could time how much take the offsite middleware operations: If that's the problem, it might be better to perform the offset check On Wed, Aug 6, 2014 at 7:27 PM, Doginal notifications@github.com wrote:
|
Hi Doginal, |
I have a crawler setup with Scrapy version 0.24.2 and the latest version of scrapy-redis. I have seen a drastic drop in performance when I add in a list of allowed_domains. If I delete the allowed_domains list, my crawler goes from 1-3 pages/min up to 200-300 pages/min. I feel this has to do with scrapy-redis causing this performance issues. Have you ever encountered this issue? Is there anything that could be causing this issue?
Settings: no auto throttle , no download limit, SCHEDULER_PERSIST = False, tried not passing a list of start_urls also.
I am also getting very low cpu usage about 0.5%, if I remove the allowed_domains it goes back up to about 10% (single thread)
Also I am seeing at the beginning of every crawl:
/usr/local/lib/python2.7/dist-packages/scrapy_redis/spiders.py:40: ScrapyDeprecationWarning: scrapy_redis.spiders.RedisSpider inherits from deprecated class scrapy.spider.BaseSpider, please inherit from scrapy.spider.Spider. (warning only on first subclass, there may be others)
class RedisSpider(RedisMixin, BaseSpider):
Thanks,
Doginal
The text was updated successfully, but these errors were encountered: