Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow crawling speeds with allowed_domains #19

Closed
Doginal opened this issue Aug 6, 2014 · 4 comments
Closed

Slow crawling speeds with allowed_domains #19

Doginal opened this issue Aug 6, 2014 · 4 comments

Comments

@Doginal
Copy link

Doginal commented Aug 6, 2014

I have a crawler setup with Scrapy version 0.24.2 and the latest version of scrapy-redis. I have seen a drastic drop in performance when I add in a list of allowed_domains. If I delete the allowed_domains list, my crawler goes from 1-3 pages/min up to 200-300 pages/min. I feel this has to do with scrapy-redis causing this performance issues. Have you ever encountered this issue? Is there anything that could be causing this issue?

Settings: no auto throttle , no download limit, SCHEDULER_PERSIST = False, tried not passing a list of start_urls also.
I am also getting very low cpu usage about 0.5%, if I remove the allowed_domains it goes back up to about 10% (single thread)

Also I am seeing at the beginning of every crawl:
/usr/local/lib/python2.7/dist-packages/scrapy_redis/spiders.py:40: ScrapyDeprecationWarning: scrapy_redis.spiders.RedisSpider inherits from deprecated class scrapy.spider.BaseSpider, please inherit from scrapy.spider.Spider. (warning only on first subclass, there may be others)
class RedisSpider(RedisMixin, BaseSpider):

Thanks,
Doginal

@Doginal Doginal changed the title Slow speeds with allowed_domains Slow crawling speeds with allowed_domains Aug 6, 2014
@rmax
Copy link
Owner

rmax commented Aug 6, 2014

How large is the allowed_domains list? Could you add the allowed_domains and disable the offsite middleware? Could you run the spider with the allowed_domains list but disabling scrapy-redis?

@Doginal
Copy link
Author

Doginal commented Aug 6, 2014

Its about 250 urls, for some reason can not disable the offsite middleware been setting 'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': None . I disabled scrapy-redis and looks like the same result so this must be a scrapy bug in the allowed_domains?

@rmax
Copy link
Owner

rmax commented Aug 6, 2014

You could time how much take the offsite middleware operations:
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spidermiddleware/offsite.py

If that's the problem, it might be better to perform the offset check
before pushing the urls to redis.

On Wed, Aug 6, 2014 at 7:27 PM, Doginal notifications@github.com wrote:

Its about 250 urls, for some reason can not disable the offsite middleware
been setting 'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware':
None . I disabled scrapy-redis and looks like the same result so this must
be a scrapy bug in the allowed_domains?


Reply to this email directly or view it on GitHub
#19 (comment).

@younghz
Copy link
Contributor

younghz commented Aug 16, 2014

Hi Doginal,
The warning at the beginning of your crawl is due to scrapy-redis is based on the old version of Scrapy.
Scrapy rename scrapy.spider.BaseSpider to scrapy.spider.Spider at version 0.22.0. You could change it in your scrapy-redis code then the warning will disappear.

@rmax rmax closed this as completed Apr 13, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants