Slow crawling speeds with allowed_domains #19

Doginal · 2014-08-06T17:42:39Z

I have a crawler setup with Scrapy version 0.24.2 and the latest version of scrapy-redis. I have seen a drastic drop in performance when I add in a list of allowed_domains. If I delete the allowed_domains list, my crawler goes from 1-3 pages/min up to 200-300 pages/min. I feel this has to do with scrapy-redis causing this performance issues. Have you ever encountered this issue? Is there anything that could be causing this issue?

Settings: no auto throttle , no download limit, SCHEDULER_PERSIST = False, tried not passing a list of start_urls also.
I am also getting very low cpu usage about 0.5%, if I remove the allowed_domains it goes back up to about 10% (single thread)

Also I am seeing at the beginning of every crawl:
/usr/local/lib/python2.7/dist-packages/scrapy_redis/spiders.py:40: ScrapyDeprecationWarning: scrapy_redis.spiders.RedisSpider inherits from deprecated class scrapy.spider.BaseSpider, please inherit from scrapy.spider.Spider. (warning only on first subclass, there may be others)
class RedisSpider(RedisMixin, BaseSpider):

Thanks,
Doginal

rmax · 2014-08-06T21:51:28Z

How large is the allowed_domains list? Could you add the allowed_domains and disable the offsite middleware? Could you run the spider with the allowed_domains list but disabling scrapy-redis?

Doginal · 2014-08-06T23:27:23Z

Its about 250 urls, for some reason can not disable the offsite middleware been setting 'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': None . I disabled scrapy-redis and looks like the same result so this must be a scrapy bug in the allowed_domains?

rmax · 2014-08-06T23:42:34Z

You could time how much take the offsite middleware operations:
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spidermiddleware/offsite.py

If that's the problem, it might be better to perform the offset check
before pushing the urls to redis.

On Wed, Aug 6, 2014 at 7:27 PM, Doginal notifications@github.com wrote:

Its about 250 urls, for some reason can not disable the offsite middleware
been setting 'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware':
None . I disabled scrapy-redis and looks like the same result so this must
be a scrapy bug in the allowed_domains?

—
Reply to this email directly or view it on GitHub
#19 (comment).

younghz · 2014-08-16T01:08:20Z

Hi Doginal,
The warning at the beginning of your crawl is due to scrapy-redis is based on the old version of Scrapy.
Scrapy rename scrapy.spider.BaseSpider to scrapy.spider.Spider at version 0.22.0. You could change it in your scrapy-redis code then the warning will disappear.

Doginal changed the title ~~Slow speeds with allowed_domains~~ Slow crawling speeds with allowed_domains Aug 6, 2014

rmax closed this as completed Apr 13, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow crawling speeds with allowed_domains #19

Slow crawling speeds with allowed_domains #19

Doginal commented Aug 6, 2014

rmax commented Aug 6, 2014

Doginal commented Aug 6, 2014

rmax commented Aug 6, 2014

younghz commented Aug 16, 2014

Slow crawling speeds with allowed_domains #19

Slow crawling speeds with allowed_domains #19

Comments

Doginal commented Aug 6, 2014

rmax commented Aug 6, 2014

Doginal commented Aug 6, 2014

rmax commented Aug 6, 2014

younghz commented Aug 16, 2014