Skip to content

Cookie handling performance improvement #77

Merged
merged 1 commit into from Apr 24, 2013

3 participants

@shaneaevans
Scrapy project member

This still needs testing and review. It worked on the crawl I tested it with, but quite possibly has broken or changed something with cookie handling.

@artemdevel

Hi Shane,

I made several tests for this patch. As I see it works fine. At first I tested it with several my spiders which uses cookies. But they just scrape data from single domains. After that I created 2 simple spiders: the first spider collects some companies and their blog URLs from CrunchBase (see https://gist.github.com/artem-dev/5037285) and the second spider scrapes some data from the collected URLs (see https://gist.github.com/artem-dev/5037292) Usually companies have blogs located in subdomains, so there are many URLs like http://company.com and http://blog.company.com so as I understand this is the case for this optimisation. About 1500 URLs were scraped. I measured scraping time and consumed memory using scraping logs for this patch applied and without it, but the results were quite identical. Perhaps bigger URL lists should be used for more price tests. Also I made some refactoring for your patch https://gist.github.com/artem-dev/5037311 (also tested it in same conditions and got the same results as for your patch and without any patches)

@shaneaevans
Scrapy project member

This patch was developed when we noticed a slow-down running a broad crawl. I think when we got into the thousands of domains it got slow, and we wanted to crawl millions. If you go a few pages deep on many domains with cookies enabled you should notice a slow down (unless something has been improved in the meantime)

@dangra dangra merged commit 9739061 into scrapy:master Apr 24, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.