New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SitemapSpider memory issues #3529
Comments
See also: #605 |
Hi @altunyurt, do you have any progress? I would like to work on this too 😄 |
@altunyurt @joaquingx I will be interested in using I'm on the same boat: I need to process a lot of (nested) sitemaps where I am only interested into getting every URL and sending them straight to a queue, without any processing/Scrapy request. I will like to optimize Scrapy RAM usage while it processes the sitemaps. I tried to override the |
On applications based on recent versions of scrapy - single sitemap with 50k urls on the most of cases can't consume 6gb of RAM as described in this issue. Hovewer it easily possible on older versions (1.5.0 and earlier) as it probably affected by #2658 (caused by..issues in gunzip) However it was fixed as result of #3281 If user just didn't updated scrapy to the latest available version (at that moment 1.5.1) - most likely affected by gunzip issue mentioned above (issue is not reproducible at all). |
I'm using SitemapSpider on a sitemapindex consisting of 20-30 sitemaps each having 50k urls.
Even trying each sitemap alone ends up eating all the memory on a 6gb machine, let alone the millions of urls in total of all the sitemaps in the index.
IIRC, the parser keeps the documents in memory during the operation. So I've monkey patched the scrapy.utils.Sitemap with the following snippet. The original code is at https://stackoverflow.com/a/12161078
and using it as follows allowed me to run the task much smoothly, utilizing at most 3gb at any time.
I can create a pull request if this is ok.
The text was updated successfully, but these errors were encountered: