Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SitemapSpider memory issues #3529

Open
altunyurt opened this issue Dec 11, 2018 · 4 comments
Open

SitemapSpider memory issues #3529

altunyurt opened this issue Dec 11, 2018 · 4 comments

Comments

@altunyurt
Copy link

I'm using SitemapSpider on a sitemapindex consisting of 20-30 sitemaps each having 50k urls.
Even trying each sitemap alone ends up eating all the memory on a 6gb machine, let alone the millions of urls in total of all the sitemaps in the index.

IIRC, the parser keeps the documents in memory during the operation. So I've monkey patched the scrapy.utils.Sitemap with the following snippet. The original code is at https://stackoverflow.com/a/12161078

# patching scrapy/utils/sitemap.py
import lxml.etree


class Sitemap(object):
    """Class to parse Sitemap (type=urlset) and Sitemap Index
    (type=sitemapindex) files"""

    def __init__(self, xmltext):
        self.type = "urlset" if "urlset" in xmltext[:100] else "sitemapindex"

        tag = {"urlset": "url", "sitemapindex": "sitemap"}[self.type]
        self._root = lxml.etree.iterparse(
            xmltext,
            tag=tag,
            event=("end",),
            recover=True,
            remove_comments=True,
            resolve_entities=False,
        )

    def __iter__(self):

        for event, elem in self._root:
            loc = elem.find("{*}loc")
            if loc is None:
                continue
            yield {"loc": loc.text}
            # It's safe to call clear() here because no descendants will be
            # accessed
            elem.clear()
            # Also eliminate now-empty references from the root node to elem
            for ancestor in elem.xpath("ancestor-or-self::*"):
                while ancestor.getprevious() is not None:
                    del ancestor.getparent()[0]
        del self._root

and using it as follows allowed me to run the task much smoothly, utilizing at most 3gb at any time.

worker.py 
-------------------------
import scrapy
from core.patches.sitemap import Sitemap

__all__ = ["Spider"]

scrapy.spiders.Sitemap = Sitemap


class Spider(scrapy.spiders.SitemapSpider):
    sitemap_urls = [ sitemap1, sitemap2 ...  ]

    def parse(self, response):
    ...

I can create a pull request if this is ok.

@kmike
Copy link
Member

kmike commented Dec 28, 2018

See also: #605

@joaquingx
Copy link
Contributor

joaquingx commented Jan 8, 2019

Hi @altunyurt, do you have any progress? I would like to work on this too 😄

@kinoute
Copy link
Contributor

kinoute commented Jul 11, 2022

@altunyurt @joaquingx I will be interested in using iterparse too for a Sitemap Spider. I wasn't able to make it work with your snippet.

I'm on the same boat: I need to process a lot of (nested) sitemaps where I am only interested into getting every URL and sending them straight to a queue, without any processing/Scrapy request. I will like to optimize Scrapy RAM usage while it processes the sitemaps.

I tried to override the _parse_sitemap method to remove the Scrapy Request on every URL, which works fine for my case, but I think the memory usage could be still way better.

@GeorgeA92
Copy link
Contributor

I'm using SitemapSpider on a sitemapindex consisting of 20-30 sitemaps each having 50k urls.
Even trying each sitemap alone ends up eating all the memory on a 6gb machine

On applications based on recent versions of scrapy - single sitemap with 50k urls on the most of cases can't consume 6gb of RAM as described in this issue.

Hovewer it easily possible on older versions (1.5.0 and earlier) as it probably affected by #2658 (caused by..issues in gunzip)
that issue with unzip was.. extremely painfull(in terms of RAM memory usage) for large 50k sitemaps

However it was fixed as result of #3281
Next version of scrapy that contain this fix Scrapy 1.5.1 released on (2018-07-12) (5 month before this issue raised).

If user just didn't updated scrapy to the latest available version (at that moment 1.5.1) - most likely affected by gunzip issue mentioned above (issue is not reproducible at all).
If this issue reproduced on scrapy 1.5.1 and newer versions - then it is required to look at that specific... strange sitemap that caused RAM memory overuse.(specific "bad sitemap" url required to reproduce the issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants