SitemapSpider memory issues #3529

altunyurt · 2018-12-11T17:28:37Z

I'm using SitemapSpider on a sitemapindex consisting of 20-30 sitemaps each having 50k urls.
Even trying each sitemap alone ends up eating all the memory on a 6gb machine, let alone the millions of urls in total of all the sitemaps in the index.

IIRC, the parser keeps the documents in memory during the operation. So I've monkey patched the scrapy.utils.Sitemap with the following snippet. The original code is at https://stackoverflow.com/a/12161078

# patching scrapy/utils/sitemap.py
import lxml.etree


class Sitemap(object):
    """Class to parse Sitemap (type=urlset) and Sitemap Index
    (type=sitemapindex) files"""

    def __init__(self, xmltext):
        self.type = "urlset" if "urlset" in xmltext[:100] else "sitemapindex"

        tag = {"urlset": "url", "sitemapindex": "sitemap"}[self.type]
        self._root = lxml.etree.iterparse(
            xmltext,
            tag=tag,
            event=("end",),
            recover=True,
            remove_comments=True,
            resolve_entities=False,
        )

    def __iter__(self):

        for event, elem in self._root:
            loc = elem.find("{*}loc")
            if loc is None:
                continue
            yield {"loc": loc.text}
            # It's safe to call clear() here because no descendants will be
            # accessed
            elem.clear()
            # Also eliminate now-empty references from the root node to elem
            for ancestor in elem.xpath("ancestor-or-self::*"):
                while ancestor.getprevious() is not None:
                    del ancestor.getparent()[0]
        del self._root

and using it as follows allowed me to run the task much smoothly, utilizing at most 3gb at any time.

worker.py 
-------------------------
import scrapy
from core.patches.sitemap import Sitemap

__all__ = ["Spider"]

scrapy.spiders.Sitemap = Sitemap


class Spider(scrapy.spiders.SitemapSpider):
    sitemap_urls = [ sitemap1, sitemap2 ...  ]

    def parse(self, response):
    ...

I can create a pull request if this is ok.

The text was updated successfully, but these errors were encountered:

kmike · 2018-12-28T07:15:10Z

See also: #605

joaquingx · 2019-01-08T16:22:39Z

Hi @altunyurt, do you have any progress? I would like to work on this too 😄

kinoute · 2022-07-11T08:34:31Z

@altunyurt @joaquingx I will be interested in using iterparse too for a Sitemap Spider. I wasn't able to make it work with your snippet.

I'm on the same boat: I need to process a lot of (nested) sitemaps where I am only interested into getting every URL and sending them straight to a queue, without any processing/Scrapy request. I will like to optimize Scrapy RAM usage while it processes the sitemaps.

I tried to override the _parse_sitemap method to remove the Scrapy Request on every URL, which works fine for my case, but I think the memory usage could be still way better.

GeorgeA92 · 2023-03-08T18:22:15Z

I'm using SitemapSpider on a sitemapindex consisting of 20-30 sitemaps each having 50k urls.
Even trying each sitemap alone ends up eating all the memory on a 6gb machine

On applications based on recent versions of scrapy - single sitemap with 50k urls on the most of cases can't consume 6gb of RAM as described in this issue.

Hovewer it easily possible on older versions (1.5.0 and earlier) as it probably affected by #2658 (caused by..issues in gunzip)
that issue with unzip was.. extremely painfull(in terms of RAM memory usage) for large 50k sitemaps

However it was fixed as result of #3281
Next version of scrapy that contain this fix Scrapy 1.5.1 released on (2018-07-12) (5 month before this issue raised).

If user just didn't updated scrapy to the latest available version (at that moment 1.5.1) - most likely affected by gunzip issue mentioned above (issue is not reproducible at all).
If this issue reproduced on scrapy 1.5.1 and newer versions - then it is required to look at that specific... strange sitemap that caused RAM memory overuse.(specific "bad sitemap" url required to reproduce the issue)

Gallaecio added enhancement performance labels Aug 14, 2019

Gallaecio mentioned this issue Feb 24, 2023

Per slot settings #5328

Merged

Gallaecio added the needs more info label Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SitemapSpider memory issues #3529

SitemapSpider memory issues #3529

altunyurt commented Dec 11, 2018

kmike commented Dec 28, 2018

joaquingx commented Jan 8, 2019 •

edited

kinoute commented Jul 11, 2022

GeorgeA92 commented Mar 8, 2023

SitemapSpider memory issues #3529

SitemapSpider memory issues #3529

Comments

altunyurt commented Dec 11, 2018

kmike commented Dec 28, 2018

joaquingx commented Jan 8, 2019 • edited

kinoute commented Jul 11, 2022

GeorgeA92 commented Mar 8, 2023

joaquingx commented Jan 8, 2019 •

edited