SitemapSpider will ignore sitemap with URLs like https://website.com/filename.xml?from=7155352010944&to=7482320519360 #6293

seagatesoft · 2024-03-15T23:41:17Z

Description

Some sitemaps are having URLs with parameters, examples:

The current implementation of _get_sitemap_body will fail to detect those URLs as sitemap because it does the following check:

if response.url.endswith(".xml") or response.url.endswith(".xml.gz"):

So far I fixed the issue by overriding _get_sitemap_body to:

def _get_sitemap_body(self, response):
    if response.url.split("?")[0].endswith(".xml"):
        return response.body
    return super()._get_sitemap_body(response)

The text was updated successfully, but these errors were encountered:

Gallaecio · 2024-03-18T10:08:25Z

It might be worth it to find out why the earlier if isinstance(response, XmlResponse): did not work for those, though. I suspect #5204 might help here.

GeorgeA92 · 2024-03-19T13:19:08Z

I am not able to reproduce this locally on plain scrapy v.2.11.0

script.py

import scrapy
from scrapy.crawler import CrawlerProcess as Cp

class SitemapTestSpider(scrapy.spiders.sitemap.SitemapSpider):
    name = "quotes"
    custom_settings = {"DOWNLOAD_DELAY": 1}
    sitemap_urls = [
        'https://hwpartstore.com/sitemap_products_8.xml?from=7155352010944&to=7482320519360',
        'https://tornadoparts.com/sitemap_products_1.xml?from=1734178111555&to=1734707675203',
        'https://www.mycnhistore.com/medias/sitemap-product-newhollandag-us-en-main.xml?context=bWFzdGVyfHJvb3R8NTA4OTY0MXx0ZXh0L3htbHxhREppTDJneVpTODVOVGc1TVRVMk9ETTVORFUwTDNOcGRHVnRZWEF0Y0hKdlpIVmpkQzF1Wlhkb2IyeHNZVzVrWVdjdGRYTXRaVzR0YldGcGJpNTRiV3d8ZWM0NDFlMDgwZmYzNTlkYjkzZWIwNGFhYzM0NGNlOWFmMjUzYjBhZWFjYTY3MDg5YjY5NWY1OTE2ODM2MTJjYQ'
        ]

    def _get_sitemap_body(self, response):
        # self.logger.info(f"data for {response.url}")
        # headers = '\n\t\t'.join([f"{k}:{v}" for k,v in response.headers.items()])
        # self.logger.info(f"{headers}")
        self.logger.info(
            f"{'!!!' if isinstance(response, scrapy.http.XmlResponse) else ''}"
            f"{response.url} \n identified as {response.__class__} ")

if __name__ == "__main__":
    proc = Cp(); proc.crawl(SitemapTestSpider); proc.start()

log output

2024-03-19 14:07:28 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: scrapybot)
2024-03-19 14:07:28 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.10.8 | packaged by conda-forge | (main, Nov 24 2022, 14:07:00) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.0.0 (OpenSSL 1.1.1w  11 Sep 2023), cryptography 39.0.1, Platform Windows-10-10.0.22631-SP0
2024-03-19 14:07:28 [scrapy.addons] INFO: Enabled addons:
[]
2024-03-19 14:07:28 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2024-03-19 14:07:28 [scrapy.extensions.telnet] INFO: Telnet Password: e7ff9d2a81697957
2024-03-19 14:07:28 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2024-03-19 14:07:28 [scrapy.crawler] INFO: Overridden settings:
{'DOWNLOAD_DELAY': 1}
2024-03-19 14:07:28 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-03-19 14:07:28 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-03-19 14:07:28 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-03-19 14:07:28 [scrapy.core.engine] INFO: Spider opened
2024-03-19 14:07:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-03-19 14:07:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-03-19 14:07:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tornadoparts.com/sitemap_products_1.xml?from=1734178111555&to=1734707675203> (referer: None)
2024-03-19 14:07:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://hwpartstore.com/sitemap_products_8.xml?from=7155352010944&to=7482320519360> (referer: None)
2024-03-19 14:07:29 [quotes] INFO: !!!https://tornadoparts.com/sitemap_products_1.xml?from=1734178111555&to=1734707675203 
 identified as <class 'scrapy.http.response.xml.XmlResponse'> 
2024-03-19 14:07:29 [scrapy.spiders.sitemap] WARNING: Ignoring invalid sitemap: <200 https://tornadoparts.com/sitemap_products_1.xml?from=1734178111555&to=1734707675203>
2024-03-19 14:07:29 [quotes] INFO: !!!https://hwpartstore.com/sitemap_products_8.xml?from=7155352010944&to=7482320519360 
 identified as <class 'scrapy.http.response.xml.XmlResponse'> 
2024-03-19 14:07:29 [scrapy.spiders.sitemap] WARNING: Ignoring invalid sitemap: <200 https://hwpartstore.com/sitemap_products_8.xml?from=7155352010944&to=7482320519360>
2024-03-19 14:07:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.mycnhistore.com/medias/sitemap-product-newhollandag-us-en-main.xml?context=bWFzdGVyfHJvb3R8NTA4OTY0MXx0ZXh0L3htbHxhREppTDJneVpTODVOVGc1TVRVMk9ETTVORFUwTDNOcGRHVnRZWEF0Y0hKdlpIVmpkQzF1Wlhkb2IyeHNZVzVrWVdjdGRYTXRaVzR0YldGcGJpNTRiV3d8ZWM0NDFlMDgwZmYzNTlkYjkzZWIwNGFhYzM0NGNlOWFmMjUzYjBhZWFjYTY3MDg5YjY5NWY1OTE2ODM2MTJjYQ> (referer: None)
2024-03-19 14:07:30 [quotes] INFO: !!!https://www.mycnhistore.com/medias/sitemap-product-newhollandag-us-en-main.xml?context=bWFzdGVyfHJvb3R8NTA4OTY0MXx0ZXh0L3htbHxhREppTDJneVpTODVOVGc1TVRVMk9ETTVORFUwTDNOcGRHVnRZWEF0Y0hKdlpIVmpkQzF1Wlhkb2IyeHNZVzVrWVdjdGRYTXRaVzR0YldGcGJpNTRiV3d8ZWM0NDFlMDgwZmYzNTlkYjkzZWIwNGFhYzM0NGNlOWFmMjUzYjBhZWFjYTY3MDg5YjY5NWY1OTE2ODM2MTJjYQ 
 identified as <class 'scrapy.http.response.xml.XmlResponse'> 
2024-03-19 14:07:30 [scrapy.spiders.sitemap] WARNING: Ignoring invalid sitemap: <200 https://www.mycnhistore.com/medias/sitemap-product-newhollandag-us-en-main.xml?context=bWFzdGVyfHJvb3R8NTA4OTY0MXx0ZXh0L3htbHxhREppTDJneVpTODVOVGc1TVRVMk9ETTVORFUwTDNOcGRHVnRZWEF0Y0hKdlpIVmpkQzF1Wlhkb2IyeHNZVzVrWVdjdGRYTXRaVzR0YldGcGJpNTRiV3d8ZWM0NDFlMDgwZmYzNTlkYjkzZWIwNGFhYzM0NGNlOWFmMjUzYjBhZWFjYTY3MDg5YjY5NWY1OTE2ODM2MTJjYQ>
2024-03-19 14:07:30 [scrapy.core.engine] INFO: Closing spider (finished)
2024-03-19 14:07:30 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1082,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 456034,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 'elapsed_time_seconds': 1.430959,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 3, 19, 13, 7, 30, 256888, tzinfo=datetime.timezone.utc),
 'httpcompression/response_bytes': 7681561,
 'httpcompression/response_count': 3,
 'log_count/DEBUG': 4,
 'log_count/INFO': 13,
 'log_count/WARNING': 3,
 'response_received_count': 3,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2024, 3, 19, 13, 7, 28, 825929, tzinfo=datetime.timezone.utc)}
2024-03-19 14:07:30 [scrapy.core.engine] INFO: Spider closed (finished)

Process finished with exit code 0

In this case response objects from all mentioned urls that reached to _get_sitemap_body method identified as scrapy.http.response.xml.XmlResponse which means that original _get_sitemap_body method from sitemap spider should identifiy responses as valid sitemap from condition if isinstance(response, XmlResponse):

scrapy/scrapy/spiders/sitemap.py

Lines 88 to 93 in 2f1d345

    
               def _get_sitemap_body(self, response): 
        
                   """Return the sitemap body contained in the given response, 
        
                   or None if the response is not a sitemap. 
        
                   """ 
        
                   if isinstance(response, XmlResponse): 
        
                       return response.body

before

scrapy/scrapy/spiders/sitemap.py

Lines 117 to 118 in 2f1d345

    
           if response.url.endswith(".xml") or response.url.endswith(".xml.gz"): 
        
               return response.body

It might be worth it to find out why the earlier if isinstance(response, XmlResponse): did not work for those, though. I suspect #5204 might help here.

Originally - scrapy create Response object as it contains binary compressed data. Later on HttpCompressionMiddleware.process_response - after decompression response object recreated as XmlResponse instance compatible with SitemapSpider

scrapy/scrapy/downloadermiddlewares/httpcompression.py

Lines 138 to 150 in 02b97f9

    
                   respcls = responsetypes.from_args( 
        
                       headers=response.headers, url=response.url, body=decoded_body 
        
                   ) 
        
                   kwargs = {"cls": respcls, "body": decoded_body} 
        
                   if issubclass(respcls, TextResponse): 
        
                       # force recalculating the encoding until we make sure the 
        
                       # responsetypes guessing is reliable 
        
                       kwargs["encoding"] = None 
        
                   response = response.replace(**kwargs) 
        
                   if not content_encoding: 
        
                       del response.headers["Content-Encoding"] 
        
           return response

wRAR · 2024-03-20T07:24:16Z

Is it possible that the original problem happens on an older Scrapy version or with some SitemapSpider methods overridden? @seagatesoft

wRAR added the bug label Mar 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SitemapSpider will ignore sitemap with URLs like https://website.com/filename.xml?from=7155352010944&to=7482320519360 #6293

SitemapSpider will ignore sitemap with URLs like https://website.com/filename.xml?from=7155352010944&to=7482320519360 #6293

seagatesoft commented Mar 15, 2024

Gallaecio commented Mar 18, 2024

GeorgeA92 commented Mar 19, 2024 •

edited

wRAR commented Mar 20, 2024

SitemapSpider will ignore sitemap with URLs like https://website.com/filename.xml?from=7155352010944&to=7482320519360 #6293

SitemapSpider will ignore sitemap with URLs like https://website.com/filename.xml?from=7155352010944&to=7482320519360 #6293

Comments

seagatesoft commented Mar 15, 2024

Description

Gallaecio commented Mar 18, 2024

GeorgeA92 commented Mar 19, 2024 • edited

wRAR commented Mar 20, 2024

GeorgeA92 commented Mar 19, 2024 •

edited