Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SitemapSpider will ignore sitemap with URLs like https://website.com/filename.xml?from=7155352010944&to=7482320519360 #6293

Open
seagatesoft opened this issue Mar 15, 2024 · 3 comments
Labels

Comments

@seagatesoft
Copy link

Description

Some sitemaps are having URLs with parameters, examples:

  1. https://hwpartstore.com/sitemap_products_8.xml?from=7155352010944&to=7482320519360
  2. https://tornadoparts.com/sitemap_products_1.xml?from=1734178111555&to=1734707675203
  3. https://www.mycnhistore.com/medias/sitemap-product-newhollandag-us-en-main.xml?context=bWFzdGVyfHJvb3R8NTA4OTY0MXx0ZXh0L3htbHxhREppTDJneVpTODVOVGc1TVRVMk9ETTVORFUwTDNOcGRHVnRZWEF0Y0hKdlpIVmpkQzF1Wlhkb2IyeHNZVzVrWVdjdGRYTXRaVzR0YldGcGJpNTRiV3d8ZWM0NDFlMDgwZmYzNTlkYjkzZWIwNGFhYzM0NGNlOWFmMjUzYjBhZWFjYTY3MDg5YjY5NWY1OTE2ODM2MTJjYQ

The current implementation of _get_sitemap_body will fail to detect those URLs as sitemap because it does the following check:

if response.url.endswith(".xml") or response.url.endswith(".xml.gz"):

So far I fixed the issue by overriding _get_sitemap_body to:

def _get_sitemap_body(self, response):
    if response.url.split("?")[0].endswith(".xml"):
        return response.body
    return super()._get_sitemap_body(response)
@wRAR wRAR added the bug label Mar 16, 2024
@Gallaecio
Copy link
Member

It might be worth it to find out why the earlier if isinstance(response, XmlResponse): did not work for those, though. I suspect #5204 might help here.

@GeorgeA92
Copy link
Contributor

GeorgeA92 commented Mar 19, 2024

I am not able to reproduce this locally on plain scrapy v.2.11.0

script.py
import scrapy
from scrapy.crawler import CrawlerProcess as Cp

class SitemapTestSpider(scrapy.spiders.sitemap.SitemapSpider):
    name = "quotes"
    custom_settings = {"DOWNLOAD_DELAY": 1}
    sitemap_urls = [
        'https://hwpartstore.com/sitemap_products_8.xml?from=7155352010944&to=7482320519360',
        'https://tornadoparts.com/sitemap_products_1.xml?from=1734178111555&to=1734707675203',
        'https://www.mycnhistore.com/medias/sitemap-product-newhollandag-us-en-main.xml?context=bWFzdGVyfHJvb3R8NTA4OTY0MXx0ZXh0L3htbHxhREppTDJneVpTODVOVGc1TVRVMk9ETTVORFUwTDNOcGRHVnRZWEF0Y0hKdlpIVmpkQzF1Wlhkb2IyeHNZVzVrWVdjdGRYTXRaVzR0YldGcGJpNTRiV3d8ZWM0NDFlMDgwZmYzNTlkYjkzZWIwNGFhYzM0NGNlOWFmMjUzYjBhZWFjYTY3MDg5YjY5NWY1OTE2ODM2MTJjYQ'
        ]

    def _get_sitemap_body(self, response):
        # self.logger.info(f"data for {response.url}")
        # headers = '\n\t\t'.join([f"{k}:{v}" for k,v in response.headers.items()])
        # self.logger.info(f"{headers}")
        self.logger.info(
            f"{'!!!' if isinstance(response, scrapy.http.XmlResponse) else ''}"
            f"{response.url} \n identified as {response.__class__} ")

if __name__ == "__main__":
    proc = Cp(); proc.crawl(SitemapTestSpider); proc.start()
log output
2024-03-19 14:07:28 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: scrapybot)
2024-03-19 14:07:28 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.10.8 | packaged by conda-forge | (main, Nov 24 2022, 14:07:00) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.0.0 (OpenSSL 1.1.1w  11 Sep 2023), cryptography 39.0.1, Platform Windows-10-10.0.22631-SP0
2024-03-19 14:07:28 [scrapy.addons] INFO: Enabled addons:
[]
2024-03-19 14:07:28 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2024-03-19 14:07:28 [scrapy.extensions.telnet] INFO: Telnet Password: e7ff9d2a81697957
2024-03-19 14:07:28 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2024-03-19 14:07:28 [scrapy.crawler] INFO: Overridden settings:
{'DOWNLOAD_DELAY': 1}
2024-03-19 14:07:28 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-03-19 14:07:28 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-03-19 14:07:28 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-03-19 14:07:28 [scrapy.core.engine] INFO: Spider opened
2024-03-19 14:07:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-03-19 14:07:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-03-19 14:07:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tornadoparts.com/sitemap_products_1.xml?from=1734178111555&to=1734707675203> (referer: None)
2024-03-19 14:07:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://hwpartstore.com/sitemap_products_8.xml?from=7155352010944&to=7482320519360> (referer: None)
2024-03-19 14:07:29 [quotes] INFO: !!!https://tornadoparts.com/sitemap_products_1.xml?from=1734178111555&to=1734707675203 
 identified as <class 'scrapy.http.response.xml.XmlResponse'> 
2024-03-19 14:07:29 [scrapy.spiders.sitemap] WARNING: Ignoring invalid sitemap: <200 https://tornadoparts.com/sitemap_products_1.xml?from=1734178111555&to=1734707675203>
2024-03-19 14:07:29 [quotes] INFO: !!!https://hwpartstore.com/sitemap_products_8.xml?from=7155352010944&to=7482320519360 
 identified as <class 'scrapy.http.response.xml.XmlResponse'> 
2024-03-19 14:07:29 [scrapy.spiders.sitemap] WARNING: Ignoring invalid sitemap: <200 https://hwpartstore.com/sitemap_products_8.xml?from=7155352010944&to=7482320519360>
2024-03-19 14:07:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.mycnhistore.com/medias/sitemap-product-newhollandag-us-en-main.xml?context=bWFzdGVyfHJvb3R8NTA4OTY0MXx0ZXh0L3htbHxhREppTDJneVpTODVOVGc1TVRVMk9ETTVORFUwTDNOcGRHVnRZWEF0Y0hKdlpIVmpkQzF1Wlhkb2IyeHNZVzVrWVdjdGRYTXRaVzR0YldGcGJpNTRiV3d8ZWM0NDFlMDgwZmYzNTlkYjkzZWIwNGFhYzM0NGNlOWFmMjUzYjBhZWFjYTY3MDg5YjY5NWY1OTE2ODM2MTJjYQ> (referer: None)
2024-03-19 14:07:30 [quotes] INFO: !!!https://www.mycnhistore.com/medias/sitemap-product-newhollandag-us-en-main.xml?context=bWFzdGVyfHJvb3R8NTA4OTY0MXx0ZXh0L3htbHxhREppTDJneVpTODVOVGc1TVRVMk9ETTVORFUwTDNOcGRHVnRZWEF0Y0hKdlpIVmpkQzF1Wlhkb2IyeHNZVzVrWVdjdGRYTXRaVzR0YldGcGJpNTRiV3d8ZWM0NDFlMDgwZmYzNTlkYjkzZWIwNGFhYzM0NGNlOWFmMjUzYjBhZWFjYTY3MDg5YjY5NWY1OTE2ODM2MTJjYQ 
 identified as <class 'scrapy.http.response.xml.XmlResponse'> 
2024-03-19 14:07:30 [scrapy.spiders.sitemap] WARNING: Ignoring invalid sitemap: <200 https://www.mycnhistore.com/medias/sitemap-product-newhollandag-us-en-main.xml?context=bWFzdGVyfHJvb3R8NTA4OTY0MXx0ZXh0L3htbHxhREppTDJneVpTODVOVGc1TVRVMk9ETTVORFUwTDNOcGRHVnRZWEF0Y0hKdlpIVmpkQzF1Wlhkb2IyeHNZVzVrWVdjdGRYTXRaVzR0YldGcGJpNTRiV3d8ZWM0NDFlMDgwZmYzNTlkYjkzZWIwNGFhYzM0NGNlOWFmMjUzYjBhZWFjYTY3MDg5YjY5NWY1OTE2ODM2MTJjYQ>
2024-03-19 14:07:30 [scrapy.core.engine] INFO: Closing spider (finished)
2024-03-19 14:07:30 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1082,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 456034,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 'elapsed_time_seconds': 1.430959,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 3, 19, 13, 7, 30, 256888, tzinfo=datetime.timezone.utc),
 'httpcompression/response_bytes': 7681561,
 'httpcompression/response_count': 3,
 'log_count/DEBUG': 4,
 'log_count/INFO': 13,
 'log_count/WARNING': 3,
 'response_received_count': 3,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2024, 3, 19, 13, 7, 28, 825929, tzinfo=datetime.timezone.utc)}
2024-03-19 14:07:30 [scrapy.core.engine] INFO: Spider closed (finished)

Process finished with exit code 0


In this case response objects from all mentioned urls that reached to _get_sitemap_body method identified as scrapy.http.response.xml.XmlResponse which means that original _get_sitemap_body method from sitemap spider should identifiy responses as valid sitemap from condition if isinstance(response, XmlResponse):

def _get_sitemap_body(self, response):
"""Return the sitemap body contained in the given response,
or None if the response is not a sitemap.
"""
if isinstance(response, XmlResponse):
return response.body

before
if response.url.endswith(".xml") or response.url.endswith(".xml.gz"):
return response.body

It might be worth it to find out why the earlier if isinstance(response, XmlResponse): did not work for those, though. I suspect #5204 might help here.

Originally - scrapy create Response object as it contains binary compressed data. Later on HttpCompressionMiddleware.process_response - after decompression response object recreated as XmlResponse instance compatible with SitemapSpider

respcls = responsetypes.from_args(
headers=response.headers, url=response.url, body=decoded_body
)
kwargs = {"cls": respcls, "body": decoded_body}
if issubclass(respcls, TextResponse):
# force recalculating the encoding until we make sure the
# responsetypes guessing is reliable
kwargs["encoding"] = None
response = response.replace(**kwargs)
if not content_encoding:
del response.headers["Content-Encoding"]
return response

@wRAR
Copy link
Member

wRAR commented Mar 20, 2024

Is it possible that the original problem happens on an older Scrapy version or with some SitemapSpider methods overridden? @seagatesoft

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants