Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sitemap spider not robust against wrong sitemap URLs in robots.txt #2390

Closed
redapple opened this issue Nov 9, 2016 · 3 comments
Closed

Sitemap spider not robust against wrong sitemap URLs in robots.txt #2390

redapple opened this issue Nov 9, 2016 · 3 comments
Labels
Milestone

Comments

@redapple
Copy link
Contributor

@redapple redapple commented Nov 9, 2016

The "specs" do say that the URL should be a "full URL":

You can specify the location of the Sitemap using a robots.txt file. To do this, simply add the following line including the full URL to the sitemap:
Sitemap: http://www.example.com/sitemap.xml

But some robots.txt use relative ones.

Example: http://www.asos.com/robots.txt

User-agent: *
Sitemap: /sitemap.ashx
Sitemap: http://www.asos.com/sitemap.xml
Disallow: /basket/
(...)

Spider:

from scrapy.spiders import SitemapSpider


class TestSpider(SitemapSpider):
    name = "test"
    sitemap_urls = [
        'http://www.asos.com/robots.txt',
    ]

    def parse(self, response):
        self.logger.info('parsing %r' % response.url)

Logs:

$ scrapy runspider spider.py
Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.90 Safari/537.36'
2016-11-09 17:46:19 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrapybot)
(...)
2016-11-09 17:46:19 [scrapy] DEBUG: Crawled (200) <GET http://www.asos.com/robots.txt> (referer: None)
2016-11-09 17:46:19 [scrapy] ERROR: Spider error processing <GET http://www.asos.com/robots.txt> (referer: None)
Traceback (most recent call last):
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spiders/sitemap.py", line 36, in _parse_sitemap
    yield Request(url, callback=self._parse_sitemap)
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 25, in __init__
    self._set_url(url)
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 57, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: /sitemap.ashx
2016-11-09 17:46:19 [scrapy] INFO: Closing spider (finished)
2016-11-09 17:46:19 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 291,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 1857,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 11, 9, 16, 46, 19, 332383),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/ValueError': 1,
 'start_time': datetime.datetime(2016, 11, 9, 16, 46, 19, 71714)}
2016-11-09 17:46:19 [scrapy] INFO: Spider closed (finished)
@redapple redapple added the bug label Nov 9, 2016
@redapple redapple changed the title Sitemap spider not robust against wrong sitemap URLs Sitemap spider not robust against wrong sitemap URLs in robots.txt Nov 9, 2016
@elacuesta
Copy link
Member

@elacuesta elacuesta commented Nov 15, 2016

@redapple: maybe I'm overlooking something, but do we need something more complex than just adding

if not re.match('https?://', url):
    url = response.urljoin(url)

or even

if not url.startswith('http'):
    url = response.urljoin(url)

right before https://github.com/scrapy/scrapy/blob/master/scrapy/spiders/sitemap.py#L36?

@redapple
Copy link
Contributor Author

@redapple redapple commented Nov 15, 2016

@elacuesta , it could be that simple, I haven't had a look :)
Patch welcome!

@elacuesta
Copy link
Member

@elacuesta elacuesta commented Nov 15, 2016

Patch submitted! :-) ☝️

@redapple redapple added this to the v1.3 milestone Nov 16, 2016
@redapple redapple added this to the v1.2.2 milestone Nov 30, 2016
@redapple redapple removed this from the v1.3 milestone Nov 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants