Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SitemapSpider support rel="alternate" works not as expected #2853

Closed
jenya opened this issue Jul 26, 2017 · 4 comments
Closed

SitemapSpider support rel="alternate" works not as expected #2853

jenya opened this issue Jul 26, 2017 · 4 comments
Labels
Milestone

Comments

@jenya
Copy link
Contributor

jenya commented Jul 26, 2017

This functionality was added here: #360
According to this https://github.com/scrapy/scrapy/blob/master/scrapy/spiders/sitemap.py#L46 sitemap_alternate_links used for sitemap type sitemapindex, but not urlset, which is common use case for rel="alternate".
I don't think this is as expected, because that issue, docs and tests (which in fact tests xml parser, but not sitemap parser) in PR talks about urlset.
Please correct me if I'm wrong. Otherwise - I will prepare PR.

@redapple
Copy link
Contributor

redapple commented Jul 26, 2017

Hi @jenya , do you have an example website for which SitemapSpider does not collect alternate URLs as you would expect?
As far as I can see, https://github.com/scrapy/scrapy/blob/master/tests/test_utils_sitemap.py does test the sitemap body parser, and collects alternates when supplied the example body from #360

@jenya
Copy link
Contributor Author

jenya commented Jul 26, 2017

@redapple even that xml from #360 is an example
My initial message about test was incorrect - as it tests Sitemap, but not SitemapSpider. There is no issue in Sitemap, it's an issue in SitemapSpider in _parse_sitemap method.

Here is sample spider: https://gist.github.com/jenya/4e5277d74f435699d3ce2aea0b40f3ce
It should follow english version of pages, which is not main in sitemap.

@redapple
Copy link
Contributor

You're right @jenya . I hadn't noticed the local filtering with iterloc.
Feel free to open an PR. Appreciated!

@redapple redapple added the bug label Jul 26, 2017
@redapple redapple added this to the v1.4.1 milestone Jul 27, 2017
kmike added a commit that referenced this issue Aug 21, 2017
[MRG+1] Follow alternate link for all types of sitemaps #2853
@redapple
Copy link
Contributor

Fixed via #2854

@kmike kmike modified the milestones: v1.4.1, v1.5 Dec 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants