Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SitemapSpider support rel="alternate" works not as expected #2853

Closed
jenya opened this issue Jul 26, 2017 · 4 comments
Closed

SitemapSpider support rel="alternate" works not as expected #2853

jenya opened this issue Jul 26, 2017 · 4 comments
Labels
bug
Milestone

Comments

@jenya
Copy link
Contributor

@jenya jenya commented Jul 26, 2017

This functionality was added here: #360
According to this https://github.com/scrapy/scrapy/blob/master/scrapy/spiders/sitemap.py#L46 sitemap_alternate_links used for sitemap type sitemapindex, but not urlset, which is common use case for rel="alternate".
I don't think this is as expected, because that issue, docs and tests (which in fact tests xml parser, but not sitemap parser) in PR talks about urlset.
Please correct me if I'm wrong. Otherwise - I will prepare PR.

@redapple
Copy link
Contributor

@redapple redapple commented Jul 26, 2017

Hi @jenya , do you have an example website for which SitemapSpider does not collect alternate URLs as you would expect?
As far as I can see, https://github.com/scrapy/scrapy/blob/master/tests/test_utils_sitemap.py does test the sitemap body parser, and collects alternates when supplied the example body from #360

@jenya
Copy link
Contributor Author

@jenya jenya commented Jul 26, 2017

@redapple even that xml from #360 is an example
My initial message about test was incorrect - as it tests Sitemap, but not SitemapSpider. There is no issue in Sitemap, it's an issue in SitemapSpider in _parse_sitemap method.

Here is sample spider: https://gist.github.com/jenya/4e5277d74f435699d3ce2aea0b40f3ce
It should follow english version of pages, which is not main in sitemap.

@redapple
Copy link
Contributor

@redapple redapple commented Jul 26, 2017

You're right @jenya . I hadn't noticed the local filtering with iterloc.
Feel free to open an PR. Appreciated!

@redapple redapple added the bug label Jul 26, 2017
@redapple redapple added this to the v1.4.1 milestone Jul 27, 2017
kmike added a commit that referenced this issue Aug 21, 2017
[MRG+1] Follow alternate link for all types of sitemaps #2853
@redapple
Copy link
Contributor

@redapple redapple commented Aug 22, 2017

Fixed via #2854

@redapple redapple closed this Aug 22, 2017
@kmike kmike modified the milestones: v1.4.1, v1.5 Dec 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants