Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate general sitemap.xml for projects #5122

Merged
merged 22 commits into from Feb 19, 2019
Merged

Generate general sitemap.xml for projects #5122

merged 22 commits into from Feb 19, 2019

Conversation

@humitos
Copy link
Member

@humitos humitos commented Jan 16, 2019

This PR makes Read the Docs to generate a general (non specific per project) sitemap.xml served at the root of the project /sitemap.xml based on discussions from #557

I think it would be good to split this in two phases:

  1. Generate a general sitemap.xml for all the project without option to customize it (this PR as is)
  2. Check if the project is already generating a sitemap.xml and instead of generating a general one, generate a specific one for this project using sitemapindex

Example with a toy project locally,

$ http http://fastcgi-for-net.dev.readthedocs.io:8000/sitemap.xml
HTTP/1.0 200 OK
Content-Language: en
Content-Length: 1713
Content-Type: application/xml
Date: Wed, 16 Jan 2019 21:15:44 GMT
Server: WSGIServer/0.2 CPython/3.6.6
Vary: Accept-Language, Cookie

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
  
  <url>
    <loc>http://fastcgi-for-net.dev.readthedocs.io:8000/bn/latest/</loc>
    
    <xhtml:link
        rel="alternate"
        hreflang="en"
        href="http://fastcgi-for-net.dev.readthedocs.io:8000/en/latest/"/>
    
    <xhtml:link
        rel="alternate"
        hreflang="bn"
        href="http://fastcgi-for-net.dev.readthedocs.io:8000/bn/latest/"/>
    
    <lastmod>2019-01-16T21:13:53.325602+00:00</lastmod>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  
  <url>
    <loc>http://fastcgi-for-net.dev.readthedocs.io:8000/bn/stable/</loc>
    
    <xhtml:link
        rel="alternate"
        hreflang="en"
        href="http://fastcgi-for-net.dev.readthedocs.io:8000/en/stable/"/>
    
    <xhtml:link
        rel="alternate"
        hreflang="bn"
        href="http://fastcgi-for-net.dev.readthedocs.io:8000/bn/stable/"/>
    
    <lastmod>2019-01-16T10:54:16.792524+00:00</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.9</priority>
  </url>
  
  <url>
    <loc>http://fastcgi-for-net.dev.readthedocs.io:8000/bn/make-docs-dir-a-list/</loc>
    
    <xhtml:link
        rel="alternate"
        hreflang="en"
        href="http://fastcgi-for-net.dev.readthedocs.io:8000/en/make-docs-dir-a-list/"/>
    
    <xhtml:link
        rel="alternate"
        hreflang="bn"
        href="http://fastcgi-for-net.dev.readthedocs.io:8000/bn/make-docs-dir-a-list/"/>
    
    <lastmod>2019-01-16T11:03:10.324661+00:00</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
  
</urlset>
$ 
yield c

while True:
yield 'monthly'
Copy link
Member Author

@humitos humitos Jan 16, 2019

I'm using monthly because I think never is too aggressive. If the tag is removed and a branch is created with the same name, we will want bots to revisit this.

NOTE: maybe this should be a comment in the code itself.

for version, priority, changefreq in zip(
sorted_versions, priorities_generator(), changefreqs_generator()):
element = {
'loc': version.get_subdomain_url(),
Copy link
Member Author

@humitos humitos Jan 16, 2019

The URL should be properly escaped: https://www.sitemaps.org/protocol.html#escaping

Copy link
Member Author

@humitos humitos Feb 4, 2019

I'm not sure if there is something we need to do here, actually.

@humitos humitos force-pushed the humitos/sitemap-xml branch from 88beea8 to 9c05020 Jan 17, 2019
iteration. After 0.1 is reached, it will keep returning 0.1.
"""
priorities = [1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2]
yield from itertools.chain(priorities, itertools.repeat(0.1))
Copy link
Member Author

@humitos humitos Jan 17, 2019

yield from is not Python2 syntax compatible :(

Copy link
Contributor

@jdillard jdillard Jan 18, 2019

Typo on line 288: change change
and line 293: this one i not

Copy link
Member Author

@humitos humitos Jan 20, 2019

Fixed!

@humitos
Copy link
Member Author

@humitos humitos commented Jan 17, 2019

This is ready to be merged. Although, it uses yield from which is syntax not compatible with python2. I think we don't want to re-write it to make it compatible with it since we are deprecating it: #4543

(we should probably already remove it from travis and tox)



@map_project_slug
# TODO: make this cache dependent on the project's slug
Copy link
Contributor

@davidfischer davidfischer Jan 31, 2019

The docs for cache_page say it is "is keyed off of the URL". Looking at the code, it does look to me like they mean the fully qualified URL including the host so I think we're ok.

Copy link
Member Author

@humitos humitos Feb 4, 2019

When I read that I wasn't sure if by URL it meant the path or the full URL.

Digging a little into the code I found that this line is the one that generates the cache key:

https://github.com/django/django/blob/b9cf764be62e77b4777b3a75ec256f6209a57671/django/utils/cache.py#L314

Which returns the absolute URL as you said. Thanks. We are good!

https://docs.djangoproject.com/en/1.11/ref/request-response/#django.http.HttpRequest.build_absolute_uri

context = {
'versions': versions,
}
return render(request, 'sitemap.xml', context, content_type='application/xml')
Copy link
Contributor

@davidfischer davidfischer Jan 31, 2019

Any reason you went with a template instead of Django's builtin sitemap framework? https://docs.djangoproject.com/en/1.11/ref/contrib/sitemaps/

The framework might have some advantages in case we ever get so large we need to break up sitemaps into multiple files.

Copy link
Member Author

@humitos humitos Feb 4, 2019

Ignorance.

That said and after reading the documentation, I'm not sure how to:

  • get the project from inside the Sitemap class (I think it doesn't have access to the request object --where we could check for .slug property, for example)
  • generate different locations for translations that are not based on LANGUAGE variable but on project.translations

Sitemap objects look pretty clean and clear but I'd need some help on those two things to be able to make it work using the framework.


versions = []
for version, priority, changefreq in zip(
sorted_versions, priorities_generator(), changefreqs_generator()):
Copy link
Contributor

@davidfischer davidfischer Jan 31, 2019

If I'm reading this right:

  • latest will always have priority=1 and changefreq=daily
  • stable will always have priority=0.9 and changefreq=weekly
  • other versions will have decreasing priorities and changefreq=monthly

Is that right?

Wouldn't it be better to just guess at a priority and changefrequency based on the last build date? If there is no last build date (version was never built), we don't include the version.

Copy link
Member Author

@humitos humitos Feb 4, 2019

You are reading it right, yes.

Wouldn't it be better to just guess at a priority and changefrequency based on the last build date?

Our idea behind this was that you (as author of the project) want to point your readers to the latest version published first. That's why the priority works like that: latest, stable, v1.5, v1.4, etc in that order... I think that build date is not associated with priority.

Regarding changefreq, I have a similar opinion: I expect last versions to change more frequently than v0.1.

I wouldn't complicate the logic for this.

If there is no last build date (version was never built), we don't include the version.

I'm including only active versions, as we do on the flyout also. This could change once more states be implemented: #4001 (comment)

@humitos
Copy link
Member Author

@humitos humitos commented Feb 4, 2019

I added small docs for sitemap to at least communicate that we are doing this automatically, without deep too much into its implementation. Once sitemap index is implemented, we can use this page to extend for that feature.

@humitos humitos force-pushed the humitos/sitemap-xml branch from 03043d6 to 3de35d0 Feb 4, 2019
@humitos humitos requested a review from Feb 4, 2019
Copy link
Contributor

@davidfischer davidfischer left a comment

This change looks good to me.

I will monitor our organic traffic over the next month and make sure there isn't a dip from this change. Based on my understanding of sitemaps, I this should be positive but SEO is a bit of the dark arts.

I know this is a first step but here are some improvements we can do in future iterations:

  • Have a sub-sitemap per version. Our root sitemap (slug.readthedocs.io/sitemap.xml) could point to version specific sitemaps (slug.readthedocs.io/en/1.x/sitemap.xml) that have entries for each HTML file.
  • We can support user submitted sitemaps. If the user has a sitemap.xml file in their build output, our root sitemap could point to it or maybe it replaces the root sitemap.
  • Use the Django sitemap features. I think this is slightly better than using an XML template if possible because there are a few intricacies for sitemaps (max 50k entries per file, etc.) that we won't hit with this implementation but we might if we expand it.

@jdillard
Copy link
Contributor

@jdillard jdillard commented Feb 9, 2019

I was looking through the changes and it doesn't seem the sitemapindex is added to the robots.txt file, which would likely make it easier for the search engines to find.

If you think that is worth implementing here is an example of what I am talking about: https://github.com/jdillard/sphinx-sitemap#getting-the-most-out-of-the-sitemap

@humitos
Copy link
Member Author

@humitos humitos commented Feb 9, 2019

doesn't seem the sitemapindex is added to the robots.txt file, which would likely make it easier for the search engines to find.

I think this is a good addition and should be easy to add it to the default robots.txt returned by Read the Docs at https://github.com/rtfd/readthedocs.org/blob/799480827f20f04ded7239a6307853c721de39fa/readthedocs/core/views/serve.py#L312

@humitos humitos force-pushed the humitos/sitemap-xml branch from ab64878 to 6b3cf9f Feb 12, 2019
@humitos humitos requested a review from Feb 12, 2019
@humitos humitos force-pushed the humitos/sitemap-xml branch from 1bb65e3 to 21b0015 Feb 12, 2019
Copy link
Member

@ericholscher ericholscher left a comment

Looks simple enough. I could see needing to play with the priorities or something over time, but this is definitely better than no sitemap (hopefully :)

@@ -0,0 +1,19 @@
Sitemaps
Copy link
Member

@ericholscher ericholscher Feb 14, 2019

Is this linked from anywhere? Should be in an toctree somewhere.

Copy link
Member Author

@humitos humitos Feb 18, 2019

Yes. I wanted to link it under Feature Documentation but I forgot to do it.

I will move this file under docs/features/sitemaps.rst and will be linked automatically on that section.

raise Http404

sorted_versions = sort_version_aware(
project.versions.filter(
Copy link
Member

@ericholscher ericholscher Feb 14, 2019

any reason not to use public( on the queryset here?

Copy link
Member

@ericholscher ericholscher Feb 14, 2019

I believe it also defaults to only active projects, but can pass only_active=True also

Copy link
Member Author

@humitos humitos Feb 18, 2019

No specific reason. I just changed to Version.objects.public(project=project, only_active=True)

@humitos humitos dismissed stale reviews from davidfischer and ericholscher via 6482bac Feb 18, 2019
@humitos
Copy link
Member Author

@humitos humitos commented Feb 18, 2019

I just pushed the changes suggested on feedback. I will merge this PR once tests pass.

@humitos
Copy link
Member Author

@humitos humitos commented Feb 18, 2019

Mmm... It seems that I can't merge because as I sent new changes I need a new approval now:

At least 1 approving review is required by reviewers with write access.

@humitos humitos requested a review from ericholscher Feb 18, 2019
@humitos humitos merged commit f1c15d4 into master Feb 19, 2019
1 check passed
@delete-merged-branch delete-merged-branch bot deleted the humitos/sitemap-xml branch Feb 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants