Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate a sitemap.xml #557

Closed
2xyo opened this issue Dec 1, 2013 · 24 comments
Closed

Generate a sitemap.xml #557

2xyo opened this issue Dec 1, 2013 · 24 comments
Assignees
Labels
Accepted Accepted issue on our roadmap Feature New feature

Comments

@2xyo
Copy link

2xyo commented Dec 1, 2013

Note: This is about enhancing SEO.

Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling.

The detailed technical specifications are available here.

Why?

  • It provides more granular information to search engines about the relative importance of pages in the documentation:
    • The latest version or stable version of the documentation is more important and should appear first in the results of search engines.
  • It provides to search engines how often the spider should come back :
  • It would decrease the load on the server because old versions of the documentation would be crawled less often.

Proposals

  • Set priority as:
    • 1 for the pages of the latest or stable version. This option could be set in conf.py.
    • for each following version, decrease the priority of 0.1 at each version
    • 0.1 for the pages for other version if there is more than 9 versions.
  • Use the timestamp of the source file for lastmod
  • Set changefreq as :
    • daily for the pages of the latest version
    • weekly for the pages of the last tag version
    • never for the pages of other versions

Example

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://django.readthedocs.org/en/latest/</loc>
      <lastmod>2013-12-01T19:20:30.45+01:00</lastmod>
      <changefreq>daily</changefreq>
      <priority>1</priority>
   </url>
   <url>
      <loc>http://django.readthedocs.org/en/1.6.x/</loc>
      <lastmod>2013-11-30T19:20:30.45+01:00</lastmod>
      <changefreq>weekly</changefreq>
      <priority>0.9</priority>
   </url>
   <url>
      <loc>http://django.readthedocs.org/en/1.5.x/</loc>
      <lastmod>2013-10-03T19:20:30.45+01:00</lastmod>
      <changefreq>never</changefreq>
      <priority>0.8</priority>
   </url>
  <url>
      <loc>http://django.readthedocs.org/en/0.1.x/</loc>
      <lastmod>2013-10-03T19:20:30.45+01:00</lastmod>
      <changefreq>never</changefreq>
      <priority>0.1</priority>
   </url>
</urlset>

Implementation

We currently have logic in the code base for determining version order. We could just subtract .1 from the versions that are supported until we hit 0.1. We could also change the logic for tags and branches, since tags should never change, they can be updated much less frequently.

Bonus Points

@cydrobolt
Copy link

This this feature being worked on? I can work on this if no one has worked on it yet.

@ericholscher
Copy link
Member

A theme that is doing this with an extension: https://github.com/guzzle/guzzle_sphinx_theme/blob/master/guzzle_sphinx_theme/__init__.py#L30

@ericholscher
Copy link
Member

Another interesting approach: https://github.com/openstack/openstack-doc-tools/tree/master/sitemap

@jdillard
Copy link
Contributor

jdillard commented Mar 30, 2018

I took the sitemap logic out of guzzle_sphinx_theme and made it an extension/package here: https://github.com/jdillard/sphinx-sitemap

It's my first time making a package, but several people are using it successfully and I have it running in a few production environments myself.

Some are even using it on RTD, for example:

@ericholscher
Copy link
Member

Neat, this is definitely something we could incorporate into the standard build process.

@jdillard
Copy link
Contributor

jdillard commented Apr 2, 2018

Great! This has been a project to help me learn new things and I'm very much still learning, so let me know if you need anything from me.

@akien-mga
Copy link
Contributor

akien-mga commented Jun 12, 2018

I'd also support this feature being available in the standard build process, as it might be especially relevant for multilingual RTD projects, see https://en.wikipedia.org/wiki/Sitemaps#Multilingual_and_multinational_Sitemaps

Context: For Godot Engine, we recently put up localized RTD instances, most of which are still over 80% of English text while translators work on things string by string. Search engines seem to have taken a particular liking to the Ukrainian instance for English queries, which puzzles many users. I hope the sitemap trick mentioned in the above link could fix that.

(I'll try jdillard's extension in the meantime)

@jdillard
Copy link
Contributor

@akien-mga I created a PR, jdillard/sphinx-sitemap#15, on my extension that adds support for multilingual sitemaps if you want to test it out and leave feedback there. I don't have much first hand experience with multi-lingual sphinx/RTD setups, so I might have missed some nuances.

@humitos
Copy link
Member

humitos commented Oct 17, 2018

Neat, this is definitely something we could incorporate into the standard build process.

@ericholscher What's your idea to accomplish this?

I'm thinking on installing the sphinx-sitemap extension by default (together with other default packages that we are installing) and add it when append_conf method is called with the user's conf.py plus the site_url setting with the canonical_url of the project.

What do you think? Is this the path to follow?

@humitos humitos added this to the Low priority features we want milestone Oct 17, 2018
@davidfischer
Copy link
Contributor

There are a few challenges with sitemaps. One challenge is that a sitemap is normally at /sitemap.xml by default. You can also specify where it is in robots.txt. So the sitemap should not be language specific (although you can have sub-sitemaps). It won't be at /$lang/latest/sitemap.xml.

One possibility is to make https://project.readthedocs.io/sitemap.xml a dynamic page which scans the different versions and translations under that domain for sitemap.xml files and links to them as sub-sitemaps. Another possibility is to let people upload a sitemap file that applies to all their versions and translations. Perhaps the simplest (but not a totally complete solution) is to just have the sitemap link to the root of different versions and translations.

@jdillard
Copy link
Contributor

jdillard commented Oct 17, 2018

@davidfischer You can also use a sitemapindex to manage multiple sitemaps. I'm not sure if the RTD build process could create that file (containing links to the sub-sitemaps) and place in the root directory.

@jdillard
Copy link
Contributor

@humitos I didn't realize there was already a html_baseurl config value that could have been used instead of site_url, and was thinking about switching to using that instead (with a backwards compatibility check for site_url). I'm not sure if that would make things easier.

@davidfischer
Copy link
Contributor

@davidfischer You can also use a sitemapindex to manage multiple sitemaps. I'm not sure if the RTD build process could create that file (containing links to the sub-sitemaps) and place in the root directory.

This is exactly what I'm thinking! It is possible that RTD could dynamically generate the root sitemap rather than creating/updating it when builds happen.

@humitos
Copy link
Member

humitos commented Oct 18, 2018

Just to put all together and continue with the next step. We need to decide,

  1. if we want to generate the sitemap for all the projects automatically (by following what I said in this comment) or rely on the user to install and setup all of this.
  2. if the /sitemap.xml will be a Django view that will search for all the sitemap.xml on the project's output generated by Sphinx at build time, and generate the sitemap index with those files.
    • how much impact this would have on the servers?
    • can we use a long cache time for this? 1 week? 1 month?

I personally like the idea of making all of this automatically, but in that case we need to think if there could be users that don't want this for some particular reason (it could also be an option from the admin).

@humitos humitos added the Needed: design decision A core team decision is required label Oct 18, 2018
@davidfischer
Copy link
Contributor

How about we do a combination of both! Here's my proposal:

  • We make /sitemap.xml a Django view. When requested, it looks for $lang/$version/sitemap.xml files and includes them in a sitemapindex as @jdillard proposes.
  • If no sub-sitemaps are found, instead it just has entries in the sitemap for all the different language/version combinations (as in the "example" section in the original proposal for this issue).
  • I think this can be safely cached for at least a couple days and maybe a week. I wouldn't go for a month.

I don't think any users will actively not want this so I don't know if being able to disable it is critical in the first implementation.

@humitos
Copy link
Member

humitos commented Oct 22, 2018

I like your proposal, @davidfischer

I think that we have something that it's actionable now, and we can implement it. I'd love to see/get/receive a PR for this.

I don't think any users will actively not want this so I don't know if being able to disable it is critical in the first implementation.

We will need to install a new dependency that could impact in the building time (not too much, though) but that could bring a new issue.

That was my only concern, but I think we are fine by installing and running this by default. It's a new feature that will benefit all the projects and may have a minimum impact on some particular projects (we could add a feature flag if we find problems around it)

@davidfischer
Copy link
Contributor

We will need to install a new dependency that could impact in the building time (not too much, though) but that could bring a new issue.

I proposed that we do not add the extra sphinx extension for generating sitemaps by default. I think users should opt-in to it.

If users choose not to opt-in, the sitemap we display would just point to the active versions:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 <url>
    <loc>http://django.readthedocs.org/en/latest/</loc>
    <lastmod>2013-12-01T19:20:30.45+01:00</lastmod>
    <changefreq>daily</changefreq>
    <priority>1</priority>
 </url>
 <url>
    <loc>http://django.readthedocs.org/en/1.6.x/</loc>
    <lastmod>2013-11-30T19:20:30.45+01:00</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.9</priority>
 </url>
 <url>
    <loc>http://django.readthedocs.org/en/1.5.x/</loc>
    <lastmod>2013-10-03T19:20:30.45+01:00</lastmod>
    <changefreq>never</changefreq>
    <priority>0.8</priority>
 </url>
<url>
    <loc>http://django.readthedocs.org/en/0.1.x/</loc>
    <lastmod>2013-10-03T19:20:30.45+01:00</lastmod>
    <changefreq>never</changefreq>
    <priority>0.1</priority>
 </url>
</urlset>

@ericholscher
Copy link
Member

ericholscher commented Oct 22, 2018

Opt in sounds good (at least for now) -- we should however write a guide about how to enable it, documenting our integration and how users can enable it (once we build the integration :D)

@agjohnson
Copy link
Contributor

I am on the fence (only slightly!) on making this a Django view. We've been talking more about pushing docs off our servers and to Azure storage and historically served docs entirely from nginx on the community side.

However, we could maybe redirect to a .org endpoint in Azure storage (similar to S3 redirects), or could reverse proxy the request to an API endpoint through Nginx. Worst case for an Azure implementation is we could just plop the sitemap index on the storage on any project save.

I would however be into making this a more integrated feature, like a sitemap: true in our YAML. I'd be 👍 on just making this the default too (perhaps eventually?)

@humitos
Copy link
Member

humitos commented Jan 16, 2019

I created a proposal for this at #5122. This initial version is not allowing users to serve their own generated sitemap.xml on the root (/sitemap.xml) but we will get there soon using the sitemapindex tag, hopefully.

@humitos humitos added Accepted Accepted issue on our roadmap and removed Needed: design decision A core team decision is required labels Jan 16, 2019
@davidfischer davidfischer modified the milestones: 3.3, Search improvements Jan 31, 2019
@omidraha
Copy link

omidraha commented Feb 5, 2019

I took the sitemap logic out of guzzle_sphinx_theme and made it an extension/package here: https://github.com/jdillard/sphinx-sitemap

It's my first time making a package, but several people are using it successfully and I have it running in a few production environments myself.

Some are even using it on RTD, for example:

* build: [sitemap.xml](http://docs.bonobo-project.org/en/master/sitemap.xml)

* source: [conf.py](https://github.com/python-bonobo/bonobo/blob/71039ddcb125a6bf6681cb590dd775d3d8e30dea/docs/conf.py#L25)

If I add sphinx_sitemap to the extensions variable of config.py file, dose it installed automatically on rftd build process?

@humitos
Copy link
Member

humitos commented Feb 5, 2019

@omidraha not yet.

In #5122 we will be generating a general sitemap that will leaves at /sitemap.xml. In the near future, we will be generating sitemap indexes which will allow you to generate your own sitemap via sphinx_sitemap and Read the Docs will recognize it and serve it.

@humitos
Copy link
Member

humitos commented Feb 12, 2019

The PR with the general sitemap.xml generation is about to get merged. Although, I want to link this comment from David here since it's an important one to consider when working on the next phase (sitemap indexes and more)

@stsewd
Copy link
Member

stsewd commented Feb 28, 2019

This is already implemented #5122

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Accepted issue on our roadmap Feature New feature
Projects
None yet
Development

No branches or pull requests

10 participants