Extremely poor docs.python.org SEO performance. #1691

nelhage · 2020-11-27T19:36:50Z

Describe the bug
docs.python.org has atrocious search performance on Google. It's so bad that I suspect Google is actively downranking it for some reason.

To Reproduce
Search Google for virtually any Python documentation topic. The ones that drove me here were searches for [python set] and [python shuffle list].

Expected behavior
docs.python.org is the authoritative source for Python documentation on the web. I expect to find relevant results on docs.python.org somewhere on the first page of Google search results.

Instead, I find the opposite. For [python set] I expect to find https://docs.python.org/3/library/stdtypes.html#set
somewhere on the first page of results, but instead, the only python.org result I see is for the long-deprecated
Python 2 sets module --
https://docs.python.org/2/library/sets.html

For [python shuffle list], neither https://docs.python.org/3/library/random.html nor any other python.org result shows up anywhere on the first page.

Screenshots

Additional context
These results are egregious enough to make me suspect you're
being actively downranked for some reason. This isn't a request for general SEO optimization -- although that'd be a great project if someone has the interest -- but for a domain admin to try to use Google's search console (https://developers.google.com/search) to investigate if there's something egregiously wrong with an easy fix.

The text was updated successfully, but these errors were encountered:

ewdurbin · 2020-12-07T18:50:19Z

@JulienPalard Any ideas here? Do you have/need access to the google search console for docs?

JulienPalard · 2020-12-13T23:02:46Z

I do have access to the search console, but I don't think I'm of any help from an SEO point of view.

The search console is telling us our "mobile ergonomy is bad", it's probably just that? I heard Google is ranking mobile pages first.

We have a PR opened to make the doc responsive since may here: python/python-docs-theme#46 I don't know if it could help.

JulienPalard · 2021-02-08T14:43:11Z

Today, Google, for the python set search, displays a big box linking to w3schools first ☹ and https://docs.python.org/fr/3/tutorial/datastructures.html 2nd, and no links to our doc for python shuffle list.

Unlink many spammy websites, docs.python.org does not have a page dedicated to those topics, so they probably win because it's in the title of their page, and the more specific we go, the more they'll win.

For example, for python control flow we land 1st because our tutorial have a page with control flow in the title and URL (https://docs.python.org/3/tutorial/controlflow.html).

On the other hand, for python remove all duplicate items from a list, stackoverflow is obviously first, and spammy w3schools comes 2nd with an exact title match too, and bad examples, followed by many spammers (dumps of StackOverflow DB I bet for most of them).

An obvious way would be to write a page for all those topics, but user-generated content is equally good at this job (stack overflow typically), with less contributor bottleneck. I'd leave this question to the to-be-created new doc sig.

Sadly if we don't do it, spammers will always get first, with bad quality or outdated content, just so they can display their ads.

di · 2022-04-21T18:46:17Z

Another issue is that the search engines often seem to prefer docs for older releases than for newer releases, e.g.:

The missing description is probably also hurting us. The 'learn why' link goes to https://support.google.com/webmasters/answer/7489871?hl=en

jacobian · 2022-04-21T20:09:59Z

That other issue you mentioned, @di, is something I've noticed about the Django docs as well -- google often seems to rank older versions much higher. I wonder if that problem could be solved with a rel=cannonical?

alex · 2022-04-21T20:12:39Z

FWIW the Rust packages docs also had this problem, and seemed to have solved it, but I can't remember how (and ironically, googling it is useless) -- I don't see rel=canonical on docs.rs pages so there may be another tactic in addition.

ericholscher · 2022-04-21T20:24:03Z

It looks like Python is indexing old versions, and they are disallowed in the robots.txt: https://docs.python.org/robots.txt -- at least the link that @di is pointing at. That is probably the first thing I'd try to fix them, but agreed that canonicalization can sometimes help. Google is quite fickle though, and hard to understand how to fix this. We've tried a number of different things.

We have lots more tips here: https://docs.readthedocs.io/en/stable/guides/technical-docs-seo-guide.html.

I'm guessing the Python docs history of non-mobile friendly design has probably hurt it a lot over time. I believe that's fixed now though.

The first step I'd do is probably add canonical links to /3/, since I believe that is the "canonical" version. (It looks like y'all are already doing that though, so yea, the unknowable Google indexing conundrum lives on).

The next step is definitely diving into Google Search Console for what it says there.

davidfischer · 2022-04-21T21:40:46Z

The next step is definitely diving into Google Search Console for what it says there.

Just to echo Eric, this should definitely be the next step. Whoever has access to y'alls search console will get a lot of details about what Google is doing. For example, there may be something Google sees as spam or duplicated and they've taken some downranking action against the domain.

I looked at the str.strip example from @di's screenshot. It is hard for the Python documentation to compete with a site that has a whole page about the str.strip method with examples especially when you consider that the Python docs have a single mention of the method in the middle of a 2,000 word page. However, there's a few things that would help. I noticed that while some new versions do set the canonical URL, the 3.4 docs do not have rel=canonical on them (look here). This is probably why they're continuing to show up in results. You might also need to let the robots through temporarily so they can index that change once you change it.

A larger (and harder to fix) issue is that a lot of the Python documentation isn't written with search engines in mind. I would tackle issues with robots, sitemaps, and search console first, but this might be worth a look afterwards. Just to give a couple concrete examples:

The term regex barely appears on the regular expressions module page (not in great context and not in the first paragraph) but does on the howto page. This is probably why the latter ranks better when searching for something like "python regex". It's a bunch of work, but going through your core docs pages and making sure the title and first 2-3 sentences of the main content describe the page pretty well is probably worth it. Y'all don't use meta descriptions, so those 2-3 sentences are what will show up on search engine results page.
Some pages like the built-in types page (where str.strip appears) are going to really struggle with ranking because they cover a lot of ground on a lot of different topics. Instead, I'd consider having a page on iterators, a page on sets, a page on boolean types, etc.

davidfischer · 2022-04-21T22:27:34Z

FWIW the Rust packages docs also had this problem, and seemed to have solved it, but I can't remember how (and ironically, googling it is useless) -- I don't see rel=canonical on docs.rs pages so there may be another tactic in addition.

Probably this ticket: rust-lang/rust#12466

JulienPalard · 2022-04-22T12:24:33Z

Looks like we have canonicals links to /3/ since 3.5:

$ for v in $(seq 10); do echo $v $(curl  https://docs.python.org/3.$v/library/stdtypes.html | grep canonical); done
1
2
3
4
5 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
6 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
7 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
8 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
9 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
10 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />

I may probably be fixed with some proper sed-fu, but is it worth it as previous versions are denied by robots.txt:

$ curl https://docs.python.org/robots.txt
Sitemap: https://docs.python.org/sitemap.xml

# Prevent development and old documentation from showing up in search results.
User-agent: *
Disallow: /dev
Disallow: /release

# Disallow EOL versions
Disallow: /2/
Disallow: /2.0/
Disallow: /2.1/
Disallow: /2.2/
Disallow: /2.3/
Disallow: /2.4/
Disallow: /2.5/
Disallow: /2.6/
Disallow: /2.7/
Disallow: /3.0/
Disallow: /3.1/
Disallow: /3.2/
Disallow: /3.3/
Disallow: /3.4/

JulienPalard · 2022-04-22T12:30:29Z

It is hard for the Python documentation to compete with a site that has a whole page about the str.strip

I totally agree, I don't think we can beat those, to the point I wonder if we should do the same: for the most searched functions, to build a dedicated "howto" or "tutorial", with up-to-date good practices, examples, and so on.

But I don't feel my english level is enough to start this kind of project ☹

davidfischer · 2022-04-22T16:23:12Z

I may probably be fixed with some proper sed-fu, but is it worth it as previous versions are denied by robots.txt:

I would definitely fix it. This will stop version 3.4 from showing up in Google's results. You may have to open up the robots.txt for a while after making the change but I'm not sure there.

As to whether to take on a huge docs reformatting/rework projects, it's a terrifying never-ending project of incremental improvement. I'd fix all the concrete easy things (like 3.4 docs showing up in search engines) first.

gpshead · 2023-02-08T00:35:09Z

Any update on stuffing the older docs that lack rel=canonical information with canonical tags? 3.3 and such are still showing up on top in many searches such as Googling for zip site:docs.python.org

hugovk · 2023-02-09T06:31:06Z

@JulienPalard Please could you use your sed-fu?

JulienPalard · 2023-02-09T21:12:49Z

@JulienPalard Please could you use your sed-fu?

I can propose:

docsbuild@docs:/srv/docs.python.org/release/3.4.10$ find -name '*.html' | while read -r file; do sed -i '/link rel="shortcut icon/{s|$|\n    <link rel="canonical" href="https://docs.python.org/3/'"$file"'" />|;s|/\./|/|g}' "$file"; done

Followed by:

curl -XPURGE https://docs.python.org/3.4/{$(find -name '*.html' | sed 's|^./||g' | tr '\n' ,)}
curl -XPURGE https://docs.python.org/3.4/{$(find -name '*.html' | sed 's|^./||g' | grep index.html | sed 's/index.html//g' | tr '\n' ,)}

to clean the cache.

I just passed it for 3.4, tell me I should go ahead on 3.0, 3.1, 3.2, and 3.3 or if you see an issue.

$ curl  https://docs.python.org/3.4/library/ | grep canonical
    <link rel="canonical" href="https://docs.python.org/3/library/index.html" />
$ curl  https://docs.python.org/3.4/library/stdtypes.html | grep canonical
    <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
$ curl  https://docs.python.org/3.4/library/functions.html | grep canonical
    <link rel="canonical" href="https://docs.python.org/3/library/functions.html" />

hugovk · 2023-02-11T21:21:09Z

Canonical looks good at https://docs.python.org/3.4/library/, as does https://developers.facebook.com/tools/debug/?q=https%3A%2F%2Fdocs.python.org%2F3.4%2Flibrary%2F

What do others say? Good to do 3.0 - 3.3?

gpshead · 2023-02-11T23:59:31Z

makes sense, go ahead for the earlier 3s as well.

JulienPalard · 2023-02-12T21:49:08Z

$ for v in $(seq 10); do echo $v $(curl  https://docs.python.org/3.$v/library/stdtypes.html | grep canonical); done
1 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
2 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
3 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
4 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
5 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
6 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
7 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
8 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
9 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
10 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />

davidfischer · 2023-03-07T16:36:28Z

This is definitely a step in the right direction, but Google hasn't indexed it yet. I'm not sure whether you need to open up the robots.txt temporarily or you might see if you could submit the docs through Google Search Console for reindexing first.

I did verify that the canonical tag is on that page so it should be picked up eventually.

di · 2023-03-07T16:54:48Z

I agree they should be temporarily removed from robots.txt -- perhaps even permanently? I think that blocking them was likely a misguided attempt to remove these from the search results. The crawler should deprioritize them based on the canonical link.

That said, I'm not seeing how the current robots.txt is being generated, does anyone know?

di · 2023-03-07T16:57:31Z

Never mind, I see that it's here: https://github.com/python/docsbuild-scripts/blob/3a75c4dcac91e25d6188b750b7beb0546d40eb90/templates/robots.txt#L8-L22

JulienPalard · 2023-03-07T17:22:56Z

@di I just removed them: python/docsbuild-scripts@c49181f

di · 2023-03-07T17:41:00Z

@JulienPalard Thanks! Let me know when that's deployed and I can submit them for reindexing!

di · 2023-03-08T15:24:57Z

I see that the robots.txt is updated and submitted these for indexing. However, when looking at the search console for the sitemap, I found that Google is not indexing any of the URLs included:

This is because the sitemap provides a URL like https://docs.python.org/3/ but this has a canonical link to https://docs.python.org/3/index.html which isn't in the sitemap:

I think we probably need to a) make sure the canonical URLs are in the sitemap and b) put many more URLs into the sitemap (possibly, every URL we have). Right now, the sitemap only includes:

https://docs.python.org/3.12/
https://docs.python.org/3.11/
https://docs.python.org/3.10/
https://docs.python.org/3.9/
https://docs.python.org/3.8/
https://docs.python.org/3.7/
https://docs.python.org/3/

And doesn't include any older Python versions or any sub-pages.

di · 2023-03-08T15:25:53Z

Additionally, here's a breakdown with the top reasons why pages aren't being indexed:

di · 2023-03-08T15:29:39Z

Also, to resolve the issue in OP, where https://docs.python.org/2/library/sets.html is the 2nd result for https://www.google.com/search?q=python+set, I think we probably need to update canonical tags and remove robots.txt blocking for all EOL versions.

di · 2023-03-19T18:19:03Z

Another thing: it seems like for translations, our canonical URLs should be pointing to the translated versions of these pages with:

<link rel=”alternate” hreflang="a-different-language" ...>

https://developers.google.com/search/blog/2010/09/unifying-content-under-multilingual

di · 2023-03-19T18:33:11Z

It seems like many 3.x pages are still missing canonical tags as well:

$ curl -s https://docs.python.org/3/library/email.examples.html | grep canonical
    <link rel="canonical" href="https://docs.python.org/3/library/email.examples.html" />
    
$ curl -s https://docs.python.org/3.4/library/email-examples.html | grep canonical

$

JulienPalard · 2023-03-22T06:58:56Z

It seems like many 3.x pages are still missing canonical tags as well

Ohhh interesting!

Those lost their canonical tag automaticaly because it points to a 404, because it was generated by dumb sed instead of human knowing about email-examples.html being renamed email.examples.html.

So we have to find all pages like this...

$ cd 3.4/
$ grep -L 'rel="canonical"' **/*.html
howto/webservers.html
library/_dummy_thread.html
library/asyncio-eventloops.html
library/binhex.html
library/dummy_threading.html
library/email-examples.html
library/email.util.html
library/formatter.html
library/fpectl.html
library/macpath.html
library/misc.html
library/othergui.html
library/parser.html
library/symbol.html
library/undoc.html
using/scripts.html

and fix them manually... at the end some will still not have a 'canonical', or at least not one to /3/ if they don't appear on /3/.

hugovk mentioned this issue Oct 4, 2022

Get URL's version number inside conf.py? python/docsbuild-scripts#137

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely poor docs.python.org SEO performance. #1691

Extremely poor docs.python.org SEO performance. #1691

nelhage commented Nov 27, 2020

ewdurbin commented Dec 7, 2020

JulienPalard commented Dec 13, 2020

JulienPalard commented Feb 8, 2021

di commented Apr 21, 2022

jacobian commented Apr 21, 2022

alex commented Apr 21, 2022

ericholscher commented Apr 21, 2022 •

edited

Loading

davidfischer commented Apr 21, 2022 •

edited

Loading

davidfischer commented Apr 21, 2022

JulienPalard commented Apr 22, 2022

JulienPalard commented Apr 22, 2022

davidfischer commented Apr 22, 2022

gpshead commented Feb 8, 2023

hugovk commented Feb 9, 2023

JulienPalard commented Feb 9, 2023 •

edited

Loading

hugovk commented Feb 11, 2023

gpshead commented Feb 11, 2023

JulienPalard commented Feb 12, 2023

davidfischer commented Mar 7, 2023

di commented Mar 7, 2023

di commented Mar 7, 2023

JulienPalard commented Mar 7, 2023

di commented Mar 7, 2023

di commented Mar 8, 2023

di commented Mar 8, 2023

di commented Mar 8, 2023

di commented Mar 19, 2023

di commented Mar 19, 2023

JulienPalard commented Mar 22, 2023

Extremely poor docs.python.org SEO performance. #1691

Extremely poor docs.python.org SEO performance. #1691

Comments

nelhage commented Nov 27, 2020

ewdurbin commented Dec 7, 2020

JulienPalard commented Dec 13, 2020

JulienPalard commented Feb 8, 2021

di commented Apr 21, 2022

jacobian commented Apr 21, 2022

alex commented Apr 21, 2022

ericholscher commented Apr 21, 2022 • edited Loading

davidfischer commented Apr 21, 2022 • edited Loading

davidfischer commented Apr 21, 2022

JulienPalard commented Apr 22, 2022

JulienPalard commented Apr 22, 2022

davidfischer commented Apr 22, 2022

gpshead commented Feb 8, 2023

hugovk commented Feb 9, 2023

JulienPalard commented Feb 9, 2023 • edited Loading

hugovk commented Feb 11, 2023

gpshead commented Feb 11, 2023

JulienPalard commented Feb 12, 2023

davidfischer commented Mar 7, 2023

di commented Mar 7, 2023

di commented Mar 7, 2023

JulienPalard commented Mar 7, 2023

di commented Mar 7, 2023

di commented Mar 8, 2023

di commented Mar 8, 2023

di commented Mar 8, 2023

di commented Mar 19, 2023

di commented Mar 19, 2023

JulienPalard commented Mar 22, 2023

ericholscher commented Apr 21, 2022 •

edited

Loading

davidfischer commented Apr 21, 2022 •

edited

Loading

JulienPalard commented Feb 9, 2023 •

edited

Loading