New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Help search engines find latest version of a crate #1438
Comments
The only reservation I have is the URL scheme: I'd prefer to have |
Ah, yep, that URL scheme does make more sense for those reasons. Updated the proposal. |
@rust-lang/docs-rs Does this approval sound good? From the above, @pietroalbini is for it but I'd like to get a little more buy-in from the team before I start work on it. |
Hi @jsha , thanks for doing and pushing this! I general I only have superficial knowledge of SEO / search engines, so really cannot judge on the best approach to optimize search results. Still some thoughts:
|
A lot of the most important links, in terms of ranking, come from outside docs.rs. For instance, crates.io, GitHub, and pages discussing a crate. We don't control those directly so we can't stick rel=nofollow on links that go to older versions. Also, even if we could control them, we'd have to update all of them whenever a new version is released, to attach rel=nofollow to any links to the old version's URL. Also, search engines no longer rank solely or even primarily based on links - real user navigations (measured various ways) count for a lot too. If users consistently navigate to the "latest" URL, that helps indicate that URL is more important. Think of it this way: imagine each inbound link counts as 1 point and each navigation counts as 1 point. We currently have something like: /foo/0.1.0: 5 points We'd really rather have: /foo/latest: 17 points
I'm not totally sure I understand the question, but I think it's: How will search engines find older crates? They'll be linked from https://docs.rs/crate/ureq/2.2.0.
/latest/ |
The change to I now suspect there may be issues with crawling, where crates are often not crawled deeply enough because there are so many crates, and so many pages per crate. I suspect the Google Search Console would let me dig deeper and find that out. Would the docs.rs team be willing to authorize me to look at Search Console data for docs.rs? It would involve putting a TXT record with google-site-verification=XXXXXX in it or serving a specific HTML file at the root of the domain. |
from @pietroalbini:
|
Done. |
Thanks! Here's our coverage report: 1.26M valid URLs, 3.28M excluded. The top reasons are "Crawled - currently not indexed" and "Duplicate without user-selected canonical." Details on those reasons here: https://support.google.com/webmasters/answer/7440203#duplicate_page_without_canonical_tag Drilling down, here is a sample of "Crawled - currently not indexed" URLs: And "Duplicate without user-selected canonical": We can drill down into individual URLs: What I conclude from this first look is that we probably do need to implement |
A couple things that might be worth trying prior to the sledgehammers of
|
Thank you for the ideas! I don't know enough about google indexing to judge the ideas, but I have one remark about :
This would definitely increase the sitemap's size quite much, we have crates with > 1 Mio files ( which are probably quite many pages to be added to the sitemap).
I remember talking to @jyn514 about this. Only concern I would see is someone searching for an element which only exists in an old version of the library. |
This would definitely increase the sitemap's size quite much, we have
crates with > 1 Mio files ( which are probably quite many pages to be added
to the sitemap).
We could limit the links per crate to reduce this problem.
Another problem is that sitemaps are limited to 50k URLs. We have a sitemap
index linking to various sitemaps, but that can only be nested one deep.
Still, I like this idea. A related one: link to the crate root _and_
all.html from the sitemap.
- Marking all links to specific versions (e.g. the dependencies and
versions in the crate overview dropdown, links in doc comments or due to
re-exporting) as rel=nofollow. This is just a hint that might minimise "passing
along ranking credit to another page"
<https://developers.google.com/search/blog/2019/09/evolving-nofollow-new-ways-to-identify>,
but perhaps worth a shot.
I remember talking to @jyn514 <https://github.com/jyn514> about this.
Only concern I would see is someone searching for an element which only
exists in an old version of the library.
Fwiw, our current canonicalization plan has this problem too. I suspect we
can solve it with some semver awareness, but first I'd like to see it
working to improve search.
I like the idea of nofollow in the dependencies links. That could reduce
unnecessary crawling too.
By the way, pkg.go.dev and docs.python.org use canonicalization to latest
version.
|
I frankly don't remember anything about this problem 😅 happy to go with whatever you decide on. |
@jsha @alecmocatta was also talking about setting old versions to With a hint which way helps I'm happy to help too. Which links shouldn't be followed? If we fully exclude old versions that's probably a different discussion since we would completely exclude them from the index.
Thinking longer about this, this is a tough nut to crack. We don't have the generated docs in the database but only on S3. So generating the pages for a crate would involve an additional request to S3 for each crate. I have some ideas how to solve this, but it's only worth the effort as a last option, if I'm not missing something. |
@jsha @alecmocatta was also talking about setting old versions to With a hint which way helps I'm happy to help too. Which links shouldn't be followed? If we fully exclude old versions that's probably a different discussion since we would completely exclude them from the index.
Thinking longer about this, this is a tough nut to crack. We don't have the generated docs in the database but only on S3. So generating the pages for a crate would involve an additional request to S3 for each crate. I have some ideas how to solve this, but it's only worth the effort as a last option, if I'm not missing something. |
Thanks for clarifying. I think it makes sense to
Yep, I agree it's not that necessary. Particularly given we have all.html available if we want to help search engines consistently discover all items in each crate. Presumably for an all.html with 1M links, a search engine would disregard links beyond some cutoff. |
Maybe just add all and the modules, structs and re-exports in the crate root (like bevy::prelude, bevy::app, ...) |
Good idea in general, but in that regard docs.rs is "only" serving static files from rustdoc, so has no detailed knowledge about the documentation files apart from some exceptions (#1781). We have the list of source-files for our source-browser with which we could generate the module list, but this would tightly bind docs.rs to rustdoc implementation details (its file structure). So similar effort, so only when it's worth it :) |
Update: Checking the latest data from the Google Search Console, it is still finding many pages that are "duplicate without user-selected canonical", but spot-checking them, they are all crates that have a separate documentation URL and so are not getting the The Search Console allows exporting a report of click-through data, and it turns out to be an interesting way to find examples of URLs with this problem: the pages that have the highest click-through rates tend to be ones that have the "versioned URL" problem. For instance https://docs.rs/rand/0.6.5/rand/fn.thread_rng.html is the page with the single highest click-through rate on docs.rs, presumably because people search for I followed the thread_rng example further, and in the Search Console "inspected" the URL. It turns out https://docs.rs/rand/0.6.5/rand/fn.thread_rng.html is considered canonical - it doesn't have a It turns out version 0.6.5 had a different documentation URL: https://rust-random.github.io/rand. Since we don't render I think we need to provide Another interesting result: https://docs.rs/futures/0.1.11/futures/future/type.BoxFuture.html also has a high click-through rate, and that URL does have a In other words, Google sees our canonical link, parses it, and chooses to ignore it in favor of considering 0.1.11 to be the canonical URL. It's not clear why that is; perhaps version 0.1.11 has more inbound links, or has a long history of being canonical. 0.3.1 is the latest version for that crate. |
That makes sense to me; we can treat the self-hosted docs as canonical for the latest version only. |
@jsha coming from your last comments regarding the google check: what do you think about closing this issue, and possibly #74 ? |
I have a more recent comment; we still haven't finished a full recrawl. I'm happy to close this issue since the basic proposal is done, but if we wanted to treat it as more of a tracking issue we would keep it open since there's still additional work to do. |
On Feb 3 we deployed a change adding noindex to versioned rustdoc URLs. As of today, only 7 of the top 1000 pages visited from Google Search have a version in the URL. Presumably those just haven't been recrawled yet to see the noindex directive. By contrast, as of July 2022, 305 of the top 1000 pages had a version in the URL. Of those 305 pages, 147 of them have their I did stumble across one anomaly: [parse_macro_input], for which we used to rank #1, now points to I'm satisfied that this issue is solved. 🎉 |
Motivation and Summary
When someone uses a web search engine to find a crate’s documentation, they are likely to wind up on the documentation for a random older version of that crate. This can be confusing and frustrating if they are working on a project that depends on a more recent version of that crate. As an example: in April 2021, a Google search for [rustls serversession] links to version 0.5.5 of that crate, released Feb 2017. A Google search for [rustls clientsession] links to version 0.11.0, released Jan 2019. The latest version is 0.19.0, released Nov 2020.
To fix this, I propose that doc.rs’s URL structure should be more like crates.io: Each crate should have an unversioned URL (docs.rs/rustls/latest) that always shows the docs for the latest version of that crate. There would continue to be versioned URLs like today (https://docs.rs/rustls/0.19.0/rustls/), accessible as defined below. I believe this will, over time, lead search engines to more often find the unversioned URL.
This is a popular request:
https://github.com/rust-lang/docs.rs/issues/1006
https://github.com/rust-lang/docs.rs/issues/854
https://github.com/rust-lang/docs.rs/issues/74
https://github.com/rust-lang/docs.rs/issues/1411
It's also a problem that disproportionately affects new users of the language who haven't gotten in the habit of looking for the "go to latest version" link. I know when I was first learning Rust, this problem was a particular headache for me.
Non-working solutions
<link rel=canonical>
is a commonly proposed solution, but it’s not the right fit:Since documentation of different versions is not duplicative, this won’t work. And in fact search engines verify the property, and will disregard canonical links on a site if it does not hold.
Here are some links about Google’s handling of canonical:
https://developers.google.com/search/docs/advanced/guidelines/duplicate-content
https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls
Proposed Solution
For any given crate,
https://docs.rs/<crate>/latest
should exist and not be a redirect. It should serve the latest version of that crate. Crates.io should be updated so that the unversioned URL for a crate (e.g. https://crates.io/crates/ureq) links to the unversioned URL on docs.rs.Sometimes people will want to link to a specific version of the documentation rather than the generic “latest” URL. There will be two ways to do that:
y
).Caching issues
Currently, only static files are cached. The things that change between versions of a crate are its HTML and some JS (containing JSON data used in displaying the pages). The HTML is currently not cached at all, so invalidating its cache is not a current concern. The JS is also not cached, but it has a unique URL per crate version so easily could be cached.
In case we later decide to start caching the HTML: The CloudFront rates for cache invalidation are reasonable: $0.005 per invalidation request, and purging a whole subdirectory (like
/crates/<crate>/*
) is considered a single invalidation request.Will it work?
I’m pretty sure it will work. Search engines these days take navigation events heavily into account, so if most navigation events go to an unversioned URL, that will help a lot. Also, once we make this change, the unversioned URLs will start accumulating “link juice,” which will also help a lot.
One good demonstration that it will work is that crates.io already follows a scheme like this, and does not have the "links go to old versions" problem at all.
The text was updated successfully, but these errors were encountered: