Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rustdoc: adjust static file names for better cache configuration #98413

Closed
jsha opened this issue Jun 23, 2022 · 18 comments · Fixed by #101702
Closed

rustdoc: adjust static file names for better cache configuration #98413

jsha opened this issue Jun 23, 2022 · 18 comments · Fixed by #101702
Labels
T-docs-rs Relevant to the docs-rs subteam, which will review and decide on the PR/issue. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.

Comments

@jsha
Copy link
Contributor

jsha commented Jun 23, 2022

rustdoc has a number of static files that should really be long-cached by the browser for best loading performance: the fonts, the CSS, storage.js, main.js, and so on. We should add a hash to their names so that services can be more confident in setting long cache headers for them.

Right now we categorize things into Unversioned, ToolchainSpecific, and InvocationSpecific:

enum SharedResource<'a> {
/// This file will never change, no matter what toolchain is used to build it.
///
/// It does not have a resource suffix.
Unversioned { name: &'static str },
/// This file may change depending on the toolchain.
///
/// It has a resource suffix.
ToolchainSpecific { basename: &'static str },
/// This file may change for any crate within a build, or based on the CLI arguments.
///
/// This differs from normal invocation-specific files because it has a resource suffix.
InvocationSpecific { basename: &'a str },
}

Unversioned is used just for the font files. ToolchainSpecific is used for the CSS, the images, and most of the JS. InvocationSpecific is used for search-indexN.NN.N.js, source-filesN.NN.N.js, cratesN.NN.N.js, the JS that contains the list of implementors on trait pages, and the JS that contains the list of additional sidebar items (siblings in a module).

Unversioned gets no infix. ToolchainSpecific gets a version suffix, like main1.63.0.js (from main.js). InvocationSpecific gets the same version suffix.

Unversioned and ToolchainSpecific files should be infinitely cacheable. Right now, that's not the case for ToolchainSpecific, because multiple toolchains have the same version infix. For instance, every nightly build right now creates a main1.63.0.js, but it's potentially different each night. That means https://doc.rust-lang.org/nightly/main1.63.0.js potentially changes every night, and can't be long-cached. Since docs.rs uses the nightly toolchain, the main1.63.0.js it produces for a crate today may be different than the one it produces for a crate it builds tomorrow.

docs.rs has special code to recognize the ToolchainSpecific files and rename them to contain a date and a hash, like https://docs.rs/main-20220517-1.63.0-nightly-4c5f6e627.js. But doc.rust-lang.org doesn't have that code, and as a result is less able to cache things that should be cached. And anyone who self-hosts docs is on their own.

I propose that we change our file naming scheme. All Unversioned and ToolchainSpecific files should be emitted to a subdirectory s/<hash>/, where <hash> is calculated over the contents of all of those files together. This makes it easy to configure a web server to set Cache-Control headers for everything under that subdirectory.

Advantage: this makes calculating URLs for such resources easy, especially when the calculation is done in JS. Disadvantage: if one file changes, the whole hash changes, potentially requiring the user to load more files when navigating between crates generated with different rustdoc versions.

Alternately, we could add a hash of each individual file to that file's name. That makes calculating URLs harder, but means better reuse of cached data across different nightly versions.

/cc @rust-lang/rustdoc @rust-lang/docs-rs

@jsha jsha added T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. T-docs-rs Relevant to the docs-rs subteam, which will review and decide on the PR/issue. labels Jun 23, 2022
@jyn514
Copy link
Member

jyn514 commented Jun 23, 2022

docs.rs has special code to recognize the ToolchainSpecific files and rename them to contain a date and a hash, like https://docs.rs/main-20220517-1.63.0-nightly-4c5f6e627.js

That's not quite right - docs.rs passes --resource-suffix to rustdoc to make it automatically include the suffix; the alternative would be parsing and rewriting thousands of html files at build time. It should be possible for doc.rust-lang.org to do the same pretty easily.

I'm not opposed to adding a default for --resource-suffix on nightly, but I also don't think it's a particularly pressing problem.

@GuillaumeGomez
Copy link
Member

I guess adding -{VERSION}-{DATE} as --resource-suffix would fix the issue.

@jsha
Copy link
Contributor Author

jsha commented Jun 26, 2022

That's not quite right - docs.rs passes --resource-suffix to rustdoc to make it automatically include the suffix; the alternative would be parsing and rewriting thousands of html files at build time. It should be possible for doc.rust-lang.org to do the same pretty easily.

Ah, thanks for the correction! I had taken the hex bytes in those filenames to be a hash, but I see now they are a commit id.

Also, for some reason I had assumed that self-built docs had the version suffix by default (e.g. 1.61.0), but I see now they don't. That means people self-hosting their docs can't set caching headers for static files.

I'm not opposed to adding a default for --resource-suffix on nightly, but I also don't think it's a particularly pressing problem.

My main priority here is that http://doc.rust-lang.org/std (i.e. stable) doesn't have caching headers for the JS and CSS, which adds some delay to page loads. To add those I'd like to make it things so that when content changes, filenames change (as suggested in rust-lang/simpleinfra#108 (comment)).

I think the straightforward fix is to add a default value for --resource-suffix on all builds - nightly and stable. If we choose a default that meets the needs of docs.rs, would that allow us to deprecate the flag itself? I note that docs.rs uses <date>-<version>-<channel>-<commit hash>, which seems like more than needed. We could get by with just the commit hash, right? The original commit in docs.rs doesn't mention the reason for the current scheme.

@jyn514
Copy link
Member

jyn514 commented Jun 26, 2022

I note that docs.rs uses ---, which seems like more than needed. We could get by with just the commit hash, right?

Just the hash would make it a little harder to tell when the docs were published, but I guess we can find that out from our database records anyway. I don't think there's any special logic behind the current format other than being unambiguous.

Do note though that the hash will have to be longer if you omit the channel and date - docs.rs stores enough years of docs that hash collisions are a real risk (especially since git is still using SHA1).

@jyn514
Copy link
Member

jyn514 commented Jun 26, 2022

That means people self-hosting their docs can't set caching headers for static files.

Well, they can, they just have to know about resource-suffix.

@Mark-Simulacrum
Copy link
Member

FWIW, it's not clear to me why we should be versioning these files along rustc's commit hash or similar: a hash of the file contents is relatively easy to compute and avoids any questions around needing git (e.g., when building from tarballs with patches applied) to prevent problems. It also gives us longer-lived caches on a more granular basis, and should work for all files that aren't user visible in links. (Even without resource suffix passed).

Is there a reason we're not doing that by default?

@jsha
Copy link
Contributor Author

jsha commented Jun 27, 2022

Is there a reason we're not doing that by default?

We do some loading of resource files from JS:

// Given a basename (e.g. "storage") and an extension (e.g. ".js"), return a URL
// for a resource under the root-path, with the resource-suffix.
function resourcePath(basename, extension) {
return getVar("root-path") + basename + getVar("resource-suffix") + extension;
}
. Right now the JS knows enough to do that by just knowing the resource suffix. To do hashes instead, this would need to receive a manifest of all files it might want to load, with their hashes. Not insurmountable, but a bit more work than choosing a default value for the resource suffix. Currently this is used to load the search JS, search index, settings JS, and themes.

Still, the benefits of using hashes might be worth it.

@jsha
Copy link
Contributor Author

jsha commented Aug 30, 2022

Thinking about it some more, I'm convinced hashes are the right approach. My goal is to enable faster page loads in general. One was is by turning on caching of static files on doc.rust-lang.org. But another nice improvement would be to improve cache hit rates on docs.rs. Right now it's unlikely that any two crates have the same URL for, e.g. normalize.css, because it will always be specific to the rustdoc version, e.g. https://docs.rs/normalize-20220709-1.64.0-nightly-6dba4ed21.css. But that file basically never changes.

If we used a hash as part of the URL, in theory docs.rs could stop setting --resource-suffix, and most page navigations would load a cached normalize.css. Even static files like main.js that change more frequently change less often than nightly and would have a better cache hit rate.

@jyn514
Copy link
Member

jyn514 commented Aug 30, 2022

👍 for not having to keep track of this in docs.rs - ideally we could get rid of --print=invocation-specific at the same time

@jsha
Copy link
Contributor Author

jsha commented Sep 2, 2022

I have a branch in progress and would love some eyes on it just to check I'm going in the right direction:

https://github.com/rust-lang/rust/compare/master...jsha:rust:static-files?expand=1

https://rustdoc.crud.net/jsha/static-files/std/io/trait.Read.html

Basically everything static (Unversioned + InvocationSpecific) now gets written to static.files/, which should make it simpler for docs.rs to figure out what should be copied out. The InvocationSpecific files get a hash in the filename based on their contents. I can also collapse the two categories so the fonts (Unversioned) get the hash treatment as well.

Still to be done:

  • Get themes working.
  • Make minification work.
  • Respect --disable-minification.

@GuillaumeGomez
Copy link
Member

Your approach seems a bit complex. I would have done it as follows: generate static files and their hash first, then save this value into a struct stored somewhere and then use it when generating the template files. Like that it wouldn't require too much changes and would still work out easily.

The downside with the hash approach is that it's tricky to know what the new files are since you need to look at the file date and it makes impossible to use the resource suffix. I'd say we lose in predictability what we gain in simplicity for the source code. Not sure if it's important or not though. Also, changing/removing cli options is a breaking change I think. So maybe we need a transition period in-between.

@notriddle
Copy link
Contributor

@GuillaumeGomez

Your approach seems a bit complex.

The complexity comes from the dependency graph between them. The main.js script is supposed to be able to lazily load search.js, which means it needs to know somehow what the filename is. If that's embedded in the file itself, then the hash of main.js depends on the hash of search.js.

You can work around this by generating a separate "file manifest", either inlined into all of the HTML files or stored as another JS file, but that causes its own complexity (at least, it adds more boilerplate to rustdoc's already heavy HTML output).

Also, changing/removing cli options is a breaking change I think. So maybe we need a transition period in-between.

This is an unstable option anyway. A deprecation period might be a good idea, but it doesn't have to be very long.

$ rustdoc --emit invocation-specific
error: the `-Z unstable-options` flag must also be passed to enable the flag `emit`

@GuillaumeGomez
Copy link
Member

The order of files is known ahead of time so there shouldn't be too much complexity normally? But maybe I'm missing something obvious.

Also, do we know if there are users relying on the replaced cli options apart from docs.rs?

This is an unstable option anyway. A deprecation period might be a good idea, but it doesn't have to be very long.

I guess it can be short if no ones uses this option indeed.

@jyn514
Copy link
Member

jyn514 commented Sep 2, 2022

Docs.rs uses this option. It's ok if rustdoc removes it, but we need advance notice (cc @rust-lang/docs-rs) and to know what we should replace it with.

@jsha
Copy link
Contributor Author

jsha commented Sep 2, 2022

Docs.rs uses this option.

The --emit option, right? That should continue to work with my branch.

To make sure I understand how it works, is this summary accurate?

The first time a new toolchain is installed, rustdoc is called on a dummy crate with --emit=unversioned-shared-resources,toolchain-shared-resources, which gets the fonts and the static JS/CSS. These are stored somewhere such that they can be served under / and shared by all crates.

For regular crate builds, rustdoc is called with --emit=invocation-specific, to avoid emitting the fonts and static JS/CSS, which saves storage space. The resulting directory can be directly packaged up and won't contain unnecessary files.

With the proposal in my branch, everything emitted by --emit=unversioned-shared-resources,toolchain-shared-resources would be guaranteed to go under static.files/. So we could either keep using the flags as they are, or deprecate them. If we deprecate them, docs.rs would rely on rm -r'ing static.files/ in regular crate builds, and copying static.files/ on new toolchain installs. I don't have a strong opinion on which is better. Getting rid of flags is nice; at the same time, if the flag approach is working why mess with it. :-)

generate static files and their hash first, then save this value into a struct stored somewhere and then use it when generating the template files.

This is effectively what we do here, though instead of "a struct", I use a series of statics. However, having a single top-level static with fields would indeed make it easier to make these values available to templates. I could give that a shot.

@GuillaumeGomez
Copy link
Member

GuillaumeGomez commented Sep 2, 2022

The --emit option, right? That should continue to work with my branch.

I was talking about the --resource-suffix. If you were not talking about this option, I completely misunderstood you. ^^'

This is effectively what we do here, though instead of "a struct", I use a series of statics. However, having a single top-level static with fields would indeed make it easier to make these values available to templates. I could give that a shot.

👍

@jsha
Copy link
Contributor Author

jsha commented Sep 2, 2022

@GuillaumeGomez said:

I was talking about the --resource-suffix

Indeed, earlier in this thread I talked about potentially deprecating --resource-suffix. @notriddle also mentioned --emit. Both are relevant, though after poking at it some more I have concluded my branch cannot completely deprecate --resource-suffix. The "dynamic" JS still (probably) needs it: sidebar-items, implementors, crates, and search-index.

We could eventually talk about changing the naming scheme for our dynamic JS: instead of being named based on --resource-suffix, the format of object provided by those files could be versioned independently, like we version the rustdoc JSON output today. But that's outside the scope of this issue IMO.

@jsha
Copy link
Contributor Author

jsha commented Sep 11, 2022

PR is up: #101702

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Nov 5, 2022
…illaumeGomez

rustdoc: add hash to filename of toolchain files

All static files used by rustdoc are now stored in static.files/ and their filenames include a hash of their contents. Their filenames no longer include the contents of the --resource-suffix flag. This clarifies caching semantics. Anything in static.files can use Cache-Control: immutable because any updates will show up as a new URL.

Invocation-specific files like crates-NN.js, search-index-NN.js, and sidebar-items-NN.js still get the resource suffix.

This has a useful side effect: once toolchain files aren't affected by resource suffix, it will become possible for docs.rs to include crate version in the resource suffix. That should fix a caching issue with `/latest/` URLs: rust-lang/docs.rs#1593. My goal is that it should be safe to serve all rustdoc JS, CSS, and fonts with infinite caching headers, even when new versions of a crate are uploaded in the same place as old versions.

The --disable-minification flag is removed because it would vary the output of static files based on invocation flags. Instead, for rustdoc development purposes it's preferable to symlink static files to a non-minified copy for quick iteration.

Example listing:

```
$ cd build/x86_64-unknown-linux-gnu/doc/ && find . | egrep 'js$|css$' | egrep -v 'sidebar-items|implementors' | sort
./crates1.65.0.js
./rust.css
./search-index1.65.0.js
./source-files1.65.0.js
./static.files/ayu-2bfd0af01c176fd5.css
./static.files/dark-95d11b5416841799.css
./static.files/light-c83a97e93a11f15a.css
./static.files/main-efc63f77fb116394.js
./static.files/normalize-76eba96aa4d2e634.css
./static.files/noscript-5bf457055038775c.css
./static.files/rustdoc-7a422337900fa894.css
./static.files/scrape-examples-3dd10048bcead3a4.js
./static.files/search-47f3c289722672cf.js
./static.files/settings-17b08337296ac774.js
./static.files/settings-3f95eacb845293c0.css
./static.files/source-script-215e9db86679192e.js
./static.files/storage-26d846fcae82ff09.js
```

Fixes rust-lang#98413
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-docs-rs Relevant to the docs-rs subteam, which will review and decide on the PR/issue. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants