Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Serve crates-io registry over HTTP as static files #2789

Open
wants to merge 5 commits into
base: master
from

Conversation

@kornelski
Copy link
Contributor

kornelski commented Oct 18, 2019

The existing crate index format is good enough to be served over HTTP as-is, and still fetched and updated incrementally.

This can be faster than a git clone of the whole index ahead of time, and will spare clients from unbounded growth of the index.

Originating forum thread.

Rendered RFC

@djc

This comment has been minimized.

Copy link
Contributor

djc commented Oct 18, 2019

It would be good to touch on offline use scenarios and how these (as well as slow links) would be affected by these changes.

text/0000-http-index.md Outdated Show resolved Hide resolved
@kornelski

This comment has been minimized.

Copy link
Contributor Author

kornelski commented Oct 19, 2019

I've added more info about network speed. To sum it up:

  • It's a significant saving of bandwidth compared to a full clone, even if it ends up downloading all crates anyway. For CI that runs once and doesn't keep the index, it'll be a huge reduction in bandwidth.

  • For a first run with no cache and no lockfile, the worst case is as slow as the max depth of the dependency tree (but in practice less than that thanks to crates appearing in multiple places in the tree). In my projects I've seen deps 12 levels deep, so on a high-latency connection it may take several seconds, but that still is likely way faster than a full index clone.

  • For subsequent runs, with some cache or a lockfile, the update can be speculatively parallelized to reduce latency to one or two roundtrips, because Cargo will already know about all or almost all dependencies-of-dependencies that it needs to check.

  • In terms of number of update requests, the worst case scenario without incremental changelog is a few hundred requests. It may seem like a lot, but thanks to web bloat, web servers regularly handle this much for typical page load without breaking a sweat :) These requests can be made conditional (using HTTP status 304), they should use very little bandwidth for updates. The incremental changelog can efficiently mark crates as fresh and not needed updates, so that in the best case an update will be just one HTTP request for the log itself.

  • Cargo has a fast path for GitHub that checks if the index is fresh in one HTTP request. This can be supported too for the HTTP index, so the fast path is just as fast.

Note that cost of the git index is proportional to total size of crates-io, but cost of this solution is proportional to the project size. At some point git clone will become too large to be usable, but this solution will remain functional (even large Rust projects don't use all crates-io dependencies at the same time, so they won't grow as big as fast).

@Nemo157

This comment has been minimized.

Copy link
Contributor

Nemo157 commented Oct 19, 2019

One third-party tool that could not support a feature with this design is cargo add's - vs _ automatic distinction:

> cargo add serde-json
WARN: Added `serde_json` instead of `serde-json`

well, it still could be implemented by fetching every permutation, that would be O(n^2), but since n is going to generally be 1-2 that could be ok.

@matklad

This comment has been minimized.

Copy link
Member

matklad commented Oct 21, 2019

I think it might make sense to spell out how Cargo would use a new index in some more details. Currently, it works as follows.

For every command that can change lockfile (most commonly, first cargo build after Cargo.toml change or cargo update) the first thing that Cargo does is updating the registry (that is, pulling new changes from the git repo). After that, Cargo has a fresh offline copy of the registry in a bare git checkout, and use it for actual dependency resolution.

The two interesting properties of status quo are:

  • if you change lockfile, you always use the most recent version of the index, so that security patches and other minor updates propagate as fast as possible
  • index changes are globally atomic: Cargo works with a set of packages, identified by a commit hash

What would we do in the new implementation? I think there's no explicit registry update anymore. Instead, we update registry on a per-crate basis lazily, when we do version resolution?

@kornelski

This comment has been minimized.

Copy link
Contributor Author

kornelski commented Oct 21, 2019

@matklad This is covered in "greedy fetch" section. It works as follows:

  1. Cargo still "updates the registry" as a separate step, but instead of fetching everything, it fetches a subset of the registry that is relevant for a specific workspace. At this stage it's not an actual precise dependency resolution, but a "greedy" dependency resolution in order to discover and update all dependencies that may be needed. The greedy fetch guarantees that all of the crates that the actual dependency resolution algorithm might look at will be up to date.

  2. After the incremental update, Cargo has a fresh offline copy of the registry, except crates that were not relevant for the update. It can run the regular dependency resolution as usual, and it won't know the difference.

The registry storage format would change from a bare git repo to a directory structure similar to a git checkout. Theoretically it'd be possible to mix the two and add "fake" commits to a local git repo to keep using it as the storage format, but I don't think that's worth the complexity.

@the8472

This comment has been minimized.

Copy link

the8472 commented Oct 23, 2019

Possible alternative that wouldn't require http2 support: Download index as zip, which could utilize github's download as zip functionality.

@kornelski

This comment has been minimized.

Copy link
Contributor Author

kornelski commented Oct 24, 2019

The zip would be good only for initial download (you wouldn't want to download 20MB+ on every cargo update), so it sounds like a variation of "Initial index from rustup" alternative in the RFC.

@the8472

This comment has been minimized.

Copy link

the8472 commented Oct 24, 2019

Zips don't have to be compressed (store "compression") and if github supports range requests one could only fetch the headers and the parts that changed, i.e. do an incremental download.

@kornelski

This comment has been minimized.

Copy link
Contributor Author

kornelski commented Oct 24, 2019

Oh, that's clever! So yes, a ZIP could be used to combine multiple requests into a two/three range requests (to find offsets of the ZIP central directory, to get the directory, to get ranges of desired files).

  • GitHub unfortunately doesn't support Range on their ZIPs, so crates-io would have to store the ZIP elsewhere.
  • Care would have to be taken to recover from race conditions (in case ZIP changed after client read the central directory)
  • If crates-io ever moved to storing crate metadata in a database, it'd be difficult to maintain a ZIP-like "API" on top of that. Plain HTTP/2 requests don't dictate how the index is stored server-side.

Currently the index requires registries to list all of their crates ahead of time. In a model where individual crates are requested directly by their name, a registry could even create crates on demand (lazily). That may be interesting for setups where Cargo registry is a facade for other code repositories.

@remram44

This comment has been minimized.

Copy link

remram44 commented Oct 24, 2019

You are describing something akin to zsync (or librsync), which interestingly I'm in the process of re-implementing in Rust 🤔

@gilescope

This comment has been minimized.

Copy link

gilescope commented Oct 24, 2019

If cargo hits http static files rather than a git repo wouldn't that make life a lot easier for private repositories like artifactory for example to support Rust? Seems like this would lower the barrier to supporting Rust in an enterprise setting so a thumps up from me.

@@ -106,6 +106,10 @@ Rust/Cargo installation could come bundled with an initial version of the index.

The proposed solution scales much better, because Cargo needs to download and cache only a "working set" of the index, and unused/abandoned/spam crates won't cost anything.

## Rsync

The rsync protocol requires scanning and checksumming of source and destination files, which creates a lot of unnecessary I/O, and it requires SSH or a custom daemon running on the server, which limits hosting options for the index.

This comment has been minimized.

Copy link
@joshtriplett

joshtriplett Nov 4, 2019

Member

Indeed. A quote from many years ago:

We learned that the Linux load average rolls over at 1024. And we actually found this out empirically. -- H. Peter Anvin, about rsync on kernel.org

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants
You can’t perform that action at this time.