Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Serve crates-io registry over HTTP as static files #2789

Merged
merged 23 commits into from Jan 12, 2021

Conversation

@kornelski
Copy link
Contributor

@kornelski kornelski commented Oct 18, 2019

The existing crate index format is good enough to be served over HTTP as-is, and still fetched and updated incrementally.

This can be faster than a git clone of the whole index ahead of time, and will spare clients from unbounded growth of the index.

Originating forum thread.

Rendered RFC

@djc
Copy link
Contributor

@djc djc commented Oct 18, 2019

It would be good to touch on offline use scenarios and how these (as well as slow links) would be affected by these changes.

text/0000-http-index.md Outdated Show resolved Hide resolved
text/0000-http-index.md Outdated Show resolved Hide resolved
@kornelski
Copy link
Contributor Author

@kornelski kornelski commented Oct 19, 2019

I've added more info about network speed. To sum it up:

  • It's a significant saving of bandwidth compared to a full clone, even if it ends up downloading all crates anyway. For CI that runs once and doesn't keep the index, it'll be a huge reduction in bandwidth.

  • For a first run with no cache and no lockfile, the worst case is as slow as the max depth of the dependency tree (but in practice less than that thanks to crates appearing in multiple places in the tree). In my projects I've seen deps 12 levels deep, so on a high-latency connection it may take several seconds, but that still is likely way faster than a full index clone.

  • For subsequent runs, with some cache or a lockfile, the update can be speculatively parallelized to reduce latency to one or two roundtrips, because Cargo will already know about all or almost all dependencies-of-dependencies that it needs to check.

  • In terms of number of update requests, the worst case scenario without incremental changelog is a few hundred requests. It may seem like a lot, but thanks to web bloat, web servers regularly handle this much for typical page load without breaking a sweat :) These requests can be made conditional (using HTTP status 304), they should use very little bandwidth for updates. The incremental changelog can efficiently mark crates as fresh and not needed updates, so that in the best case an update will be just one HTTP request for the log itself.

  • Cargo has a fast path for GitHub that checks if the index is fresh in one HTTP request. This can be supported too for the HTTP index, so the fast path is just as fast.

Note that cost of the git index is proportional to total size of crates-io, but cost of this solution is proportional to the project size. At some point git clone will become too large to be usable, but this solution will remain functional (even large Rust projects don't use all crates-io dependencies at the same time, so they won't grow as big as fast).

@Nemo157
Copy link
Member

@Nemo157 Nemo157 commented Oct 19, 2019

One third-party tool that could not support a feature with this design is cargo add's - vs _ automatic distinction:

> cargo add serde-json
WARN: Added `serde_json` instead of `serde-json`

well, it still could be implemented by fetching every permutation, that would be O(n^2), but since n is going to generally be 1-2 that could be ok.

@matklad
Copy link
Member

@matklad matklad commented Oct 21, 2019

I think it might make sense to spell out how Cargo would use a new index in some more details. Currently, it works as follows.

For every command that can change lockfile (most commonly, first cargo build after Cargo.toml change or cargo update) the first thing that Cargo does is updating the registry (that is, pulling new changes from the git repo). After that, Cargo has a fresh offline copy of the registry in a bare git checkout, and use it for actual dependency resolution.

The two interesting properties of status quo are:

  • if you change lockfile, you always use the most recent version of the index, so that security patches and other minor updates propagate as fast as possible
  • index changes are globally atomic: Cargo works with a set of packages, identified by a commit hash

What would we do in the new implementation? I think there's no explicit registry update anymore. Instead, we update registry on a per-crate basis lazily, when we do version resolution?

@kornelski
Copy link
Contributor Author

@kornelski kornelski commented Oct 21, 2019

@matklad This is covered in "greedy fetch" section. It works as follows:

  1. Cargo still "updates the registry" as a separate step, but instead of fetching everything, it fetches a subset of the registry that is relevant for a specific workspace. At this stage it's not an actual precise dependency resolution, but a "greedy" dependency resolution in order to discover and update all dependencies that may be needed. The greedy fetch guarantees that all of the crates that the actual dependency resolution algorithm might look at will be up to date.

  2. After the incremental update, Cargo has a fresh offline copy of the registry, except crates that were not relevant for the update. It can run the regular dependency resolution as usual, and it won't know the difference.

The registry storage format would change from a bare git repo to a directory structure similar to a git checkout. Theoretically it'd be possible to mix the two and add "fake" commits to a local git repo to keep using it as the storage format, but I don't think that's worth the complexity.

@the8472
Copy link

@the8472 the8472 commented Oct 23, 2019

Possible alternative that wouldn't require http2 support: Download index as zip, which could utilize github's download as zip functionality.

@kornelski
Copy link
Contributor Author

@kornelski kornelski commented Oct 24, 2019

The zip would be good only for initial download (you wouldn't want to download 20MB+ on every cargo update), so it sounds like a variation of "Initial index from rustup" alternative in the RFC.

@the8472
Copy link

@the8472 the8472 commented Oct 24, 2019

Zips don't have to be compressed (store "compression") and if github supports range requests one could only fetch the headers and the parts that changed, i.e. do an incremental download.

@kornelski
Copy link
Contributor Author

@kornelski kornelski commented Oct 24, 2019

Oh, that's clever! So yes, a ZIP could be used to combine multiple requests into a two/three range requests (to find offsets of the ZIP central directory, to get the directory, to get ranges of desired files).

  • GitHub unfortunately doesn't support Range on their ZIPs, so crates-io would have to store the ZIP elsewhere.
  • Care would have to be taken to recover from race conditions (in case ZIP changed after client read the central directory)
  • If crates-io ever moved to storing crate metadata in a database, it'd be difficult to maintain a ZIP-like "API" on top of that. Plain HTTP/2 requests don't dictate how the index is stored server-side.

Currently the index requires registries to list all of their crates ahead of time. In a model where individual crates are requested directly by their name, a registry could even create crates on demand (lazily). That may be interesting for setups where Cargo registry is a facade for other code repositories.

@remram44
Copy link

@remram44 remram44 commented Oct 24, 2019

You are describing something akin to zsync (or librsync), which interestingly I'm in the process of re-implementing in Rust 🤔

@gilescope
Copy link

@gilescope gilescope commented Oct 24, 2019

If cargo hits http static files rather than a git repo wouldn't that make life a lot easier for private repositories like artifactory for example to support Rust? Seems like this would lower the barrier to supporting Rust in an enterprise setting so a thumps up from me.

text/0000-http-index.md Outdated Show resolved Hide resolved
@Eh2406
Copy link

@Eh2406 Eh2406 commented Nov 20, 2019

Personal opinions, not speaking for the team.

This feels like a direction that could actually work! I have participated in many "just use HTTP" conversations in the past, this is the first time it has sounded plausible. (Sorry if you explained it to me before and I was to dance to get it.)

At some point between now and NPM this will be faster. That being said my sense is that the tipping point is far off. Additionally it is very important to read the line "Since alternative registries are stable, the git-based protocol is stable, and can't be removed." As such I will be spending my time on ways to improve the git-based protocol. This RFC is convincing enough that I would not object to someone spending there time on developing it.

On Performance:
This RFC assumes that a robust parallel HTTP/2 client will be maintainable and faster then a git fetch. I am skeptical by default. I would love to see examples showing that to be true. Async and await make thinks much ezere then it was last time I tried, and things continue to improve.

For Crates.io:
On CDN:
The Index is generously hosted by GitHub with a good CDN. We do not do things that will get it kicked out (aka shallow clone). Which side of that line is using raw.githubusercontent.com? I would like proactive assurance that switching will be ok. (I miss lolbench.rs and don't want that to happen to the Index.) Do we know someone at GitHub that can provide such assurance? If not can we get an estimate of the CDN bill? Switching may have to wait for that size bill to feel small.

For Alternative Registries:
I really like that this reduces the set of infrastructure required to set up an Alternative Registry to a static file server!

@the8472
Copy link

@the8472 the8472 commented Nov 20, 2019

Should compression support be mandatory for clients? i.e. they must send an Accept-encoding: ... header? gzip, br, zstd, ...?

@alexcrichton
Copy link
Member

@alexcrichton alexcrichton commented Dec 5, 2019

The Cargo team discussed this RFC over the past few weeks, and we wanted to leave some thoughts of what we discussed and what we're thinking about this RFC.

Overall none of us are opposed to the principles of this RFC or the idea it's chasing after. If there's a better solution for hosting and managing the index, that'd be great to implement! If it's better after all then it's better :)

Our main discussion point, though, was that implementing this style of fetching the index is likely to be a very significant undertaking from a technical perspective in Cargo. There's a ton of concerns to take care of and it's not quite as simple as "just throw HTTP/2 at the problem and let it solve everything". We'd ideally like to see a prototype implementation to prove out the possible gains that an architecture like this would bring us, but unfortunately a prototype implementation is likely such a significant undertaking that there needs to be more buy-in before even that happens.

The conclusion we reached is that we'd be willing to accept this under the idea that it's an "eRFC" or "experimental RFC". It would basically be a blessing that the Cargo team is interested in seeing research and development around this area of hosting the index and fetching it on the client. We aren't wed to any particular details yet, and the actual details of this RFC as-written are highly likely to change while a solution is being developed.

@kornelski
Copy link
Contributor Author

@kornelski kornelski commented Dec 8, 2019

I've implemented a proof-of-concept with greedy resolution that takes into account features and version ranges.

https://github.com/kornelski/cargo-static-registry-rfc-proof-of-concept

https://github.com/kornelski/cargo-static-registry-rfc-proof-of-concept/blob/master/src/main.rs

Tested with rocket framework, which required 122 requests:

  • If every request had added a whole 1 second of latency, over HTTP/1.1, it fetched all deps in 11 seconds
  • If every request had added 200ms latency, over HTTP/1.1: 3.5 seconds.
  • Without extra latency, over HTTP/2 on a fast connection: 1.6-3 seconds.

Tested on a fast connection with 144 crates.io dependencies used by servo. It required 344 requests:

  • over HTTP/1.1: 1.2-2s.
  • over HTTP/2: 0.7-1.3s.

As predicted, more dependencies make lookup faster, because lookup starts with more parallelism.

Just rand@0.7 - 8 requests:

  • over HTTP/1.1: 0.3s
  • over HTTP/2: 0.2s

All these tests were simulation of the worst case of a blank state with no cache.

@kornelski
Copy link
Contributor Author

@kornelski kornelski commented Dec 8, 2019

I've added cache and revalidation support. Cache flattens the check's critical path to a depth of 1, and revalidation makes bandwidth use minimal, so this makes all checks (regardless of crate(s) checked) almost equally fast.

With disk cache and a fast connection:

  • HTTP/1: 0.8s
  • HTTP/2: 0.4s

With disk cache and a 3G connection:

  • HTTP/1: 1.6-2.4s
  • HTTP/2: 0.7s-2s

@pietroalbini
Copy link
Member

@pietroalbini pietroalbini commented Dec 9, 2019

A problem this RFC doesn't seem to address with its local cache is deleting entries from the index: it's something we do very rarely, but sometimes we're legally required to do so (DMCA, GDPR...), and in those caches we can't leave the deleted entries in the users' local caches. The git index solves this cleanly (it's just yet another commit).

The Index is generously hosted by GitHub with a good CDN. We do not do things that will get it kicked out (aka shallow clone). Which side of that line is using raw.githubusercontent.com? I would like proactive assurance that switching will be ok. (I miss lolbench.rs and don't want that to happen to the Index.) Do we know someone at GitHub that can provide such assurance? If not can we get an estimate of the CDN bill? Switching may have to wait for that size bill to feel small.

If we do this we should just host the content on Rust infrastructure. We should still be on the 0.04$/GB CloudFront pricing tier, which is sustainable for those small files.

@kornelski
Copy link
Contributor Author

@kornelski kornelski commented Dec 9, 2019

Note that Cargo currently does not delete crate tarballs and checked out copies of crates even after they have been removed from the index. Deleted metadata files continue to be distributed as part of git history until the index branch is squashed.

With a static file solution, when a client checks freshness of a deleted crate, it will make a request to the server and notice a 404/410/451 HTTP status. It can then be made to act accordingly, and clean up local data (even tarball and source checkout).

If the client is not interested in deleted crate, it won't check it, but chances are it never did, and didn't download it. If ability to immediately erase deleted data is important, then the "incremental changelog" feature can be extended to notify about deletions proactively.

text/0000-sparse-index.md Outdated Show resolved Hide resolved

https://andre.arko.net/2014/03/28/the-new-rubygems-index-format/

Bundler used to have a full index fetched ahead of time, similar to Cargo's, until it grew too large. Then it used a centralized query API, until that became too problematic to support. Then it switched to an incrementally downloaded flat file index format similar to the solution proposed here.
Copy link
Member

@joshtriplett joshtriplett Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More prior art: Debian has "pdiffs", where it provides incremental deltas from the complete "Packages" file for a certain period of time, so that apt update can download those incremental deltas if it isn't too far out of date. There have been several revisions to that approach, to minimize the number of diffs that an update must download.

Copy link
Contributor Author

@kornelski kornelski Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is pdiff similar to zsync?

text/0000-sparse-index.md Outdated Show resolved Hide resolved
Co-authored-by: Josh Triplett <josh@joshtriplett.org>
@alexcrichton
Copy link
Member

@alexcrichton alexcrichton commented Dec 16, 2020

@rfcbot fcp merge

I think this is in a good state to merge now personally, thanks again @jonhoo and @kornelski. To recap for cc'd members here, this RFC is about declaring the git index as having a finite amount of time left before its scalability is a large problem. The RFC here proposes using HTTP instead for each crate's index file, but details about how exactly this all works are not specified in the RFC (but examples are given). @jonhoo's prototype work shows that this should be a very viable solution for the registry's index on crates.io and while the implementation work in Cargo needs more to be done we're already in a pretty good state.

I have also removed T-crates-io from this RFC since this is currently just related to Cargo really. We've discussed with the crates-io team that work may be upcoming in the future, but while this is implemented in Cargo and tested in the ecosystem the goal is to avoid any changes to crates.io itself.

@rfcbot
Copy link

@rfcbot rfcbot commented Dec 16, 2020

Team member @alexcrichton has proposed to merge this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@rfcbot
Copy link

@rfcbot rfcbot commented Dec 19, 2020

🔔 This is now entering its final comment period, as per the review above. 🔔

@rfcbot
Copy link

@rfcbot rfcbot commented Dec 29, 2020

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

The RFC will be merged soon.

ehuss
ehuss approved these changes Jan 12, 2021
@ehuss ehuss merged commit 17b10d7 into rust-lang:master Jan 12, 2021
@ehuss
Copy link
Contributor

@ehuss ehuss commented Jan 12, 2021

Huzzah! The @rust-lang/cargo team has decided to accept this RFC!

To track further discussion, subscribe to the tracking issue here: rust-lang/cargo#9069

@kornelski kornelski deleted the http-index branch Jan 13, 2021
@glandium
Copy link

@glandium glandium commented Jan 13, 2021

I don't see git clone --depth=1 being discussed in the RFC. It reports transferring 29.97MiB, and allows for incremental updates after that.

@ssokolow
Copy link

@ssokolow ssokolow commented Jan 13, 2021

@glandium It was mentioned in the discussion. Expand all the collapsed bits and Ctrl+F for "shallow clone".

@glandium
Copy link

@glandium glandium commented Jan 14, 2021

Oh, the RFC also has something about shallow clone, but I was interpreting it as narrow/partial clone for some reason. Sorry for the noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment