RFC: Serve crates-io registry over HTTP as static files #2789

kornelski · 2019-10-18T19:53:17Z

The existing crate index format is good enough to be served over HTTP as-is, and still fetched and updated incrementally.

This can be faster than a git clone of the whole index ahead of time, and will spare clients from unbounded growth of the index.

Originating forum thread.

Rendered RFC

djc · 2019-10-18T21:03:09Z

It would be good to touch on offline use scenarios and how these (as well as slow links) would be affected by these changes.

text/0000-http-index.md

kornelski · 2019-10-19T01:41:19Z

I've added more info about network speed. To sum it up:

It's a significant saving of bandwidth compared to a full clone, even if it ends up downloading all crates anyway. For CI that runs once and doesn't keep the index, it'll be a huge reduction in bandwidth.
For a first run with no cache and no lockfile, the worst case is as slow as the max depth of the dependency tree (but in practice less than that thanks to crates appearing in multiple places in the tree). In my projects I've seen deps 12 levels deep, so on a high-latency connection it may take several seconds, but that still is likely way faster than a full index clone.
For subsequent runs, with some cache or a lockfile, the update can be speculatively parallelized to reduce latency to one or two roundtrips, because Cargo will already know about all or almost all dependencies-of-dependencies that it needs to check.
In terms of number of update requests, the worst case scenario without incremental changelog is a few hundred requests. It may seem like a lot, but thanks to web bloat, web servers regularly handle this much for typical page load without breaking a sweat :) These requests can be made conditional (using HTTP status 304), they should use very little bandwidth for updates. The incremental changelog can efficiently mark crates as fresh and not needed updates, so that in the best case an update will be just one HTTP request for the log itself.
Cargo has a fast path for GitHub that checks if the index is fresh in one HTTP request. This can be supported too for the HTTP index, so the fast path is just as fast.

Note that cost of the git index is proportional to total size of crates-io, but cost of this solution is proportional to the project size. At some point git clone will become too large to be usable, but this solution will remain functional (even large Rust projects don't use all crates-io dependencies at the same time, so they won't grow as big as fast).

@carols10cents

Thanks to @carols10cents for the idea

Nemo157 · 2019-10-19T09:26:58Z

One third-party tool that could not support a feature with this design is cargo add's - vs _ automatic distinction:

> cargo add serde-json
WARN: Added `serde_json` instead of `serde-json`

well, it still could be implemented by fetching every permutation, that would be O(n^2), but since n is going to generally be 1-2 that could be ok.

matklad · 2019-10-21T07:57:28Z

I think it might make sense to spell out how Cargo would use a new index in some more details. Currently, it works as follows.

For every command that can change lockfile (most commonly, first cargo build after Cargo.toml change or cargo update) the first thing that Cargo does is updating the registry (that is, pulling new changes from the git repo). After that, Cargo has a fresh offline copy of the registry in a bare git checkout, and use it for actual dependency resolution.

The two interesting properties of status quo are:

if you change lockfile, you always use the most recent version of the index, so that security patches and other minor updates propagate as fast as possible
index changes are globally atomic: Cargo works with a set of packages, identified by a commit hash

What would we do in the new implementation? I think there's no explicit registry update anymore. Instead, we update registry on a per-crate basis lazily, when we do version resolution?

kornelski · 2019-10-21T12:06:58Z

@matklad This is covered in "greedy fetch" section. It works as follows:

Cargo still "updates the registry" as a separate step, but instead of fetching everything, it fetches a subset of the registry that is relevant for a specific workspace. At this stage it's not an actual precise dependency resolution, but a "greedy" dependency resolution in order to discover and update all dependencies that may be needed. The greedy fetch guarantees that all of the crates that the actual dependency resolution algorithm might look at will be up to date.
After the incremental update, Cargo has a fresh offline copy of the registry, except crates that were not relevant for the update. It can run the regular dependency resolution as usual, and it won't know the difference.

The registry storage format would change from a bare git repo to a directory structure similar to a git checkout. Theoretically it'd be possible to mix the two and add "fake" commits to a local git repo to keep using it as the storage format, but I don't think that's worth the complexity.

the8472 · 2019-10-23T22:09:56Z

Possible alternative that wouldn't require http2 support: Download index as zip, which could utilize github's download as zip functionality.

kornelski · 2019-10-24T00:42:14Z

The zip would be good only for initial download (you wouldn't want to download 20MB+ on every cargo update), so it sounds like a variation of "Initial index from rustup" alternative in the RFC.

the8472 · 2019-10-24T06:46:29Z

Zips don't have to be compressed (store "compression") and if github supports range requests one could only fetch the headers and the parts that changed, i.e. do an incremental download.

kornelski · 2019-10-24T11:38:18Z

Oh, that's clever! So yes, a ZIP could be used to combine multiple requests into a two/three range requests (to find offsets of the ZIP central directory, to get the directory, to get ranges of desired files).

GitHub unfortunately doesn't support Range on their ZIPs, so crates-io would have to store the ZIP elsewhere.
Care would have to be taken to recover from race conditions (in case ZIP changed after client read the central directory)
If crates-io ever moved to storing crate metadata in a database, it'd be difficult to maintain a ZIP-like "API" on top of that. Plain HTTP/2 requests don't dictate how the index is stored server-side.

Currently the index requires registries to list all of their crates ahead of time. In a model where individual crates are requested directly by their name, a registry could even create crates on demand (lazily). That may be interesting for setups where Cargo registry is a facade for other code repositories.

remram44 · 2019-10-24T14:00:31Z

You are describing something akin to zsync (or librsync), which interestingly I'm in the process of re-implementing in Rust 🤔

gilescope · 2019-10-24T20:23:20Z

If cargo hits http static files rather than a git repo wouldn't that make life a lot easier for private repositories like artifactory for example to support Rust? Seems like this would lower the barrier to supporting Rust in an enterprise setting so a thumps up from me.

Fix typo

text/0000-http-index.md

Eh2406 · 2019-11-20T23:06:51Z

Personal opinions, not speaking for the team.

This feels like a direction that could actually work! I have participated in many "just use HTTP" conversations in the past, this is the first time it has sounded plausible. (Sorry if you explained it to me before and I was to dance to get it.)

At some point between now and NPM this will be faster. That being said my sense is that the tipping point is far off. Additionally it is very important to read the line "Since alternative registries are stable, the git-based protocol is stable, and can't be removed." As such I will be spending my time on ways to improve the git-based protocol. This RFC is convincing enough that I would not object to someone spending there time on developing it.

On Performance:
This RFC assumes that a robust parallel HTTP/2 client will be maintainable and faster then a git fetch. I am skeptical by default. I would love to see examples showing that to be true. Async and await make thinks much ezere then it was last time I tried, and things continue to improve.

For Crates.io:
On CDN:
The Index is generously hosted by GitHub with a good CDN. We do not do things that will get it kicked out (aka shallow clone). Which side of that line is using raw.githubusercontent.com? I would like proactive assurance that switching will be ok. (I miss lolbench.rs and don't want that to happen to the Index.) Do we know someone at GitHub that can provide such assurance? If not can we get an estimate of the CDN bill? Switching may have to wait for that size bill to feel small.

For Alternative Registries:
I really like that this reduces the set of infrastructure required to set up an Alternative Registry to a static file server!

the8472 · 2019-11-20T23:19:20Z

Should compression support be mandatory for clients? i.e. they must send an Accept-encoding: ... header? gzip, br, zstd, ...?

alexcrichton · 2019-12-05T22:09:45Z

The Cargo team discussed this RFC over the past few weeks, and we wanted to leave some thoughts of what we discussed and what we're thinking about this RFC.

Overall none of us are opposed to the principles of this RFC or the idea it's chasing after. If there's a better solution for hosting and managing the index, that'd be great to implement! If it's better after all then it's better :)

Our main discussion point, though, was that implementing this style of fetching the index is likely to be a very significant undertaking from a technical perspective in Cargo. There's a ton of concerns to take care of and it's not quite as simple as "just throw HTTP/2 at the problem and let it solve everything". We'd ideally like to see a prototype implementation to prove out the possible gains that an architecture like this would bring us, but unfortunately a prototype implementation is likely such a significant undertaking that there needs to be more buy-in before even that happens.

The conclusion we reached is that we'd be willing to accept this under the idea that it's an "eRFC" or "experimental RFC". It would basically be a blessing that the Cargo team is interested in seeing research and development around this area of hosting the index and fetching it on the client. We aren't wed to any particular details yet, and the actual details of this RFC as-written are highly likely to change while a solution is being developed.

kornelski · 2019-12-08T18:47:02Z

I've implemented a proof-of-concept with greedy resolution that takes into account features and version ranges.

https://github.com/kornelski/cargo-static-registry-rfc-proof-of-concept

https://github.com/kornelski/cargo-static-registry-rfc-proof-of-concept/blob/master/src/main.rs

Tested with rocket framework, which required 122 requests:

If every request had added a whole 1 second of latency, over HTTP/1.1, it fetched all deps in 11 seconds
If every request had added 200ms latency, over HTTP/1.1: 3.5 seconds.
Without extra latency, over HTTP/2 on a fast connection: 1.6-3 seconds.

Tested on a fast connection with 144 crates.io dependencies used by servo. It required 344 requests:

over HTTP/1.1: 1.2-2s.
over HTTP/2: 0.7-1.3s.

As predicted, more dependencies make lookup faster, because lookup starts with more parallelism.

Just rand@0.7 - 8 requests:

over HTTP/1.1: 0.3s
over HTTP/2: 0.2s

All these tests were simulation of the worst case of a blank state with no cache.

kornelski · 2019-12-08T21:24:22Z

I've added cache and revalidation support. Cache flattens the check's critical path to a depth of 1, and revalidation makes bandwidth use minimal, so this makes all checks (regardless of crate(s) checked) almost equally fast.

With disk cache and a fast connection:

HTTP/1: 0.8s
HTTP/2: 0.4s

With disk cache and a 3G connection:

HTTP/1: 1.6-2.4s
HTTP/2: 0.7s-2s

pietroalbini · 2019-12-09T10:01:04Z

A problem this RFC doesn't seem to address with its local cache is deleting entries from the index: it's something we do very rarely, but sometimes we're legally required to do so (DMCA, GDPR...), and in those caches we can't leave the deleted entries in the users' local caches. The git index solves this cleanly (it's just yet another commit).

The Index is generously hosted by GitHub with a good CDN. We do not do things that will get it kicked out (aka shallow clone). Which side of that line is using raw.githubusercontent.com? I would like proactive assurance that switching will be ok. (I miss lolbench.rs and don't want that to happen to the Index.) Do we know someone at GitHub that can provide such assurance? If not can we get an estimate of the CDN bill? Switching may have to wait for that size bill to feel small.

If we do this we should just host the content on Rust infrastructure. We should still be on the 0.04$/GB CloudFront pricing tier, which is sustainable for those small files.

kornelski · 2019-12-09T10:57:31Z

Note that Cargo currently does not delete crate tarballs and checked out copies of crates even after they have been removed from the index. Deleted metadata files continue to be distributed as part of git history until the index branch is squashed.

With a static file solution, when a client checks freshness of a deleted crate, it will make a request to the server and notice a 404/410/451 HTTP status. It can then be made to act accordingly, and clean up local data (even tarball and source checkout).

If the client is not interested in deleted crate, it won't check it, but chances are it never did, and didn't download it. If ability to immediately erase deleted data is important, then the "incremental changelog" feature can be extended to notify about deletions proactively.

rfcbot · 2020-12-29T15:50:50Z

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

The RFC will be merged soon.

ehuss · 2021-01-12T21:08:29Z

Huzzah! The @rust-lang/cargo team has decided to accept this RFC!

To track further discussion, subscribe to the tracking issue here: rust-lang/cargo#9069

glandium · 2021-01-13T23:40:13Z

I don't see git clone --depth=1 being discussed in the RFC. It reports transferring 29.97MiB, and allows for incremental updates after that.

ssokolow · 2021-01-13T23:59:30Z

@glandium It was mentioned in the discussion. Expand all the collapsed bits and Ctrl+F for "shallow clone".

glandium · 2021-01-14T00:13:19Z

Oh, the RFC also has something about shallow clone, but I was interpreting it as narrow/partial clone for some reason. Sorry for the noise.

HTTP registry implementation Implement HTTP registry support described in [RFC 2789](rust-lang/rfcs#2789). Adds a new unstable flag `-Z http-registry` which allows cargo to interact with remote registries served over http rather than git. These registries can be identified by urls starting with `sparse+http://` or `sparse+https://`. When fetching index metadata over http, cargo only downloads the metadata for needed crates, which can save significant time and bandwidth over git. The format of the http index is identical to a checkout of a git-based index. This change is based on `@jonhoo's` PR #8890. cc `@Eh2406` Remaining items: - [x] Performance measurements - [x] Make unstable only - [x] Investigate unification of download system. Probably best done in separate change. - [x] Unify registry tests (code duplication in `http_registry.rs`) - [x] Use existing on-disk cache, rather than adding a new one.

Eh2406 · 2022-06-25T00:42:58Z

This is ready for testing. Instructions are on the blog post, feedback is welcome on internals or as issues on the Cargo repo.

U007D · 2022-06-28T15:09:09Z

Please note I have removed the call-for-testing label as this RFC will appear in the Call for Testing section of the next issue (# 449) of TWiR.

Please feel free to re-add the call-for-testing label if you would like this RFC to appear again in another issue of TWiR.

RFC: Serve crates-io registry over HTTP as static files

35e0090

fbstj reviewed Oct 18, 2019

View reviewed changes

text/0000-http-index.md Outdated Show resolved Hide resolved

kornelski added 2 commits October 19, 2019 02:12

More about network speed of HTTP-index

8b572f6

Rsync is not a viable alternative to HTTP-index

edb8b50

carols10cents reviewed Oct 19, 2019

View reviewed changes

text/0000-http-index.md Outdated Show resolved Hide resolved

Mention raw.githubusercontent as a possible host for HTTP-index

456dd21

Thanks to @carols10cents for the idea

HTTP-index drawback

43b7b58

carols10cents added T-cargo Relevant to the Cargo team, which will review and decide on the RFC. T-crates-io Relevant to the crates.io team, which will review and decide on the RFC. labels Oct 21, 2019

pickfire and others added 2 commits November 1, 2019 19:57

Fix typo

b1eac3e

Merge pull request #1 from pickfire/patch-1

a4043ce

Fix typo

joshtriplett reviewed Nov 4, 2019

View reviewed changes

text/0000-http-index.md Outdated Show resolved Hide resolved

rfcbot removed the proposed-final-comment-period Currently awaiting signoff of all team members in order to enter the final comment period. label Dec 19, 2020

rfcbot added finished-final-comment-period The final comment period is finished for this RFC. and removed final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. labels Dec 29, 2020

rfcbot added the to-announce label Dec 29, 2020

kornelski mentioned this pull request Jan 12, 2021

Look into ways to decouple some crate metadata from Cargo.toml rust-lang/crates.io#3167

Open

ehuss mentioned this pull request Jan 12, 2021

Tracking Issue for serving an index over HTTP rust-lang/cargo#9069

Closed

11 tasks

ehuss added 2 commits January 12, 2021 13:05

Update 2789 links.

1c2d704

Merging RFC 2789

3fec7e2

ehuss approved these changes Jan 12, 2021

View reviewed changes

ehuss merged commit 17b10d7 into rust-lang:master Jan 12, 2021

kornelski deleted the http-index branch January 13, 2021 00:08

Eh2406 mentioned this pull request Apr 25, 2021

Using git to manage cargo.io-index may be not good solution. rust-lang/cargo#9413

Closed

arlosi mentioned this pull request Jun 15, 2021

Cargo alternative registry auth #3139

Merged

carols10cents mentioned this pull request Feb 11, 2022

[longer term] Increase priority of serving the index as static files over HTTP rust-lang/crates.io-index-archive#6

Closed

arlosi mentioned this pull request Mar 9, 2022

HTTP registry implementation rust-lang/cargo#10470

Merged

5 tasks

This was referenced Jun 22, 2022

Add support for sparse index by serving registry over HTTP as static file moriturus/ktra#49

Open

Announcement: About the development of Ktra in future moriturus/ktra#28

Open

Eh2406 added the call-for-testing Add this label + test notes in comments to be included in TWiR `Call for Testing` section. label Jun 25, 2022

Eh2406 mentioned this pull request Jun 25, 2022

Call for Testing of Cargo's -Z sparse-registry rust-lang/this-week-in-rust#3392

Closed

U007D removed the call-for-testing Add this label + test notes in comments to be included in TWiR `Call for Testing` section. label Jun 28, 2022

thomcc added the A-registry Proposals relating to the registries of cargo. label Jul 1, 2022

BaukeWestendorp mentioned this pull request Apr 22, 2023

Running the install-rpi.sh script on the Pi takes a long time. Impossible-Robotics-5412/linkage#64

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Serve crates-io registry over HTTP as static files #2789

RFC: Serve crates-io registry over HTTP as static files #2789

kornelski commented Oct 18, 2019 •

edited by ehuss

djc commented Oct 18, 2019

kornelski commented Oct 19, 2019 •

edited

Nemo157 commented Oct 19, 2019

matklad commented Oct 21, 2019

kornelski commented Oct 21, 2019 •

edited

the8472 commented Oct 23, 2019

kornelski commented Oct 24, 2019 •

edited

the8472 commented Oct 24, 2019

kornelski commented Oct 24, 2019 •

edited

remram44 commented Oct 24, 2019 •

edited

gilescope commented Oct 24, 2019

Eh2406 commented Nov 20, 2019

the8472 commented Nov 20, 2019

alexcrichton commented Dec 5, 2019

kornelski commented Dec 8, 2019 •

edited

kornelski commented Dec 8, 2019 •

edited

pietroalbini commented Dec 9, 2019

kornelski commented Dec 9, 2019

rfcbot commented Dec 29, 2020

ehuss commented Jan 12, 2021

glandium commented Jan 13, 2021

ssokolow commented Jan 13, 2021 •

edited

glandium commented Jan 14, 2021

Eh2406 commented Jun 25, 2022

U007D commented Jun 28, 2022 •

edited

RFC: Serve crates-io registry over HTTP as static files #2789

RFC: Serve crates-io registry over HTTP as static files #2789

Conversation

kornelski commented Oct 18, 2019 • edited by ehuss

djc commented Oct 18, 2019

kornelski commented Oct 19, 2019 • edited

Nemo157 commented Oct 19, 2019

matklad commented Oct 21, 2019

kornelski commented Oct 21, 2019 • edited

the8472 commented Oct 23, 2019

kornelski commented Oct 24, 2019 • edited

the8472 commented Oct 24, 2019

kornelski commented Oct 24, 2019 • edited

remram44 commented Oct 24, 2019 • edited

gilescope commented Oct 24, 2019

Eh2406 commented Nov 20, 2019

the8472 commented Nov 20, 2019

alexcrichton commented Dec 5, 2019

kornelski commented Dec 8, 2019 • edited

kornelski commented Dec 8, 2019 • edited

pietroalbini commented Dec 9, 2019

kornelski commented Dec 9, 2019

rfcbot commented Dec 29, 2020

ehuss commented Jan 12, 2021

glandium commented Jan 13, 2021

ssokolow commented Jan 13, 2021 • edited

glandium commented Jan 14, 2021

Eh2406 commented Jun 25, 2022

U007D commented Jun 28, 2022 • edited

kornelski commented Oct 18, 2019 •

edited by ehuss

kornelski commented Oct 19, 2019 •

edited

kornelski commented Oct 21, 2019 •

edited

kornelski commented Oct 24, 2019 •

edited

kornelski commented Oct 24, 2019 •

edited

remram44 commented Oct 24, 2019 •

edited

kornelski commented Dec 8, 2019 •

edited

kornelski commented Dec 8, 2019 •

edited

ssokolow commented Jan 13, 2021 •

edited

U007D commented Jun 28, 2022 •

edited