From 35e00907942c133ee71e1c217726e7979a23e61f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kornel=20Lesin=CC=81ski?= Date: Fri, 18 Oct 2019 20:05:22 +0100 Subject: [PATCH 01/22] RFC: Serve crates-io registry over HTTP as static files --- text/0000-http-index.md | 124 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 124 insertions(+) create mode 100644 text/0000-http-index.md diff --git a/text/0000-http-index.md b/text/0000-http-index.md new file mode 100644 index 00000000000..581823ce59a --- /dev/null +++ b/text/0000-http-index.md @@ -0,0 +1,124 @@ +- Feature Name: http_index +- Start Date: 2019-10-18 +- RFC PR: [rust-lang/rfcs#0000](https://github.com/rust-lang/rfcs/pull/0000) +- Rust Issue: [rust-lang/rust#0000](https://github.com/rust-lang/rust/issues/0000) + +# Summary +[summary]: #summary + +Selective download of the crates-io index over HTTP, similar to a solution used by Ruby's Bundler. Changes transport from an ahead-of-time Git clone to HTTP fetch as-needed, while keeping existing content and structure of the index. Most importantly, the proposed solution works with static files and doesn't require custom server-side APIs. + +# Motivation +[motivation]: #motivation + +The full crate index is relatively big and slow to download. It will keep growing as crates.io grows, making the problem worse. The need to download the full index slows down the first use of Cargo. It's especially slow and wasteful in stateless CI environments, which download the full index, use only a tiny fraction of it, and throw it away. Caching of the index in hosted CI environments is difficult (`.cargo` dir is large) and often not effective (e.g. upload and download of large caches in Travis CI is almost as slow as a fresh index download). + +The kind of data stored in the index is not a good fit for the git protocol. The index content (as of eb037b4863) takes 176MiB as an uncompressed tarball, 16MiB with `gz -1`, and 10MiB compressed with `xz -6`. Git clone reports downloading 215MiB. That's more than just the uncompressed latest index content, and over **20 times more** than a compressed tarball. + +A while ago, GitHub indicated they [don't want to support shallow clones of large repositories](http://blog.cocoapods.org/Master-Spec-Repo-Rate-Limiting-Post-Mortem/). libgit2 doesn't support shallow clones yet. Squashing of the index history adds complexity to management and consumption of the index (which is also used by tools other than Cargo), and still doesn't solve problems of the git protocol inefficiency and overall growth. + +# Guide-level explanation +[guide-level-explanation]: #guide-level-explanation + +Expose the index over HTTP as simple files, keeping the existing content and directory layout unchanged, e.g.: + +``` +/config.json +/ac/ti +/ac/ti/action +/ac/ti/actiondb +/ac/ti/actions +/ac/ti/actions-toolkit-sys +/ac/ti/activation +/ac/ti/activeds-sys +… +``` + +To learn about crates and resolve dependencies, Cargo (or any other client) would make requests to known URLs for each dependency it needs to learn about, e.g. `https://index.example.com/se/rd/serde`. For each dependency the client would also have to request information about its dependencies, recursively, until all dependencies are fetched (and cached) locally. + +It's possible to request dependency files in parallel, so the worst-case latency of such dependency resolution is limited to the maximum depth of the dependency tree. In practice it may be less, because dependencies may occur in multiple places in the tree, allowing earlier discovery and increasing parallelization. Additionally, if there's a lock file, all dependencies listed in it can be speculatively checked in parallel. Similarly, cached dependency files can be used to speculatively check known sub-dependencies sooner. + +## Greedy fetch + +To simplify the implementation, and parallelize fetches effectively, Cargo will have to fetch all dependency information before performing the actual dependency resolution algorithm. This means it'll have to pessimistically fetch information about all sub dependencies of all dependency versions that *may* match known version requrements. This won't add much overhead, because requests are per create, not per crate version. It causes additional fetches only for dependencies that were used before, but were later dropped. Fetching is still narrowed by required version ranges, so even worst cases can be avoided by bumping version requirements. For example: + +* foo v1.0.1 depends on old-dep v1.0.0 +* foo v1.0.2 depends on maybe-dep v1.0.2 +* foo v1.0.3 depends on maybe-dep v1.0.3 +* foo v1.0.4 has no dependencies + +If a dependency requires `foo >=1.0.2`, then Cargo would need to fetch information about `maybe-dep` (once), even if `foo v1.0.4` ends up being selected later. However, it would not need to fetch `old-dep`. If the version requirement was upgraded to `foo >=v1.0.4` then there wouldn't be any extra fetches. + +## Bandwidth reduction + +Cargo supports HTTP/2, which handles many similar requests efficiently. + +All fetched dependency files can be cached, and refreshed using conditional HTTP requests (with `Etag` or `If-Modified-Since` headers), to avoid redownloading of files that haven't changed. + +Dependency files compress well. Currently the largest file of `rustc-ap-rustc_data_structures` compresses from 1MiB to 26KiB with Brotli. Many servers support transparently serving pre-compressed files (i.e. request for `/rustc-ap-rustc_data_structures` can be served from `rustc-ap-rustc_data_structures.gz` with an appropriate content encoding header), so the index can use high compression levels without increasing CPU cost of serving the files. + +### Optionally, a rotated incremental changelog + +To further reduce number requests needed to update the index, the index may maintain an append-only log of changes. For each change (crate version published or yanked), the log would append a line with: epoch number (explained below), last-modified timestamp, and the name of the changed crate, e.g. + +``` +1 2019-10-18 23:51:23 oxigen +1 2019-10-18 23:51:25 linda +1 2019-10-18 23:51:29 rv +1 2019-10-18 23:52:00 anyhow +1 2019-10-18 23:53:03 build_id +1 2019-10-18 23:56:16 canonical-form +1 2019-10-18 23:59:01 cotton +1 2019-10-19 00:01:44 kg-utils +1 2019-10-19 00:08:45 serde_traitobject +``` + +Because the log is append-only, the client can incrementally update it using a `Range` HTTP request. The client doesn't have to download the full log in order to start using it; it can download only an arbitrary fraction of it, up to the end of the file, which is straightforward with a `Range` request. When a crate is found in the log (searching from the end), and modification date is the same as modification date of crate's cached locally, the client won't have to make an HTTP request for the file. + +When the log grows too big, the epoch number can be incremented, and the log reset back to empty. The epoch number allows clients to detect that the log has been reset, even if the `Range` they requested happened to be valid for the new log file. + +# Drawbacks +[drawbacks]: #drawbacks + +* A basic solution, without the incremental changelog, needs more requests and has higher latency to update the index. With the help of the incremental changelog, this is largely mitigated. +* Performant implementation of this solution depends on making many small requests in parallel. This in practice requires HTTP/2 support on the server. +* It's uncertain if GitHub pages can handle this many files and the amount of traffic they generate, so the index may need to be hosted elsewhere. +* Since alternative registries are stable, the git-based protocol is stable, and can't be removed. + +# Rationale and alternatives +[rationale-and-alternatives]: #rationale-and-alternatives + +## Query API + +An obvious alternative would be to create a web API that can be asked to perform dependency resolution server-side (i.e. take a list dependencies and return a lockfile or similar). However, this would require running dependency resolution algorithm server-side. Maintenance of a dynamic API, critical for daily use for nearly all Rust users, is much harder and more expensive than serving of static files. + +The proposed solution doesn't require any custom server-side logic. The index can be hosted on a static-file CDN, and can be easily cached and mirrored by users. It's not necessary to change how the index is populated, and the canonical version of the index can be kept as a git repository with the full history. This makes it easy to keep backwards compatibility with older versions of Cargo, as well as 3rd party tools that use the index in its current format. + +The proposed solution fully preserves Cargo's ability to work offline (for every crate tarball available to use, there will be an index file cached). + +## Initial index from rustup + +Rust/Cargo installation could come bundled with an initial version of the index. This way when Cargo is run, it wouldn't have to download the full index over git, only a delta update from the seed version. The index would need to be packaged separately and intelligently handled by rustup to avoid downloading the index multiple times when upgrading or installing multiple versions of Cargo. This would make download and compression of the index much better, making current implementation usable for longer, but it wouldn't prevent the index from growing indefinitely. + +The proposed solution scales much better, because Cargo needs to download and cache only a "working set" of the index, and unused/abandoned/spam crates won't cost anything. + +# Prior art +[prior-art]: #prior-art + +https://andre.arko.net/2014/03/28/the-new-rubygems-index-format/ + +Bundler used to have a full index fetched ahead of time, similar to Cargo's, until it grew too large. Then it used a centralized query API, until that became too problematic to support. Then it switched to an incrementally downloaded flat file index format similar to the solution proposed here. + +# Unresolved questions +[unresolved-questions]: #unresolved-questions + +* Should the changelog use a more extensible format? +* Instead of one file that gets reset, maybe it could be split into series of files (e.g. one per day or month, or a previous file ending with a filename of the next one). +* Can the changelog be compressed on the HTTP level? There are subtle differences between content encoding and transfer encoding, important for `Range` requests. +* Should freshness of files be checked with an `Etag` or `Last-Modified`? Should these be "statelessly" derived from the hash of the file or modification date in the filesystem, or explicitly stored somewhere? +* How to configure whether an index (including alternative registries) should be fetched over git or the new HTTP? The current syntax uses `https://` URLs for git-over-HTTP. + +# Future possibilities +[future-possibilities]: #future-possibilities + +Bundler also uses an append-only format for individual dependency files to incrementally download only new versions' information where possible. Cargo's format is almost append-only (except yanking), so if growth of individual dependency files becomes a problem, it should be possible to fix that. However, currently the largest crate `rustc-ap-rustc_data_structures` that publishes versions daily grows by about 44 bytes per version (compressed), so even after 10 years it'll take only 190KB (compressed), which doesn't seem to be terrible enough to require a solution yet. From 8b572f6763c0dddc6c3559ef0676b8d808f512eb Mon Sep 17 00:00:00 2001 From: Kornel Date: Sat, 19 Oct 2019 02:07:46 +0100 Subject: [PATCH 02/22] More about network speed of HTTP-index --- text/0000-http-index.md | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/text/0000-http-index.md b/text/0000-http-index.md index 581823ce59a..8f9f1654883 100644 --- a/text/0000-http-index.md +++ b/text/0000-http-index.md @@ -49,6 +49,10 @@ To simplify the implementation, and parallelize fetches effectively, Cargo will If a dependency requires `foo >=1.0.2`, then Cargo would need to fetch information about `maybe-dep` (once), even if `foo v1.0.4` ends up being selected later. However, it would not need to fetch `old-dep`. If the version requirement was upgraded to `foo >=v1.0.4` then there wouldn't be any extra fetches. +## Offline support + +The proposed solution fully preserves Cargo's ability to work offline. Fetching of crates while online by necessity downloads enough of the index to use them, and all this data remains cached for use offline. + ## Bandwidth reduction Cargo supports HTTP/2, which handles many similar requests efficiently. @@ -57,7 +61,9 @@ All fetched dependency files can be cached, and refreshed using conditional HTTP Dependency files compress well. Currently the largest file of `rustc-ap-rustc_data_structures` compresses from 1MiB to 26KiB with Brotli. Many servers support transparently serving pre-compressed files (i.e. request for `/rustc-ap-rustc_data_structures` can be served from `rustc-ap-rustc_data_structures.gz` with an appropriate content encoding header), so the index can use high compression levels without increasing CPU cost of serving the files. -### Optionally, a rotated incremental changelog +Even in the worst case of downloading the entire index file by file, it should still use significantly less bandwidth than git clone (individually compressed files add up to about 39MiB). + +## Optionally, a rotated incremental changelog To further reduce number requests needed to update the index, the index may maintain an append-only log of changes. For each change (crate version published or yanked), the log would append a line with: epoch number (explained below), last-modified timestamp, and the name of the changed crate, e.g. @@ -80,7 +86,7 @@ When the log grows too big, the epoch number can be incremented, and the log res # Drawbacks [drawbacks]: #drawbacks -* A basic solution, without the incremental changelog, needs more requests and has higher latency to update the index. With the help of the incremental changelog, this is largely mitigated. +* A basic solution, without the incremental changelog, needs more requests and has higher latency to update the index. With the help of the incremental changelog, this is largely mitigated. For GitHub-hosted indexes Cargo has a fast path that checks in GitHub API whether the master branch has changed. With the changelog file, the same fast path can be implemented by making a conditional HTTP request for the changelog file (i.e. checking `ETag` or `Last-Modified`). * Performant implementation of this solution depends on making many small requests in parallel. This in practice requires HTTP/2 support on the server. * It's uncertain if GitHub pages can handle this many files and the amount of traffic they generate, so the index may need to be hosted elsewhere. * Since alternative registries are stable, the git-based protocol is stable, and can't be removed. @@ -94,8 +100,6 @@ An obvious alternative would be to create a web API that can be asked to perform The proposed solution doesn't require any custom server-side logic. The index can be hosted on a static-file CDN, and can be easily cached and mirrored by users. It's not necessary to change how the index is populated, and the canonical version of the index can be kept as a git repository with the full history. This makes it easy to keep backwards compatibility with older versions of Cargo, as well as 3rd party tools that use the index in its current format. -The proposed solution fully preserves Cargo's ability to work offline (for every crate tarball available to use, there will be an index file cached). - ## Initial index from rustup Rust/Cargo installation could come bundled with an initial version of the index. This way when Cargo is run, it wouldn't have to download the full index over git, only a delta update from the seed version. The index would need to be packaged separately and intelligently handled by rustup to avoid downloading the index multiple times when upgrading or installing multiple versions of Cargo. This would make download and compression of the index much better, making current implementation usable for longer, but it wouldn't prevent the index from growing indefinitely. From edb8b5005272d7c5cb325cfb72452eb317b4ca0d Mon Sep 17 00:00:00 2001 From: Kornel Date: Sat, 19 Oct 2019 02:07:56 +0100 Subject: [PATCH 03/22] Rsync is not a viable alternative to HTTP-index --- text/0000-http-index.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/text/0000-http-index.md b/text/0000-http-index.md index 8f9f1654883..7ea0591ca02 100644 --- a/text/0000-http-index.md +++ b/text/0000-http-index.md @@ -106,6 +106,10 @@ Rust/Cargo installation could come bundled with an initial version of the index. The proposed solution scales much better, because Cargo needs to download and cache only a "working set" of the index, and unused/abandoned/spam crates won't cost anything. +## Rsync + +The rsync protocol requires scanning and checksumming of source and destination files, which creates a lot of unnecessary I/O, and it requires SSH or a custom daemon running on the server, which limits hosting options for the index. + # Prior art [prior-art]: #prior-art From 456dd21bc3a8829bb5a642d5c131dbdce60f08f4 Mon Sep 17 00:00:00 2001 From: Kornel Date: Sat, 19 Oct 2019 03:02:37 +0100 Subject: [PATCH 04/22] Mention raw.githubusercontent as a possible host for HTTP-index Thanks to @carols10cents for the idea --- text/0000-http-index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/text/0000-http-index.md b/text/0000-http-index.md index 7ea0591ca02..42da94d6298 100644 --- a/text/0000-http-index.md +++ b/text/0000-http-index.md @@ -20,7 +20,7 @@ A while ago, GitHub indicated they [don't want to support shallow clones of larg # Guide-level explanation [guide-level-explanation]: #guide-level-explanation -Expose the index over HTTP as simple files, keeping the existing content and directory layout unchanged, e.g.: +Expose the index over HTTP as simple files, keeping the existing content and directory layout unchanged (the existing raw.githubusercontent.com view may even be enough for this). The current format is structured like this: ``` /config.json @@ -88,7 +88,7 @@ When the log grows too big, the epoch number can be incremented, and the log res * A basic solution, without the incremental changelog, needs more requests and has higher latency to update the index. With the help of the incremental changelog, this is largely mitigated. For GitHub-hosted indexes Cargo has a fast path that checks in GitHub API whether the master branch has changed. With the changelog file, the same fast path can be implemented by making a conditional HTTP request for the changelog file (i.e. checking `ETag` or `Last-Modified`). * Performant implementation of this solution depends on making many small requests in parallel. This in practice requires HTTP/2 support on the server. -* It's uncertain if GitHub pages can handle this many files and the amount of traffic they generate, so the index may need to be hosted elsewhere. +* If GitHub won't like high-traffic usage of the index via raw.githubusercontent.com, the index may need to be hosted elsewhere. * Since alternative registries are stable, the git-based protocol is stable, and can't be removed. # Rationale and alternatives From 43b7b58728fc1a1bf1e98bd307dd41abe7564487 Mon Sep 17 00:00:00 2001 From: Kornel Date: Sat, 19 Oct 2019 16:32:19 +0100 Subject: [PATCH 05/22] HTTP-index drawback --- text/0000-http-index.md | 1 + 1 file changed, 1 insertion(+) diff --git a/text/0000-http-index.md b/text/0000-http-index.md index 42da94d6298..4b7e8e09628 100644 --- a/text/0000-http-index.md +++ b/text/0000-http-index.md @@ -90,6 +90,7 @@ When the log grows too big, the epoch number can be incremented, and the log res * Performant implementation of this solution depends on making many small requests in parallel. This in practice requires HTTP/2 support on the server. * If GitHub won't like high-traffic usage of the index via raw.githubusercontent.com, the index may need to be hosted elsewhere. * Since alternative registries are stable, the git-based protocol is stable, and can't be removed. +* Tools that perform fuzzy search of the index (e.g. `cargo add`) may need to make multiple requests or use some other method. # Rationale and alternatives [rationale-and-alternatives]: #rationale-and-alternatives From b1eac3e18d6ea7c2548f90ee1edd69e485a48d37 Mon Sep 17 00:00:00 2001 From: Ivan Tham Date: Fri, 1 Nov 2019 19:57:22 +0800 Subject: [PATCH 06/22] Fix typo --- text/0000-http-index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0000-http-index.md b/text/0000-http-index.md index 4b7e8e09628..86c4fe5af85 100644 --- a/text/0000-http-index.md +++ b/text/0000-http-index.md @@ -40,7 +40,7 @@ It's possible to request dependency files in parallel, so the worst-case latency ## Greedy fetch -To simplify the implementation, and parallelize fetches effectively, Cargo will have to fetch all dependency information before performing the actual dependency resolution algorithm. This means it'll have to pessimistically fetch information about all sub dependencies of all dependency versions that *may* match known version requrements. This won't add much overhead, because requests are per create, not per crate version. It causes additional fetches only for dependencies that were used before, but were later dropped. Fetching is still narrowed by required version ranges, so even worst cases can be avoided by bumping version requirements. For example: +To simplify the implementation, and parallelize fetches effectively, Cargo will have to fetch all dependency information before performing the actual dependency resolution algorithm. This means it'll have to pessimistically fetch information about all sub dependencies of all dependency versions that *may* match known version requirements. This won't add much overhead, because requests are per create, not per crate version. It causes additional fetches only for dependencies that were used before, but were later dropped. Fetching is still narrowed by required version ranges, so even worst cases can be avoided by bumping version requirements. For example: * foo v1.0.1 depends on old-dep v1.0.0 * foo v1.0.2 depends on maybe-dep v1.0.2 From a0505cc848c99ecfee2e852d92242b8ca9b4f56a Mon Sep 17 00:00:00 2001 From: Kornel Date: Mon, 9 Dec 2019 11:22:46 +0000 Subject: [PATCH 07/22] Add conclusions from the proof of concept --- text/0000-http-index.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/text/0000-http-index.md b/text/0000-http-index.md index 86c4fe5af85..6a8e231bbf0 100644 --- a/text/0000-http-index.md +++ b/text/0000-http-index.md @@ -11,7 +11,7 @@ Selective download of the crates-io index over HTTP, similar to a solution used # Motivation [motivation]: #motivation -The full crate index is relatively big and slow to download. It will keep growing as crates.io grows, making the problem worse. The need to download the full index slows down the first use of Cargo. It's especially slow and wasteful in stateless CI environments, which download the full index, use only a tiny fraction of it, and throw it away. Caching of the index in hosted CI environments is difficult (`.cargo` dir is large) and often not effective (e.g. upload and download of large caches in Travis CI is almost as slow as a fresh index download). +The full crate index is relatively big and slow to download. It will keep growing as crates.io grows, making the problem worse. The requirement to download the full index slows down the first use of Cargo. It's especially slow and wasteful in stateless CI environments, which download the full index, use only a tiny fraction of it, and throw it away. Caching of the index in hosted CI environments is difficult (`.cargo` dir is large) and often not effective (e.g. upload and download of large caches in Travis CI is almost as slow as a fresh index download). The kind of data stored in the index is not a good fit for the git protocol. The index content (as of eb037b4863) takes 176MiB as an uncompressed tarball, 16MiB with `gz -1`, and 10MiB compressed with `xz -6`. Git clone reports downloading 215MiB. That's more than just the uncompressed latest index content, and over **20 times more** than a compressed tarball. @@ -36,7 +36,7 @@ Expose the index over HTTP as simple files, keeping the existing content and dir To learn about crates and resolve dependencies, Cargo (or any other client) would make requests to known URLs for each dependency it needs to learn about, e.g. `https://index.example.com/se/rd/serde`. For each dependency the client would also have to request information about its dependencies, recursively, until all dependencies are fetched (and cached) locally. -It's possible to request dependency files in parallel, so the worst-case latency of such dependency resolution is limited to the maximum depth of the dependency tree. In practice it may be less, because dependencies may occur in multiple places in the tree, allowing earlier discovery and increasing parallelization. Additionally, if there's a lock file, all dependencies listed in it can be speculatively checked in parallel. Similarly, cached dependency files can be used to speculatively check known sub-dependencies sooner. +It's possible to request dependency files in parallel, so the worst-case latency of such dependency resolution is limited to the maximum depth of the dependency tree. In practice it's less, because dependencies occur in multiple places in the tree, allowing earlier discovery and increasing parallelization. Additionally, if there's a lock file, all dependencies listed in it can be speculatively checked in parallel. Similarly, cached dependency files can be used to speculatively check known sub-dependencies sooner. ## Greedy fetch @@ -87,10 +87,10 @@ When the log grows too big, the epoch number can be incremented, and the log res [drawbacks]: #drawbacks * A basic solution, without the incremental changelog, needs more requests and has higher latency to update the index. With the help of the incremental changelog, this is largely mitigated. For GitHub-hosted indexes Cargo has a fast path that checks in GitHub API whether the master branch has changed. With the changelog file, the same fast path can be implemented by making a conditional HTTP request for the changelog file (i.e. checking `ETag` or `Last-Modified`). -* Performant implementation of this solution depends on making many small requests in parallel. This in practice requires HTTP/2 support on the server. -* If GitHub won't like high-traffic usage of the index via raw.githubusercontent.com, the index may need to be hosted elsewhere. +* Performant implementation of this solution depends on making many small requests in parallel. HTTP/2 support on the server makes checking twice as fast compared to HTTP/1.1, but speed over HTTP/1.1 is still reasonable. +* If GitHub won't like high-traffic usage of the index via raw.githubusercontent.com, the index may need to be cached/hosted elsewhere. * Since alternative registries are stable, the git-based protocol is stable, and can't be removed. -* Tools that perform fuzzy search of the index (e.g. `cargo add`) may need to make multiple requests or use some other method. +* Tools that perform fuzzy search of the index (e.g. `cargo add`) may need to make multiple requests or use some other method. URLs are already normalized to lowercase, so case-insensitivity doesn't require extra requests. # Rationale and alternatives [rationale-and-alternatives]: #rationale-and-alternatives @@ -99,7 +99,7 @@ When the log grows too big, the epoch number can be incremented, and the log res An obvious alternative would be to create a web API that can be asked to perform dependency resolution server-side (i.e. take a list dependencies and return a lockfile or similar). However, this would require running dependency resolution algorithm server-side. Maintenance of a dynamic API, critical for daily use for nearly all Rust users, is much harder and more expensive than serving of static files. -The proposed solution doesn't require any custom server-side logic. The index can be hosted on a static-file CDN, and can be easily cached and mirrored by users. It's not necessary to change how the index is populated, and the canonical version of the index can be kept as a git repository with the full history. This makes it easy to keep backwards compatibility with older versions of Cargo, as well as 3rd party tools that use the index in its current format. +The proposed solution doesn't require any custom server-side logic. The index can be hosted on a static-file CDN, and can be easily cached and mirrored by users. It's not necessary to change how the index is populated. The canonical version of the index can be kept as a git repository with the full history. This makes it easy to keep backwards compatibility with older versions of Cargo, as well as 3rd party tools that use the index in its current format. ## Initial index from rustup @@ -122,7 +122,7 @@ Bundler used to have a full index fetched ahead of time, similar to Cargo's, unt [unresolved-questions]: #unresolved-questions * Should the changelog use a more extensible format? -* Instead of one file that gets reset, maybe it could be split into series of files (e.g. one per day or month, or a previous file ending with a filename of the next one). +* Instead of one file that gets reset, maybe the changelog could be split into series of files (e.g. one per day or month, or a previous file ending with a filename of the next one). * Can the changelog be compressed on the HTTP level? There are subtle differences between content encoding and transfer encoding, important for `Range` requests. * Should freshness of files be checked with an `Etag` or `Last-Modified`? Should these be "statelessly" derived from the hash of the file or modification date in the filesystem, or explicitly stored somewhere? * How to configure whether an index (including alternative registries) should be fetched over git or the new HTTP? The current syntax uses `https://` URLs for git-over-HTTP. From 427c5e4c228b6693894cfd6d04e505c42e59e1ce Mon Sep 17 00:00:00 2001 From: Kornel Date: Mon, 9 Dec 2019 11:34:44 +0000 Subject: [PATCH 08/22] Handling crates deleted from the index --- text/0000-http-index.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/text/0000-http-index.md b/text/0000-http-index.md index 6a8e231bbf0..4702e7c22ca 100644 --- a/text/0000-http-index.md +++ b/text/0000-http-index.md @@ -83,6 +83,12 @@ Because the log is append-only, the client can incrementally update it using a ` When the log grows too big, the epoch number can be incremented, and the log reset back to empty. The epoch number allows clients to detect that the log has been reset, even if the `Range` they requested happened to be valid for the new log file. +## Handling deleted crates + +When a client checks freshness of a crate that has been deleted, it will make a request to the server and notice a 404/410/451 HTTP status. The client can then act accordingly, and clean up local data (even tarball and source checkout). + +If the client is not interested in deleted crate, it won't check it, but chances are it never did, and didn't download it. If ability to proactively erase caches of deleted crates is important, then the "incremental changelog" feature can be extended to notify about deletions. + # Drawbacks [drawbacks]: #drawbacks From 350330cb84366b681d6b0d2163f2c7261c023f52 Mon Sep 17 00:00:00 2001 From: Kornel Date: Fri, 27 Nov 2020 17:15:02 +0000 Subject: [PATCH 09/22] Update text/0000-http-index.md Co-authored-by: Tobias Bieniek --- text/0000-http-index.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/text/0000-http-index.md b/text/0000-http-index.md index 4702e7c22ca..d4e71e6af27 100644 --- a/text/0000-http-index.md +++ b/text/0000-http-index.md @@ -68,15 +68,15 @@ Even in the worst case of downloading the entire index file by file, it should s To further reduce number requests needed to update the index, the index may maintain an append-only log of changes. For each change (crate version published or yanked), the log would append a line with: epoch number (explained below), last-modified timestamp, and the name of the changed crate, e.g. ``` -1 2019-10-18 23:51:23 oxigen -1 2019-10-18 23:51:25 linda -1 2019-10-18 23:51:29 rv -1 2019-10-18 23:52:00 anyhow -1 2019-10-18 23:53:03 build_id -1 2019-10-18 23:56:16 canonical-form -1 2019-10-18 23:59:01 cotton -1 2019-10-19 00:01:44 kg-utils -1 2019-10-19 00:08:45 serde_traitobject +1 2019-10-18T23:51:23Z oxigen +1 2019-10-18T23:51:25Z linda +1 2019-10-18T23:51:29Z rv +1 2019-10-18T23:52:00Z anyhow +1 2019-10-18T23:53:03Z build_id +1 2019-10-18T23:56:16Z canonical-form +1 2019-10-18T23:59:01Z cotton +1 2019-10-19T00:01:44Z kg-utils +1 2019-10-19T00:08:45Z serde_traitobject ``` Because the log is append-only, the client can incrementally update it using a `Range` HTTP request. The client doesn't have to download the full log in order to start using it; it can download only an arbitrary fraction of it, up to the end of the file, which is straightforward with a `Range` request. When a crate is found in the log (searching from the end), and modification date is the same as modification date of crate's cached locally, the client won't have to make an HTTP request for the file. From e1c0c1eb1e6af7cfb966210a2c8ec871b769f9e9 Mon Sep 17 00:00:00 2001 From: Kornel Date: Sat, 5 Dec 2020 13:42:40 +0000 Subject: [PATCH 10/22] Update text/0000-http-index.md Co-authored-by: Jon Gjengset --- text/0000-http-index.md | 20 -------------------- 1 file changed, 20 deletions(-) diff --git a/text/0000-http-index.md b/text/0000-http-index.md index d4e71e6af27..c303694e04c 100644 --- a/text/0000-http-index.md +++ b/text/0000-http-index.md @@ -63,26 +63,6 @@ Dependency files compress well. Currently the largest file of `rustc-ap-rustc_da Even in the worst case of downloading the entire index file by file, it should still use significantly less bandwidth than git clone (individually compressed files add up to about 39MiB). -## Optionally, a rotated incremental changelog - -To further reduce number requests needed to update the index, the index may maintain an append-only log of changes. For each change (crate version published or yanked), the log would append a line with: epoch number (explained below), last-modified timestamp, and the name of the changed crate, e.g. - -``` -1 2019-10-18T23:51:23Z oxigen -1 2019-10-18T23:51:25Z linda -1 2019-10-18T23:51:29Z rv -1 2019-10-18T23:52:00Z anyhow -1 2019-10-18T23:53:03Z build_id -1 2019-10-18T23:56:16Z canonical-form -1 2019-10-18T23:59:01Z cotton -1 2019-10-19T00:01:44Z kg-utils -1 2019-10-19T00:08:45Z serde_traitobject -``` - -Because the log is append-only, the client can incrementally update it using a `Range` HTTP request. The client doesn't have to download the full log in order to start using it; it can download only an arbitrary fraction of it, up to the end of the file, which is straightforward with a `Range` request. When a crate is found in the log (searching from the end), and modification date is the same as modification date of crate's cached locally, the client won't have to make an HTTP request for the file. - -When the log grows too big, the epoch number can be incremented, and the log reset back to empty. The epoch number allows clients to detect that the log has been reset, even if the `Range` they requested happened to be valid for the new log file. - ## Handling deleted crates When a client checks freshness of a crate that has been deleted, it will make a request to the server and notice a 404/410/451 HTTP status. The client can then act accordingly, and clean up local data (even tarball and source checkout). From 1ecf39feb10d69761f7f601debadc9087b3edc99 Mon Sep 17 00:00:00 2001 From: Kornel Date: Sat, 5 Dec 2020 14:06:18 +0000 Subject: [PATCH 11/22] Update text/0000-http-index.md Co-authored-by: Jon Gjengset --- text/0000-http-index.md | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/text/0000-http-index.md b/text/0000-http-index.md index c303694e04c..250bb04b0fe 100644 --- a/text/0000-http-index.md +++ b/text/0000-http-index.md @@ -117,3 +117,25 @@ Bundler used to have a full index fetched ahead of time, similar to Cargo's, unt [future-possibilities]: #future-possibilities Bundler also uses an append-only format for individual dependency files to incrementally download only new versions' information where possible. Cargo's format is almost append-only (except yanking), so if growth of individual dependency files becomes a problem, it should be possible to fix that. However, currently the largest crate `rustc-ap-rustc_data_structures` that publishes versions daily grows by about 44 bytes per version (compressed), so even after 10 years it'll take only 190KB (compressed), which doesn't seem to be terrible enough to require a solution yet. + +## Provide an index summary + +The scheme as described so far must double-check the contents of every index file with the server to update the index, even if many of the files have not changed. And index update happens on a `cargo update`, but can also happen for other reasons, such as when a project has no lockfile yet, or when a new dependency is added. While HTTP/2 pipelining and conditional GET requests make requesting many unchanged files [fairly efficient](https://github.com/rust-lang/cargo/pull/8890#issuecomment-737472043), it would still be better if we could avoid those extraneous requests, and instead only request index files that have truly changed. + +One way to achieve this is for the index to provide a summary that lets the client quickly determine whether a given local index file is out of date. This can either come in the form of a complete "index-of-indexes" file (essentially a snapshot of the index tree), or in the form of a changelog. The former is a "large" item to fetch, since it is proportional in size to the size of the index (barring other optimizations), but may be necessary for other reasons such as whole-registry signing. Alternatively, the index could maintain an append-only log of changes. For each change (crate version published or yanked), the log would append a line with: epoch number (explained below), last-modified timestamp, and the name of the changed crate, e.g. + + 1 2019-10-18T23:51:23Z oxigen + 1 2019-10-18T23:51:25Z linda + 1 2019-10-18T23:51:29Z rv + 1 2019-10-18T23:52:00Z anyhow + 1 2019-10-18T23:53:03Z build_id + 1 2019-10-18T23:56:16Z canonical-form + 1 2019-10-18T23:59:01Z cotton + 1 2019-10-19T00:01:44Z kg-utils + 1 2019-10-19T00:08:45Z serde_traitobject + +Because the log is append-only, the client can incrementally update it using a `Range` HTTP request. The client doesn't have to download the full log in order to start using it; it can download only an arbitrary fraction of it, up to the end of the file, which is straightforward with a `Range` request. When a crate is found in the log (searching from the end), and modification date is the same as modification date of crate's cached locally, the client won't have to make an HTTP request for the file. + +When the log grows too big, the epoch number can be incremented, and the log reset back to empty. The epoch number allows clients to detect that the log has been reset, even if the `Range` they requested happened to be valid for the new log file. + +Ultimately, this RFC does not recommend such a scheme, as the changelog itself introduces [significant complexity](https://github.com/rust-lang/cargo/commit/bda120ad837e6e71edb334a44e64533119402dee) for relatively [rare gains](https://github.com/rust-lang/rfcs/pull/2789#issuecomment-738194824) that are also [fairly small in absolute value relative to a "naive" fetch](https://github.com/rust-lang/cargo/pull/8890#issuecomment-738316828). If support for index snapshots landed later for something like registry signing, the implementation of this RFC could take advantage of such a snapshot just as it could take advantage of a changelog. From 3ec0d7ac117c60beb50f9cbda00b2167ac264e3c Mon Sep 17 00:00:00 2001 From: Kornel Date: Sat, 5 Dec 2020 14:58:54 +0000 Subject: [PATCH 12/22] Rename to Sparse index --- text/0000-http-index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0000-http-index.md b/text/0000-http-index.md index 250bb04b0fe..eacfd69ce4f 100644 --- a/text/0000-http-index.md +++ b/text/0000-http-index.md @@ -1,4 +1,4 @@ -- Feature Name: http_index +- Feature Name: sparse_index - Start Date: 2019-10-18 - RFC PR: [rust-lang/rfcs#0000](https://github.com/rust-lang/rfcs/pull/0000) - Rust Issue: [rust-lang/rust#0000](https://github.com/rust-lang/rust/issues/0000) From 5aea40459201ce01d490570699cc8482f4519764 Mon Sep 17 00:00:00 2001 From: Kornel Date: Sat, 5 Dec 2020 14:59:42 +0000 Subject: [PATCH 13/22] Small text changes --- text/0000-http-index.md | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/text/0000-http-index.md b/text/0000-http-index.md index eacfd69ce4f..65cd6da23d3 100644 --- a/text/0000-http-index.md +++ b/text/0000-http-index.md @@ -6,7 +6,7 @@ # Summary [summary]: #summary -Selective download of the crates-io index over HTTP, similar to a solution used by Ruby's Bundler. Changes transport from an ahead-of-time Git clone to HTTP fetch as-needed, while keeping existing content and structure of the index. Most importantly, the proposed solution works with static files and doesn't require custom server-side APIs. +Selective download of the crates-io index over HTTP, similar to a solution used by Ruby's Bundler. Changes transport from an ahead-of-time Git clone to HTTP fetch as-needed. The existing structure and content of the index can remain unchanged. Most importantly, the proposed solution works with static files and doesn't require custom server-side APIs. # Motivation [motivation]: #motivation @@ -15,12 +15,12 @@ The full crate index is relatively big and slow to download. It will keep growin The kind of data stored in the index is not a good fit for the git protocol. The index content (as of eb037b4863) takes 176MiB as an uncompressed tarball, 16MiB with `gz -1`, and 10MiB compressed with `xz -6`. Git clone reports downloading 215MiB. That's more than just the uncompressed latest index content, and over **20 times more** than a compressed tarball. -A while ago, GitHub indicated they [don't want to support shallow clones of large repositories](http://blog.cocoapods.org/Master-Spec-Repo-Rate-Limiting-Post-Mortem/). libgit2 doesn't support shallow clones yet. Squashing of the index history adds complexity to management and consumption of the index (which is also used by tools other than Cargo), and still doesn't solve problems of the git protocol inefficiency and overall growth. +Shallow clones or squashing of git history are only temporary solutions. Besides the fact that GitHub indicated they [don't want to support shallow clones of large repositories](http://blog.cocoapods.org/Master-Spec-Repo-Rate-Limiting-Post-Mortem/), and libgit2 doesn't support shallow clones yet, it still doesn't solve the problem that clients have to download index data for *all* crates. # Guide-level explanation [guide-level-explanation]: #guide-level-explanation -Expose the index over HTTP as simple files, keeping the existing content and directory layout unchanged (the existing raw.githubusercontent.com view may even be enough for this). The current format is structured like this: +Expose the index over HTTP as simple files, keeping the existing content and directory layout unchanged (similar to the existing raw.githubusercontent.com view). The current format is structured like this: ``` /config.json @@ -34,9 +34,9 @@ Expose the index over HTTP as simple files, keeping the existing content and dir … ``` -To learn about crates and resolve dependencies, Cargo (or any other client) would make requests to known URLs for each dependency it needs to learn about, e.g. `https://index.example.com/se/rd/serde`. For each dependency the client would also have to request information about its dependencies, recursively, until all dependencies are fetched (and cached) locally. +To learn about crates and resolve dependencies, Cargo (or any other client) would make requests to known URLs for each dependency it needs to learn about, e.g. `https://index.example.com/se/rd/serde` (the paths are constructed and normalized the same was as for the git index). For each dependency the client would also have to request information about its dependencies, recursively, until all dependencies are fetched (and cached) locally. -It's possible to request dependency files in parallel, so the worst-case latency of such dependency resolution is limited to the maximum depth of the dependency tree. In practice it's less, because dependencies occur in multiple places in the tree, allowing earlier discovery and increasing parallelization. Additionally, if there's a lock file, all dependencies listed in it can be speculatively checked in parallel. Similarly, cached dependency files can be used to speculatively check known sub-dependencies sooner. +It's possible to request dependency files in parallel, so the worst-case latency of such dependency resolution is limited to the maximum depth of the dependency tree. In practice it's less, because dependencies occur in multiple places in the tree, allowing earlier discovery and increasing parallelization. Additionally, if there's a lock file, all dependencies listed in it can be speculatively checked in parallel. ## Greedy fetch @@ -51,7 +51,7 @@ If a dependency requires `foo >=1.0.2`, then Cargo would need to fetch informati ## Offline support -The proposed solution fully preserves Cargo's ability to work offline. Fetching of crates while online by necessity downloads enough of the index to use them, and all this data remains cached for use offline. +The proposed solution fully preserves Cargo's ability to work offline. Fetching of crates (while online) by necessity downloads enough of the index to use them, and all this data remains cached for use offline. ## Bandwidth reduction @@ -61,7 +61,7 @@ All fetched dependency files can be cached, and refreshed using conditional HTTP Dependency files compress well. Currently the largest file of `rustc-ap-rustc_data_structures` compresses from 1MiB to 26KiB with Brotli. Many servers support transparently serving pre-compressed files (i.e. request for `/rustc-ap-rustc_data_structures` can be served from `rustc-ap-rustc_data_structures.gz` with an appropriate content encoding header), so the index can use high compression levels without increasing CPU cost of serving the files. -Even in the worst case of downloading the entire index file by file, it should still use significantly less bandwidth than git clone (individually compressed files add up to about 39MiB). +Even in the worst case of downloading the entire index file by file, it should still use significantly less bandwidth than git clone (individually compressed files currently add up to about 39MiB). ## Handling deleted crates @@ -116,9 +116,11 @@ Bundler used to have a full index fetched ahead of time, similar to Cargo's, unt # Future possibilities [future-possibilities]: #future-possibilities -Bundler also uses an append-only format for individual dependency files to incrementally download only new versions' information where possible. Cargo's format is almost append-only (except yanking), so if growth of individual dependency files becomes a problem, it should be possible to fix that. However, currently the largest crate `rustc-ap-rustc_data_structures` that publishes versions daily grows by about 44 bytes per version (compressed), so even after 10 years it'll take only 190KB (compressed), which doesn't seem to be terrible enough to require a solution yet. +## Incremental crate files -## Provide an index summary +Bundler uses an append-only format for individual dependency files to incrementally download only new versions' information where possible. Cargo's format is almost append-only (except yanking), so if growth of individual dependency files becomes a problem, it should be possible to fix that. However, currently the largest crate `rustc-ap-rustc_data_structures` that publishes versions daily grows by about 44 bytes per version (compressed), so even after 10 years it'll take only 190KB (compressed), which doesn't seem to be terrible enough to require a solution yet. + +## Incremental changelog The scheme as described so far must double-check the contents of every index file with the server to update the index, even if many of the files have not changed. And index update happens on a `cargo update`, but can also happen for other reasons, such as when a project has no lockfile yet, or when a new dependency is added. While HTTP/2 pipelining and conditional GET requests make requesting many unchanged files [fairly efficient](https://github.com/rust-lang/cargo/pull/8890#issuecomment-737472043), it would still be better if we could avoid those extraneous requests, and instead only request index files that have truly changed. From 566500941bcdb6d3090f9045d78d0b5354aed835 Mon Sep 17 00:00:00 2001 From: Kornel Date: Sat, 5 Dec 2020 14:59:54 +0000 Subject: [PATCH 14/22] Update drawbacks --- text/0000-http-index.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/text/0000-http-index.md b/text/0000-http-index.md index 65cd6da23d3..43963c648f9 100644 --- a/text/0000-http-index.md +++ b/text/0000-http-index.md @@ -72,10 +72,11 @@ If the client is not interested in deleted crate, it won't check it, but chances # Drawbacks [drawbacks]: #drawbacks -* A basic solution, without the incremental changelog, needs more requests and has higher latency to update the index. With the help of the incremental changelog, this is largely mitigated. For GitHub-hosted indexes Cargo has a fast path that checks in GitHub API whether the master branch has changed. With the changelog file, the same fast path can be implemented by making a conditional HTTP request for the changelog file (i.e. checking `ETag` or `Last-Modified`). +* crates-io plans to add a cryptographic signatures to the index as an extra layer of protection on top of HTTPS. Cryptographic verification of a git index is straigthforward, but signing of a sparse HTTP index may be challenging. +* A basic solution, without the incremental changelog, needs many requests update the index. This could have higher latency than a git fetch. However, in preliminary benchmarks it appears to be faster than a git fetch if the CDN supports enough (>60) requests in parallel. For GitHub-hosted indexes Cargo has a fast path that checks in GitHub API whether the master branch has changed. With the incremental changelog file, the same fast path can be implemented by making a conditional HTTP request for the changelog file (i.e. checking `ETag` or `Last-Modified`). * Performant implementation of this solution depends on making many small requests in parallel. HTTP/2 support on the server makes checking twice as fast compared to HTTP/1.1, but speed over HTTP/1.1 is still reasonable. -* If GitHub won't like high-traffic usage of the index via raw.githubusercontent.com, the index may need to be cached/hosted elsewhere. -* Since alternative registries are stable, the git-based protocol is stable, and can't be removed. +* `raw.githubusercontent.com` is not suitable as a CDN. The sparse index will have to be cached/hosted elsewhere. +* Since the alternative registries feature is stable, the git-based index protocol is stable, and can't be removed. * Tools that perform fuzzy search of the index (e.g. `cargo add`) may need to make multiple requests or use some other method. URLs are already normalized to lowercase, so case-insensitivity doesn't require extra requests. # Rationale and alternatives From e10071fcf70b76635489c40234d252ea99c81902 Mon Sep 17 00:00:00 2001 From: Kornel Date: Sat, 5 Dec 2020 15:01:04 +0000 Subject: [PATCH 15/22] Handle inconsistent caches --- text/0000-http-index.md | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/text/0000-http-index.md b/text/0000-http-index.md index 43963c648f9..c2f5a18d65b 100644 --- a/text/0000-http-index.md +++ b/text/0000-http-index.md @@ -63,11 +63,25 @@ Dependency files compress well. Currently the largest file of `rustc-ap-rustc_da Even in the worst case of downloading the entire index file by file, it should still use significantly less bandwidth than git clone (individually compressed files currently add up to about 39MiB). +An "incremental changelog" file (described in a later section) can be used to avoid many conditional requests. + ## Handling deleted crates When a client checks freshness of a crate that has been deleted, it will make a request to the server and notice a 404/410/451 HTTP status. The client can then act accordingly, and clean up local data (even tarball and source checkout). -If the client is not interested in deleted crate, it won't check it, but chances are it never did, and didn't download it. If ability to proactively erase caches of deleted crates is important, then the "incremental changelog" feature can be extended to notify about deletions. +If the client is not interested in the deleted crate, it won't check it, but chances are it never did, and didn't download it. If ability to proactively erase caches of deleted crates is important, then the "incremental changelog" feature can be extended to notify about deletions. + +## Dealing with inconsistent HTTP caches + +The index does not require all files to form one cohesive snapshot. The index is updated one file at a time. Every file is updated in a separate commit, so for every file change there exists an index state that is valid with or without it. The index only needs to preserve a partial order of updates. + +From Cargo's perspective dependencies are always allowed to update independently. If crate's dependencies' files are refreshed before the crate itself, it won't be different than if someone had used an older version of the crate. + +The only case where stale caches can cause a problem is when a new version of a crate depends on the latest version of a newly-published dependency, and caches expired for the parent crate before expiring for the dependency. Cargo will prevent that from happening, at least for the datacenter it can see. Cargo requires dependencies with sufficient versions to be already visible in the index, and won't publish a "broken" crate. + +Ideally, the server should ensure that a previous file change is visible everywhere before making the next change, i.e. make the CDN purge the changed file, and wait for the purge to be executed before updating files that may depend on it. This may be difficult to guarantee in a global CDNs, so Cargo needs a recovery mechanism: + +If a crate A is found to depend on a crate B with a version that doesn't appear to exist in the index, Cargo should fetch the crate B again with a cache buster. The cache buster can be a query string appended to the URL with either the current timestamp, or timestamp parsed from the `last-modified` header of the crate A's response: `?cachebust=12345678`. # Drawbacks [drawbacks]: #drawbacks From 85ff5175cf2f36e1f65e282bc20106ba9d028683 Mon Sep 17 00:00:00 2001 From: Kornel Date: Mon, 7 Dec 2020 01:21:27 +0000 Subject: [PATCH 16/22] Rename sparse index rfc file --- text/{0000-http-index.md => 0000-sparse-index.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename text/{0000-http-index.md => 0000-sparse-index.md} (100%) diff --git a/text/0000-http-index.md b/text/0000-sparse-index.md similarity index 100% rename from text/0000-http-index.md rename to text/0000-sparse-index.md From 745c99c098ac92eaecbb1da0cd20a33830daa233 Mon Sep 17 00:00:00 2001 From: Kornel Date: Mon, 7 Dec 2020 01:22:11 +0000 Subject: [PATCH 17/22] Cache bust as future possibility --- text/0000-sparse-index.md | 30 ++++++++++++++++-------------- 1 file changed, 16 insertions(+), 14 deletions(-) diff --git a/text/0000-sparse-index.md b/text/0000-sparse-index.md index c2f5a18d65b..ec877a572a1 100644 --- a/text/0000-sparse-index.md +++ b/text/0000-sparse-index.md @@ -63,25 +63,13 @@ Dependency files compress well. Currently the largest file of `rustc-ap-rustc_da Even in the worst case of downloading the entire index file by file, it should still use significantly less bandwidth than git clone (individually compressed files currently add up to about 39MiB). -An "incremental changelog" file (described in a later section) can be used to avoid many conditional requests. +An "incremental changelog" file (described in "Future possibilities") could be used to avoid many conditional requests. ## Handling deleted crates When a client checks freshness of a crate that has been deleted, it will make a request to the server and notice a 404/410/451 HTTP status. The client can then act accordingly, and clean up local data (even tarball and source checkout). -If the client is not interested in the deleted crate, it won't check it, but chances are it never did, and didn't download it. If ability to proactively erase caches of deleted crates is important, then the "incremental changelog" feature can be extended to notify about deletions. - -## Dealing with inconsistent HTTP caches - -The index does not require all files to form one cohesive snapshot. The index is updated one file at a time. Every file is updated in a separate commit, so for every file change there exists an index state that is valid with or without it. The index only needs to preserve a partial order of updates. - -From Cargo's perspective dependencies are always allowed to update independently. If crate's dependencies' files are refreshed before the crate itself, it won't be different than if someone had used an older version of the crate. - -The only case where stale caches can cause a problem is when a new version of a crate depends on the latest version of a newly-published dependency, and caches expired for the parent crate before expiring for the dependency. Cargo will prevent that from happening, at least for the datacenter it can see. Cargo requires dependencies with sufficient versions to be already visible in the index, and won't publish a "broken" crate. - -Ideally, the server should ensure that a previous file change is visible everywhere before making the next change, i.e. make the CDN purge the changed file, and wait for the purge to be executed before updating files that may depend on it. This may be difficult to guarantee in a global CDNs, so Cargo needs a recovery mechanism: - -If a crate A is found to depend on a crate B with a version that doesn't appear to exist in the index, Cargo should fetch the crate B again with a cache buster. The cache buster can be a query string appended to the URL with either the current timestamp, or timestamp parsed from the `last-modified` header of the crate A's response: `?cachebust=12345678`. +If the client is not interested in the deleted crate, it won't check it, but chances are it never did, and didn't download it. If ability to proactively erase caches of deleted crates is important, then the "incremental changelog" feature could be extended to notify about deletions. # Drawbacks [drawbacks]: #drawbacks @@ -156,3 +144,17 @@ Because the log is append-only, the client can incrementally update it using a ` When the log grows too big, the epoch number can be incremented, and the log reset back to empty. The epoch number allows clients to detect that the log has been reset, even if the `Range` they requested happened to be valid for the new log file. Ultimately, this RFC does not recommend such a scheme, as the changelog itself introduces [significant complexity](https://github.com/rust-lang/cargo/commit/bda120ad837e6e71edb334a44e64533119402dee) for relatively [rare gains](https://github.com/rust-lang/rfcs/pull/2789#issuecomment-738194824) that are also [fairly small in absolute value relative to a "naive" fetch](https://github.com/rust-lang/cargo/pull/8890#issuecomment-738316828). If support for index snapshots landed later for something like registry signing, the implementation of this RFC could take advantage of such a snapshot just as it could take advantage of a changelog. + +## Dealing with inconsistent HTTP caches + +The index does not require all files to form one cohesive snapshot. The index is updated one file at a time. Every file is updated in a separate commit, so for every file change there exists an index state that is valid with or without it. The index only needs to preserve a partial order of updates. + +From Cargo's perspective dependencies are always allowed to update independently. If crate's dependencies' files are refreshed before the crate itself, it won't be different than if someone had used an older version of the crate. + +The only case where stale caches can cause a problem is when a new version of a crate depends on the latest version of a newly-published dependency, and caches expired for the parent crate before expiring for the dependency. Cargo will prevent that from happening, at least for the datacenter it can see. Cargo requires dependencies with sufficient versions to be already visible in the index, and won't publish a "broken" crate. + +Ideally, the server should ensure that a previous file change is visible everywhere before making the next change, i.e. make the CDN purge the changed file, and wait for the purge to be executed before updating files that may depend on it. This may be difficult to guarantee in a global CDNs, so Cargo needs a recovery mechanism: + +If a crate A is found to depend on a crate B with a version that doesn't appear to exist in the index, Cargo should fetch the crate B again with a cache buster. The cache buster can be a query string appended to the URL with either the current timestamp, or timestamp parsed from the `last-modified` header of the crate A's response: `?cachebust=12345678`. + +Cache buster has an advantage over requests with `cache-control: no-cache`: it's more widely supported by CDNs, and allows the "busted" URLs to still be cached by the CDN, limiting excess traffic to the origin to 1 request per second on average. From 44031781050e604ca8ebf526966ac7f93e56138c Mon Sep 17 00:00:00 2001 From: Kornel Date: Wed, 9 Dec 2020 23:48:37 +0000 Subject: [PATCH 18/22] Avoid dictating specifics of the implementation --- text/0000-sparse-index.md | 47 +++++++++------------------------------ 1 file changed, 10 insertions(+), 37 deletions(-) diff --git a/text/0000-sparse-index.md b/text/0000-sparse-index.md index ec877a572a1..0bd5518f961 100644 --- a/text/0000-sparse-index.md +++ b/text/0000-sparse-index.md @@ -20,27 +20,15 @@ Shallow clones or squashing of git history are only temporary solutions. Besides # Guide-level explanation [guide-level-explanation]: #guide-level-explanation -Expose the index over HTTP as simple files, keeping the existing content and directory layout unchanged (similar to the existing raw.githubusercontent.com view). The current format is structured like this: - -``` -/config.json -/ac/ti -/ac/ti/action -/ac/ti/actiondb -/ac/ti/actions -/ac/ti/actions-toolkit-sys -/ac/ti/activation -/ac/ti/activeds-sys -… -``` - -To learn about crates and resolve dependencies, Cargo (or any other client) would make requests to known URLs for each dependency it needs to learn about, e.g. `https://index.example.com/se/rd/serde` (the paths are constructed and normalized the same was as for the git index). For each dependency the client would also have to request information about its dependencies, recursively, until all dependencies are fetched (and cached) locally. +Expose the index over HTTP as plain files. It would be enough to expose the existing index layout (like the raw.githubusercontent.com view), but the URL scheme may also be simplified for the HTTP case. + +To learn about crates and resolve dependencies, Cargo (or any other client) would make requests to known URLs for each dependency it needs to learn about, e.g. `https://index.example.com/se/rd/serde`. For each dependency the client would also have to request information about its dependencies, recursively, until all dependencies are fetched (and cached) locally. It's possible to request dependency files in parallel, so the worst-case latency of such dependency resolution is limited to the maximum depth of the dependency tree. In practice it's less, because dependencies occur in multiple places in the tree, allowing earlier discovery and increasing parallelization. Additionally, if there's a lock file, all dependencies listed in it can be speculatively checked in parallel. ## Greedy fetch -To simplify the implementation, and parallelize fetches effectively, Cargo will have to fetch all dependency information before performing the actual dependency resolution algorithm. This means it'll have to pessimistically fetch information about all sub dependencies of all dependency versions that *may* match known version requirements. This won't add much overhead, because requests are per create, not per crate version. It causes additional fetches only for dependencies that were used before, but were later dropped. Fetching is still narrowed by required version ranges, so even worst cases can be avoided by bumping version requirements. For example: +To simplify the implementation, and parallelize fetches effectively, Cargo may fetch all possibly relevant dependency information before performing the actual precise dependency resolution algorithm. This would mean pessimistically fetching information about all sub dependencies of all dependency versions that *may* match known version requirements. This won't add much overhead, because requests are per create, not per crate version. It causes additional fetches only for dependencies that were used before, but were later dropped. Fetching is still narrowed by required version ranges, so even worst cases can be avoided by bumping version requirements. For example: * foo v1.0.1 depends on old-dep v1.0.0 * foo v1.0.2 depends on maybe-dep v1.0.2 @@ -67,7 +55,7 @@ An "incremental changelog" file (described in "Future possibilities") could be u ## Handling deleted crates -When a client checks freshness of a crate that has been deleted, it will make a request to the server and notice a 404/410/451 HTTP status. The client can then act accordingly, and clean up local data (even tarball and source checkout). +The proposed scheme may support deletion of crates, if necessary. When a client checks freshness of a crate that has been deleted, it will make a request to the server and notice a 404/410/451 HTTP status. The client can then act accordingly, and clean up local data (even tarball and source checkout). If the client is not interested in the deleted crate, it won't check it, but chances are it never did, and didn't download it. If ability to proactively erase caches of deleted crates is important, then the "incremental changelog" feature could be extended to notify about deletions. @@ -125,19 +113,9 @@ Bundler uses an append-only format for individual dependency files to incrementa ## Incremental changelog -The scheme as described so far must double-check the contents of every index file with the server to update the index, even if many of the files have not changed. And index update happens on a `cargo update`, but can also happen for other reasons, such as when a project has no lockfile yet, or when a new dependency is added. While HTTP/2 pipelining and conditional GET requests make requesting many unchanged files [fairly efficient](https://github.com/rust-lang/cargo/pull/8890#issuecomment-737472043), it would still be better if we could avoid those extraneous requests, and instead only request index files that have truly changed. - -One way to achieve this is for the index to provide a summary that lets the client quickly determine whether a given local index file is out of date. This can either come in the form of a complete "index-of-indexes" file (essentially a snapshot of the index tree), or in the form of a changelog. The former is a "large" item to fetch, since it is proportional in size to the size of the index (barring other optimizations), but may be necessary for other reasons such as whole-registry signing. Alternatively, the index could maintain an append-only log of changes. For each change (crate version published or yanked), the log would append a line with: epoch number (explained below), last-modified timestamp, and the name of the changed crate, e.g. +The scheme as described so far must revalidate freshness of every index file with the server to update the index, even if many of the files have not changed. And index update happens on a `cargo update`, but can also happen for other reasons, such as when a project has no lockfile yet, or when a new dependency is added. While HTTP/2 pipelining and conditional GET requests make requesting many unchanged files [fairly efficient](https://github.com/rust-lang/cargo/pull/8890#issuecomment-737472043), it would still be better if we could avoid those extraneous requests, and instead only request index files that have truly changed. - 1 2019-10-18T23:51:23Z oxigen - 1 2019-10-18T23:51:25Z linda - 1 2019-10-18T23:51:29Z rv - 1 2019-10-18T23:52:00Z anyhow - 1 2019-10-18T23:53:03Z build_id - 1 2019-10-18T23:56:16Z canonical-form - 1 2019-10-18T23:59:01Z cotton - 1 2019-10-19T00:01:44Z kg-utils - 1 2019-10-19T00:08:45Z serde_traitobject +One way to achieve this is for the index to provide a summary that lets the client quickly determine whether a given local index file is out of date. To spare clients from fetching a snapshot of the entire index tree, the index could maintain an append-only log of changes. For each change (crate version published or yanked), the log would append a record (a line) with: epoch number (explained below), last-modified timestamp, the name of the changed crate, and possibly other metadata if needed in the future. Because the log is append-only, the client can incrementally update it using a `Range` HTTP request. The client doesn't have to download the full log in order to start using it; it can download only an arbitrary fraction of it, up to the end of the file, which is straightforward with a `Range` request. When a crate is found in the log (searching from the end), and modification date is the same as modification date of crate's cached locally, the client won't have to make an HTTP request for the file. @@ -147,14 +125,9 @@ Ultimately, this RFC does not recommend such a scheme, as the changelog itself i ## Dealing with inconsistent HTTP caches -The index does not require all files to form one cohesive snapshot. The index is updated one file at a time. Every file is updated in a separate commit, so for every file change there exists an index state that is valid with or without it. The index only needs to preserve a partial order of updates. - -From Cargo's perspective dependencies are always allowed to update independently. If crate's dependencies' files are refreshed before the crate itself, it won't be different than if someone had used an older version of the crate. - -The only case where stale caches can cause a problem is when a new version of a crate depends on the latest version of a newly-published dependency, and caches expired for the parent crate before expiring for the dependency. Cargo will prevent that from happening, at least for the datacenter it can see. Cargo requires dependencies with sufficient versions to be already visible in the index, and won't publish a "broken" crate. +The index does not require all files to form one cohesive snapshot. The index is updated one file at a time, and only needs to preserve a partial order of updates. From Cargo's perspective dependencies are always allowed to update independently. -Ideally, the server should ensure that a previous file change is visible everywhere before making the next change, i.e. make the CDN purge the changed file, and wait for the purge to be executed before updating files that may depend on it. This may be difficult to guarantee in a global CDNs, so Cargo needs a recovery mechanism: +The only case where stale caches can cause a problem is when a new version of a crate depends on the latest version of a newly-published dependency, and caches expired for the parent crate before expiring for the dependency. Cargo requires dependencies with sufficient versions to be already visible in the index, and won't publish a "broken" crate. -If a crate A is found to depend on a crate B with a version that doesn't appear to exist in the index, Cargo should fetch the crate B again with a cache buster. The cache buster can be a query string appended to the URL with either the current timestamp, or timestamp parsed from the `last-modified` header of the crate A's response: `?cachebust=12345678`. +However, there's always a possiblity that CDN caches will be stale or expire in a "wrong" order. If Cargo detects that its cached copy of the index is stale (i.e. it finds that a crate that depends on a dependency that doesn't appear to be in the index yet) it may recover from such situation by re-requesting files from the index with a "cache buster" (e.g. current timestamp) appended to their URL. This has an effect of reliably bypassing stale caches, even when CDNs don't honor `cache-control: no-cache` in requests. -Cache buster has an advantage over requests with `cache-control: no-cache`: it's more widely supported by CDNs, and allows the "busted" URLs to still be cached by the CDN, limiting excess traffic to the origin to 1 request per second on average. From 32d2c2e9df5dd380c6f7da72a8cd31f7d7e84d22 Mon Sep 17 00:00:00 2001 From: Jon Gjengset Date: Tue, 15 Dec 2020 09:30:16 -0800 Subject: [PATCH 19/22] Remove greedy fetch, add feasibility note The greedy fetching algorithm should not be normative in the RFC since it is likely to be entirely replaced in the ultimate implementation. Instead, it serves (along with the experimental implementation) as a feasibility proof for the sparse index strategy. --- text/0000-sparse-index.md | 21 ++++++--------------- 1 file changed, 6 insertions(+), 15 deletions(-) diff --git a/text/0000-sparse-index.md b/text/0000-sparse-index.md index 0bd5518f961..15539c8f007 100644 --- a/text/0000-sparse-index.md +++ b/text/0000-sparse-index.md @@ -26,17 +26,6 @@ To learn about crates and resolve dependencies, Cargo (or any other client) woul It's possible to request dependency files in parallel, so the worst-case latency of such dependency resolution is limited to the maximum depth of the dependency tree. In practice it's less, because dependencies occur in multiple places in the tree, allowing earlier discovery and increasing parallelization. Additionally, if there's a lock file, all dependencies listed in it can be speculatively checked in parallel. -## Greedy fetch - -To simplify the implementation, and parallelize fetches effectively, Cargo may fetch all possibly relevant dependency information before performing the actual precise dependency resolution algorithm. This would mean pessimistically fetching information about all sub dependencies of all dependency versions that *may* match known version requirements. This won't add much overhead, because requests are per create, not per crate version. It causes additional fetches only for dependencies that were used before, but were later dropped. Fetching is still narrowed by required version ranges, so even worst cases can be avoided by bumping version requirements. For example: - -* foo v1.0.1 depends on old-dep v1.0.0 -* foo v1.0.2 depends on maybe-dep v1.0.2 -* foo v1.0.3 depends on maybe-dep v1.0.3 -* foo v1.0.4 has no dependencies - -If a dependency requires `foo >=1.0.2`, then Cargo would need to fetch information about `maybe-dep` (once), even if `foo v1.0.4` ends up being selected later. However, it would not need to fetch `old-dep`. If the version requirement was upgraded to `foo >=v1.0.4` then there wouldn't be any extra fetches. - ## Offline support The proposed solution fully preserves Cargo's ability to work offline. Fetching of crates (while online) by necessity downloads enough of the index to use them, and all this data remains cached for use offline. @@ -98,11 +87,13 @@ Bundler used to have a full index fetched ahead of time, similar to Cargo's, unt # Unresolved questions [unresolved-questions]: #unresolved-questions -* Should the changelog use a more extensible format? -* Instead of one file that gets reset, maybe the changelog could be split into series of files (e.g. one per day or month, or a previous file ending with a filename of the next one). -* Can the changelog be compressed on the HTTP level? There are subtle differences between content encoding and transfer encoding, important for `Range` requests. -* Should freshness of files be checked with an `Etag` or `Last-Modified`? Should these be "statelessly" derived from the hash of the file or modification date in the filesystem, or explicitly stored somewhere? * How to configure whether an index (including alternative registries) should be fetched over git or the new HTTP? The current syntax uses `https://` URLs for git-over-HTTP. +* How do we ensure that the switch to an HTTP registry does not cause a huge diff to all lock files? +* How can the current resolver be adapted to enable parallel fetching of index files? It currently requires that each index file is available synchronously, which precludes parallelism. + +# Implementation feasibility + +An implementation of this RFC that uses a simple "greedy" algorithm for fetching index files has been tested in https://github.com/rust-lang/cargo/pull/8890, and demonstrates good performance, especially for fresh builds. The PR for that experimental implementation also suggests a strategy for modifying the resolver to obviate the need for the greedy fetching phase. # Future possibilities [future-possibilities]: #future-possibilities From afa6078ee879c590f651aca69156104df10dfba4 Mon Sep 17 00:00:00 2001 From: Kornel Date: Wed, 16 Dec 2020 00:16:31 +0000 Subject: [PATCH 20/22] Typo Co-authored-by: Josh Triplett --- text/0000-sparse-index.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/text/0000-sparse-index.md b/text/0000-sparse-index.md index 15539c8f007..2703b8a6cbc 100644 --- a/text/0000-sparse-index.md +++ b/text/0000-sparse-index.md @@ -13,7 +13,7 @@ Selective download of the crates-io index over HTTP, similar to a solution used The full crate index is relatively big and slow to download. It will keep growing as crates.io grows, making the problem worse. The requirement to download the full index slows down the first use of Cargo. It's especially slow and wasteful in stateless CI environments, which download the full index, use only a tiny fraction of it, and throw it away. Caching of the index in hosted CI environments is difficult (`.cargo` dir is large) and often not effective (e.g. upload and download of large caches in Travis CI is almost as slow as a fresh index download). -The kind of data stored in the index is not a good fit for the git protocol. The index content (as of eb037b4863) takes 176MiB as an uncompressed tarball, 16MiB with `gz -1`, and 10MiB compressed with `xz -6`. Git clone reports downloading 215MiB. That's more than just the uncompressed latest index content, and over **20 times more** than a compressed tarball. +The kind of data stored in the index is not a good fit for the git protocol. The index content (as of eb037b4863) takes 176MiB as an uncompressed tarball, 16MiB with `gzip -1`, and 10MiB compressed with `xz -6`. Git clone reports downloading 215MiB. That's more than just the uncompressed latest index content, and over **20 times more** than a compressed tarball. Shallow clones or squashing of git history are only temporary solutions. Besides the fact that GitHub indicated they [don't want to support shallow clones of large repositories](http://blog.cocoapods.org/Master-Spec-Repo-Rate-Limiting-Post-Mortem/), and libgit2 doesn't support shallow clones yet, it still doesn't solve the problem that clients have to download index data for *all* crates. @@ -121,4 +121,3 @@ The index does not require all files to form one cohesive snapshot. The index is The only case where stale caches can cause a problem is when a new version of a crate depends on the latest version of a newly-published dependency, and caches expired for the parent crate before expiring for the dependency. Cargo requires dependencies with sufficient versions to be already visible in the index, and won't publish a "broken" crate. However, there's always a possiblity that CDN caches will be stale or expire in a "wrong" order. If Cargo detects that its cached copy of the index is stale (i.e. it finds that a crate that depends on a dependency that doesn't appear to be in the index yet) it may recover from such situation by re-requesting files from the index with a "cache buster" (e.g. current timestamp) appended to their URL. This has an effect of reliably bypassing stale caches, even when CDNs don't honor `cache-control: no-cache` in requests. - From 1c2d7044ca175c8ba55c682b646a3c954a1dba32 Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Tue, 12 Jan 2021 13:05:14 -0800 Subject: [PATCH 21/22] Update 2789 links. --- text/0000-sparse-index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/text/0000-sparse-index.md b/text/0000-sparse-index.md index 2703b8a6cbc..d143d23e5b4 100644 --- a/text/0000-sparse-index.md +++ b/text/0000-sparse-index.md @@ -1,7 +1,7 @@ - Feature Name: sparse_index - Start Date: 2019-10-18 -- RFC PR: [rust-lang/rfcs#0000](https://github.com/rust-lang/rfcs/pull/0000) -- Rust Issue: [rust-lang/rust#0000](https://github.com/rust-lang/rust/issues/0000) +- RFC PR: [rust-lang/rfcs#2789](https://github.com/rust-lang/rfcs/pull/2789) +- Tracking Issue: [rust-lang/rust#9069](https://github.com/rust-lang/cargo/issues/9069) # Summary [summary]: #summary From 3fec7e2e44884cff1a8f6c4a278bd6896ce04227 Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Tue, 12 Jan 2021 13:06:10 -0800 Subject: [PATCH 22/22] Merging RFC 2789 --- text/{0000-sparse-index.md => 2789-sparse-index.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename text/{0000-sparse-index.md => 2789-sparse-index.md} (100%) diff --git a/text/0000-sparse-index.md b/text/2789-sparse-index.md similarity index 100% rename from text/0000-sparse-index.md rename to text/2789-sparse-index.md