-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement experimental registry HTTP API from RFC #8890
Conversation
This commit implements the HTTP registry API from rust-lang/rfcs#2789. A proof-of-concept server implementation exists at https://github.com/kornelski/cargo-static-registry-rfc-proof-of-concept The client implementation supports the changelog extension in the RFC, but the server does not at the time of writing. This first draft does not take advantage of `Etag` to avoid re-downloads if the server does not support changelog. It also does not download multiple index files concurrently, since the Registry trait doesn't lend itself well to that at the moment.
With this change, a registry that does not support the optional changelog feature will no longer purge the index on every build. Instead, it will double-check the current file contents with the server for each load, which is fairly efficient if a given index file has _not_ changed. The changelog is (unsurprisingly) still much more efficient.
They're handled the same anyway.
(rust_highfive has picked a reviewer for you, use r? to override) |
The biggest open question at the moment is how to make this effectively utilize HTTP/2 the way the RFC suggests. The I'm personally leaning towards landing the experimental registry support in its current state, and then doing that re-architecting in a dedicated follow-up PR, but am happy to do it as part of this PR instead if that's preferable. |
I have only had time to do a cursory skim of the code. But it looks like a lot of very good work. Thank you!
I think this is a bit of a non starter. The resolver is already handling a NP-Hard problem, any loss of control increases the chance of exponential blow up. Not to mention it is not the easiest to test, so we don't want to make that harder, for example by making it non deterministic. The RFC describes a Greedy fetch solution to this quandary, that gets (or atleast starts to get) the relevant ranges before the resolver gets started. Of course the optimum strategy is somewhere in between "prefetch everything you could possibly need" and "load only when needed", and we can work on designing that balanced strategy as this gets more established. Of course it is also possible that you have something perfectly reasonable in mind and I am misunderstanding.
Given what I said above I am definitely for waiting on bigger re-architecting. Let's get this set up enough so users can test if there are bugs, then figure out how to make it efficient. It is also possible that the work on using One thought I have had for improving performance while we are testing out with both approaches is that Cargo stores a copy on disk after parsing from the git index. It may be cool to find some way for both ways of retrieving the registry to have a shared representation in that dir. I have not yet figured out how to make that technically possible, but it is something to think about. |
IIRC one of the biggest open questions on the RFC was whether this was feasible in terms of how long it might take. Do you have some numbers on how long this implementation takes locally on some crate graphs? For example how long does I think when this lands will be guided by experimental feedback from the implementation. If this is almost done except for some cleanup and performs well, seems fine to land. If this is orders of magnitude slower than what we have today and has no viable path to improvement without major refactorings, I think it's best to hold off on landing because it's otherwise dead code largely. |
What I was specifically thinking was to change fn enqueue_load(&mut self, root: &Path, path: &Path) -> CargoResult<()>;
fn load_next(&mut self) -> CargoResult<Option<LoadedData<'_>>>; where
And then restructure the resolver so that instead of walking the dependency tree directly, it continuously calls
So, the reason I didn't implement this immediately was that it feels unfortunate to essentially have two resolvers in the codebase, even though the greedy one is much simpler. The greedy fetch now also has to parse the index files (ideally re-using the cache in the process) and understand basic semver requirements specs. I'm happy to tackle that next though, as I agree that one of the two approaches above likely need to be implemented for this approach to be competitive.
I haven't had a chance to test that yet, but will do it shortly. Without the greedy fetch above it's likely to do a bit worse, though I don't know by how much. It may still beat a complete The lack of support for a changelog file in the crates.io registry is also likely to make the evaluation yield incomplete and suboptimal results. I wonder how tricky it'd be to update crates.io to start keeping a changelog already. The changelog is super simple on the server side, so I feel like it shouldn't be too much work to get one included.
Seems reasonable -- I'll go ahead and start benchmarking. I think I'll probably then also implement the Greedy fetch from the RFC, as I think without that it's unlikely to be competitive especially in incremental cases. |
As one of the people that know the resolvers codebase the best, one of the big things in favor of this RFC was not having to make massive changes to the resolver until this has proved itself effective and is in widespread testing. That being said some integration may be better than a pure "greedy fetch" approach. An approach toward a next step that may be a plausible diff is:
But I am open to other reasonably small diffs. If we can't find one, we may have to do a full "greedy fetch" until it is clear that this is a good RFC for a good number of people. If crates.io wants to start publishing a changelog, we may want to have it hosted on a different branch/fork so if things don't work out or it needs changes it is not in the main line history. The |
I started implementing the greedy fetch as a prefetching step in jonhoo@c5b2468, but I have to say I don't like it much. It causes a lot of what feels like duplication of concern with the resolver, and also introduces a decent amount of complexity in keeping track of which dependencies we have to fetch at which versions. Not to mention that it duplicates much of the logic around accessing the registry cache. Some of that can probably be fixed by tidying up the way the prefetching is done, but it feels like the wrong approach. Perhaps you had a better idea for how this greedy fetching should be implemented that I can try @Eh2406? Otherwise, I think trying to re-use the existing resolver logic based on the path you suggested above might be a more promising avenue. How exactly to structure this as a promise without cargo actually having async is going to be a bit of a puzzle, but I'll do some thinking and tinkering. Is it a problem if the changelog appears in the main line history for a short amount of time? Regardless, doing it on a branch should be easy enough (though it means crates.io would have to publish to two branches, which may be a pain) -- you already opt into the HTTP-based access mechanism by using a |
I'm not so certain the final implementation is going to really use all that much from this PR itself. This already mostly copies the already-existing parallel download code of packages, and it's not like we need to erase the git history so it'll still be here. I agree that from a peformance standpoint it is not too useful to land this without prefetching, but I think it's more important to land maintainable code than an implementation in an interim state which we know we want to change and/or remove. I don't expect we'll ask users to test this out immediately after landing, what I'd prefer is that implementation work continued in-tree, but things are broken up into PRs for separate review. I don't personally mind reviewing larger PRs, but a longer discussion can get unwieldy (as is the case with this PR at this point). I'm ok if you'd prefer to land the resolver changes first and then land this afterwards, I am currently of the opinion that we just don't want to land the temporary implementation of prefetching here (since it infects a good number of places rather than being an isolated change). |
In that case I'm in favor of holding off on this until the resolver change lands. I suppose another piece of work that could then be parallelized and happen before this PR is to generalize |
☔ The latest upstream changes (presumably #8994) made this pull request unmergeable. Please resolve the merge conflicts. |
I'm going to hold off on dealing with the conflicts until #8985 lands. |
I could not find a tracking issue for this, so I opened #9069. |
Hi, I'm looking for an HTTP like API for fetching crates from a registry, like it's done in NuGet for .NET. This seems like the RFC for exactly that feature. I created an issue #9364 for that, which I closed as it duplicates this RFC. Can I help somehow to implement or test the feature? I could implement a prototype in our private registry for testing, if that's easier than adding the feature to crates.io. Have you though about using the same authorization token that is used for pushing a crate for the pull mechanism? This would be helpful for private registries, where the pull should be authorized, too. |
I am glad to here that you are excited to help with this experiment! The work in this PR was incredibly helpful in figuring out that we want the RFC, but suggested some big restructurings to Cargo that would make it easier to maintain and faster to run. I got started on the refactor but got distracted ( #8985 ). I am very excited by it and expect to get back to it when I am settled into my new job. So there is some progress on having a draft of it available on nightly for use with private registries. Ideas for how to make authorization work better for private registries, would need to be a different RFC. There is previous attempt at rust-lang/rfcs#2719, but I don't yet know enough to evaluate if that is the design we would want. I was hoping to do a deep dive into that design space, so as to be able to talk competently about that, when this RFC is in a good place. |
Hi, cargo sets the Why is the request path for a crate index split, e.g. How to I get cargo to set the not sure who can help @jonhoo or @Eh2406 best regards, |
That I believe is to improve the indexing indexing and retriving of data stored, most likely for performance reasons or maybe something to do with sharding? Or maybe because it wants to be similar to git clone like mentioned in the RFC?
|
Hi!
I'm not sure I follow — the idea here is that
As @pickfire alluded to, the idea here was primarily to be able to serve the index simply by fronting a git checkout of the index with an HTTP server. Of course, you don't need to do that (e.g., you can just use the last path parameter), but if we didn't include them with the request, then it'd be a pain.
When you say "cargo" here do you mean cargo as it is in this PR? With this PR, cargo should always send requests with |
Wonderful questions! And grate examples of how we are still figuring out what design we will want. Please help hold us to account and make shore we adequately think about them before we stabilize.
When this PR was opened the RFC wanted to be able to serve an HTTP index by checking out a GIT index and pointing a static file server at it. Maybe even just using |
Thank you all for your explanations. The more I poke around in the code, the more I get the feeling that trying to be as close as possible to the git approach does not work well with an HTTP API. Current workflowPlease correct me, if I got something wrong here. Add a [source.crates-io]
replace-with = 'alternative-registry'
[source.alternative-registry']
registry = 'sparse+http://my.registry.com' Cargo pulls the {
"dl": "http://my.registry.com/api/v1/crates",
"api": "http://my.registry.com"
} If a crate dependency That's a lot to configure and do for basically two simple get requests. ProposalAdd a [source.crates-io]
replace-with = 'alternative-registry'
[source.alternative-registry']
http-registry = 'http://my.registry.com' # Replace `sparse` with a new key. I just think that is cleaner and less error prone Skip the step where cargo pulls the Get the index with an Download the chosen crate version from The approach with the I find this approach a lot cleaner and easier to understand. What do you think? |
There were two main motivations for the existing URL scheme:
So the RFC intentionally uses 1:1 layout of the existing registry, so that it can be simply done by putting a CDN in front of the git repo. The The config needs to remain supported for the git registry, and ability to pass it through as-is may be useful for mirroring git registries via sparse http protocol. |
As for The keys aren't very consistent though. There's registry config: [registries]
my-registry = { index = "https://my-intranet:8080/git/index" } but source replacement uses [source.the-source-name]
registry = "https://example.com/path/to/index" so kinda maybe we could bikeshed an alternative property that works everywhere. |
@kornelski that makes sense. Thanks for the detailed explanation. I've tried to add optional authorization to the HTTP registry branch, as such a feature would be useful for private registries where not only the push of crates needs to be restricted, but the pull, too. As it's my first time contributing to Cargo, I'm pretty sure that the code is not the best it can be, so please give me feedback if such a feature is wanted at all and where I can improve the code, if you think optional authorization is a good idea. PR: jonhoo#1 |
Now that #10064 has landed, I think this can be picked up again and turned into a more real implementation with fewer hacks! |
@jonhoo I have an implementation (based on this PR) that I can send out soon. Is that alright? Or would you like to own getting this in? |
@arlosi Oh, no, go right ahead if you have something ready! |
I'm going to go ahead and close this in favor of #10470. |
HTTP registry implementation Implement HTTP registry support described in [RFC 2789](rust-lang/rfcs#2789). Adds a new unstable flag `-Z http-registry` which allows cargo to interact with remote registries served over http rather than git. These registries can be identified by urls starting with `sparse+http://` or `sparse+https://`. When fetching index metadata over http, cargo only downloads the metadata for needed crates, which can save significant time and bandwidth over git. The format of the http index is identical to a checkout of a git-based index. This change is based on `@jonhoo's` PR #8890. cc `@Eh2406` Remaining items: - [x] Performance measurements - [x] Make unstable only - [x] Investigate unification of download system. Probably best done in separate change. - [x] Unify registry tests (code duplication in `http_registry.rs`) - [x] Use existing on-disk cache, rather than adding a new one.
This change implements the HTTP registry API from rust-lang/rfcs#2789.
The client implementation supports the changelog extension in the RFC, but the server (that is, GitHub's static file serving of the registry) does not at the time of writing.
This implementation does not download multiple index files concurrently, since the
Registry
trait doesn't lend itself well to that at the moment.