Add support local mirrors of registries, take 2 #2857

Merged
merged 12 commits into from Aug 2, 2016

Conversation

Projects
None yet
@alexcrichton
Member

alexcrichton commented Jul 12, 2016

This series of commits culminates in first class support in Cargo for local mirrors of registries. This is implemented through a number of other more generic mechanisms, and extra support was added along the way. The highlights of this PR, however, are:

Source redirection

New .cargo/config keys have been added to enable replacing one source with another. This functionality is intended to be used for mirrors of the main registry or otherwise one to one source correspondences. The support looks like:

# in .cargo/config
[source.crates-io]
replace-with = 'my-awesome-registry'

[source.my-awesome-registry]
registry = 'https://github.com/my-awesome/registry-index'

This configuration means that instead of using crates-io (e.g. https://github.com/rust-lang/crates.io-index), Cargo will query the my-awesome-registry source instead (configured to a different index here). This alternate source must be the exact same as the crates.io index. Cargo assumes that replacement sources are exact 1:1 mirrors in this respect, and the following support is designed around that assumption.

When generating a lock file for crate using a replacement registry, the original registry will be encoded into the lock file. For example in the configuration above, all lock files will still mention crates.io as the registry that packages originated from. This semantically represents how crates.io is the source of truth for all crates, and this is upheld because all replacements have a 1:1 correspondance.

Overall, this means that no matter what replacement source you're working with, you can ship your lock file to anyone else and you'll all still have verifiably reproducible builds!

Adding sha256 checksums to the lock file

With the above support for custom registries, it's now possible for a project to be downloading crates from any number of sources. One of Cargo's core goals is reproducible builds, and with all these new sources of information it may be easy for a few situations to arise:

  1. A local replacement of crates.io could be corrupt
  2. A local replacement of crates.io could have made subtle changes to crates

In both of these cases, Cargo would today simply give non-reproducible builds. To help assuage these concerns, Cargo will now track the sha256 checksum of all crates from registries in the lock file. Whenever a Cargo.lock is generated from now on it will contain a [metadata] section which lists the sha256 checksum of all crates in the lock file (or <none> if the sha256 checksum isn't known).

Cargo already checks registry checksums against what's actually downloaded, and Cargo will now verify between iterations of the lock file that checksums remain the same as well. This means that if a local replacement registry is not in a 1:1 correspondance with crates.io, the lock file will prevent the build from progressing until the discrepancy is resolved.

Local Registries

In addition to the support above, there is now a new kind of source in Cargo, a "local registry", which is intended to be a subset of the crates.io ecosystem purposed for a local build for any particular project here or there. The way to enable this looks like:

# in .cargo/config
[source.crates-io]
replace-with = 'my-awesome-registry'

[source.my-awesome-registry]
local-registry = 'path/to/my/local/registry'

This local registry is expected to have two components:

  1. A directory called index which matches the same structure as the crates.io index. The config.json file is not required here.
  2. Inside the registry directory are any number of .crate files (downloaded from crates.io). Each crate file has the name <package>-<version>.crate.

This local registry must currently be managed manually, but I plan on publishing and maintaining a Cargo subcommand to manage a local registry. It will have options to do things like:

  1. Sync a local registry with a Cargo.lock
  2. Add a registry package to a local registry
  3. Remove a package from a local registry

Directory registries

In addition to local registries, Cargo also supports a "directory source" like so

# in .cargo/config
[source.crates-io]
replace-with = 'my-awesome-registry'

[source.my-awesome-registry]
directory = 'path/to/some/sources'

A directory source is similar to a local registry above, except that all the crates are unpacked and visible as vendored source. This format is suitable for checking into source trees, like Gecko's.

Unlike local registries above we don't have a tarball to verify the crates.io checksum with, but each vendored dependency has metadata containing what it would have been. To further prevent modifications by accident, the metadata contains the checksum of each file which should prevent accidental local modifications and steer towards [replace] as the mechanism to edit dependencies if necessary.

What's all this for?

This is quite a bit of new features! What's all this meant to do? Some example scenarios that this is envisioned to solve are:

  1. Supporting mirrors for crates.io in a first class fashion. Once we have the ability to spin up your own local registry, it should be easy to locally select a new mirror.
  2. Supporting round-robin mirrors, this provides an easy vector for configuration of "instead of crates.io hit the first source in this list that works"
  3. Build environments where network access is not an option. Preparing a local registry ahead-of-time (from a known good lock file) will be a vector to ensure that all Rust dependencies are locally available.
    • Note this is intended to include use cases like Debian and Gecko

What's next?

Even with the new goodies here, there's some more vectors through which this can be expanded:

  • Support for running your own mirror of crates.io needs to be implemented to be "easy to do". There should for example be a cargo install foo available to have everything "Just Work".
  • Replacing a source with a list of sources (attempted in round robin fashion) needs to be implemented
  • Eventually this support will be extended to the Cargo.toml file itself. For example:
    • packages should be downloadable from multiple registries
    • replacement sources should be encodable into Cargo.toml (note that these replacements, unlike the ones above, would be encoded into Cargo.lock)
    • adding multiple mirrors to a Cargo.toml should be supported
  • Implementing the subcommand above to manage local registries needs to happen (I will attend to this shortly)
@rust-highfive

This comment has been minimized.

Show comment
Hide comment
@rust-highfive

rust-highfive Jul 12, 2016

@alexcrichton: no appropriate reviewer found, use r? to override

@alexcrichton: no appropriate reviewer found, use r? to override

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Jul 12, 2016

Member

This is a rebasing of #2361 with directory support added in, I'll continue to follow up with more comments in a bit (hit the "make a PR" button a bit soon...)

Member

alexcrichton commented Jul 12, 2016

This is a rebasing of #2361 with directory support added in, I'll continue to follow up with more comments in a bit (hit the "make a PR" button a bit soon...)

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Jul 12, 2016

Member

Ok,

cc @rust-lang/tools, a significant addition to Cargo with lots of room to grow as well. This is likely the start of Cargo supporting multiple registries. A next major work item here is to actually make it possible to bring your own mirror registry of crates.io online. That is, write the server which mirrors crates.io and otherwise ships crates to you.

Most of that shouldn't be too hard, but it'll be hard to get it nice and ergonomic the first time around I believe.

cc @luser, this is the support in Cargo necessary for vendoring dependencies in Gecko. The workflow we imagine is:

  • When Gecko is checked out, it has a bunch of vendored Rust dependencies from crates.io
  • When Gecko is configured, you have an option of developing Rust in "online mode" or "offline mode", the former hitting crates.io and the latter using the vendored sources. The switch is communicated to Cargo via a bit of .cargo/config configuration which is not checked into the source tree.
  • Cargo will strive very hard to ensure that online mode and offline mode all compile the exact same thing so long as you have the same Cargo.lock. To that end Cargo will reject modifications to the vendored source code, instead suggesting the use of [replace] which is intended exactly for making modifications to a dependency quickly. I can go into more detail if you'd like, but it suffices to say that the workflow for modifying and testing out changes to dependencies should be quite smooth.
  • Whenever a new dependency is added, someone with the cargo vendor subcommand will have to re-run it to ensure the "offline mode" build will continue to work. This has yet to be written, but will likely have syncing options and other various configuration (also shouldn't be too hard to write).
Member

alexcrichton commented Jul 12, 2016

Ok,

cc @rust-lang/tools, a significant addition to Cargo with lots of room to grow as well. This is likely the start of Cargo supporting multiple registries. A next major work item here is to actually make it possible to bring your own mirror registry of crates.io online. That is, write the server which mirrors crates.io and otherwise ships crates to you.

Most of that shouldn't be too hard, but it'll be hard to get it nice and ergonomic the first time around I believe.

cc @luser, this is the support in Cargo necessary for vendoring dependencies in Gecko. The workflow we imagine is:

  • When Gecko is checked out, it has a bunch of vendored Rust dependencies from crates.io
  • When Gecko is configured, you have an option of developing Rust in "online mode" or "offline mode", the former hitting crates.io and the latter using the vendored sources. The switch is communicated to Cargo via a bit of .cargo/config configuration which is not checked into the source tree.
  • Cargo will strive very hard to ensure that online mode and offline mode all compile the exact same thing so long as you have the same Cargo.lock. To that end Cargo will reject modifications to the vendored source code, instead suggesting the use of [replace] which is intended exactly for making modifications to a dependency quickly. I can go into more detail if you'd like, but it suffices to say that the workflow for modifying and testing out changes to dependencies should be quite smooth.
  • Whenever a new dependency is added, someone with the cargo vendor subcommand will have to re-run it to ensure the "offline mode" build will continue to work. This has yet to be written, but will likely have syncing options and other various configuration (also shouldn't be too hard to write).
@cardoe

This comment has been minimized.

Show comment
Hide comment
@cardoe

cardoe Jul 12, 2016

Contributor

So just so I'm clear, the local registry directory layout should look like:

.
└── local-registry
    ├── index
    │   └── git-clone-or-snapshot-crates.io-index
    ├── memchr-0.1.11.crate
    ├── rand-0.3.14.crate
    ├── rustfmt-0.5.0.crate
    └── semver-0.2.3.crate

And the plan is to nail this down as ABI so people can depend on this layout to remain constant?

Contributor

cardoe commented Jul 12, 2016

So just so I'm clear, the local registry directory layout should look like:

.
└── local-registry
    ├── index
    │   └── git-clone-or-snapshot-crates.io-index
    ├── memchr-0.1.11.crate
    ├── rand-0.3.14.crate
    ├── rustfmt-0.5.0.crate
    └── semver-0.2.3.crate

And the plan is to nail this down as ABI so people can depend on this layout to remain constant?

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Jul 12, 2016

Member

@cardoe indeed! Actually that reminds me that I forgot to write documentation on this in src/doc. I plan on documenting the format of local registries as well as directory sources so external tools can create and manage those workspaces.

I've previously written cargo local-registry for creating local registries, and I plan on updating cargo vendor to emit a directory source (this PR implements) once this is merged.

Member

alexcrichton commented Jul 12, 2016

@cardoe indeed! Actually that reminds me that I forgot to write documentation on this in src/doc. I plan on documenting the format of local registries as well as directory sources so external tools can create and manage those workspaces.

I've previously written cargo local-registry for creating local registries, and I plan on updating cargo vendor to emit a directory source (this PR implements) once this is merged.

@brson brson self-assigned this Jul 12, 2016

@luser

This comment has been minimized.

Show comment
Hide comment
@luser

luser Jul 13, 2016

Contributor

This all sounds great. The directory source should meet our needs exactly.

Cargo will strive very hard to ensure that online mode and offline mode all compile the exact same thing so long as you have the same Cargo.lock. To that end Cargo will reject modifications to the vendored source code, instead suggesting the use of [replace] which is intended exactly for making modifications to a dependency quickly. I can go into more detail if you'd like, but it suffices to say that the workflow for modifying and testing out changes to dependencies should be quite smooth.

This seems pretty reasonable. We'll have to nail down (and document) the process for making changes to vendored libraries. I know you or @wycats mentioned a potential cargo fork workflow which would be great. I think this is one of those things that we're just going to have to learn as we go along, although I'm sure Servo has some great experience we can build off of.

Contributor

luser commented Jul 13, 2016

This all sounds great. The directory source should meet our needs exactly.

Cargo will strive very hard to ensure that online mode and offline mode all compile the exact same thing so long as you have the same Cargo.lock. To that end Cargo will reject modifications to the vendored source code, instead suggesting the use of [replace] which is intended exactly for making modifications to a dependency quickly. I can go into more detail if you'd like, but it suffices to say that the workflow for modifying and testing out changes to dependencies should be quite smooth.

This seems pretty reasonable. We'll have to nail down (and document) the process for making changes to vendored libraries. I know you or @wycats mentioned a potential cargo fork workflow which would be great. I think this is one of those things that we're just going to have to learn as we go along, although I'm sure Servo has some great experience we can build off of.

@Ericson2314

This comment has been minimized.

Show comment
Hide comment
@Ericson2314

Ericson2314 Jul 13, 2016

Contributor

@alexcrichton I think I'm starting to understand this better :). If the section is called [source], perhaps any source including our existing git and path sources could be used as a mirror? Might that mean there would be no need for directory registries?

Contributor

Ericson2314 commented Jul 13, 2016

@alexcrichton I think I'm starting to understand this better :). If the section is called [source], perhaps any source including our existing git and path sources could be used as a mirror? Might that mean there would be no need for directory registries?

@aneeshusa

This comment has been minimized.

Show comment
Hide comment
@aneeshusa

aneeshusa Jul 14, 2016

It wasn't clear to me, but looking at some of the test cases it doesn't look like the [metadata] format records the hash type, e.g. sha256. Can we please record the hash algorithm with the hashes themselves? This allows for cryptographic agility in the future in order to upgrade to stronger hashes.

I'd also like to see sha512 used as the default algorithm since this is a new feature, and I don't think the extra length will cause any noticeable performance impacts.

It wasn't clear to me, but looking at some of the test cases it doesn't look like the [metadata] format records the hash type, e.g. sha256. Can we please record the hash algorithm with the hashes themselves? This allows for cryptographic agility in the future in order to upgrade to stronger hashes.

I'd also like to see sha512 used as the default algorithm since this is a new feature, and I don't think the extra length will cause any noticeable performance impacts.

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Jul 14, 2016

Member

@luser

We'll have to nail down (and document) the process for making changes to vendored libraries

Yeah definitely, I'll try to write up documentation for all this, especially on the expected workflows. We can also chat more directly to make sure we're comfortable with it all.

@Ericson2314

If the section is called [source], perhaps any source including our existing git and path sources could be used as a mirror? Might that mean there would be no need for directory registries?

Unfortunately, no. To replace a source with another the checksum property (whether or not the source has checksums) must be the same. Registries have checksums, but git repositories and path sources do not. Vendored sources do, however.

@aneeshusa

Can we please record the hash algorithm with the hashes themselves? This allows for cryptographic agility in the future in order to upgrade to stronger hashes.

Yeah I was thinking we may want to do something like that. Of course we could also just say that if it starts with a bunch of unknown hex characters it's a sha256 hash, so it's not necessarily super pressing.

Member

alexcrichton commented Jul 14, 2016

@luser

We'll have to nail down (and document) the process for making changes to vendored libraries

Yeah definitely, I'll try to write up documentation for all this, especially on the expected workflows. We can also chat more directly to make sure we're comfortable with it all.

@Ericson2314

If the section is called [source], perhaps any source including our existing git and path sources could be used as a mirror? Might that mean there would be no need for directory registries?

Unfortunately, no. To replace a source with another the checksum property (whether or not the source has checksums) must be the same. Registries have checksums, but git repositories and path sources do not. Vendored sources do, however.

@aneeshusa

Can we please record the hash algorithm with the hashes themselves? This allows for cryptographic agility in the future in order to upgrade to stronger hashes.

Yeah I was thinking we may want to do something like that. Of course we could also just say that if it starts with a bunch of unknown hex characters it's a sha256 hash, so it's not necessarily super pressing.

@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Jul 14, 2016

Contributor

☔️ The latest upstream changes (presumably #2858) made this pull request unmergeable. Please resolve the merge conflicts.

Contributor

bors commented Jul 14, 2016

☔️ The latest upstream changes (presumably #2858) made this pull request unmergeable. Please resolve the merge conflicts.

@Ericson2314

This comment has been minimized.

Show comment
Hide comment
@Ericson2314

Ericson2314 Jul 14, 2016

Contributor

@alexcrichton Could those checksums be generated on the fly? If the checksum is generated by cargo publish (as opposed to the crates.io server) Cargo already contains the functionality!

Contributor

Ericson2314 commented Jul 14, 2016

@alexcrichton Could those checksums be generated on the fly? If the checksum is generated by cargo publish (as opposed to the crates.io server) Cargo already contains the functionality!

@aneeshusa

This comment has been minimized.

Show comment
Hide comment
@aneeshusa

aneeshusa Jul 14, 2016

@alexcrichton Saying "if it starts with a bunch of unknown hex characters it's a sha256 hash" worries me.

  • guessing formats has always been a source of vulnerabilities (see the libmagic CVEs for example), but this is especially bad for crypto; here is a blog post with another example.
  • other hashes also use hex characters; how would you distinguish between them in the future? just using the length is not safe! SHA3 also has -256, -512, etc. variants, so these would be indistinguishable from the SHA2 versions without an explicit identifier.

Please be explicit here in the metadata format to allow for future changes without any guessing. The hash types used here will need to change in the future, so it would be better to avoid having to keep around a separate legacy "if there's no identifier yolo sha256" code path.

See pip for a good example of package verification.

@alexcrichton Saying "if it starts with a bunch of unknown hex characters it's a sha256 hash" worries me.

  • guessing formats has always been a source of vulnerabilities (see the libmagic CVEs for example), but this is especially bad for crypto; here is a blog post with another example.
  • other hashes also use hex characters; how would you distinguish between them in the future? just using the length is not safe! SHA3 also has -256, -512, etc. variants, so these would be indistinguishable from the SHA2 versions without an explicit identifier.

Please be explicit here in the metadata format to allow for future changes without any guessing. The hash types used here will need to change in the future, so it would be better to avoid having to keep around a separate legacy "if there's no identifier yolo sha256" code path.

See pip for a good example of package verification.

@joshtriplett

This comment has been minimized.

Show comment
Hide comment
@joshtriplett

joshtriplett Jul 14, 2016

Member

This looks excellent, and should work well for Debian packaging.

Regarding the local-registry index, could Cargo support reading the index from multiple files, in an index.d directory, if one exists? That would allow each crate to install its associated index entry. Without that, Debian would have to have a trigger that updates the index for every installed crate package.

Looking at the index format, it comes close to supporting this, but still requires a list of versions of each crate within a single file, which would make it harder to package each crate version separately. If each crate version could install a separate file, then packages of crate versions wouldn't need to run any code at installation time.

Member

joshtriplett commented Jul 14, 2016

This looks excellent, and should work well for Debian packaging.

Regarding the local-registry index, could Cargo support reading the index from multiple files, in an index.d directory, if one exists? That would allow each crate to install its associated index entry. Without that, Debian would have to have a trigger that updates the index for every installed crate package.

Looking at the index format, it comes close to supporting this, but still requires a list of versions of each crate within a single file, which would make it harder to package each crate version separately. If each crate version could install a separate file, then packages of crate versions wouldn't need to run any code at installation time.

@Ericson2314

This comment has been minimized.

Show comment
Hide comment
@Ericson2314

Ericson2314 Jul 14, 2016

Contributor

@alexcrichton Moreover, for this NixOS case, the idea is what every source, not just crates.io, would need to be mirrored (nix would fetch all git repos, etc, so that Cargo would work without network access). So being able to mirror any source (not just registries), is crucial from our cases.

This is technically unrelated to being able to mirror a source with any sort of source, but as a practical matter, it would be much easier to replace a git source with its local clone than have to create a single-repo directory registry.

CC NixOS/nixpkgs#11144 @wycats ^ basically but we were talking about re nix.

Contributor

Ericson2314 commented Jul 14, 2016

@alexcrichton Moreover, for this NixOS case, the idea is what every source, not just crates.io, would need to be mirrored (nix would fetch all git repos, etc, so that Cargo would work without network access). So being able to mirror any source (not just registries), is crucial from our cases.

This is technically unrelated to being able to mirror a source with any sort of source, but as a practical matter, it would be much easier to replace a git source with its local clone than have to create a single-repo directory registry.

CC NixOS/nixpkgs#11144 @wycats ^ basically but we were talking about re nix.

@cuviper

This comment has been minimized.

Show comment
Hide comment
@cuviper

cuviper Jul 14, 2016

Member

@joshtriplett Wouldn't the directory-registry work better for distro use? That's what I was leaning toward, then no index management is needed. The -devel packages would just drop their source into that shared root.

Member

cuviper commented Jul 14, 2016

@joshtriplett Wouldn't the directory-registry work better for distro use? That's what I was leaning toward, then no index management is needed. The -devel packages would just drop their source into that shared root.

@bryteise

This comment has been minimized.

Show comment
Hide comment
@bryteise

bryteise Jul 14, 2016

@cuviper That makes more sense to me for a distro use case. The only thing this runs into trouble with is if people expect to carry different patches for the same build dependency across different packages which seems to be wrong in my mind.

@cuviper That makes more sense to me for a distro use case. The only thing this runs into trouble with is if people expect to carry different patches for the same build dependency across different packages which seems to be wrong in my mind.

@cuviper

This comment has been minimized.

Show comment
Hide comment
@cuviper

cuviper Jul 14, 2016

Member

@bryteise I would expect those people to hash out their differences, just like they must for any other common dependency. IOW multiple sources shouldn't be dropping the same crate in that path, just one shared source that has to work for everyone.

Member

cuviper commented Jul 14, 2016

@bryteise I would expect those people to hash out their differences, just like they must for any other common dependency. IOW multiple sources shouldn't be dropping the same crate in that path, just one shared source that has to work for everyone.

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Jul 14, 2016

Member

@Ericson2314

Could those checksums be generated on the fly?

No, I described this in "Directory registries" in the PR description.

So being able to mirror any source (not just registries), is crucial from our cases.

Yes, this is intended to be extensible for things like git repos as well, although it'd have to be specified once per git repo as well.

@joshtriplett

in an index.d directory, if one exists

Could you elaborate what you mean by this? I'm unfortunately not sure what index.d is :(. Right now it's kinda nice sharing all the index handling logic between the remote and local registries, and it'd be a bit of a shame to lose but perhaps not the worst!

Member

alexcrichton commented Jul 14, 2016

@Ericson2314

Could those checksums be generated on the fly?

No, I described this in "Directory registries" in the PR description.

So being able to mirror any source (not just registries), is crucial from our cases.

Yes, this is intended to be extensible for things like git repos as well, although it'd have to be specified once per git repo as well.

@joshtriplett

in an index.d directory, if one exists

Could you elaborate what you mean by this? I'm unfortunately not sure what index.d is :(. Right now it's kinda nice sharing all the index handling logic between the remote and local registries, and it'd be a bit of a shame to lose but perhaps not the worst!

@cuviper

This comment has been minimized.

Show comment
Hide comment
@cuviper

cuviper Jul 14, 2016

Member

Usually "foo.d" is a directory of individual files to augment the main one. Like how there's /etc/profile and then everything in /etc/profile.d/ extends it. Packages can drop their own files in without having to merge in one common file.

Member

cuviper commented Jul 14, 2016

Usually "foo.d" is a directory of individual files to augment the main one. Like how there's /etc/profile and then everything in /etc/profile.d/ extends it. Packages can drop their own files in without having to merge in one common file.

@Ericson2314

This comment has been minimized.

Show comment
Hide comment
@Ericson2314

Ericson2314 Jul 14, 2016

Contributor

@alexcrichton you mean this paragraph?

Unlike local registries above we don't have a tarball to verify the crates.io checksum with, but each vendored dependency has metadata containing what it would have been. To further prevent modifications by accident, the metadata contains the checksum of each file which should prevent accidental local modifications and steer towards [replace] as the mechanism to edit dependencies if necessary.

I don't see why those files can't just be hashed on a fly. If the lockfile doesn't already exist, we are already assuming that the mirror reflects the original---precomputed hashes do not get around the fact that this is a matter of trust. If the lockfile does exist, the on-the-fly hashing is verified against that (Cargo computes the whole Merkel dag). I'm not saying we shouldn't have a directory repository---all that recursive hashing is slower---but I don't see a security problem.

Contributor

Ericson2314 commented Jul 14, 2016

@alexcrichton you mean this paragraph?

Unlike local registries above we don't have a tarball to verify the crates.io checksum with, but each vendored dependency has metadata containing what it would have been. To further prevent modifications by accident, the metadata contains the checksum of each file which should prevent accidental local modifications and steer towards [replace] as the mechanism to edit dependencies if necessary.

I don't see why those files can't just be hashed on a fly. If the lockfile doesn't already exist, we are already assuming that the mirror reflects the original---precomputed hashes do not get around the fact that this is a matter of trust. If the lockfile does exist, the on-the-fly hashing is verified against that (Cargo computes the whole Merkel dag). I'm not saying we shouldn't have a directory repository---all that recursive hashing is slower---but I don't see a security problem.

@joshtriplett

This comment has been minimized.

Show comment
Hide comment
@joshtriplett

joshtriplett Jul 14, 2016

Member

@cuviper @alexcrichton The compressed .crate files seemed nice to save space, but directory registries would work fine in distro packages too. The one downside would be the inability to match hashes directly with upstream, relying on the per-file hashes that (as I understand it) upstream doesn't directly provide. However, that property applies to every other package in the distribution, too, so it doesn't really matter.

The index.d concept was that each crate package could drop its portion of the index into a directory, rather than needing to have a single file that multiple crate packages would need to update. But if directory registries don't need that index at all, then that seems like the preferred alternative.

Member

joshtriplett commented Jul 14, 2016

@cuviper @alexcrichton The compressed .crate files seemed nice to save space, but directory registries would work fine in distro packages too. The one downside would be the inability to match hashes directly with upstream, relying on the per-file hashes that (as I understand it) upstream doesn't directly provide. However, that property applies to every other package in the distribution, too, so it doesn't really matter.

The index.d concept was that each crate package could drop its portion of the index into a directory, rather than needing to have a single file that multiple crate packages would need to update. But if directory registries don't need that index at all, then that seems like the preferred alternative.

@joshtriplett

This comment has been minimized.

Show comment
Hide comment
@joshtriplett

joshtriplett Jul 14, 2016

Member

@Ericson2314 It isn't a security problem; it's more that it stops casual "oh, I'll just patch the package" local hacks, which break the whole assumption of reproducible builds where version X of crate C always refers to the same crate contents.

Member

joshtriplett commented Jul 14, 2016

@Ericson2314 It isn't a security problem; it's more that it stops casual "oh, I'll just patch the package" local hacks, which break the whole assumption of reproducible builds where version X of crate C always refers to the same crate contents.

@Ericson2314

This comment has been minimized.

Show comment
Hide comment
@Ericson2314

Ericson2314 Jul 14, 2016

Contributor

@joshtriplett Sure it's a not a real security problem. I'm just trying to get at what's worse about using the existing sources for mirrors. The only thing I could come up with is the slowness of hashing more things.

Contributor

Ericson2314 commented Jul 14, 2016

@joshtriplett Sure it's a not a real security problem. I'm just trying to get at what's worse about using the existing sources for mirrors. The only thing I could come up with is the slowness of hashing more things.

@cuviper

This comment has been minimized.

Show comment
Hide comment
@cuviper

cuviper Jul 14, 2016

Member

Distros should be allowed to patch these things though!

Member

cuviper commented Jul 14, 2016

Distros should be allowed to patch these things though!

@joshtriplett

This comment has been minimized.

Show comment
Hide comment
@joshtriplett

joshtriplett Jul 14, 2016

Member

@cuviper Absolutely. But they shouldn't patch version X of crate C and still call it version X of crate C; at that point, it's, for instance, Debian package version X-2 (X-3, X-4, ...) of crate C.

Member

joshtriplett commented Jul 14, 2016

@cuviper Absolutely. But they shouldn't patch version X of crate C and still call it version X of crate C; at that point, it's, for instance, Debian package version X-2 (X-3, X-4, ...) of crate C.

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Jul 15, 2016

Member

@cuviper

Usually "foo.d" is a directory of individual files to augment the main one.

Ah thanks! @joshtriplett then to answer your question, no, Cargo doesn't support that yet.

@Ericson2314

I don't see why those files can't just be hashed on a fly.

The hash in the lock file is the .crate file from crates.io, that's not available with directories.

@joshtriplett

The compressed .crate files seemed nice to save space, but directory registries would work fine in distro packages too. The one downside would be the inability to match hashes directly with upstream, relying on the per-file hashes that (as I understand it) upstream doesn't directly provide.

Note that .crate files won't work for distributions which want to patch dependencies and use source replacement. If using source replacement then it must be replaced with the exact same copy. If distros, however, are comfortable patching Cargo.toml to depend on a different source of crates, then .crate files would work just fine.

Also note that it's not the intention of directory registries for them to "lie" about file hashes. They don't have an actual file to cryptographically hash and verify is the same as the upstream *.crate file, so they just list it in a JSON file. This can be forged, but it's intended to prevent mistakes, not provide security.

Although none of this is too relevant if Cargo.toml is patched, the requirement of "same exact crate" is only in place for replaced sources (via .cargo/config)

Member

alexcrichton commented Jul 15, 2016

@cuviper

Usually "foo.d" is a directory of individual files to augment the main one.

Ah thanks! @joshtriplett then to answer your question, no, Cargo doesn't support that yet.

@Ericson2314

I don't see why those files can't just be hashed on a fly.

The hash in the lock file is the .crate file from crates.io, that's not available with directories.

@joshtriplett

The compressed .crate files seemed nice to save space, but directory registries would work fine in distro packages too. The one downside would be the inability to match hashes directly with upstream, relying on the per-file hashes that (as I understand it) upstream doesn't directly provide.

Note that .crate files won't work for distributions which want to patch dependencies and use source replacement. If using source replacement then it must be replaced with the exact same copy. If distros, however, are comfortable patching Cargo.toml to depend on a different source of crates, then .crate files would work just fine.

Also note that it's not the intention of directory registries for them to "lie" about file hashes. They don't have an actual file to cryptographically hash and verify is the same as the upstream *.crate file, so they just list it in a JSON file. This can be forged, but it's intended to prevent mistakes, not provide security.

Although none of this is too relevant if Cargo.toml is patched, the requirement of "same exact crate" is only in place for replaced sources (via .cargo/config)

@bryteise

This comment has been minimized.

Show comment
Hide comment
@bryteise

bryteise Jul 15, 2016

@alexcrichton I'm not sure how I'm going to handle as a distro patching and versioning rust dependencies yet. My normal way to manage this is to keep the same upstream version and bump my release number when I'm adding say a bug fix. If I need to stop just mirroring and actually fork a vendored package, update the Cargo.toml with version twizzling, and then have a way to revert and go back to mirroring it seems to create a larger amount of work.

Perhaps I just need to see this in action first though to better understand what is being suggested.

@alexcrichton I'm not sure how I'm going to handle as a distro patching and versioning rust dependencies yet. My normal way to manage this is to keep the same upstream version and bump my release number when I'm adding say a bug fix. If I need to stop just mirroring and actually fork a vendored package, update the Cargo.toml with version twizzling, and then have a way to revert and go back to mirroring it seems to create a larger amount of work.

Perhaps I just need to see this in action first though to better understand what is being suggested.

@cardoe cardoe referenced this pull request in gentoo/gentoo-rust Jul 15, 2016

Closed

Create a proper eclass for crates #195

@cuviper

This comment has been minimized.

Show comment
Hide comment
@cuviper

cuviper Jul 18, 2016

Member

One thing that has frustrated me with .cargo/config and its paths field is that it doesn't seem to handle crates well when they have relative paths for their dependencies.

For instance, the num crate looks for num-complex in path = "./complex". Pointing .cargo/config paths to a local num works fine when that's a full repo checkout, but not if I only extracted a num-VERSION.crate. In the latter case, I get something like:

error: failed to update path override `[...]/num-0.1.34` (defined in `[...]/.cargo/config`)

Caused by:
  Could not find `Cargo.toml` in `[...]/num-0.1.34/complex`

It doesn't help if I also add num-complex and all the rest separately to .cargo/config paths.

Obviously the normal registries take care of this, ignoring the relative path and fetching num-complex et al. on their own. And cargo package also does the right thing when verifying the build. Will directory registries also work like this?

Member

cuviper commented Jul 18, 2016

One thing that has frustrated me with .cargo/config and its paths field is that it doesn't seem to handle crates well when they have relative paths for their dependencies.

For instance, the num crate looks for num-complex in path = "./complex". Pointing .cargo/config paths to a local num works fine when that's a full repo checkout, but not if I only extracted a num-VERSION.crate. In the latter case, I get something like:

error: failed to update path override `[...]/num-0.1.34` (defined in `[...]/.cargo/config`)

Caused by:
  Could not find `Cargo.toml` in `[...]/num-0.1.34/complex`

It doesn't help if I also add num-complex and all the rest separately to .cargo/config paths.

Obviously the normal registries take care of this, ignoring the relative path and fetching num-complex et al. on their own. And cargo package also does the right thing when verifying the build. Will directory registries also work like this?

@cuviper

This comment has been minimized.

Show comment
Hide comment
@cuviper

cuviper Jul 18, 2016

Member

I've been thinking about the hashing thing more, and I think there may be some different goals in how the distro's directory registry might be used. Namely, it's not generally my intent to provide a distro mirror of everything that end-users might want from crates.io. Users can keep pulling from crates.io, and that's fine with me.

My goal is to provide a distro source of crates mostly for use by Rust programs that are also packaged in the distro. For instance, if we were to package rustfmt, then as a prerequisite we'll want crate rpms for toml, rustc-serialize, and everything else it depends on. For this purpose, I think all the hashing doesn't really matter -- we should be perfectly happy to regenerate Cargo.lock at rpmbuild time to represent the distro's version of all those crates.

Is that reasonable?

Member

cuviper commented Jul 18, 2016

I've been thinking about the hashing thing more, and I think there may be some different goals in how the distro's directory registry might be used. Namely, it's not generally my intent to provide a distro mirror of everything that end-users might want from crates.io. Users can keep pulling from crates.io, and that's fine with me.

My goal is to provide a distro source of crates mostly for use by Rust programs that are also packaged in the distro. For instance, if we were to package rustfmt, then as a prerequisite we'll want crate rpms for toml, rustc-serialize, and everything else it depends on. For this purpose, I think all the hashing doesn't really matter -- we should be perfectly happy to regenerate Cargo.lock at rpmbuild time to represent the distro's version of all those crates.

Is that reasonable?

@joshtriplett

This comment has been minimized.

Show comment
Hide comment
@joshtriplett

joshtriplett Jul 18, 2016

Member

@cuviper Ideally, it should be possible to build num without a path override at all, and with the source for num and num-complex extracted from separate num-$version.crate and num-complex-$version.crate files, as long as the versions match. That makes it easier to not care about the path relationships between crates, and just package crates directly from .crate files based on dependencies.

Member

joshtriplett commented Jul 18, 2016

@cuviper Ideally, it should be possible to build num without a path override at all, and with the source for num and num-complex extracted from separate num-$version.crate and num-complex-$version.crate files, as long as the versions match. That makes it easier to not care about the path relationships between crates, and just package crates directly from .crate files based on dependencies.

@bryteise

This comment has been minimized.

Show comment
Hide comment
@bryteise

bryteise Jul 18, 2016

@cuviper

My goal is to provide a distro source of crates mostly for use by Rust programs that are also packaged in the distro.

That's exactly the use case I'm looking at as well. The idea is to have the ability to package whatever dependencies are for rust programs in my distro such that when I add those dependencies they can have source/crate/whatever extraction in place such that the local mirror mode will just use them. From there I'd like to also be able to patch those dependencies and (assuming I'm not breaking API) not be required to update the Cargo.toml of the program I'm trying to build.

@cuviper

My goal is to provide a distro source of crates mostly for use by Rust programs that are also packaged in the distro.

That's exactly the use case I'm looking at as well. The idea is to have the ability to package whatever dependencies are for rust programs in my distro such that when I add those dependencies they can have source/crate/whatever extraction in place such that the local mirror mode will just use them. From there I'd like to also be able to patch those dependencies and (assuming I'm not breaking API) not be required to update the Cargo.toml of the program I'm trying to build.

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Jul 19, 2016

Member

@bryteise yeah it's true that vendoring crates in distros may require some more work, but the major effort of this replace-source and mirror functionality in this PR is to pave the way and solve the vendoring problem with Cargo today for large projects. That is, it's not directly targeted at distros, so that will likely grow some extra support later.

@cuviper

One thing that has frustrated me with .cargo/config and its paths field is that it doesn't seem to handle crates well when they have relative paths for their dependencies.

Indeed! This is one of the many things that paths doesn't work for, and Cargo's just very bad about warning you in situations where this is not expected to work. And yeah directory registries will solve this the same way the normal registries do.

For this purpose, I think all the hashing doesn't really matter -- we should be perfectly happy to regenerate Cargo.lock at rpmbuild time to represent the distro's version of all those crates.

Is that reasonable?

Indeed! I think that distros will probably either patch Cargo.toml or somehow modify it to pull in custom sources, or Cargo will eventually provide a first-class method for doing this. But in general some discussions awhile ago led to the conclusion that using the exact same Cargo.lock probably won't work for distros, because the whole point is that they're consolidating versions and patching things from time to time.


By the way an example of how distros might work is:

  • First, we add the ability to specify custom sources in Cargo.toml, this means you can pull in crates from multiple registries
  • Next, we allow you to configure the "default source" for a particular Cargo.toml, with the default default today being crates.io
  • Finally, distros will patch Cargo.toml of applications to say the default source is their store of crates.

That way distros completely sidestep lock files, hashes, etc, and they can manage their own registry and have the same dependency graph, just a small patch is needed to tell the crate to compile that.

Note though that this is just and idea, I'd of course want to flesh it out with others, the main purpose of this PR is to solve the vendoring problem.

Member

alexcrichton commented Jul 19, 2016

@bryteise yeah it's true that vendoring crates in distros may require some more work, but the major effort of this replace-source and mirror functionality in this PR is to pave the way and solve the vendoring problem with Cargo today for large projects. That is, it's not directly targeted at distros, so that will likely grow some extra support later.

@cuviper

One thing that has frustrated me with .cargo/config and its paths field is that it doesn't seem to handle crates well when they have relative paths for their dependencies.

Indeed! This is one of the many things that paths doesn't work for, and Cargo's just very bad about warning you in situations where this is not expected to work. And yeah directory registries will solve this the same way the normal registries do.

For this purpose, I think all the hashing doesn't really matter -- we should be perfectly happy to regenerate Cargo.lock at rpmbuild time to represent the distro's version of all those crates.

Is that reasonable?

Indeed! I think that distros will probably either patch Cargo.toml or somehow modify it to pull in custom sources, or Cargo will eventually provide a first-class method for doing this. But in general some discussions awhile ago led to the conclusion that using the exact same Cargo.lock probably won't work for distros, because the whole point is that they're consolidating versions and patching things from time to time.


By the way an example of how distros might work is:

  • First, we add the ability to specify custom sources in Cargo.toml, this means you can pull in crates from multiple registries
  • Next, we allow you to configure the "default source" for a particular Cargo.toml, with the default default today being crates.io
  • Finally, distros will patch Cargo.toml of applications to say the default source is their store of crates.

That way distros completely sidestep lock files, hashes, etc, and they can manage their own registry and have the same dependency graph, just a small patch is needed to tell the crate to compile that.

Note though that this is just and idea, I'd of course want to flesh it out with others, the main purpose of this PR is to solve the vendoring problem.

@joshtriplett

This comment has been minimized.

Show comment
Hide comment
@joshtriplett

joshtriplett Jul 19, 2016

Member

@alexcrichton It'd be really painful to have to patch every Cargo.toml, especially for crates that should otherwise work completely unmodified. I think it's much more likely that distributions like Debian will provide a script for use in packages of Rust bin crates (like a dh_cargo_bin for debhelper) that invokes cargo, points it at the directory registry, and uses the "don't touch the network" option.

Would you consider providing a way to point at a .cargo/config file from the cargo command line? That would allow pointing at, for instance, /usr/share/rust/cargo.config, which would contain the configuration to use the directory registry in place of crates.io.

I agree that distributions will probably ignore Cargo.lock files. On that note, would you consider providing a variant of the "don't touch the network" --frozen option from #2811 that doesn't imply "don't ever generate Cargo.lock"?

Member

joshtriplett commented Jul 19, 2016

@alexcrichton It'd be really painful to have to patch every Cargo.toml, especially for crates that should otherwise work completely unmodified. I think it's much more likely that distributions like Debian will provide a script for use in packages of Rust bin crates (like a dh_cargo_bin for debhelper) that invokes cargo, points it at the directory registry, and uses the "don't touch the network" option.

Would you consider providing a way to point at a .cargo/config file from the cargo command line? That would allow pointing at, for instance, /usr/share/rust/cargo.config, which would contain the configuration to use the directory registry in place of crates.io.

I agree that distributions will probably ignore Cargo.lock files. On that note, would you consider providing a variant of the "don't touch the network" --frozen option from #2811 that doesn't imply "don't ever generate Cargo.lock"?

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Jul 19, 2016

Member

@joshtriplett yeah note that I'm just spitballing, the focus of this PR is vendoring, not distros. In what I was thinking though only applications using Rust crates would need a patched Cargo.toml in theory, no dependency would require a patch.

Member

alexcrichton commented Jul 19, 2016

@joshtriplett yeah note that I'm just spitballing, the focus of this PR is vendoring, not distros. In what I was thinking though only applications using Rust crates would need a patched Cargo.toml in theory, no dependency would require a patch.

@joshtriplett

This comment has been minimized.

Show comment
Hide comment
@joshtriplett

joshtriplett Jul 19, 2016

Member

@alexcrichton I understand that you're primarily trying to support vendoring. However, based on some discussions with the Debian Rust team, I think this PR provides everything needed to support distros, too, other than having a convenient way to specify the directory registry without having to write out a .cargo/config file.

Member

joshtriplett commented Jul 19, 2016

@alexcrichton I understand that you're primarily trying to support vendoring. However, based on some discussions with the Debian Rust team, I think this PR provides everything needed to support distros, too, other than having a convenient way to specify the directory registry without having to write out a .cargo/config file.

@brson

This comment has been minimized.

Show comment
Hide comment
@brson

brson Jul 29, 2016

Contributor

@bors r- I should look at this closer.

Contributor

brson commented Jul 29, 2016

@bors r- I should look at this closer.

+ if srcs.next().is_some() {
+ return Err(human(format!("more than one source URL specified for \
+ `source.{}`", name)))
+ }

This comment has been minimized.

@brson

brson Jul 31, 2016

Contributor

Why is srcs a Vec here if exactly one is required? Seems like a lot of extra error handling.

@brson

brson Jul 31, 2016

Contributor

Why is srcs a Vec here if exactly one is required? Seems like a lot of extra error handling.

This comment has been minimized.

@alexcrichton

alexcrichton Aug 1, 2016

Member

I wanted to make sure there was an error emitted if more than one was specified, also if zero were specified, and use most of the same code for everything. That is, I didn't want to duplicate all the keys in the "error checking" phase from the parsing phase.

@alexcrichton

alexcrichton Aug 1, 2016

Member

I wanted to make sure there was an error emitted if more than one was specified, also if zero were specified, and use most of the same code for everything. That is, I didn't want to duplicate all the keys in the "error checking" phase from the parsing phase.

+
+ for k in to_remove {
+ metadata.remove(&k);
+ }

This comment has been minimized.

@brson

brson Jul 31, 2016

Contributor

Why do the checksums get removed from the metadata table? Is there code in cargo that would be confused if it sees it?

@brson

brson Jul 31, 2016

Contributor

Why do the checksums get removed from the metadata table? Is there code in cargo that would be confused if it sees it?

This comment has been minimized.

@alexcrichton

alexcrichton Aug 1, 2016

Member

The checksum metadata is stored in a separate side table (checksums) and otherwise the metadata here is just preserved for future Cargos.

Right now nothing in Cargo reads metadata, it's just opaquely driven forward, so it shouldn't be confused at all.

@alexcrichton

alexcrichton Aug 1, 2016

Member

The checksum metadata is stored in a separate side table (checksums) and otherwise the metadata here is just preserved for future Cargos.

Right now nothing in Cargo reads metadata, it's just opaquely driven forward, so it shouldn't be confused at all.

src/cargo/core/resolver/mod.rs
+ * a replacement source in use (e.g. a mirror) returned a different checksum
+ * the source itself may be corrupt in one way or another
+
+unable to verify that `{0}` was the same as before in any situation

This comment has been minimized.

@brson

brson Jul 31, 2016

Contributor

This last sentence 'unable to verify' isn't very clear to me. 'was the same as before in any situation' is quite a mouthful and I can't tell what it means. Ditto for the similar sentence in the second error. I think the word 'was' is especially confusing to me - this seems to be talking about something happening in the present tense (the verification). Maybe 'unable to verify that {} is the same as when the lockfile was generated'.

@brson

brson Jul 31, 2016

Contributor

This last sentence 'unable to verify' isn't very clear to me. 'was the same as before in any situation' is quite a mouthful and I can't tell what it means. Ditto for the similar sentence in the second error. I think the word 'was' is especially confusing to me - this seems to be talking about something happening in the present tense (the verification). Maybe 'unable to verify that {} is the same as when the lockfile was generated'.

src/cargo/sources/registry/index.rs
+ drop(OpenOptions::new()
+ .write(true)
+ .create(true)
+ .open(self.path.join(INDEX_LOCK).into_path_unlocked()));

This comment has been minimized.

@brson

brson Jul 31, 2016

Contributor

I don't follow what this is doing. It looks like it is ensuring that an empty index exists in a local registry immediately before opening it. Why would an empty index be useful?

@brson

brson Jul 31, 2016

Contributor

I don't follow what this is doing. It looks like it is ensuring that an empty index exists in a local registry immediately before opening it. Why would an empty index be useful?

This comment has been minimized.

@brson

brson Jul 31, 2016

Contributor

Hm, this is an index lock. Is this lock happening in place in the local index? That is, is it writing to the local registry's directory tree? Seems like cargo should not assume that the local registry is writable.

@brson

brson Jul 31, 2016

Contributor

Hm, this is an index lock. Is this lock happening in place in the local index? That is, is it writing to the local registry's directory tree? Seems like cargo should not assume that the local registry is writable.

This comment has been minimized.

@alexcrichton

alexcrichton Aug 1, 2016

Member

Yes, this was to get tests working, it probably shouldn't write into the tree for directory registries. I'll see if I can tweak.

@alexcrichton

alexcrichton Aug 1, 2016

Member

Yes, this was to get tests working, it probably shouldn't write into the tree for directory registries. I'll see if I can tweak.

src/cargo/sources/registry/local.rs
+ let mut data = Vec::new();
+ try!(crate_file.read_to_end(&mut data).chain_error(|| {
+ human(format!("failed to read `{}`", crate_file.path().display()))
+ }));

This comment has been minimized.

@brson

brson Jul 31, 2016

Contributor

Would be better not to load all this into memory. Crates could have arbitrary size.

@brson

brson Jul 31, 2016

Contributor

Would be better not to load all this into memory. Crates could have arbitrary size.

src/cargo/sources/registry/remote.rs
@@ -53,6 +53,10 @@ impl<'cfg> RegistryData for RemoteRegistry<'cfg> {
}
fn update_index(&mut self) -> CargoResult<()> {
+ // Ensure that this'll actually succeed later on.
+ try!(ops::http_handle(self.config));

This comment has been minimized.

@brson

brson Jul 31, 2016

Contributor

This comment and code is pretty opaque. What's 'this' and how does this commend ensure it actually succeeds?

@brson

brson Jul 31, 2016

Contributor

This comment and code is pretty opaque. What's 'this' and how does this commend ensure it actually succeeds?

+ internal("missing package source")
+ }));
+ try!(source.verify(unit.pkg.package_id()));
+ }

This comment has been minimized.

@brson

brson Jul 31, 2016

Contributor

Can you add a comment explaining what his new block is doing? I don't follow. It looks like it's recovering from an error by doing extra validation. In what situation does comparing the fingerprint produce an error, and in what situation does doing the extra verify call help and how?

@brson

brson Jul 31, 2016

Contributor

Can you add a comment explaining what his new block is doing? I don't follow. It looks like it's recovering from an error by doing extra validation. In what situation does comparing the fingerprint produce an error, and in what situation does doing the extra verify call help and how?

This comment has been minimized.

@alexcrichton

alexcrichton Aug 1, 2016

Member

Yeah sure. If compare.is_err() is true, it means we're about to rebuild the crate. This typically happens due to modifications to the source or just because it wasn't built previously. In this case we verify the crate source itself, and here is where the directory source will verify all sha256 checksums.

@alexcrichton

alexcrichton Aug 1, 2016

Member

Yeah sure. If compare.is_err() is true, it means we're about to rebuild the crate. This typically happens due to modifications to the source or just because it wasn't built previously. In this case we verify the crate source itself, and here is where the directory source will verify all sha256 checksums.

+ self.root.display()))
+ }));
+
+ for entry in entries {

This comment has been minimized.

@brson

brson Jul 31, 2016

Contributor

Shouldn't this loop be skipping over the index directory, which is not a package?

@brson

brson Jul 31, 2016

Contributor

Shouldn't this loop be skipping over the index directory, which is not a package?

This comment has been minimized.

@alexcrichton

alexcrichton Aug 1, 2016

Member

Oh the index directory is only present for local registries, not for directory sources.

@alexcrichton

alexcrichton Aug 1, 2016

Member

Oh the index directory is only present for local registries, not for directory sources.

@brson

This comment has been minimized.

Show comment
Hide comment
@brson

brson Jul 31, 2016

Contributor

Alright, I read through everything, and it looks pretty awesome, but as usual I only have a basic understanding of what I'm looking it.

I'm confused especially about how the directory registry updating and fingerprinting works.

It looks to me like there's a massive amount of data being moved around. When a directory registry is updated all of the checksums of all of the source files in the registry are loaded into memory, then presumably re-serialized to some local cache. If that's the case I wouldn't expect it to scale very far (e.g. a distro couldn't stand to unpack every crate in crates.io into a directory registry).

I don't understand how the fingerprint validation is done, per this comment. Under what situation are all the source files of a package re-checksummed?

It also seems to me that this scheme does not technically prevent modification of the sources in the directory registry: the .cargo-checksum.json file is trusted to be checksums of the files in crates.io. So if somebody really wanted to they could change the local source, update the checksum for that source, and cargo wouldn't be able to tell the difference. And furthermore, unlike with local registries, nobody would be able to detect that somebody had screwed up after switching back to the crates.io registry since the whole time the cratefile checksum appeared to be the same as the canonical checksum.

Contributor

brson commented Jul 31, 2016

Alright, I read through everything, and it looks pretty awesome, but as usual I only have a basic understanding of what I'm looking it.

I'm confused especially about how the directory registry updating and fingerprinting works.

It looks to me like there's a massive amount of data being moved around. When a directory registry is updated all of the checksums of all of the source files in the registry are loaded into memory, then presumably re-serialized to some local cache. If that's the case I wouldn't expect it to scale very far (e.g. a distro couldn't stand to unpack every crate in crates.io into a directory registry).

I don't understand how the fingerprint validation is done, per this comment. Under what situation are all the source files of a package re-checksummed?

It also seems to me that this scheme does not technically prevent modification of the sources in the directory registry: the .cargo-checksum.json file is trusted to be checksums of the files in crates.io. So if somebody really wanted to they could change the local source, update the checksum for that source, and cargo wouldn't be able to tell the difference. And furthermore, unlike with local registries, nobody would be able to detect that somebody had screwed up after switching back to the crates.io registry since the whole time the cratefile checksum appeared to be the same as the canonical checksum.

@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Aug 1, 2016

Contributor

☔️ The latest upstream changes (presumably #2940) made this pull request unmergeable. Please resolve the merge conflicts.

Contributor

bors commented Aug 1, 2016

☔️ The latest upstream changes (presumably #2940) made this pull request unmergeable. Please resolve the merge conflicts.

alexcrichton added some commits Feb 1, 2016

Refactor URL parsing, be more robust for decoding errors
URL parsing now returns a `CargoResult` and also change a few `unwrap()` calls
to returning a `CargoResult` when decoding various bits and pieces of
information.
Add convenience helpers to map source ids
Should help easily mapping packages from one source to another
Implement source redirection
This commit implements a scheme for .cargo/config files where sources can be
redirected to other sources. The purpose of this will be to override crates.io
for a few use cases:

  * Replace it with a mirror site that is sync'd to crates.io
  * Replace it with a "directory source" or some other local source

This major feature of this redirection, however, is that none of it is encoded
into the lock file. If one source is redirected to another then it is assumed
that packages from both are exactly the same (e.g. `foo v0.0.1` is the same in
both location). The lock file simply encodes the canonical soure (e.g.
crates.io) rather than the replacement source. In the end this means that
Cargo.lock files can be generated from any replacement source and shipped to
other locations without the lockfile oscillating about where packages came from.

Eventually this support will be extended to `Cargo.toml` itself (which will be
encoded into the lock file), but that support is not implemented today. The
syntax for what was implemented today looks like:

    # .cargo/config
    [source.my-awesome-registry]
    registry = 'https://example.com/path/to/index'

    [source.crates-io]
    replace-with = 'my-awesome-registry'

Each source will have a canonical name and will be configured with the various
keys underneath it (today just 'registry' and 'directory' will be accepted). The
global `crates-io` source represents crates from the standard registry, and this
can be replaced with other mirror sources.

All tests have been modified to use this new infrastructure instead of the old
`registry.index` configuration. This configuration is now also deprecated and
will emit an unconditional warning about how it will no longer be used in the
future.

Finally, all subcommands now use this "source map" except for `cargo publish`,
which will always publish to the default registry (in this case crates.io).
Add sha256 checksums to the lockfile
This commit changes how lock files are encoded by checksums for each package in
the lockfile to the `[metadata]` section. The previous commit implemented the
ability to redirect sources, but the core assumption there was that a package
coming from two different locations was always the same. An inevitable case,
however, is that a source gets corrupted or, worse, ships a modified version of
a crate to introduce instability between two "mirrors".

The purpose of adding checksums will be to resolve this discrepancy. Each crate
coming from crates.io will now record its sha256 checksum in the lock file. When
a lock file already exists, the new checksum for a crate will be checked against
it, and if they differ compilation will be aborted. Currently only registry
crates will have sha256 checksums listed, all other sources do not have
checksums at this time.

The astute may notice that if the lock file format is changing, then a lock file
generated by a newer Cargo might be mangled by an older Cargo. In anticipation
of this, however, all Cargo versions published support a `[metadata]` section of
the lock file which is transparently carried forward if encountered. This means
that older Cargos compiling with a newer lock file will not verify checksums in
the lock file, but they will carry forward the checksum information and prevent
it from being removed.

There are, however, a few situations where problems may still arise:

1. If an older Cargo takes a newer lockfile (with checksums) and updates it with
   a modified `Cargo.toml` (e.g. a package was added, removed, or updated), then
   the `[metadata]` section will not be updated appropriately. This modification
   would require a newer Cargo to come in and update the checksums for such a
   modification.

2. Today Cargo can only calculate checksums for registry sources, but we may
   eventually want to support other sources like git (or just straight-up path
   sources). If future Cargo implements support for this sort of checksum, then
   it's the same problem as above where older Cargos will not know how to keep
   the checksum in sync
Refactor the RegistrySource implementation
Add an abstraction over which the index can be updated and downloads can be
made. This is currently implemented for "remote" registries (e.g. crates.io),
but soon there will be one for "local" registries as well.
Implement a local registry type
This flavor of registry is intended to behave very similarly to the standard
remote registry, except everything is contained locally on the filesystem
instead. There are a few components to this new flavor of registry:

1. The registry itself is rooted at a particular directory, owning all structure
   beneath it.
2. There is an `index` folder with the same structure as the crates.io index
   describing the local registry (e.g. contents, versions, checksums, etc).
3. Inside the root will also be a list of `.crate` files which correspond to
   those described in the index. All crates must be of the form
   `name-version.crate` and be the same `.crate` files from crates.io itself.

This support can currently be used via the previous implementation of source
overrides with the new type:

```toml
[source.crates-io]
replace-with = 'my-awesome-registry'

[source.my-awesome-registry]
local-registry = 'path/to/registry'
```

I will soon follow up with a tool which can be used to manage these local
registries externally.
Implement a directory source
This flavor of source is intended to behave like a local registry except that
its contents are unpacked rather than zipped up in `.crate` form. Like with
local registries the only way to use this currently is via the
`.cargo/config`-based source replacement currently, and primarily only to
replace crates.io or other registries at the moment.

A directory source is simply a directory which has many `.crate` files unpacked
inside of it. The directory is not recursively traversed for changes, but rather
it is just required that all elements in the directory are themselves
directories of packages.

This format is more suitable for checking into source trees, and it still
provides guarantees around preventing modification of the original source from
the upstream copy. Each directory in the directory source is required to have a
`.cargo-checksum.json` file indicating the checksum it *would* have had if the
crate had come from the original source as well as all of the sha256 checksums
of all the files in the repo. It is intended that directory sources are
assembled from a separately shipped subcommand (e.g.  `cargo vendor` or `cargo
local-registry`), so these checksum files don't have to be managed manually.

Modification of a directory source is not the intended purpose, and if a
modification is detected then the user is nudged towards solutions like
`[replace]` which are intended for overriding other sources and processing local
modifications.
@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Aug 1, 2016

Member

Yeah you're right in that we load up checksums for all the crates in a directory registry. Not doing so would require an index similar to the local registry index. The intent of directory sources are for vendoring, not distros, and in the case of vendoring you're going to load all the crates anyway to build them so the overhead shouldn't be much.

Also yeah, this isn't preventing any sort of "malicious activity" wrt directories. If you change a file you can change the checksum in the json metadata. The intent is to prevent accidental updates, not prevent malicious modifications. The core of the source replacement mechanism is that you replace a crate with the exact same code , just from a different location. If directory sources had no checksums at all then they couldn't provide that guarantee, but this allows them to at least provide a good enough guarantee along those lines for our purposes.

Member

alexcrichton commented Aug 1, 2016

Yeah you're right in that we load up checksums for all the crates in a directory registry. Not doing so would require an index similar to the local registry index. The intent of directory sources are for vendoring, not distros, and in the case of vendoring you're going to load all the crates anyway to build them so the overhead shouldn't be much.

Also yeah, this isn't preventing any sort of "malicious activity" wrt directories. If you change a file you can change the checksum in the json metadata. The intent is to prevent accidental updates, not prevent malicious modifications. The core of the source replacement mechanism is that you replace a crate with the exact same code , just from a different location. If directory sources had no checksums at all then they couldn't provide that guarantee, but this allows them to at least provide a good enough guarantee along those lines for our purposes.

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Aug 1, 2016

Member

Ok, @brson I believe I should have addressed all your comments and I've also pushed a commit containing documentation for this to go on doc.crates.io

Member

alexcrichton commented Aug 1, 2016

Ok, @brson I believe I should have addressed all your comments and I've also pushed a commit containing documentation for this to go on doc.crates.io

@brson

This comment has been minimized.

Show comment
Hide comment
@brson

brson Aug 1, 2016

Contributor

@bors r+

Contributor

brson commented Aug 1, 2016

@bors r+

@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Aug 1, 2016

Contributor

📌 Commit 63ac9e1 has been approved by brson

Contributor

bors commented Aug 1, 2016

📌 Commit 63ac9e1 has been approved by brson

@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Aug 1, 2016

Contributor

⌛️ Testing commit 63ac9e1 with merge f09ef68...

Contributor

bors commented Aug 1, 2016

⌛️ Testing commit 63ac9e1 with merge f09ef68...

bors added a commit that referenced this pull request Aug 1, 2016

Auto merge of #2857 - alexcrichton:redirect-sources, r=brson
Add support local mirrors of registries, take 2

This series of commits culminates in first class support in Cargo for local mirrors of registries. This is implemented through a number of other more generic mechanisms, and extra support was added along the way. The highlights of this PR, however, are:

New `.cargo/config` keys have been added to enable *replacing one source with another*. This functionality is intended to be used for mirrors of the main registry or otherwise one to one source correspondences. The support looks like:

```toml
[source.crates-io]
replace-with = 'my-awesome-registry'

[source.my-awesome-registry]
registry = 'https://github.com/my-awesome/registry-index'
```

This configuration means that instead of using `crates-io` (e.g. `https://github.com/rust-lang/crates.io-index`), Cargo will query the `my-awesome-registry` source instead (configured to a different index here). This alternate source **must be the exact same as the crates.io index**. Cargo assumes that replacement sources are exact 1:1 mirrors in this respect, and the following support is designed around that assumption.

When generating a lock file for crate using a replacement registry, the *original registry* will be encoded into the lock file. For example in the configuration above, all lock files will still mention crates.io as the registry that packages originated from. This semantically represents how crates.io is the source of truth for all crates, and this is upheld because all replacements have a 1:1 correspondance.

Overall, this means that no matter what replacement source you're working with, you can ship your lock file to anyone else and you'll all still have verifiably reproducible builds!

With the above support for custom registries, it's now possible for a project to be downloading crates from any number of sources. One of Cargo's core goals is reproducible builds, and with all these new sources of information it may be easy for a few situations to arise:

1. A local replacement of crates.io could be corrupt
2. A local replacement of crates.io could have made subtle changes to crates

In both of these cases, Cargo would today simply give non-reproducible builds. To help assuage these concerns, Cargo will now track the sha256 checksum of all crates from registries in the lock file. Whenever a `Cargo.lock` is generated from now on it will contain a `[metadata]` section which lists the sha256 checksum of all crates in the lock file (or `<none>` if the sha256 checksum isn't known).

Cargo already checks registry checksums against what's actually downloaded, and Cargo will now verify between iterations of the lock file that checksums remain the same as well. This means that if a local replacement registry is **not** in a 1:1 correspondance with crates.io, the lock file will prevent the build from progressing until the discrepancy is resolved.

In addition to the support above, there is now a new kind of source in Cargo, a "local registry", which is intended to be a subset of the crates.io ecosystem purposed for a local build for any particular project here or there. The way to enable this looks like:

```toml
[source.crates-io]
replace-with = 'my-awesome-registry'

[source.my-awesome-registry]
local-registry = 'path/to/my/local/registry'
```

This local registry is expected to have two components:

1. A directory called `index` which matches the same structure as the crates.io index. The `config.json` file is not required here.
2. Inside the registry directory are any number of `.crate` files (downloaded from crates.io). Each crate file has the name `<package>-<version>.crate`.

This local registry must currently be managed manually, but I plan on publishing and maintaining a Cargo subcommand to manage a local registry. It will have options to do things like:

1. Sync a local registry with a `Cargo.lock`
2. Add a registry package to a local registry
3. Remove a package from a local registry

In addition to local registries, Cargo also supports a "directory source" like so

```toml
[source.crates-io]
replace-with = 'my-awesome-registry'

[source.my-awesome-registry]
directory = 'path/to/some/sources'
```

A directory source is similar to a local registry above, except that all the crates are unpacked and visible as vendored source. This format is suitable for checking into source trees, like Gecko's.

Unlike local registries above we don't have a tarball to verify the crates.io checksum with, but each vendored dependency has metadata containing what it *would* have been. To further prevent modifications by accident, the metadata contains the checksum of each file which should prevent accidental local modifications and steer towards `[replace]` as the mechanism to edit dependencies if necessary.

This is quite a bit of new features! What's all this meant to do? Some example scenarios that this is envisioned to solve are:

1. Supporting mirrors for crates.io in a first class fashion. Once we have the ability to spin up your own local registry, it should be easy to locally select a new mirror.
2. Supporting round-robin mirrors, this provides an easy vector for configuration of "instead of crates.io hit the first source in this list that works"
3. Build environments where network access is not an option. Preparing a local registry ahead-of-time (from a known good lock file) will be a vector to ensure that all Rust dependencies are locally available.
   * Note this is intended to include use cases like Debian and Gecko

Even with the new goodies here, there's some more vectors through which this can be expanded:

* Support for running your own mirror of crates.io needs to be implemented to be "easy to do". There should for example be a `cargo install foo` available to have everything "Just Work".
* Replacing a source with a list of sources (attempted in round robin fashion) needs to be implemented
* Eventually this support will be extended to the `Cargo.toml` file itself. For example:
  * packages should be downloadable from multiple registries
  * replacement sources should be encodable into `Cargo.toml` (note that these replacements, unlike the ones above, would be encoded into `Cargo.lock`)
  * adding multiple mirrors to a `Cargo.toml` should be supported
* Implementing the subcommand above to manage local registries needs to happen (I will attend to this shortly)
@bors

This comment has been minimized.

Show comment
Hide comment
Contributor

bors commented Aug 2, 2016

@bors bors merged commit 63ac9e1 into rust-lang:master Aug 2, 2016

3 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
homu Test successful
Details

@alexcrichton alexcrichton deleted the alexcrichton:redirect-sources branch Aug 9, 2016

@sunng87 sunng87 referenced this pull request in ustclug/mirrorrequest Aug 9, 2016

Closed

希望能增加 https://crates.io 的镜像 #16

@gentoo90 gentoo90 referenced this pull request in gentoo/gentoo-rust Aug 25, 2016

Merged

Add ebuild for dev-util/geckodriver-0.10.0 #198

@killercup killercup referenced this pull request in Rustaceans/rust-cologne Sep 5, 2016

Closed

Meetup September 5th #6

10 of 10 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment