Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create checksum from more reproducible artifact #7212

Open
est31 opened this issue Aug 5, 2019 · 6 comments
Open

Create checksum from more reproducible artifact #7212

est31 opened this issue Aug 5, 2019 · 6 comments
Labels
A-caching Area: caching of dependencies, repositories, and build artifacts A-interacts-with-crates.io Area: interaction with registries A-lockfile Area: Cargo.lock issues S-needs-design Status: Needs someone to work further on the design for the feature or fix. NOT YET accepted.

Comments

@est31
Copy link
Member

est31 commented Aug 5, 2019

PR #6317 has been quite disruptive (you can call it detrimental) to my attempts at building a reversable storage of cargo crates.

The goal of the project is to deduplicate files present in multiple versions of the crate and do the compression on the fly. This gives sweet improvements. The reason why #6317 was so disruptive is that I want to replicate the compression 100% so that there are no changes in Cargo.lock hashes. But for this I need to be able to run the the exact compression algorithm used. Now there are countless versions of the zlib library used in various OSs and I'm not sure whether all of them are even open source. That's why I said detrimental, as if proprietary versions of zlib are used, the project would be impossible.

A short term fix for the issue would be to revert the PR. A more long term fix would be to use a different format for the hashes used in the registry and Cargo.lock. Instead of hashing the entire tar.gz files, you could hash the .tar files only. There is a new upcoming version of Cargo.lock in which it could be adopted. Do you think that this is a good idea?

est31 added a commit to est31/cargo-local-serve that referenced this issue Aug 5, 2019
There's been an event I call "zlib explosion" after which
cargo adopted tons of variants of the zlib algorithm.

With this algorithm the attempts to re-create hash-matching
.crate files from the deduped storage turn pointless.

See also rust-lang/cargo#7212
@ehuss ehuss added A-lockfile Area: Cargo.lock issues and removed A-lockfile Area: Cargo.lock issues labels Sep 21, 2019
@link2xt
Copy link

link2xt commented Sep 8, 2023

Instead of hashing the entire tar.gz files, you could hash the .tar files only.

Another option is to hash the files, write the hashes into the manifest and sign the manifest itself. This is how JAR file signatures work. This way even the .tar format can be replaced with a different format later.

@est31
Copy link
Member Author

est31 commented Sep 8, 2023

There is a large number of security vulnerabilities around that, and the jar model is generally seen as bad now in the security community.

@Eh2406
Copy link
Contributor

Eh2406 commented Sep 12, 2023

For my own edification, where can I learn more about these security problems?

@est31
Copy link
Member Author

est31 commented Sep 13, 2023

Having done a small search I found:

  • "MasterKey" vulnerability Android bug 8219321 (CVE-2013-4787), discovered by Bluebox security in 2013, article 1, article 2, black hat talk, was about apks allowing multiple entries with the same name, and the signature verifier and the extraction code taking two different ones.
  • Various unnamed vulnerabilities (one, two).
  • Janus vulnerability (CVE-2017–13156), discovered by GuardSquare. The version 2 signatures and above are not vulnerable to it, but version 1 is (disallowed completely since Android 11). versions 3 and 4, which came later, add some refinements like key rotation or better streaming support.

Right now cargo/crates.io does it really well with hashing the entire tar.gz file's contents. My suggestion is to only use the tar file's contents for the hashing, allowing for better compression and deduplication to be done on the backend side. If we ever want to do signatures embedded in the tar.gz, some alternatives

  • via duplicate tarring, so having a .tar.gz file comprised of two files: signature.toml and files.tar, where signature.toml contains a textual representation of the raw bytes of the files.tar file. one needs to ensure that the outer .tar.gz file does not contain any files beyond those two, say files/foo/bar, which would then be added upon extraction.
  • one could also add signature.toml to the tar file itself as the very last file, the signature being computed from the tar file's contents up to the signature.toml entry, but this is a bit more dangerous as it is more complex and one needs to forbid any content after the signature.toml entry.

@epage epage added A-interacts-with-crates.io Area: interaction with registries A-lockfile Area: Cargo.lock issues A-caching Area: caching of dependencies, repositories, and build artifacts labels Nov 1, 2023
@epage
Copy link
Contributor

epage commented Nov 1, 2023

This might benefit #2526 as it would let us change migrate to additional compression algorithms while the checksum stays stable across all of them.

@epage epage changed the title zlib explosion Create checksum from more reproducible artifact Nov 1, 2023
@epage epage added the S-needs-design Status: Needs someone to work further on the design for the feature or fix. NOT YET accepted. label Nov 1, 2023
@Eh2406
Copy link
Contributor

Eh2406 commented Nov 2, 2023

I was reminded of this at PackagingCon, and wanted to write it down before it slipped my mind. There is an existing battle hardened (file system|operating system|compression) agnostic file structure hash, the hashing algorithm used for git trees. It should be possible to compute this hash over any compressed version of the file or a checkout of the tree on the file system and get the same result. It does treat several filesystem properties as irrelevant and hashes other properties either of which could be wrong decisions in our use case. But it has the advantage of being battle hardened. Just a thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-caching Area: caching of dependencies, repositories, and build artifacts A-interacts-with-crates.io Area: interaction with registries A-lockfile Area: Cargo.lock issues S-needs-design Status: Needs someone to work further on the design for the feature or fix. NOT YET accepted.
Projects
None yet
Development

No branches or pull requests

5 participants