feat!: use stable hash from rustc-stable-hash #14116

weihanglo · 2024-06-20T17:40:30Z

What does this PR try to resolve?

This helps -Ztrim-paths build a stable cross-platform path for the
registry and git sources. Sources files then can be found from the same
path when debugging.

See #13171 (comment)

How should we test and review this PR?

There are a few caveats, and we should do an FCP before merge:

This will invalidate the current downloaded caches.
Need to put this in the Cargo CHANGELOG.
As a consequence of changing how SourceId is hashed, the global cache
tracker is also affected because Cargo writes source identifiers (e.g.
index.crates.io-6f17d22bba15001f) to SQLite.
The performance of rustc-stable-hash is slightly worse than the old
SipHasher in std on short things like SourceId, but for long stuff
like fingerprint. See Additional information.

StableHasher is used in several places. We should consider if there is a need for cryptographyic hash (see #13171 (comment)).

Rebuild detection (fingerprints)
- Rustc version, including all the CLI args running rustc -vV.
- Build caches
Compute rustc -C metadata
- stable hash for SourceId
- Also read and hash contents from custom target JSON file.
UnitInner::dep_hash
- This is to distinguish same units having different features set between normal and build dependencies.
Hash file contents for cargo package to verify if files were modified before and after the build.
Rusc diagnostics deduplication
Places using SourceId identifier like registry/src path,
and -Zscript target directories.

Additional information

Benchmark on x86_64-unknown-linux-gnu

bench_hasher/RustcStableHasher/URL
                        time:   [33.843 ps 33.844 ps 33.845 ps]
                        change: [-0.0167% -0.0049% +0.0072%] (p = 0.44 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) low severe
  3 (3.00%) high mild
  2 (2.00%) high severe
bench_hasher/SipHasher/URL
                        time:   [18.954 ns 18.954 ns 18.955 ns]
                        change: [-0.1281% -0.0951% -0.0644%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) low severe
  4 (4.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe
bench_hasher/RustcStableHasher/lorem ipsum
                        time:   [659.18 ns 659.20 ns 659.22 ns]
                        change: [-0.0192% -0.0062% +0.0068%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe
bench_hasher/SipHasher/lorem ipsum
                        time:   [1.2006 µs 1.2008 µs 1.2010 µs]
                        change: [+0.0117% +0.0467% +0.0808%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Benchmark on aarch64-apple-darwin

Benchmarking bench_hasher/RustcStableHasher/URL: Collecting 1000 samples in estimated 5.0090 s (256M ibench_hasher/RustcStableHasher/URL
                        time:   [19.619 ns 19.645 ns 19.670 ns]
Found 156 outliers among 1000 measurements (15.60%)
  10 (1.00%) low severe
  59 (5.90%) low mild
  43 (4.30%) high mild
  44 (4.40%) high severe
Benchmarking bench_hasher/SipHasher/URL: Collecting 1000 samples in estimated 5.0075 s (279M iterationbench_hasher/SipHasher/URL
                        time:   [17.809 ns 17.826 ns 17.843 ns]
Found 34 outliers among 1000 measurements (3.40%)
  28 (2.80%) high mild
  6 (0.60%) high severe
Benchmarking bench_hasher/RustcStableHasher/300 chars: Collecting 1000 samples in estimated 5.0027 s (bench_hasher/RustcStableHasher/300 chars
                        time:   [95.535 ns 95.679 ns 95.824 ns]
Found 48 outliers among 1000 measurements (4.80%)
  39 (3.90%) high mild
  9 (0.90%) high severe
Benchmarking bench_hasher/SipHasher/300 chars: Collecting 1000 samples in estimated 5.0492 s (34M iterbench_hasher/SipHasher/300 chars
                        time:   [151.18 ns 151.37 ns 151.58 ns]
Found 16 outliers among 1000 measurements (1.60%)
  13 (1.30%) high mild
  3 (0.30%) high severe
Benchmarking bench_hasher/RustcStableHasher/lorem ipsum (3222 chars): Collecting 1000 samples in estimbench_hasher/RustcStableHasher/lorem ipsum (3222 chars)
                        time:   [975.85 ns 976.65 ns 977.50 ns]
Found 92 outliers among 1000 measurements (9.20%)
  48 (4.80%) high mild
  44 (4.40%) high severe
Benchmarking bench_hasher/SipHasher/lorem ipsum (3222 chars): Collecting 1000 samples in estimated 5.3bench_hasher/SipHasher/lorem ipsum (3222 chars)
                        time:   [1.7856 µs 1.7872 µs 1.7888 µs]
Found 66 outliers among 1000 measurements (6.60%)
  47 (4.70%) high mild
  19 (1.90%) high severe

Criterion benchmark script

#![allow(deprecated)]

use std::hash::Hash as _;
use std::hash::Hasher as _;

use criterion::criterion_group;
use criterion::criterion_main;
use criterion::BenchmarkId;
use criterion::Criterion;

struct SipHasher(std::hash::SipHasher);

impl SipHasher {
    fn new() -> SipHasher {
        SipHasher(std::hash::SipHasher::new())
    }
}

impl std::hash::Hasher for SipHasher {
    fn finish(&self) -> u64 {
        self.0.finish()
    }
    fn write(&mut self, bytes: &[u8]) {
        self.0.write(bytes)
    }
}

struct RustcStableHasher(rustc_stable_hash::StableHasher);

impl RustcStableHasher {
    fn new() -> RustcStableHasher {
        RustcStableHasher(rustc_stable_hash::StableHasher::new())
    }

    fn finish(self) -> u64 {
        self.0.finalize().0
    }
}

impl std::hash::Hasher for RustcStableHasher {
    fn finish(&self) -> u64 {
        panic!("call StableHasher::finish instead");
    }

    fn write(&mut self, bytes: &[u8]) {
        self.0.write(bytes)
    }
}

const INPUTS: &[(&'static str, &'static str)] = &[
    ("URL", "registry+https://github.com/rust-lang/crates.io-index"),
    ("300 chars", "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec sem odio, consectetur ac velit ac, hendrerit pulvinar nisl. Aenean auctor felis non accumsan porta. Nullam purus diam, aliquam nec dui vitae, iaculis fermentum eros. Nunc laoreet lectus nec malesuada tristique. Quisque venenatis vehicula"),
    ("lorem ipsum (3222 chars)", "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec sem odio, consectetur ac velit ac, hendrerit pulvinar nisl. Aenean auctor felis non accumsan porta. Nullam purus diam, aliquam nec dui vitae, iaculis fermentum eros. Nunc laoreet lectus nec malesuada tristique. Quisque venenatis vehicula lacus sed auctor. In libero sapien, auctor vulputate tellus ut, scelerisque feugiat neque. Sed feugiat nulla vel lorem tincidunt viverra. Proin blandit pretium sapien id imperdiet. Sed elementum, ligula quis porttitor consectetur, augue ligula consectetur erat, at congue massa tortor in odio. Morbi sit amet tincidunt libero, eu rutrum felis. Integer rhoncus tortor et erat congue venenatis. Proin ac ante sit amet urna tincidunt ullamcorper. Vestibulum nec tincidunt neque. Vestibulum venenatis, libero et blandit pretium, risus nibh efficitur libero, vel condimentum tortor nulla non sapien. Morbi ac dapibus est. Duis justo arcu, laoreet lacinia luctus mollis, placerat non augue. Interdum et malesuada fames ac ante ipsum primis in faucibus. Fusce vestibulum eu tellus in pellentesque. Nam efficitur mattis turpis. Vestibulum a condimentum purus. Suspendisse eget augue scelerisque sem dignissim ornare vitae in augue. Vestibulum porta rhoncus sapien, non luctus nisi vehicula in. Etiam cursus tortor turpis, eu imperdiet purus facilisis ut. Nullam vestibulum erat ex, sit amet commodo est fermentum eleifend. Donec pulvinar imperdiet urna, egestas ultricies mi pulvinar at. Maecenas velit dui, iaculis at egestas eu, consequat sit amet nisl. Ut eu leo ultricies, porttitor ante eu, ultrices massa. Nam commodo, nunc ut mollis egestas, lectus ex eleifend nisl, vitae mollis metus quam vitae sapien. Curabitur eu nulla massa. Vivamus sodales turpis et lorem placerat, ac dignissim nulla luctus. In placerat eleifend orci, dapibus varius felis tincidunt sed. Nulla suscipit mauris condimentum ipsum finibus, ac mattis sapien aliquet. Cras feugiat elementum augue, viverra lacinia ante congue et. Sed et bibendum sem. Aenean pretium tellus eget velit commodo pretium a sit amet velit. Curabitur vitae est vitae nulla venenatis tristique in a eros. In scelerisque lectus et luctus mattis. Cras ac purus ac purus tempor molestie vitae vitae felis. Quisque volutpat elementum felis vitae mollis. Pellentesque finibus quam eget vestibulum tempus. Praesent quis massa eget ligula ultrices lobortis. Ut pellentesque, mi ac finibus sagittis, dui felis tempor dui, ac commodo mauris massa nec dolor. Cras congue, lectus vitae luctus faucibus, massa mauris malesuada elit, et facilisis turpis odio non justo. Proin volutpat turpis quis ante interdum pellentesque. Morbi faucibus, erat vel elementum aliquet, odio leo eleifend magna, sagittis semper lorem mauris nec arcu. Curabitur lacinia sagittis ante mollis facilisis. Fusce ultrices tellus sed justo rhoncus varius ut eu justo. Sed a est purus. Sed nec mi laoreet, consequat justo nec, sodales augue. Nullam posuere ipsum et velit aliquam blandit a quis metus. Aliquam id eros non magna suscipit bibendum. Curabitur porta auctor sapien, a molestie nisl. Donec neque leo, consequat vitae velit sit amet, aliquam elementum purus. Donec sit amet congue mi. Etiam at magna nunc."),
];

fn bench_hasher(c: &mut Criterion) {
    let mut group = c.benchmark_group("bench_hasher");
    group.sample_size(1000);
    for (name, input) in INPUTS {
        let id = BenchmarkId::new("RustcStableHasher", name);
        group.bench_with_input(id, input, |b, input| {
            b.iter(|| {
                let mut hasher = RustcStableHasher::new();
                input.hash(&mut hasher);
                hasher.finish();
            })
        });
        let id = BenchmarkId::new("SipHasher", name);
        group.bench_with_input(id, input, |b, input| {
            b.iter(|| {
                let mut hasher = SipHasher::new();
                input.hash(&mut hasher);
                hasher.finish();
            })
        });
    }
    group.finish();
}

criterion_group!(benches, bench_hasher);
criterion_main!(benches);

rustbot · 2024-06-20T17:40:36Z

r? @epage

rustbot has assigned @epage.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

weihanglo · 2024-06-20T17:42:12Z

This is blocked on releasing rustc-stable-hash to crates.io

briansmith · 2024-06-20T17:51:00Z

From a <1 minute reading of "-Ztrim-paths build a stable cross-platform path for the registry and git sources."

My understanding is that the intent here is to use a hash function to create a stable path to a particular set of source files or artifacts. If there is a hash collision then potentially hash(malicous-sources) == hash(trusted-sources) and so malicous-sources could be used instead of trusted-sources, silently.

Urgau · 2024-06-20T18:05:46Z

src/cargo/util/hasher.rs

    fn write(&mut self, bytes: &[u8]) {
        self.0.write(bytes)
    }


I'm not sure only forwarding Hasher::write is enough, since the endian-ness handling is done on the individual write_{u,i}{8,16,32,64,128} methods and not forwarding those will bypass that endian-ness handling ¹.

I think it's also going to bypass the {u,i}size handling.

Footnotes

the default implementation use native endian-ness, instead of a fixed one ↩

After rust-lang/rustc-stable-hash#6 and rust-lang/rustc-stable-hash#8, I think StableSipHasher128 should be a drop-in replacement that we can use directly.

I am also thinking of blake3 as an alternative, but haven't figured out how to make it play nice with ExtendedHasher.

This helps `-Ztrim-paths` build a stable cross-platform path for the registry and git sources. Sources files then can be found from the same path when debugging. See rust-lang#13171 (comment) A few caveats: * This will invalidate the current downloaded caches. Need to put this in the Cargo CHANGELOG. * As a consequence of changing how `SourceId` is hashed, the global cache tracker is also affected because Cargo writes source identifiers (e.g. `index.crates.io-6f17d22bba15001f`) to SQLite. * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/global_cache_tracker.rs#L388-L391 * The performance of rustc-stable-hash is slightly worse than the old SipHasher in std on short things like `SourceId`, but for long stuff like fingerprint. See appendix. StableHasher is used in several places (some might not be needed?): * Rebuild detection (fingerprints) * Rustc version, including all the CLI args running `rustc -vV`. * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/util/rustc.rs#L326 * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/util/rustc.rs#L381 * Build caches * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/compiler/fingerprint/mod.rs#L1456 * Compute rustc `-C metadata` * stable hash for SourceId * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/package_id.rs#L207 * Also read and hash contents from custom target JSON file. * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/compiler/compile_kind.rs#L81-L91 * `UnitInner::dep_hash` * This is to distinguish same units having different features set between normal and build dependencies. * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/ops/cargo_compile/mod.rs#L627 * Hash file contents for `cargo package` to verify if files were modified before and after the build. * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/ops/cargo_package.rs#L999 * Rusc diagnostics deduplication * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/compiler/job_queue/mod.rs#L311 * Places using `SourceId` identifier like `registry/src` path, and `-Zscript` target directories. Appendix -------- Benchmark on x86_64-unknown-linux-gnu ``` bench_hasher/RustcStableHasher/URL time: [33.843 ps 33.844 ps 33.845 ps] change: [-0.0167% -0.0049% +0.0072%] (p = 0.44 > 0.05) No change in performance detected. Found 10 outliers among 100 measurements (10.00%) 5 (5.00%) low severe 3 (3.00%) high mild 2 (2.00%) high severe bench_hasher/SipHasher/URL time: [18.954 ns 18.954 ns 18.955 ns] change: [-0.1281% -0.0951% -0.0644%] (p = 0.00 < 0.05) Change within noise threshold. Found 14 outliers among 100 measurements (14.00%) 3 (3.00%) low severe 4 (4.00%) low mild 3 (3.00%) high mild 4 (4.00%) high severe bench_hasher/RustcStableHasher/lorem ipsum time: [659.18 ns 659.20 ns 659.22 ns] change: [-0.0192% -0.0062% +0.0068%] (p = 0.34 > 0.05) No change in performance detected. Found 12 outliers among 100 measurements (12.00%) 4 (4.00%) low severe 3 (3.00%) low mild 3 (3.00%) high mild 2 (2.00%) high severe bench_hasher/SipHasher/lorem ipsum time: [1.2006 µs 1.2008 µs 1.2010 µs] change: [+0.0117% +0.0467% +0.0808%] (p = 0.01 < 0.05) Change within noise threshold. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild ```

weihanglo · 2024-07-09T20:03:16Z

src/cargo/core/source_id.rs

-    // The hash value depends on endianness and bit-width, so we only run this test on
-    // little-endian 64-bit CPUs (such as x86-64 and ARM64) where it matches the
-    // well-known value.
+    // The hash value should be stable across platforms, and doesn't depend on


This is something we need to fix, if the goal is a fully cross-platform.

Urgau · 2024-07-11T15:06:17Z

For information, rustc-stable-hash v0.1.0 has now been released on crates.io!

rustbot assigned epage Jun 20, 2024

rustbot added A-cache-messages Area: caching of compiler messages A-layout Area: target output directory layout, naming, and organization A-registries Area: registries S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jun 20, 2024

weihanglo changed the title ~~Stable hash~~ feat!: use stable hash from rustc-stable-hash Jun 20, 2024

weihanglo added the Z-trim-paths Nightly: path sanitization label Jun 20, 2024

Urgau reviewed Jun 20, 2024

View reviewed changes

weihanglo mentioned this pull request Jun 20, 2024

trim-paths: Remap rules for different dependency kinds #13171

Open

weihanglo added 2 commits July 9, 2024 15:49

refactor(source_id): merge stable hash tests into one

5acc7d0

weihanglo force-pushed the stable-hash branch from 0c54ce2 to 5acc7d0 Compare July 9, 2024 19:51

rustbot added the A-rebuild-detection Area: rebuild detection and fingerprinting label Jul 9, 2024

weihanglo commented Jul 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat!: use stable hash from rustc-stable-hash #14116

feat!: use stable hash from rustc-stable-hash #14116

weihanglo commented Jun 20, 2024 •

edited

Loading

rustbot commented Jun 20, 2024

weihanglo commented Jun 20, 2024

briansmith commented Jun 20, 2024

Urgau Jun 20, 2024 •

edited

Loading

weihanglo Jul 9, 2024

weihanglo Jul 9, 2024

Urgau commented Jul 11, 2024

feat!: use stable hash from rustc-stable-hash #14116

Are you sure you want to change the base?

feat!: use stable hash from rustc-stable-hash #14116

Conversation

weihanglo commented Jun 20, 2024 • edited Loading

What does this PR try to resolve?

How should we test and review this PR?

Additional information

rustbot commented Jun 20, 2024

weihanglo commented Jun 20, 2024

briansmith commented Jun 20, 2024

Urgau Jun 20, 2024 • edited Loading

Choose a reason for hiding this comment

Footnotes

weihanglo Jul 9, 2024

Choose a reason for hiding this comment

weihanglo Jul 9, 2024

Choose a reason for hiding this comment

Urgau commented Jul 11, 2024

weihanglo commented Jun 20, 2024 •

edited

Loading

Urgau Jun 20, 2024 •

edited

Loading