Revert "Do not hash leading zero bytes of i64 numbers in Sip128 hasher" #93014

Kobzol · 2022-01-17T20:27:38Z

Reverts #92103. It had a (in retrospect, obvious) correctness problem where changing the order of two adjacent values would produce identical hashes, which is problematic in stable hashing (see this comment).

I'll try to send the PR again with a fix for this issue.

r? @the8472

the8472 · 2022-01-17T20:29:42Z

Expecting negative perf impact since it reverts a perf optimization to restore correctness.

@bors r+ rollup=never

bors · 2022-01-17T20:29:43Z

📌 Commit cb225f2affc062daa2b05234eeeb152c3ea0ff97 has been approved by the8472

the8472 · 2022-01-17T20:35:49Z

For the beta backport consideration: Impact is probably low since we have had no issue reports so far that could be traced back to the correcness issue. On the other hand it's a revert so the risk should also be low.

cuviper · 2022-01-19T23:02:15Z

Why would it be a correctness issue? Predictable hash collisions are a performance problem, possible DoS in user-facing contexts, but should otherwise still behave correctly.

Kobzol · 2022-01-19T23:07:58Z

AFAIK, stable hashing has more strict requirements than normal Hash. Conflicts should happen only with the smallest possible probability. With this implementation, a source code change might produce the same hash and thus break incr. compilation. The chance is probably quite small (maybe even this specific situation can't happen in the current version of rustc, hard to tell), but it should be fixed I think.

the8472 · 2022-01-19T23:12:15Z

AFAIK, stable hashing has more strict requirements than normal Hash.

Indeed, that's part of the StableHasher contract.

rust/compiler/rustc_data_structures/src/stable_hasher.rs

Lines 150 to 154 in 5e57faa

    
           /// - Stable hashes are sometimes used as identifiers. Therefore they must 
        
           ///   conform to the corresponding `PartialEq` implementations: 
        
           /// 
        
           ///     - `x == y` implies `hash_stable(x) == hash_stable(y)`, and 
        
           ///     - `x != y` implies `hash_stable(x) != hash_stable(y)`.

cuviper · 2022-01-19T23:19:34Z

The negative case must still be subject to the pigeonhole principle though. You can try to improve that probability, but it can't be guaranteed. i.e. "x != y strongly implies hash_stable(x) != hash_stable(y)."

the8472 · 2022-01-19T23:33:40Z

The hashes are u128, we don't want to do significantly worse than random collisions. Different structs mapping to the same input to the hash function obviously are collisions, and an entire class thereof.

cuviper · 2022-01-19T23:37:19Z

Yeah, agreed. I'm not arguing against the revert or the beta-backport, just noting that we can't be perfect.

apiraino · 2022-01-20T15:16:54Z

Beta backport approved as per compiler team on Zulip

@rustbot label +beta-accepted

pnkfelix · 2022-01-20T15:18:47Z

Yeah, agreed. I'm not arguing against the revert or the beta-backport, just noting that we can't be perfect.

I'd like us to be clearer in our docs about what the consequences are of imperfection here. I often see arguments of the form "the probability is so low, its never going to happen." And we treat hash collisions as serious bugs in our incr-comp infrastructure.

the8472 · 2022-01-22T08:59:01Z

@pnkfelix

I'd like us to be clearer in our docs about what the consequences are of imperfection here. I often see arguments of the form "the probability is so low, its never going to happen." And we treat hash collisions as serious bugs in our incr-comp infrastructure.

There shouldn't be any since - assuming a perfectly uniform hash function - it'd take 2⁶⁴ inputs to get a collision. other bugs and hardware glitches will get you long before that. For all practical purposes unique inputs -> unique outputs.

Or do you mean it should be documented what happens when the requirement about the inputs isn't upheld?

bors · 2022-01-23T06:07:49Z

⌛ Testing commit cb225f2affc062daa2b05234eeeb152c3ea0ff97 with merge d9c1bf5eea3c3d6330796bd1f7bdc08347eb7760...

bors · 2022-01-23T06:25:03Z

💔 Test failed - checks-actions

the8472 · 2022-01-23T09:11:35Z

Tests need to be reblessed.

Kobzol · 2022-01-24T08:08:28Z

Done.

the8472 · 2022-01-24T09:06:13Z

@bors r+

bors · 2022-01-24T09:06:14Z

📌 Commit 50f8062 has been approved by the8472

bors · 2022-01-24T11:20:04Z

⌛ Testing commit 50f8062 with merge d2dc425...

bors · 2022-01-24T14:29:57Z

☀️ Test successful - checks-actions
Approved by: the8472
Pushing d2dc425 to master...

rust-timer · 2022-01-24T16:04:57Z

Finished benchmarking commit (d2dc425): comparison url.

Summary: This change led to very large relevant regressions 😿 in compiler performance.

Very large regression in instruction counts (up to 7.9% on incr-full builds of clap-rs check)

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression

cuviper · 2022-01-24T16:52:15Z

@rustbot label: +perf-regression-triaged

#93014 (comment)

Expecting negative perf impact since it reverts a perf optimization to restore correctness.

…zero-bytes, r=the8472 Revert "Do not hash leading zero bytes of i64 numbers in Sip128 hasher" Reverts rust-lang#92103. It had a (in retrospect, obvious) correctness problem where changing the order of two adjacent values would produce identical hashes, which is problematic in stable hashing (see [this comment](rust-lang#92103 (comment))). I'll try to send the PR again with a fix for this issue. r? `@the8472`

pnkfelix · 2022-01-26T03:17:15Z

@pnkfelix

I'd like us to be clearer in our docs about what the consequences are of imperfection here. I often see arguments of the form "the probability is so low, its never going to happen." And we treat hash collisions as serious bugs in our incr-comp infrastructure.

There shouldn't be any since - assuming a perfectly uniform hash function - it'd take 264 inputs to get a collision. other bugs and hardware glitches will get you long before that. For all practical purposes unique inputs -> unique outputs.

Or do you mean it should be documented what happens when the requirement about the inputs isn't upheld?

For anyone interested in where this went, this conversation largely moved over to Zulip (though I don't think it necessarily reached a conclusion): https://zulip-archive.rust-lang.org/stream/131828-t-compiler/topic/Ident.20and.20.60PartialEq.60.20.2F.20.60HashStable.60.html#269112656

[beta] Backports * Update cargo * rust-lang/cargo#10328 — remove --host flag * rust-lang/cargo#10325 — Fix documenting with undocumented dependencies * rust-lang#93128 — Add script to prevent point releases with same number as existing ones * rust-lang#93014 — Revert "Do not hash leading zero bytes of i64 numbers in Sip128 hasher" * rust-lang#93012 — Update pulldown-cmark version to fix markdown list issue

…sion, r=the8472 Compress amount of hashed bytes for `isize` values in StableHasher This is another attempt to land rust-lang#92103, this time hopefully with a correct implementation w.r.t. stable hashing guarantees. The previous PR was [reverted](rust-lang#93014) because it could produce the [same hash](rust-lang#92103 (comment)) for different values even in quite simple situations. I have since added a basic [test](rust-lang#93193) that should guard against that situation, I also added a new test in this PR, specialised for this optimization. ## Why this optimization helps Since the original PR, I have tried to analyze why this optimization even helps (and why it especially helps for `clap`). I found that the vast majority of stable-hashing `i64` actually comes from hashing `isize` (which is converted to `i64` in the stable hasher). I only found a single place where is this datatype used directly in the compiler, and this place has also been showing up in traces that I used to find out when is `isize` being hashed. This place is `rustc_span::FileName::DocTest`, however, I suppose that isizes also come from other places, but they might not be so easy to find (there were some other entries in the trace). `clap` hashes about 8.5 million `isize`s, and all of them fit into a single byte, which is why this optimization has helped it [quite a lot](rust-lang#92103 (comment)). Now, I'm not sure if special casing `isize` is the correct solution here, maybe something could be done with that `isize` inside `DocTest` or in other places, but that's for another discussion I suppose. In this PR, instead of hardcoding a special case inside `SipHasher128`, I instead put it into `StableHasher`, and only used it for `isize` (I tested that for `i64` it doesn't help, or at least not for `clap` and other few benchmarks that I was testing). ## New approach Since the most common case is a single byte, I added a fast path for hashing `isize` values which positive value fits within a single byte, and a cold path for the rest of the values. To avoid the previous correctness problem, we need to make sure that each unique `isize` value will produce a unique hash stream to the hasher. By hash stream I mean a sequence of bytes that will be hashed (a different sequence should produce a different hash, but that is of course not guaranteed). We have to distinguish different values that produce the same bit pattern when we combine them. For example, if we just simply skipped the leading zero bytes for values that fit within a single byte, `(0xFF, 0xFFFFFFFFFFFFFFFF)` and `(0xFFFFFFFFFFFFFFFF, 0xFF)` would send the same hash stream to the hasher, which must not happen. To avoid this situation, values `[0, 0xFE]` are hashed as a single byte. When we hash a larger (treating `isize` as `u64`) value, we first hash an additional byte `0xFF`. Since `0xFF` cannot occur when we apply the single byte optimization, we guarantee that the hash streams will be unique when hashing two values `(a, b)` and `(b, a)` if `a != b`: 1) When both `a` and `b` are within `[0, 0xFE]`, their hash streams will be different. 2) When neither `a` and `b` are within `[0, 0xFE]`, their hash streams will be different. 3) When `a` is within `[0, 0xFE]` and `b` isn't, when we hash `(a, b)`, the hash stream will definitely not begin with `0xFF`. When we hash `(b, a)`, the hash stream will definitely begin with `0xFF`. Therefore the hash streams will be different. r? `@the8472`

rust-highfive assigned the8472 Jan 17, 2022

rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jan 17, 2022

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jan 17, 2022

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 17, 2022

the8472 added perf-regression-triaged The performance regression has been triaged. beta-nominated Nominated for backporting to the compiler in the beta channel. labels Jan 17, 2022

pnkfelix added the beta-accepted Accepted for backporting to the compiler in the beta channel. label Jan 20, 2022

Kobzol mentioned this pull request Jan 22, 2022

Add test for stable hash uniqueness of adjacent field values #93193

Merged

bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Jan 23, 2022

This comment has been minimized.

Sign in to view

the8472 added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 23, 2022

Revert "Do not hash leading zero bytes of i64 numbers in Sip128 hasher"

50f8062

Kobzol force-pushed the revert-92103-stable-hash-skip-zero-bytes branch from cb225f2 to 50f8062 Compare January 24, 2022 08:07

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jan 24, 2022

bors added the merged-by-bors This PR was explicitly merged by bors. label Jan 24, 2022

bors merged commit d2dc425 into rust-lang:master Jan 24, 2022

rustbot added this to the 1.60.0 milestone Jan 24, 2022

rustbot added the perf-regression Performance regression. label Jan 24, 2022

Kobzol deleted the revert-92103-stable-hash-skip-zero-bytes branch January 24, 2022 16:06

ehuss mentioned this pull request Jan 25, 2022

[beta] Backports #93306

Merged

ehuss removed the beta-nominated Nominated for backporting to the compiler in the beta channel. label Jan 25, 2022

ehuss modified the milestones: 1.60.0, 1.59.0 Jan 25, 2022

Kobzol mentioned this pull request Jan 28, 2022

Compress amount of hashed bytes for isize values in StableHasher #93432

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "Do not hash leading zero bytes of i64 numbers in Sip128 hasher" #93014

Revert "Do not hash leading zero bytes of i64 numbers in Sip128 hasher" #93014

Kobzol commented Jan 17, 2022

the8472 commented Jan 17, 2022

bors commented Jan 17, 2022

the8472 commented Jan 17, 2022

cuviper commented Jan 19, 2022

Kobzol commented Jan 19, 2022

the8472 commented Jan 19, 2022

cuviper commented Jan 19, 2022

the8472 commented Jan 19, 2022

cuviper commented Jan 19, 2022

apiraino commented Jan 20, 2022

pnkfelix commented Jan 20, 2022

the8472 commented Jan 22, 2022

bors commented Jan 23, 2022

bors commented Jan 23, 2022

This comment has been minimized.

the8472 commented Jan 23, 2022

Kobzol commented Jan 24, 2022

the8472 commented Jan 24, 2022

bors commented Jan 24, 2022

bors commented Jan 24, 2022

bors commented Jan 24, 2022

rust-timer commented Jan 24, 2022

cuviper commented Jan 24, 2022

pnkfelix commented Jan 26, 2022

Revert "Do not hash leading zero bytes of i64 numbers in Sip128 hasher" #93014

Revert "Do not hash leading zero bytes of i64 numbers in Sip128 hasher" #93014

Conversation

Kobzol commented Jan 17, 2022

the8472 commented Jan 17, 2022

bors commented Jan 17, 2022

the8472 commented Jan 17, 2022

cuviper commented Jan 19, 2022

Kobzol commented Jan 19, 2022

the8472 commented Jan 19, 2022

cuviper commented Jan 19, 2022

the8472 commented Jan 19, 2022

cuviper commented Jan 19, 2022

apiraino commented Jan 20, 2022

pnkfelix commented Jan 20, 2022

the8472 commented Jan 22, 2022

bors commented Jan 23, 2022

bors commented Jan 23, 2022

This comment has been minimized.

the8472 commented Jan 23, 2022

Kobzol commented Jan 24, 2022

the8472 commented Jan 24, 2022

bors commented Jan 24, 2022

bors commented Jan 24, 2022

bors commented Jan 24, 2022

rust-timer commented Jan 24, 2022

cuviper commented Jan 24, 2022

pnkfelix commented Jan 26, 2022