Review which hash algorithm to use for incremental compilation hashes. #41215

michaelwoerister · 2017-04-11T12:38:34Z

For incremental compilation we are hashing lots of things, HIR, crate metadata and soon also type checking tables. These hashes are used purely as fingerprints, that is, we compare these hashes in order to find out if the things having been hashed are equal or not. Consequently the primary qualities we require from the hash algorithm are:

That it acts as a pseudo-random-function, i.e. whatever we feed in, the resulting hashes should have a collision probability that is roughly the same as that of random numbers.
That it produces wide enough hash values so the collision probabilities are low enough (128 bits should be fine).
That it is fast.

It remains to be clarified whether it makes sense to use a cryptographic hash function, i.e. that it is hard to find/construct collisions. My guess is no:

Incremental compilation caches are not meant to be distributed. They are platform dependent files created locally by your compiler and there is little use in distributing them.
A 128 bit hash is probably susceptible to a brute force attack anyway.
I cannot think of an attack that exploits a hash collision here. If one wants to tamper with an incremental compilation cache, one could just replace the object file in the cache without a need to ever touch item hashes.

I'd be very interested though if someone could come up with an attack scenario. But even then the solution would probably be to implement some general code signing scheme for these caches.

At the moment we are using BLAKE2 as our hashing algorithm which is a cryptographic hash function and thus fulfills requirements (1) and (2) very well. However, compared to non-cryptographic hash functions it is rather slow. For raw throughput SipHash would be about three times as fast and fast hash functions like Metrohash can be 20 times as fast. Before you get too excited though: Most of the time computing hashes is spent gathering the data that is fed into the hash function, not actually running the function on it. But there are still some gains to be had, here are some preliminary tests of using BLAKE2 -- which is our "worst case" -- and MetroHash -- which is among the fastest hash functions providing 128 bit hashes:

	BLAKE2	MetroHash	SipHash 1-3	SipHash 2-4
libcore	0.258s	0.193s (-25%)	0.189s (-26%)	0.201s (-22%)
librustc	0.435s	0.345s (-20%)	0.320s (-26%)	0.327s (-25%)
syntex_syntax (HIR)	0.221s	0.166s (-25%)	0.168s (-24%)	0.176s (-20%)
syntex_syntax (Metadata)	0.456s	0.326s (-28%)	0.312s (-31%)	0.347s (-24%)

So it seems that a fast hash functions allows to save 20-30% spent during ICH computation. If we could get some confidence that:

a function like MetroHash provides low enough collision probability and
we really don't need a cryptographic hash function

then we could have some free speedups here.

UPDATE:
I added the 128 bit version of SipHash to the table and the timings are as good as the ones for MetroHash.

est31 · 2017-04-11T13:53:20Z

As another idea, if non cryptographic hashes are allowed, we could use SHA-1 on platforms where its hardware accelerated (apparently Ryzen and newer and Goldmont and newer). SHA-1 is slightly slower than BLAKE2, but I guess with hardware acceleration it will be much faster, maybe faster than a MetroHash implemented with generic instructions. Would be interesting to have benchmarks on this.

michaelwoerister · 2017-04-11T14:11:19Z

That's an interesting idea. However, I'd need to see actual numbers for that and they would have to be very compelling in order to justify adding platform-specific code.
Also, once we go down the platform-specific road: MetroHash can also make use of some special instructions, almost doubling its speed.

michaelwoerister · 2017-04-11T14:14:08Z

A 128 bit SipHash13 might also be a good contender. It's pretty fast and it's designed to be a good PRF.

arthurprs · 2017-04-12T22:33:26Z

Siphash is probably a sensible choice here if the numbers look good. Metrohash is great but I think we should try to be conservative to some extent.

michaelwoerister · 2017-04-18T09:26:43Z

@rust-lang/libs: There are no plans of making the 128-bit version of SipHash available in the standard library, right?

sfackler · 2017-04-18T09:32:01Z

Not in a stable manner, no, but that shouldn't matter for compiler internals.

est31 · 2017-04-18T15:54:09Z

Also, afaik its possible to simply use crates from crates.io now. Just add them to the Cargo.toml and commit both that change and the change to Cargo.lock into git.

michaelwoerister · 2017-04-21T12:18:08Z

I added the 128 bit version of SipHash to the table and the timings are as good as the ones for MetroHash.

arthurprs · 2017-04-21T12:21:55Z

Sounds like a clear winner to me. Is that 1-3 or 2-4 variant?

michaelwoerister · 2017-04-21T12:25:07Z

It's 1-3 ...

michaelwoerister · 2017-04-21T12:56:46Z

I added the SipHash 2-4 timings too. As expected they are slightly worse but still pretty good.

briansmith · 2017-04-27T22:24:19Z

That it produces wide enough hash values so the collision probabilities are low enough (128 bits should be fine).

NIT: It doesn't make sense to measure hashes by the number of bits this way, because the number of bits doesn't correlate strongly enough with the desired collision probability. For example, `SHA-1(x)||"123456789012" is a 256-bit hash but it isn't nearly as strong as SHA-256.

Instead, it would be good to mention the desired collision probability. Also, what's the threat model? I.e. is there any way that something could go wrong if there's a collision? AFAICT a collision could easily be disasterous.

Also, I don't think the use of non-Rust implementations should be rejected since a huge amount of the compiler isn't written in Rust already. I would rather the compiler use a safer algorithm written in assembly language then a weaker algorithm that's implemented in Rust.

michaelwoerister · 2017-04-28T10:31:38Z

NIT: It doesn't make sense to measure hashes by the number of bits this way, because the number of bits doesn't correlate strongly enough with the desired collision probability. For example, `SHA-1(x)||"123456789012" is a 256-bit hash but it isn't nearly as strong as SHA-256.

I'm working under the assumption that a high-quality hash function will use all bits available.

Also, what's the threat model? I.e. is there any way that something could go wrong if there's a collision? AFAICT a collision could easily be disasterous.

A collision would result in some piece data not being detected as having changed. I would not talk about "threat model" though. Cache files are not meant to be distributed. An attacker would be better off just manipulating rlibs directly.

Also, I don't think the use of non-Rust implementations should be rejected since a huge amount of the compiler isn't written in Rust already. I would rather the compiler use a safer algorithm written in assembly language then a weaker algorithm that's implemented in Rust.

Nobody said it has to be written in Rust. But maintenance cost is a factor too.

nagisa · 2017-05-18T12:51:17Z

SHA-256 based on the dedicated SHA instructions can do 1.7GB/s on Ryzen. Probably more for SHA-1, but I didn’t benchmark it yet. AVX2 implementation of Fletcher4 that ZFS uses for checksumming does 15GB/s on the same machine. That same Fletcher4 is approximately 2GB/s for plain-old C version.

(EDIT: of course Fletcher4 is good for checksumming and does not really work well for avoiding collisions)

arthurprs · 2017-05-18T14:31:16Z

Any reason NOT to use siphash2-4 on this?

It's gonna give rustc 90% of the speed benefit and it'll leave little to no questions regarding maintaining platform specific code, hashing quality or security (the last is a weak argument, but it exists anyway).

michaelwoerister · 2017-05-18T21:02:30Z

Any reason NOT to use siphash2-4 on this?

It seems like an improvement over what we have now (which does not preclude that there are still better options).
Leaving security aside, is there any reason NOT to use siphash1-3? Does hashing quality degrade with fewer rounds?

arthurprs · 2017-05-18T22:04:21Z

Apparently there's a slight loss of quality on smhasher (avalanche test worse bias from 0.72% (2-4) to 0.9%(1-3)), not sure if it's significant.

source: https://github.com/rurban/smhasher

michaelwoerister · 2017-05-24T09:25:00Z

@alexcrichton What's the state of using crates.io crates in the compiler. Is that something we can do easily now? I guess that we have to at least vendor each crate we are using, right?

Mark-Simulacrum · 2017-05-24T12:41:07Z

As far as I know, crates within the compiler itself can just be depended upon and everything should just work. I think crates that are depended upon by libstd or it's dependencies can't do that yet, but I'm not sure.

alexcrichton · 2017-05-24T14:03:52Z

@michaelwoerister as of #41847 we should be able to just depend on crates.io crates and have things like stability, vendoring, license checking, etc, all just fall out. Unfortunately really leveraging #41847 will require a new stage0 compiler which supports that flag.

michaelwoerister · 2017-05-24T14:08:23Z

Thanks for the info @Mark-Simulacrum and @alexcrichton!

Unfortunately really leveraging #41847 will require a new stage0 compiler which supports that flag.

So this will be available when the next beta comes out in a couple of weeks? That's soon enough, there's no rush with this issue.

alexcrichton · 2017-05-24T14:13:39Z

June 8!

joshtriplett · 2017-06-08T16:02:55Z

Has anyone tested seahash? It's several times faster than SipHash and FNV, and has a native Rust implementation. Trying it should be easy given crates.io crate support.

This doesn't need a cryptographically strong hash, and seahash should beat even a hardware-accelerated SHA-256. (Seahash itself appears designed to use hardware-accelerated vectorization where available, while remaining portable.)

michaelwoerister · 2017-06-08T16:23:51Z

Looks promising. @ticki, is there also a version of seahash that provides 128 bit outputs (or more)?

est31 · 2017-06-08T19:46:22Z

One way to get 128 bits of output would be to do this in the finalize step instead of the "xor them all together":

a^=c;
c^=d;
b=a;
d=c;
a^=written;
helper::diffuse(a);
helper::diffuse(c);
a^=c;
c^=b;
helper::diffuse(a);
helper::diffuse(c);
return (a,c);

But maybe @ticki has a better idea.

arielb1 · 2017-10-05T12:45:33Z

If this is really correctness-critical, I like the cryptographic guarantees a cryptographic hash gives us and would hate to have correctness possibly be compromised by hash misbehavior.

michaelwoerister · 2017-10-05T13:03:43Z

@arielb1 Would you be comfortable with using SipHash? It is performing very well too and since it is supposed to combat exploits that are based on hash collisions, the bar for it's output quality is pretty high. Also, I assume that it has had lots of eyes on it since it is used so widely.

arthurprs · 2017-10-05T13:27:03Z

I'm sure this was answered somewhere already, but what happens if hashes collide?

michaelwoerister · 2017-10-05T13:42:55Z

I'm sure this was answered somewhere already, but what happens if hashes collide?

Then the compiler would potentially not re-run some computation although it should. The consequences of that depends on the computation/hash in question. The absolute worst case would be that an object file gets re-used although it shouldn't and you silently end up with the previous machine code version of the functions in there instead of the current one.

est31 · 2017-10-05T14:28:35Z

If this is really correctness-critical, I like the cryptographic guarantees a cryptographic hash gives us and would hate to have correctness possibly be compromised by hash misbehavior.

There are two cases here where collisions could occur theoretically:

One is security related where e.g. someone creates some code with a given hash and it now gets put into a different function as well that has the same hash, and this happens on purpose. It seems like this is very hard to pull this off. I guess many other ways of attacks are way easier (like creating a compiler plugin and then using unsafe to get access to compiler managed memory).

The second is as a general source of bugs. These collisions might in theory also happen if you are very unlucky. However, I doubt that you'd get "by chance" collisions given the 128 bits.

So generally I don't think the hash is correctness-critical.

michaelwoerister · 2017-10-05T14:47:26Z

One is security related where e.g. someone creates some code with a given hash and it now gets put into a different function as well that has the same hash.

It's hard to imagine how that would work in practice. A potential attacker could try to inject some code that has the same hash as some other piece of code that is already in your codebase. But the effect of the hash collision would be that the injected piece of code would not get used since the difference between the two pieces of code is not detected :)

nagisa · 2017-10-05T15:01:30Z

Could a SHA256 be considered if the host machine can hardware accelerate it? Or is it necessary to only ever use one hash algorithm for some interop reasons?

…

On Oct 5, 2017 17:48, "Michael Woerister" ***@***.***> wrote: One is security related where e.g. someone creates some code with a given hash and it now gets put into a different function as well that has the same hash. It's hard to imagine how that would work in practice. A potential attacker could try to inject some code that has the same hash as some other piece of code that is already in your codebase. But the effect of the hash collision would be that the injected piece of code would *not* get used since the difference between the two pieces of code is not detected :) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#41215 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AApc0koWnNd_xpDJPLwnbYMK_HHyFsOYks5spOwpgaJpZM4M6CaR> .

michaelwoerister · 2017-10-05T15:11:01Z

@nagisa I don't know of a use case where it makes sense to move an incr. comp. cache from one machine to another. I personally would prefer to have just one implementation to maintain but if a hardware accelerated SHA256 was significantly faster than SipHash, then in theory it could be done. Do you have numbers on how fast SHA256 can get?

nagisa · 2017-10-05T17:01:58Z

@michaelwoerister I posted #41215 (comment) earlier in the thread. Comparison is against Fletcher4, which is not very useful here, so I ran a similar test using std::hash::SipHasher24 in an environment as close to what I remember I used for the initial test. SipHasher manages aproximately a 2.33GB/s, so SHA doesn’t seem to be faster.

arthurprs · 2017-10-05T17:46:31Z

I think 128bit SipHash is the way to go here, there are surely faster hashers but the returns are very small for this workload (MetroHash results sort of prove that). No amount of bits can guarantee no collisions 🤷‍♀️ and as far as I understand there's no need for crypto hash security anyway.

Also, I don't fell comfortable using MetroHash or (insert your fav. fast hash here) in places other than hashtables.

arielb1 · 2017-10-05T19:07:53Z

@michaelwoerister

My worry was more that a specially-crafted piece of "code" (e.g. string data included from some outside source) could cause MetroHash to get into a bad state in which "particular" collisions are "easy". I can't think of a simple way to extend this into an exploit, but stranger things have happened.

I think that SipHash with a per-incremental-directory random key should be, if not something I'm 100% comfortable with, harder to exploit - exploiting it would require finding the SipHash key (which is not that easy) and crafting data based on that. While I can imagine some online situations in which that could be exploitable, this decreases the attack surface.

arielb1 · 2017-10-05T20:00:41Z

From an academic cryptographer POV, I can hardly call this scheme cryptographically secure. I would consider it a nice opportunity to try and find a contrived way to "break this" if you control all the neighboring degrees of freedom.

Actually, all 128-bit schemes expose an attack with complexity about 2^64 - if there's a "data blob" an attacker can manipulate, he can "birthday attack" his way to a collision and cause some specific target to not be updated. I weakly suspect that sip2-4 has known-key collision attacks with better complexity (it wasn't that well analyzed for collision resistance). 2^64 is already "way too close to be comfortable" (e.g. the bitcoin network computes about that amount of hashes every second), especially 10 years from now.

From a practical standpoint, I think keeping the SIP key reasonably secret would make "generic" attacks more complicated while I still can't see a good way to get a "theoretical" level of security.

joshtriplett · 2017-10-05T21:56:30Z

Would seahash make sense here? It's quite fast, and collision-resistant.

I'm curious about the microbenchmarks used to evaluate it relative to metrohash, given that seahash's own benchmarks seem to disagree with that result. Might be worth doublechecking that with the seahash and metrohash developers, both.

michaelwoerister · 2017-10-06T11:17:38Z

From an academic cryptographer POV, I can hardly call this scheme cryptographically secure.

That's what I'm saying all along: None of this is supposed to be cryptographically secure. Using SipHash or even BLAKE2 would not change that.

michaelwoerister · 2017-10-06T11:30:23Z

@joshtriplett This is the benchmark I used:

#![allow(non_snake_case)]
#![feature(test)]

extern crate test;
extern crate rand;

extern crate metrohash;
extern crate fasthash;

use test;
use rand::{self, Rng};
use metrohash::MetroHash128;
use fasthash::sea::SeaHasher64 as fasthash_SeaHasher64;

fn create_unique_values(count: usize) -> Vec<u8> {
    let mut rng = rand::thread_rng();
    let mut values = Vec::with_capacity(count);
    for _ in 0 .. count {
        values.push(rng.gen())
    }

    values
}

fn run_bench<H: Hasher+Default>(bytes: usize, bh: &mut test::Bencher) {
    let values = create_unique_values(bytes);
    bh.iter(|| {
        let mut hasher = H::default();
        values.hash(&mut hasher);
        test::black_box(hasher.finish());
    })
}

// This is roughly the average amount of data that we hash at a time in the Rust compiler  
const BYTE_COUNT: usize = 400;

#[bench]
fn metrohash128(bh: &mut test::Bencher) {
    run_bench::<MetroHash128>(BYTE_COUNT, bh);
}

#[bench]
fn fasthash_SeaHasher64(bh: &mut test::Bencher) {
    run_bench::<fasthash_SeaHasher64>(BYTE_COUNT, bh);
}

I get these results:

test bench::fasthash_SeaHasher64        ... bench:          67 ns/iter (+/- 3)
test bench::metrohash128                ... bench:          58 ns/iter (+/- 2)

TBH, I would be less comfortable with using SeaHash than with using MetroHash since SeaHash is even newer. There's no doubt that SeaHash is a great hash function but for this use case I prefer something more boring :)

michaelwoerister · 2017-10-06T11:44:42Z

@nagisa Thanks for those numbers! Looks like SipHash is the better choice then.

…sakis incr.comp.: Use 128bit SipHash for fingerprinting This PR switches incr. comp. result fingerprinting from 128 bit BLAKE2 to 128 bit SipHash. When we started using BLAKE2 for fingerprinting, the 128 bit version of SipHash was still experimental. Now that it isn't anymore we should be able to get a nice performance boost without significantly increasing collision probability. ~~I'm going to start a try-build for this, so we can gauge the performance impact before merging (hence the `WIP` in the title).~~ EDIT: Performance improvements look as expected. Tests seem to be passing. Fixes #41215.

michaelwoerister added the A-incr-comp Area: Incremental compilation label Apr 11, 2017

est31 mentioned this issue Apr 27, 2017

Faster hash algorithm? mozilla/sccache#108

Closed

Mark-Simulacrum added the C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. label Jul 27, 2017

michaelwoerister mentioned this issue Aug 1, 2017

Unnaturally slow expansion on generated source #43573

Closed

michaelwoerister mentioned this issue Sep 19, 2017

incr.comp.: Compute fingerprint for all query results. #44364

Merged

This was referenced Oct 5, 2017

[DO NOT MERGE]: Try out 128bit Metrohash for incr. comp. hashes. #45026

Closed

[DO NOT MERGE]: Try out 128bit SipHash for incr. comp. hashes. #45028

Closed

This was referenced Oct 12, 2017

Add unstable 128 bit SipHash version for internal use in the compiler. #45237

Closed

incr.comp.: Use 128bit SipHash for fingerprinting #45319

Merged

bors closed this as completed in #45319 Oct 20, 2017

michaelwoerister mentioned this issue May 25, 2018

SipHasher takes up lots of time in incremental builds #51054

Closed

michaelwoerister mentioned this issue Apr 1, 2019

[WIP] Use buffered aHash as StableHasher #59598

Closed

michaelwoerister mentioned this issue Mar 20, 2020

Add hash of source files in debug info #69718

Merged

ehuss mentioned this issue Jul 9, 2021

Deduplicate compiler diagnostics. rust-lang/cargo#9675

Merged

michaelwoerister mentioned this issue Feb 13, 2023

Use SipHash-1-3 instead of SipHash-2-4 for StableHasher #107925

Merged

qedawkins mentioned this issue Jan 9, 2024

Add vm.buffer.hash and util.buffer.hash ops iree-org/iree#16003

Merged

briansmith mentioned this issue Jun 20, 2024

Add support for SipHash128 with all-zero key cr-marcstevens/hashclash#40

Open

Review which hash algorithm to use for incremental compilation hashes. #41215

Review which hash algorithm to use for incremental compilation hashes. #41215

Comments

michaelwoerister commented Apr 11, 2017 • edited Loading

est31 commented Apr 11, 2017 • edited Loading

michaelwoerister commented Apr 11, 2017

michaelwoerister commented Apr 11, 2017

arthurprs commented Apr 12, 2017 • edited Loading

michaelwoerister commented Apr 18, 2017

sfackler commented Apr 18, 2017

est31 commented Apr 18, 2017

michaelwoerister commented Apr 21, 2017

arthurprs commented Apr 21, 2017

michaelwoerister commented Apr 21, 2017

michaelwoerister commented Apr 21, 2017

briansmith commented Apr 27, 2017

michaelwoerister commented Apr 28, 2017 • edited Loading

nagisa commented May 18, 2017 • edited Loading

arthurprs commented May 18, 2017

michaelwoerister commented May 18, 2017

arthurprs commented May 18, 2017

michaelwoerister commented May 24, 2017

Mark-Simulacrum commented May 24, 2017

alexcrichton commented May 24, 2017

michaelwoerister commented May 24, 2017

alexcrichton commented May 24, 2017

joshtriplett commented Jun 8, 2017

michaelwoerister commented Jun 8, 2017

est31 commented Jun 8, 2017

arielb1 commented Oct 5, 2017

michaelwoerister commented Oct 5, 2017

arthurprs commented Oct 5, 2017

michaelwoerister commented Oct 5, 2017

est31 commented Oct 5, 2017

michaelwoerister commented Oct 5, 2017

nagisa commented Oct 5, 2017 via email

michaelwoerister commented Oct 5, 2017

nagisa commented Oct 5, 2017 • edited Loading

arthurprs commented Oct 5, 2017 • edited Loading

arielb1 commented Oct 5, 2017

arielb1 commented Oct 5, 2017

joshtriplett commented Oct 5, 2017 • edited Loading

michaelwoerister commented Oct 6, 2017

michaelwoerister commented Oct 6, 2017

michaelwoerister commented Oct 6, 2017

michaelwoerister commented Apr 11, 2017 •

edited

Loading

est31 commented Apr 11, 2017 •

edited

Loading

arthurprs commented Apr 12, 2017 •

edited

Loading

michaelwoerister commented Apr 28, 2017 •

edited

Loading

nagisa commented May 18, 2017 •

edited

Loading

nagisa commented Oct 5, 2017 •

edited

Loading

arthurprs commented Oct 5, 2017 •

edited

Loading

joshtriplett commented Oct 5, 2017 •

edited

Loading