-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review which hash algorithm to use for incremental compilation hashes. #41215
Comments
As another idea, if non cryptographic hashes are allowed, we could use SHA-1 on platforms where its hardware accelerated (apparently Ryzen and newer and Goldmont and newer). SHA-1 is slightly slower than BLAKE2, but I guess with hardware acceleration it will be much faster, maybe faster than a MetroHash implemented with generic instructions. Would be interesting to have benchmarks on this. |
That's an interesting idea. However, I'd need to see actual numbers for that and they would have to be very compelling in order to justify adding platform-specific code. |
A 128 bit SipHash13 might also be a good contender. It's pretty fast and it's designed to be a good PRF. |
Siphash is probably a sensible choice here if the numbers look good. Metrohash is great but I think we should try to be conservative to some extent. |
@rust-lang/libs: There are no plans of making the 128-bit version of SipHash available in the standard library, right? |
Not in a stable manner, no, but that shouldn't matter for compiler internals. |
Also, afaik its possible to simply use crates from crates.io now. Just add them to the Cargo.toml and commit both that change and the change to Cargo.lock into git. |
I added the 128 bit version of SipHash to the table and the timings are as good as the ones for MetroHash. |
Sounds like a clear winner to me. Is that 1-3 or 2-4 variant? |
It's 1-3 ... |
I added the SipHash 2-4 timings too. As expected they are slightly worse but still pretty good. |
NIT: It doesn't make sense to measure hashes by the number of bits this way, because the number of bits doesn't correlate strongly enough with the desired collision probability. For example, `SHA-1(x)||"123456789012" is a 256-bit hash but it isn't nearly as strong as SHA-256. Instead, it would be good to mention the desired collision probability. Also, what's the threat model? I.e. is there any way that something could go wrong if there's a collision? AFAICT a collision could easily be disasterous. Also, I don't think the use of non-Rust implementations should be rejected since a huge amount of the compiler isn't written in Rust already. I would rather the compiler use a safer algorithm written in assembly language then a weaker algorithm that's implemented in Rust. |
I'm working under the assumption that a high-quality hash function will use all bits available.
A collision would result in some piece data not being detected as having changed. I would not talk about "threat model" though. Cache files are not meant to be distributed. An attacker would be better off just manipulating rlibs directly.
Nobody said it has to be written in Rust. But maintenance cost is a factor too. |
SHA-256 based on the dedicated SHA instructions can do 1.7GB/s on Ryzen. Probably more for SHA-1, but I didn’t benchmark it yet. AVX2 implementation of Fletcher4 that ZFS uses for checksumming does 15GB/s on the same machine. That same Fletcher4 is approximately 2GB/s for plain-old C version. (EDIT: of course Fletcher4 is good for checksumming and does not really work well for avoiding collisions) |
Any reason NOT to use siphash2-4 on this? It's gonna give rustc 90% of the speed benefit and it'll leave little to no questions regarding maintaining platform specific code, hashing quality or security (the last is a weak argument, but it exists anyway). |
It seems like an improvement over what we have now (which does not preclude that there are still better options). |
Apparently there's a slight loss of quality on smhasher (avalanche test worse bias from 0.72% (2-4) to 0.9%(1-3)), not sure if it's significant. |
@alexcrichton What's the state of using crates.io crates in the compiler. Is that something we can do easily now? I guess that we have to at least vendor each crate we are using, right? |
As far as I know, crates within the compiler itself can just be depended upon and everything should just work. I think crates that are depended upon by libstd or it's dependencies can't do that yet, but I'm not sure. |
@michaelwoerister as of #41847 we should be able to just depend on crates.io crates and have things like stability, vendoring, license checking, etc, all just fall out. Unfortunately really leveraging #41847 will require a new stage0 compiler which supports that flag. |
Thanks for the info @Mark-Simulacrum and @alexcrichton!
So this will be available when the next beta comes out in a couple of weeks? That's soon enough, there's no rush with this issue. |
June 8! |
Has anyone tested seahash? It's several times faster than SipHash and FNV, and has a native Rust implementation. Trying it should be easy given crates.io crate support. This doesn't need a cryptographically strong hash, and seahash should beat even a hardware-accelerated SHA-256. (Seahash itself appears designed to use hardware-accelerated vectorization where available, while remaining portable.) |
Looks promising. @ticki, is there also a version of seahash that provides 128 bit outputs (or more)? |
One way to get 128 bits of output would be to do this in the finalize step instead of the "xor them all together": a^=c;
c^=d;
b=a;
d=c;
a^=written;
helper::diffuse(a);
helper::diffuse(c);
a^=c;
c^=b;
helper::diffuse(a);
helper::diffuse(c);
return (a,c); But maybe @ticki has a better idea. |
If this is really correctness-critical, I like the cryptographic guarantees a cryptographic hash gives us and would hate to have correctness possibly be compromised by hash misbehavior. |
@arielb1 Would you be comfortable with using SipHash? It is performing very well too and since it is supposed to combat exploits that are based on hash collisions, the bar for it's output quality is pretty high. Also, I assume that it has had lots of eyes on it since it is used so widely. |
I'm sure this was answered somewhere already, but what happens if hashes collide? |
Then the compiler would potentially not re-run some computation although it should. The consequences of that depends on the computation/hash in question. The absolute worst case would be that an object file gets re-used although it shouldn't and you silently end up with the previous machine code version of the functions in there instead of the current one. |
There are two cases here where collisions could occur theoretically: One is security related where e.g. someone creates some code with a given hash and it now gets put into a different function as well that has the same hash, and this happens on purpose. It seems like this is very hard to pull this off. I guess many other ways of attacks are way easier (like creating a compiler plugin and then using unsafe to get access to compiler managed memory). The second is as a general source of bugs. These collisions might in theory also happen if you are very unlucky. However, I doubt that you'd get "by chance" collisions given the 128 bits. So generally I don't think the hash is correctness-critical. |
It's hard to imagine how that would work in practice. A potential attacker could try to inject some code that has the same hash as some other piece of code that is already in your codebase. But the effect of the hash collision would be that the injected piece of code would not get used since the difference between the two pieces of code is not detected |
Could a SHA256 be considered if the host machine can hardware accelerate
it? Or is it necessary to only ever use one hash algorithm for some interop
reasons?
…On Oct 5, 2017 17:48, "Michael Woerister" ***@***.***> wrote:
One is security related where e.g. someone creates some code with a given
hash and it now gets put into a different function as well that has the
same hash.
It's hard to imagine how that would work in practice. A potential attacker
could try to inject some code that has the same hash as some other piece of
code that is already in your codebase. But the effect of the hash collision
would be that the injected piece of code would *not* get used since the
difference between the two pieces of code is not detected :)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#41215 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AApc0koWnNd_xpDJPLwnbYMK_HHyFsOYks5spOwpgaJpZM4M6CaR>
.
|
@nagisa I don't know of a use case where it makes sense to move an incr. comp. cache from one machine to another. I personally would prefer to have just one implementation to maintain but if a hardware accelerated SHA256 was significantly faster than SipHash, then in theory it could be done. Do you have numbers on how fast SHA256 can get? |
@michaelwoerister I posted #41215 (comment) earlier in the thread. Comparison is against Fletcher4, which is not very useful here, so I ran a similar test using |
I think 128bit SipHash is the way to go here, there are surely faster hashers but the returns are very small for this workload (MetroHash results sort of prove that). No amount of bits can guarantee no collisions 🤷♀️ and as far as I understand there's no need for crypto hash security anyway. Also, I don't fell comfortable using MetroHash or (insert your fav. fast hash here) in places other than hashtables. |
My worry was more that a specially-crafted piece of "code" (e.g. string data included from some outside source) could cause MetroHash to get into a bad state in which "particular" collisions are "easy". I can't think of a simple way to extend this into an exploit, but stranger things have happened. I think that SipHash with a per-incremental-directory random key should be, if not something I'm 100% comfortable with, harder to exploit - exploiting it would require finding the SipHash key (which is not that easy) and crafting data based on that. While I can imagine some online situations in which that could be exploitable, this decreases the attack surface. |
From an academic cryptographer POV, I can hardly call this scheme cryptographically secure. I would consider it a nice opportunity to try and find a contrived way to "break this" if you control all the neighboring degrees of freedom. Actually, all 128-bit schemes expose an attack with complexity about 2^64 - if there's a "data blob" an attacker can manipulate, he can "birthday attack" his way to a collision and cause some specific target to not be updated. I weakly suspect that sip2-4 has known-key collision attacks with better complexity (it wasn't that well analyzed for collision resistance). 2^64 is already "way too close to be comfortable" (e.g. the bitcoin network computes about that amount of hashes every second), especially 10 years from now. From a practical standpoint, I think keeping the SIP key reasonably secret would make "generic" attacks more complicated while I still can't see a good way to get a "theoretical" level of security. |
Would seahash make sense here? It's quite fast, and collision-resistant. I'm curious about the microbenchmarks used to evaluate it relative to metrohash, given that seahash's own benchmarks seem to disagree with that result. Might be worth doublechecking that with the seahash and metrohash developers, both. |
That's what I'm saying all along: None of this is supposed to be cryptographically secure. Using SipHash or even BLAKE2 would not change that. |
@joshtriplett This is the benchmark I used: #![allow(non_snake_case)]
#![feature(test)]
extern crate test;
extern crate rand;
extern crate metrohash;
extern crate fasthash;
use test;
use rand::{self, Rng};
use metrohash::MetroHash128;
use fasthash::sea::SeaHasher64 as fasthash_SeaHasher64;
fn create_unique_values(count: usize) -> Vec<u8> {
let mut rng = rand::thread_rng();
let mut values = Vec::with_capacity(count);
for _ in 0 .. count {
values.push(rng.gen())
}
values
}
fn run_bench<H: Hasher+Default>(bytes: usize, bh: &mut test::Bencher) {
let values = create_unique_values(bytes);
bh.iter(|| {
let mut hasher = H::default();
values.hash(&mut hasher);
test::black_box(hasher.finish());
})
}
// This is roughly the average amount of data that we hash at a time in the Rust compiler
const BYTE_COUNT: usize = 400;
#[bench]
fn metrohash128(bh: &mut test::Bencher) {
run_bench::<MetroHash128>(BYTE_COUNT, bh);
}
#[bench]
fn fasthash_SeaHasher64(bh: &mut test::Bencher) {
run_bench::<fasthash_SeaHasher64>(BYTE_COUNT, bh);
} I get these results:
TBH, I would be less comfortable with using SeaHash than with using MetroHash since SeaHash is even newer. There's no doubt that SeaHash is a great hash function but for this use case I prefer something more boring |
@nagisa Thanks for those numbers! Looks like SipHash is the better choice then. |
…sakis incr.comp.: Use 128bit SipHash for fingerprinting This PR switches incr. comp. result fingerprinting from 128 bit BLAKE2 to 128 bit SipHash. When we started using BLAKE2 for fingerprinting, the 128 bit version of SipHash was still experimental. Now that it isn't anymore we should be able to get a nice performance boost without significantly increasing collision probability. ~~I'm going to start a try-build for this, so we can gauge the performance impact before merging (hence the `WIP` in the title).~~ EDIT: Performance improvements look as expected. Tests seem to be passing. Fixes #41215.
…sakis incr.comp.: Use 128bit SipHash for fingerprinting This PR switches incr. comp. result fingerprinting from 128 bit BLAKE2 to 128 bit SipHash. When we started using BLAKE2 for fingerprinting, the 128 bit version of SipHash was still experimental. Now that it isn't anymore we should be able to get a nice performance boost without significantly increasing collision probability. ~~I'm going to start a try-build for this, so we can gauge the performance impact before merging (hence the `WIP` in the title).~~ EDIT: Performance improvements look as expected. Tests seem to be passing. Fixes #41215.
For incremental compilation we are hashing lots of things, HIR, crate metadata and soon also type checking tables. These hashes are used purely as fingerprints, that is, we compare these hashes in order to find out if the things having been hashed are equal or not. Consequently the primary qualities we require from the hash algorithm are:
It remains to be clarified whether it makes sense to use a cryptographic hash function, i.e. that it is hard to find/construct collisions. My guess is no:
I'd be very interested though if someone could come up with an attack scenario. But even then the solution would probably be to implement some general code signing scheme for these caches.
At the moment we are using BLAKE2 as our hashing algorithm which is a cryptographic hash function and thus fulfills requirements (1) and (2) very well. However, compared to non-cryptographic hash functions it is rather slow. For raw throughput SipHash would be about three times as fast and fast hash functions like Metrohash can be 20 times as fast. Before you get too excited though: Most of the time computing hashes is spent gathering the data that is fed into the hash function, not actually running the function on it. But there are still some gains to be had, here are some preliminary tests of using BLAKE2 -- which is our "worst case" -- and MetroHash -- which is among the fastest hash functions providing 128 bit hashes:
So it seems that a fast hash functions allows to save 20-30% spent during ICH computation. If we could get some confidence that:
then we could have some free speedups here.
UPDATE:
I added the 128 bit version of SipHash to the table and the timings are as good as the ones for MetroHash.
The text was updated successfully, but these errors were encountered: