-
Notifications
You must be signed in to change notification settings - Fork 12.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collisions in type_id #10389
Comments
I'm not entirely sure how feasible it is for a program to have We could in theory have very cheap inequality among types, and then have an expensive equality check. Something which may walk the respective Either way, I don't think that this is a super-pressing issue for now, but I'm nominating to discuss whether we want to get this done for 1.0. This could in theory have serious implications depending on how frequently |
Ah, it was already nominated! |
Why not compare an interned version of the type data string? (i.e. what is currently passed as data to be hashed, possibly SHA-256 hashed first) The linker can be used for interning by emitting a common symbol with the type data string as name and taking its address, and otherwise the same thing can be done manually in a global constructor. This way it's always a pointer comparison, and there are no collisions. |
I don't know how node id values are generated, but assuming that they are generated sequentially, this particular collision is not realistic. However, its not hard to find collisions for more realistic node id values by picking particular values for the crate hashes: assert!(hash_struct("a2c55ca1a1f68", 4080) == hash_struct("138b8278caab5", 2804)); The key thing to consider isn't the number of node id values, though: its the total number of type id values. Some quick (hopefully correct) math shows that there is a 0.01% chance of a collision once there are around 60 million type id values. That's still a pretty large number of type id values for a somewhat low probability of a collision, thought. So, its unclear to me how big a deal this is for the Rust 1.0 timeframe. It all depends on what the acceptable probability of a collision is. |
When I saw that @alexcrichton proposed using a hash, my first reaction was "collision!" but then I thought "...but exceedingly unlikely to occur in practice". I think this is not a matter of imminent destruction but if we can leverage the linker or some other scheme to avoid this danger, we should -- and perhaps we should just go ahead and mark the current scheme as deprecated and just plan on finding a replacement scheme. |
A cryptographic hash designed for this purpose (larger output) would be enough. Although, a larger output would be more expensive to compare (four |
We don't need to deal with this right now. P-low. |
How relevant is this issue today? I think that it's all the same, but am not sure. |
It's 64-bit so collisions are likely with enough types (consider recursive type metaprogramming) and it doesn't have any check to bail out if one occurs. Bailing out is not a very good solution anyway, because it pretty much means that there's no way to compile the program, beyond using a different random seed and hoping for the best. It's a crappy situation. |
Note that "hoping for the best" by iteratively changing the seed might work with overwhelmingly large probability after very few iterations. |
use std::any::Any;
fn main() {
let weird : [([u8; 188250],[u8; 1381155],[u8; 558782]); 0] = [];
let whoops = Any::downcast_ref::<[([u8; 1990233],[u8; 798602],[u8; 2074279]); 1]>(&weird);
println!("{}",whoops.unwrap()[0].0[333333]);
} Actually a soundness issue. playground: http://is.gd/TwBayX |
I'd like the lang team to devote a little time to this now that we are post 1.0. Nominating |
OK, lang team discussed it, and our conclusion was that:
|
I was wondering about a design where we do something like:
compare the string pointers for equality (to give a fast equality check). If that fails, compare the hashes for inequality (to give a fast inequality check). If THAT fails, compare the strings for content (to handle dynamic linking). Although re-reading the thread I see @bill-myers may have had an even more clever solution. |
@nikomatsakis putting the hash of the data at the start is a good idea, to increase the probability that we catch unequal things quickly. It seems to me like @bill-myers' approach composes fine with that strategy. |
I doubt the "problem" is limited to Any. You can probably confuse the compiler just as effectively by colliding hashes for symbol mangling, or many other things. What is the objective here? Since Rust is not a sandbox language, I don't think "protect memory from malicious programmers" should be one of our goals (we should document the types of undefined behavior that can be hit in safe code, and fix the ones that are possible to hit by accident; if someone is determined to break the type system, they can already write an unsafe block, or use std::process to launch a subprocess that ptraces its parent and corrupts memory). |
Thanks to: https://www.reddit.com/r/rust/comments/5pfwjr/mitigating_underhandedness_clippy/dcrew0k/ This example works on Beta and Nightly. |
@nikomatsakis Should this be marked as I-unsound? I've done so for now, since that seems to be the general conclusion a couple of times by different people, but please unmark if I'm wrong. |
Yes, speed is the concern. See #107925 for an example about the impact of the chosen hash function. |
For |
The question as I understood is whether even a cryptographic hash is sufficient. After all, the probability that there exists a sha3 collision is 1. So there is a fundamental decision to be made whether we are okay with that: assuming a perfect hash function (random oracle style), is that good enough? And then if yes, how close to perfect does the actual hash function have to be?
I was not referring to the current hash function specifically when I talked about computational hardness. I trust you in your judgments about how hard it is to find a collision in hash functions.
|
Also, cool work on finding those collisions :)
|
Ralf, thanks for clarifying that. I had misread what you wrote.
I cannot find it right now, but if it is useful I can find a discussion that made a pretty convincing argument that no practical hash function with 128-bit output can never be collision-free enough to approximate such a "perfect" hash that we could make a "good enough" argument for. And my understanding from the discussion above (or elsewhere?) is that increasing the hash size to 256 bits (which would probably be the minimum to be considered "good enough" for theoretical soundness arguments) is considered prohibitive, such that [edit] relying solely on hash comparison would be disqualified as a solution. |
It is one thing knowing there exist collisions in theory. But that is discounting conscious effort to cause bad stuff. |
Yes, threat models have been discussed upthread. |
I read some parts, it looks like such threats are not taken really in design. |
Not "immediately", it took quite a while.^^
Indeed, a lot of stuff was discussed there.^^ I think we need a decent summary collecting all major positions before we can ask t-lang to take a look. |
10 years from the report to the fix. And 32bit resistance is not enough even for non-malicious uses. Even if it's very rare it'd still happen eventually to someone over the course of normal rust use. |
The bigger problem here is that now other parts of the Rust project, and other projects, are using rustc's use of SipHash 1-3 with an all-zero key and 128-bit output as an indication that it is good enough for their uses. See: |
I should think that simply directing people to read the discussion here would be sufficient to dissuade them. Or, even better, directing them to a summary. |
One important comment in this thread that I hope doesn't get lost is per SipHash's author (and as others have noted), it's designed to be a keyed PRF, not an unkeyed hash function, and From what I can tell the selection of SipHash for |
There's no intrinsic reason why |
Incremental compilation hash collisions can also lead to unsoundness though, can't they?
Though there at least users have a work-around -- build the final artifact for distribution without incremental.
|
Yes, incremental compilation is a best-effort developer-quality-of-life feature. No incrementally built code should ever be shipped. |
I've opened an issue in the rustc-stable-hasher repo about supporting different hash algorithms. It's clear that SipHash13 is not a good choice for most use cases. |
Just to confirm, the SipHash128 that rustc is using is identical to this code: |
I don't think there is any deliberate deviation from the reference impl, so probably yes. Here are the compiler's test vectors for the algorithm: rust/compiler/rustc_data_structures/src/sip128/tests.rs Lines 26 to 124 in 4e6de37
|
IMO, developers shouldn't be put more at risk using incremental compilation than with a full rebuild. We should be aiming for trustworthy incremental builds. |
rust-lang/compiler-team#765 proposes another use of hashing in the build process. EDIT: Ah, this was already brought up. |
If that is the position of the team, it seems like that should be communicated more clearly? I wasn't aware of this, and I think it is safe to assume that the vast majority of our users are not aware of this, either. |
Yes, that's definitely something that should be done. It's not like incrementally compiled code is likely to be wrong (especially with an empty cache there should be no difference other than CGU partitioning). But even without hash collisions taken into account, doing things incrementally is intrinsically more difficult and much harder to test. The likelihood of additional compiler bugs is just greater. |
Anecdotally, the small number of ICEs I hit in recent years all went away with |
For further context, I recall seeing someone knowledgeable (sorry, don't recall who) say much the same — that incremental is likely to have an unknown number of issues, solely due to the massively expanded surface area. (IIRC, this was fairly close to when incremental was made default for the dev profile.) However, this is mitigated by the fact that they also expected that these issues would manifest as ICE rather than incorrect compilation. Anecdotally, my experience has also been that every issue I've hit since then (without unstable features) has been incremental ICEs, never a successful compilation generating incoherent behavior. Even when I've done UB that would justify two compilation modes having divergent behavior. Additionally, AIUI, the compiler has only gotten better at spotting any issues with incremental compilation over time. I'd actually concur that incrementally built code shouldn't be shipped, but not due to any risk of miscompilation, just because that's needlessly leaving performance on the table compared to a non-incremental optimized build. Not every piece of shipped software is distributed enough to justify full fat LTO and PGO, but a clean build is generally worth it. It was also my impression that this was the compiler team position, and the best-effort falls out of that, in the same way An available-by-default " |
I think it would be best to have this discussion elsewhere, this seems like a tangent. Maybe important, but still a tangent. |
Yes, sorry for derailing the discussion here. "No incrementally built code should ever be shipped" does make it sound too extreme. Let's put it this way: there is no upside to building code incrementally unless your rebuilds need to be quick. The initial build will be slower, code quality might be lower due to more object files being generated, the resulting binary will be larger, and there is a chance of running into incr. comp. only compiler bugs which otherwise are just not an issue. But: any incr. comp. miscompilation bug will certainly be treated as critical and we have only had one such bug (in 2021), as far as I know. I'll take an action item of adding information about incremental builds wrt release builds to the relevant docs for rustc and cargo. |
To get back to the request for some lang team input on |
The implementation of type_id from #10182 uses SipHash on various parameters depending on the type. The output size of SipHash is only 64-bits, however, making it feasible to find collisions via a Birthday Attack. I believe the code below demonstrates a collision in the type_id value of two different ty_structs:
The text was updated successfully, but these errors were encountered: