Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collisions in type_id #10389

Open
DaGenix opened this issue Nov 9, 2013 · 92 comments
Open

Collisions in type_id #10389

DaGenix opened this issue Nov 9, 2013 · 92 comments
Labels
A-typesystem Area: The type system C-bug Category: This is a bug. I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness P-low Low priority T-lang Relevant to the language team, which will review and decide on the PR/issue.

Comments

@DaGenix
Copy link

DaGenix commented Nov 9, 2013

The implementation of type_id from #10182 uses SipHash on various parameters depending on the type. The output size of SipHash is only 64-bits, however, making it feasible to find collisions via a Birthday Attack. I believe the code below demonstrates a collision in the type_id value of two different ty_structs:

use std::hash;
use std::hash::Streaming;

// I believe that this pretty much the same thing as hash_crate_independent() in ty.rs
// for a ty_struct on a 64-bit platform
fn hash_struct(local_hash: &str, node: u64) -> u64 {
    let mut state = hash::default_state();
    state.input([18]);
    state.input(local_hash.as_bytes());
    do node.iter_bytes(true) |bytes| { state.input(bytes); true };
    state.result_u64()
}

fn main() {
    // This represents two structures with different node values from a crate with a 
    // local_hash value of "local" that end up getting the same hash and thus, 
    // I think, the same type_id
    assert!(hash_struct("local", 0x9e02c8943c336302) == hash_struct("local", 0x366a806b1d5f1b2));
}
@alexcrichton
Copy link
Member

I'm not entirely sure how feasible it is for a program to have 0x366a806b1d5f1b2 node ids (2.5 trillion), but this is still concerning.

We could in theory have very cheap inequality among types, and then have an expensive equality check. Something which may walk the respective TyDesc structures in parallel to make sure that they're the same. We could also bump up the hash size to using something like sha1/sha2 and have the type_id() intrinsic return [u8, ..N] to reduce the possibility of a collision.

Either way, I don't think that this is a super-pressing issue for now, but I'm nominating to discuss whether we want to get this done for 1.0. This could in theory have serious implications depending on how frequently Any is used.

@alexcrichton
Copy link
Member

Ah, it was already nominated!

@bill-myers
Copy link
Contributor

Why not compare an interned version of the type data string? (i.e. what is currently passed as data to be hashed, possibly SHA-256 hashed first)

The linker can be used for interning by emitting a common symbol with the type data string as name and taking its address, and otherwise the same thing can be done manually in a global constructor.

This way it's always a pointer comparison, and there are no collisions.

@DaGenix
Copy link
Author

DaGenix commented Nov 11, 2013

I don't know how node id values are generated, but assuming that they are generated sequentially, this particular collision is not realistic. However, its not hard to find collisions for more realistic node id values by picking particular values for the crate hashes:

assert!(hash_struct("a2c55ca1a1f68", 4080) == hash_struct("138b8278caab5", 2804));

The key thing to consider isn't the number of node id values, though: its the total number of type id values. Some quick (hopefully correct) math shows that there is a 0.01% chance of a collision once there are around 60 million type id values. That's still a pretty large number of type id values for a somewhat low probability of a collision, thought. So, its unclear to me how big a deal this is for the Rust 1.0 timeframe. It all depends on what the acceptable probability of a collision is.

@nikomatsakis
Copy link
Contributor

When I saw that @alexcrichton proposed using a hash, my first reaction was "collision!" but then I thought "...but exceedingly unlikely to occur in practice". I think this is not a matter of imminent destruction but if we can leverage the linker or some other scheme to avoid this danger, we should -- and perhaps we should just go ahead and mark the current scheme as deprecated and just plan on finding a replacement scheme.

@thestinger
Copy link
Contributor

A cryptographic hash designed for this purpose (larger output) would be enough. Although, a larger output would be more expensive to compare (four u64 comparisons for SHA2).

@pnkfelix
Copy link
Member

We don't need to deal with this right now. P-low.

@steveklabnik
Copy link
Member

How relevant is this issue today? I think that it's all the same, but am not sure.

@thestinger
Copy link
Contributor

It's 64-bit so collisions are likely with enough types (consider recursive type metaprogramming) and it doesn't have any check to bail out if one occurs. Bailing out is not a very good solution anyway, because it pretty much means that there's no way to compile the program, beyond using a different random seed and hoping for the best. It's a crappy situation.

@vks
Copy link
Contributor

vks commented Jan 21, 2015

Note that "hoping for the best" by iteratively changing the seed might work with overwhelmingly large probability after very few iterations.

@sorear
Copy link
Contributor

sorear commented Oct 14, 2015

use std::any::Any;

fn main() {
    let weird : [([u8; 188250],[u8; 1381155],[u8; 558782]); 0] = [];
    let whoops = Any::downcast_ref::<[([u8; 1990233],[u8; 798602],[u8; 2074279]); 1]>(&weird);
    println!("{}",whoops.unwrap()[0].0[333333]);
}

Actually a soundness issue. playground: http://is.gd/TwBayX

@pnkfelix
Copy link
Member

I'd like the lang team to devote a little time to this now that we are post 1.0. Nominating

@pnkfelix pnkfelix added I-nominated T-lang Relevant to the language team, which will review and decide on the PR/issue. labels Oct 18, 2015
@nikomatsakis
Copy link
Contributor

OK, lang team discussed it, and our conclusion was that:

  1. This issue ought to be fixed, it's silly not to.
  2. This is an implementation detail that we could change whenever we want (right?)
  3. Nonetheless, we probably ought to open an RFC or at least a discuss thread, with a proposal to do better, since probably people will have some clever ideas.
  4. Probably the runtime overhead of the virtual call in the Any trait is way more than a strcmp anyhow for all realistic types.

@nikomatsakis
Copy link
Contributor

I was wondering about a design where we do something like:

  • generate a static string representing the full type; in static builds, at least, this will be interned by the linker;
  • generate a hash

compare the string pointers for equality (to give a fast equality check). If that fails, compare the hashes for inequality (to give a fast inequality check). If THAT fails, compare the strings for content (to handle dynamic linking).

Although re-reading the thread I see @bill-myers may have had an even more clever solution.

@pnkfelix
Copy link
Member

@nikomatsakis putting the hash of the data at the start is a good idea, to increase the probability that we catch unequal things quickly. It seems to me like @bill-myers' approach composes fine with that strategy.

@sorear
Copy link
Contributor

sorear commented Oct 30, 2015

I doubt the "problem" is limited to Any. You can probably confuse the compiler just as effectively by colliding hashes for symbol mangling, or many other things. What is the objective here? Since Rust is not a sandbox language, I don't think "protect memory from malicious programmers" should be one of our goals (we should document the types of undefined behavior that can be hit in safe code, and fix the ones that are possible to hit by accident; if someone is determined to break the type system, they can already write an unsafe block, or use std::process to launch a subprocess that ptraces its parent and corrupts memory).

@hmvp
Copy link

hmvp commented Jan 24, 2017

@Mark-Simulacrum
Copy link
Member

@nikomatsakis Should this be marked as I-unsound? I've done so for now, since that seems to be the general conclusion a couple of times by different people, but please unmark if I'm wrong.

@Mark-Simulacrum Mark-Simulacrum added I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness C-bug Category: This is a bug. and removed C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Jul 19, 2017
@briansmith
Copy link

briansmith commented Dec 22, 2023

My understanding is that all unkeyed hash functions, including state-of-the-art cryptographic ones, lack mathematical proof of their collision resistance and instead only various heuristics and approximations are used that suggest but do not prove that functions have desirable statistical and cryptographic properties.

Usually we don't use hashes equality as substitution for value equality unless the hash function is believed to have a collision birthday bound around 1/2^128. Like SHA-256/SHA512-256/SHA3-256/etc. would be the minimum. Again, the SipHash authors in their paper specifically disclaim collision resistance for such purposes. The original paper specifically says "We comment that SipHash is not meant to be, and (obviously) is not, collision resistant." [0] So besides the output being shorter than we'd use for this purpose, the whole usage is specifically disclaimed by the inventors. The paper is at https://cr.yp.to/siphash/siphash-20120918.pdf.

Yes, this is all extremely problematic.
Could you elaborate? What problems is this causing?

Any part of the compiler that assumes infers that x == y when hash(x) == hash(y) for some types x and y is going to allow an attacker to publish a crate on crates.io that looks totally benign but causes the compiler to miscompile crates involved in a compilation, with probably P, which is the collision resistance of the hash function. Somebody could put up a totally valid PR for my own crates that implements useful and correct functionality which I would be happy to accept, which triggers this, and I would have no way of knowing.

So then, we have to ask ourselves:

  1. Why are we even using cryptography for this in the first place, when it is completely unneeded? Notice how clang implements std::type_index which is pointer sized, i.e. smaller than Rust's TypeId now.
  2. If we must insist on using cryptography, what is the acceptable probability P that we're willing to live with? Hopefully higher than 1/2^64.
  3. If we must insist on continuing to use cryptography for this, what is the collision resistance of the hashing function we're currently doing, the way we're currently doing it? People have posted claims regarding this, but I don't see anybody accepting those claims as valid, and I don't see us finding a consensus about it. How are we going to come to a consensus on the answer?

Now we have to ask ourselves, is it prudent to wait for somebody to publish an attack and put it on crates.io, e.g. one that causes a TypeId collision with std::io::Error or something else commonly used? It looks like it would be hugely disruptive, and potentially within script kiddie territory. Even worse, if/when it would happen, we would probably waste a lot of time debating how bad it is.

@the8472
Copy link
Member

the8472 commented Dec 22, 2023

is going to allow an attacker to publish a crate on crates.io that looks totally benign but causes the compiler to miscompile crates involved in a compilation, with probably P,

e.g. one that causes a TypeId collision with std::io::Error or something else commonly used?

Those would require preimage attacks, not collisions. To create a collision an attacker must control both types, which means they're screwing up just their own crate. Which, yes, could still be problematic, just not the scenario you're describing here.

Additionally this might be easy to mitigate, if it isn't already. siphash is designed as a keyed hash function. I don't know what the constraints wrt. reproducible builds are but perhaps a random seed could be included in the crate metadata that would change the hash output whenever a clean build is made of a crate.

Somebody could put up a totally valid PR for my own crates that implements useful and correct functionality which I would be happy to accept, which triggers this, and I would have no way of knowing.

Theoretically yes, practically an attack does not necessarily give an attacker sufficiently fine control to disguise this in human-readable text, especially if they only control part of the input.

Anyway, this is putting the cart before the horse. As I wrote earlier, afaik such things are outside rustc's threat model because this would be a mouse hole compared to the open barn doors that are build rs, proc macros, llvm miscompilations and other randomly-occurring or exploitable sources of soundness issues. If this were the last thing before rustc could be considered a provably-secure compiler then it could be reevaluated. Until then it's likely Good Enough™.
And to reiterate, just because something is not a security issue doesn't mean it won't be fixed. If siphash is found to be weaker than expected it can swapped out for something better. It's not set in stone.

Why are we even using cryptography for this in the first place, when it is completely unneeded? Notice how clang implements std::type_index which is pointer sized, i.e. smaller than Rust's TypeId now.

Rust has no RTTI, so this would incur additional footprint costs.

If we must insist on using cryptography, what is the acceptable probability P that we're willing to live with? Hopefully higher than 1/2^64.

That number may not be astronomically small, but it's still GPU-years small. I.e. it's unlikely to arise by chance. If we were trying to defend against well-funded adversaries intentionally searching for collisions then yes, it should be increased.

@bjorn3
Copy link
Member

bjorn3 commented Dec 22, 2023

I don't know what the constraints wrt. reproducible builds are but perhaps a random seed could be included in the crate metadata that would change the hash output whenever a clean build is made of a crate.

Many people need reproducibility. Note that we do already include a hash of the full rustc version string though through StableCrateId. This includes the commit hash for the official builds and thus isn't predictable for future versions. And when distros rebuild rustc the version string is almost certainly different and thus so is the TypeId.

@RalfJung
Copy link
Member

RalfJung commented Dec 22, 2023

I agree with @the8472 . It is also my understanding that rustc's threat model does not, in general, include maliciously crafted input (as in, source code). Where practical the toolchain will be made resistant to such inputs, but for TypeId collisions so far this has not been considered practical.

This was discussed before, I think it was here.

Assuming a non-malicious, i.e. randomly picked, input program, the probability analysis seems correct to me.

@briansmith
Copy link

is going to allow an attacker to publish a crate on crates.io that looks totally benign but causes the compiler to miscompile crates involved in a compilation, with probably P,

e.g. one that causes a TypeId collision with std::io::Error or something else commonly used?

Those would require preimage attacks, not collisions.

We don't have an argument in favor of preimage resistance of SipHash. Again, SipHash is a PRF, all the security assumptions we have for SipHash are premised on having a random secret key. So we don't have any argument in favor of sigifnicant collision resistance or (second) preimage resistance for SipHash with a fixed key being used as a hash function.

Somebody could put up a totally valid PR for my own crates that implements useful and correct functionality which I would be happy to accept, which triggers this, and I would have no way of knowing.

Theoretically yes, practically an attack does not necessarily give an attacker sufficiently fine control to disguise this in human-readable text, especially if they only control part of the input.

We don't know how easy or hard it would be. Finding out would be a likely be a huge amount of work.

Anyway, this is putting the cart before the horse. As I wrote earlier, afaik such things are outside rustc's threat model because this would be a mouse hole compared to the open barn doors that are build rs, proc macros,

It's true that users are responsible for ensuring security in the face of build.rs and proc macros. However, users cannot be responsible for preventing TypeId collisions because it's not practical for them to detect them.

And to reiterate, just because something is not a security issue doesn't mean it won't be fixed.

OTOH, if it wasn't a security issue, I wouldn't bother putting any effort into it. It is a security issue. It is the very annoying kind where probably nobody can justify spending the effort to analyze how practical an attack would be, or to create a PoC.

If siphash is found to be weaker than expected it can swapped out for something better. It's not set in stone.

What is the "expected" strength of SipHash for collision resistance and preimage resistance when used with a fixed key as a hash function? AFAICT there has been no attempt to get an expected strength. The paper that introduced SipHash specifically warned against having any such expectation.

Why are we even using cryptography for this in the first place, when it is completely unneeded? Notice how clang implements std::type_index which is pointer sized, i.e. smaller than Rust's TypeId now.

Rust has no RTTI, so this would incur additional footprint costs.

To do it like clang, in the most naive way, it would cost at most 1 byte of "RTTI info" per type where the type ID is actually used. Let's say we moved to word-size type IDs like clang. Then on 32-bit targets type ID would cost 4 + 1 = 5 bytes, and on 64-bit targets type IDs would cost 8 + 1 = 9 bytes. Currently with 128-bit TypeIDs in Rust, each type ID cost 16 bytes. So it seems like the hash-based approach actually costs more space than doing it like clang's std::type_index does. (More code is required to move 128-bit values around than word-size values, this probably understates the cost of the hash-based approach.)

(IDK if it is possible to instead use the address of the Any vtable for each type to avoid that extra byte. Otherwise I know of a theoretical way of avoiding the extra byte with the clang-like approach but IDK how practical it is.)

If we must insist on using cryptography, what is the acceptable probability P that we're willing to live with? Hopefully higher than 1/2^64.

That number may not be astronomically small, but it's still GPU-years small. I.e. it's unlikely to arise by chance. If we were trying to defend against well-funded adversaries intentionally searching for collisions then yes, it should be increased.

It is an unfortunate situation that it is extremely expensive to evaluate the security and correctness of the hash-based approach, such that no "good people" can afford to take the time to do it, as we have little to gain. Only adversaries could justify the effort. Already, I think we're at the breaking point of the amount of effort we want to spend on this.

@bjorn3
Copy link
Member

bjorn3 commented Dec 22, 2023

To do it like clang, in the most naive way, it would cost at most 1 byte of "RTTI info" per type where the type ID is actually used.

How does clang ensure that this value is globally unique for a given type even across dylibs? AFAIU on Windows there is no way to ensure two DLL's defining the same global variable will get a single instance. This is also why statics can't be generic. And how is a Hash implementation for TypeId consistent for multiple executions of the same program possible with this scheme? The only thing you can hash would be the pointer itself, which due to ASLR is non-deterministic between executions.

@the8472
Copy link
Member

the8472 commented Dec 23, 2023

I think the issue hinges on the current threat model. Changing it would have significant consequences, e.g. LLVM miscompilations could become CVEs, might have to be fixed under secrecy and require point releases; LLVM would have to play along with that. Various I-unsound issues would become P-critical. Perhaps more resource would have to be spent on fuzzing rustc at a higher level (ICEs wouldn't be the goal here). Perhaps a hardened rustc mode/build would be necessary. Etc. etc.

Or you could plead that this a special case, different from all other ways to produce unsoundness. But imo that's not the case because finding novel weaknesses in LLVM or rust's type system may be easier than finding a weakness in siphash.

There are other options. Even if we don't consider this a priority that doesn't prevent you or others from working on it (read: patches welcome). Perhaps a more detailed security analysis could be done, digging into what exactly gets hashed, what defenses exist (asserts in the compiler?) and what that implies about the bounds on reliability of any hypothetical exploit. Perhaps there are some cheap mitigations that wouldn't affect performance or reproducibility. Or perhaps someone could run rustc with debug asserts enabled over all crates and that'd do the job. Idk.
And as I have said previously, the TypeId implementation is not set in stone. If someone writes an alternative that achieves the same goals but is more robust that could get accepted too. But that's probably insufficient to allay your more general concerns since siphash is relied on in the compiler for other things.

@briansmith
Copy link

briansmith commented Jan 2, 2024

I think the issue hinges on the current threat model. Changing it would have significant consequences, e.g. LLVM miscompilations could become CVEs, might have to be fixed under secrecy and require point releases; [...] Various I-unsound issues would become P-critical.

It seems like there's a broader discussion to be had about differing expectations about how security-relevant problems in the compiler/toolchain/libraries are handled. I don't think this issue is the place for that discussion. I think mixing that meta-issue into this specific issue has created most of the conflict here.

  • I think we should be able to agree that the way TypeId is implemented and used (both in the standard library, and the analogous use in the compiler) is based on a line of reasoning that starts with a false (or, at least, unproven and very likely to be proven false) premise.
  • I think we should be able to agree that the API that promises that TypeId is "globally unique" is both vague almost to the point of being meaningless, and also promising something that isn't being (and likely can't be) implemented.
  • I think we should be able to agree that we don't want our claims (and future proofs) about memory safety of Rust's standard library to be qualified on the collision-resistance of (mis)using SipHash with a fixed key. I don't think any of us want to make Rust's standard library to be "memory-safe with some probability that we can't specify."
  • I think we should be able to get to agreement that we don't want to qualify the maximum possible correctness of the Rust compiler similarly.

I think it is harder to get to agreement on priority, responsibility, security advisory stuff, etc. I suggest we all just defer all thought of those aspects.

Regarding intentionally-created collisions: A recent blog post was published at https://www.da.vidbuchanan.co.uk/blog/colliding-secure-hashes.html where the author shows that it is not difficult to (intentionally) force collisions even using an actually-believed-secure hash function (SHA-256) truncated to 128 bits. SHA-256 is a much stronger hash function than SipHash-128 misused with a fixed key. I am far from the first one to point this out, you can see others make it at https://news.ycombinator.com/item?id=19405756 and many other places. Similarly, if you Google for preimage resistance of SipHash then you'll find most of the discussions quickly end by noting that SipHash doesn't even try to be a secure hash function, because it is a PRF that requires a secret key.

@briansmith
Copy link

To do it like clang, in the most naive way, it would cost at most 1 byte of "RTTI info" per type where the type ID is actually used.

How does clang ensure that this value is globally unique for a given type even across dylibs? AFAIU on Windows there is no way to ensure two DLL's defining the same global variable will get a single instance. This is also why statics can't be generic.

It is hard to answer that question when posed so abstractly. A lot of times such issues arise when people are violating the One Definition Rule (ODR) in C++, and I'm not sure what such a situation would look like when they aren't violating the ODR. It would be good to see a description of a Rust program composed of multiple shared libraries in a way that is supported by the Rust toolchain and where the "two DLLs defining the same global variable" issue would arise w.r.t. a Clang-like TypeId implementation strategy.

And how is a Hash implementation for TypeId consistent for multiple executions of the same program possible with this scheme? The only thing you can hash would be the pointer itself, which due to ASLR is non-deterministic between executions.

  1. Luckily the Rust standard library doesn't guarantee that a TypeId is consistent for multiple executions of a program.
  2. The TypeId doesn't necessarily have to be the absolute address; it could be the offset from the beginning of the rodata segument, or similar. This would avoid it being affected by ASLR and also, in theory, would allow us to throw away the never-used extra byte that would be allocated for each TypeId in the Clang-like scheme.

@the8472
Copy link
Member

the8472 commented Jan 2, 2024

I think we should be able to agree that the way TypeId is implemented and used (both in the standard library, and the analogous use in the compiler) is based on a line of reasoning that starts with a false (or, at least, unproven and very likely to be proven false) premise.

[...] where the author shows that it is not difficult to (intentionally) force collisions even using an actually-believed-secure hash function (SHA-256) truncated to 128 bits.

No, this loops back to the threat model. "collision resistance" always refers to the cryptographic property, not the statistical one. Which means it's only relevant under malicious inputs, which makes the threat model relevant.
If we only assume natural inputs that do not take advantage of the structure of the hash function or of rustc implementation details then only simpler properties like the avalanche effect should be relevant.

By saying "proven to be false" you're pulling targeted attacks into your argument. You can't do that and pretend that this isn't about extending the threat model.

I think we should be able to agree that the API that promises that TypeId is "globally unique" is both vague almost to the point of being meaningless, and also promising something that isn't being (and likely can't be) implemented.

Hrm yeah, perhaps that is a slight overpromise since we currently have no mechanism that can ensure that, especially when dynamic linking is involved. Though for most practical purposes it does fulfill that role.

I think we should be able to agree that we don't want our claims (and future proofs) about memory safety of Rust's standard library to be qualified on the collision-resistance of (mis)using SipHash with a fixed key. I don't think any of us want to make Rust's standard library to be "memory-safe with some probability that we can't specify."

TypeId's contents are not generated by the standard library. It's provided by the compiler and it's an implementation detail. As far as the user is concerned it's a random value assigned once at compile time to each type so they'd get a collision with a specifiable expected probability.
The real implementation doesn't do that because assigning a value exactly once across multiple rustc invocations independently instantiating a type is difficult. But the goal is to observationally behave as-if it had been generated that way.

I think we should be able to get to agreement that we don't want to qualify the maximum possible correctness of the Rust compiler similarly.

I think that should also be a separate discussion and could maybe solved by a hardend-rustc for those who need it.

@bjorn3
Copy link
Member

bjorn3 commented Jan 2, 2024

It would be good to see a description of a Rust program composed of multiple shared libraries in a way that is supported by the Rust toolchain and where the "two DLLs defining the same global variable" issue would arise w.r.t. a Clang-like TypeId implementation strategy.

Crate A is a dylib and defines a type Foo<T>. Crates B and C depend on this dylib and each try to get the TypeId for Foo<u8>. Crate D is an executable and links against both crate B and crate C. With the clang-like implementation strategy as I understand it both crate B and C would define a global variable and thus produce a different TypeId, which is invalid.

@briansmith
Copy link

Crate A is a dylib and defines a type Foo. Crates B and C depend on this dylib and each try to get the TypeId for Foo. Crate D is an executable and links against both crate B and crate C. With the clang-like implementation strategy as I understand it both crate B and C would define a global variable and thus produce a different TypeId, which is invalid.

In the Clang-like strategy, Crate A would define the TypeId symbol for type A::Foo, and crates B and C would reference (and not themselves define) the symbol for A::Foo's TypeId, right?

@bjorn3
Copy link
Member

bjorn3 commented Jan 2, 2024

Foo<T> is generic here, so Foo<u8> and Foo<MyLocalType> would result in different TypeId. It is impossible to know all generic instantiations that will be used in advance when compiling crate A (for example Foo<Foo<Foo<...1_000_000 levels of recursion...>>> is a valid type), so we can't define the TypeId symbol when compiling crate A and as such have to define it when compiling crates B and C.

@tarcieri
Copy link
Contributor

tarcieri commented Jan 2, 2024

IIRC there were plans to use 64bit of that for a hash and the other 64bit for a pointer where the full type ID is stored. Then we'd have a fast test for != (different hashes), and with sufficient linker magic a fast test for == (compare the pointer; rely on linker deduplication to make the pointers the same cross-crate). That would be fully sound without relying on a hash and hopefully also not be slower than today for the vast majority of cases.

@RalfJung this seems like a really good solution, but AFAICT it was never clarified if that was implemented in #95845 or what happened with that whole approach.

As a bit of a meta comment the current approach seems to be misapplying cryptographic primitives in ways that create problems where they otherwise wouldn't exist, and I think that's the core concern here.

@RalfJung
Copy link
Member

RalfJung commented Jan 3, 2024

The linker magic part was never implemented I think. I am not even sure if we ever concluded that all our platforms have sufficiently powerful linkers to provide such magic. Personally I'd be happy to see further experimentation in that area. If the linker magic part works out then as far as I can see this has the same performance characteristics as the current implementation, but I can't speak for the teams so I can't say which other concerns there might be. If the linker magic part does not work out then there is the problem that a successful equality test slows down by a lot since it has to compare the strings for equality.

As a bit of a meta comment the current approach seems to be misapplying cryptographic primitives in ways that create problems where they otherwise wouldn't exist, and I think that's the core concern here.

Using SipHash with a fixed key is certainly odd. (I haven't checked if that's what we do, but it's possible.)

But I don't see how it creates new problems compared to the baseline of using something like fx-hash. Either way, I think the claim that we have a collision probability of around 2^64 for regular (non-malicious) programs is correct. Of course this is not a precise claim since there's no precise probability distribution of "regular (non-malicious) programs". What's relevant is whether there is a realistic chance of soundness issues in well-intentioned code here, and I am not aware of evidence for such a soundness issue. And if you worry about malicious code and rustc (presumably with forbid(unsafe_code)) is your only line of defense, then we are talking about a completely different world. That's why people keep bringing up the attack model.

The theoretician in me would absolutely prefer a TypeId scheme that does not depend on any probabilistic arguments. And I understand that what rustc does makes no sense at all when viewed through the lens of a cyptographer. But pragmatically speaking I don't see what an attack would look like here, and as far as I was able to see nobody has described such an attack above. My understanding is that it would require (1) an attacker finding two Rust types that hash to the same TypeId, and then (2) the attacker must convince me to add their code as a static or dynamic dependency to my program, with me blindly trusting its soundness. Most of the discussion above has been about (1), arguing that this is easier than people thought. (The fact that collisions in 128bit-truncated SHA256 are feasible is new to me, thanks for sharing that!) But that totally misses the point that the soundness argument doesn't just rely on (1) not happening, it relies on (1)+(2) not happening.

For the people arguing that this is worse than "regular" soundness issues (at least I think that is what @briansmith is arguing), I think it'd help to spell out the attack you are imagining. Or maybe we're just discussing how severe of a soundness issue this is? It seems way harder to exploit than some of our others, so I don't see a good reason to give this one particularly high priority.

@briansmith
Copy link

briansmith commented Jan 4, 2024

The linker magic part was never implemented I think. I am not even sure if we ever concluded that all our platforms have sufficiently powerful linkers to provide such magic. Personally I'd be happy to see further experimentation in that area.

Rather than relying on linker "magic," it may make sense to have an optimized case for types known statically at the time the executable is about to be linked, vs. types that aren't known at that time. (Basically, a pre-link step.) It would maybe not be cheap.

If the linker magic part does not work out then there is the problem that a successful equality test slows down by a lot since it has to compare the strings for equality.

By "the strings" I guess you mean the Any::type_name. Any::type_name is optimized for human readability, not for size/performance. There are much more compact encodings of the string that are possible. Or, there are other implementation strategies. For example, one could imagine maintaining (at runtime) a globally counter, with each type_id implementation lazily allocating the TypeId(u64) value on first use. Or see what GCC does to handle the tricker cases in its std::type_name implementation. This is not a new problem since C++ has basically the same construct and has dealt with it.

As a bit of a meta comment the current approach seems to be misapplying cryptographic primitives in ways that create problems where they otherwise wouldn't exist, and I think that's the core concern here.

Either way, I think the claim that we have a collision probability of around 2^64 for regular (non-malicious) programs is correct.

What is the reasoning that was used to reach that belief? I think it is good to write down the reasoning we are using so we can all agree on it. Just like, if we made a change to the borrow checker, we'd review the reasoning for the change to verify that the reasoning is sound.

And I understand that what rustc does makes no sense at all when viewed through the lens of a cyptographer.

I am not saying this "makes no sense at all."

Again, the beginning of my argument was, "Hey, it looks like you're assuming a certain collision rate; let's see how that collision rate is being calculated, because based on the comment I read, the reasoning isn't sound." And then the other part of my argument is that it looks like people have some threshold for what is an acceptable collision rate, but I don't think we have a sound argument in favor of a particular collision rate as being acceptable.

Keep in mind that these hash comparisons are critical for the correctness and memory-safety of Any::{is, downcast_*} and many other memory-safety-critical primitives that are not marked unsafe, so the collisions directly affect the memory safety of programs. (Similarly for the way similar comparisons are reportedly done within the compiler.)

People keep using this 1/2^64 number; let's see an argument that starts with "Here is how TypeId is calculated" and ends with "1/2^64."

Here is an example: x ↦ u128::MAX is a hash function with an output that is 128 bits. The birthday bound of a 128-bit hash function gives us a collision rate of at least 1/2^64. Then this hash function has a collision rate of 1/2^64. (This is not a valid argument because the birthday bound is a lower bound but says nothing about the upper bound.)

For the people arguing that this is worse than "regular" soundness issues (at least I think that is what @briansmith is arguing),

I think we agreed that the meta-discussion about what responsibility if any the compiler has w.r.t. security should be somewhere else, so I'm not going to debate that further here.

To clarify my position: TypeId should do what it is documented to do for all inputs that the toolchain translates into executable programs (roughly all valid programs), so that all TypeID-based constructs, including Any::{downcast_ref, is} are memory-safe. No exceptions. We shouldn't accept 1/2^64 rate of failure of memory safety by design as acceptable. (If we're operating on the assumption that an intentional 1/2^64 memory safety failure rate is OK, then let's put that in writing in the language reference specification; I believe such a change won't be accepted.)

@digama0
Copy link
Contributor

digama0 commented Jan 4, 2024

People keep using this 1/2^64 number; let's see an argument that starts with "Here is how TypeId is calculated" and ends with "1/2^64."

You do realize that any such argument is going to have "and therefore P != NP" in the middle, right? Rigorous proofs of pretty much anything in the cryptography world (other than relative hardness proofs) is a luxury we cannot afford.

@briansmith
Copy link

You do realize that any such argument is going to have "and therefore P != NP" in the middle, right? Rigorous proofs of pretty much anything in the cryptography world (other than relative hardness proofs) is a luxury we cannot afford.

I'm just asking people to show how they got the 1^2/64 number. People are "believing" the collision rate to be that but why?

@digama0
Copy link
Contributor

digama0 commented Jan 4, 2024

For no formally justifiable reason. Asking them to produce one is not reasonable under the conditions because we can all see that it is impossible to produce one without solving some really hard mathematical problems first.

Of course I think this sucks. I literally make my trade in proving systems like this correct, and it is disheartening to see cryptography anywhere near this because of the known limitations of that style of proof (if it can even be called that). But unless you have a better proposal I don't think it is useful to be pointing this out at this stage, these are the known facts of the situation and I suspect Any is making impossible promises, at least as long as it has fixed size.

@the8472
Copy link
Member

the8472 commented Jan 4, 2024

I already linked #107925 (comment) in a previous comment.

@briansmith
Copy link

I already linked #107925 (comment) in a previous comment.

I quickly skimmed the 2nd linked paper. I do not pretend to know anything about differential cryptanalysis. The 2nd paper is analyzing SipHash with an unknown key. They note that attacks are much easier when the key is known than their characteristic for internal collisions indicates, because other techniques based on input modification can be used for unkeyed functions. Further, I think whoever did the analysis on our end seemed to extrapolate from their published characteristic for internal collisions to the difficulty of a collision of the final output. I don't see the logic that was used to make that leap.

For context, the complexity of intentionally creating a collision for SHA-1 is currently estimated to be about 2^52 to 2^63 as published results of several years ago. Compared to SipHash-128 with an all-zero key, SHA-1 has more complex rounds, an order of magnitude more rounds, double the internal state size (512 bit internal state vs 256 bit), and a larger output (160 bits vs 128 bits). Accordingly, it is hard to believe a conclusion where we find SipHash-128 to be more collision-resistant than SHA-1 and equally collision-resistant to SHA-256 (a much stronger hash function) truncated to 128 bits.

@digama0
Copy link
Contributor

digama0 commented Jan 4, 2024

the complexity of intentionally creating a collision

As Ralf said above, the goal is not to be robust against intentional collisions, it is robustness against unintentional collisions. Here it is much more reasonable to assume an even spread of values, because they are not being adversarially chosen so as long as the hash has enough mixing behavior this gives values spread throughout the whole range. This is where the "attacker model" makes the difference, because to intentionally cause a collision means to have a nefarious piece of code in your code or in a dependency and such code can already do much worse than attempt to break rustc.

@the8472
Copy link
Member

the8472 commented Jan 4, 2024

To phrase it differently. In our model we pretend that the inputs are generated by an actor that does neither know nor bother to optimize its input-generation process over the hash function or its key.

Which means we can treat the situation as-if a random key was chosen. It also excludes adaptive black-box attacks.

@RalfJung
Copy link
Member

RalfJung commented Jan 4, 2024

And from that it then follows that the properties of the hash function don't really matter, and we can use the birthday bound with good approximation. So the reasoning for the 2^64 bound is: if we feed random inputs to a ~decent hash function with 128bit output, we'll get collisions at a rate of approximately 2^-64. Real inputs aren't random but the programs people write are extremely unlikely to align with the structure of the hash function in any meaningful way, except if programs are specifically designed to exploit the hash function, which we assume does not happen. (Pragmatically speaking: there are much easier-to-exploit soundness bugs, why would an attacker bother finding a hash collision.)

If I had to model this for a formal soundness argument I'd use some sort of random oracle model.


Rather than relying on linker "magic," it may make sense to have an optimized case for types known statically at the time the executable is about to be linked, vs. types that aren't known at that time. (Basically, a pre-link step.) It would maybe not be cheap.

By "the strings" I guess you mean the Any::type_name. Any::type_name is optimized for human readability, not for size/performance. There are much more compact encodings of the string that are possible. Or, there are other implementation strategies. For example, one could imagine maintaining (at runtime) a globally counter, with each type_id implementation lazily allocating the TypeId(u64) value on first use. Or see what GCC does to handle the tricker cases in its std::type_name implementation. This is not a new problem since C++ has basically the same construct and has dealt with it.

I don't think anyone is objecting to experimenting with approaches to avoid relying on a hash here.

@briansmith
Copy link

@RalfJung wrote:

Real inputs aren't random but the programs people write are extremely unlikely to align with the structure of the hash function in any meaningful way

I am not sure how why we should think so. I do not think we have strong evidence of it.

@RalfJung wrote:

except if programs are specifically designed to exploit the hash function, which we assume does not happen.
(Pragmatically speaking: there are much easier-to-exploit soundness bugs, why would an attacker bother finding a hash collision.)

Where is the evidence for this claim? There are 83 open I-unsound issues total. Only 14 open I-unsound issues are in the T-lang component. I couldn't find 7 that look easier to exploit than this; almost all of them would be easily mitigated by code review. So I don't think we have evidence that this is hard to exploit compared to other I-unsound issues.

If we think like an attacker, one of the reasons this would be appealing to exploit is that it is unlikely to be fixed. (This issue is a decade old), and a PoC that would be convincing enough to change that would be extremely expensive to produce. This is exactly what happened with MD5 to enable FLAME; MD5 collisions were both too expensive to be considered a worthwhile attack vector in real life, but also MD5 was already theoretically broken so nobody could be funded to produce a PoC until it was too late.

If I had to model this for a formal soundness argument I'd use some sort of random oracle model.

You could prove that the current scheme is correct when instantiated with a random oracle model, predicating the proof on "assume we use a hash function that operates like a random oracle," i.e. predicating the proof on the collision-resistance and preimage resistance of whatever hash function is used. In the case of instantiating the scheme with SipHash128-with-an-all-zeros key, you wouldn't be able to use that proof since the hash function doesn't meet the proof's precondition of acting like a random oracle.

In one of the other discussions, it was suggested that somebody trying to force a collision would need to use large inputs that would not be practical as type names, as typically collision attacks on secure hash functions are easier when a certain multi-block structure is used. That is true for secure hash functions, but it is not true for non-secure hash functions. We should be able to find single-block collisions in the SipHash128-with-an-all-zero-key hash function easily. Consider that MD5 single-block collisions were found; see https://marc-stevens.nl/research/md5-1block-collision/md5-1block-collision.pdf. Note that the two colliding 64-byte messages only differed by two bits total. SipHash128-with-an-all-zeros-key should be easier to force collisions.

i just stumbled across #95845 (comment), where Ralf wrote:

A small TypeId is simply unsound, as we do have real-world examples of collisions. So that is not an option.

So I think we are in agreement in overall. He wrote that when TypeId was 64 bits but the situation isn't too different when it is 128 bits. Before we had evidence of a real probilem. Now we have an absence of evidence, but not evidence of absence. The evidence that we'd likely need, for or against, is unreasonably expensive to acquire. We could pay a student $100,000 to acquire it if somebody were willing to fund it. But, implementing a non-hash-based solution would likely cost much less than that, and we'd have a certain fix.

I think the approach in #95845 looks very good. If performance is an issue then it could be improved by using a more compact encoding of the name instead of the normal v0 mangled form of the name, since the name mangle scheme is still optimized too much for human readability. It looked like the performance measurement results of that approach was also OK.

@bjorn3
Copy link
Member

bjorn3 commented Jan 8, 2024

The v0 symbol mangling is not optimized for human readability at all. It is a compact encoding of the full function/type name. For example it has backreferences and encodes hashes as base-62 (the stable crate id/disambiguator, this is checked by rustc for collisions and will result in an error if a collision occurs). It also encodes primitive types using single letters. Frankly I find v0 symbols to be pretty much unreadable, while I can read legacy symbols with very little effort. That may say something about the fact that I have had to read legacy symbols a lot though while working on cg_clif.

@RalfJung
Copy link
Member

RalfJung commented Jan 8, 2024

@briansmith From what you say, we do not even have an example of this unsoundness being demonstrated for a recent version of rustc (any version since the switch to 128bit type IDs). That makes this I-unsound issue the only one without a reproducing example. I would say that is fairly clear evidence that it is harder to exploit (and by a significant margin) than all the other I-unsound issues. The others did not require 100k or more in compute cost, this one does.

Absent such an example, what you are saying about detecting this in reviews is pure speculation. I find it hard to believe that the absurd types one has to construct to create a hash collision would not be caught in review. In contrast, some of the other I-unsound issues could be hidden behind fairly benign-looking function signatures or trait constructions.

This is exactly what happened with MD5 to enable FLAME

The analogy with MD5 misses the fact that we are already assuming the authors of all Rust code that you include in your binary to not be adversarial.

You could prove that the current scheme is correct when instantiated with a random oracle model, predicating the proof on "assume we use a hash function that operates like a random oracle," i.e. predicating the proof on the collision-resistance and preimage resistance of whatever hash function is used. In the case of instantiating the scheme with SipHash128-with-an-all-zeros key, you wouldn't be able to use that proof since the hash function doesn't meet the proof's precondition of acting like a random oracle.

No real hash function acts like a random oracle. So any proof based on the random oracle model fails when instantiated with an actual hash function. At least that is my understanding of the random oracle model. (Though of course SipHash is way further away from that than a cryptographic hash.)

But, implementing a non-hash-based solution would likely cost much less than that, and we'd have a certain fix.

We agree that this is the most desirable outcome. We just disagree on the prioritization. But this is open-source, prioritization means fairly little -- nothing we discuss here is likely to change whether some contributor will make this their next passion project, so the entire discussion is probably rather futile.

@veorq
Copy link

veorq commented Jan 11, 2024

(Thanks @tarcieri for contacting me about this.)

One of my life's regrets is naming SipHash "SipHash" when it's a PRF/MAC rather than a general-purpose, unkeyed hash function :)

When the key is fixed/known, the original 64-bit SipHash is not collision-resistance, in an adversarial setting, as the cost of finding a collision is only that of about 232 function evaluations (birthday attack etc.). This is a few minutes on a laptop.

It looks like here you'd use SipHash128, the version producting 128-bit output, with 2+4 rounds. The cost of collisions would then be 264. This would be days with a bunch of GPUs, so not secure either, cryptographically speaking.

Are faster ways to find collisions than the birthday attack? I'm not aware of such cryptanalysis results. There's a "distinguisher" for 3 rounds, but that's a long way to getting a collision attack faster than the generic attack.

That said, quoting @digama0:

the goal is not to be robust against intentional collisions, it is robustness against unintentional collisions

In such a case it would be fine. (Theoretically speaking, what you'd then need is called a universal hash function, which in principle requires a key.) But idk if that "unintentional" assumption is and will remain reasonable, as you've discussed.

@shelerr
Copy link

shelerr commented Feb 11, 2024

I would like to note, that there seems to be a misunderstanding in this discussion about what exactly the birthday paradox implies. Having 128-bit hashes doesn't mean that there is a probability of 1/2^64 of a collision (what would that even mean?). It actually means that if you had 2^64 different types, that the probability of a collision between at least two of them would be roughly 50% (~39.34%).

In general, the probability of a collision can be estimated as d^2/2n, where d is the amount of types, and n - amount of possible hashes (if d is much smaller than n). For example, if you had 2^32 different types, the probability would be 1/2^65. Or, as another example, 1/2^81 for 2^24 types. And even that seems like too many types to be handled by any piece of code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-typesystem Area: The type system C-bug Category: This is a bug. I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness P-low Low priority T-lang Relevant to the language team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.