Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collisions in type_id #10389

Open
DaGenix opened this Issue Nov 9, 2013 · 38 comments

Comments

Projects
None yet
@DaGenix
Copy link

DaGenix commented Nov 9, 2013

The implementation of type_id from #10182 uses SipHash on various parameters depending on the type. The output size of SipHash is only 64-bits, however, making it feasible to find collisions via a Birthday Attack. I believe the code below demonstrates a collision in the type_id value of two different ty_structs:

use std::hash;
use std::hash::Streaming;

// I believe that this pretty much the same thing as hash_crate_independent() in ty.rs
// for a ty_struct on a 64-bit platform
fn hash_struct(local_hash: &str, node: u64) -> u64 {
    let mut state = hash::default_state();
    state.input([18]);
    state.input(local_hash.as_bytes());
    do node.iter_bytes(true) |bytes| { state.input(bytes); true };
    state.result_u64()
}

fn main() {
    // This represents two structures with different node values from a crate with a 
    // local_hash value of "local" that end up getting the same hash and thus, 
    // I think, the same type_id
    assert!(hash_struct("local", 0x9e02c8943c336302) == hash_struct("local", 0x366a806b1d5f1b2));
}
@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Nov 10, 2013

I'm not entirely sure how feasible it is for a program to have 0x366a806b1d5f1b2 node ids (2.5 trillion), but this is still concerning.

We could in theory have very cheap inequality among types, and then have an expensive equality check. Something which may walk the respective TyDesc structures in parallel to make sure that they're the same. We could also bump up the hash size to using something like sha1/sha2 and have the type_id() intrinsic return [u8, ..N] to reduce the possibility of a collision.

Either way, I don't think that this is a super-pressing issue for now, but I'm nominating to discuss whether we want to get this done for 1.0. This could in theory have serious implications depending on how frequently Any is used.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Nov 10, 2013

Ah, it was already nominated!

@bill-myers

This comment has been minimized.

Copy link
Contributor

bill-myers commented Nov 10, 2013

Why not compare an interned version of the type data string? (i.e. what is currently passed as data to be hashed, possibly SHA-256 hashed first)

The linker can be used for interning by emitting a common symbol with the type data string as name and taking its address, and otherwise the same thing can be done manually in a global constructor.

This way it's always a pointer comparison, and there are no collisions.

@DaGenix

This comment has been minimized.

Copy link
Author

DaGenix commented Nov 11, 2013

I don't know how node id values are generated, but assuming that they are generated sequentially, this particular collision is not realistic. However, its not hard to find collisions for more realistic node id values by picking particular values for the crate hashes:

assert!(hash_struct("a2c55ca1a1f68", 4080) == hash_struct("138b8278caab5", 2804));

The key thing to consider isn't the number of node id values, though: its the total number of type id values. Some quick (hopefully correct) math shows that there is a 0.01% chance of a collision once there are around 60 million type id values. That's still a pretty large number of type id values for a somewhat low probability of a collision, thought. So, its unclear to me how big a deal this is for the Rust 1.0 timeframe. It all depends on what the acceptable probability of a collision is.

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Nov 11, 2013

When I saw that @alexcrichton proposed using a hash, my first reaction was "collision!" but then I thought "...but exceedingly unlikely to occur in practice". I think this is not a matter of imminent destruction but if we can leverage the linker or some other scheme to avoid this danger, we should -- and perhaps we should just go ahead and mark the current scheme as deprecated and just plan on finding a replacement scheme.

@thestinger

This comment has been minimized.

Copy link
Contributor

thestinger commented Nov 11, 2013

A cryptographic hash designed for this purpose (larger output) would be enough. Although, a larger output would be more expensive to compare (four u64 comparisons for SHA2).

@pnkfelix

This comment has been minimized.

Copy link
Member

pnkfelix commented Nov 21, 2013

We don't need to deal with this right now. P-low.

@steveklabnik

This comment has been minimized.

Copy link
Member

steveklabnik commented Jan 20, 2015

How relevant is this issue today? I think that it's all the same, but am not sure.

@thestinger

This comment has been minimized.

Copy link
Contributor

thestinger commented Jan 20, 2015

It's 64-bit so collisions are likely with enough types (consider recursive type metaprogramming) and it doesn't have any check to bail out if one occurs. Bailing out is not a very good solution anyway, because it pretty much means that there's no way to compile the program, beyond using a different random seed and hoping for the best. It's a crappy situation.

@vks

This comment has been minimized.

Copy link
Contributor

vks commented Jan 21, 2015

Note that "hoping for the best" by iteratively changing the seed might work with overwhelmingly large probability after very few iterations.

@sorear

This comment has been minimized.

Copy link
Contributor

sorear commented Oct 14, 2015

use std::any::Any;

fn main() {
    let weird : [([u8; 188250],[u8; 1381155],[u8; 558782]); 0] = [];
    let whoops = Any::downcast_ref::<[([u8; 1990233],[u8; 798602],[u8; 2074279]); 1]>(&weird);
    println!("{}",whoops.unwrap()[0].0[333333]);
}

Actually a soundness issue. playground: http://is.gd/TwBayX

@pnkfelix

This comment has been minimized.

Copy link
Member

pnkfelix commented Oct 18, 2015

I'd like the lang team to devote a little time to this now that we are post 1.0. Nominating

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Oct 29, 2015

OK, lang team discussed it, and our conclusion was that:

  1. This issue ought to be fixed, it's silly not to.
  2. This is an implementation detail that we could change whenever we want (right?)
  3. Nonetheless, we probably ought to open an RFC or at least a discuss thread, with a proposal to do better, since probably people will have some clever ideas.
  4. Probably the runtime overhead of the virtual call in the Any trait is way more than a strcmp anyhow for all realistic types.
@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Oct 29, 2015

I was wondering about a design where we do something like:

  • generate a static string representing the full type; in static builds, at least, this will be interned by the linker;
  • generate a hash

compare the string pointers for equality (to give a fast equality check). If that fails, compare the hashes for inequality (to give a fast inequality check). If THAT fails, compare the strings for content (to handle dynamic linking).

Although re-reading the thread I see @bill-myers may have had an even more clever solution.

@pnkfelix

This comment has been minimized.

Copy link
Member

pnkfelix commented Oct 29, 2015

@nikomatsakis putting the hash of the data at the start is a good idea, to increase the probability that we catch unequal things quickly. It seems to me like @bill-myers' approach composes fine with that strategy.

@sorear

This comment has been minimized.

Copy link
Contributor

sorear commented Oct 30, 2015

I doubt the "problem" is limited to Any. You can probably confuse the compiler just as effectively by colliding hashes for symbol mangling, or many other things. What is the objective here? Since Rust is not a sandbox language, I don't think "protect memory from malicious programmers" should be one of our goals (we should document the types of undefined behavior that can be hit in safe code, and fix the ones that are possible to hit by accident; if someone is determined to break the type system, they can already write an unsafe block, or use std::process to launch a subprocess that ptraces its parent and corrupts memory).

@hmvp

This comment has been minimized.

Copy link

hmvp commented Jan 24, 2017

@Mark-Simulacrum

This comment has been minimized.

Copy link
Member

Mark-Simulacrum commented Jul 19, 2017

@nikomatsakis Should this be marked as I-unsound? I've done so for now, since that seems to be the general conclusion a couple of times by different people, but please unmark if I'm wrong.

@withoutboats

This comment has been minimized.

Copy link
Contributor

withoutboats commented Jan 26, 2018

You can probably confuse the compiler just as effectively by colliding hashes for symbol mangling, or many other things.

Would any of these result in incorrect runtime behavior, or just bizarre compiler errors?

EDIT: The comment I'm replying to is over 2 years old.

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Jan 26, 2018

Would any of these result in incorrect runtime behavior, or just bizarre compiler errors?

It could be unsound, if it fooled Any into an incorrect downcast.

@withoutboats

This comment has been minimized.

Copy link
Contributor

withoutboats commented Jan 26, 2018

@nikomatsakis Right, I meant sorear's other examples (name mangling and such). The fact that this can cause a soundness bug separates it from other mishashes which would just result in a compiler error.

@SoniEx2

This comment has been minimized.

Copy link

SoniEx2 commented Nov 8, 2018

eventbus crate uses a lot of Any, as it encourages you to use a lot of different event types and emulated inheritance. this could become an issue...

@SoniEx2

This comment has been minimized.

Copy link

SoniEx2 commented Nov 8, 2018

What if, instead of making TypeId bigger, we add another implementation detail that works like eventbus?

Calls to Any, TypeId, etc are converted into a set of static usize that get initialized at runtime.

https://internals.rust-lang.org/t/static-generics/8734?u=soni

we can then use an expensive check (string lookup/comparison) once and use a cheap check every other time.

@ubsan

This comment has been minimized.

Copy link
Contributor

ubsan commented Nov 8, 2018

Why not use a pointer comparison of an inline symbol? VTables should work well, if we remove the thing where vtables can be distinct across TUs.

(in C++)

template <typename T>
struct VtableForAny {
  std::size_t size;
  std::size_t align;

  void (*destroy)(void *);
};
template <typename T>
inline VtableForAny<T> VTABLE_FOR_ANY = {
  sizeof(T),
  alignof(T),
  T::~T(),
};

struct Foo {
};

template <typename T>
T *downcast(AnyRef any) {
  if (any.vtable == &VTABLE_FOR_ANY<T>) {
    return static_cast<T*>(any.ptr);
  } else {
    return nullptr;
  }
}

We could do a similar translation for Rust, and this is guaranteed to work by LLVM.

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Nov 8, 2018

if we remove the thing where vtables can be distinct across TUs.

I think that would not be easy. It's not just TUs; if you create a trait object in a const then it will have a vtable generated by miri, and I am not sure if that will get deduplicated with the one generated for run-time use.

@rkruppe

This comment has been minimized.

Copy link
Member

rkruppe commented Nov 8, 2018

@ubsan

if we remove the thing where vtables can be distinct across TUs.

We can't, for the same reason generic statics couldn't be unique across TUs if we added them (dynamic linking on Windows, possibly other platforms).

@ubsan

This comment has been minimized.

Copy link
Contributor

ubsan commented Nov 8, 2018

@rkruppe ah, yeah, forgot about dynamic linking. It's likely we'd have to do a strcmp then, in the case where addresses are distinct, so you'd likely get a fast path and a slow path.

@RalfJung it's absolutely possible and, in fact, quite easy. It just requires a simple LLVM attribute.

@rkruppe

This comment has been minimized.

Copy link
Member

rkruppe commented Nov 8, 2018

What strings do you want to compare? Lacking a global namespace, the only way I know of to get a "type name" that won't be the same across different crates is using the crate disambiguator, which is a hash, which is therefore susceptible to the same sort of collision. Or can we arrange for such collisions always resulting in a linker error, even when the crates are only linked together dynamically and indirectly?

@ubsan

This comment has been minimized.

Copy link
Contributor

ubsan commented Nov 8, 2018

@rkruppe the name of the type, I think? That should be globally unique.

crate_name::foo::bar::Baz

maybe crate_name:v0.1.0::foo::bar::Baz

@rkruppe

This comment has been minimized.

Copy link
Member

rkruppe commented Nov 8, 2018

Crate names are not globally unique, and neither are (crate name, version) tuples.

@ubsan

This comment has been minimized.

Copy link
Contributor

ubsan commented Nov 8, 2018

@rkruppe huh! then I guess whatever information we pass to the hash function :P

@rkruppe

This comment has been minimized.

Copy link
Member

rkruppe commented Nov 8, 2018

The disambiguator is passed in using the -C metadata argument, rustc doesn't know where it comes from. Arguably that places it outside of the domain of rust and thus collisions of it aren't our soundness problem, but let's consider Rust workflows more broadly. Typically it's computed as a hash by Cargo, and AFAIK that hashing includes a whole ton of things that are impractical to put into the binary.

@SoniEx2

This comment has been minimized.

Copy link

SoniEx2 commented Nov 8, 2018

You can work around the generic statics issue the same way eventbus does it currently.

Language support would be an improvement over using a macro, tho.

@SoniEx2

This comment has been minimized.

Copy link

SoniEx2 commented Nov 22, 2018

So, first, for fixing the collisions:

struct TypeId(&'static [&'static [&'static str]]);

Why this specific layout? Three reasons:

  1. We want an identifier that's consistent across builds, so a string.
  2. We want an identifier that's relatively lightweight, so we want to reuse substrings, or a slice of strings.
  3. We want to handle generics, which contain identifiers, so a slice of slices of strings.

So, for example, we can end up with some of the following TypeIds:

TypeId(&[&["rustc-1.30", "cargo-1.30", "eventbus-0.4.0", "Handlers<T>"], &["rustc-1.30", "cargo-1.30", "ircclient-0.1.0", "IrcLineEvent"]])
TypeId(&[&["rustc-1.30", "cargo-1.30", "std-1.30", "result", "Result<T, E>"], &["rustc-1.30", "cargo-1.30", "eventbus-0.4.0", "Handlers<T>"], &["rustc-1.30", "cargo-1.30", "ircclient-0.1.0", "IrcMessageEvent"], &["rustc-1.30", "cargo-1.30", "eventbus-0.4.0", "MyError"]])

Note that these are only examples and the real thing probably wouldn't look like this, but this shows the basic idea.

It's still possible to intentionally create collisions, ofc, but that's different from accidental collisions (what we're trying to prevent here).

Also note that these are still opaque, so the contents need not be unambiguous to humans, only to the compiler (or the debugger, if we want that...).

@thepowersgang

This comment has been minimized.

Copy link
Contributor

thepowersgang commented Nov 24, 2018

@rkruppe With your comment about the metadata string, did you mean that it's "safe" to use it as a disambiguator between different versions of the same crate (be it versions or feature-sets), or that we shouldn't use it (because with non-cargo workflows it may not be passed).

Storing the type "name" and comparing that (potentially with a short hash for a fail-fast option) and trusting the -C metadata argument to be sufficient for disambiguation sounds like it could be the best option for reducing the chance of collisions.

@rkruppe

This comment has been minimized.

Copy link
Member

rkruppe commented Nov 24, 2018

I didn't really mean to recommend any course of action, just list constraints. One of them is that the metadata string is itself a hash in the typical workflow and thus could have collision.

But thinking about the options you gave, it seems to me that: if we are OK with a neglegible chance for collisions, we can achieve that (in principle, I don't know how large the collision risk is today) with a single hash. Well, assuming collisions in the crate disambiguator can't be detected -- I am not sure whether there is perhaps some clever use of the linker to detect them.

@SoniEx2

This comment has been minimized.

Copy link

SoniEx2 commented Nov 24, 2018

I have tricks to improve performance, but they need to come after this

(am I even being understood? I have no idea if ppl can understand me but they often seem to ignore me)

@eduardosm

This comment has been minimized.

Copy link

eduardosm commented Jan 28, 2019

What about keeping somewhere a "cache" of calculated TypeIds? That cache would store a TypeId-type pair for each type for which TypeId::of::<T> has been called. If a collision is detected, another TypeId is calculated using a hash table collision resolution algorithm. So, TypeIds would be the index in a hash table with 2^64 buckets. With this approach, we can continue to use a 64-bit integer as TypeId (or maybe it could be reduced to 32-bit). However, this means that, if the cache is deleted, all crates should be compiled again.

We could also forbid TypeId zero, so size_of::<Option<TypeId>>() == size_of::<TypeId>().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.