Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Widen TypeId from 64 bits to 128. #75923

Closed
wants to merge 1 commit into from
Closed

Conversation

eddyb
Copy link
Member

@eddyb eddyb commented Aug 25, 2020

Doubling the bit-width of the type hash inside TypeId would serve two purposes:

  • with a crater run we could find out if anyone was transmute-ing TypeIds to u64 (although not if they read an u64 from a &TypeId in a less checked way)
  • it should be more collision-resistant (similar to how we use 128-bit hashes in the incremental dep-graph)

Maybe we shouldn't use SipHash for TypeId, as unlike incremental dep-graph hashes, we don't need computing TypeIds to be incredibly fast, but also we can change that independently of the bit-width we keep in TypeId.

@rust-highfive
Copy link
Collaborator

r? @LukasKalbertodt

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Aug 25, 2020
@LukasKalbertodt
Copy link
Member

I don't think I'm the right person to review this. Randomly assigning someone from the compiler team:

r? @estebank

From libs perspective: I think we do not promise anything about the type (we even explicitly state that it's "opaque"), so this change seems fine.

@jyn514 jyn514 added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Aug 28, 2020
@eddyb
Copy link
Member Author

eddyb commented Aug 28, 2020

Probably should've picked a reviewer myself.

r? @oli-obk cc @nikomatsakis

Also, I want a crater run but the queue seems pretty bad right now and I wouldn't want to add onto it (especially if this might not be able to be check-only).

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@rust-highfive rust-highfive assigned oli-obk and unassigned estebank Aug 28, 2020
@bors
Copy link
Contributor

bors commented Aug 28, 2020

⌛ Trying commit 3990fad with merge 7d696513621aa8d27282101be594aedb03593636...

@bors
Copy link
Contributor

bors commented Aug 28, 2020

☀️ Try build successful - checks-actions, checks-azure
Build commit: 7d696513621aa8d27282101be594aedb03593636 (7d696513621aa8d27282101be594aedb03593636)

@rust-timer
Copy link
Collaborator

Queued 7d696513621aa8d27282101be594aedb03593636 with parent 41aaa90, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (7d696513621aa8d27282101be594aedb03593636): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never

@oli-obk
Copy link
Contributor

oli-obk commented Aug 28, 2020

diff lgtm, I think we should just merge it this early in the cycle and see what happens. Feel free to r=me or T-compiler FCP on it (with my check box pre-checked). I'm considering this a pure T-compiler thing as libstd impl details are now up to us.

@Mark-Simulacrum
Copy link
Member

I agree that this is purely T-compiler; and I don't think it needs an FCP. It should be an entirely private implementation detail. Waiting for a non-check crater run to come back also seems hard, and we're at the very start of a cycle right now so I think we should just land it and keep an eye out for regressions.

I was going to just r=oli, but it looks like this breaks one of the perf benchmarks (in turn breaking another one, which depends on webrender crate as well). I think we should fix that before landing this. @eddyb, would you be up for submitting a PR to rustc-perf which changes the transmute to a transmute_copy with u128 or similar? I've not investigated at all why this is being done either.

        |      | error[E0512]: cannot transmute between types of different sizes, or dependently-sized types                          >
        |      |   --> webrender/src/record.rs:34:13                                                                                  >
        |      |    |                                                                                                                 >
        |      | 34 |             mem::transmute::<TypeId, u64>(TypeId::of::<ApiMsg>())                                               >
        |      |    |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                       >
        |      |    |                                                                                                                 >
        |      |    = note: source type: `std::any::TypeId` (128 bits)                                                                >
        |      |    = note: target type: `u64` (64 bits)                                                                              >

@eddyb
Copy link
Member Author

eddyb commented Aug 28, 2020

See, this is why I wanted a crater run: I've either seen, or had strong suspicions of, code like that being in the wild.

It doesn't really matter that it's internal, it's always been u64 and people have been abusing that.

Maybe it's fine to do a check-only crater, but transmute_copy or equivalent will easily be UB (if it reads more than the size of TypeId, at most it could transmute to [u8; size_of::<TypeId>()]), so I was hoping unit tests would notice that (but maybe not, I doubt we'll catch everything with unit tests). Sadly I don't have the time to keep track of this, I only did it because it seemed relatively easy, the subject arose during a conversation with @mystor and @yaahc.

Note that there is an even greater danger, of people being able to assume the value of TypeId is stable between executions of a program: we can't make it rely on memory addresses (deduplicated across sibling crates by the linker), which was one of @mystor's suggestions for making it collision-proof. That's because ASLR would cause different values to be observed.

This could also be observed at compile-time, if someone were to try and transmute a TypeId into an integer in a constant expression (which I suspect is possible with unions? I forget what the limits are).

(we could maybe turn a TypeId collision into a linker error, while keeping them plain integers, at which point the double width would just make the likelihood significantly smaller, without affecting soundness either way)

@RalfJung
Copy link
Member

Ouch.

Maybe it's fine to do a check-only crater, but transmute_copy or equivalent will easily be UB

Well, this increases the size, so previous transmute_copy should not be UB (but truncate the ID). Right?

@eddyb
Copy link
Member Author

eddyb commented Aug 30, 2020

@RalfJung Right, that part was more of a reply to @Mark-Simulacrum's suggestion for fixing webrender.

If we change it to transmute_copy::<_, u128> but later decide that e.g. we've found a way to ensure collisions are detected, and maybe we want to switch back to u64, that code is now UB. Or if we try to do something clever with addresses, which would result in a platform-dependent size for TypeId, etc.


So I guess there's no easy way to detect non-transmute misuse of TypeId, and there's no good reason to run tests, except for maybe some assert_eq!(size_of::<TypeId>(), 8) that would fail at runtime?

@RalfJung
Copy link
Member

For webrender, they could do something like

match size_of::<TypeId>() {
  8 => /* transmute_copy to u64 */,
  16 => /* transmute_copy to u128 */,
  _ => panic!(),
}

@nagisa
Copy link
Member

nagisa commented Aug 31, 2020

@eddyb I’m having a hard time coming up with a mechanism for linkers to detect a collision. We would invariably need to emit symbols where the name of the symbol is the hash value, but then in a cross-crate scenario you would never be able to know if any sibling crate has already emitted such a symbol itself, causing a false positive.

It would be somewhat more plausible in a situation where the intermediate crates put the information about generated TypeId values and only the last step that generates a binary artifact (bin/staticlib/cdylib) could check for collisions and maybe emit these symbols.

Finally, anything linker-based with cdylib output is going to be non-operational. If something dlopens two distinct rust-produced cdylibs with a collision between them, they could still observe the collision without any kind of loader acting on it.

So the odds at making the collision a guaranteed compile-time/link-time failure are against us. We can maybe do something to catch some cases, but definitely not all of them.


IMHO we should both use a better quality hasher and increase the bit width here. TypeId having no collisions is something a ton of unsafe code relies onto to be sound, so we better do our best to make sure these collisions can only happen if there's somewhere a couple dozen suns powering the laptop that's compiling Rust code.

(Silly ideas start here) Even better if we can figure out how to make weirdly sized like e.g. size_of::<TypeId> = 15 so that people stop transmuting it into integers. (Silly ideas end here)

@eddyb
Copy link
Member Author

eddyb commented Aug 31, 2020

but then in a cross-crate scenario you would never be able to know if any sibling crate has already emitted such a symbol itself, causing a false positive.

I was imagining using the v0 mangling to compare the actual types, but maybe there's no way to tell the linker "make sure multiple definitions of this symbol have the same contents".

Worst case we could use C++ global constructors support to register all the generated TypeIds in each crate, into a single HashMap<TypeId, &'static str>, at runtime.

@mystor
Copy link
Contributor

mystor commented Aug 31, 2020

Coming back to my suggestion which @eddyb mentioned in #75923 (comment), one possibility is that TypeId is made to contain a pointer. In that world, the v0 mangling is used to define a weak global symbol pointing to something (e.g. a byte in rodata). The linker would then fuse these weak symbols together across compilation units to have a single unique address for each TypeId in the program.

This would break compatibility with the current situation, as was noted in that earlier comment, in a few ways:

  1. size_of::<TypeId>() would vary depending on platform to be the size of a pointer instead of u64.
  2. TypeId would contain relocations, meaning that some optimizations may not be possible with it's value as it's not known by the compiler.
  3. TypeId would not be stable across executions, as it could be impacted by ASLR
  4. If multiple copies of std are linked into the same binary, they will likely not share TypeIds (as these symbols probably won't be exported from the cdylib/staticlib, and so won't be shared across different std instances). (This may actually fix a potential issue where a type from one version of a library could be interpreted by another version, when they don't actually match?)

If the conclusion is that we need to break some of the properties which code is currently relying on (such as the size of TypeId), then I think this may be an option if we're willing to change all of these things at once.

(It could also be impossible for reasons I'm not seeing off the top of my head, but y'know)

Worst case we could use C++ global constructors support to register all the generated TypeIds in each crate, into a single HashMap<TypeId, &'static str>, at runtime.

I don't think global constructors are supported on every platform, and could have a lot of overhead. Especially because TypeId is in core, so we couldn't use a hashmap or any dynamic heap allocation as part of it's implementation.

@eddyb
Copy link
Member Author

eddyb commented Aug 31, 2020

One thing about @mystor's suggestion is that IMO we should undo #72488 (including on beta, making this a ticking clock!) until we have an implementation of it, to avoid anyone relying on the compile-time value being less than fully opaque, if we ever want to do it.


so we couldn't use a hashmap or any dynamic heap allocation as part of it's implementation

My idea was a static hash table (i.e. an array) with a fixed number of buckets (e.g. 256) using a subset of the bits of TypeId, and each bucket would be a linked list of per-crate statics. The locking/atomics would IMO be a worse problem than the memory representation.

We could even precompute this hash table for a Rust binary (and maybe pick a per-binary bucket count, to make it flatter/more like a PHF), so only staticlib, cdylib (and dylib?) have any runtime costs.

At least we could start by (statically) tracking what TypeIds were created in the crate graph, to at least make the common configuration of "just building a regular Rust binary, and doing no dynamic loading" sound.

@Mark-Simulacrum
Copy link
Member

Going to nominate for compiler team to talk about @eddyb's concern wrt to typeid stabilization; it might be a T-lang question too. It would perhaps be good to file a dedicated issue for that concern, though, I am not sure I follow it completely myself. It seems like const-time TypeId is the "easy case" as we can definitely check for hash collisions in that context, potentially with an entirely different scheme than used at runtime.

@crlf0710 crlf0710 added the S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. label Dec 11, 2020
@crlf0710 crlf0710 added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jan 15, 2021
@JohnCSimon JohnCSimon added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 8, 2021
@JohnCSimon JohnCSimon added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 1, 2021
@Dylan-DPC-zz
Copy link

closing this as inactive after a discussion with the author

@Dylan-DPC-zz Dylan-DPC-zz added S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 10, 2021
@the8472
Copy link
Member

the8472 commented Jul 27, 2022

Late comment, but

Well, as I said I am not convinced by probability-based arguments for soundness -- in particular probability-based arguments for deterministic functions.

This is a ludicrous argument. Cryptographic hash functions having collisions coming up in routine use with a handful (a few thousands) of inputs would be a stop-the-world, patch-everything and incinerate unpatchable hardware event. One of the defining features of cryptographic hashes is collision resistance. And no, structured but distinct inputs will not do better than random inputs in this area, because if structured inputs did better then attacks would be using them!

So, assuming the hash function is secure (upholding its contract) and that your inputs are distinct then it would take more than the age of the universe, a significant fraction of all matter in the universe used as storage and enormous amounts of energy even given optimal computers operating at the temperature of the microwave background to find one collision.

If a rustc-invocation happens to do that on the side by generating so many type IDs also doing otherwork then I'll say you either have a serious problem or you're Clippy and perhaps really do need to fix rust to take over the remainder of the universe. And in that case you you simply switch to a 512bit hash function.

@eddyb
Copy link
Member Author

eddyb commented Jul 28, 2022

@the8472 For what it's worth, very similar arguments have been brought up a lot more on #95845, at some point in that discussion I even believe @RalfJung started agreeing as well, and in the end #95845 was abandoned because of consensus along the very same lines (at least to the extent that I understand them).

My main regret nowadays is that we didn't design the Rust (v0) mangling 2-3 years earlier than we did, because maybe if we had, it might've been easier to transition TypeId to a model closer to C++ RTTI (maybe like #95845, or perhaps focusing more on the linker deduplication?), but that's just random hypotheticals.
(the mangling's relevance here being that the legacy mangling crunches all of its type information into a hash almost exactly like TypeId's, so there's no interesting "structure" it could add to TypeId, whereas the v0 mangling is much closer to the Itanium C++ mangling, in being able to represent types etc. - large parts of it are even isomorphic to rustc's own understanding of structural equality and nominal identity for types and other semantic entities, making it much more connected to the "Rust abstract machine" and other formal notions, than concrete "low-level"/"platform-specific"/ABI)

@RalfJung
Copy link
Member

RalfJung commented Jul 28, 2022

(btw calling the second mangling scheme Rust uses "v0" was IMO not a good choice -- I always think v0 is the legacy one and v1 the more structured one we are transitioning to)

So, assuming the hash function is secure (upholding its contract) and that your inputs are distinct then it would take more than the age of the universe,

Well if you apply the same argument to md5 you will get "similar" numbers and we know collisions there because of cryptographic weaknesses. So just looking at the number of bits is a rather naive way of analyzing this.

IOW, "assuming the hash function is secure" is doing a lot of work here. We know the hash function is not secure, in the sense that we know that there are collisions (by the pigeon hole principle). There isn't even an agreed-upon standard for what makes a deterministic hash function secure. So this is all not quite as unambiguous as you make it sound.

But, yeah, I would still be impressed if Rust were able to find a collision in sha256. Other assumptions the language makes about how hardware and operating systems work are IMO much bigger risks to validity of the Rust claims than those hash collisions.

@the8472
Copy link
Member

the8472 commented Jul 28, 2022

Well if you apply the same argument to md5 you will get "similar" numbers and we know collisions there because of cryptographic weaknesses. So just looking at the number of bits is a rather naive way of analyzing this.

Given non-malicious inputs md5 would in fact still do the job, albeit at only 64bits worth of collision-resistance which would actually get you much smaller numbers, more supercomputer-scale instead of universe-scale.

We know the hash function is not secure, in the sense that we know that there are collisions (by the pigeon hole principle).

A hash function is considered secure when you can't do better than brute-forcing it.

There isn't even an agreed-upon standard for what makes a deterministic hash function secure. So this is all not quite as unambiguous as you make it sound.

If you mean that most widely used cryptographic hash functions aren't provably secure because the provably-secure ones are way too slow, then that is correct. Which is why we have to replace them every few decades because someone does find a better-than-brute-force attack. But that only means we may have to replace one 256bit hash algorithm with another one at some distant point in the future.

If the concern is that someone might rely on the specific bits of the hash - which is a completely different argument - then we can add the rustc version to the hash inputs which is somewhat similar to randomizing hashmaps which not only helps against HashDoS but also against people relying on iteration order.

@bjorn3
Copy link
Member

bjorn3 commented Jul 28, 2022

If the concern is that someone might rely on the specific bits of the hash - which is a completely different argument - then we can add the rustc version to the hash inputs which is somewhat similar to randomizing hashmaps which not only helps against HashDoS but also against people relying on iteration order.

We already hash the rustc version. The stable crate id, which is included in type id hashes, is a hash of among other things the rustc version.

@the8472
Copy link
Member

the8472 commented Jul 28, 2022

Then I don't see a need to do string comparisons. Against accidental collisions a 128bit hash would do in all but extreme scenarios. Against malicious collisions a 256bit hash + occasional algorithm changes when new attacks are discovered should do. And if we're ever concerned about an attacker with with access to quantum computers then we might have to bump it to 384 bits.

@RalfJung
Copy link
Member

RalfJung commented Jul 28, 2022

A hash function is considered secure when you can't do better than brute-forcing it.

That's an intuition, not a definition.

If you mean that most widely used cryptographic hash functions aren't provably secure because the provably-secure ones are way too slow, then that is correct.

No, what I mean is that there isn't even a definition to prove them secure against. (Which also means that provably-secure ones cannot exist.) There are security definitions for keyed hash functions, but AFAIK there is nothing for unkeyed deterministic hash functions.

For encryption, we have well-established game-based security definitions, encoding the inability of an attacker to learn even a single bit of information about the clear text. For signatures, we have something similar. But for hash functions, no such definition exists: the property we want is "has no collisions that can be found in polynomial time", but that definition is meaningless -- there exist two inputs to sha256 that produce the same output, and so there exists a constant-time attacker that just produces those two inputs. We can't actually compute such an attacker, but to my knowledge nobody figured out a way to incorporate that into a security definition.

Then I don't see a need to do string comparisons. Against accidental collisions a 128bit hash would do in all but extreme scenarios. Against malicious collisions a 256bit hash + occasional algorithm changes when new attacks are discovered should do.

I agree. I am just applying some factual corrections to your statements, but I agree with your conclusions.

@bjorn3
Copy link
Member

bjorn3 commented Jul 28, 2022

We can't actually compute such an attacker, but to my knowledge nobody figured out a way to incorporate that into a security definition.

Would it be possible to do something like finding n unique collisions each in polynomial time where n tends towards infinity. Or find a collision between hash(a || b) and hash(a || c) in polynomial time for any given a. In both cases I think a constant time algorithm would require being able to derive one collision from another collision in constant time, which if the case would almost definitively make the hash algorithm be considered not cryptographically secure.

@RalfJung
Copy link
Member

RalfJung commented Jul 28, 2022

I wouldn't know that any of them is an accepted notion of security.

The one with the a is basically a form of keyed hash function, if you keep a secret. But if there is no secret, then this doesn't really work, and also doesn't reflect how we use hash functions -- we don't hash all Rust types with a common prefix a, and if we did, a would be publicly written in the rustc sources, and so the attacker could attack that particular a, rendering it moot.

@eddyb
Copy link
Member Author

eddyb commented Jul 29, 2022

(btw calling the second mangling scheme Rust uses "v0" was IMO not a good choice -- I always think v0 is the legacy one and v1 the more structured one we are transitioning to)

(entirely the wrong venue for this but...)

I'm not 100% happy w/ it but IIRC choosing 0 has to do with the _R prefix and its encoding.

But, it's "Rust mangling v0", an officially specified format, whereas the other thing is "whatever rustc used to do to cram barely-enough-info into C++ Itanium mangling", and "legacy" is a colloquial name we use for it, not an official identifier.

"Rust didn't have a mangling scheme, one was RFC'd in 2019, and rustc is finally getting close to deploying it in 2022" is the short story.

The fact that we're aware of "what rustc used to do" is not something that's part of Rust's future any more than @T/~T types are in Rust's present. Note that this is different from things like "macros 2.0", where we declare what got stabilized w/ Rust 1.0, as "macros 1.0" - there, both of those have to remain supported ~indefinitely (which also applies to some edition migration stuff).

Speaking of long-term support... you don't need any dedicated tools to get something (just demangle it as a C++ symbol) - in fact, outside of the $XY$ escaping, you can pretty much just squint and read it without demangling demangling anyway, 3foo3bar and foo::bar aren't that different (whereas v0 and C++ manglings aren't limited to a flat structure).
The main reason to keep legacy demangling implementations around is just because it's too tiny of a hack to care about either way, not any practical applicability.

(Oh and in the C port of the demangler, the legacy support uses demangler->version = -1, heh)

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Jun 8, 2023
…apkin

Use 128 bits for TypeId hash

Preliminary/Draft impl of rust-lang/compiler-team#608

Prior art (probably incomplete list)
- rust-lang#75923
- rust-lang#95845
thomcc pushed a commit to tcdi/postgrestd that referenced this pull request Aug 24, 2023
Use 128 bits for TypeId hash

Preliminary/Draft impl of rust-lang/compiler-team#608

Prior art (probably incomplete list)
- rust-lang/rust#75923
- rust-lang/rust#95845
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet