# Validity of integers and floating point #71

Open
opened this Issue Jan 10, 2019 · 27 comments

Projects
None yet
7 participants
Member

### RalfJung commented Jan 10, 2019

 Discussing the validity invariant of integer and floating point types. Clearly, every possible bit pattern is allowed. For integers they all have a distinct and meaningful interpretation, and we have a safe NOP-conversion between f32 and u32, and f64 and u64, through to_bits and from_bits. The remaining open question is: is it ever allowed to have an uninitialized bit in an integer or floating point value? We could reasonably decide either way.

Member Author

### RalfJung commented Jan 10, 2019

 Arguments for allowing uninitialized bits: There does not seem to be a reasonable optimization that is broken by this. It does not contradict anything we currently tell LLVM. It simplifies writing code with mem::unintiailized()/MaybeUninit::uninitialized() when the type is known to consist of integers only. Arguments for disallowing uninitialized bits: If we allow uninitialized bits, we have to define what happens with arithmetic operations when an input contains an uninitialized bit. Is that insta-UB? Is the result fully uninitialized? Does uninitializedness somehow propagate bitwise? These are hard choices, because it is easy to break arithmetic equations here. By disallowing uninitialized bits, we successfully avoid the question. "No uninitialized data outside of unions" is a nice and and easy to teach principle. Ruling out uninitialized bits in integers and floating points makes it easier to find bugs where people accidentally have such variables not properly initialized: A bug-finding tool can flag any occurrence of uninitialized memory in integers/floating points as a bug immediately. Also see https://www.youtube.com/watch?v=yG1OZ69H_-o: The fact that C makes integer overflow on unsigned variables defined to wrap-around actually has lead to real-world bugs (caused by unintended overflow) not being detected by bug-finding tools, because you cannot just mark every overflow as a bug. If there is a rare exception to a common principle ("integer overflow is a bug", "uninitialized data is a bug"), then it is preferable to have the programmer state their intent of violating that principle explicitly (use explicitly overflowing arithmetic operations like wrapping_add, use types like MaybeUninit for explicitly handling uninitialized data).

Open

Member Author

### RalfJung commented Jan 10, 2019

 Here's an example of a piece of code that relies on uninitialized integers (and raw pointers, and AtomicPtr) being valid: https://github.com/carllerche/bytes/blob/456221d16521cf54cea0e6569669e47120a1b738/src/bytes.rs#L1810
Member Author

### RalfJung commented Jan 30, 2019 • edited

 An interesting use of uninitialized bits in integers is in crossbeam's AtomicCell: for types of certain sizes, it will use the AtomicU* types. But this means that with a type like (u8, u16), it will use AtomicU32 and thus carry uninitialized bits in a u32. Also see crossbeam-rs/crossbeam#315.

Open

Collaborator

### gnzlbg commented Jan 30, 2019

 If we allow uninitialized bits, we have to define what happens with arithmetic operations when an input contains an uninitialized bit. Is that insta-UB? Is the result fully uninitialized? Does uninitializedness somehow propagate bitwise? These are hard choices, because it is easy to break arithmetic equations here. By disallowing uninitialized bits, we successfully avoid the question. If we were to allow uninitialized bits, it might be reasonable to say, at least initially, that any operation that's not a read or a write is UB. That would allow us to define those operations later on. For example, that would mean that adding an initialized integer with an uninitialized one is UB, but later on, we could define that to something else, like the result of that operation being uninitialized. That is, we wouldn't need to answer these hard questions right now. I wonder whether it is possible to allow uninitialized bits later on in a backwards compatible way or whether we do have to make this decision right now ? I think we have to make this decision right now, because, e.g. if we forbid uninitialized bits, all unsafe code assumes that integers are always initialized, and we can't change that later to support uninitialized bits without breaking that assumption. In the same way, if we allow uninitialized bits, we can't later on disallow them without breaking code that uses them. "No uninitialized data outside of unions" is a nice and and easy to teach principle. Agreed. A bug-finding tool can flag any occurrence of uninitialized memory in integers/floating points as a bug immediately. I don't think it can flag every occurence, for example, extern fn foo() -> MaybeUninit, but it would be able to do so for all code that the tool is able to instrument at least.
Member

### rkruppe commented Jan 30, 2019

 I think we have to make this decision right now, because, e.g. if we forbid uninitialized bits, all unsafe code assumes that integers are always initialized, and we can't change that later to support uninitialized bits without breaking that assumption. I don't see how. If a type's validity invariant rules out certain initialized bit patterns, unsafe code can use those bit patterns for its own purposes, which will clash with later allowing those bit patterns in the validity invariant. However, uninitialized bits -- even if they are valid -- cannot be detected by the program: reading them is either UB or perhaps produces some string or zeros or ones (possibly a non-deterministic one). So that kind of counter-example is right out. Moreover, because uninitialized bits will never be safe, unsafe code can't be fed them from unknown/external sources. IOW the only way a library/module/etc. will ever see uninitialized bits is if it produces them itself or obtains them from its "trusted base" (e.g., a function whose functional correctness is relevant to the soundness of the library/module/etc.), in which case it's UB today and its problem is an internal bug, not the interface with the rest of the world.
Collaborator

### nikomatsakis commented Jan 31, 2019

 I would be inclined to permit uninitialized data in integers. My reasoning is as follows: I think that backwards compatibility around mem::uninitialized is a real concern. It is a very common pattern to allocate an uninitialized [u8] array in practice, and I am reluctant to declare all of that pre-existing code UB (even if we deprecate uninitialized). Second, I think that the crossbeam example which @RalfJung raised earlier feels like the kind of tricky thing that people shouldn't have to fret about. In particular, if you have uninitialized padding bits in structs and things like that which are known to be less than word size, I think you should be able to treat them as integers for convenience, and it seems like that would be Insta-UB under this proposal. Basically, I think people are likely to do things (like crossbeam) that wind up using uninitialized bits in integers but which don't necessarily "feel" like cases where MaybeUninit should be required. I also think that people will commonly reason about uninitialized data informally and hence write code that is UB -- of course, a tool that detects such usage can help, but it's also nice if "common things" people write are not UB even if they neglect to run the tool. That said, I find the most compelling argument against permitting uninitialized bits to be that it would allow us to declare such uses an error. But it seems like we could still have a sort of "lint", where we say "ah, you are using uninitialized data outside of a MaybeUninit struct -- while not technically UB, it is recommended to use MaybeUninit. One meta-question: Suppose that I do have a clever algorithm that makes use of uninitialized integers. For example, a trick I have used in a past life was to have an integer set that had O(1) initialization cost regardless of its capacity. This worked by having two vectors of integers. One of them started out with size N but had uninitialized data. The other started out as size 0 but had initialized data. To add to the set, you checked one against the other (I can give details if desired). The key point is that you had to read and use an uninitialized integer in the process and compare it against initialized data -- is reading such an integer UB?
Member

### rkruppe commented Jan 31, 2019 • edited

 The crossbeam example and the "O(1) initialization" integer set, as well as all other clever uses of uninitalized memory that I'm aware of (i.e., not just allocating uninitialized memory and initializing it at your own pace before using it) require operating on uninitialized bits. So if we want to allow them, we need to not only allow reading uninitalized memory but also make arithmetic and comparisons on them defined (and reasonably deterministic! undef results are worse than useless). That is not possible at all with current LLVM (except by initializing all memory, or pretending to do so in ways that will block lots of unrelated optimizations) and even if/when freeze and poison is adopted, it'll have some negative codegen impact and I see no way to restrict that impact only to modules that really need it. So even though I agree that it would be best to support these things, I do not see a reasonable way to achieve that.
Member Author

Member

### rkruppe commented Jan 31, 2019

 But once LLVM has freeze, we could make them legal very easily and (I think) without bad impact elsewhere by adding a freeze intrinsic and telling people to use that. Hm, if getting people to use such an intrinsic is acceptable (I think the biggest source of worry is people reasoning naively about uninitalized as "initialized to an arbitrary bit string" and not even checking), then we can already build such an intrinsic today, it just has to do something that all LLVM optimizations have to assume could initialized the memory (e.g., some inline asm). This will have some unfortunate impact on optimizations unrelated to uninitialized memory, but it will still be localized to uses of that intrinsic.
Member Author

### RalfJung commented Jan 31, 2019 • edited

 I think the biggest source of worry is people reasoning naively about uninitalized as "initialized to an arbitrary bit string" and not even checking Fair, but I see no model that makes this work. This is actually also a reason why I'd prefer to not allow uninitialized data in integers -- that may be more surprising, but it is easier to explain and very concise: "No uninitialized data outside MaybeUninit". If we allow uninitialized bits but then almost all operations are UB, it becomes something more like "No uninitialized data outside MaybeUninit, except if it is an integer or a raw pointer and you are not actually looking at the data, just loading and storing" and then we still have to explain that x * 0 is also UB even though it doesn't really have to "look at" x -- and then people will be "screw this, that's too complicated, I will just do whatever and write some tests". Basically, we are breaking expectations anyway, and maybe it is better to break them more but in simpler ways, than to figure out how to break them as little as possible while still breaking them in ways that are much more complicated to explain. That seems plausible to me. Not sure if it makes any sense.^^

### CAD97 commented Feb 1, 2019 • edited

 If uninitialized bits in an integer are made instantly invalid, is it possible to do the crossbeam::AtomicCell trick of making an atomic (u8, u16) by treating it as AtomicU32 at all? Is it possible to do that minus the CAS operation? It seems like it should be possible to use an atomic (u8, u16) on a platform supporting 32 bit atomics. The only way I can see without requiring valid(T) \in initialized(u32) (and thus a very unsafe AtomicCell) is to provide some sort of mem::initialize_padding(&T). I'm personally on the side of making uninitialized integers invalid so long as we don't lose anything (but mem::uninitialized) for it.

### Amanieu commented Feb 1, 2019 • edited by RalfJung

 The case of AtomicCell<(u8, u16)> does indeed depend on UB at the moment. C++ solved this by essentially making it the responsibility of std::atomic to magically ignore padding bytes (see this page, at the bottom). Note that this magic only works for structs: compare_exchange_strong is not guaranteed to ever converge with unions. I've been toying with the idea of a clear_padding_bytes intrinsic that would support this use case, as well as things like writing a struct to a file without leaking information through the padding bytes: /// Writes 0 to all padding bytes in val and returns a mutable slice of the bytes in val. fn clear_padding_bytes(val: &mut T) -> &mut [u8];
Member Author

### RalfJung commented Feb 1, 2019 • edited

 If uninitialized bits in an integer are made instantly invalid, is it possible to do the crossbeam::AtomicCell trick of making an atomic (u8, u16) by treating it as AtomicU32 at all? Is it possible to do that minus the CAS operation? No. Notice though that the state of this trick (even the load/store part) is dubious in LLVM as well -- poison is infecting the entire value when just one of the bytes loaded from memory is poisoned, and there are proposals to replace undef by poison. C++ solved this by essentially making it the responsibility of std::atomic to magically ignore padding bytes Well, "solved". As usual with C++, it is somewhat unclear what this actually means in an operational way. I would not call this a solution. One possible solution is to say that values get frozen before being compared, that at least removes the UB -- but it doesn't guarantee that a compare-exchange loop will ever terminate as they could get frozen to a different value each time. Or maybe they actually get frozen in memory, but that conflicts tons of real optimizations. Writes 0 to all padding bytes in val and returns a mutable slice of the bytes in val. I don't think this is implementable -- at run-time there is no way to distinguish padding bytes from initialized bytes. But a version of this which just picks arbitrary bit patterns for all uninitialized and padding bytes would be implementable, that's exactly what freeze does.

### Amanieu commented Feb 1, 2019

 I don't think this is implementable -- at run-time there is no way to distinguish padding bytes from initialized bytes. I don't see what the problem is? The compiler knows the layout of type T and therefore knows where all the padding holes are. Well, "solved". As usual with C++, it is somewhat unclear what this actually means in an operational way. I would not call this a solution. In an operational sense this would mean setting the padding bytes to 0 on all input values to an atomic operation. This will cause padding bytes to be "ignored" by the compare_exchange hardware instruction since they will always be 0.
Member Author

### RalfJung commented Feb 1, 2019 • edited

 The compiler knows the layout of type T and therefore knows where all the padding holes are. Oh, you mean statically -- well, for enums that'll require dynamic checks as well. And of course this is hopeless for unions. In an operational sense this would mean setting the padding bytes to 0 on all input values to an atomic operation. That would also guarantee that the value you read has 0 for all padding bytes, which I am fairly sure they do not want to guarantee. And anyway I see no way to implement this behavior even remotely efficiently. I think a more reasonable operational version amounts to saying that you freeze both values before comparing -- that at least avoids comparing uninitialized bytes, and it makes compiling to a simple CAS correct. But it allows spurious comparison failures and makes no guarantees that your retrying CAS loop will ever succeed, because you could see a different frozen value each time around the loop. Also this assumes the atomic operation knows the correct type, whereas AFAIK for LLVM atomic operations only work on integer types -- at which point you cannot know which bytes are padding.

### stjepang commented Feb 3, 2019 • edited

 But it allows spurious comparison failures and makes no guarantees that your retrying CAS loop will ever succeed, because you could see a different frozen value each time around the loop. I believe spurious comparison failures can be fixed like so: // Assume T can be transmuted into usize. fn compare_and_swap(&self, current: T, new: T) -> T { // Freeze current and new and transmute them into usizes. let mut current: usize = freeze_and_transmute(current); let new: usize = freeze_and_transmute(new); loop { unsafe { // previous is already frozen because we only store frozen values into inner. let previous = self.inner.compare_and_swap(current, new, SeqCst); // If previous and current are byte-equal, then CAS succeeded. if previous == current { return transmute(previous); } // If previous and current are semantically equal, but differ in uninitialized bits... let previous_t: T = transmute(previous); let current_t: T = transmute(current); if previous_t == current_t { // Then try again, but substitute current for previous. current = previous; continue; } // Otherwise, CAS failed and we return previous. return transmute(previous); } } } Now it's still possible to have a spurious failure in the first iteration of the loop, but the second one will definitely succeed (unless the atomic was concurrently modified). In fact, this is exactly how CAS in AtomicCell already works, except we don't have a way to freeze values.
Member Author

### RalfJung commented Feb 4, 2019

 previous is already frozen because we only store frozen values into inner. I think this is the key invariant here -- if you can make that work, then yes I can think there is a consistent semantics here. This requires making sure the contents are initialized, and also freezing before writing in store (and any other modifying instruction). However, notice that you have AtomicCell::get_mut, so I do not believe it is possible to maintain the invariant that inner will always be frozen.
Collaborator

### gnzlbg commented Feb 4, 2019

 We seem to be trying to answer two different questions here that are entangled. One is whether integers and such can be uninitialized or not. The other one, which is most fundamental, is whether we want to support using uninitialized memory outside unions or not, in general. I think we should answer this question first, and use the answer to drive the rest of the design. We can't think of allowing uninitialized memory on integers without also considering that the validity of integers and raw pointers is going to be alike, so we are also allowing uninitialized memory on raw pointers. Raw pointers can point to DSTs, so we also need to be thinking whether we want to allow uninitialized memory in pointers to DSTs (for the whole pointer, some part of it, etc.). I agree with @nikomatsakis that a lot of code is using mem::uninitialized today and we should weight the impact on that. I don't share their reluctance (yet), because I think deprecating mem::uninitialized doesn't break the world (the compiler and tools are going to be the same as the day before the deprecation). People code will continue working "as is", and they will have time to upgrade. I don't know what the cost of the upgrade will be, but reasoning about mem::uninitialized is hard, and the rule "no uninitialized memory outside unions" turns something that is hard into something that's relatively easy. It also potentially simplifies our "rules" for all other types (e.g. raw pointers, integers, etc.), and there is value in that. The issue that some algorithms require uninitialized memory has shown up. The question that hasn't been answered yet AFAICT is whether MaybeUninit can be used to implement those algorithms or not. If it can't, then IMO that might hint at a design flaw in MaybeUninit (or maybe we need more helper types, e.g., AtomicMaybeUninit or similar), but that doesn't necessarily imply that banning uninitialized memory outside unions is a bad choice.

### stjepang commented Feb 4, 2019

 However, notice that you have AtomicCell::get_mut, so I do not believe it is possible to maintain the invariant that inner will always be frozen. Just to clarify for everyone else following this thread: @RalfJung and I discussed this and the conclusion was that we'll simply remove get_mut() if that is necessary to make AtomicCell sound.
Member Author

### RalfJung commented Feb 5, 2019 • edited

 About whether uninitialized data is okay in integers at all: we discussed this at the all-hands. The general consensus seems to be that we should permit uninitialized data in integers and raw pointers. There is just too much existing code doing stuff like let x: [u8; 256] = mem::uninitialized(); // go on or let x: SomeFfiStruct = mem::uninitialized(); // go on Both of these patterns would be insta-UB if we disallow uninitialized integers. That doesn't seem worth the benefit of better error-checking with Miri. Incidentally, that matches what Miri already currently implements, mostly for pragmatic reasons (libstd already violated the rules about uninitialized integers when I wrote the checks -- I think I have since moved it to MaybeUninit).
Collaborator

### gnzlbg commented Feb 5, 2019 • edited

 Both of these patterns would be insta-UB if we disallow uninitialized integers. That doesn't seem worth the benefit of better error-checking with Miri. Was it also discussed whether this was worth the benefit of a simpler and more teachable correctness model for unsafe code ? (independently of whether this model can be better checked with miri or not?). think I have since moved it to MaybeUninit). I think so too.
Member Author

### RalfJung commented Feb 5, 2019

 Is "integers must be initialized" really that much simpler than "integers are allowed to not be initialized"?
Collaborator

### gnzlbg commented Feb 5, 2019 • edited

 Is "integers must be initialized" really that much simpler than "integers are allowed to not be initialized"? No, but I do think that "uninitialized memory is only allowed inside unions" is much much simpler and teachable than all other alternatives that are currently being discussed.

### CAD97 commented Feb 7, 2019 • edited

 Another (drastic?) option to consider that allows let x: u32 = mem::uninitialized();: Disallow uninitialized (poison) bits in integers (and pointers?). Make mem::uninitialized return freeze(poison). This makes mem::uninitialized behave as most initially intuit (a constant nondeterministic bit string) and gains us the "(true) uninitialized memory only inside unions (or padding)" property. This may degrade some performance around uses of mem::uninitialized, but the intent is to move those over to mem::MaybeUninit anyway, right? And as previously mentioned, mem::uninitialized is a very dangerous tool to start with. Also it could degrade talking about "uninitialized memory", because it then introduces both the "frozen uninitialized" behind mem::uninitialized and "poison uninitialized" behind mem::MaybeUninit and padding as being called "uninitialized" in the language. This could be just documented on the deprecated big-fat-warning-label mem::uninitialized for most of the points back, though. Example: mem::uninitialized Deprecated, do not use. Use mem::MaybeUninit for its clearer semantics instead. For safety reasons, this function now returns "frozen" memory -- memory set to nondeterministic but consistent bits -- rather than truly uninitialized memory. Otherwise, merely storing the returned value in a variable would immediately break the validity invariants of any type except mem::MaybeUninit.
Member Author

### RalfJung commented Feb 12, 2019 • edited

 @CAD97 Good idea! Basically, once we have rust-lang/rust#58363, we could reimplement mem::uninitialized in terms of MaybeUninit::initialized and then call ptr::freeze. That would make all the existing code that creates uninitialized integer arrays or FFI structs valid even if we decide that integers must be properly initialized. (I guess we could say we require them to be "frozen".) @Amanieu suggested we should allow uninitialized values in integers only for "backwards compatibility reasons". This would have a similar effect. It might, however, incur a performance cost on such existing code. Porting code to MaybeUninit enables it use "unfrozen memory" and gain back the performance. However, I suspect people will still want to keep memory unfrozen when e.g. calling a known Read::read implementation. That would mean that if we disallow uninitialized integers, we have to make the validity invariant of references shallow. (I am in favor of that anyway, but this is a contentious point.)
Collaborator

### gnzlbg commented Feb 12, 2019

 Good idea! Basically, once we have rust-lang/rust#58363, we could reimplement mem::uninitialized in terms of MaybeUninit::initialized and then call ptr::freeze. We should definitely do this. That would mean that if we disallow uninitialized integers, we have to make the validity invariant of references shallow. (I am in favor of that anyway, but this is a contentious point.) +1. There is just too much existing code doing stuff like let x: [u8; 256] = mem::uninitialized(); The freeze fix makes this "work" for Rust 2015 and Rust 2018, but unless I missed something the plan is still is to deprecate mem::uninitialized. If that's the case, we could prevent Rust 2021 crates from accessing it (even though the std lib still needs to ship it for inter-edition compatibility). So I find the argument that "integers and pointers shall support uninitialized bit-patterns because there is too much code using mem::uninitialized()" weak. In Rust < 2021 uninitialized would be changed to freeze, and in Rust >= 2021 there could be no mem::uninitialized.

### CAD97 commented Feb 12, 2019

 What about let x: [u8; mem::size_of::()] = mem::transmute(t);? That's the other way to easily accidentally put padding bytes in an integer. Do we care about that use pattern? (What's the sound equivalent if we don't allow this? [MaybeUninit; mem::size_of::()]? The sooner MaybeUninit stabilizes the better.)

Closed