Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upArea proposal: Representation and validity invariants #5
Conversation
nikomatsakis
added some commits
Aug 24, 2018
gnzlbg
reviewed
Aug 24, 2018
| a value of this type is considered to be initialized. The compiler expects | ||
| the validity invariant to hold **at all times** and is thus allowed to use | ||
| these invariants to (e.g.) affect the layout of data structures or do other | ||
| optimizations. |
This comment has been minimized.
This comment has been minimized.
gnzlbg
Aug 24, 2018
•
Collaborator
I find this paragraph too confusing. It states that the validity invariant must hold at all times, but that it only defines what must hold for initialized values, so I am left wondering what happens with uninitialized values. Do they exist? Is there a distinction between the storage of a value, and the value itself (e.g. uninitialized memory of type T contains no value of type T).
It might help to add one sentence stating something about uninitialized values / memory that clarifies things. But I don't know what that might look like.
This comment has been minimized.
This comment has been minimized.
nikomatsakis
Aug 24, 2018
Author
Collaborator
Perhaps, "at all times when a value the compiler considers a value to be initialized"? This is probably the most precise statement, though perhaps not the most intuitive. It is also the phrasing that @RalfJung used, if I recall.
I do want to improve the language, but at the same time, I am not sure how much detail I want to go into in this document. I guess it's good to invest some effort though defining our terms carefully, however.
One of the subtle bits -- and I'm not sure how best to phrase this -- is that one of the questions we want to discuss is how to think about loads from uninitialized memory. (e.g., accesses to union fields etc). Most of the time, this is UB, but there are definitely use cases for being able to load from uninitialized memory if you treat the result as an integer (or perhaps any other scalar type where all bit patterns are valid). I think the key point there is that the point where the compiler considers the memory initialized is exactly the point of load.
This comment has been minimized.
This comment has been minimized.
gnzlbg
Aug 24, 2018
Collaborator
This is probably the most precise statement,
That sounds fine.
I think one thing I've been missing from the discussion is the difference between typed memory, and whether objects of the type actually live in that memory. Validity could then be about the layout of the objects in memory. This layout depends on which values these objects can represent.
Whether we require all typed memory accessible from safe Rust to always contain objects of its type, and whether we allow unsafe Rust to access memory that contains no objects (e.g. uninitialized), would relate to safety, but validity wouldn't talk about that.
This comment has been minimized.
This comment has been minimized.
gereeter
Aug 24, 2018
One of the harder questions that I don't think has been sufficiently addressed is what form invalidity should take. Almost all of the discussion has been talking about UB, but there are lots of other notions of "incorrect": undef/varying value, poison, frozen poison/arbitrary value, some sort of sticky poison that on creation also marks some memory locations as permanently inaccessible, etc.
The wording of saying that the validity invariant should hold at all times or even at all times when the compiler considers a value to be initialized implies UB, but I generally think this is the wrong choice and we should avoid it if we can get away with it. In my (admittedly limited) experience, actual UB makes things harder to optimize, not easier, since it makes intuitively pure operations have side effects, which makes operations very hard to reorder.
This comment has been minimized.
This comment has been minimized.
gereeter
Aug 24, 2018
Oh, and it also occurs to me that there are multiple notions of UB, corresponding to the question of whether the println! can be optimized away in
println!("Before UB!");
invoke_ub();That is, does UB invalidate the entire program since the compiler is allowed to assume it never happens, or does is just invalidate the program from the point it happens onward?
This comment has been minimized.
This comment has been minimized.
gereeter
Aug 24, 2018
That's certainly true of C or LLVM's undefined behaviour, but it isn't the only way incorrect behaviour could be specified. I suppose I shouldn't call it UB, though, since that is confusing (although I think this new notion actually fits the description of undefined behaviour better). Since we are in the process of specifying what Rust is, we get to choose whatever semantics we want (within the bounds of what we can reasonably compile to LLVM). Concretely, the two notions are:
-
(C's UB) The compiler is allowed to assume that the incorrect action never happens.
-
(I don't know what to call this) Upon the incorrect action happening, the program is allowed perform any series of actions (i.e. syscalls, not time travel). If we model the side effects of a program as a sequence of system calls (and some other things like volatile memory accesses), then this just says that an undefined (in this sense) operation never returns and adds on an arbitrary and possibly infinitely long series of system calls, but the system calls that led to the operation can't be deleted.
In the blog post you linked, if C used this semantics (which it doesn't), the invalid piece of reasoning would be the rewrite of
unwitting: the "UB" is not just saying that the compiler is allowed to assume something. Although the calls tovalue_or_fallbackandwait_for_door_to_openwould be removable, the call toring_bellwould not be since it happened before the "UB".
This comment has been minimized.
This comment has been minimized.
gnzlbg
Aug 24, 2018
Collaborator
What would be the value of doing that and what would that cost?
Thinking of what would that cost, you showed this snippet:
println!("Before UB!");
invoke_ub();
Imagine invoke_ub would do a null ptr dereference of a pointer completely unrelated with anything happening on the print. If that pointer isn't null, the compiler could actually reorder the dereference before the print, so that the memory load, if it has a cache miss, can happen out-of-order while the print executes.
However, if the compiler cannot assume that UB does not happen, then suddenly it cannot do that optimization, and potentially it cannot really reorder any code, since that could alter when undefined behavior happens.
That's a pretty big cost of optimization potential.
Thinking of what does this buy us, I still don't see that it buys us much. If we don't reorder the load, after the print undefined behavior still happens. For all we know that could clear the stdout buffer before its flushed (if the print doesn't flush it). That is, even if we guarantee the print to execute correctly before UB, then the UB afterwards could still allow the print not to show.
As I understand, the point of the guidelines or a spec is to define the behavior of Rust programs, and to state which programs are Rust and which programs aren't Rust. Programs with undefined behavior aren't Rust (they are illegal Rust programs). So trying to define the behavior of programs with undefined behavior doesn't make much sense to me, since it appears equivalent to trying to define the behavior of all programs that aren't really Rust. Even if that was interesting, that wouldn't belong in the Rust spec - but in the spec for non-Rust programs.
Does that make sense?
This comment has been minimized.
This comment has been minimized.
gereeter
Aug 24, 2018
What would be the value of doing that and what would that cost?
Sorry if I wasn't clear: I didn't mean to imply that I think that we should make this specification the only notion of UB in all of Rust, nor even that we necessarily should use this at all. As you mention, there are very real costs, and in many cases (including dereferences of invalid pointers) I don't even see a way to correctly lower to LLVM, since LLVM uses C's stricter notion of UB.
I simply meant to say that
- For every single thing we say is invalid, we have a independent choice of what semantics to give it.
- All the discussions about unsafe code I've seen have just said UB since that is in some sense a default, but we never made an explicit choice.
- We should discuss tradeoffs and make explicit choices.
- One option of very many that we could pick is this "constrained UB" that I was describing.
Thinking of what does this buy us, I still don't see that it buys us much.
That said, obviously part of the reason I bring it up as an option is because I think it does have merit in some cases (though out of the many options, my general favourite is some form of poison). For example:
- Security. If secrets like passwords and keys are stored in some volatile memory and properly cleared after use, it would be useful to know that some bug in unrelated code long after the sensitive computation is complete can't expose the secrets. Or, for another example, if a web server does (completely bug-free) authentication with users and after that is complete gives access to an API that is insecure and buggy, it would be good to know that until someone finds and triggers the bug (which may be much more difficult if only authenticated users have access to it), the server will only authenticate valid users.
- Debugging. If using
eprintln!or explicitly flushing streams, then it can be much easier to track down where the invalid code is, since the logs will be correct up to the point of the bug, and it may be possible to notice the anomalies. Note that this is relevant even with optimizations off, since Rust still does things like layout and uninhabitedness optimizations while debugging. - Explainability. This is naturally a "softer" argument, but I think it holds. The "full UB" statement of "the compiler is allowed to produce literally anything if you do UB since it can assume it doesn't happen" is really quite simple, but articles like the one you linked show that the way people think about it is more complex. Instead of using a denotational "this program has no meaning" explanation, people gravitate towards an operational description where the undefined behvaiour is an action that happens - it just goes back in time and does very strange things. From this operational viewpoint, the "constrained UB" makes more sense. It just means that when you execute it, something (anything) happens. Just not unreal things like time travel because those aren't things that can happen. System calls are.
Programs with undefined behavior aren't Rust (they are illegal Rust programs). So trying to define the behavior of programs with undefined behavior doesn't make much sense
I disagree. Anything written in Rust is a Rust program, regardless of how legal it is. Now, if a Rust program unconditionally executes undefined behaviour, then it is a completely and utterly meaningless Rust program, but it is still Rust. Throwing up our hands and saying "doing this destroys your program" is a completely reasonable response to certain things, but there is nothing fundamentally different about that choice from saying "doing this crashes your program". They are both specifications, just one very very loose and the other not.
In particular, I don't understand your point of view when it comes to programs that are conditionally UB. For example:
if user_input() == 5 {
invoke_ub();
}This is certainly a broken program, but that doesn't mean it isn't Rust, and it doesn't mean we don't give it semantics. In particular, if no user ever inputs 5, this seems perfectly legal. We should guarantee that the program works correctly and does something predictable as long as no one ever enters 5. (At least I was pretty sure this was the consensus. Machine checkable UB is pretty weak if even never-reached UB can totally break your program.)
This comment has been minimized.
This comment has been minimized.
rkruppe
Aug 24, 2018
Member
For every single thing we say is invalid, we have a independent choice of what semantics to give it.
This is true, but UB is in many ways a "safe" choice because it's the strongest prohibition and can be weakened later. So if we declare everything either "UB" or "totally allowed" in the UC we can revisit it later and tweak that decision, and until we do that we can be sure we haven't painted outselves into a corner that's not implementable or has negative repercussions for optimizations we want to enable. Also consider that weaker prohibitions are
- rarely applicable: as you say, many kinds of UB are hard to implement in LLVM -- or any other optimizer, for that matter
- niche compared to the immense value of finally having settled what's legal or not in unsafe Rust
- rather difficult and subtle to consciously make use of (e.g. the security example you give is more defense-in-depth than something programmers would reason about and rely on)
In particular, I don't understand your point of view when it comes to programs that are conditionally UB. For example:
We can usefully talk about the many possible well-defined executions of this programs, but there are possible executions which are undefined and so, without wading into the philosophical discussions, it at least isn't a correct program.
This comment has been minimized.
This comment has been minimized.
arielb1
Aug 25, 2018
Oh, and it also occurs to me that there are multiple notions of UB, corresponding to the question of whether the
println!can be optimized away
Optimizing away the println! in that situation requires knowing that it will return.
For example, if your program is run with a seccomp filter on stdout that makes write issue a SIGKILL, then I'm quite sure the program is very well-defined - it will kill itself.
gnzlbg
reviewed
Aug 24, 2018
| - `Option<extern "C" fn()>` | ||
| - `usize` | ||
| - Platform dependent size, but guaranteed to be able to store a pointer? | ||
| - Also an array length? |
This comment was marked as resolved.
This comment was marked as resolved.
gnzlbg
Aug 24, 2018
•
Collaborator
C++ says that usize is an unsigned integer type that can store the maximum size (as returned by mem::size_of<T>/size_of_val/etc.) of a theoretically possible object of any type (including arrays). A type whose size cannot be represented by usize is ill-formed. On many platforms (an exception is systems with segmented addressing) usize can safely store the value of any non-member pointer. In those platforms, usize is a type capable of holding a pointer.
This comment was marked as resolved.
This comment was marked as resolved.
gnzlbg
Aug 24, 2018
•
Collaborator
Also, it might be worth it to talk about isize here. C++ says: If an array is so large (greater than isize::max_value() elements, but less than usize::max_value() bytes), that the difference between two pointers may not be representable as isize, the result of subtracting two such pointers is undefined. On many platforms isize can safely store the value of any non-member pointer. In those platforms, isizeis a type capable of holding a pointer.
I don't know what difference pointer to member functions make in C++ though.
This comment was marked as resolved.
This comment was marked as resolved.
nikomatsakis
Aug 24, 2018
Author
Collaborator
This reminds me that Rust currently has a rule that the maximum size of any value is representable with isize, I believe. I'm not sure how "baked in" that rule is by now -- probably "fairly" -- but it'd be good to document (and, if we don't like it, think seriously about what it takes to change it). I'll add a note to that effect.
This comment was marked as resolved.
This comment was marked as resolved.
gnzlbg
Aug 24, 2018
Collaborator
That sounds reasonable and should simplify the definitions of both usize and isize.
This comment was marked as resolved.
This comment was marked as resolved.
gnzlbg
Aug 27, 2018
•
Collaborator
@nikomatsakis I think it is also worth discussing whether we want to make 0_usize/0_isize as *_ T be equal to ptr::null/null_mut or not.
AFAIK in C and C++ at least NULL isn't necessary equal to the address at 0, but rather a platform-specific address (https://stackoverflow.com/a/2597232/1422197 contains some examples of old architectures where this wasn't the case). Currently, we do not promise this in Rust, but we do many optimizations like Option<&T> which are all hardcoded to 0 being a special bitpattern. The question is, should people only be able to use .is_null and ptr::null() to construct and test for null pointers, or should they also be able to cast a specific integer into a pointer that creates a null pointer, and should this integer always be 0, or be isize/usize::consts::NULL or something like that?
This might affect some of the wording we use (e.g. non-zero optimization, ...) everywhere. Although we already pretty much consistently use the term null everywhere, e.g. references can't be null, non-null optimization, etc.
EDIT: The C standard says:
An integer constant expression with the value 0, or such an expression cast to type void *, is called a null pointer constant.
so I was wrong. In C NULL always has to have the "integer" value of zero, independently of what the hardware uses.
This comment was marked as resolved.
This comment was marked as resolved.
This comment has been minimized.
This comment has been minimized.
rkruppe
Aug 28, 2018
•
Member
@gnzlbg AFAIK while the constant expression 0 turns into the NULL pointer when cast to a pointer, a runtime integer value cast to 0 does not necessarily have to turn into the NULL pointer when cast at run time (int<->ptr casts are implementation-defined anyway IIRC).
This comment has been minimized.
This comment has been minimized.
gnzlbg
Aug 28, 2018
Collaborator
@rkruppe indeed, the zero is null is only so for constant expressions .
This comment has been minimized.
This comment has been minimized.
RalfJung
Aug 30, 2018
Member
That's C++, though, right? For Rust we can just say that a pointer is NULL iff its bits are all 0...
This comment has been minimized.
This comment has been minimized.
rkruppe
Aug 30, 2018
Member
Yes, we can say that, and we would exclude some (possibly hypothetical) targets/implementations by doing that. That's what needs to be discussed.
nikomatsakis
added some commits
Aug 24, 2018
rkruppe
reviewed
Aug 24, 2018
| (and/or treated by the ABI)? | ||
| - e.g., what about different structs with same definition | ||
| - across executions of the same program? | ||
| - Tuples |
This comment has been minimized.
This comment has been minimized.
rkruppe
Aug 24, 2018
Member
A somewhat common request is layout compatibility between homogeneous tuples (i.e., (T, T, ..., T)) and the equivalent array.
This comment has been minimized.
This comment has been minimized.
gnzlbg
Aug 25, 2018
Collaborator
And vector types. A couple of them are already stable (e.g __m128 and friends)
rkruppe
reviewed
Aug 24, 2018
| To start, we will create threads for each major categories of types | ||
| (with a few suggested focus points): | ||
|
|
||
| - Integers and floating points |
This comment was marked as resolved.
This comment was marked as resolved.
rkruppe
Aug 24, 2018
Member
This has been thoroughly investigated and found to be a non-issue (see e.g. rust-lang/rust#40470 (comment)), but the spectre of signaling NaNs still haunts many developer's brains so it might be good to mention them.
This comment was marked as resolved.
This comment was marked as resolved.
rkruppe
reviewed
Aug 24, 2018
| goal of the `#[repr(transparent)]` annotation introduced in [RFC | ||
| 1758]. For built-in types, such as `&T` and so forth, it is important | ||
| for us to specify how they are treated at the point of a function | ||
| call. |
This comment was marked as resolved.
This comment was marked as resolved.
rkruppe
Aug 24, 2018
Member
We might also want to specify it for some user-defined aggregates. For example, do we want to guarantee (some subset of) newtype unpacking and relegate repr(transparent) to being the way to guarantee to other crates that a type with private fields is and will remain a newtype?
This comment was marked as resolved.
This comment was marked as resolved.
matthewjasper
reviewed
Aug 27, 2018
| - Unions | ||
| - Can we ever say anything about the initialized contents of a union? | ||
| - Is `#[repr(C)]` meaningful on a union? | ||
| - Fn pointers (`fn()`, `extern "C" fn()`) |
This comment has been minimized.
This comment has been minimized.
matthewjasper
Aug 27, 2018
When is transmuting from one fn type to another allowed, such as in core::fmt? Perhaps this isn't relevant if any non-null, suitably aligned function pointer is valid.
This comment has been minimized.
This comment has been minimized.
| - Are these effectively anonymous structs? | ||
| - Unions | ||
| - Can we ever say anything about the initialized contents of a union? | ||
| - Is `#[repr(C)]` meaningful on a union? |
This comment was marked as resolved.
This comment was marked as resolved.
matthewjasper
Aug 27, 2018
It should guarantee that all fields have the same address. This might be the case without #[repr(C)].
This comment was marked as resolved.
This comment was marked as resolved.
| `#[repr]` annotations, and they have the same field types, can we | ||
| say that they will have the same layout? | ||
| - or do we have the freedom to rearrange the types of `A` but not | ||
| `B`, e.g. based on PGO results |
This comment was marked as resolved.
This comment was marked as resolved.
matthewjasper
Aug 27, 2018
And we do have that freedom, do we also have it for Vec<T> and Vec<U> where T and U have the same layout?
This comment was marked as resolved.
This comment was marked as resolved.
|
|
||
| TODO: | ||
| This discussion is meant to focus on two things: |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Meta comment: I'd like to focus more on the set of things to discuss -- and whether I've broken them up in a sensible way -- versus the discussion itself. Sometimes it's hard to draw the line though, I suppose. =) (To be clear, I've not caught up on all the latest comments, but the number suggests to me that we may sometimes be getting into details here.) |
This comment has been minimized.
This comment has been minimized.
|
So, I think the discussion about UB raised by @gereeter is quite interesting, but perhaps orthogonal? That is, it seems like we can focus on what sorts of conditions we would like to hold without fully specifying what will happen if they don't? I think this is another variant of @rkruppe's point that UB represents a kind of "maximal stance" which gives us room to make changes in the future. But put another way: I think it would make sense to hold a focused discussion at some point on when/where we can contain the effects of UB. Consider e.g. the recent paper on "Bounding data races in space and time" -- I could imagine attempting to give some kind of guarantees of this kind, that let you reason about code in isolation, without necessarily knowing or caring about what other code has run in the past or may run in the future. |
nikomatsakis
referenced this pull request
Aug 28, 2018
Open
can we in some cases have more limited forms of "undefined behavior"? #6
This comment has been minimized.
This comment has been minimized.
|
I opened up #6 as a form for discussing whether there are weaker forms of UB (and it could eventually be promoted to an Area of Discussion in its own right). |
nikomatsakis
added some commits
Aug 28, 2018
avadacatavra
added
the
topic-repr
label
Aug 30, 2018
This comment has been minimized.
This comment has been minimized.
|
Meta question: You have a two-element bullet point list as the set of topics. Are these close enough to be one area, or distinct enough to be two? In principle these are separate: First the compiler computes a As far as I am concerned, I have very little to say for bullet 1 (data structure layout guarantees), but a lot of thoughts on 2 (once a layout is fixed, what are the validity requirements). So I feel like these are fairly separate. But that may just be my particular point of view. |
RalfJung
reviewed
Aug 30, 2018
| ### ABI compatibilty | ||
|
|
||
| When one either calls a foreign function or is called by one, extra | ||
| care is needed to ensure that all the ABI details line up. ABI compatibility |
This comment has been minimized.
This comment has been minimized.
RalfJung
Aug 30, 2018
Member
Clarification question: "ABI" is always about function calls? The term appears in TyLayout, which is used for laying out types in general, so I am a bit confused about that. And "Application Binary Interface" seems to be much larger in scope than just function calls. Should this be "function call ABI" in the text, or am I just missing come context?
This comment has been minimized.
This comment has been minimized.
rkruppe
Aug 30, 2018
Member
ABI includes data structure layout and other things beyond function calling conventions, yeah. One needs to take care of all these aspects in FFI, but it seems clear that this section is about calling conventions specifically.
This comment has been minimized.
This comment has been minimized.
nikomatsakis
Aug 30, 2018
Author
Collaborator
I specifically meant the details of function calling here, but I guess I would presume that "ABI" in general refers to how structures in the language are mapped to the underlying architecture.
This comment has been minimized.
This comment has been minimized.
|
(Replying here because that thread went into a different direction.) @gnzlbg wrote
As far as I am concerned, memory is entirely untyped. Typed memory is extremely hard to define in an unsafe language, and -- I think -- provides no tangible benefits. (Note that I still think that integers and pointers are distinct classes of values, so we know which bytes are integers and which are pointers. But this has nothing to do with types. Thanks to ptr-int-casts, you can have pointer values at integer type, and thanks to int-ptr-casts you can have integer values at pointer type. This works fine and it is what the miri engine implements. But that is very different from remembering if something was an However, I think this is part of the discussion on validity invariants, not part of the discussion on what to discuss. :) |
This comment has been minimized.
This comment has been minimized.
|
In the meeting today we decided that it would make sense to split out the discussion of invariants here. In general, we should be talking about and documenting the layout that the compiler currently uses (and to what extent people can rely on that not changing) — this then informs the invariant discussion, since any layout optimizations we do have to be justified by the invariants we design (e.g., |
This comment has been minimized.
This comment has been minimized.
|
I'm going to try to factor out "invariant" things, therefore, into an issue for future discussion. |
This comment has been minimized.
This comment has been minimized.
|
Filed #8 |
This comment has been minimized.
This comment has been minimized.
|
I will merge this PR then and try to open up various issues. They will be tagged with topic-repr. |
nikomatsakis commentedAug 24, 2018
This is a proposal for our first area of discussion. This discussion is meant to focus on two things:
NB. The discussion is not meant to discuss the "safety invariant"
from Ralf's blog post, as that can be handled later.