Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What are the values of a union type? (in particular, what is the validity invariant of a union) #438

Open
RalfJung opened this issue Aug 2, 2023 · 9 comments
Labels
A-unions Topic: Related to unions A-validity Topic: Related to validity invariants S-pending-design Status: Resolving this issue requires addressing some open design questions

Comments

@RalfJung
Copy link
Member

RalfJung commented Aug 2, 2023

When loading data at union type from one place and storing it in another -- what exactly is being required about the bytes, and what is preserved? In other words, what is the representation relation of a union type, and what is even the set of values that can be generated by a union-typed load?

Originally I intended the answer to be "the values are just lists of raw bytes as large as the union". IOW, a union has no validity requirements whatsoever, and all data inside it is perfectly preserved. However, reality disagrees. With the C ABI, some of the bytes in a union value do not always get preserved -- there can be padding. Also there is some desire to make unions have validity invariants, so that e.g. a union { usize, *const T } is noundef.

On the other hand, it is certainly not the case that the value of a union is "a value of one of its fields". Apart from the problem about writing decode for such a specification, it can be violated in safe code.

Note that for the purpose of this, we do assume that the LLVM type iN will perfectly preserve N bytes of memory. LLVM code generated from Rust can never put poison in memory (as that would not be preserved on a per-byte level, the entire iN gets "infected" if any byte is poison), which is currently accurate (Rust code has UB for any situation where the generated LLVM code might produce poison). LLVM iN values carrying provenance is incoherent but mostly works in practice and LLVM does not offer us a better option. We need a "byte" type to properly express our intent to LLVM; until that exist, we make do with what we have. If you want to discuss LLVM issues, please open a new thread -- this one here is solely about the Rust AM semantics.

Possible designs

MiniRust decides that a value of union type is a list of "chunks" where the bytes are perfectly preserved, but there can be gaps between those chunks where data is lost of a copy. The extent of these chunks can be computed from a Rust union type layout. That is sufficient to model the reality of union padding. The effective validity invariant here is still "any list of bytes allowed".

The noundef property could be achieved if we refine this "chunk" idea a bit further: every byte in a union is either "discarded", "preserved", or "initialized". Bytes which are padding in all variants get "discarded"; bytes which are padding only in some variants get "preserved"; bytes which are "in the data part" for all variants get "initialized". A value of union type is a list of bytes the size of the union with the constraint that "discarded" bytes must be Uninit and "initialized" bytes must not be Uninit. The validity invariant requires that all "initialized" bytes must be initialized -- and that's it; "discarded" bytes get flushed to Uninit on a decode. Then the Rust type union { usize, *const T } would translate to a union type where all bytes are "initialized", and we could emit noundef attributes for such a type.
(I don't want multiple distinct AM values to be equivalent, hence the 'value of union type' predicate demands that all "uninit" bytes are truly Uninit, but of course the validity invariant allows those bytes to be anything. Note that the validity invariant is a predicate over "list of bytes", whereas "what are the values of this type" is a predicate over Value. The chunks idea achieves this by just not having the padding bytes in the Value; we could do the same here but that would be formally more awkward to express I think. Maybe I should just link to a MiniRust patch here, that might be more clear than trying to say this in English.^^)

We could in principle go further and tro to achieve something along the lines of "if a byte is non-null in each variant, it is also non-null in the union", but that becomes increasingly complicated -- in particular it requires the "generate union type description from Rust type" logic to encode more and more knowledge about validity invariants, when the entire intent of this "value representation" business was that the validity invariant is implicitly encoded by the representation relation.

What other designs should we consider? And are they worth the complexity?

@RalfJung RalfJung added A-unions Topic: Related to unions A-validity Topic: Related to validity invariants S-pending-design Status: Resolving this issue requires addressing some open design questions labels Aug 2, 2023
This was referenced Aug 2, 2023
@scottmcm
Copy link
Member

scottmcm commented Aug 2, 2023

As I was trying to figure out how to evaluate the possibilities, I started thinking that unions might be serving too many masters. I wonder if they want to be multiple different things? It might be that the "sugar for transmuting between values" use and the "enum that's always unsafe because there's no stored discriminant" intents are just so different that they should be different things...

@RalfJung
Copy link
Member Author

RalfJung commented Aug 2, 2023

So... basically, let's imagine what could be done if we didn't have to support code like this?

I first thought "not much". While validity might be defined as "any of the variants is valid", it becomes harder to define what happens on a typed copy. Consider a union of (u8, u16) and (u16, u8) (or rather, their repr(C) equivalents): if we do a copy at any variant type, at least one byte will be lost to padding. But if the original value is [0, 0, 0, 0], then we must copy and preserve all bytes, since we can't know which is the "active variant".

But there's actually a way to achieve that: we could say that a value of union type is a non-empty set of values of possible variants. So in decode, we decode the data at all variant types, and we put all the values that were successfully decoded into the union value. If none of the variant decodes successfully, we trigger UB. In encode, we encode all of them, and then we take the bytewise "maximum" of these N encodings along the "definedess" order defined in this section to compute the resulting output. Since all these values were created from the same input bytes, such a maximum must exist. (So really a value of union type is a non-empty set of values that are all mutually coherent -- in the sense that when they are encoded, the byte lists are related by "definedness".)

Another option would be to angelically pick which variant to do the copy at, but I'd rather avoid angelic non-determinism if possible.

That definition should give us maximal niche power. However, we also probably don't want it for MaybeUninit: under that definition, MaybeUninit<u8> would not preserve provenance! It would strictly behave as "either () or u8". Was it a mistake to say that MaybeUninit<u8> preserves provenance? Or do we truly need both kinds of unions (even disregarding backwards compatibility)?

Also, implementing this in Miri will be quite terrible...

@Jules-Bertholet
Copy link

Could we have different reprs to make the distinction? For example, #[repr(Rust)] unions are "every byte, considered independently, must either be valid at that position for one of the union's fields, or be undef", but #[repr(init)] doesn't allow undef, and #[repr(strict)] must always be fully valid for at least one of the fields.

@Lokathor
Copy link
Contributor

Lokathor commented Aug 2, 2023

I think one goal is for union { f32, u32 } to be "always valid for either field", even without a declared repr. Which doesn't mean we can't support other reprs, just that the default Rust repr should be the one which leads to safe code being able to use unions (when the fields line up appropriately).

@saethlin
Copy link
Member

saethlin commented Aug 2, 2023

What optimizations can we get from noundef and from passing through niches?

For noundef I think we already apply the attribute to scalar loads where the scalar type does not permit being uninit, so I'm not really sure what additional optimizations open up here. Are there optimizations permitted by noundef unions that are not also justified by validity on typed copy?

For niches, doing layout optimizations on unions does sound like it would have some plausible value in general, but I expect a significant fraction of union use either doesn't have niches or the union is implementing tag packing itself in which case the user doesn't mind if the union never exposes a niche. Is there more value to exposing niches than permitting enum layout optimizations?

@RalfJung
Copy link
Member Author

RalfJung commented Aug 2, 2023

I think one goal is for union { f32, u32 } to be "always valid for either field", even without a declared repr.

Is it? I don't think so. It's not my goal, anyway. ;)

the default Rust repr should be the one which leads to safe code being able to use unions (when the fields line up appropriately).

When and how do you want safe code to be able to read union fields?

For noundef I think we already apply the attribute to scalar loads where the scalar type does not permit being uninit, so I'm not really sure what additional optimizations open up here. Are there optimizations permitted by noundef unions that are not also justified by validity on typed copy?

Hm, good question. That would depend on how smart LLVM is about these attributes, but of course we can always look to make it smarter. The problem @scottmcm ran into that made him ask for union nonnull is blocked on rust-lang/rust#114383, which doesn't need any changes to the representation relation.

@Lokathor
Copy link
Contributor

Lokathor commented Aug 2, 2023

XD I've mentioned this to you before Ralf. I thought that's what you were already talking about when you said "there is some desire to make unions have validity invariants".

Many people want a future rust to support safe union field access when the fields of the union satisfy some sort of specific case that the compiler can automatically check by itself. Whatever is decided for unions now should hopefully allow for that to happen some day.

Being intentionally vague, and specifically expecting the reader to interpret the following words in the most positive light, the assumption is usually something like: "if the safe-transmute project completes, and safe-transmute could convert all field types to each other, then that union should obviously be fine to use from safe code."; So something like union {f32, u32} or union {u64, [u32; 2]} or similar things like that.

@scottmcm
Copy link
Member

scottmcm commented Aug 2, 2023

Are there optimizations permitted by noundef unions that are not also justified by validity on typed copy?

That would depend on how smart LLVM is about these attributes, but of course we can always look to make it smarter.

I tried to explore this a bit, but I don't know how successful I was. https://llvm.godbolt.org/z/4K9q9d3Yq

If I put a typed load in a function,

define i32 @_ZN7example3foo17h02b60a93fb12ffb0E(i32 %0) {
start:
  %x = alloca i32, align 4
  store i32 %0, ptr %x, align 4
  %_2 = load i32, ptr %x, align 4, !noundef !1
  ret i32 %_2
}

That just gets completely optimized out today, as far as I can tell

define i32 @_ZN7example3foo17h02b60a93fb12ffb0E(i32 returned %0) {
  ret i32 %0
}

Though Alive says it would be legal for it to mark the return type as noundef: https://alive2.llvm.org/ce/z/s5ttA7

So I don't know if the typed load would be sufficient in practice to get noundef-related optimizations. Probably depends on evil phase ordering questions :/

@RalfJung
Copy link
Member Author

RalfJung commented Aug 2, 2023

@Lokathor IIRC my stance in that discussion was that that isn't even a validity invariant question, it is primarily a safety invariant question. Or at least, it is certainly the case that safety invariants are sufficient to achieve what you are asking for, since it's all about "how can we soundly expose operation X to safe code".

So to make this a validity invariants requires stronger motivation.

I thought that's what you were already talking about when you said "there is some desire to make unions have validity invariants".

I was mostly thinking of @scottmcm, and of people asking for niches in unions.


@scottmcm that's not quite how the code would look, would it? If this is a getter fn get_int(Union) -> usize then rustc will already add noundef to the return value. I would expect the !noundef on the load to be sufficient for optimizations that come later and that work on %_2?

Was there a specific optimization you were looking for in the iterator code that went missing, or was it more a general concern without concrete examples?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-unions Topic: Related to unions A-validity Topic: Related to validity invariants S-pending-design Status: Resolving this issue requires addressing some open design questions
Projects
None yet
Development

No branches or pull requests

5 participants