Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Untagged unions (tracking issue for RFC 1444) #32836

Open
nikomatsakis opened this Issue Apr 8, 2016 · 205 comments

Comments

Projects
None yet
@nikomatsakis
Copy link
Contributor

nikomatsakis commented Apr 8, 2016

Tracking issue for rust-lang/rfcs#1444.

Unresolved questions:

  • Does assigning directly to a union field trigger a drop of the previous contents?
  • When moving out of one field of a union, are the others considered invalidated? (1, 2, 3, 4)
  • Under what conditions can you implement Copy for a union? For example, what if some variants are of non-Copy type? All variants?
  • What interaction is there between unions and enum layout optimizations? (#36394)

Open issues of high import:

  • #47412 -- MIR-based unsafety checker sometimes accepts unsafe accesses to union fields in presence of uninhabited fields
@sfackler

This comment has been minimized.

Copy link
Member

sfackler commented Apr 8, 2016

I may have missed it in the discussion on the RFC, but am I correct in thinking that destructors of union variants are never run? Would the destructor for the Box::new(1) run in this example?

union Foo {
    f: i32,
    g: Box<i32>,
}

let mut f = Foo { g: Box::new(1) };
f.g = Box::new(2);
@solson

This comment has been minimized.

Copy link
Member

solson commented Apr 8, 2016

@sfackler My current understanding is that f.g = Box::new(2) will run the destructor but f = Foo { g: Box::new(2) } would not. That is, assigning to a Box<i32> lvalue will cause a drop like always, but assigning to a Foo lvalue will not.

@sfackler

This comment has been minimized.

Copy link
Member

sfackler commented Apr 8, 2016

So an assignment to a variant is like an assertion that the field was previously "valid"?

@solson

This comment has been minimized.

Copy link
Member

solson commented Apr 8, 2016

@sfackler For Drop types, yeah, that's my understanding. If they weren't previously valid you need to use the Foo constructor form or ptr::write. From a quick grep, it doesn't seem like the RFC is explicit about this detail, though. I see it as an instantiation of the general rule that writing to a Drop lvalue causes a destructor call.

@ohAitch

This comment has been minimized.

Copy link

ohAitch commented Apr 8, 2016

Should a &mut union with Drop variants be a lint?

On Friday, 8 April 2016, Scott Olson notifications@github.com wrote:

@sfackler https://github.com/sfackler For Drop types, yeah, that's my
understanding. If they weren't previously valid you need to use the Foo
constructor form or ptr::write. From a quick grep, it doesn't seem like
the RFC is explicit about this detail, though.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#32836 (comment)

@joshtriplett

This comment has been minimized.

Copy link
Member

joshtriplett commented Apr 8, 2016

On April 8, 2016 3:36:22 PM PDT, Scott Olson notifications@github.com wrote:

@sfackler For Drop types, yeah, that's my understanding. If they
weren't previously valid you need to use the Foo constructor form or
ptr::write. From a quick grep, it doesn't seem like the RFC is
explicit about this detail, though.

I should have covered that case explicitly. I think both behaviors are defensible, but I think it'd be far less surprising to never implicitly drop a field. The RFC already recommends a lint for union fields with types that implement Drop. I don't think assigning to a field implies that field was previously valid.

@sfackler

This comment has been minimized.

Copy link
Member

sfackler commented Apr 8, 2016

Yeah, that approach seems a bit less dangerous to me as well.

@solson

This comment has been minimized.

Copy link
Member

solson commented Apr 8, 2016

Not dropping when assigning to a union field would make f.g = Box::new(2) act differently from let p = &mut f.g; *p = Box::new(2), because you can't make the latter case not drop. I think my approach is less surprising.

It's not a new problem, either; unsafe programmers already have to deal with other situations where foo = bar is UB if foo is uninitialized and Drop.

@joshtriplett

This comment has been minimized.

Copy link
Member

joshtriplett commented Apr 8, 2016

I personally don't plan to use Drop types with unions at all. So I'll defer entirely to people who have worked with analogous unsafe code on the semantics of doing so.

@retep998

This comment has been minimized.

Copy link
Member

retep998 commented Apr 9, 2016

I also don't intend to use Drop types in unions so either way doesn't matter to me as long as it is consistent.

@ohAitch

This comment has been minimized.

Copy link

ohAitch commented Apr 9, 2016

I don't intend to use mutable references to unions, and probably
just "weirdly-tagged" ones with Into

On Friday, 8 April 2016, Peter Atashian notifications@github.com wrote:

I also don't intend to use Drop types in unions so either way doesn't
matter to me as long as it is consistent.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#32836 (comment)

@nikomatsakis

This comment has been minimized.

Copy link
Contributor Author

nikomatsakis commented Apr 12, 2016

Seems like this is a good issue to raise up as an unresolved question. I'm not sure yet which approach I prefer.

@joshtriplett

This comment has been minimized.

Copy link
Member

joshtriplett commented Apr 12, 2016

@nikomatsakis As much as I find it awkward for assigning to a union field of a type with Drop to require previous validity of that field, the reference case @tsion mentioned seems almost unavoidable. I think this might just be a gotcha associated with code that intentionally disables the lint for putting a type with Drop in a union. (And a short explanation of it should be in the explanatory text for that lint.)

@solson

This comment has been minimized.

Copy link
Member

solson commented Apr 12, 2016

And I'd like to reiterate that unsafe programmers must already generally know that a = b means drop_in_place(&mut a); ptr::write(&mut a, b) to write safe code. Not dropping union fields would be one more exception to learn, not one less.

(NB: the drop doesn't happen when a is statically known to already be uninitialized, like let a; a = b;.)

But I support having a default warning against Drop variants in unions that people have to #[allow(..)] since this is a fairly non-obvious detail.

@nikomatsakis

This comment has been minimized.

Copy link
Contributor Author

nikomatsakis commented Apr 12, 2016

@tsion this is not true for a = b and maybe only sometimes true for a.x = b but it is certainly true for *a = b. This uncertainty is what made me hesitant about it. For example, this compiles:

fn main() {
  let mut x: (i32, i32);
  x.0 = 2;
  x.1 = 3;
}

(though trying to print x later fails, but I consider that a bug)

@solson

This comment has been minimized.

Copy link
Member

solson commented Apr 12, 2016

@nikomatsakis That example is new to me. I guess I would have considered it a bug that that example compiles, given my previous experience.

But I'm not sure I see the relevance of that example. Why is what I said not true for a = b and only sometimes for a.x = b?

Say, if x.0 had a type with a destructor, surely that destructor is called:

fn main() {
    let mut x: (Box<i32>, i32);
    x.0 = Box::new(2); // x.0 statically know to be uninit, destructor not called
    x.0 = Box::new(3); // x.0 destructor is called before writing new value
}
@arielb1

This comment has been minimized.

Copy link
Contributor

arielb1 commented Apr 14, 2016

Maybe just lint against that kind of write?

@nikomatsakis

This comment has been minimized.

Copy link
Contributor Author

nikomatsakis commented Apr 16, 2016

My point is only that = does not always run the destructor; it
uses some knowledge about whether the target is known to be
initialized.

On Tue, Apr 12, 2016 at 04:10:39PM -0700, Scott Olson wrote:

@nikomatsakis That example new to me. I guess I would have considered it a bug that that example compiles, given my previous experience.

But I'm not sure I see the relevance of that example. Why is what I said not true for a = b and only sometimes for 'a.x = b'?

Say, if x.0 had a type with a destructor, surely that destructor is called:

fn main() {
    let mut x: (Box<i32>, i32);
    x.0 = Box::new(2); // x.0 statically know to be uninit, destructor not called
    x.0 = Box::new(3); // x.0 destructor is called
}
@arielb1

This comment has been minimized.

Copy link
Contributor

arielb1 commented Apr 16, 2016

@nikomatsakis

It runs the destructor if the drop flag is set.

But I think that kind of write is confusing anyway, so why not just forbid it? You can always do *(&mut u.var) = val.

@solson

This comment has been minimized.

Copy link
Member

solson commented Apr 16, 2016

My point is only that = does not always run the destructor; it uses some knowledge about whether the target is known to be initialized.

@nikomatsakis I already mentioned that:

(NB: the drop doesn't happen when a is statically known to already be uninitialized, like let a; a = b;.)

But I didn't account for dynamic checking of drop flags, so this is definitely more complicated than I considered.

@arielb1

This comment has been minimized.

Copy link
Contributor

arielb1 commented Apr 17, 2016

@tsion

Drop flags are only semi-dynamic - after zeroing drop is gone, they are a part of codegen. I say we forbid that kind of write because it does more confusion than good.

@Daggerbot

This comment has been minimized.

Copy link

Daggerbot commented Apr 27, 2016

Should Drop types even be allowed in unions? If I'm understanding things correctly, the main reason to have unions in Rust is to interface with C code that has unions, and C doesn't even have destructors. For all other purposes, it seems that it's better to just use an enum in Rust code.

@Amanieu

This comment has been minimized.

Copy link
Contributor

Amanieu commented Apr 27, 2016

There is a valid use case for using a union to implement a NoDrop type which inhibits drop.

@joshtriplett

This comment has been minimized.

Copy link
Member

joshtriplett commented Apr 27, 2016

As well as invoking such code manually via drop_in_place or similar.

@RumataEstor

This comment has been minimized.

Copy link

RumataEstor commented Jun 21, 2016

To me dropping a field value while writing to it is definitely wrong because the previous option type is undefined.

Would it be possible to prohibit field setters but require full union replacement? In this case if the union implements Drop full union drop would be called for the value replaced as expected.

@joshtriplett

This comment has been minimized.

Copy link
Member

joshtriplett commented Jun 22, 2016

I don't think it makes sense to prohibit field setters; most uses of unions should have no problem using those, and fields without a Drop implementation will likely remain the common case. Unions with fields that implement Drop will produce a warning by default, making it even less likely to hit this case accidentally.

@petrochenkov

This comment has been minimized.

Copy link
Contributor

petrochenkov commented Jul 22, 2018

We don't think we have to pay that cost for unions since the bag-of-bits model doesn't give any new opportunities compared to enum-with-unknown-variant model.

@rkruppe

This comment has been minimized.

Copy link
Member

rkruppe commented Jul 22, 2018

The property in question here is at least as much of a burden for unsafe code to uphold as it is a safeguard. There's no static analysis which can prevent all mistakes that could break this property since we do want to use unions for unsafe type punning1, so "enum with unknown variant" really means code handling unions has to be super careful with how it writes to the union or risk instant UB, without really reducing the unsafety involved in reading from the union, since reading already requires knowing (through channels the compiler doesn't understand) that the bits are valid for the variant you're reading. We can only actually warn users about a union that isn't valid for any of its variants is when running under miri, not in the vast majority of cases where it happens at runtime.

1 For example, assuming tuples are repr(C) for simplicity, union Foo { a: (bool, u8), b: (u8, bool) } allows you to construct something that's invalid just by field assignments.

@petrochenkov

This comment has been minimized.

Copy link
Contributor

petrochenkov commented Jul 22, 2018

@rkruppe

union Foo { a: (bool, u8), b: (u8, bool) }

Hey, that's my example :)
And it's valid under the RFC 1897's model (at least one of "leaf" fragments bool-1, u8-1, u8-2, bool-2 is valid after any partial assignments).

code handling unions has to be super careful with how it writes to the union or risk instant UB

That's the point of RFC 1897's model, static checking ensures that no safe operation (like assignment or partial assignment) can turn the union into invalid state, so you don't need to be super careful all the time and don't get instant UB.
Only union-unrelated unsafe operations like writes through wild pointers can make a union invalid.

On the other hand, without move checking, union can be put into invalid state very easily.

let u: Union;
let x = u.field; // UB
@rkruppe

This comment has been minimized.

Copy link
Member

rkruppe commented Jul 22, 2018

That's the point of RFC 1897's model, static checking ensures that no safe operation (like assignment or partial assignment) can turn the union into invalid state, so you don't need to be super careful all the time and don't get instant UB.
Only union-unrelated unsafe operations like writes through wild pointers can make a union invalid.

You can automatically recognize some kinds of writes as not violating the extra invariants imposed on unions, but it's still extra invariants that need to be upheld by writers. Since reading is still unsafe and requires manually ensuring that the bits will be valid for the variant that's read, this doesn't actually help readers, it just makes writers' lifes harder. Neither "bag of bits" nor "enum with unknown variant" helps solve the hard problem of unions: how to ensure it actually stores the kind of data you want to read.

@derekdreery

This comment has been minimized.

Copy link
Contributor

derekdreery commented Jul 22, 2018

How would the fancier type-checking affect Dropping? If you create a union then pass it to C, which takes ownership, will rust try to free the data, perhaps causing a double-free? Or would you always implement Drop yourself?

edit it would be way cool if unions were like "enums where the variant is checked statically at compile time", if I've understood the suggestion

edit 2 could unions start off as a bag of bits and then later allow safe access whilst being backwards-compatible?

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Jul 22, 2018

And it's valid under the RFC 1897's model (at least one of "leaf" fragments bool-1, u8-1, u8-2, bool-2 is valid after any partial assignments).

If we decide we want this to be valid, I think @oli-obk should update miri's checks to reflect that -- with #51361 merged, it would be rejected by miri.

@petrochenkov The part I do not understand is what this buys us. We get extra complexity, in terms of implementation (static analysis) and usage (user still needs to be aware of the exact rules). This extra complexity adds up to fact that when unions are used, we are already in an unsafe context so things are naturally more complex. I think we should have a clear motivation for why this extra complexity is worth it. I do not consider "it violates the spirit of the language somewhat" to be a clear motivation.

The one thing I can think of is layout optimizations. In a "bag of bits" model, a union has no niche, ever. However, I feel that is better addresses by giving the programmer more manual control over the niche, which would also be useful in other cases.

@gnzlbg

This comment has been minimized.

Copy link
Contributor

gnzlbg commented Jul 22, 2018

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Jul 22, 2018

@gnzlbg I think the only guarantee we'd get is what @petrochenkov wrote above

static checking ensures that no safe operation (like assignment or partial assignment) can turn the union into invalid state

On the other hand, without move checking, union can be put into invalid state very easily.

Your proposal does not protect against bad reads either, I don't think that's possible.

Also, I imagined some very basic "initialized" tracking along the lines of "writing to any field initializes the union". We'd need something anyway when impl Drop for MyUnion is allowed. For better or worse, we have to decide when and where to insert automatic drop calls for a union. Those rules should be as simple as at all possible because this is extra code that we are inserting into existing subtle unsafe code. For unions that do implement Drop, I also imagined a restriction similar to struct that does not allow writing to a field unless the data structure is already initialized.

@derekchiang

could unions start off as a bag of bits and then later allow safe access whilst being backwards-compatible?
No. Once we say it's a bag of bits, there could be unsafe code assuming that's allowed.

@joshtriplett

This comment has been minimized.

Copy link
Member

joshtriplett commented Jul 22, 2018

I think there's value in the bare-minimum move checking to see if a union is initialized. The original RFC explicitly specified that initializing or assigning to any union field makes the whole union initialized. Beyond that, though, rustc should not try to infer anything about the value in a union that the user doesn't explicitly specify; a union may contain any value at all, including a value that isn't valid for any of its fields.

One use case for that, for instance: consider a C-style tagged union that's explicitly extensible with more tags in the future. C and Rust code reading that union must not assume it knows every possible field type.

@petrochenkov

This comment has been minimized.

Copy link
Contributor

petrochenkov commented Jul 22, 2018

@RalfJung

Perhaps I should start from the other direction.

Should this code work 1) for unions 2) for non-unions?

let x: T;
let y = x.field;

For me the answer is obvious "no" in both cases, because this is a whole class of errors that Rust can and want to prevent, regardless of "union"-ness of T.

This means move checker should have some kind of scheme in accordance to which it implements that support. Given that move checker (and borrow checker) generally work in per-field fashion, the simplest scheme for unions would be "same rules as for structs + (de)initialization/borrow of a field also (de)initializes/borrows its sibling fields".
This simple rule covers all the static checking.

Then, the enum model is simply a consequence of the static checking described above + one more condition.
If 1) initialization checking is enabled and 2) unsafe code doesn't write arbitrary invalid bytes into the area belonging to the union, then one of unions "leaf" fields is automatically valid. This is dynamic uncheckable (at least for unions with >1 fields and outside of const-evaluator) guarantee, but it's targeted at people reading code first of all.

This case from @joshtriplett , for example

One use case for that, for instance: consider a C-style tagged union that's explicitly extensible with more tags in the future. C and Rust code reading that union must not assume it knows every possible field type.

would be much clearer for people reading code if the union explicitly had an extra field for "possible future extensions".

Of course, we can keep the basic static initialization checking, but reject the second condition and allow writing arbitrary possibly invalid data to the union through some unsafe "third party" means without it being instant UB. Then we wouldn't have that dynamic people-targeted guarantee anymore, I just think that would be a net loss.

@joshtriplett

This comment has been minimized.

Copy link
Member

joshtriplett commented Jul 22, 2018

@petrochenkov

Should this code work 1) for unions 2) for non-unions?

let x: T;
let y = x.field;

For me the answer is obvious "no" in both cases, because this is a whole class of errors that Rust can and want to prevent, regardless of "union"-ness of T.

Agreed, this level of checking for uninitialized values seems reasonable, and quite feasible.

This means move checker should have some kind of scheme in accordance to which it implements that support. Given that move checker (and borrow checker) generally work in per-field fashion, the simplest scheme for unions would be "same rules as for structs + (de)initialization/borrow of a field also (de)initializes/borrows its sibling fields".
This simple rule covers all the static checking.

Agreed so far, assuming I understand the rules for structs.

Then, the enum model is simply a consequence of the static checking described above + one more condition.
If 1) initialization checking is enabled and 2) unsafe code doesn't write arbitrary invalid bytes into the area belonging to the union, then one of unions "leaf" fields is automatically valid. This is dynamic uncheckable (at least for unions with >1 fields and outside of const-evaluator) guarantee, but it's targeted at people reading code first of all.

That additional condition isn't valid for unions.

This case from @joshtriplett , for example

One use case for that, for instance: consider a C-style tagged union that's explicitly extensible with more tags in the future. C and Rust code reading that union must not assume it knows every possible field type.

would be much clearer for people reading code if the union explicitly had an extra field for "possible future extensions".

That's not how C unions work, nor how Rust unions were specified to work. (And I'd question whether it'd be clearer, or simply whether it matches a different set of expectations.) Changing this would make Rust unions no longer fit for some of the purposes for which they were designed and proposed.

Of course, we can keep the basic static initialization checking, but reject the second condition and allow writing arbitrary possibly invalid data to the union through some unsafe "third party" means without it being instant UB. Then we wouldn't have that dynamic people-targeted guarantee anymore, I just think that would be a net loss.

Those 'unsafe "third party" means' include "getting a union from FFI", which is a completely valid use case.

Here's a concrete example:

union Event {
    event_id: u32,
    event1: Event1,
    event2: Event2,
    event3: Event3,
}

struct Event1 {
    event_id: u32, // always EVENT1
    // ... more fields ...
}
// ... more event structs ...

match u.event_id {
    EVENT1 => { /* ... */ }
    EVENT2 => { /* ... */ }
    EVENT3 => { /* ... */ }
    _ => { /* unknown event */ }
}

That's completely valid code that people can and will write using unions.

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Jul 22, 2018

@petrochenkov

Should this code work 1) for unions 2) for non-unions?
For me the answer is obvious "no" in both cases, because this is a whole class of errors that Rust can and want to prevent, regardless of "union"-ness of T.

Fine for me.

the simplest scheme for unions would be "same rules as for structs + (de)initialization/borrow of a field also (de)initializes/borrows its sibling fields".

Woah. The struct rules make sense because they are all based on the fact that different fields are disjoint. You can't just invalidate that basic assumption and still use the same rules. The fact that you need an addendum to the rules show that. I would never expect unions to be checked similar to structs. If anything, one might expect them to be checked similar to enums -- but of course that cannot work, because enums can only be accessed via match.

If 1) initialization checking is enabled and 2) unsafe code doesn't write arbitrary invalid bytes into the area belonging to the union, then one of unions "leaf" fields is automatically valid. This is dynamic uncheckable (at least for unions with >1 fields and outside of const-evaluator) guarantee, but it's targeted at people reading code first of all.

I think it is extremely desirable for the basic validity assumptions to be dynamically checkable (given type information). Then we can check them during CTFE in miri, we can even check them during "full" miri runs (e.g. of a test suite), we can eventually have some kind of sanitizer or maybe a mode where Rust emits debug_assert! in critical places to check the validity invariants.
I think the experience with C's uncheckable rules gives ample evidence that these are problematic. Usually, the first step to actually understand and clarify what the rules are is to find a dynamically checkable way to express them. Even for concurrency memory models, "dynamically checkable" variants (operational semantics explaining everything in terms of step-by-step execution of a virtual machine) are showing up and seem to be the only way to solve long-standing open problems of the axiomatic models that were previously used ("ouf of thin air problem" is a keyword here).

I can hardly overstate how important I think it is to have dynamically checkable rules. I think we should aim to have 0 uncheckable cases of UB. (We're not there yet, but it's the goal we should have.) That is the only responsible way to have UB in your language, everything else is a case of compiler/language authors making their life easier at the expense of everyone who has to live with the consequences. (I am currently working on dynamically checkable rules for aliasing and raw pointer accesses.)
Even if that would be the only problem, as far as I am concerned "not dynamically checkable" is sufficient grounds to not use this approach.

That said, I see no fundamental reason why this should not be checkable: For every byte in the union, go over all variants to see which values are allowed for that byte in this variant, and take the union (heh ;) ) of all of those sets. A sequence of bytes is valid for a union if every byte is valid according to this definition.
This is, however, quite hard to actually implement a check for -- by far the most complex basic type validity invariant we would have in Rust. That is a direct consequence of the fact that this validity rule is somewhat tricky to describe, which is why I don't like it.

Of course, we can keep the basic static initialization checking, but reject the second condition and allow writing arbitrary possibly invalid data to the union through some unsafe "third party" means without it being instant UB. Then we wouldn't have that dynamic people-targeted guarantee anymore, I just think that would be a net loss.

What does that guarantee buy us? Where does it actually help? Right now, all I see is that everyone has to work hard and be careful to uphold it. I don't see the benefit we, the people, get out of that.

@joshtriplett

consider a C-style tagged union that's explicitly extensible with more tags in the future. C and Rust code reading that union must not assume it knows every possible field type.

The model proposed by @petrochenkov allows those usecases, by adding a __non_exhaustive: () field to the union. However, I don't think that should be necessary. Conceivably, binding generators could add such a field.

@petrochenkov

This comment has been minimized.

Copy link
Contributor

petrochenkov commented Jul 22, 2018

@RalfJung

This is dynamic uncheckable (at least for unions with >1 fields and outside of const-evaluator) guarantee

I think it is extremely desirable for the basic validity assumptions to be dynamically checkable

A clarification: I meant uncheckable in "by default"/"in release mode", of course it can be checkable in "slow mode" with some extra instrumentation, but you already wrote about this better than I could.

@joshtriplett

This comment has been minimized.

Copy link
Member

joshtriplett commented Jul 22, 2018

@RalfJung

The model proposed by @petrochenkov allows those usecases, by adding a __non_exhaustive: () field to the union.

Yes, I understood that that was the proposal.

However, I don't think that should be necessary. Conceivably, binding generators could add such a field.

They could, but they'd have to systematically add it to every single union.

I have yet to see an argument for why it makes sense to break primary use cases of unions in favor of some unspecified use case that depends on limiting what bit patterns they can contain.

@petrochenkov

This comment has been minimized.

Copy link
Contributor

petrochenkov commented Jul 29, 2018

@joshtriplett

primary use cases of unions

It's not obvious to me at all why this is the primary use case.
It may be true for repr(C) unions if you assume that all uses of unions for tagged unions / "Rust enum emulation" in FFI assume extensibility (which is not true), but from what I've seen, uses of repr(Rust) unions (drop control, intialization control, transmutes) do not expect "unexpected variants" suddenly appearing in them.

@joshtriplett

This comment has been minimized.

Copy link
Member

joshtriplett commented Jul 29, 2018

@petrochenkov I didn't say "break the primary use case", I said "break primary use cases". FFI is one of the primary use cases of unions.

@scottmcm

This comment has been minimized.

Copy link
Member

scottmcm commented Jul 30, 2018

and take the union (heh ;) ) of all of those sets

There's certainly an attractive obviousness to a statement that "the possible values of a union are the union of the possible values of all its possible variants"...

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Jul 30, 2018

True. However, that's not the proposal -- we all agree that the following should be legal:

union F {
  x: (u8, bool),
  y: (bool, u8),
}
fn foo() -> F {
  let mut f = F { x: (5, false) };
  unsafe { f.y.1 = 17; }
  f
}

Actually I think it is a bug that this even requires unsafe.

So, the union has to be taken bytewise, at least.
Also, I don't think "attractive obviousness" on its own is a sufficiently good reason. Any invariant we decide on is a significant burden for unsafe code authors, we should have concrete advantages that we get in turn.

@petrochenkov

This comment has been minimized.

Copy link
Contributor

petrochenkov commented Jul 30, 2018

@RalfJung

Actually I think it is a bug that this even requires unsafe.

I don't know about the new MIR-based unsafety-checker implementation, but in the old HIR-based one it was certainly a checker limitation/simplification - only expressions of the form expr1.field = expr2 were analyzed for possible "field assignment" unsafety opt-out, everything else was conservatively treated as generic "field access" that's unsafe for unions.

@petrochenkov

This comment has been minimized.

Copy link
Contributor

petrochenkov commented Aug 5, 2018

Answering the comment in #52786 (comment):

So the idea is that compiler still doesn't know anything about the Wrap<T>'s contract and can't e.g. do layout optimizations. Ok, this position is understood.
This means that internally, inside of Wrap's module, implementation of Wrap<T> module can, for example, temporarily write "unexpected values" into it, if it doesn't leak them to users, and compiler will be okay with them.

I'm not sure though how exactly the part of Wraps contract about absence of unexpected values is related to field privacy.

First of all, regardless of fields being private or public, unexpected values cannot be written directly through those fields. You need something like a raw pointer, or code on the other side of FFI to do it, and it can be done without any field access, just by having a pointer to the whole union. So we need to approach this from some other direction than access to a field being restricted.

As I interpret you comment, the approach is to say that a private field (in union or a struct, doesn't matter) implies an arbitrary invariant unknown to user, so any operations changing that field (directly or through wild pointers, doesn't matter) result in UB because they can potentially break that unspecified invariant.

This means that if a union has a single private field, then its implementer (but not compiler) can assume that no third party will write an unexpected value into that union.
That's a "default union documentation clause" for the user in some sense:
- (Default) If a union has a private field you can't write garbage into it.
- Otherwise, you can write garbage into a union unless its docs explicitly prohibit it.

If some union wants to prohibit unexpected values while still providing pub access to its expected fields (e.g. when those fields have no their own invariants), then it still can do it through documentation, that's why the "unless" in the second clause is necessary.

@RalfJung
Does this describe you position accurately?

How scenarios like this are treated?

mod m {
    union MyPrivateUnion { /* private fields */ }
    extern {
        fn my_private_ffi_function() -> MyPrivateUnion; // Can return garbage (?)
    }
}
@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Aug 6, 2018

As I interpret you comment, the approach is to say that a private field (in union or a struct, doesn't matter) implies an arbitrary invariant unknown to user, so any operations changing that field (directly or through wild pointers, doesn't matter) result in UB because they can potentially break that unspecified invariant.

No, that is not what I meant.

There are multiple invariants. I do not know how many we will need, but there will be at least two (and I don't have great names for them):

  • The "Layout-level invariant" (or "syntactic invariant") of a type is completely defined by the syntactic shape of the type. These are things like "&mut T is non-NULL and aligned", "bool is 0 or 1", "! cannot exist". On this level, *mut T is the same as usize -- both allow any value (or maybe any initialized value, but that distinction is for another discussion). We are, eventually, going to have a document spelling out these invariants for all types, by structural recursion: The layout-level invariant of a struct is that all its fields have their invariant maintained, etc. Visibility does not play a role here.

    Violating the layout-level invariant is instantaneous UB. This is a statement we can make because we have defined this invariant in very simple terms, and we make it part of the definition of the language itself. We can then exploit this UB (and we already do), e.g. to perform enum layout optimizations.

  • The "Custom type-level invariant" (or "semantic invariant") of a type is picked by whoever implements the type. The compiler cannot know this invariant as we do not have a language to express it, and the same goes for the language definition. We cannot make violating this invariant UB, as we cannot even say what that invariant is! The fact that it is even possible to have custom invariants is a feature of any useful type system: Abstraction. I wrote more about this in a past blog post.

    The connection between the custom, semantic invariant and UB is that we declare that unsafe code may rely on its semantic invariants being preserved by foreign code. That makes it incorrect to just go ahead any put random stuff into a Vec's size field. Note that I said incorrect (I sometimes use the term unsound) -- but not undefined behavior! Another example to demonstrate this difference (really, the same example) is the discussion about aliasing rules for &mut ZST. Creating a dangling well-aligned non-null &mut ZST is never immediate UB, but it is still incorrect/unsound because one may write unsafe code which relies on this not to happen.

It would be nice to align these two concepts, but I do not think it is practical. First of all, for some types (function pointers, dyn traits), the definition of the custom, semantic invariant actually uses the definition of UB in the language. This definition would be circular if we wanted to say that it is UB to ever violate the custom, semantic invariant. Secondly, I'd prefer if the definition of our language, and whether a certain execution trace exhibits UB, was a decidable property. Semantic, custom invariants are frequently not decidable.


I'm not sure though how exactly the part of Wraps contract about absence of unexpected values is related to field privacy.

Essentially, when a type chooses its custom invariant, it has to make sure that anything that safe code can do preserves the invariant. After all, the promise is that just using this type's safe API can never lead to UB. This is applies to both structs and unions. One of the things safe code can do is access public fields, which is where this connection comes from.

For example, a public field of a struct cannot have a custom invariant that is different from the custom invariant of the field type: After all, any safe user could write arbitrary data into that field, or read form the field and expect "good" data. A struct where all fields are public can be safely constructed, placing further restrictions on the field.

A union with a public field... well that's somewhat interesting. Reading union fields is unsafe anyway, so nothing changes there. Writing union fields is safe, so a union with a public field has to be able to handle arbitrary data which satisfies that field's type's custom invariant being put into the field. I doubt this will be very useful...

So, to recap, when you choose a custom invariant, it is your responsibility to make sure that foreign safe code cannot break this invariant (and you have tools like private fields to help you achieve this). It is the responsibility of foreign unafe code to not violate your invariant when that code does something safe code could not do.


This means that internally, inside of Wrap's module, implementation of Wrap module can, for example, temporarily write "unexpected values" into it, if it doesn't leak them to users, and compiler will be okay with them.

Correct. (panic-safety is a concern here but you are probably aware). This is just like, in Vec, I can safely do

let sz = self.size;
self.size = 1337;
self.size = sz;

and there is no UB.


mod m {
    union MyPrivateUnion { /* private fields */ }
    extern {
        fn my_private_ffi_function() -> MyPrivateUnion; // Can return garbage (?)
    }
}

In terms of the syntactic layout invariant, my_private_ffi_function can do anything (assuming the function call ABI and signature matches). In terms of the semantic custom invariant, that's not visible in the code -- whoever wrote this module had an invariant in mind, they should document it next to their union definition and then make sure that the FFI function returns a value which satisfies the invariant.

@RalfJung

This comment has been minimized.

Copy link
Member

RalfJung commented Aug 22, 2018

I finally wrote that blog post about whether and when &mut T must be initialized, and the two kinds of invariants I mentioned above.

@SimonSapin

This comment has been minimized.

Copy link
Contributor

SimonSapin commented Mar 10, 2019

Is there anything left to track here that’s not already covered by #55149, or should we close?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.