Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure our handling of partially initialized values is compatible with LLVM #94428

Open
RalfJung opened this issue Feb 27, 2022 · 2 comments
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-mir Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-opsem Relevant to the opsem team

Comments

@RalfJung
Copy link
Member

The intent is for types like MaybeUninit<u64> to support dealing with partially initialized data: e.g., if we have a (u32, u16) (and assuming for a second we could rely on its layout), it should be sound to transmute that to MaybeUninit<u64> and back even though the padding between the two tuple fields might be uninitialized. Code like #94212 relies on this.

The thing is, we are compiling MaybeUninit<u64> to i64 for LLVM -- MaybeUninit is repr(transparent). This was required to avoid codegen regressions when MaybeUninit started to be used in some hot data copying loops inside libcore. So, for this all to work out, we better be sure that i64 correctly preserves partially initialized data.

LLVM has two kinds of "uninit" data, undef and poison.

  • undef is per-bit and precisely preserved in all iN types, so we should be fine here.
  • poison, however, is per-value: when loading an i64 and any of its bytes is poison, the entire result is poison. That is exactly not what we want for MaybeUninit<u64>. However, at least in current LLVM, poison is only created in very few situations (such as "nowrap" arithmetic that overflows), and AFAIK none of them can happen in a UB-free Rust program -- so, basically "uninit" in Rust only ever corresponds to undef in LLVM, never to poison. (But I might have missed places where LLVM generates posion.)

So I think right now we are good. However, LLVM is slowly moving away from undef and towards posion, since undef is seriously ill-behaved in many ways. And if that ever means that "uninit" in Rust could correspond to LLVM poison, then we have a problem here -- we have to keep monitoring this situation, and it might be good for us to be involved in the relevant LLVM discussions here as well to make sure they are aware of this problem.

Similarly, as we evolve the MIR semantics we have to make sure that no UB-free program can generate poison after compilation to LLVM.

A very elegant solution to this issue would be for LLVM to adopt the "byte type" proposal, however, so far my impression is the LLVM community is not convinced they need such a type. With a byte type, MaybeUninit<u64> could be easily compiled to b64 in LLVM, and a byte type would preserve poison precisely, so we'd be all good.

I am mostly opening this so we have some place to track the current situation, and to make sure everyone agrees on what the main concerns are here -- and to get input from folks with more LLVM experience in case I got some of this wrong.
Cc @rust-lang/wg-unsafe-code-guidelines @rust-lang/wg-llvm

@RalfJung
Copy link
Member Author

RalfJung commented Jul 14, 2022

Cc llvm/llvm-project#52930

Also see this LLVM thread.

@jyn514 jyn514 added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. T-lang Relevant to the language team, which will review and decide on the PR/issue. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. A-codegen Area: Code generation A-mir Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html T-opsem Relevant to the opsem team and removed T-lang Relevant to the language team, which will review and decide on the PR/issue. labels Apr 9, 2023
@RalfJung
Copy link
Member Author

There was another LLVM discussion early last year, re-affirming this point:

Similarly, as we evolve the MIR semantics we have to make sure that no UB-free program can generate poison after compilation to LLVM.

That is not great since it means we can't have "delayed UB (via poison)" in Rust at all even when LLVM has it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-mir Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-opsem Relevant to the opsem team
Projects
None yet
Development

No branches or pull requests

2 participants