Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What about: mixed-size atomic accesses #345

Open
RalfJung opened this issue Jul 2, 2022 · 16 comments
Open

What about: mixed-size atomic accesses #345

RalfJung opened this issue Jul 2, 2022 · 16 comments
Labels
C-open-question Category: An open question that we should revisit

Comments

@RalfJung
Copy link
Member

RalfJung commented Jul 2, 2022

So it looks like mixed-size-accesses made the round on twitter again recently which got me thinking about them. rust-lang/rust#97516 carefully chose not to talk about them (and anyway it's not clear that is the right place to go into more detail about them). So what do we want to do with them in Rust?

Some facts:

  • In C++, you cannot convert a &uint16_t into an reference to an array "because no such array exists at that location in memory"; they insist that memory is strongly typed. This means they don't even have to talk about mixed-size accesses. It also means they are ignoring a large fraction of the programs out there but I guess they are fine with that. We are not. ;)

  • Apparently the x86 manual says you "should" not do this: "Software should access semaphores (shared memory used for signalling between multiple processors) using identical addresses and operand lengths." It is unclear what "should" means (or what anything else here really means, operationally speaking...)

  • In Rust, it is pretty much established that you can safely turn a &mut u16 into a &mut [u8; 2]. So you can do something where you start with a &mut AtomicU16, do some atomic things to it, then use the above conversion and get a &mut AtomicU8 and do atomic things with that -- a bona fide mixed-size atomic access.

    However, this means that there is a happens-before edge between all the 16-bit accesses and the 8-bit accesses. Having the &mut means that there is synchronization. I hope this means not even Intel disagrees with this, but since literally none of the words in that sentence is defined, who knows.

So... it seems like the most restrictive thing we can say, without disallowing code that you can already write entirely safely using bytemuck, is that

  • it is allowed to do differently-sized atomic accesses to the same location over time,
  • but only if any two not-perfectly-overlapping accesses are completely synchronized through other means (i.e., it is not these accesses themselves that add a happens-before edge, there already exists a happens-before edge through other accesses).
  • Any other kind of mixed-size access is UB.

Cc @workingjubilee @m-ou-se @Amanieu @cbeuw @thomcc

@m-ou-se
Copy link
Member

m-ou-se commented Jul 2, 2022

That all matches my understanding as well.

(It's part of the reason why I added Atomic*::from_mut.)

However, this means that there is a happens-before edge between all the 16-bit accesses and the 8-bit accesses. Having the &mut means that there is synchronization. I hope this means not even Intel disagrees with this, but since literally none of the words in that sentence is defined, who knows.

If they would disagree with that, that'd basically imply that after using some memory for an atomic operation, you can never re-use that memory again. E.g. deallocating a Box would be unsafe, and so would be a stack-allocated AtomicU16 that goes out of scope.

They don't say it very clearly, but I don't see how their no-mixed-sizes rule can apply to anything other than atomic operations on the same memory that race with each other.

It is unclear what "should" means (or what anything else here really means, operationally speaking...)

Yeah, it could very well turn out that "should" just means "for performance", and that it has nothing to do with correctness. They're not very clear.

So... it seems like the most restrictive thing we can say [..]

That seems like exactly the right thing to say, and matches what you can do in safe Rust (if we include the unstable Atomic*::from_mut).

I don't think it's impossible that this might be less restrictive in the future, if we find more reasons to believe that racing mixed-size atomic operations will work on all platforms.

In C++, you cannot convert a &uint16_t into an reference to an array

Converting uint16_t* to a char* however is fine, e.g. to memset or memcpy into a uint16_t or struct, etc.

In C++20, you can also have a struct X { int a; int b; } and create an std::atomic_ref<X> first, and an std::atomic_ref<int> to one of the fields later.

In atomics.ref.generic#general-3, they clearly specify mixed-size accesses in the same way as us:

The lifetime ([basic.life]) of an object referenced by *ptr shall exceed the lifetime of all atomic_­refs that reference the object. While any atomic_­ref instances exist that reference the *ptr object, all accesses to that object shall exclusively occur through those atomic_­ref instances. No subobject of the object referenced by atomic_­ref shall be concurrently referenced by any other atomic_­ref object.

(Emphasis mine.)

@Amanieu
Copy link
Member

Amanieu commented Jul 2, 2022

ARM's memory model (in the section: The AArch64 Application Level Memory Model) seems to fully support mixed-sized atomic accesses.

@RalfJung
Copy link
Member Author

RalfJung commented Jul 2, 2022

If they would disagree with that, that'd basically imply that after using some memory for an atomic operation, you can never re-use that memory again. E.g. deallocating a Box would be unsafe, and so would be a stack-allocated AtomicU16 that goes out of scope.

Ah, good point. I had forgotten that hardware memory models do not have provenance. 😂

That seems like exactly the right thing to say, and matches what you can do in safe Rust (if we include the unstable Atomic*::from_mut).

.. and include bytemuck

Converting uint16_t* to a char* however is fine, e.g. to memset or memcpy into a uint16_t or struct, etc.

Isn't that a C thing? Though C++ might have something similar with std::byte.
But anyway that's a non-atomic type.

In C++20, you can also have a struct X { int a; int b; } and create an std::atomic_ref first, and an std::atomic_ref to one of the fields later.

Oh, good point. So in some sense this is actually already all covered by rust-lang/rust#97516.

@bjorn3
Copy link
Member

bjorn3 commented Jul 2, 2022

If they would disagree with that, that'd basically imply that after using some memory for an atomic operation, you can never re-use that memory again. E.g. deallocating a Box would be unsafe, and so would be a stack-allocated AtomicU16 that goes out of scope.

Ah, good point. I had forgotten that hardware memory models do not have provenance. joy

When reusing memory it is undef, right? Furthermore deallocation requires some kind of synchronization with every thread that has ever accessed it using atomic operations. Together I would assume this is enough to provide consistency by "resetting" the state witnessing that it was accessed using atomic operations of a different size.

@m-ou-se
Copy link
Member

m-ou-se commented Jul 2, 2022

Though C++ might have something similar with std::byte.

C++ allows aliasing through char, unsigned char and std::byte: https://wg21.link/basic.lval#11.3

std::atomic_ref

To add to my previous comment: one of the reasons why C++'s atomic_ref doesn't allow mixed size / overlapping operations, is that it supports objects of any size. If it gets too big for native atomic instructions, it uses a mutex instead, which is probably stored in some kind of global table indexed by the address of the object. It's not completely clear whether it's necessary to be as restrictive when limited to only natively supported atomic operations, like in Rust.

@chorman0773
Copy link
Contributor

chorman0773 commented Apr 3, 2023

Apparently the x86 manual says you "should" not do this: "Software should access semaphores (shared memory used for signalling between multiple processors) using identical addresses and operand lengths." It is unclear what "should" means (or what anything else here really means, operationally speaking...)

I thik we can interpet "should" as "It's undefined, by spec, though it works in practice b/c some important people rely on it, but please don't, we want to do fast things".

Therefore, the mixed size access should be considered undefined by Rust, as we expect to be able to compile to x86 where it is undefined.

@cbeuw
Copy link

cbeuw commented Apr 6, 2023

Regarding x86, I got the following from @thiagomacieira which is very helpful

Even on Intel processors (note: I work for Intel and this is an area I am very familiar
with) you can mix different-sized operations and still retain atomicity,
provided you obey some rules:

  • never cross a cacheline boundary (preferably, align naturally)
  • use operations of 16 bytes or less (ideally: use only 1-uop operations)
    • for all current in-market processors and I believe this applies to AMD
    • for all current P-core processors: 32- and 64-byte accesses are also fine

Note: I don't know how valid this is for the new RAO instructions, but they
should be ok for CMPccXADD.

In fact, I've seen a few codebases that do use the fact that they can mix two atomic_uint32_t with an overlapping atomic_uint64_t, at least when it comes to interfacing with the Linux kernel 32-bit futex support, see
https://codebrowser.dev/glibc/glibc/nptl/sem_waitcommon.c.html#do_futex_wait

@RalfJung
Copy link
Member Author

use operations of 16 bytes or less (ideally: use only 1-uop operations)

So this means there are atomic 256bit accesses but doing size mixing with those is a problem? (Not sure what "1-uop operations are.)

@chorman0773
Copy link
Contributor

chorman0773 commented Apr 10, 2023 via email

@thiagomacieira
Copy link

SSE loads and stores are atomic. AVX 256- and 512-bit loads and stores are atomic on P-core processors, but not E-core (the 256-bit operation is cracked into two 128-bit operations and therefore not atomic).

There s no RMW SIMD. The best you get is a merge-store and I'm confident that's atomic on P-core, but I doubt so on E-core. Therefore, SIMD atomics are very limited if you can only use them for loads and stores. The most useful thing that this could be done for is to load 16 bytes atomically, and use CMPXCHG16B to store, but it's still limited and somewhat slow due to the transfer between register files. seqlocks are more flexible.

@tmandry
Copy link
Member

tmandry commented May 21, 2024

My reading of the RISC-V spec says that mixed-size atomics are well-defined if they are aligned, both in terms of their atomicity and in terms of their globally observable ordering with respect to other load/store operations in the same thread.

It took quite a lot of context to convince myself that this is true, so I'm leaving notes below, but the tl;dr is "mixed size atomics are fine, as long as they are aligned".


Atomicity:

All memory operations are single-copy atomic: they can never be observed in a partially complete state.

Ordering (emphasis mine):

The global memory order for any given execution of a program respects some but not all of each hart’s program order. The subset of program order that must be respected by the global memory order is known as preserved program order.

The complete definition of preserved program order is as follows (and note that AMOs are simultaneously both loads and stores): memory operation a precedes memory operation b in preserved program order (and hence also in the global memory order) if a precedes b in program order, a and b both access regular main memory (rather than I/O regions), and any of the following hold:

Overlapping-Address Orderings:

  1. b is a store, and a and b access overlapping memory addresses
  2. a and b are loads, x is a byte read by both a and b, there is no store to x between a and b in program order, and a and b return values for x written by different memory operations
  3. a is generated by an AMO or SC instruction, b is a load, and b returns a value written by a

Context: Here, AMO is an atomic modify operation such as fetch_add or fetch_xor, and "hart" is a hardware thread. RISC-V does not implement CAS directly but uses paired LR (load-reserved) and SC (store-conditional) instructions, which can be used to implement CAS.

Note that there is a discrepancy described in section A.7.1 between their axiomatic (normative) and operational memory models, regarding mixed-size load and store operations. The examples provided all look like edge cases to me and I do not believe they have any bearing on guarantees provided by the C++ atomic model (even if the model were extended to cover mixed-size atomics). As long as you rely only on atomics being atomic and on acquire/release (or stronger) semantics providing happens-before relationships, you could not write code that depends on this discrepancy being resolved one way or another.

@RalfJung
Copy link
Member Author

RalfJung commented May 22, 2024

There are plenty of hardware models that allow mixed-size atomics. That's entirely useless for purpose though; Rust needs a language-level memory model and for those, so far I don't know of any proposal that would permit mixed-size atomics. Hardware-level memory models disallow many optimizations we'd want to do (such as GVN, hoisting code, or algebraic identities that remove syntactic dependencies like y-y → 0).

Trying to use a hardware memory model in Rust doesn't work for the usual reason: what the hardware does is not what your program does. Rust programs run on an Abstract Machine that has its own rules, and that is intended to support way more optimizations than what you can do on assembly, and the cost for that is that the AM is a lot harder to define.

@zachs18
Copy link

zachs18 commented May 22, 2024

Yes, Rust cannot just directly use a hardware memory model, and I do not think tmandry was suggesting it should, but it is still necessary to discuss hardware memory models to determine whether or not a possible future Rust memory model allowing mixed-size atomics is even implementable on those platforms.

@tmandry
Copy link
Member

tmandry commented May 23, 2024

Don't worry! I promise to forever believe in the Rust abstract machine.

I also think we have a precedent for "poking holes" in Rust's abstract machine by saying the behavior of certain operations is "whatever the underlying hardware does – within (important) constraints" – correct me if I'm wrong, but floating point comes to mind.

Currently, I don't see any insurmountable obstacles to doing that for mixed-size atomic ops. It's very unlikely that LLVM takes advantage of the UB given that actual C/C++ code in the wild makes use of them, and so far I can't imagine a use case that allows us to do an important optimization (at least, the importance of such an optimization would likely be smaller than the importance of using mixed-size atomics).

Examples of usage in the wild:

  1. glibc semaphores, as shared by @thiagomacieira above
  2. They seem to be the basis of the refusal to support 64-bit futexes in Linux
  3. A blocking rw-lock in the Fuchsia kernel

While we could tell everyone they should go down to assembly for this, it would be much nicer – and probably more optimization-friendly – if we could let them use Rust atomics.

@RalfJung
Copy link
Member Author

Floating-point operations are trivial pure operations, so poking a hole is trivial in principle -- and even there it is already going horribly wrong, see the x86-32 issues.

The memory model is probably the most complicated part of the Rust specification. I don't know of any way that is even remotely principled that would let us "poke a hole". So no, I don't think we can do that. If you want to use target semantics, you have to use inline asm (and even then you have to ensure that the end-to-end behavior of the inline asm can be expressed in Rust).

@tmandry
Copy link
Member

tmandry commented May 24, 2024

Yeah, I didn't mean to suggest this was trivial, and your comment makes me realize that "poking a hole" is definitely the wrong framing.

I would very much like it if we could extend the memory model (possibly in target-specific ways) to support these operations, but I don't know how to do that and I'm sure some smart people have tried. On the bright side, we may have ourselves a great idea for a PhD thesis...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-open-question Category: An open question that we should revisit
Projects
None yet
Development

No branches or pull requests

9 participants