Fenced load in bounded queue / v1.2.3 #17

sbarral · 2022-07-19T13:46:45Z

I have commented on the relevant commit (#16), but too late as the commit was merged so I thought I would open an issue so this can be tracked.

Admittedly, I don't know what exactly is the role of the fence here. This fence does not exist in Dmitry Vyukov's original implementation of the queue, so I guess it was added as part of the modifications that ensure that this queue is linearisable (unlike the original queue).

That being said, if the cross-platform solution is indeed to place the load before the fence (this, I do not know) then I am pretty sure that the intel specialization that uses a lock operation instead of an mfence should also keep the load before.

I did look at https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html but could not see where it states that lock + mov (in this order) is equivalent to mov + mfence. In fact, the latest GCC does use the lock optimization and definitely preserves the order, i.e. mov + lock (see this godbolt: https://godbolt.org/z/o3rYdTvYv).

sbarral · 2022-07-19T14:09:11Z

Also, the compare_exchange(...) line should IMO be fenced by two compiler_fence(Ordering::SeqCst), because compare_exchange does not prevent all compiler reordering like an atomic SeqCst fence does.

Another thing to consider is whether a fetch-family instruction, as used by GCC, is not preferable than compare_exchange. At least on my machine a fetch is faster.

taiki-e · 2022-07-19T15:19:45Z

I thought the purpose of that fence was for subsequent relaxed loads (because there were cases where there was no preceding relaxed operation. e.g., pop in the unbounded queue). However, #16 is maybe wrong for the bounded queue because the bounded queue has a preceding relaxed operation. So, I've yanked 1.2.3 for now.

(The relevant commits are crossbeam-rs/crossbeam@7d3f7f0 (bounded) and crossbeam-rs/crossbeam@05554e6 (unbounded).)

I did look at https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html but could not see where it states that lock + mov (in this order) is equivalent to mov + mfence. In fact, the latest GCC does use the lock optimization and definitely preserves the order, i.e. mov + lock (see this godbolt: https://godbolt.org/z/o3rYdTvYv).

The table under "Note: there is an alternative mapping of C/C++11 to x86..." says that mfence+mov works as a SeqCst load.

alternative: MFENCE,MOV (from memory)

Also, the compare_exchange(...) line should IMO be fenced by two compiler_fence(Ordering::SeqCst), because compare_exchange does not prevent all compiler reordering like an atomic SeqCst fence does.

You are right.

Another thing to consider is whether a fetch-family instruction, as used by GCC, is not preferable than compare_exchange. At least on my machine a fetch is faster.

LLVM also uses lock or instead of lock cmpxchg. However, LLVM can optimize fetch_* to load (llvm/llvm-project#56450), so inline asm is needed.

sbarral · 2022-07-19T15:37:37Z

I did look at https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html but could not see where it states that lock + mov (in this order) is equivalent to mov + mfence. In fact, the latest GCC does use the lock optimization and definitely preserves the order, i.e. mov + lock (see this godbolt: https://godbolt.org/z/o3rYdTvYv).

The table under "Note: there is an alternative mapping of C/C++11 to x86..." says that mfence+mov works as a SeqCst load.

alternative: MFENCE,MOV (from memory)

I see, but the fact that a SeqCst load can be replaced by mfence + mov does not imply the reverse because an atomic fence provides stricter synchronization than a SeqCst load. My guess is that the original author did not only want to perform a SeqCst load but also to have a global barrier, otherwise I expect he would have used a load(Ordering::SeqCst) on non-intel platforms. Though again, I admit I do not fully understand the intent of this fence so I may be wrong about the original intent.

Another thing to consider is whether a fetch-family instruction, as used by GCC, is not preferable than compare_exchange. At least on my machine a fetch is faster.

LLVM also uses lock or instead of lock cmpxchg. However, LLVM can optimize fetch_* to load (llvm/llvm-project#56450), so inline asm is needed.

But it looks like this was a miscompilation? Or did this bug already make it into rustc?

[Edit] OK, I tested it on godbolt and it looks like the optimization (or miscompilation if it is one) is not an issue for fetch with SeqCst ordering: either the proper lock instruction is used or it is changed to mfence + mov.

sbarral · 2022-07-19T15:53:39Z

Regarding inline assembly: my own experience was that in practice it actually produced slower code than simply using a fence, because the ASM was treated as a black box by the compiler and too many optimization were prevented.

RalfJung · 2022-07-26T21:30:21Z

The table under "Note: there is an alternative mapping of C/C++11 to x86..." says that mfence+mov works as a SeqCst load.

Is this compositional though? As in, when mixing both mappings, does the result still behave as intended?

With these mappings it is sometimes the case that the entire program needs to use the same mapping, or else things go wrong.

sbarral · 2022-07-26T23:52:41Z

The table under "Note: there is an alternative mapping of C/C++11 to x86..." says that mfence+mov works as a SeqCst load.

Is this compositional though? As in, when mixing both mappings, does the result still behave as intended?

I don't know if we are not barking up the wrong tree: I don't think the original code was intended to work as a SeqCst load but rather as a full SeqCst barrier, which does much more (for an example see #18 (comment)).
When it comes to SeqCst fences, my understanding is that using a lock instruction as an alternative to an mfence is well-established, see e.g. microsoft/STL#739 as well as many answers by Peter Cordes on SO.

Unfortunately the above thread does not directly answer the concern about mixing mfence and lock, but I would be a bit surprised if they couldn't be mixed.

taiki-e · 2022-07-27T02:08:55Z

With these mappings it is sometimes the case that the entire program needs to use the same mapping, or else things go wrong.

I believe this is true in the case where seqcst store is mapped as mov without fence, otherwise it should not be a problem.

fence+mov is also proposed for optimization of the idempotent RMW on x86 in https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf, Section 7, and LLVM actually implements it.

sbarral · 2022-07-27T09:47:51Z

After looking a bit more at the bounded queue, I have the feeling that the fences are meant to prevent spurious reports of empty (pop) or full (push) queues when a concurrent push (resp. pop) operation is ongoing.

Taking the example of pop, the fence ensures that the relaxed load of the head remains before the relaxed load of the tail, which in turns ensures that if a push is ongoing (i.e. the tail has moved but the stamp hasn't been updated and still matches the head), this will be observed. This would explain why it is a full fence and not simply a SeqCst load, as the later does not prevent reordering with relaxed loads.

That's just a theory though, these things are awfully tricky so I may very likely be wrong...

sbarral · 2022-07-27T11:05:14Z

fence+mov is also proposed for optimization of the idempotent RMW on x86 in https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf, Section 7, and LLVM actually implements it.

Note that even though this may have been an optimization back then (these links are dated from 2012 and 2014), now the situation seems to have reversed: in most cases an idempotent RMW will be faster than an mfence; some supporting elements:

this relatively recent comment: <atomic>: Optimize atomic_thread_fence microsoft/STL#739 (comment),
the fact that GCC>=11 compiles SeqCst fences into an idempotent RMW rather than an mfence as it used too,
this is what the original author of this crate stated so I assume this is what he observed too,
my own benchmarks ;-)

taiki-e · 2023-08-15T12:54:15Z

Closing -- 1.2.3 change has been reverted in 1.2.4 (#18), and x86 full_fence implementation is rewritten with inline assembly in #47.

taiki-e mentioned this issue Jul 20, 2022

Revert "Fix fence on non-x86 arch and miri (#16)" #18

Merged

RalfJung mentioned this issue Jul 26, 2022

world::World doctests sometimes hangs with Miri bevyengine/bevy#5164

Closed

taiki-e closed this as completed Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fenced load in bounded queue / v1.2.3 #17

Fenced load in bounded queue / v1.2.3 #17

sbarral commented Jul 19, 2022

sbarral commented Jul 19, 2022

taiki-e commented Jul 19, 2022 •

edited

Loading

sbarral commented Jul 19, 2022 •

edited

Loading

sbarral commented Jul 19, 2022 •

edited

Loading

RalfJung commented Jul 26, 2022

sbarral commented Jul 26, 2022

taiki-e commented Jul 27, 2022

sbarral commented Jul 27, 2022

sbarral commented Jul 27, 2022

taiki-e commented Aug 15, 2023

Fenced load in bounded queue / v1.2.3 #17

Fenced load in bounded queue / v1.2.3 #17

Comments

sbarral commented Jul 19, 2022

sbarral commented Jul 19, 2022

taiki-e commented Jul 19, 2022 • edited Loading

sbarral commented Jul 19, 2022 • edited Loading

sbarral commented Jul 19, 2022 • edited Loading

RalfJung commented Jul 26, 2022

sbarral commented Jul 26, 2022

taiki-e commented Jul 27, 2022

sbarral commented Jul 27, 2022

sbarral commented Jul 27, 2022

taiki-e commented Aug 15, 2023

taiki-e commented Jul 19, 2022 •

edited

Loading

sbarral commented Jul 19, 2022 •

edited

Loading

sbarral commented Jul 19, 2022 •

edited

Loading