Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fenced load in bounded queue / v1.2.3 #17

Closed
sbarral opened this issue Jul 19, 2022 · 10 comments
Closed

Fenced load in bounded queue / v1.2.3 #17

sbarral opened this issue Jul 19, 2022 · 10 comments

Comments

@sbarral
Copy link

sbarral commented Jul 19, 2022

I have commented on the relevant commit (#16), but too late as the commit was merged so I thought I would open an issue so this can be tracked.

Admittedly, I don't know what exactly is the role of the fence here. This fence does not exist in Dmitry Vyukov's original implementation of the queue, so I guess it was added as part of the modifications that ensure that this queue is linearisable (unlike the original queue).

That being said, if the cross-platform solution is indeed to place the load before the fence (this, I do not know) then I am pretty sure that the intel specialization that uses a lock operation instead of an mfence should also keep the load before.

I did look at https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html but could not see where it states that lock + mov (in this order) is equivalent to mov + mfence. In fact, the latest GCC does use the lock optimization and definitely preserves the order, i.e. mov + lock (see this godbolt: https://godbolt.org/z/o3rYdTvYv).

@sbarral
Copy link
Author

sbarral commented Jul 19, 2022

Also, the compare_exchange(...) line should IMO be fenced by two compiler_fence(Ordering::SeqCst), because compare_exchange does not prevent all compiler reordering like an atomic SeqCst fence does.

Another thing to consider is whether a fetch-family instruction, as used by GCC, is not preferable than compare_exchange. At least on my machine a fetch is faster.

@taiki-e
Copy link
Collaborator

taiki-e commented Jul 19, 2022

I thought the purpose of that fence was for subsequent relaxed loads (because there were cases where there was no preceding relaxed operation. e.g., pop in the unbounded queue). However, #16 is maybe wrong for the bounded queue because the bounded queue has a preceding relaxed operation. So, I've yanked 1.2.3 for now.

(The relevant commits are crossbeam-rs/crossbeam@7d3f7f0 (bounded) and crossbeam-rs/crossbeam@05554e6 (unbounded).)


I did look at https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html but could not see where it states that lock + mov (in this order) is equivalent to mov + mfence. In fact, the latest GCC does use the lock optimization and definitely preserves the order, i.e. mov + lock (see this godbolt: https://godbolt.org/z/o3rYdTvYv).

The table under "Note: there is an alternative mapping of C/C++11 to x86..." says that mfence+mov works as a SeqCst load.

alternative: MFENCE,MOV (from memory)


Also, the compare_exchange(...) line should IMO be fenced by two compiler_fence(Ordering::SeqCst), because compare_exchange does not prevent all compiler reordering like an atomic SeqCst fence does.

You are right.


Another thing to consider is whether a fetch-family instruction, as used by GCC, is not preferable than compare_exchange. At least on my machine a fetch is faster.

LLVM also uses lock or instead of lock cmpxchg. However, LLVM can optimize fetch_* to load (llvm/llvm-project#56450), so inline asm is needed.

@sbarral
Copy link
Author

sbarral commented Jul 19, 2022

I did look at https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html but could not see where it states that lock + mov (in this order) is equivalent to mov + mfence. In fact, the latest GCC does use the lock optimization and definitely preserves the order, i.e. mov + lock (see this godbolt: https://godbolt.org/z/o3rYdTvYv).

The table under "Note: there is an alternative mapping of C/C++11 to x86..." says that mfence+mov works as a SeqCst load.

alternative: MFENCE,MOV (from memory)

I see, but the fact that a SeqCst load can be replaced by mfence + mov does not imply the reverse because an atomic fence provides stricter synchronization than a SeqCst load. My guess is that the original author did not only want to perform a SeqCst load but also to have a global barrier, otherwise I expect he would have used a load(Ordering::SeqCst) on non-intel platforms. Though again, I admit I do not fully understand the intent of this fence so I may be wrong about the original intent.

Another thing to consider is whether a fetch-family instruction, as used by GCC, is not preferable than compare_exchange. At least on my machine a fetch is faster.

LLVM also uses lock or instead of lock cmpxchg. However, LLVM can optimize fetch_* to load (llvm/llvm-project#56450), so inline asm is needed.

But it looks like this was a miscompilation? Or did this bug already make it into rustc?

[Edit] OK, I tested it on godbolt and it looks like the optimization (or miscompilation if it is one) is not an issue for fetch with SeqCst ordering: either the proper lock instruction is used or it is changed to mfence + mov.

@sbarral
Copy link
Author

sbarral commented Jul 19, 2022

Regarding inline assembly: my own experience was that in practice it actually produced slower code than simply using a fence, because the ASM was treated as a black box by the compiler and too many optimization were prevented.

@RalfJung
Copy link

The table under "Note: there is an alternative mapping of C/C++11 to x86..." says that mfence+mov works as a SeqCst load.

Is this compositional though? As in, when mixing both mappings, does the result still behave as intended?

With these mappings it is sometimes the case that the entire program needs to use the same mapping, or else things go wrong.

@sbarral
Copy link
Author

sbarral commented Jul 26, 2022

The table under "Note: there is an alternative mapping of C/C++11 to x86..." says that mfence+mov works as a SeqCst load.

Is this compositional though? As in, when mixing both mappings, does the result still behave as intended?

I don't know if we are not barking up the wrong tree: I don't think the original code was intended to work as a SeqCst load but rather as a full SeqCst barrier, which does much more (for an example see #18 (comment)).
When it comes to SeqCst fences, my understanding is that using a lock instruction as an alternative to an mfence is well-established, see e.g. microsoft/STL#739 as well as many answers by Peter Cordes on SO.

Unfortunately the above thread does not directly answer the concern about mixing mfence and lock, but I would be a bit surprised if they couldn't be mixed.

@taiki-e
Copy link
Collaborator

taiki-e commented Jul 27, 2022

With these mappings it is sometimes the case that the entire program needs to use the same mapping, or else things go wrong.

I believe this is true in the case where seqcst store is mapped as mov without fence, otherwise it should not be a problem.

fence+mov is also proposed for optimization of the idempotent RMW on x86 in https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf, Section 7, and LLVM actually implements it.

@sbarral
Copy link
Author

sbarral commented Jul 27, 2022

After looking a bit more at the bounded queue, I have the feeling that the fences are meant to prevent spurious reports of empty (pop) or full (push) queues when a concurrent push (resp. pop) operation is ongoing.

Taking the example of pop, the fence ensures that the relaxed load of the head remains before the relaxed load of the tail, which in turns ensures that if a push is ongoing (i.e. the tail has moved but the stamp hasn't been updated and still matches the head), this will be observed. This would explain why it is a full fence and not simply a SeqCst load, as the later does not prevent reordering with relaxed loads.

That's just a theory though, these things are awfully tricky so I may very likely be wrong...

@sbarral
Copy link
Author

sbarral commented Jul 27, 2022

fence+mov is also proposed for optimization of the idempotent RMW on x86 in https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf, Section 7, and LLVM actually implements it.

Note that even though this may have been an optimization back then (these links are dated from 2012 and 2014), now the situation seems to have reversed: in most cases an idempotent RMW will be faster than an mfence; some supporting elements:

@taiki-e
Copy link
Collaborator

taiki-e commented Aug 15, 2023

Closing -- 1.2.3 change has been reverted in 1.2.4 (#18), and x86 full_fence implementation is rewritten with inline assembly in #47.

@taiki-e taiki-e closed this as completed Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants