Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upAlter Once to remove unnecessary SeqCst usage #31650
Conversation
rust-highfive
assigned
nikomatsakis
Feb 14, 2016
This comment has been minimized.
This comment has been minimized.
|
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @nikomatsakis (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
gereeter
reviewed
Feb 14, 2016
| @@ -71,7 +71,7 @@ impl Once { | |||
| #[stable(feature = "rust1", since = "1.0.0")] | |||
| pub fn call_once<F>(&'static self, f: F) where F: FnOnce() { | |||
| // Optimize common path: load is much cheaper than fetch_add. | |||
| if self.cnt.load(Ordering::SeqCst) < 0 { | |||
| if self.cnt.load(Ordering::Acquire) < 0 { | |||
This comment has been minimized.
This comment has been minimized.
gereeter
Feb 14, 2016
Contributor
Slightly more efficient would be to use a Relaxed load and only fence(Acquire) if the count is in fact negative.
This comment has been minimized.
This comment has been minimized.
Amanieu
Feb 14, 2016
Contributor
I disagree, in this case you would want to acquire to be on the load itself. This is because some architectures (ARM, AArch64) have a load-acquire instruction that is much faster than the fence instruction. Since this branch is going to be taken in the majority of cases, you would want to avoid the fence overhead here.
gereeter
reviewed
Feb 14, 2016
| if prev < 0 { | ||
| // Make sure we never overflow, we'll never have isize::MIN | ||
| // simultaneous calls to `call_once` to make this value go back to 0 | ||
| self.cnt.store(isize::MIN, Ordering::SeqCst); | ||
| self.cnt.store(isize::MIN, Ordering::Release); |
This comment has been minimized.
This comment has been minimized.
gereeter
reviewed
Feb 14, 2016
| @@ -114,15 +114,15 @@ impl Once { | |||
| // otherwise we run the job and record how many people will try to grab | |||
| // this lock | |||
| let guard = self.mutex.lock(); | |||
| if self.cnt.load(Ordering::SeqCst) > 0 { | |||
| if self.cnt.load(Ordering::Acquire) > 0 { | |||
This comment has been minimized.
This comment has been minimized.
gereeter
Feb 14, 2016
Contributor
This can be Relaxed: The acquisition of the result of the function is already done by the mutex.
gereeter
reviewed
Feb 14, 2016
| } | ||
| drop(guard); | ||
|
|
||
| // Last one out cleans up after everyone else, no leaks! | ||
| if self.lock_cnt.fetch_add(-1, Ordering::SeqCst) == 1 { | ||
| if self.lock_cnt.fetch_add(-1, Ordering::AcqRel) == 1 { |
This comment has been minimized.
This comment has been minimized.
gereeter
Feb 14, 2016
Contributor
This can be Relaxed: the necessary synchronization to know that the lock is ready to be cleaned up is provided by the lock itself.
This comment has been minimized.
This comment has been minimized.
|
Memory orderings are tricky enough that I think that having comments explaining why the more relaxed bounds are correct would be very useful. |
gereeter
reviewed
Feb 14, 2016
| @@ -102,11 +102,11 @@ impl Once { | |||
| // calling `call_once` will return immediately before the initialization | |||
| // has completed. | |||
|
|
|||
| let prev = self.cnt.fetch_add(1, Ordering::SeqCst); | |||
| let prev = self.cnt.fetch_add(1, Ordering::AcqRel); | |||
This comment has been minimized.
This comment has been minimized.
gereeter
Feb 14, 2016
Contributor
This can be Relaxed with a fence(Acquire) if the count is negative: since swap and fetch_add are atomic, no extra synchronization is necessary to make the number of threads that pass through the fetch_add seeing a positive integer exactly the number that will be read as the result of the swap.
gereeter
reviewed
Feb 14, 2016
| f(); | ||
| let prev = self.cnt.swap(isize::MIN, Ordering::SeqCst); | ||
| self.lock_cnt.store(prev, Ordering::SeqCst); | ||
| let prev = self.cnt.swap(isize::MIN, Ordering::AcqRel); |
This comment has been minimized.
This comment has been minimized.
gereeter
Feb 14, 2016
Contributor
This can just be Release: there are no stores that need to be acquired.
This comment has been minimized.
This comment has been minimized.
|
@gereeter Thanks a lot for the thorough review - really made me think about/learn more about the use of relaxed ordering, where I'd just been concerned with the sequential consistency aspect. Made changes which hopefully address your thoughts. |
This comment has been minimized.
This comment has been minimized.
|
Ah, looks like there's an issue filed for this: #27610 |
This comment has been minimized.
This comment has been minimized.
|
@Amanieu (continuing out of a now hidden comment) Pro
Pro
Neither of the two options would even allow LLVM to produce the optimal code on PowerPC and ARMv7, which would be an instruction sync only in the branch for the early exit. The fence would prevent LLVM from using an instruction sync at all, while the Given all that, I'm inclined to support switching back to a |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
I'm personally very wary of relaxing any orderings in the standard library, especially if we're coming up with the design ourselves. These are notoriously hard to get right unfortunately. Along those lines, do you have any benchmarks to show the impact of these relaxed orderings? It would be good to get a handle on what sort of runtimes we're talking about. I suspect x86 will benefit, but the greatest benefit is likely from ARM. I also suspect that we only really need to relax one ordering in this function to get any real benefit, which is the first one. None of the orderings really have any contention which needs to be super fast, so I would personally prefer to see what our best perf increase would be and then move as much as possible back to |
This comment has been minimized.
This comment has been minimized.
|
@alexcrichton The benefit of relaxing SeqCst for the first instruction will likely be zero on x86 - the extra barrier for SeqCst comes on the write. Acquire/Release comes for free on x86, while the extra fence for sequential consistency adds (IIRC) about 100 cycles to each write. I can understand a fear of using Ordering::Relaxed - I do tend to think it can leave code harder to change safely. On other other hand I'd argue against insisting on full sequential consistency everywhere in the standard library that isn't absolutely performance critical. The difference between sequential consistency and acquire/release is not complicated for simple code like this, and given the broad scope of usage of standard library constructs, it seems a shame to spend cycles on obviously pointless work - even if the overall benefit is pretty minor. |
This comment has been minimized.
This comment has been minimized.
|
To me at least these orderings seem pretty non-trivial. I've found that to reason through these you basically have to reason about all orderings with respect to other orderings, and there's quite a few going on here. In terms of performance I think we'll only really get wins by fixing the first few (the ones that check to see if the initialization has already run and finished). Only modifying those would be much easier to reason about (to me at least) |
This comment has been minimized.
This comment has been minimized.
|
Yeah, I don't by any means intend to say that Acquire/Release vs Sequential Consistency is always trivial, just that it is in this particular case - if we were to ignore Relaxed and use Acquire/Release for everything, all the synchronization necessary to guarantee visibility of the results of the closure call occurs on the self.cnt variable. That being the case, sequential consistency is irrelevant, because to be useful over AcqRel it requires a minimum of two different atomic variables being altered in different threads. All that said, any performance gain from this is obviously going to be too minor to quibble extensively about, so I'm of course perfectly happy to change this so that it just uses Acquire on the early exit checks. I'm busy tonight but I'll sort it out as soon as I can :-). |
This comment has been minimized.
This comment has been minimized.
|
Ok, I think that making the first two operations unconditionally I'd want to confirm with others still, though. |
This comment has been minimized.
This comment has been minimized.
|
Since we're talking about optimizing the hot path, I think it would be nice to split the cold path of |
This comment has been minimized.
This comment has been minimized.
|
I would be ok doing so assuming that benchmarks are shown that it's a sizable improvement. |
This comment has been minimized.
This comment has been minimized.
|
I am not familiar enough with this code to have a strong opinion, and don't care to become so. ;) cc @aturon |
rust-highfive
assigned
alexcrichton
and unassigned
nikomatsakis
Feb 17, 2016
This comment has been minimized.
This comment has been minimized.
|
+1 for cold path split |
This comment has been minimized.
This comment has been minimized.
|
|
huonw
added
the
T-libs
label
Apr 26, 2016
This comment has been minimized.
This comment has been minimized.
|
This has been stuck in limbo for a while, so @rust-lang/libs should discuss it. I'm in favour of optimising it, but we should definitely be sure it is correct. I imagine there's other implementations of As @alexcrichton says, it seems like we could at the very least just weaken the first ordering (together with its associated |
huonw
added
the
I-nominated
label
Apr 27, 2016
This comment has been minimized.
This comment has been minimized.
|
The big issue here is that we need to construct a
|
This comment has been minimized.
This comment has been minimized.
|
@Amanieu note that after #32325 we no longer have a mutex at all, so that concern is essentially moot. In my opinion the only ordering which needs to change is this one which can likely be relaxed to |
This comment has been minimized.
This comment has been minimized.
|
My apologies, this completely fell off my radar. I shall try to take a look at it this weekend. |
AlisdairO
force-pushed the
AlisdairO:once
branch
from
327289f
to
15213ac
May 2, 2016
This comment has been minimized.
This comment has been minimized.
|
OK, looks like it's a pretty simple change at this point. @alexcrichton is correct on the potential performance impact, although alas I have no multi-core ARM machine to test on. |
This comment has been minimized.
This comment has been minimized.
|
I ran some tests on ARM & AArch64 machines and the performance of load(Acquire) and load(SeqCst) are identical. However I still think this change is the right thing to do. |
This comment has been minimized.
This comment has been minimized.
|
The libs team discussed this during triage yesterday and the conclusion was that we don't want merge this at this time. This unfortunately reduces readability as the non- Along those lines, without evidence in favor of this, we'd like to keep the code as it is today, but thanks regardless for the PR @AlisdairO! |
alexcrichton
closed this
May 4, 2016
This comment has been minimized.
This comment has been minimized.
|
Additionally, @alexcrichton looked at the ARM assembly, and apparently it was identical for both |
This comment has been minimized.
This comment has been minimized.
|
It seems that while load(Acquire) and load(SeqCst) does generate the exact same assembly on ARM, there is a difference on PowerPC: load(Acquire): ld 3, 0(3)
lwsyncload(SeqCst): sync
ld 3, 0(3)
lwsyncSince |
This comment has been minimized.
This comment has been minimized.
|
Perhaps, but if we're just fishing for platforms where this makes a difference then we'll of course find such a platform. It still stands that no performance measurements show an improvement and it decreases code readability, so until that state changes we'll likely leave as-is. |
alexcrichton
referenced this pull request
Sep 5, 2017
Closed
Relax orderings in std::sync::once #44331
This comment has been minimized.
This comment has been minimized.
|
I have a different opinion on readability. According to this paper, |
This comment has been minimized.
This comment has been minimized.
|
@jeehoonkang heh I think you're far more well versed on this topic than I, so I'm definitely willing to defer to you! I was mostly just probing for rationale on #44331 as the rule of thumb seems to be |
AlisdairO commentedFeb 14, 2016
Just had a look at Once, and it looks like it's unnecessarily using SeqCst where acquire/release semantics would be sufficient. Would appreciate a review to be sure, but I can't see why it would require SeqCst.