Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Mutex, AtomicU64, and deal with lots of related fallout #11610

Closed
wants to merge 13 commits into from

Conversation

alexcrichton
Copy link
Member

This is the follow up PR to #11520, #11462, and #11509. This takes the original approach for #11462 for the mutex (due to my comment).

I had originally added an AtomicU64 for usage in #11520's implementation, and I figured I may as well leave it implemented because it may turn out to be useful someday.

@brson
Copy link
Contributor

brson commented Jan 17, 2014

@sanxiyn @yichoi Is v7 the ARM we should be using by default?

@yichoi
Copy link
Contributor

yichoi commented Jan 17, 2014

@brson yep

@brson
Copy link
Contributor

brson commented Jan 18, 2014

I didn't see where extra::sync got converted to the new mutex from send_deferred.

@bill-myers
Copy link
Contributor

I see that you went with this approach to maximize native performance (which might be possible to improve in #11520, haven't really tried), but is an implementation that is completely unfair in the combination of native and green threads really acceptable to ship as a core synchronization primitive?

As written, a native task will always be blocked forever on lock() if there are enough contending green tasks, which seems to make a Mutex unreliable in such a configuration.

In fact, as far as I can tell, a set of N + 1 green tasks, where N is the number of concurrent task schedulers, (i.e. typically the number of CPU cores) that take a mutex in a loop without sleeping/scheduling with the mutex unlocked (perhaps because they do all their sleeping while holding the mutex) seems enough to make sure that native tasks will never get the mutex.

And if such a configuration is not really usable and native performance must be optimal, then maybe the best option is to ban mixing native and green tasks for either a given mutex (create one or the other depending on the type of the task that calls Mutex::new()) or the whole process, which means that one can have two separate implementations of all primitives.

That way, one can directly use the OS primitives in the native case, and use the algorithms that the Linux kernel uses for it's own sleepable primitives for the green tasks, providing optimal performance and eliminating the need of engineering from scratch primitives for the unique Rust use case of mixed native/green tasks.

[There's also the issue that using yields is bad (e.g. if one sends SIGSTOP or breaks a thread in a debugger at the wrong time, the others will be spinning forever) and neither glibc nor Linux use them in library code, but at least that doesn't impact the correctness of programs.]

@brson
Copy link
Contributor

brson commented Jan 19, 2014

Is there a modification to this strategy that can prevent native thread starvation in the presence of green thread contention?

@bill-myers
Copy link
Contributor

It might be possible to always unlock and relock the lock on unlock (at some performance cost, and perhaps losing the guarantee) and then add a check for the queue being empty to native tasks to prevent the opposite.

@bill-myers
Copy link
Contributor

I think there is at least one race condition in the code: as the comment in the code says, one can ignore the case when the MPSC queue is inconsistent IF the pusher is guaranteed to see that held has been set to false.

However, there is no fence that guarantees that the self.held.store(false, atomics::Relaxed) won't be moved after the q.pop().

It looks like there needs to be a SeqCst fence there and a SeqCst fence after the push in lock(), or alternatively either the MPSC queue or the held variable need to be manipulated with SeqCst (acquire and release don't seem to enough, since they aren't StoreLoad fences).

Also, there should be at least a compiler optimization barrier (an empty volatile asm!() with a memory clobber should work, not sure if LLVM has something better) after the write to held in lock(), since otherwise the compiler could in principle move that store arbitrarily down after inlining lock(), resulting in green tasks spinning for a long time; likewise, there should be at least a compiler barrier before the held store in unlock() for the same reason.

#[inline(never)] on both lock() and unlock() would also work, but with the negative side effect of always preventing inlining.

A fence, but I think all real CPUs guarantee that stores are made visible as soon as possible, so it should not be needed.

I'm not sure whether that makes it free from race conditions or if there are others.

@bill-myers
Copy link
Contributor

I changed the code in #10520 (see https://github.com/bill-myers/rust/tree/mutex-lockless) to start cmpxchg loops with a relaxed load, which seems to be a bit faster on Haswell compared to starting with an optimistic cmpxchg.

After that change to #10520, and adding SeqCst to the held store in unlock in this pull (which means it's going to use xchg instead of mov), they both take around 2.5 seconds real time to unlock a mutex a million times from each of 8 native tasks on an HT 4-core mobile 2.2GHz Haswell (with Linux on bare hardware), while using unstable::Mutex directly takes 1.7 seconds.

The original racy code in this pull after removing the assert!() on held results in 2 seconds, and completely removing both held load and store results in 1.85 seconds (but both are incorrect), while #10520 without the optimization took around 2.85 seconds.

So it seems that this closes the gap on my machine, but Alex Crichton's results seemed to show a larger gap in the first place, so not sure what it would look like there with those changes (was the benchmark on #10520 perhaps done with 64-bit atomics on x86-32?)

@alexcrichton
Copy link
Member Author

I've updated with what @brson and I talked about today

  • Every atomic has a SeqCst ordering. This does have a slowdown in the benches that I had. If I put Relaxed in a few locations (so the stores aren't xchgs) it's about 5-10% slower than a pthread mutex. Right now it's slower than that, but I believe that having SeqCst until we understand better is the right way forward
  • The mutex is a little more fair (still not completely fair) by alternating who is woken up (native threads favor green and green threads favor native).

@bill-myers
Copy link
Contributor

Is this still faster than https://github.com/bill-myers/rust/tree/mutex-lockless on 64-bit Linux on x86 when using atomics::SeqCnt?

Comparing performance with atomics relaxed makes little sense, since that means benchmarking an incorrect implementation, and the atomics are in fact the main source of slowdown.

Anyway, I think both https://github.com/bill-myers/rust/tree/mutex-lockless and this should be acceptable from a "doesn't break programs" perspective.

With the key insight of alternating between green and native tasks, I think there is actually a third option, which would involve taking this pull request, and removing the need to yield, in the following way:

  1. Replace the MPSC queue with the one from Alternative lockless Mutex implementation (2nd try) #11520, which doesn't need yields and allows to check that the queue is either empty or that someone else popped without the lock by checking that q.producer == 0
  2. Remove the held variable, assume that green tasks can sleep if trylock fails
  3. In unlock after doing the q.pop(), getting None and unlocking the mutex, do an extra check for q.producer != 0, and if it succeeds, then try_lock, and if that succeeds too, retry the unlocking procedure from the start

Compared to this pull, it would remove yields and reduce the size of the mutex structure (yes, the mutex would be touched after unlocking, but that's not a problem in Rust as argued in #11520).

I think this might perhaps be the best option.

BTW, regarding the TLS hit, having a "is_nonblocking_task" thread-local variable (which needs to be written to only when a green task scheduler starts or stops) rather than having to fetch the task would speed it up, and perhaps make it possible to remove the is_blocking_locker variable (and eventually, one could ask the glibc maintainers to reserve a bit in the glibc thread structure for this purpose).

@alexcrichton
Copy link
Member Author

I do believe that the performance comparison with Relaxed is a fair comparison. The only reason these use SeqCst is that I don't personally feel like I have a good enough understanding of atomics in order to relax the constraints. This causes a store to become an xchg on x86, but that is completely unnecessary and it can most certainly just be a mov (whereas with a cmpxchg it must always be a compare-and-swap).

As I've discussed before, relaxing the constraint of re-using the mutex in unlock is something I believe to be unacceptable. There is code which does this today in rust (seen here), and you cannot have "just use an arc" as a solution.

@bill-myers
Copy link
Contributor

The store to held in unlock must be an xchg, since otherwise the loads in q.pop() can be moved before it.

The fetch_and_sub and fetch_and_add must be lock xadd since just xadd is not even atomic::Relaxed.

Loads are implemented as mov regardless of atomic memory model.

The stores to held in lock() can indeed be relaxed.

So I think code generation with all SeqCst is the optimal one on x86 (although SeqCst might not be the optimal choice otherwise), except for storing true in held, with can and should be relaxed.

@alexcrichton
Copy link
Member Author

The store to held in unlock must be an xchg, since otherwise the loads in q.pop() can be moved before it.

The x86 architecture does not do this, the compiler may do this. The SeqCst ordering is telling the compiler to not re-order it, and I don't know the best way to tell the compiler "don't reorder this and don't make this and xchg"

The fetch_and_sub and fetch_and_add must be lock xadd since just xadd is not even atomic::Relaxed.

This is only in the contended case, so speed is much less of a worry once you've reached this situation

@bill-myers
Copy link
Contributor

No, x86 architecture CPUs with out of order execution (i.e. all the desktop/mobile ones) can move loads before stores (to a different memory location, of course) unless the store is an xchg or there is a locked op, xchg or fence in the middle.

So assuming q.pop() checks that the queue is empty with only reads, then storing false to held must be done with an xchg (or mov + mfence or similar).

See https://en.wikipedia.org/wiki/Memory_ordering, "Stores reordered after loads", for instance.

@thestinger
Copy link
Contributor

@alexcrichton: There are still multiple levels of atomic ordering guarantees on x86 itself. You'll get different instructions from relaxed and sequential consistency. The compiler is still free to reorder the atomic instructions within a thread if it can figure out what's going on. IIRC, we mark all atomic operations as volatile unnecessarily so we're probably not getting code generation as good as C++ atomics. They should be split out into volatile_ intrinsics or we should approach volatile in another way.

@brson
Copy link
Contributor

brson commented Jan 23, 2014

I'm satisfied with this approach and want to merge it. Further improvements can be made in future pull requests.

This also generalizes all atomic intrinsics over T so we'll be able to add u8
atomics if we really feel the need to (do we really want to?)
With the recently added double word CAS functionality on 32-bit ARM (enabled via
a 64-bit atomic instruction in LLVM IR), without some extra features enabled
LLVM lowers code to function calls which emulate atomic instructions. With the
v7 feature enabled, proper 64-bit CAS instructions are used on 32-bit arm.

I've been told that v7 for arm is what we should have been doing anyway. This is
overridable by providing some other non-empty feature string.
This allows for easier static initialization of a pthread mutex, although the
windows mutexes still sadly suffer.

Note that this commit removes the clone() method from a mutex because it no
longer makes sense for pthreads mutexes. This also removes the Once type for
now, but it'll get added back shortly.
There are no current consumers of this mutex, but rather more will be added
soon.
This originally lived in std::unstable::mutex, but now it has a new home (and a
more proper one).
Now that extra::sync primitives are built on a proper mutex instead of a
pthreads one, there's no longer any use for this function.
The comments are in the commit itself, but the general idea is to have green
threads relinquish control to native blockers and to have native blockers
relinquish control to green threads. This should help keep things a little more
fair instead of just always favoring green threads.
This is becoming our general policy until we have a better understanding of
these orderings.
It turns out that windows cricital sections will succeed in EnterCriticalSection
or TryEnterCriticalSection if the calling thread already owns the mutex, but
this is obviously bad for us because there could be two contenting green threads
on the same os thread.

With some extra checks of `self.held` in try_lock(), I believe that we can
circumvent this limitation with minimal impact to the mutex.
@bill-myers
Copy link
Contributor

I think the only simple way of solving this is to switch to locks that don't have special behavior when recursively locked.

On Windows Vista, I think you can use SRW locks for that.

On Windows XP I think it should work to implement mutexes as an auto-reset event that starts as signalled, use WaitForSingleObject to lock it, and SetEvent(TRUE) to unlock it.

Otherwise, you need to, at least:

  1. Forbid running native task code on a thread after it scheduled any green tasks (otherwise that native task code might not be able to block on a mutex taken by that same native thread when it was executing a green task)
  2. Turn self.held into a tri-state value, and add a new state for when a thread is between the "self.held = true" and the lock.unlock() in unlock so that you don't race on q.pop() due to try_lock succeeding early in the green task lock case

BTW, I just noticed that locking a mutex on a native thread and unlocking it on another is undefined behavior according to POSIX, which means that Rust should refuse to compile for any unknown POSIX platform, and it should be checked that this does indeed work on all supported POSIX platforms (I think Linux should be fine as long as one uses PTHREAD_MUTEX_NORMAL, but haven't checked in detail).

@thestinger
Copy link
Contributor

@bill-myers: It's not a good idea to depend on undefined behavior based on glibc implementation details. They're willing to break any such incorrect code at any time, and lock elision may do exactly that as it matures and becomes the default.

@bill-myers
Copy link
Contributor

@thestinger If your argument is accepted, then this Mutex implementation must be completely redesigned (the other approaches I proposed share this issue, although they may or may not more easily accomodate the changes)

Lock elision is actually a really important point; we can of course request to disable it, but it would be nice to take advantage of it both for native and for green tasks.

Maybe we really need to go back to the drawing board, make all tasks always drop the OS mutex on unlock and redesign everything as needed to accomodate that?

@thestinger
Copy link
Contributor

Here's an example of their willingness to break anything relying on undefined behavior: https://bugzilla.redhat.com/show_bug.cgi?id=638477

@bill-myers
Copy link
Contributor

glibc doesn't seem to have any documentation of pthread functions other than the POSIX specification.

So, the problem of always unlocking is that it would starve green tasks at least in case the native OS mutex implements FIFO wakeups, since green tasks cannot block and thus cannot reserve a spot in the OS mutex FIFO queue.

The only solution is to make new native tasks queue as well in the MPSC queue once any green tasks are queued.

This is what #11520 did, but I think the code in this pull request can also be changed to do that.

@alexcrichton
Copy link
Member Author

In something as critical as mutex, I don't want to rely on undefined behavior in POSIX.

I didn't realize how much of a problem it was to unlock a mutex on another thread that originally locked it. Holding an OS mutex across the lock/unlock call I'm thinking at this point is unacceptable for green threads. It's possible for green threads to migrate around os threads, so you have no guarantee that the green thread will unlock on the same os thread.

For now, I'm going to close this because this still has work to hash out.

@alexcrichton
Copy link
Member Author

I have an idea of how to get around this.

@bill-myers bill-myers mentioned this pull request Jan 28, 2014
flip1995 pushed a commit to flip1995/rust that referenced this pull request Oct 21, 2023
…nt in async function for the `needless_pass_by_ref_mut` lint
flip1995 pushed a commit to flip1995/rust that referenced this pull request Oct 21, 2023
…ut-regression-test-11610, r=blyxyas

Add regression test for rust-lang#11610 about mutable usage of argument in async function for the `needless_pass_by_ref_mut` lint

Fixes rust-lang/rust-clippy#11610.

This was already fixed. I simply added a regression test.

changelog: Add regression test for rust-lang#11610 about mutable usage of argument in async function for the `needless_pass_by_ref_mut` lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants