Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutexes, take 5 #11866

Closed
wants to merge 9 commits into from
Closed

Conversation

alexcrichton
Copy link
Member

Let's try this again.

This is an implementation of mutexes which I believe is free from undefined behavior of OS mutexes (the pitfall of the previous implementation).

This implementation is not ideal. There's a yield-loop spot, and it's not particularly fair with respect to lockers who steal without going through the normal code paths. That being said, I believe that this is a correct implementation which is a stepping stone to move from.

I haven't done rigorous benchmarking of this mutex, but preliminary results show that it's about 25% slower in the uncontended case on linux (same runtime on OSX), and it's actually faster than a pthreads mutex on high contention (again, not rigorous benchmarking, I just saw these numbers come up).

@bill-myers
Copy link
Contributor

It seems it's missing logic to alternate between waking up native_blocker and green_blocker.

Also, while it's kind of nice to have the symmetry between native_blocker and green_blocker, this design seems suboptimal compared to replacing code that accesses green_blocker with code that pops from the queue, which would avoid making green tasks deschedule and get rescheduled twice in the native+green task case (first on the queue, then on green_blocker).

This probably requires to merge green_cnt and state, by reserving the two least significant bits for LOCKED and NATIVE_BLOCKED. This would also reduce the size of struct Mutex by two words.

Overall, this implementation is quite nice and might be the optimal approach for performance in both the green-only and native-only cases.

It has, however, the significant caveat that it makes taking advantage of lock elision that should eventually be provided by all OS mutexes completely impossible due to always writing to state (maybe it can still be used by implementing it "by hand").

[I think the only workable alternative so far for automatic lock elision is to use https://github.com//pull/11610#issuecomment-33075571 after adapting to always lock/unlock on each call.]

Also, it doesn't reduce to just using pthread mutexes in the native-only case, which would be a problem for taking advantage of priority inheritance for native tasks where the OS mutex supports it and in general anything that requires the OS to know that a thread owns a mutex; this is fixable if desired by removing TryLockAcquisition by removing the initial try lock code in lock, and implementing try_lock in a similar way to lock (first try_locking the green or native parts, then the whole).

@alexcrichton
Copy link
Member Author

It seems it's missing logic to alternate between waking up native_blocker and green_blocker.

We actually get this for free because if a green thread holds the lock, you're guaranteed that the green blocker is empty (unless you've got a steal from a try_lock, as noted by the FIXME in the code).

which would avoid making green tasks deschedule and get rescheduled twice in the native+green task case

It's true, and this is even a problem with native tasks. This unlocks the OS mutex and then flags the outer mutex as available, so it's possible for the child thread to run first, go to sleep on a cvar, and then have to get woken up. I agree this is a little odd, but the focus now is to get something working rather than get the fastest thing working up front.

This probably requires to merge green_cnt and state

I can see how that'd work, nice idea! For now I'd rather focus on clarity, however. I'd like to get this merged in one form or another so we can start using it. I'll add a FIXME though so we don't forget that we can shrink down the size if we need to.

Overall, this implementation is quite nice

Thank you!

It has, however, the significant caveat that it makes taking advantage of lock elision that should eventually be provided by all OS mutexes completely impossible due to always writing to state

While regrettable, I'm feeling at this point we should get something merged before getting the perfect thing merged. I personally don't know enough about lock elision to write a mutex for it, but regardless that probably shouldn't be the focus this time around.

Also, it doesn't reduce to just using pthread mutexes in the native-only case

Hmm, that's not a bad idea, I think I'll play around with that and see how it works out. I was initially worried about the TLS hit, but my profiling found this to not be that expensive.

@bill-myers
Copy link
Contributor

We actually get this for free because if a green thread holds the lock, you're guaranteed that the green blocker is empty.

Indeed, I was wrong.

Yes, getting something merged is definitely a priority, since the implementation can be changed in a backwards compatible way, and this one seems to work.

The only significant issue regarding API and backwards compatibility is whether to commit to providing the "immediate destruction after unlock" guarantee to unsafe code, or to explicitly reserve the right to reimplement Mutex in a way that no longer provides it.

Also, what behavior is guaranteed when attempting to recursively lock the Mutex (always fails, always deadlocks, undefined).

I'm thinking lock elision will probably not require a redesign anyway, since it's likely going to be faster to do it ourselves, by simply starting an HTM transaction and checking that state == 0 inside; the runtime probably needs to be aware of it anyway since it seems it should abort any active transaction on green context switches.

@thestinger
Copy link
Contributor

@bill-myers: Keep in mind that the real lock elision API has better performance characteristics than RTM, and with RTM you still need a fallback abort path (in all likelihood, to a lock).

This also generalizes all atomic intrinsics over T so we'll be able to add u8
atomics if we really feel the need to (do we really want to?)
With the recently added double word CAS functionality on 32-bit ARM (enabled via
a 64-bit atomic instruction in LLVM IR), without some extra features enabled
LLVM lowers code to function calls which emulate atomic instructions. With the
v7 feature enabled, proper 64-bit CAS instructions are used on 32-bit arm.

I've been told that v7 for arm is what we should have been doing anyway. This is
overridable by providing some other non-empty feature string.
This allows for easier static initialization of a pthread mutex, although the
windows mutexes still sadly suffer.

Note that this commit removes the clone() method from a mutex because it no
longer makes sense for pthreads mutexes. This also removes the Once type for
now, but it'll get added back shortly.
This originally lived in std::unstable::mutex, but now it has a new home (and a
more proper one).
Now that extra::sync primitives are built on a proper mutex instead of a
pthreads one, there's no longer any use for this function.
bors added a commit that referenced this pull request Feb 3, 2014
Let's try this again.

This is an implementation of mutexes which I believe is free from undefined behavior of OS mutexes (the pitfall of the previous implementation).

This implementation is not ideal. There's a yield-loop spot, and it's not particularly fair with respect to lockers who steal without going through the normal code paths. That being said, I believe that this is a correct implementation which is a stepping stone to move from.

I haven't done rigorous benchmarking of this mutex, but preliminary results show that it's about 25% slower in the uncontended case on linux (same runtime on OSX), and it's actually faster than a pthreads mutex on high contention (again, not rigorous benchmarking, I just saw these numbers come up).
@bors bors closed this Feb 3, 2014
@alexcrichton alexcrichton deleted the atomic-u64 branch February 5, 2014 00:08
bors added a commit to rust-lang-ci/rust that referenced this pull request Dec 1, 2023
…ap_or_else_none, r=flip1995

Simplify code for `result_map_or_else_none`

As mentioned in rust-lang/rust-clippy#11864.

r? `@flip1995`

changelog: Simplify code for `result_map_or_else_none`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants