Replace pthread `RwLock` with custom implementation #110211

joboet · 2023-04-11T22:29:45Z

This is one of the last items in #93740. I'm doing RwLock first because it is more self-contained and has less tradeoffs to make. The motivation is explained in the documentation, but in short: the pthread rwlock is slow and buggy and std can do much better. I considered implementing a parking lot, as was discussed in the tracking issue, but settled for the queue-based version because writing self-balancing binary trees is not fun in Rust...

This is a rather complex change, so I have added quite a bit of documentation to help explain it. Please point out any part that could be explained better.

The read performance is really good, I'm getting 4x the throughput of the pthread version and about the same performance as usync/parking_lot on an Apple M1 Max in the usync benchmark suite, but the write performance still falls way behind what usync and parking_lot achieve. I tried using a separate queue lock like what usync uses, but that didn't help. I'll try to investigate further in the future, but I wanted to get some eyes on this first. Resolved

r? @m-ou-se
CC @kprotty

rustbot · 2023-04-11T22:29:53Z

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with @rustbot label +T-libs-api -T-libs to tag it appropriately. If this PR contains changes to any unstable APIs please edit the PR description to add a link to the relevant API Change Proposal or create one if you haven't already. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

Stabilizing library features
Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
Changing public documentation in ways that create new stability guarantees
Changing observable runtime behavior of library APIs

library/std/src/sys/unix/locks/queue_rwlock.rs

library/std/src/thread/mod.rs

library/std/src/sys/unix/locks/queue_rwlock.rs

bors · 2023-04-14T18:24:58Z

☔ The latest upstream changes (presumably #110324) made this pull request unmergeable. Please resolve the merge conflicts.

joboet · 2023-04-18T19:23:31Z

I've found the bottleneck! Because the QUEUED bit was not checked in lock_contended, writers would unnecessarily spin while there were other threads queued.

The implementation now uses a separate QUEUE_LOCKED bit and exponential backoff like usync does. This makes the performance very competitive with usync and parking-lot, outperforming both in some conditions.

klensy · 2023-04-27T13:24:00Z

Few functions here with boolean args looks like have compile time known arguments values (i.e. true/false), so probably optimizer can see it and inline it's values. Maybe try to make this arguments const generics instead and see if this will work better?

joboet · 2023-05-02T15:00:30Z

Few functions here with boolean args looks like have compile time known arguments values (i.e. true/false), so probably optimizer can see it and inline it's values. Maybe try to make this arguments const generics instead and see if this will work better?

At least locally, the performance doesn't change at all (that makes sense, the condition in lock_contended is extremely well predictable). Therefore, I would rather not do this, as it's harder to read and unnecessarily adds to binary size.

bors · 2023-12-15T21:44:20Z

☔ The latest upstream changes (presumably #118996) made this pull request unmergeable. Please resolve the merge conflicts.

bors · 2024-01-13T16:22:57Z

☔ The latest upstream changes (presumably #117285) made this pull request unmergeable. Please resolve the merge conflicts.

bors · 2024-01-30T20:04:44Z

☔ The latest upstream changes (presumably #120496) made this pull request unmergeable. Please resolve the merge conflicts.

…usync

… queue updates

m-ou-se

I finally had the time to review this. It looks great! Simpler and a lot more readable than I was expecting. ^^

Just a few small comments:

library/std/src/sys/pal/unix/locks/queue_rwlock.rs

m-ou-se

Looks great to me!

Since this is a bunch of new unsafe+atomics code, I'd feel slightly more comfortable if another reviewer took a look at it too: @Amanieu, would you have time to also take a look at the soundness of this lock implementation?

Amanieu

r=me with the small nit addressed

library/std/src/sys/pal/unix/locks/queue_rwlock.rs

Amanieu · 2024-02-11T20:07:36Z

@bors r+

bors · 2024-02-11T20:07:39Z

📌 Commit 04282db has been approved by Amanieu

It is now in the queue for this repository.

bors · 2024-02-12T09:45:25Z

⌛ Testing commit 04282db with merge b17491c...

RalfJung · 2024-02-12T11:21:26Z

I recall for mutex we stuck to pthreads on some platforms because that gives the scheduler more information and it can apply priority boosting when a high-priority thread wants to take a lock from a low-priority thread (specifically, on macOS).

Do similar concerns not apply here?

joboet · 2024-02-12T11:38:27Z

I was just thinking about the same thing, but I think PI is not an issue here. For higher-priority readers, it can't practically be supported, as you'd have to memorize all current readers. For higher-priority writers, while it's possible to support PI in principle, it appears that no platform actually does so (macOS doesn't (I think), NetBSD doesn't, QNX doesn't). I can't find any at least, and the POSIX standard does not mandate or facilitate PI for rwlocks, on the contrary, it warns against priority inversions.

bors · 2024-02-12T12:08:56Z

☀️ Test successful - checks-actions
Approved by: Amanieu
Pushing b17491c to master...

Replace pthread `RwLock` with custom implementation This is one of the last items in #93740. I'm doing `RwLock` first because it is more self-contained and has less tradeoffs to make. The motivation is explained in the documentation, but in short: the pthread rwlock is slow and buggy and `std` can do much better. I considered implementing a parking lot, as was discussed in the tracking issue, but settled for the queue-based version because writing self-balancing binary trees is not fun in Rust... This is a rather complex change, so I have added quite a bit of documentation to help explain it. Please point out any part that could be explained better. ~~The read performance is really good, I'm getting 4x the throughput of the pthread version and about the same performance as usync/parking_lot on an Apple M1 Max in the usync benchmark suite, but the write performance still falls way behind what usync and parking_lot achieve. I tried using a separate queue lock like what usync uses, but that didn't help. I'll try to investigate further in the future, but I wanted to get some eyes on this first.~~ [Resolved](rust-lang/rust#110211 (comment)) r? `@m-ou-se` CC `@kprotty`

rust-timer · 2024-02-12T13:23:13Z

Finished benchmarking commit (b17491c): comparison URL.

Overall result: ✅ improvements - no action needed

@rustbot label: -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.5%	[-0.5%, -0.5%]	1
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-0.5%	[-0.5%, -0.5%]	1

Max RSS (memory usage)

This benchmark run did not return any relevant results for this metric.

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 664.937s -> 663.718s (-0.18%)
Artifact size: 308.38 MiB -> 308.26 MiB (-0.04%)

RalfJung · 2024-02-26T07:13:05Z

So now that we have this, is it even worth still also having the Futex-based rwlock for Linux?

joboet · 2024-02-26T15:56:46Z

Probably yes? I don't have a Linux machine to run benchmarks on, but my guess would be that there is little difference in performance between the implementations. The Linux version is a lot simpler, so I don't think there is any harm in keeping it. The true benefit of this version is that we can finally get rid of the unsound lock code on SGX and provide a fast and well-tested (heh!) lock for all the weird platforms out there (e.g. xous just uses spinning right now).

Replace pthread `RwLock` with custom implementation This is one of the last items in #93740. I'm doing `RwLock` first because it is more self-contained and has less tradeoffs to make. The motivation is explained in the documentation, but in short: the pthread rwlock is slow and buggy and `std` can do much better. I considered implementing a parking lot, as was discussed in the tracking issue, but settled for the queue-based version because writing self-balancing binary trees is not fun in Rust... This is a rather complex change, so I have added quite a bit of documentation to help explain it. Please point out any part that could be explained better. ~~The read performance is really good, I'm getting 4x the throughput of the pthread version and about the same performance as usync/parking_lot on an Apple M1 Max in the usync benchmark suite, but the write performance still falls way behind what usync and parking_lot achieve. I tried using a separate queue lock like what usync uses, but that didn't help. I'll try to investigate further in the future, but I wanted to get some eyes on this first.~~ [Resolved](rust-lang/rust#110211 (comment)) r? `@m-ou-se` CC `@kprotty`

Use queue-based `RwLock` on more platforms This switches over Windows 7, SGX and Xous to the queue-based `RwLock` implementation added in rust-lang#110211, thereby fixing rust-lang#121949 for Windows 7 and partially resolving rust-lang#114581 on SGX. TEEOS can't currently be switched because it doesn't have a good thread parking implementation. CC `@roblabla` `@raoulstrackx` `@xobs` Could you help me test this, please? r? `@ChrisDenton` the Windows stuff should be familiar to you

Rollup merge of rust-lang#123811 - joboet:queue_em_up, r=ChrisDenton Use queue-based `RwLock` on more platforms This switches over Windows 7, SGX and Xous to the queue-based `RwLock` implementation added in rust-lang#110211, thereby fixing rust-lang#121949 for Windows 7 and partially resolving rust-lang#114581 on SGX. TEEOS can't currently be switched because it doesn't have a good thread parking implementation. CC `@roblabla` `@raoulstrackx` `@xobs` Could you help me test this, please? r? `@ChrisDenton` the Windows stuff should be familiar to you

Replace pthread `RwLock` with custom implementation This is one of the last items in #93740. I'm doing `RwLock` first because it is more self-contained and has less tradeoffs to make. The motivation is explained in the documentation, but in short: the pthread rwlock is slow and buggy and `std` can do much better. I considered implementing a parking lot, as was discussed in the tracking issue, but settled for the queue-based version because writing self-balancing binary trees is not fun in Rust... This is a rather complex change, so I have added quite a bit of documentation to help explain it. Please point out any part that could be explained better. ~~The read performance is really good, I'm getting 4x the throughput of the pthread version and about the same performance as usync/parking_lot on an Apple M1 Max in the usync benchmark suite, but the write performance still falls way behind what usync and parking_lot achieve. I tried using a separate queue lock like what usync uses, but that didn't help. I'll try to investigate further in the future, but I wanted to get some eyes on this first.~~ [Resolved](rust-lang/rust#110211 (comment)) r? `@m-ou-se` CC `@kprotty`

Pkgsrc changes: * Adapt checksums and patches, some have beene intregrated upstream. Upstream chnages: Version 1.78.0 (2024-05-02) =========================== Language -------- - [Stabilize `#[cfg(target_abi = ...)]`] (rust-lang/rust#119590) - [Stabilize the `#[diagnostic]` namespace and `#[diagnostic::on_unimplemented]` attribute] (rust-lang/rust#119888) - [Make async-fn-in-trait implementable with concrete signatures] (rust-lang/rust#120103) - [Make matching on NaN a hard error, and remove the rest of `illegal_floating_point_literal_pattern`] (rust-lang/rust#116284) - [static mut: allow mutable reference to arbitrary types, not just slices and arrays] (rust-lang/rust#117614) - [Extend `invalid_reference_casting` to include references casting to bigger memory layout] (rust-lang/rust#118983) - [Add `non_contiguous_range_endpoints` lint for singleton gaps after exclusive ranges] (rust-lang/rust#118879) - [Add `wasm_c_abi` lint for use of older wasm-bindgen versions] (rust-lang/rust#117918) This lint currently only works when using Cargo. - [Update `indirect_structural_match` and `pointer_structural_match` lints to match RFC] (rust-lang/rust#120423) - [Make non-`PartialEq`-typed consts as patterns a hard error] (rust-lang/rust#120805) - [Split `refining_impl_trait` lint into `_reachable`, `_internal` variants] (rust-lang/rust#121720) - [Remove unnecessary type inference when using associated types inside of higher ranked `where`-bounds] (rust-lang/rust#119849) - [Weaken eager detection of cyclic types during type inference] (rust-lang/rust#119989) - [`trait Trait: Auto {}`: allow upcasting from `dyn Trait` to `dyn Auto`] (rust-lang/rust#119338) Compiler -------- - [Made `INVALID_DOC_ATTRIBUTES` lint deny by default] (rust-lang/rust#111505) - [Increase accuracy of redundant `use` checking] (rust-lang/rust#117772) - [Suggest moving definition if non-found macro_rules! is defined later] (rust-lang/rust#121130) - [Lower transmutes from int to pointer type as gep on null] (rust-lang/rust#121282) Target changes: - [Windows tier 1 targets now require at least Windows 10] (rust-lang/rust#115141) - [Enable CMPXCHG16B, SSE3, SAHF/LAHF and 128-bit Atomics in tier 1 Windows] (rust-lang/rust#120820) - [Add `wasm32-wasip1` tier 2 (without host tools) target] (rust-lang/rust#120468) - [Add `wasm32-wasip2` tier 3 target] (rust-lang/rust#119616) - [Rename `wasm32-wasi-preview1-threads` to `wasm32-wasip1-threads`] (rust-lang/rust#122170) - [Add `arm64ec-pc-windows-msvc` tier 3 target] (rust-lang/rust#119199) - [Add `armv8r-none-eabihf` tier 3 target for the Cortex-R52] (rust-lang/rust#110482) - [Add `loongarch64-unknown-linux-musl` tier 3 target] (rust-lang/rust#121832) Refer to Rust's [platform support page][platform-support-doc] for more information on Rust's tiered platform support. Libraries --------- - [Bump Unicode to version 15.1.0, regenerate tables] (rust-lang/rust#120777) - [Make align_offset, align_to well-behaved in all cases] (rust-lang/rust#121201) - [PartialEq, PartialOrd: document expectations for transitive chains] (rust-lang/rust#115386) - [Optimize away poison guards when std is built with panic=abort] (rust-lang/rust#100603) - [Replace pthread `RwLock` with custom implementation] (rust-lang/rust#110211) - [Implement unwind safety for Condvar on all platforms] (rust-lang/rust#121768) - [Add ASCII fast-path for `char::is_grapheme_extended`] (rust-lang/rust#121138) Stabilized APIs --------------- - [`impl Read for &Stdin`] (https://doc.rust-lang.org/stable/std/io/struct.Stdin.html#impl-Read-for-%26Stdin) - [Accept non `'static` lifetimes for several `std::error::Error` related implementations] (rust-lang/rust#113833) - [Make `impl<Fd: AsFd>` impl take `?Sized`] (rust-lang/rust#114655) - [`impl From<TryReserveError> for io::Error`] (https://doc.rust-lang.org/stable/std/io/struct.Error.html#impl-From%3CTryReserveError%3E-for-Error) These APIs are now stable in const contexts: - [`Barrier::new()`] (https://doc.rust-lang.org/stable/std/sync/struct.Barrier.html#method.new) Cargo ----- - [Stabilize lockfile v4](rust-lang/cargo#12852) - [Respect `rust-version` when generating lockfile] (rust-lang/cargo#12861) - [Control `--charset` via auto-detecting config value] (rust-lang/cargo#13337) - [Support `target.<triple>.rustdocflags` officially] (rust-lang/cargo#13197) - [Stabilize global cache data tracking] (rust-lang/cargo#13492) Misc ---- - [rustdoc: add `--test-builder-wrapper` arg to support wrappers such as RUSTC_WRAPPER when building doctests] (rust-lang/rust#114651) Compatibility Notes ------------------- - [Many unsafe precondition checks now run for user code with debug assertions enabled] (rust-lang/rust#120594) This change helps users catch undefined behavior in their code, though the details of how much is checked are generally not stable. - [riscv only supports split_debuginfo=off for now] (rust-lang/rust#120518) - [Consistently check bounds on hidden types of `impl Trait`] (rust-lang/rust#121679) - [Change equality of higher ranked types to not rely on subtyping] (rust-lang/rust#118247) - [When called, additionally check bounds on normalized function return type] (rust-lang/rust#118882) - [Expand coverage for `arithmetic_overflow` lint] (rust-lang/rust#119432) Internal Changes ---------------- These changes do not affect any public interfaces of Rust, but they represent significant improvements to the performance or internals of rustc and related tools. - [Update to LLVM 18](rust-lang/rust#120055) - [Build `rustc` with 1CGU on `x86_64-pc-windows-msvc`] (rust-lang/rust#112267) - [Build `rustc` with 1CGU on `x86_64-apple-darwin`] (rust-lang/rust#112268) - [Introduce `run-make` V2 infrastructure, a `run_make_support` library and port over 2 tests as example] (rust-lang/rust#113026) - [Windows: Implement condvar, mutex and rwlock using futex] (rust-lang/rust#121956)

rustbot assigned m-ou-se Apr 11, 2023

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Apr 11, 2023

kprotty reviewed Apr 11, 2023

View reviewed changes

kadiwa4 reviewed Apr 12, 2023

View reviewed changes

library/std/src/thread/mod.rs Outdated Show resolved Hide resolved

klensy reviewed Apr 12, 2023

View reviewed changes

library/std/src/sys/unix/locks/queue_rwlock.rs Outdated Show resolved Hide resolved

joboet force-pushed the queue_lock branch from 97d8941 to e277f01 Compare April 15, 2023 13:02

joboet force-pushed the queue_lock branch from 5afc855 to 7e386df Compare November 30, 2023 13:43

rustbot added the O-unix Operating system: Unix-like label Nov 30, 2023

joboet force-pushed the queue_lock branch from 7e386df to d569775 Compare December 19, 2023 15:33

joboet force-pushed the queue_lock branch from d569775 to ebeae31 Compare January 23, 2024 09:48

joboet added 9 commits February 9, 2024 14:58

std: replace pthread RwLock with custom implementation inspired by …

934eb8b

…usync

adjust code documentation

2e652e5

use braces to make operator precedence less ambiguous

280cbc5

avoid unnecessary Thread handle allocation

709ccf9

immediately register writer node if threads are queued

61ce691

use exponential backoff in lock_contended

8db64b5

queue_rwlock: use a separate QUEUE_LOCKED bit to synchronize waiter…

16aae04

… queue updates

inline some single-use functions, add documentation

1fd9f78

format using latest rustfmt

69f55de

joboet force-pushed the queue_lock branch from ebeae31 to 69f55de Compare February 9, 2024 13:59

m-ou-se reviewed Feb 9, 2024

View reviewed changes

address review comments

ff44ae7

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 9, 2024

m-ou-se approved these changes Feb 9, 2024

View reviewed changes

Amanieu reviewed Feb 11, 2024

View reviewed changes

library/std/src/sys/pal/unix/locks/queue_rwlock.rs Show resolved Hide resolved

add doc-comment to unlock_queue

04282db

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 11, 2024

bors added the merged-by-bors This PR was explicitly merged by bors. label Feb 12, 2024

bors merged commit b17491c into rust-lang:master Feb 12, 2024
12 checks passed

rustbot added this to the 1.78.0 milestone Feb 12, 2024

m-ou-se mentioned this pull request Feb 20, 2024

Tracking issue for improving std::sync::{Mutex, RwLock, Condvar} #93740

Closed

63 tasks

This was referenced Feb 26, 2024

Stacked Borrows violation in macOS RwLock #121626

Closed

rwlock: avoid Stacked Borrows violation #121630

Closed

joboet mentioned this pull request Apr 11, 2024

Use queue-based RwLock on more platforms #123811

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace pthread `RwLock` with custom implementation #110211

Replace pthread `RwLock` with custom implementation #110211

joboet commented Apr 11, 2023 •

edited

rustbot commented Apr 11, 2023

bors commented Apr 14, 2023

joboet commented Apr 18, 2023

klensy commented Apr 27, 2023

joboet commented May 2, 2023

bors commented Dec 15, 2023

bors commented Jan 13, 2024

bors commented Jan 30, 2024

m-ou-se left a comment

m-ou-se left a comment

Amanieu left a comment

Amanieu commented Feb 11, 2024

bors commented Feb 11, 2024

bors commented Feb 12, 2024

RalfJung commented Feb 12, 2024

joboet commented Feb 12, 2024 •

edited

bors commented Feb 12, 2024

rust-timer commented Feb 12, 2024

RalfJung commented Feb 26, 2024

joboet commented Feb 26, 2024

Replace pthread RwLock with custom implementation #110211

Replace pthread RwLock with custom implementation #110211

Conversation

joboet commented Apr 11, 2023 • edited

rustbot commented Apr 11, 2023

bors commented Apr 14, 2023

joboet commented Apr 18, 2023

klensy commented Apr 27, 2023

joboet commented May 2, 2023

bors commented Dec 15, 2023

bors commented Jan 13, 2024

bors commented Jan 30, 2024

m-ou-se left a comment

Choose a reason for hiding this comment

m-ou-se left a comment

Choose a reason for hiding this comment

Amanieu left a comment

Choose a reason for hiding this comment

Amanieu commented Feb 11, 2024

bors commented Feb 11, 2024

bors commented Feb 12, 2024

RalfJung commented Feb 12, 2024

joboet commented Feb 12, 2024 • edited

bors commented Feb 12, 2024

rust-timer commented Feb 12, 2024

Overall result: ✅ improvements - no action needed

Instruction count

Max RSS (memory usage)

Cycles

Binary size

RalfJung commented Feb 26, 2024

joboet commented Feb 26, 2024

Replace pthread `RwLock` with custom implementation #110211

Replace pthread `RwLock` with custom implementation #110211

joboet commented Apr 11, 2023 •

edited

joboet commented Feb 12, 2024 •

edited