Reduce contention in broadcast channel #6284

vnetserg · 2024-01-14T13:05:00Z

Motivation

Broadcast channel performance is suboptimal in cases when there are a lot of threads that frequently subscribe for new values, as they all have to contend for the same mutex. See #5465

Solution

The proposed change is to use an atomic linked list to store waiters. This way the tail mutex can be replaced with RwLock, and adding new subscribers only requires a read lock. Justifying safety becomes trickier though, I did my best in the comments.

The PR also includes a benchmark that on my machine shows a dramatic improvement in high contention scenarios:

contention/10           time:   [3.5144 ms 3.5248 ms 3.5358 ms]
                        change: [-2.6135% -1.9438% -1.2888%] (p = 0.00 < 0.05)

contention/100          time:   [13.183 ms 13.209 ms 13.237 ms]
                        change: [-36.660% -36.443% -36.215%] (p = 0.00 < 0.05)

contention/500          time:   [48.847 ms 49.682 ms 50.539 ms]
                        change: [-53.426% -52.584% -51.764%] (p = 0.00 < 0.05)

contention/1000         time:   [117.23 ms 118.28 ms 119.34 ms]
                        change: [-37.337% -36.527% -35.757%] (p = 0.00 < 0.05)

Refs: tokio-rs#5465

Implement atomic linked list that allows pushing waiters concurrently, which reduces contention. Fixes: tokio-rs#5465

Darksonn · 2024-01-14T14:04:41Z

What implementation are you using for making your linked list atomic? Is it a well-known implementation strategy, or something you came up with on your own?

Darksonn · 2024-01-14T14:08:26Z

It seems like the loom tests are not running on this PR. I guess github must have changed something that broke our CI config:

tokio/.github/workflows/loom.yml

Lines 26 to 40 in 12ce924

    
           loom-sync: 
        
             name: loom tokio::sync 
        
             # base_ref is null when it's not a pull request 
        
             if: github.repository_owner == 'tokio-rs' && (contains(github.event.pull_request.labels.*.name, 'R-loom-sync') || (github.base_ref == null)) 
        
             runs-on: ubuntu-latest 
        
             steps: 
        
               - uses: actions/checkout@v4 
        
               - name: Install Rust ${{ env.rust_stable }} 
        
                 uses: dtolnay/rust-toolchain@master 
        
                 with: 
        
                     toolchain: ${{ env.rust_stable }} 
        
               - uses: Swatinem/rust-cache@v2 
        
               - name: run tests 
        
                 run: cargo test --lib --release --features full -- --nocapture sync::tests 
        
                 working-directory: tokio

Successfully passing the tokio::sync loom tests will be a prerequisite for merging something like this.

Darksonn · 2024-01-14T14:13:18Z

Hmm, looks like it started running after I merged master into your branch.

vnetserg · 2024-01-14T14:32:45Z

What implementation are you using for making your linked list atomic? Is it a well-known implementation strategy, or something you came up with on your own?

I came up with it on my own, although the idea behind it is quite trivial, so I'm sure it has been implemented by someone already. On the other hand, it is quite a special case of an atomic list - it allows adding elements atomically, but removing them still has to be done with some outside synchronization (in case of this PR - a write lock). I can research and see if there are some popular implementations of this kind of atomic list, if it would help.

tokio/src/util/linked_list.rs

Darksonn · 2024-01-14T14:44:02Z

tokio/src/util/linked_list.rs

+        /// Atomically adds an element first in the list.
+        /// This method can be called concurrently from multiple threads.
+        ///
+        /// # Safety
+        ///
+        /// The caller must ensure that:
+        /// - `val` is not pushed concurrently by muptiple threads,
+        /// - `val` is not already part of some list.
+        pub(crate) unsafe fn push_front(&self, val: L::Handle) {


This is an interesting implementation. It looks like it could be correct. I'll have to think more about it.

tokio/src/sync/broadcast.rs

tokio/src/loom/std/rwlock.rs

tokio/src/sync/broadcast.rs

carllerche · 2024-01-16T18:44:53Z

What implementation are you using for making your linked list atomic? Is it a well-known implementation strategy, or something you came up with on your own?

I came up with it on my own, although the idea behind it is quite trivial, so I'm sure it has been implemented by someone already. On the other hand, it is quite a special case of an atomic list - it allows adding elements atomically, but removing them still has to be done with some outside synchronization (in case of this PR - a write lock). I can research and see if there are some popular implementations of this kind of atomic list, if it would help.

Here are two basic MPSC atomic linked list implementations:

They are different because intrusive lists have additional logic to handle ABA issues as nodes can be removed and re-inserted concurrently to a separate insert. The provided atomic LL implementation is generic over the pointer type, which allows it to be used in an intrusive way. I haven't gotten to the broadcast channel changes yet, so it is possible that the broadcast channel uses it correctly, but I see the atomic LL implementation with generic handles as a hazard.

carllerche · 2024-01-16T19:06:41Z

tokio/src/sync/broadcast.rs

@@ -310,7 +311,7 @@ struct Shared<T> {
    mask: usize,

    /// Tail of the queue. Includes the rx wait list.
-    tail: Mutex<Tail>,
+    tail: RwLock<Tail>,


RwLock implementations tend to be heavier than Mutex. In this case, it looks like all reads are of numbers. Another option is to make these cells AtomicUsize (or AtomicU64) and require writers to these cells to hold the tail mutex. Reads can do the atomic read directly.

Are you saying that we could make Tail fields atomic? That would spare us the need to take a lock in some situations, but the main contention source is adding waiters to the list, which would still have to be done with a lock.

carllerche · 2024-01-16T19:47:42Z

So, it seems like the main thing here is to take a "regular" DLL and make the push operation concurrent with other push operations but not with a pop operation. This is done by guarding the entire list with a single RWLock.

carllerche · 2024-01-16T22:00:01Z

tokio/src/util/linked_list/atomic.rs

+
+/// An atomic intrusive linked list. It allows pushing new nodes concurrently.
+/// Removing nodes still requires an exclusive reference.
+pub(crate) struct AtomicLinkedList<L, T> {


If you stick with this strategy, I would rename this ConcurrentPushLinkedList or something explicit. Also, update the docs to be very clear that synchronization is required for all operations except concurrent push.

Yeah, I'm inclined to agree with Carl --- the name "AtomicLinkedList" suggests that all links are atomic...

also, it might be worth investigating whether there are performance benefits to using this list type in other places in tokio::sync. i think there are other synchronization primitives that currently force all tasks pushing to their wait lists pushes to serialize themselves with a Mutex, which could potentially benefit from this type. but, i would save that for a separate branch.

also, it might be worth investigating whether there are performance benefits to using this list type in other places in tokio::sync.

I was eyeing sync::Notify: it has a mutex that guards a list of waiters. #5464 reduced contention on the watch channel by utilizing 8 instances of Notify instead of one. Looks like the perfect candidate.

Yes, there are definitely other places that could benefit from something similar. I'm also wondering if the timer could have a similar optimization. Some people experience significant contention in the timer.

However, timers have more contention on cancellation due to timeouts usually being cancelled...

carllerche · 2024-01-16T22:05:27Z

I think I am following. I think the LL implementation works, but the docs need to be updated to very clearly state the requirements for using it.

Also, if you want to explore more tweaks to the code, another option might be a two-lock strategy (similar to the Michael & Scott queue). In this strategy, the head and tail of the linked list are guarded by two separate locks. For this to work, you need a sentinel node. This might be interesting because RwLocks need to worry about fairness. If you decouple the head & tail locks, then push & pop no longer contend.

To remove a node, you would then have to acquire a write lock on both locks. If we assume cancellation is a lower-priority operation, this might work out to be better.

hawkw · 2024-01-16T22:30:13Z

tokio/src/util/linked_list/atomic.rs

+
+/// An atomic intrusive linked list. It allows pushing new nodes concurrently.
+/// Removing nodes still requires an exclusive reference.
+pub(crate) struct AtomicLinkedList<L, T> {


Yeah, I'm inclined to agree with Carl --- the name "AtomicLinkedList" suggests that all links are atomic...

hawkw · 2024-01-16T22:32:08Z

tokio/src/util/linked_list/atomic.rs

+}
+
+#[cfg(test)]
+#[cfg(not(loom))]


it would be nice to have loom tests for this, potentially...

hawkw · 2024-01-16T22:35:01Z

tokio/src/util/linked_list/atomic.rs

+
+/// An atomic intrusive linked list. It allows pushing new nodes concurrently.
+/// Removing nodes still requires an exclusive reference.
+pub(crate) struct AtomicLinkedList<L, T> {


also, it might be worth investigating whether there are performance benefits to using this list type in other places in tokio::sync. i think there are other synchronization primitives that currently force all tasks pushing to their wait lists pushes to serialize themselves with a Mutex, which could potentially benefit from this type. but, i would save that for a separate branch.

vnetserg · 2024-01-17T06:08:07Z

Also, if you want to explore more tweaks to the code, another option might be a two-lock strategy (similar to the Michael & Scott queue). In this strategy, the head and tail of the linked list are guarded by two separate locks. For this to work, you need a sentinel node. This might be interesting because RwLocks need to worry about fairness. If you decouple the head & tail locks, then push & pop no longer contend.

That's an interesting strategy, I will read on that. However, according to my experiments, the main contention source is two places: pushing into the waiters list in Receiver::recv_ref and removing from the waiters list in Recv::drop. With two separate locks strategy, recv_ref will contend for the head lock, and Recv::drop may contend for two locks (although on a happy path it doesn't need locks). In the current implementation, the happy path is that recv_ref only takes a read lock and Recv::drop takes no lock at all.

The tradeoff between the current approach and the two locks approach, as I see it, is that the current implementation has writers contending with each other and readers, but readers don't contend with each other. In the two locks approach, writers contend between themselves and readers between themselves. It looks like if we came into situation when a lot if readers have to insert themselves into the waiters list, then writers are apparently slower then readers, so having readers not contend with each other is more beneficial then having them not contend with writers. What do you think?

Darksonn · 2024-01-17T11:13:57Z

If we're comparing to alternate approaches, then the existing sharding solution used by the watch channel shouldn't be forgotten. Just because the CAS operation is atomic doesn't mean that it doesn't cause contention.

vnetserg · 2024-01-17T12:10:13Z

If we're comparing to alternate approaches, then the existing sharding solution used by the watch channel shouldn't be forgotten. Just because the CAS operation is atomic doesn't mean that it doesn't cause contention.

This is true, sharding is a very valid option. Initially I wanted to implement it here, but I gave it up because of the complexity of the broadcast channel's interactions with the tail lock.

On the other hand, even with atomic pushes we can always fallback to sharding. If one list with atomic pushes performs better than one mutex-protected list, then probably sharded comparison will also be in favor of lists with atomic pushes (under high enough contention, that is - a mutex is probably faster when contention is low enough).

carllerche · 2024-01-17T18:46:44Z

That's an interesting strategy, I will read on that. However, according to my experiments, the main contention source is two places: pushing into the waiters list in Receiver::recv_ref and removing from the waiters list in Recv::drop.

Thanks for clarifying. I'm going to dig into the details a bit because the pattern of using linked list for waiters is used everywhere in Tokio, so a change here could be generally applicable. For example, Readiness in scheduled_io.rs used for all async fn I/O ops follows the same pattern.

I went back to the original implementation to refresh my memory and Recv::drop does acquire the lock to read queued. It looks like the way you solved that was by making queued atomic so you can read it in drop without the lock. If queued is false, then don't acquire the lock. This makes a lot of sense. It also seems to be unrelated to making the push operation concurrent.

How much of the gain you observe is due to making queued an atomic vs the other changes? I don't think many other places in Tokio would benefit from making push concurrent, but I am not 100% sure about that.

Also, thinking more about the two-lock queue, I don't think it would support a doubly linked list.

vnetserg · 2024-01-18T05:46:36Z

How much of the gain you observe is due to making queued an atomic vs the other changes?

That's a good question, I made a separate branch to test it. The benchmark shows the following on my machine:

All the current changes vs master:

contention/10           time:   [3.4879 ms 3.4964 ms 3.5052 ms]
                        change: [-5.2124% -4.8615% -4.5131%] (p = 0.00 < 0.05)

contention/100          time:   [11.896 ms 11.938 ms 11.983 ms]
                        change: [-36.770% -36.518% -36.276%] (p = 0.00 < 0.05)

contention/500          time:   [39.120 ms 39.326 ms 39.565 ms]
                        change: [-63.954% -63.624% -63.294%] (p = 0.00 < 0.05)

contention/1000         time:   [76.746 ms 77.944 ms 78.971 ms]
                        change: [-59.626% -58.912% -58.183%] (p = 0.00 < 0.05)

Atomic waiter.queued vs master:

contention/10           time:   [3.6555 ms 3.6723 ms 3.6895 ms]
                        change: [+1.6649% +2.3708% +3.0364%] (p = 0.00 < 0.05)

contention/100          time:   [18.266 ms 18.300 ms 18.336 ms]
                        change: [-13.471% -13.245% -13.032%] (p = 0.00 < 0.05)

contention/500          time:   [75.161 ms 75.410 ms 75.631 ms]
                        change: [-22.168% -21.858% -21.588%] (p = 0.00 < 0.05)

contention/1000         time:   [126.48 ms 130.42 ms 134.38 ms]
                        change: [-43.144% -41.486% -39.650%] (p = 0.00 < 0.05)

The smallest benchmark was probably subject to noise, but all in all, it looks like this change alone explains somewhere between a third and a half of performance improvement.

vnetserg · 2024-01-18T05:50:58Z

I don't think many other places in Tokio would benefit from making push concurrent, but I am not 100% sure about that.

At the very least, I think sync::Notify is worth a shot. It has a mutex-guarded LL and I think it is safe to assume that, in a typical use-case, there are more subscriptions than notifications.

Darksonn · 2024-01-19T11:54:13Z

I discussed the linked list you proposed with Paul E. McKenney. He agrees that the concurrent insertion is correct, but he recommended that we go for sharding instead: "In my experience, sharding is almost always way better than tweaking locking and atomic operations. Shard first, tweak later, and even then only if necessary."

Using multiple Notify instances is one way, but there's also the linked list from #6001. We could use a hash of the address of the node as index for the sharded list.

vnetserg · 2024-01-19T13:06:41Z

Ok, I will look into sharding, probably better as a separate PR.

Do you think it is worth creating a PR to make queued AtomicBool as a low-hanging 15-40% boost?

Darksonn · 2024-01-19T13:49:24Z

Yes, that sounds simple. I think that would be a good PR.

vnetserg added 2 commits January 14, 2024 12:19

benches: add sync_broadcast benchmark

4f98a68

Refs: tokio-rs#5465

sync: reduce contention in broadcast channel

cd43fc9

Implement atomic linked list that allows pushing waiters concurrently, which reduces contention. Fixes: tokio-rs#5465

github-actions bot added the R-loom-sync Run loom sync tests on this PR label Jan 14, 2024

vnetserg added 2 commits January 14, 2024 16:18

Replace let-else with if-let for 1.63 compat

a00a998

Fix build errors

be0d220

Darksonn added A-tokio Area: The main tokio crate M-sync Module: tokio/sync labels Jan 14, 2024

Merge branch 'master' into broadcast_atomic_list

2822d9c

Darksonn reviewed Jan 14, 2024

View reviewed changes

tokio/src/sync/broadcast.rs Show resolved Hide resolved

vnetserg added 5 commits January 14, 2024 19:06

Move AtomicLinkedList into separate file

3a2b35b

Fix wasi build errors

d30117c

Fix fuzzing build

1ca92d0

Make waiter.queued atomic, use read lock in notify_rx

74af0fe

Fix loom build, clippy warnings

7a3a75a

vnetserg commented Jan 16, 2024

View reviewed changes

tokio/src/sync/broadcast.rs Outdated Show resolved Hide resolved

vnetserg added 2 commits January 16, 2024 09:24

Fix build

685e5d5

Dont use downgrade method

9ced45b

vnetserg commented Jan 16, 2024

View reviewed changes

tokio/src/sync/broadcast.rs Outdated Show resolved Hide resolved

vnetserg added 5 commits January 16, 2024 11:03

Fix a bug with memory orders

96f713c

Fix comments

9d25ae2

Implement RwLock wrapper with downgrade method

6161c55

Add RwLock to loom wrappers

bf9f690

Fix loom imports

18d4d86

vnetserg commented Jan 16, 2024

View reviewed changes

tokio/src/loom/std/rwlock.rs Show resolved Hide resolved

Fix another bug in memory orders

90b1c0e

vnetserg commented Jan 16, 2024

View reviewed changes

tokio/src/sync/broadcast.rs Show resolved Hide resolved

carllerche reviewed Jan 16, 2024

View reviewed changes

hawkw reviewed Jan 16, 2024

View reviewed changes

Rename Atomic -> ConcurrentPush, fix tail locking in recv_ref

4b45a20

vnetserg closed this Jan 19, 2024

vnetserg mentioned this pull request Jan 19, 2024

Don't take the tail lock when dropping broadcast channel's Recv future #6298

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce contention in broadcast channel #6284

Reduce contention in broadcast channel #6284

vnetserg commented Jan 14, 2024

Darksonn commented Jan 14, 2024

Darksonn commented Jan 14, 2024

Darksonn commented Jan 14, 2024

vnetserg commented Jan 14, 2024

Darksonn Jan 14, 2024

carllerche commented Jan 16, 2024 •

edited

Loading

carllerche Jan 16, 2024

vnetserg Jan 17, 2024

carllerche commented Jan 16, 2024

carllerche Jan 16, 2024

hawkw Jan 16, 2024

hawkw Jan 16, 2024

vnetserg Jan 17, 2024

Darksonn Jan 17, 2024

carllerche commented Jan 16, 2024

hawkw Jan 16, 2024

hawkw Jan 16, 2024

hawkw Jan 16, 2024

vnetserg commented Jan 17, 2024

Darksonn commented Jan 17, 2024

vnetserg commented Jan 17, 2024

carllerche commented Jan 17, 2024

vnetserg commented Jan 18, 2024

vnetserg commented Jan 18, 2024 •

edited

Loading

Darksonn commented Jan 19, 2024

vnetserg commented Jan 19, 2024 •

edited

Loading

Darksonn commented Jan 19, 2024

Reduce contention in broadcast channel #6284

Reduce contention in broadcast channel #6284

Conversation

vnetserg commented Jan 14, 2024

Motivation

Solution

Darksonn commented Jan 14, 2024

Darksonn commented Jan 14, 2024

Darksonn commented Jan 14, 2024

vnetserg commented Jan 14, 2024

Choose a reason for hiding this comment

carllerche commented Jan 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carllerche commented Jan 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carllerche commented Jan 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vnetserg commented Jan 17, 2024

Darksonn commented Jan 17, 2024

vnetserg commented Jan 17, 2024

carllerche commented Jan 17, 2024

vnetserg commented Jan 18, 2024

vnetserg commented Jan 18, 2024 • edited Loading

Darksonn commented Jan 19, 2024

vnetserg commented Jan 19, 2024 • edited Loading

Darksonn commented Jan 19, 2024

carllerche commented Jan 16, 2024 •

edited

Loading

vnetserg commented Jan 18, 2024 •

edited

Loading

vnetserg commented Jan 19, 2024 •

edited

Loading