time: lazy init TimerShared in TimerEntry #6512

wathenjiang · 2024-04-23T16:24:39Z

This commit is part of reducing timeout performance overhead.

Below are relevant benchmark results of this PR on m1:

single_thread_timeout   time:   [21.869 ns 21.987 ns 22.135 ns]
                        change: [-3.4429% -2.0709% -0.8759%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

multi_thread_timeout-8  time:   [4.4835 ns 4.6138 ns 4.7614 ns]
                        change: [-4.3554% +0.1643% +4.5114%] (p = 0.95 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  8 (8.00%) high mild
  1 (1.00%) high severe

Below are relevant benchmark results of the current version on m1:

single_thread_timeout   time:   [40.227 ns 40.416 ns 40.691 ns]
                        change: [+81.321% +82.817% +84.121%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) high mild
  11 (11.00%) high severe

multi_thread_timeout-8  time:   [183.16 ns 186.02 ns 188.21 ns]
                        change: [+3765.0% +3880.4% +3987.4%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low severe
  6 (6.00%) low mild

See tokio-rs#6504 Below are relevant benchmark results of this PR on m1 mac: single_thread_timeout time: [21.869 ns 21.987 ns 22.135 ns] change: [-3.4429% -2.0709% -0.8759%] (p = 0.00 < 0.05) Change within noise threshold. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) high mild 4 (4.00%) high severe multi_thread_timeout-8 time: [4.4835 ns 4.6138 ns 4.7614 ns] change: [-4.3554% +0.1643% +4.5114%] (p = 0.95 > 0.05) No change in performance detected. Found 9 outliers among 100 measurements (9.00%) 8 (8.00%) high mild 1 (1.00%) high severe Below are relevant benchmark results of current version on m1 mac: single_thread_timeout time: [40.227 ns 40.416 ns 40.691 ns] change: [+81.321% +82.817% +84.121%] (p = 0.00 < 0.05) Performance has regressed. Found 14 outliers among 100 measurements (14.00%) 3 (3.00%) high mild 11 (11.00%) high severe multi_thread_timeout-8 time: [183.16 ns 186.02 ns 188.21 ns] change: [+3765.0% +3880.4% +3987.4%] (p = 0.00 < 0.05) Performance has regressed. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) low severe 6 (6.00%) low mild

wathenjiang · 2024-04-24T02:44:05Z

Below are relevant benchmark results of this PR on Linux AMD64:

single_thread_timeout   time:   [59.230 ns 59.308 ns 59.389 ns]
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

multi_thread_timeout-8  time:   [94.476 ns 95.247 ns 95.877 ns]
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

Below are relevant benchmark results of the current version on Linux AMD64:

single_thread_timeout   time:   [106.43 ns 106.69 ns 107.01 ns]
                        change: [+80.061% +81.176% +82.266%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) high mild
  13 (13.00%) high severe

multi_thread_timeout-8  time:   [181.75 ns 189.82 ns 199.37 ns]
                        change: [+89.155% +96.135% +103.05%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 19 outliers among 100 measurements (19.00%)
  1 (1.00%) high mild
  18 (18.00%) high severe

wathenjiang · 2024-04-24T03:50:52Z

Should we lazily initialize the TimerEntry within the Sleep instead of Sleep within Timeout?

Darksonn · 2024-04-24T07:48:11Z

Lazily initializing the TimerEntry instead sounds like a good idea.

benches/Cargo.toml

benches/time_timeout.rs

wathenjiang · 2024-04-24T07:54:22Z

Lazily initializing the TimerEntry instead sounds like a good idea.

I think so. After some time, I find there are many custom timeouts in tokio, such as Timeout<S> in the tokio-stream/src/stream_ext/timeout.rs.

tokio/src/time/sleep.rs

Darksonn · 2024-04-24T11:14:32Z

tokio/src/time/sleep.rs

    pub struct Sleep {
        inner: Inner,
-
+        deadline : Instant,
+        handle: scheduler::Handle,
        // The link between the `Sleep` instance and the timer that drives it.
        #[pin]
-        entry: TimerEntry,
+        entry: Option<TimerEntry>,
    }


How does this change the size of Sleep? Could TimerEntry be changed to reduce the change?

I think this should be possible. Your idea looks better, I'll give it a try.

I put deadline and handle in Sleep here for do not init the TimerEntry, but I have to put them to a place to save them temporarily.

Saving handle here is to ensure the timeout will panic if there is no runtime handle in TLS or we do not enable_time or enable_all.

The size of Sleep is an issue worth considering. Is there any way to solve this problem?

What happens if you just don't call clear_entry in this method when it hasn't yet been registered?

Based on @conradludgate's comment here, it sounds like this triggers a loom failure, but that's most likely due to some path where the timer is dropped concurrently with firing or something like that. If we've never been registered with the driver, then not calling clear_entry should be okay.

In my test, it can improve the performance by reduicng lock contention when we always let timeout register into the driver. But, on the other hand, if the timeout never register into the driver, this can not bring significant performance improvement.

That said, maybe we can use inner: StdUnsafeCell<Option<TimerShared>> for this purpose, the size of Sleep should be barely changed.

Avoiding clear_entry for unregistered timeouts is not necessary. In this case, we should avoid constructing the intrusive linked list item.

wathenjiang · 2024-04-24T14:33:39Z

Before blaming the performance issue on lock contention, it's best to explore what causes slow timeout performance in multi-threaded concurrency.

The below flame charts are performed on Linux amd64(which has 16 cores) based on the following code.

fn main() {
    let runtime = tokio::runtime::Builder::new_multi_thread()
        .enable_all()
        .worker_threads(8)
        .build()
        .unwrap();

    let _r = runtime.block_on(async{
        let mut handles = Vec::with_capacity(8);
        for _ in 0..8 {
            handles.push(tokio::spawn(async move {
                loop{
                    let h = timeout(Duration::from_secs(1), quick_job());
                    assert_eq!(h.await.unwrap(), 1);
                }
            })); 
        }
        for handle in handles {
            handle.await.unwrap();
        }
    });
}

Figure 1 shows the result of executing on the current PR code.

Figure 2 shows the result of executing on the current master branch.

It turns out that we are overly blaming lock contention. In the master branch, lock contention only takes up 7.72% of the execution time.

So the lazy initialization is the main reason why this PR wins in performance.

I did an interesting test, which is to do not call cancel in Drop of TimerEntry.

Compared with the cancel version, the benchmark performance test results of the master branch version are as follows:

single_thread_timeout   time:   [40.916 ns 41.337 ns 41.783 ns]
                        change: [+9.1588% +10.095% +11.148%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  12 (12.00%) high mild
  1 (1.00%) high severe

multi_thread_timeout-8  time:   [180.39 ns 184.15 ns 186.61 ns]
                        change: [+11.290% +13.768% +16.199%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) low severe
  5 (5.00%) low mild

single_thread_sleep     time:   [41.389 ns 41.648 ns 42.020 ns]
                        change: [+10.927% +12.228% +14.122%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 32 outliers among 100 measurements (32.00%)
  10 (10.00%) low severe
  4 (4.00%) low mild
  9 (9.00%) high mild
  9 (9.00%) high severe

multi_thread_sleep-8    time:   [175.35 ns 178.70 ns 181.46 ns]
                        change: [+11.991% +14.868% +17.832%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) low severe
  3 (3.00%) low mild

This test result also proves that for the performance of timeout and sleep, lazy initialization has a greater effect than reducing lock conflicts.

Darksonn · 2024-04-24T14:43:26Z

Thank you for doing that benchmark. It sounds like this particular change is still a good idea, but we should reconsider the other changes mentioned on the original issue.

That said, I wonder what we would get on a benchmark where the timer always gets registered with the driver.

wathenjiang · 2024-04-24T14:47:46Z

Forgive me. I saw it wrong :)

wathenjiang · 2024-04-24T15:59:17Z

Of course, if all timers will be registered in the Driver, and the concurrency is very high. Then lock contention will become the most important concern.

The following flame shows the global lock ops take 7.15% in register, and 8.75% in clear_entry.

The benchmark test code:

fn main() {
    let runtime = tokio::runtime::Builder::new_multi_thread()
        .enable_all()
        .worker_threads(8)
        .build()
        .unwrap();

    let _r = runtime.block_on(async {
        let mut handles = Vec::with_capacity(1024);
        for _ in 0..1024 {
            handles.push(tokio::spawn(async move {
                loop {
                    for i in 1..10 {
                        for _ in 0..100 {
                            let h = timeout(Duration::from_millis(i), never_ready_job2(i));
                            let _r = h.await;
                        }
                    }
                }
            }));
        }
        for handle in handles {
            handle.await.unwrap();
        }
    });
}

pub fn fibonacci(n: u64) -> u64 {
    match n {
        0 => 1,
        1 => 1,
        n => fibonacci(n - 1) + fibonacci(n - 2),
    }
}

// a job never ready
async fn never_ready_job2(n : u64) -> u64 {
    let res = fibonacci(n+5); // do some logic
    pending::<()>().await;
    res
}

The global lock in multi-threads is usually not good enough in performance.

So, going back to the issue itself, both methods may need to be done to reduce the performance overhead of timeout.

In most cases where timeout is not registered with the driver, lazy initialization optimization is used
Most timeouts need to be registered to the driver, using a mechanism similar to shard locks

wathenjiang · 2024-04-25T15:11:11Z

By lazily initializing the TimerShared in TimerEntry, we get a performance improvement. The below are the benchmark results (On Linux AMD64 which has 16 cores ):

The results of this PR:

single_thread_timeout   time:   [72.079 ns 72.146 ns 72.220 ns]
multi_thread_timeout-8  time:   [104.14 ns 104.90 ns 105.49 ns]
single_thread_sleep     time:   [69.147 ns 69.344 ns 69.592 ns]
multi_thread_sleep-8    time:   [106.08 ns 106.60 ns 106.98 ns]

The results of master:

single_thread_timeout   time:   [107.02 ns 107.24 ns 107.52 ns]
multi_thread_timeout-8  time:   [195.11 ns 201.26 ns 208.99 ns]
single_thread_sleep     time:   [116.30 ns 116.44 ns 116.61 ns]
multi_thread_sleep-8    time:   [210.79 ns 223.30 ns 236.26 ns]

To achieve all of this, our price is that the byte size of Sleep has increased from 112 to 120.

benches/time_timeout.rs

Darksonn

Apparently I didn't actually submit this comment. It's been "pending" for 5 days. Sorry about that.

Darksonn · 2024-04-26T07:29:17Z

tokio/src/runtime/time/entry.rs

    fn inner(&self) -> &TimerShared {
-        unsafe { &*self.inner.get() }
+        unsafe { &mut *self.inner.get() }.get_or_insert(TimerShared::new())
    }


Can you change this to only create a mutable reference if it is None?

I have used additional is_done code to achieve this goal.

benches/time_timeout.rs

Darksonn

Looks good to me.

wathenjiang changed the title ~~Timeout lazy init sleep~~ time: lazy init sleep in timeout Apr 23, 2024

wathenjiang force-pushed the timeout-lazy-init-sleep branch from 7f0c3f3 to bde0742 Compare April 23, 2024 17:22

wathenjiang added 4 commits April 24, 2024 01:28

feat: panic if the time driver is not enabled

0eb1d3e

rustfmt

a4e1231

add use crate::runtime::scheduler

cb54a1e

add #[track_caller] for new_with_delay

f8874b2

update comments

9dd1e33

mox692 added A-tokio Area: The main tokio crate M-time Module: tokio/time labels Apr 24, 2024

wathenjiang added 2 commits April 24, 2024 15:07

make TimerEntry lazy init

2d3efb6

feat: add single_thread_sleep and multi_thread_sleep-8

1ae3de2

Darksonn reviewed Apr 24, 2024

View reviewed changes

benches/Cargo.toml Outdated Show resolved Hide resolved

benches/time_timeout.rs Outdated Show resolved Hide resolved

feat: use new_current_thread

ef3630b

github-actions bot added the R-loom-time-driver Run loom time driver tests on this PR label Apr 24, 2024

wathenjiang added 3 commits April 24, 2024 18:05

Merge branch 'master' into timeout-lazy-init-sleep

d4a95a3

fix: new me variable

f6fa09b

fix: create me variable in the block

b1fa990

Darksonn mentioned this pull request Apr 24, 2024

CI fails to download files #6514

Closed

wathenjiang added 3 commits April 24, 2024 18:32

fix: ci

8c0c2e8

fix: ownership

a9a11e9

fix: fmt

ef657d2

paolobarbolini reviewed Apr 24, 2024

View reviewed changes

tokio/src/time/sleep.rs Outdated Show resolved Hide resolved

fix: lifetime issue

a446e44

Darksonn reviewed Apr 24, 2024

View reviewed changes

wathenjiang added 3 commits April 25, 2024 14:17

rebase

3a89c45

rm unnecessary code

c47b739

add is_inner_init

8e40ec4

wathenjiang changed the title ~~time: lazy init sleep in timeout~~ time: lazy init TimerShared in TimerEntry Apr 25, 2024

Merge branch 'master' into timeout-lazy-init-sleep

d328298

mox692 reviewed Apr 30, 2024

View reviewed changes

benches/time_timeout.rs Outdated Show resolved Hide resolved

benches/time_timeout.rs Outdated Show resolved Hide resolved

Darksonn reviewed May 1, 2024

View reviewed changes

wathenjiang added 3 commits May 2, 2024 16:01

adopt code review suggestions from mox692

07953b5

get mutable ref only if the inner is none

373044d

fix ci

0129c92

FrankReh reviewed May 2, 2024

View reviewed changes

benches/time_timeout.rs Outdated Show resolved Hide resolved

wathenjiang and others added 2 commits May 2, 2024 17:37

fix typo

b492723

Merge branch 'master' into timeout-lazy-init-sleep

8af2a97

Darksonn approved these changes May 3, 2024

View reviewed changes

Darksonn merged commit f6eb1ee into tokio-rs:master May 3, 2024
77 checks passed

Darksonn mentioned this pull request May 3, 2024

Reduce contention in timer driver #6504

Open

3 tasks

Darksonn mentioned this pull request May 30, 2024

chore: prepare Tokio v1.38.0 #6601

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

time: lazy init TimerShared in TimerEntry #6512

time: lazy init TimerShared in TimerEntry #6512

wathenjiang commented Apr 23, 2024 •

edited

Loading

wathenjiang commented Apr 24, 2024

wathenjiang commented Apr 24, 2024

Darksonn commented Apr 24, 2024

wathenjiang commented Apr 24, 2024

Darksonn Apr 24, 2024

wathenjiang Apr 24, 2024

wathenjiang Apr 24, 2024

Darksonn Apr 24, 2024

wathenjiang Apr 25, 2024 •

edited

Loading

wathenjiang Apr 25, 2024

wathenjiang commented Apr 24, 2024 •

edited

Loading

Darksonn commented Apr 24, 2024

wathenjiang commented Apr 24, 2024 •

edited

Loading

wathenjiang commented Apr 24, 2024

wathenjiang commented Apr 25, 2024

Darksonn left a comment

Darksonn Apr 26, 2024

wathenjiang May 2, 2024

Darksonn left a comment

time: lazy init TimerShared in TimerEntry #6512

time: lazy init TimerShared in TimerEntry #6512

Conversation

wathenjiang commented Apr 23, 2024 • edited Loading

wathenjiang commented Apr 24, 2024

wathenjiang commented Apr 24, 2024

Darksonn commented Apr 24, 2024

wathenjiang commented Apr 24, 2024

Darksonn Apr 24, 2024

Choose a reason for hiding this comment

wathenjiang Apr 24, 2024

Choose a reason for hiding this comment

wathenjiang Apr 24, 2024

Choose a reason for hiding this comment

Darksonn Apr 24, 2024

Choose a reason for hiding this comment

wathenjiang Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

wathenjiang Apr 25, 2024

Choose a reason for hiding this comment

wathenjiang commented Apr 24, 2024 • edited Loading

Darksonn commented Apr 24, 2024

wathenjiang commented Apr 24, 2024 • edited Loading

wathenjiang commented Apr 24, 2024

wathenjiang commented Apr 25, 2024

Darksonn left a comment

Choose a reason for hiding this comment

Darksonn Apr 26, 2024

Choose a reason for hiding this comment

wathenjiang May 2, 2024

Choose a reason for hiding this comment

Darksonn left a comment

Choose a reason for hiding this comment

wathenjiang commented Apr 23, 2024 •

edited

Loading

wathenjiang Apr 25, 2024 •

edited

Loading

wathenjiang commented Apr 24, 2024 •

edited

Loading

wathenjiang commented Apr 24, 2024 •

edited

Loading