add new benchmark for `FutureGroup` #179

soooch · 2024-04-16T05:07:28Z

This benchmark aims to measure latency between futures becoming ready and the stream producing them.
The bench currently shows FutureGroup trailing significantly behind FuturesUnordered for larger test sizes.

~/c/futures-concurrency (new-future-group-bench)> cargo bench future_group_poll_latency
    Finished bench [optimized + debuginfo] target(s) in 0.05s
     Running benches/bench.rs (target/release/deps/bench-26d01e1345f7786c)
Gnuplot not found, using plotters backend
future_group_poll_latency/FutureGroup/Params { init_size: 10, pct_ready_per_round: 0.001 }                                                                             
                        time:   [630.81 ns 631.56 ns 632.24 ns]
                        change: [+7.1846% +7.2671% +7.3558%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild
future_group_poll_latency/FuturesUnordered/Params { init_size: 10, pct_ready_per_round: 0.001 }                                                                             
                        time:   [466.72 ns 469.05 ns 471.58 ns]
                        change: [-1.7131% -1.2735% -0.8808%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 17 outliers among 100 measurements (17.00%)
  14 (14.00%) low severe
  3 (3.00%) low mild
future_group_poll_latency/FutureGroup/Params { init_size: 10, pct_ready_per_round: 0.2 }                                                                             
                        time:   [537.25 ns 539.50 ns 541.62 ns]
                        change: [+2.5562% +2.9383% +3.2572%] (p = 0.00 < 0.05)
                        Performance has regressed.
future_group_poll_latency/FuturesUnordered/Params { init_size: 10, pct_ready_per_round: 0.2 }                                                                             
                        time:   [366.65 ns 368.35 ns 369.97 ns]
                        change: [-2.7245% -2.4190% -2.0874%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 30 outliers among 100 measurements (30.00%)
  16 (16.00%) low severe
  4 (4.00%) low mild
  8 (8.00%) high mild
  2 (2.00%) high severe
future_group_poll_latency/FutureGroup/Params { init_size: 10, pct_ready_per_round: 1.0 }                                                                             
                        time:   [404.96 ns 410.50 ns 416.19 ns]
                        change: [+13.031% +13.890% +14.650%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  16 (16.00%) low severe
  1 (1.00%) low mild
future_group_poll_latency/FuturesUnordered/Params { init_size: 10, pct_ready_per_round: 1.0 }                                                                             
                        time:   [276.28 ns 276.34 ns 276.39 ns]
                        change: [-0.7471% -0.7059% -0.6578%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  7 (7.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe
future_group_poll_latency/FutureGroup/Params { init_size: 100, pct_ready_per_round: 0.001 }                                                                             
                        time:   [14.159 µs 14.207 µs 14.255 µs]
                        change: [+4.3452% +4.4973% +4.6408%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
  8 (8.00%) low severe
  7 (7.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe
future_group_poll_latency/FuturesUnordered/Params { init_size: 100, pct_ready_per_round: 0.001 }                                                                             
                        time:   [4.9547 µs 4.9589 µs 4.9646 µs]
                        change: [-1.6303% -1.2868% -0.9175%] (p = 0.00 < 0.05)
                        Change within noise threshold.
future_group_poll_latency/FutureGroup/Params { init_size: 100, pct_ready_per_round: 0.2 }                                                                             
                        time:   [11.181 µs 11.182 µs 11.184 µs]
                        change: [+2.6277% +2.6520% +2.6760%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
future_group_poll_latency/FuturesUnordered/Params { init_size: 100, pct_ready_per_round: 0.2 }                                                                             
                        time:   [2.9196 µs 2.9202 µs 2.9207 µs]
                        change: [-1.6521% -1.6239% -1.5951%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
future_group_poll_latency/FutureGroup/Params { init_size: 100, pct_ready_per_round: 1.0 }                                                                             
                        time:   [4.0768 µs 4.0777 µs 4.0787 µs]
                        change: [-0.3427% +0.0016% +0.2321%] (p = 1.00 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
future_group_poll_latency/FuturesUnordered/Params { init_size: 100, pct_ready_per_round: 1.0 }                                                                             
                        time:   [2.7793 µs 2.7796 µs 2.7801 µs]
                        change: [-6.4224% -6.2342% -6.0593%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  8 (8.00%) high severe
Benchmarking future_group_poll_latency/FutureGroup/Params { init_size: 1000, pct_ready_per_round: 0.001 }: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.4s, enable flat sampling, or reduce sample count to 60.
future_group_poll_latency/FutureGroup/Params { init_size: 1000, pct_ready_per_round: 0.001 }                                                                             
                        time:   [1.0431 ms 1.0460 ms 1.0488 ms]
                        change: [-0.7482% -0.5783% -0.3925%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 20 outliers among 100 measurements (20.00%)
  1 (1.00%) high mild
  19 (19.00%) high severe
future_group_poll_latency/FuturesUnordered/Params { init_size: 1000, pct_ready_per_round: 0.001 }                                                                            
                        time:   [52.312 µs 52.334 µs 52.362 µs]
                        change: [-0.0070% +0.0275% +0.0639%] (p = 0.14 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe
future_group_poll_latency/FutureGroup/Params { init_size: 1000, pct_ready_per_round: 0.2 }                                                                            
                        time:   [826.35 µs 828.36 µs 830.13 µs]
                        change: [+1.3397% +1.7624% +2.2118%] (p = 0.00 < 0.05)
                        Performance has regressed.
future_group_poll_latency/FuturesUnordered/Params { init_size: 1000, pct_ready_per_round: 0.2 }                                                                            
                        time:   [30.896 µs 31.006 µs 31.109 µs]
                        change: [-1.8062% -1.4034% -1.0374%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
future_group_poll_latency/FutureGroup/Params { init_size: 1000, pct_ready_per_round: 1.0 }                                                                            
                        time:   [44.534 µs 44.610 µs 44.675 µs]
                        change: [+2.2008% +2.3854% +2.5792%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 25 outliers among 100 measurements (25.00%)
  4 (4.00%) low severe
  21 (21.00%) high severe
future_group_poll_latency/FuturesUnordered/Params { init_size: 1000, pct_ready_per_round: 1.0 }                                                                            
                        time:   [31.036 µs 31.122 µs 31.203 µs]
                        change: [-0.0946% +0.2297% +0.6345%] (p = 0.26 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

     Running benches/compare.rs (target/release/deps/compare-430c491dda6e2002)
Gnuplot not found, using plotters backend```

benches/bench.rs

yoshuawuyts

Thanks so much for submitting this! - Despite not doing the best in these benchmarks yet, I think they're valuable because they allow us to learn where we have an opportunity to improve.

In principle I'm keen to land these; I mainly just have some concerns about the relatively high amount of outliers at the higher counts. If we can resolve that, I think we should be good!

Thank you!

yoshuawuyts · 2024-04-16T11:16:55Z

benches/bench.rs

+                while !senders.is_empty() {
+                    let completion_limit = max_ready_num_per_poll.min(senders.len());
+                    let num_completing = rng.gen_range(0..=completion_limit);
+
+                    assert_eq!(senders.drain(..num_completing).count(), num_completing);
+
+                    let recv_start = Instant::now();
+                    assert_eq!(
+                        (&mut group).take(num_completing).count().await,
+                        num_completing
+                    );
+                    total_runtime += recv_start.elapsed();
+                }
+            }
+
+            total_runtime


It took me a second to read how this works; it would be really helpful if you can annotate how the latency is being counted here.

From what I can tell, what we're doing is taking N rounds to complete all futures in the group. We count how long it took to complete each round, and add that to a total duration. The way we mark futures as complete is by inserting one-shot channels into the futures group.

Did I get that right?

We initialize the FutureGroup with size futures. Our futures are the Receiver ends of oneshots. When the Sender ends are dropped, the corresponding Receiver futures become ready.

I think that last part may be what you're missing? When we drain the senders vec at line 155 we are waking up Receivers in the FutureGroup.

Insertion to the FutureGroup happens once up front and is not used for signaling.

will get some more comments in there.

@yoshuawuyts let me know if the new docs make this more clear

benches/bench.rs

yoshuawuyts · 2024-04-16T11:33:23Z

@soooch By the way, I'd also be interested if you have any theories as to what might be causing this difference in latency? I know for example that in #137 we started dropping futures before we return their values, as a matter of correctness. I don't know whether FuturesUnordered does the same thing; but if they don't I could see that having a measurable impact on latency between futures completing and values being returned.

@soooch Also another question: it seems right now the test is hard-coded to taking 10 rounds. Have you noticed any differences if you increase or decrease the number of rounds?

soooch · 2024-04-16T13:01:40Z

my primary theory is that for each poll, all futures are iterated over regardless of whether they have been woken or not. iirc FuturesUnordered does some more complicated stuff there so it only iterates on woken futures. i think i saw an old PR on this repo that was trying to do something similar.

i did try out a few other numbers. FutureGroup was always significantly slower than FuturesUnordered, but I think the difference may have been less for higher ratios.

(i thought about parameterizing the benches over both size and "wake_limit", but the benches probably already take too long to run).

yoshuawuyts · 2024-04-17T17:48:31Z

my primary theory is that for each poll, all futures are iterated over regardless of whether they have been woken or not. iirc FuturesUnordered does some more complicated stuff there so it only iterates on woken futures. i think i saw an old PR on this repo that was trying to do something similar.

Hmmm, thank you for your answer - though that shouldn't be the case. We're being quite careful to only iterate over futures which are currently ready to be woken. We do iterate over all active keys, but that just walks a BTreeSet, which should be fairly cheap?:

futures-concurrency/src/future/future_group.rs

Lines 363 to 364 in dec5ce0

    
           for index in this.keys.iter().cloned() { 
        
               if states[index].is_pending() && readiness.clear_ready(index) {

I'm quite surprised about the differences measured here; especially since on the existing benchmark we also measure time elapsed and the differences are effectively negligable. I'm hoping that by better understanding the way this benchmark is structured we can drill down further into why it performs differently for both APIs.

soooch · 2024-04-17T18:02:12Z

my primary theory is that for each poll, all futures are iterated over regardless of whether they have been woken or not. iirc FuturesUnordered does some more complicated stuff there so it only iterates on woken futures. i think i saw an old PR on this repo that was trying to do something similar.

Hmmm, thank you for your answer - though that shouldn't be the case. We're being quite careful to only iterate over futures which are currently ready to be woken. We do iterate over all active keys, but that just walks a BTreeSet, which should be fairly cheap?:

futures-concurrency/src/future/future_group.rs

Lines 363 to 364 in dec5ce0

for index in this.keys.iter().cloned() {

if states[index].is_pending() && readiness.clear_ready(index) {

I'm quite surprised about the differences measured here; especially since on the existing benchmark we also measure time elapsed and the differences are effectively negligable. I'm hoping that by better understanding the way this benchmark is structured we can drill down further into why it performs differently for both APIs.

Theis_pending and clear_ready are probably about as expensive as the btree iteration. they’re not expensive in an absolute sense. But if they're done for 1000 items when only a single item has woken up that should start to add up, right?

Take a look at the ready_to_run_queue in FuturesUnordered

soooch · 2024-04-18T01:04:28Z

my primary theory is that for each poll, all futures are iterated over regardless of whether they have been woken or not. iirc FuturesUnordered does some more complicated stuff there so it only iterates on woken futures. i think i saw an old PR on this repo that was trying to do something similar.

Hmmm, thank you for your answer - though that shouldn't be the case. We're being quite careful to only iterate over futures which are currently ready to be woken. We do iterate over all active keys, but that just walks a BTreeSet, which should be fairly cheap?:

futures-concurrency/src/future/future_group.rs

Lines 363 to 364 in dec5ce0

for index in this.keys.iter().cloned() {

if states[index].is_pending() && readiness.clear_ready(index) {

I'm quite surprised about the differences measured here; especially since on the existing benchmark we also measure time elapsed and the differences are effectively negligable. I'm hoping that by better understanding the way this benchmark is structured we can drill down further into why it performs differently for both APIs.

Theis_pending and clear_ready are probably about as expensive as the btree iteration. they’re not expensive in an absolute sense. But if they're done for 1000 items when only a single item has woken up that should start to add up, right?

Take a look at the ready_to_run_queue in FuturesUnordered

I think I've confirmed this theory to myself. This change brings FutureGroup to parity with FuturesUnordered for this benchmark. (and also improves numbers in the throughput bench by 10-20%).

This should still be more algorithmically complex than FuturesUnordered, but iter_ones on BitSlice is sufficiently well optimized that we're a bit faster than FuturesUnordered until the test size gets to around 10000 futures. Not sure how this holds as the BitSlice gets more "fragmented" (if that happens at all (I haven't looked into how BitVec does stuff)).

This performance could definitely be be achieved without depending on bitvec. This is more of a proof of concept.

benches/bench.rs

reimplement `FutureGroup::from_iter` in terms of `extend`

Removes restriction that `F: Default`

this benchmark aims to measure latency between futures becoming ready and the stream producing them.

soooch · 2024-04-23T02:32:48Z

Thanks so much for submitting this! - Despite not doing the best in these benchmarks yet, I think they're valuable because they allow us to learn where we have an opportunity to improve.

In principle I'm keen to land these; I mainly just have some concerns about the relatively high amount of outliers at the higher counts. If we can resolve that, I think we should be good!

Thank you!

So it appears that on my machine, I have a significant number of outliers for nearly every bench included in this crate. This new benchmark is about on par with the rest. If you have a more steady bench machine, would you be able to check that this bench is reasonably stable there?

soooch · 2024-05-15T05:27:25Z

@yoshuawuyts this may have fallen through the cracks. I believe I've addressed your initial concerns.

yoshuawuyts

Oops - it indeed did. Thanks again!

soooch commented Apr 16, 2024

View reviewed changes

benches/bench.rs Outdated Show resolved Hide resolved

soooch commented Apr 16, 2024

View reviewed changes

benches/bench.rs Show resolved Hide resolved

yoshuawuyts requested changes Apr 16, 2024

View reviewed changes

soooch requested a review from yoshuawuyts April 17, 2024 23:21

soooch force-pushed the new-future-group-bench branch 3 times, most recently from ec9e058 to 501689c Compare April 18, 2024 02:15

soooch commented Apr 23, 2024

View reviewed changes

benches/bench.rs Outdated Show resolved Hide resolved

soooch commented Apr 23, 2024

View reviewed changes

benches/bench.rs Outdated Show resolved Hide resolved

soooch added 2 commits April 22, 2024 19:09

add FutureGroup::extend

de8a750

reimplement `FutureGroup::from_iter` in terms of `extend`

Add manual Default impl for FutureGroup

ee74871

Removes restriction that `F: Default`

soooch force-pushed the new-future-group-bench branch from 43d6812 to fa7c029 Compare April 23, 2024 02:09

add new benchmark for FutureGroup

d960f75

this benchmark aims to measure latency between futures becoming ready and the stream producing them.

soooch force-pushed the new-future-group-bench branch from fa7c029 to d960f75 Compare April 23, 2024 02:10

yoshuawuyts approved these changes Jun 9, 2024

View reviewed changes

yoshuawuyts merged commit a0b4003 into yoshuawuyts:main Jun 9, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add new benchmark for `FutureGroup` #179

add new benchmark for `FutureGroup` #179

soooch commented Apr 16, 2024 •

edited

yoshuawuyts left a comment

yoshuawuyts Apr 16, 2024

soooch Apr 16, 2024 •

edited

soooch Apr 16, 2024

soooch Apr 17, 2024

yoshuawuyts commented Apr 16, 2024

soooch commented Apr 16, 2024

yoshuawuyts commented Apr 17, 2024

soooch commented Apr 17, 2024 •

edited

soooch commented Apr 18, 2024 •

edited

soooch commented Apr 23, 2024 •

edited

soooch commented May 15, 2024

yoshuawuyts left a comment •

edited

add new benchmark for FutureGroup #179

add new benchmark for FutureGroup #179

Conversation

soooch commented Apr 16, 2024 • edited

yoshuawuyts left a comment

Choose a reason for hiding this comment

yoshuawuyts Apr 16, 2024

Choose a reason for hiding this comment

soooch Apr 16, 2024 • edited

Choose a reason for hiding this comment

soooch Apr 16, 2024

Choose a reason for hiding this comment

soooch Apr 17, 2024

Choose a reason for hiding this comment

yoshuawuyts commented Apr 16, 2024

soooch commented Apr 16, 2024

yoshuawuyts commented Apr 17, 2024

soooch commented Apr 17, 2024 • edited

soooch commented Apr 18, 2024 • edited

soooch commented Apr 23, 2024 • edited

soooch commented May 15, 2024

yoshuawuyts left a comment • edited

Choose a reason for hiding this comment

add new benchmark for `FutureGroup` #179

add new benchmark for `FutureGroup` #179

soooch commented Apr 16, 2024 •

edited

soooch Apr 16, 2024 •

edited

soooch commented Apr 17, 2024 •

edited

soooch commented Apr 18, 2024 •

edited

soooch commented Apr 23, 2024 •

edited

yoshuawuyts left a comment •

edited