Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add new benchmark for FutureGroup #179

Merged
merged 3 commits into from
Jun 9, 2024

Conversation

soooch
Copy link
Contributor

@soooch soooch commented Apr 16, 2024

This benchmark aims to measure latency between futures becoming ready and the stream producing them.
The bench currently shows FutureGroup trailing significantly behind FuturesUnordered for larger test sizes.

~/c/futures-concurrency (new-future-group-bench)> cargo bench future_group_poll_latency
    Finished bench [optimized + debuginfo] target(s) in 0.05s
     Running benches/bench.rs (target/release/deps/bench-26d01e1345f7786c)
Gnuplot not found, using plotters backend
future_group_poll_latency/FutureGroup/Params { init_size: 10, pct_ready_per_round: 0.001 }                                                                             
                        time:   [630.81 ns 631.56 ns 632.24 ns]
                        change: [+7.1846% +7.2671% +7.3558%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild
future_group_poll_latency/FuturesUnordered/Params { init_size: 10, pct_ready_per_round: 0.001 }                                                                             
                        time:   [466.72 ns 469.05 ns 471.58 ns]
                        change: [-1.7131% -1.2735% -0.8808%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 17 outliers among 100 measurements (17.00%)
  14 (14.00%) low severe
  3 (3.00%) low mild
future_group_poll_latency/FutureGroup/Params { init_size: 10, pct_ready_per_round: 0.2 }                                                                             
                        time:   [537.25 ns 539.50 ns 541.62 ns]
                        change: [+2.5562% +2.9383% +3.2572%] (p = 0.00 < 0.05)
                        Performance has regressed.
future_group_poll_latency/FuturesUnordered/Params { init_size: 10, pct_ready_per_round: 0.2 }                                                                             
                        time:   [366.65 ns 368.35 ns 369.97 ns]
                        change: [-2.7245% -2.4190% -2.0874%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 30 outliers among 100 measurements (30.00%)
  16 (16.00%) low severe
  4 (4.00%) low mild
  8 (8.00%) high mild
  2 (2.00%) high severe
future_group_poll_latency/FutureGroup/Params { init_size: 10, pct_ready_per_round: 1.0 }                                                                             
                        time:   [404.96 ns 410.50 ns 416.19 ns]
                        change: [+13.031% +13.890% +14.650%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  16 (16.00%) low severe
  1 (1.00%) low mild
future_group_poll_latency/FuturesUnordered/Params { init_size: 10, pct_ready_per_round: 1.0 }                                                                             
                        time:   [276.28 ns 276.34 ns 276.39 ns]
                        change: [-0.7471% -0.7059% -0.6578%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  7 (7.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe
future_group_poll_latency/FutureGroup/Params { init_size: 100, pct_ready_per_round: 0.001 }                                                                             
                        time:   [14.159 µs 14.207 µs 14.255 µs]
                        change: [+4.3452% +4.4973% +4.6408%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
  8 (8.00%) low severe
  7 (7.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe
future_group_poll_latency/FuturesUnordered/Params { init_size: 100, pct_ready_per_round: 0.001 }                                                                             
                        time:   [4.9547 µs 4.9589 µs 4.9646 µs]
                        change: [-1.6303% -1.2868% -0.9175%] (p = 0.00 < 0.05)
                        Change within noise threshold.
future_group_poll_latency/FutureGroup/Params { init_size: 100, pct_ready_per_round: 0.2 }                                                                             
                        time:   [11.181 µs 11.182 µs 11.184 µs]
                        change: [+2.6277% +2.6520% +2.6760%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
future_group_poll_latency/FuturesUnordered/Params { init_size: 100, pct_ready_per_round: 0.2 }                                                                             
                        time:   [2.9196 µs 2.9202 µs 2.9207 µs]
                        change: [-1.6521% -1.6239% -1.5951%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
future_group_poll_latency/FutureGroup/Params { init_size: 100, pct_ready_per_round: 1.0 }                                                                             
                        time:   [4.0768 µs 4.0777 µs 4.0787 µs]
                        change: [-0.3427% +0.0016% +0.2321%] (p = 1.00 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
future_group_poll_latency/FuturesUnordered/Params { init_size: 100, pct_ready_per_round: 1.0 }                                                                             
                        time:   [2.7793 µs 2.7796 µs 2.7801 µs]
                        change: [-6.4224% -6.2342% -6.0593%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  8 (8.00%) high severe
Benchmarking future_group_poll_latency/FutureGroup/Params { init_size: 1000, pct_ready_per_round: 0.001 }: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.4s, enable flat sampling, or reduce sample count to 60.
future_group_poll_latency/FutureGroup/Params { init_size: 1000, pct_ready_per_round: 0.001 }                                                                             
                        time:   [1.0431 ms 1.0460 ms 1.0488 ms]
                        change: [-0.7482% -0.5783% -0.3925%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 20 outliers among 100 measurements (20.00%)
  1 (1.00%) high mild
  19 (19.00%) high severe
future_group_poll_latency/FuturesUnordered/Params { init_size: 1000, pct_ready_per_round: 0.001 }                                                                            
                        time:   [52.312 µs 52.334 µs 52.362 µs]
                        change: [-0.0070% +0.0275% +0.0639%] (p = 0.14 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe
future_group_poll_latency/FutureGroup/Params { init_size: 1000, pct_ready_per_round: 0.2 }                                                                            
                        time:   [826.35 µs 828.36 µs 830.13 µs]
                        change: [+1.3397% +1.7624% +2.2118%] (p = 0.00 < 0.05)
                        Performance has regressed.
future_group_poll_latency/FuturesUnordered/Params { init_size: 1000, pct_ready_per_round: 0.2 }                                                                            
                        time:   [30.896 µs 31.006 µs 31.109 µs]
                        change: [-1.8062% -1.4034% -1.0374%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
future_group_poll_latency/FutureGroup/Params { init_size: 1000, pct_ready_per_round: 1.0 }                                                                            
                        time:   [44.534 µs 44.610 µs 44.675 µs]
                        change: [+2.2008% +2.3854% +2.5792%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 25 outliers among 100 measurements (25.00%)
  4 (4.00%) low severe
  21 (21.00%) high severe
future_group_poll_latency/FuturesUnordered/Params { init_size: 1000, pct_ready_per_round: 1.0 }                                                                            
                        time:   [31.036 µs 31.122 µs 31.203 µs]
                        change: [-0.0946% +0.2297% +0.6345%] (p = 0.26 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

     Running benches/compare.rs (target/release/deps/compare-430c491dda6e2002)
Gnuplot not found, using plotters backend```

benches/bench.rs Outdated Show resolved Hide resolved
Copy link
Owner

@yoshuawuyts yoshuawuyts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for submitting this! - Despite not doing the best in these benchmarks yet, I think they're valuable because they allow us to learn where we have an opportunity to improve.

In principle I'm keen to land these; I mainly just have some concerns about the relatively high amount of outliers at the higher counts. If we can resolve that, I think we should be good!

Thank you!

benches/bench.rs Outdated
Comment on lines 151 to 205
while !senders.is_empty() {
let completion_limit = max_ready_num_per_poll.min(senders.len());
let num_completing = rng.gen_range(0..=completion_limit);

assert_eq!(senders.drain(..num_completing).count(), num_completing);

let recv_start = Instant::now();
assert_eq!(
(&mut group).take(num_completing).count().await,
num_completing
);
total_runtime += recv_start.elapsed();
}
}

total_runtime
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a second to read how this works; it would be really helpful if you can annotate how the latency is being counted here.

From what I can tell, what we're doing is taking N rounds to complete all futures in the group. We count how long it took to complete each round, and add that to a total duration. The way we mark futures as complete is by inserting one-shot channels into the futures group.

Did I get that right?

Copy link
Contributor Author

@soooch soooch Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We initialize the FutureGroup with size futures. Our futures are the Receiver ends of oneshots. When the Sender ends are dropped, the corresponding Receiver futures become ready.

I think that last part may be what you're missing? When we drain the senders vec at line 155 we are waking up Receivers in the FutureGroup.

Insertion to the FutureGroup happens once up front and is not used for signaling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will get some more comments in there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yoshuawuyts let me know if the new docs make this more clear

benches/bench.rs Show resolved Hide resolved
benches/bench.rs Outdated Show resolved Hide resolved
@yoshuawuyts
Copy link
Owner

@soooch By the way, I'd also be interested if you have any theories as to what might be causing this difference in latency? I know for example that in #137 we started dropping futures before we return their values, as a matter of correctness. I don't know whether FuturesUnordered does the same thing; but if they don't I could see that having a measurable impact on latency between futures completing and values being returned.

@soooch Also another question: it seems right now the test is hard-coded to taking 10 rounds. Have you noticed any differences if you increase or decrease the number of rounds?

@soooch
Copy link
Contributor Author

soooch commented Apr 16, 2024

my primary theory is that for each poll, all futures are iterated over regardless of whether they have been woken or not. iirc FuturesUnordered does some more complicated stuff there so it only iterates on woken futures. i think i saw an old PR on this repo that was trying to do something similar.

i did try out a few other numbers. FutureGroup was always significantly slower than FuturesUnordered, but I think the difference may have been less for higher ratios.

(i thought about parameterizing the benches over both size and "wake_limit", but the benches probably already take too long to run).

@yoshuawuyts
Copy link
Owner

my primary theory is that for each poll, all futures are iterated over regardless of whether they have been woken or not. iirc FuturesUnordered does some more complicated stuff there so it only iterates on woken futures. i think i saw an old PR on this repo that was trying to do something similar.

Hmmm, thank you for your answer - though that shouldn't be the case. We're being quite careful to only iterate over futures which are currently ready to be woken. We do iterate over all active keys, but that just walks a BTreeSet, which should be fairly cheap?:

for index in this.keys.iter().cloned() {
if states[index].is_pending() && readiness.clear_ready(index) {

I'm quite surprised about the differences measured here; especially since on the existing benchmark we also measure time elapsed and the differences are effectively negligable. I'm hoping that by better understanding the way this benchmark is structured we can drill down further into why it performs differently for both APIs.

@soooch
Copy link
Contributor Author

soooch commented Apr 17, 2024

my primary theory is that for each poll, all futures are iterated over regardless of whether they have been woken or not. iirc FuturesUnordered does some more complicated stuff there so it only iterates on woken futures. i think i saw an old PR on this repo that was trying to do something similar.

Hmmm, thank you for your answer - though that shouldn't be the case. We're being quite careful to only iterate over futures which are currently ready to be woken. We do iterate over all active keys, but that just walks a BTreeSet, which should be fairly cheap?:

for index in this.keys.iter().cloned() {
if states[index].is_pending() && readiness.clear_ready(index) {

I'm quite surprised about the differences measured here; especially since on the existing benchmark we also measure time elapsed and the differences are effectively negligable. I'm hoping that by better understanding the way this benchmark is structured we can drill down further into why it performs differently for both APIs.

Theis_pending and clear_ready are probably about as expensive as the btree iteration. they’re not expensive in an absolute sense. But if they're done for 1000 items when only a single item has woken up that should start to add up, right?

Take a look at the ready_to_run_queue in FuturesUnordered

@soooch soooch requested a review from yoshuawuyts April 17, 2024 23:21
@soooch
Copy link
Contributor Author

soooch commented Apr 18, 2024

my primary theory is that for each poll, all futures are iterated over regardless of whether they have been woken or not. iirc FuturesUnordered does some more complicated stuff there so it only iterates on woken futures. i think i saw an old PR on this repo that was trying to do something similar.

Hmmm, thank you for your answer - though that shouldn't be the case. We're being quite careful to only iterate over futures which are currently ready to be woken. We do iterate over all active keys, but that just walks a BTreeSet, which should be fairly cheap?:

for index in this.keys.iter().cloned() {
if states[index].is_pending() && readiness.clear_ready(index) {

I'm quite surprised about the differences measured here; especially since on the existing benchmark we also measure time elapsed and the differences are effectively negligable. I'm hoping that by better understanding the way this benchmark is structured we can drill down further into why it performs differently for both APIs.

Theis_pending and clear_ready are probably about as expensive as the btree iteration. they’re not expensive in an absolute sense. But if they're done for 1000 items when only a single item has woken up that should start to add up, right?

Take a look at the ready_to_run_queue in FuturesUnordered

I think I've confirmed this theory to myself. This change brings FutureGroup to parity with FuturesUnordered for this benchmark. (and also improves numbers in the throughput bench by 10-20%).

This should still be more algorithmically complex than FuturesUnordered, but iter_ones on BitSlice is sufficiently well optimized that we're a bit faster than FuturesUnordered until the test size gets to around 10000 futures. Not sure how this holds as the BitSlice gets more "fragmented" (if that happens at all (I haven't looked into how BitVec does stuff)).

This performance could definitely be be achieved without depending on bitvec. This is more of a proof of concept.

@soooch soooch force-pushed the new-future-group-bench branch 3 times, most recently from ec9e058 to 501689c Compare April 18, 2024 02:15
benches/bench.rs Outdated Show resolved Hide resolved
benches/bench.rs Outdated Show resolved Hide resolved
reimplement `FutureGroup::from_iter` in terms of `extend`
Removes restriction that `F: Default`
this benchmark aims to measure latency between futures
becoming ready and the stream producing them.
@soooch
Copy link
Contributor Author

soooch commented Apr 23, 2024

Thanks so much for submitting this! - Despite not doing the best in these benchmarks yet, I think they're valuable because they allow us to learn where we have an opportunity to improve.

In principle I'm keen to land these; I mainly just have some concerns about the relatively high amount of outliers at the higher counts. If we can resolve that, I think we should be good!

Thank you!

So it appears that on my machine, I have a significant number of outliers for nearly every bench included in this crate. This new benchmark is about on par with the rest. If you have a more steady bench machine, would you be able to check that this bench is reasonably stable there?

@soooch
Copy link
Contributor Author

soooch commented May 15, 2024

@yoshuawuyts this may have fallen through the cracks. I believe I've addressed your initial concerns.

Copy link
Owner

@yoshuawuyts yoshuawuyts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops - it indeed did. Thanks again!

@yoshuawuyts yoshuawuyts merged commit a0b4003 into yoshuawuyts:main Jun 9, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants