Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize unnecessary check in Vec::retain #88060

Merged
merged 2 commits into from
Oct 3, 2021

Conversation

TennyZhuang
Copy link
Contributor

@TennyZhuang TennyZhuang commented Aug 15, 2021

The function vec::Vec::retain only have two stages:

  1. Nothing was deleted.
  2. Some elements were deleted.

Here is an unnecessary check if g.deleted_cnt > 0 in the loop, and it's difficult for compiler to optimize it. I split the loop into two stages manully and keep the code clean using const generics.

I write a special but common bench case for this optimization. I call retain on vec but keep all elements.

Before and after this optimization:

test vec::bench_retain_whole_100000                      ... bench:      84,803 ns/iter (+/- 17,314)
test vec::bench_retain_whole_100000                      ... bench:      42,638 ns/iter (+/- 16,910)

The result is expected, there are two ifs before the optimization and one if after.

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @joshtriplett (or someone else) soon.

Please see the contribution instructions for more information.

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Aug 15, 2021
@TennyZhuang
Copy link
Contributor Author

r? @oxalica A trivial but interesting optimization, PTAL

Copy link
Contributor

@oxalica oxalica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated code seems kind of weird. The second-stage loop is generated into many basic blocks and there are many jumps around.

Anyway, the benchmark is quite convincing.

BTW: Actually find this PR from your Twitter post. 😄

library/alloc/benches/vec.rs Show resolved Hide resolved
library/alloc/src/vec/mod.rs Outdated Show resolved Hide resolved
@TennyZhuang
Copy link
Contributor Author

I add a normal benchmark:

before

test vec::bench_retain_100000                            ... bench:     460,816 ns/iter (+/- 24,389)
test vec::bench_retain_whole_100000                      ... bench:      88,300 ns/iter (+/- 4,635)

after

test vec::bench_retain_100000                            ... bench:      62,627 ns/iter (+/- 6,692)
test vec::bench_retain_whole_100000                      ... bench:      46,326 ns/iter (+/- 4,133)

It's extremely fasater, even that I can't explain it

@darksv

This comment has been minimized.

@TennyZhuang

This comment has been minimized.

@darksv

This comment has been minimized.

@oxalica
Copy link
Contributor

oxalica commented Aug 16, 2021

I ran the benchmark on my R5 2400G PC and it is quite different, and is quite stable between multiple runs. The with-many-move case actually becomes slower for me.

Before

test vec::bench_retain_100000                            ... bench:      58,436 ns/iter (+/- 2,482)
test vec::bench_retain_whole_100000                      ... bench:      54,747 ns/iter (+/- 1,646)

After

test vec::bench_retain_100000                            ... bench:      53,703 ns/iter (+/- 2,905)
test vec::bench_retain_whole_100000                      ... bench:      61,588 ns/iter (+/- 689)

It might indicate that the generated code heavily relies on the behavior of branch predictor. Are you on a Intel CPU? Intel's branch predictor may work different with AMD's.

@TennyZhuang
Copy link
Contributor Author

I ran the benchmark on my R5 2400G PC and it is quite different, and is quite stable between multiple runs. The with-many-move case actually becomes slower for me.

Before

test vec::bench_retain_100000                            ... bench:      58,436 ns/iter (+/- 2,482)
test vec::bench_retain_whole_100000                      ... bench:      54,747 ns/iter (+/- 1,646)

After

test vec::bench_retain_100000                            ... bench:      53,703 ns/iter (+/- 2,905)
test vec::bench_retain_whole_100000                      ... bench:      61,588 ns/iter (+/- 689)

It might indicate that the generated code heavily relies on the behavior of branch predictor. Are you on a Intel CPU? Intel's branch predictor may work different with AMD's.

I test it on my MacBook Pro 13 with 2.0GHz quad-core 10th-generation Intel Core i5 processor, Turbo Boost up to 3.8GHz.

I can run the bench on another device.

@Xuanwo could you also run a benchmark for my optimization?

@oxalica I will also try manually expand two loop without const generics.

@Xuanwo
Copy link
Contributor

Xuanwo commented Aug 16, 2021

I ran the same benchmark on my laptop:

                   -`                    xuanwo@thinkpad-x1-carbon
                  .o+`                   -------------------------
                 `ooo/                   OS: Arch Linux x86_64
                `+oooo:                  Host: 20KHCTO1WW ThinkPad X1 Carbon 6th
               `+oooooo:                 Kernel: 5.13.10-zen1-1-zen
               -+oooooo+:                Uptime: 1 day, 5 hours, 8 mins
             `/:-:++oooo+:               Packages: 1144 (pacman)
            `/++++/+++++++:              Shell: zsh 5.8
           `/++++++++++++++:             Resolution: 1920x1080
          `/+++ooooooooooooo/`           DE: KDE5
         ./ooosssso++osssssso+`          WM: KWin
        .oossssso-````/ossssss+`         WM Theme: Breeze
       -osssssso.      :ssssssso.        Theme: Breeze Light [KDE5], Canta-light [GTK2/3]
      :osssssss/        osssso+++.       Icons: breeze [KDE5], breeze [GTK2/3]
     /ossssssss/        +ssssooo/-       Terminal: tmux
   `/ossssso+/:-        -:/+osssso+-     CPU: Intel i7-8650U (8) @ 4.200GHz
  `+sso+:-`                 `.-/+oso:    GPU: Intel UHD Graphics 620
 `++:.                           `-/+/   Memory: 10319MiB / 15754MiB
 .`                                 `/

Before

test vec::bench_retain_100000                            ... bench:      74,452 ns/iter (+/- 2,484)
test vec::bench_retain_whole_100000                      ... bench:      59,383 ns/iter (+/- 2,675)

After

test vec::bench_retain_100000                            ... bench:      69,770 ns/iter (+/- 1,764)
test vec::bench_retain_whole_100000                      ... bench:      53,829 ns/iter (+/- 1,374)

@TennyZhuang
Copy link
Contributor Author

I ran it on another Mac:

sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz

The command:

./x.py bench -i library/alloc --test-args bench_retain

result is almost the same:

Before

test vec::bench_retain_100000                            ... bench:     401,699 ns/iter (+/- 11,941)
test vec::bench_retain_whole_100000                      ... bench:      75,848 ns/iter (+/- 2,291)

After

test vec::bench_retain_100000                            ... bench:      52,742 ns/iter (+/- 1,948)
test vec::bench_retain_whole_100000                      ... bench:      38,554 ns/iter (+/- 3,745)

@TennyZhuang
Copy link
Contributor Author

TennyZhuang commented Aug 16, 2021

It seems that the original retain has a very poor performance on MacOS with unknown reason

@TennyZhuang
Copy link
Contributor Author

@Xuanwo @oxalica Can you run a version without const generics?

For your convinience:

git remote add zty git@github.com:TennyZhuang/rust.git
git fetch zty
git checkout optimize-vec-retain-expand-manully

@TennyZhuang
Copy link
Contributor Author

On my mac, the performance is same between const generics and manully expand:

test vec::bench_retain_100000                            ... bench:      52,471 ns/iter (+/- 3,921)
test vec::bench_retain_whole_100000                      ... bench:      38,560 ns/iter (+/- 2,666)

@Xuanwo
Copy link
Contributor

Xuanwo commented Aug 16, 2021

Two versions work similar on my laptop:

const generics

test vec::bench_retain_100000                            ... bench:      76,786 ns/iter (+/- 7,531)
test vec::bench_retain_whole_100000                      ... bench:      62,311 ns/iter (+/- 19,721)

manully expand

test vec::bench_retain_100000                            ... bench:      72,178 ns/iter (+/- 3,247)
test vec::bench_retain_whole_100000                      ... bench:      56,534 ns/iter (+/- 1,479)

@TennyZhuang
Copy link
Contributor Author

TennyZhuang commented Aug 16, 2021

OK, the conclusion is that:

  1. const generics is OK.
  2. It optimize a little (10%) in most cases or architectures. @Xuanwo
  3. Performance is downgraded a little (10%) in some case and some specific architecture with unknown reason (Resolved)
  4. It optimize much (8x) on some specific architecture with unknown reason. @TennyZhuang
  5. In theory, it actually optimize a unnecessary check.
  6. It introduce a little code complexity.

I tend to think this PR deserves to be merged.

@oxalica what's your opinion?

@oxalica
Copy link
Contributor

oxalica commented Aug 17, 2021

I figured out why the code generation is weird after your change.
LLVM failed to reasoning g.processed_len == original_len after the second loop, thus the drop(g) cannot be optimized. That's why I see a never-called memmove in the generated code. In you manually expanded code without const generic, this issue still exists. Not sure if it's a LLVM bug.

I'd suggest to change the while condition to g.procesed_len != original_len and it's now successfully optimized.
Generated code after

Well, this should not affect the efficiency of main loop.

@TennyZhuang
Copy link
Contributor Author

TennyZhuang commented Aug 17, 2021

I figured out why the code generation is weird after your change.
LLVM failed to reasoning g.processed_len == original_len after the second loop, thus the drop(g) cannot be optimized. That's why I see a never-called memmove in the generated code. In you manually expanded code without const generic, this issue still exists. Not sure if it's a LLVM bug.

I'd suggest to change the while condition to g.procesed_len != original_len and it's now successfully optimized.
Generated code after

Well, this should not affect the efficiency of main loop.

Good catch, I have updated my code.

Current performance on Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz

test vec::bench_retain_100000                            ... bench:      56,911 ns/iter (+/- 5,288)
test vec::bench_retain_whole_100000                      ... bench:      56,371 ns/iter (+/- 5,982)

bors added a commit to rust-lang-ci/rust that referenced this pull request Aug 21, 2021
Optimize unnecessary check in VecDeque::retain

This pr is highly inspired by rust-lang#88060 which shared the same idea: we can split the `for` loop into stages so that we can remove unnecessary checks like `del > 0`.

## Benchmarks

Before

```rust
test collections::vec_deque::tests::bench_retain_half_10000  ... bench:     290,125 ns/iter (+/- 8,717)
test collections::vec_deque::tests::bench_retain_odd_10000   ... bench:     291,588 ns/iter (+/- 9,621)
test collections::vec_deque::tests::bench_retain_whole_10000 ... bench:     287,426 ns/iter (+/- 9,009)
```

After

```rust
test collections::vec_deque::tests::bench_retain_half_10000  ... bench:     243,940 ns/iter (+/- 8,563)
test collections::vec_deque::tests::bench_retain_odd_10000   ... bench:     242,768 ns/iter (+/- 3,903)
test collections::vec_deque::tests::bench_retain_whole_10000 ... bench:     202,926 ns/iter (+/- 6,332)
```

Based on the current benchmark, this PR will improve the perf of `VecDeque::retain` by around 16%. For special cases, the improvement will be up to 30%.

Signed-off-by: Xuanwo <github@xuanwo.io>
@TennyZhuang
Copy link
Contributor Author

@joshtriplett PTAL, Thanks

@TennyZhuang
Copy link
Contributor Author

r? @dtolnay

It seems that @joshtriplett is busy. Could you also review my PR? it's similar to #88075

fn bench_retain_100000(b: &mut Bencher) {
let v = (1..=100000).collect::<Vec<u32>>();
b.iter(|| {
let mut v = v.clone();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling .clone() inside the benchmark loop should be avoided since it can introduce allocator noise. It's better to just refill the vec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks but the PR was merged, should I submit another PR to fix it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take care of it. I just wanted to leave a note.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.