Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split Vec::dedup_by into 2 cycles #118273

Merged

Conversation

AngelicosPhosphoros
Copy link
Contributor

First cycle runs until we found 2 same elements, second runs after if there any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove.

This leads to significant performance gains if all Vec items are kept: -40% on my benchmark with unique integers.

Results of benchmarks before implementation (including new benchmark where nothing needs to be removed):

  • vec::bench_dedup_all_100 74.00ns/iter +/- 13.00ns
  • vec::bench_dedup_all_1000 572.00ns/iter +/- 272.00ns
  • vec::bench_dedup_all_100000 64.42µs/iter +/- 19.47µs
  • vec::bench_dedup_none_100 67.00ns/iter +/- 17.00ns
  • vec::bench_dedup_none_1000 662.00ns/iter +/- 86.00ns
  • vec::bench_dedup_none_10000 9.16µs/iter +/- 2.71µs
  • vec::bench_dedup_none_100000 91.25µs/iter +/- 1.82µs
  • vec::bench_dedup_random_100 105.00ns/iter +/- 11.00ns
  • vec::bench_dedup_random_1000 781.00ns/iter +/- 10.00ns
  • vec::bench_dedup_random_10000 9.00µs/iter +/- 5.62µs
  • vec::bench_dedup_random_100000 449.81µs/iter +/- 74.99µs
  • vec::bench_dedup_slice_truncate_100 105.00ns/iter +/- 16.00ns
  • vec::bench_dedup_slice_truncate_1000 2.65µs/iter +/- 481.00ns
  • vec::bench_dedup_slice_truncate_10000 18.33µs/iter +/- 5.23µs
  • vec::bench_dedup_slice_truncate_100000 501.12µs/iter +/- 46.97µs

Results after implementation:

  • vec::bench_dedup_all_100 75.00ns/iter +/- 9.00ns
  • vec::bench_dedup_all_1000 494.00ns/iter +/- 117.00ns
  • vec::bench_dedup_all_100000 58.13µs/iter +/- 8.78µs
  • vec::bench_dedup_none_100 52.00ns/iter +/- 22.00ns
  • vec::bench_dedup_none_1000 417.00ns/iter +/- 116.00ns
  • vec::bench_dedup_none_10000 4.11µs/iter +/- 546.00ns
  • vec::bench_dedup_none_100000 40.47µs/iter +/- 5.36µs
  • vec::bench_dedup_random_100 77.00ns/iter +/- 15.00ns
  • vec::bench_dedup_random_1000 681.00ns/iter +/- 86.00ns
  • vec::bench_dedup_random_10000 11.66µs/iter +/- 2.22µs
  • vec::bench_dedup_random_100000 469.35µs/iter +/- 20.53µs
  • vec::bench_dedup_slice_truncate_100 100.00ns/iter +/- 5.00ns
  • vec::bench_dedup_slice_truncate_1000 2.55µs/iter +/- 224.00ns
  • vec::bench_dedup_slice_truncate_10000 18.95µs/iter +/- 2.59µs
  • vec::bench_dedup_slice_truncate_100000 492.85µs/iter +/- 72.84µs

Resolves #77772

P.S. Note that this is same PR as #92104 I just missed review then forgot about it.
Also, I cannot reopen that pull request so I am creating a new one.
I responded to remaining questions directly by adding commentaries to my code.

They are for more specific cases than old benches.
Also, better usage of blackbox
@rustbot
Copy link
Collaborator

rustbot commented Nov 25, 2023

r? @joshtriplett

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Nov 25, 2023
@the8472
Copy link
Member

the8472 commented Nov 25, 2023

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Nov 25, 2023
@bors
Copy link
Contributor

bors commented Nov 25, 2023

⌛ Trying commit f0c9274 with merge d340b8b...

bors added a commit to rust-lang-ci/rust that referenced this pull request Nov 25, 2023
…rsion_77772_2, r=<try>

Split `Vec::dedup_by` into 2 cycles

First cycle runs until we found 2 same elements, second runs after if there any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove.

This leads to significant performance gains if all `Vec` items are kept: -40% on my benchmark with unique integers.

Results of benchmarks before implementation (including new benchmark where nothing needs to be removed):
 *   vec::bench_dedup_all_100                 74.00ns/iter  +/- 13.00ns
 *   vec::bench_dedup_all_1000               572.00ns/iter +/- 272.00ns
 *   vec::bench_dedup_all_100000              64.42µs/iter  +/- 19.47µs
 *   __vec::bench_dedup_none_100                67.00ns/iter  +/- 17.00ns__
 *   __vec::bench_dedup_none_1000              662.00ns/iter  +/- 86.00ns__
 *   __vec::bench_dedup_none_10000               9.16µs/iter   +/- 2.71µs__
 *   __vec::bench_dedup_none_100000             91.25µs/iter   +/- 1.82µs__
 *   vec::bench_dedup_random_100             105.00ns/iter  +/- 11.00ns
 *   vec::bench_dedup_random_1000            781.00ns/iter  +/- 10.00ns
 *   vec::bench_dedup_random_10000             9.00µs/iter   +/- 5.62µs
 *   vec::bench_dedup_random_100000          449.81µs/iter  +/- 74.99µs
 *   vec::bench_dedup_slice_truncate_100     105.00ns/iter  +/- 16.00ns
 *   vec::bench_dedup_slice_truncate_1000      2.65µs/iter +/- 481.00ns
 *   vec::bench_dedup_slice_truncate_10000    18.33µs/iter   +/- 5.23µs
 *   vec::bench_dedup_slice_truncate_100000  501.12µs/iter  +/- 46.97µs

Results after implementation:
 *   vec::bench_dedup_all_100                 75.00ns/iter   +/- 9.00ns
 *   vec::bench_dedup_all_1000               494.00ns/iter +/- 117.00ns
 *   vec::bench_dedup_all_100000              58.13µs/iter   +/- 8.78µs
 *   __vec::bench_dedup_none_100                52.00ns/iter  +/- 22.00ns__
 *   __vec::bench_dedup_none_1000              417.00ns/iter +/- 116.00ns__
 *   __vec::bench_dedup_none_10000               4.11µs/iter +/- 546.00ns__
 *   __vec::bench_dedup_none_100000             40.47µs/iter   +/- 5.36µs__
 *   vec::bench_dedup_random_100              77.00ns/iter  +/- 15.00ns
 *   vec::bench_dedup_random_1000            681.00ns/iter  +/- 86.00ns
 *   vec::bench_dedup_random_10000            11.66µs/iter   +/- 2.22µs
 *   vec::bench_dedup_random_100000          469.35µs/iter  +/- 20.53µs
 *   vec::bench_dedup_slice_truncate_100     100.00ns/iter   +/- 5.00ns
 *   vec::bench_dedup_slice_truncate_1000      2.55µs/iter +/- 224.00ns
 *   vec::bench_dedup_slice_truncate_10000    18.95µs/iter   +/- 2.59µs
 *   vec::bench_dedup_slice_truncate_100000  492.85µs/iter  +/- 72.84µs

Resolves rust-lang#77772

P.S. Note that this is same PR as rust-lang#92104 I just missed review then forgot about it.
Also, I cannot reopen that pull request so I am creating a new one.
I responded to remaining questions directly by adding commentaries to my code.
@bors
Copy link
Contributor

bors commented Nov 25, 2023

☀️ Try build successful - checks-actions
Build commit: d340b8b (d340b8b09f67d765e5ba32ab9c2702994a63b8c3)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (d340b8b): comparison URL.

Overall result: no relevant changes - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-1.1% [-1.1%, -1.1%] 1
All ❌✅ (primary) - - 0

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.0% [0.9%, 1.0%] 2
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-1.3% [-1.3%, -1.3%] 1
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.2% [-1.3%, 1.0%] 3

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 674.339s -> 676.429s (0.31%)
Artifact size: 313.28 MiB -> 313.27 MiB (-0.00%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Nov 25, 2023
@AngelicosPhosphoros
Copy link
Contributor Author

@joshtriplett Hi, I just want to notify that there is an unreviewed PR.

@the8472
Copy link
Member

the8472 commented Dec 4, 2023

r? the8472

@rustbot rustbot assigned the8472 and unassigned joshtriplett Dec 4, 2023
Comment on lines 661 to 662
// Measures performance of slice dedup impl.
// "Old" implementation of Vec::dedup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit misleading in the context of this PR. It's "old-old" now. Can we put a version on it "from Rust 1.xx" or something like that?

// Check if we ever want to remove anything.
// This allows to use copy_non_overlapping in next cycle.
// And avoids any memory writes if we don't need to remove anything.
let mut possible_remove_idx = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is a bit weird. Maybe first_duplicate?

library/alloc/src/vec/mod.rs Outdated Show resolved Hide resolved
First cycle runs until we found 2 same elements, second runs after if there any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove.

This leads to significant performance gains if all `Vec` items are kept: -40% on my benchmark with unique integers.

Results of benchmarks before implementation (including new benchmark where nothing needs to be removed):
 *   vec::bench_dedup_all_100                 74.00ns/iter  +/- 13.00ns
 *   vec::bench_dedup_all_1000               572.00ns/iter +/- 272.00ns
 *   vec::bench_dedup_all_100000              64.42µs/iter  +/- 19.47µs
 *   __vec::bench_dedup_none_100                67.00ns/iter  +/- 17.00ns__
 *   __vec::bench_dedup_none_1000              662.00ns/iter  +/- 86.00ns__
 *   __vec::bench_dedup_none_10000               9.16µs/iter   +/- 2.71µs__
 *   __vec::bench_dedup_none_100000             91.25µs/iter   +/- 1.82µs__
 *   vec::bench_dedup_random_100             105.00ns/iter  +/- 11.00ns
 *   vec::bench_dedup_random_1000            781.00ns/iter  +/- 10.00ns
 *   vec::bench_dedup_random_10000             9.00µs/iter   +/- 5.62µs
 *   vec::bench_dedup_random_100000          449.81µs/iter  +/- 74.99µs
 *   vec::bench_dedup_slice_truncate_100     105.00ns/iter  +/- 16.00ns
 *   vec::bench_dedup_slice_truncate_1000      2.65µs/iter +/- 481.00ns
 *   vec::bench_dedup_slice_truncate_10000    18.33µs/iter   +/- 5.23µs
 *   vec::bench_dedup_slice_truncate_100000  501.12µs/iter  +/- 46.97µs

Results after implementation:
 *   vec::bench_dedup_all_100                 75.00ns/iter   +/- 9.00ns
 *   vec::bench_dedup_all_1000               494.00ns/iter +/- 117.00ns
 *   vec::bench_dedup_all_100000              58.13µs/iter   +/- 8.78µs
 *   __vec::bench_dedup_none_100                52.00ns/iter  +/- 22.00ns__
 *   __vec::bench_dedup_none_1000              417.00ns/iter +/- 116.00ns__
 *   __vec::bench_dedup_none_10000               4.11µs/iter +/- 546.00ns__
 *   __vec::bench_dedup_none_100000             40.47µs/iter   +/- 5.36µs__
 *   vec::bench_dedup_random_100              77.00ns/iter  +/- 15.00ns
 *   vec::bench_dedup_random_1000            681.00ns/iter  +/- 86.00ns
 *   vec::bench_dedup_random_10000            11.66µs/iter   +/- 2.22µs
 *   vec::bench_dedup_random_100000          469.35µs/iter  +/- 20.53µs
 *   vec::bench_dedup_slice_truncate_100     100.00ns/iter   +/- 5.00ns
 *   vec::bench_dedup_slice_truncate_1000      2.55µs/iter +/- 224.00ns
 *   vec::bench_dedup_slice_truncate_10000    18.95µs/iter   +/- 2.59µs
 *   vec::bench_dedup_slice_truncate_100000  492.85µs/iter  +/- 72.84µs

Resolves rust-lang#77772
@AngelicosPhosphoros
Copy link
Contributor Author

@the8472 I made changes you requested.

@the8472
Copy link
Member

the8472 commented Dec 5, 2023

@bors r+

@bors
Copy link
Contributor

bors commented Dec 5, 2023

📌 Commit 964df01 has been approved by the8472

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 5, 2023
@bors
Copy link
Contributor

bors commented Dec 5, 2023

⌛ Testing commit 964df01 with merge e9013ac...

@bors
Copy link
Contributor

bors commented Dec 5, 2023

☀️ Test successful - checks-actions
Approved by: the8472
Pushing e9013ac to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Dec 5, 2023
@bors bors merged commit e9013ac into rust-lang:master Dec 5, 2023
12 checks passed
@rustbot rustbot added this to the 1.76.0 milestone Dec 5, 2023
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (e9013ac): comparison URL.

Overall result: ✅ improvements - no action needed

@rustbot label: -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-0.1% [-0.1%, -0.1%] 3
All ❌✅ (primary) - - 0

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-3.8% [-3.8%, -3.8%] 1
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -3.8% [-3.8%, -3.8%] 1

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 675.495s -> 673.642s (-0.27%)
Artifact size: 314.16 MiB -> 314.17 MiB (0.00%)

@AngelicosPhosphoros AngelicosPhosphoros deleted the dedup_2_loops_version_77772_2 branch December 6, 2023 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potentially faster dedup_by implementation
6 participants