Potentially faster dedup_by implementation #77772

Plecra · 2020-10-09T21:53:31Z

It seems like the dedup_by implementation makes two passes over the data, making many more copies than are strictly necessary. Doing this allows the same_bucket implementation to panic without causing a leak. I wouldn't call this bad, but it was a surprise to see that it was doing the extra work.

https://godbolt.org/z/4To1a1

Mimicking the way Drain handles this and temporarily shortening the vector allows us to cut down on the copies (and also conveniently removes a panicking branch).

I wonder if it makes sense to document the trade-off that std makes here? Imo, it'd make it easier to use the library correctly in the way that https://doc.rust-lang.org/std/vec/struct.Vec.html#current-implementation-7 does.

The text was updated successfully, but these errors were encountered:

pickfire · 2020-10-10T15:09:17Z

I wonder if it would be even faster to split it into two loops since once read != write it will always be hitting that branch in the loop to do unsafe { ptr::copy_nonoverlapping(read, write, 1) };? Maybe it could help reduce branch misprediction? I feel like this could be applied to the same thing right now. https://godbolt.org/z/szon9b

I wonder why are we not using mem::swap but instead use ptr::copy_nonoverlapping? And do we really need to drop_in_place, from what I see it does not do that right now?

cc @lzutao @the8472 who may be interested in this.

the8472 · 2020-10-10T16:32:43Z

The leaking behavior is unnecessary. A drop guard handle the tail.

Plecra · 2020-10-10T22:46:15Z

@pickfire drop_in_place is being called in truncate in the current implementation. If you''re asking why I'm using ptr::copy_nonoverlapping, it's because it's faster than mem::swap, and we don't need to use mem::swap in this implementation.

I'm pretty sure you're right @the8472! My mistake, I wound myself up trying to justify the truncate call. I think that means this would be a direct upgrade to the current implementation?

https://godbolt.org/z/d5ahhM (Note: My implementation seems to be broken in a couple ways, it's just for reference)

Plecra · 2020-10-11T03:00:33Z

https://github.com/Plecra/alt-dedup-rs

I tried to flesh out the implementation a little, and make sure it's working properly. The behaviour of the DropGuard still needs to be tested, and it'd be good to do a more thorough benchmark if this was being considered for addition.

I also implemented the nested loop version. I can't tell if it helped with performance yet, since my computer gives the benchmarks too much noise.

pickfire · 2020-10-11T14:00:23Z

@Plecra Do you think separating the branch into a different loop is a good idea in case they are not in the same bucket?

Plecra · 2020-10-11T14:15:43Z

@pickfire I've used a second (and third) loop in the implementation to avoid the read != write branch, did I misunderstand your recommendation?

pickfire · 2020-10-12T02:18:48Z

It still does seemed to be in the loop https://github.com/Plecra/alt-dedup-rs/blob/main/src/lib.rs#L69-L79, I think making that in the inner loop is the same as the one currently since that branch will always be there. I didn't benchmark and I am not sure if it will be faster, just a guess avoiding that branch may end up reducing branch mispredictions.

Plecra · 2020-10-12T07:56:09Z

🤔 I honestly don't know how I'd implement that - it seems like we need to branch somewhere to verify the the copies are legal. I'd welcome a PR showing what you mean.

pickfire · 2020-10-12T10:32:32Z

I mean we are branching it in another loop since once the dedup read and write is different, it will always copy stuff when it hit the other branch, so instead of having 2 branches when the bucket is different, we could keep it to the other loop to remove the branch. Not sure if it's faster though.

AngelicosPhosphoros · 2022-01-17T15:45:51Z

@Mark-Simulacrum proposed additional optimizations for this issue.
Details here:
#92104 (comment)

First cycle runs until we found 2 same elements, second runs after if there any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove. This leads to significant performance gains if all `Vec` items are kept: -40% on my benchmark with unique integers. Results of benchmarks before implementation (including new benchmark where nothing needs to be removed): * vec::bench_dedup_all_100 74.00ns/iter +/- 13.00ns * vec::bench_dedup_all_1000 572.00ns/iter +/- 272.00ns * vec::bench_dedup_all_100000 64.42µs/iter +/- 19.47µs * __vec::bench_dedup_none_100 67.00ns/iter +/- 17.00ns__ * __vec::bench_dedup_none_1000 662.00ns/iter +/- 86.00ns__ * __vec::bench_dedup_none_10000 9.16µs/iter +/- 2.71µs__ * __vec::bench_dedup_none_100000 91.25µs/iter +/- 1.82µs__ * vec::bench_dedup_random_100 105.00ns/iter +/- 11.00ns * vec::bench_dedup_random_1000 781.00ns/iter +/- 10.00ns * vec::bench_dedup_random_10000 9.00µs/iter +/- 5.62µs * vec::bench_dedup_random_100000 449.81µs/iter +/- 74.99µs * vec::bench_dedup_slice_truncate_100 105.00ns/iter +/- 16.00ns * vec::bench_dedup_slice_truncate_1000 2.65µs/iter +/- 481.00ns * vec::bench_dedup_slice_truncate_10000 18.33µs/iter +/- 5.23µs * vec::bench_dedup_slice_truncate_100000 501.12µs/iter +/- 46.97µs Results after implementation: * vec::bench_dedup_all_100 75.00ns/iter +/- 9.00ns * vec::bench_dedup_all_1000 494.00ns/iter +/- 117.00ns * vec::bench_dedup_all_100000 58.13µs/iter +/- 8.78µs * __vec::bench_dedup_none_100 52.00ns/iter +/- 22.00ns__ * __vec::bench_dedup_none_1000 417.00ns/iter +/- 116.00ns__ * __vec::bench_dedup_none_10000 4.11µs/iter +/- 546.00ns__ * __vec::bench_dedup_none_100000 40.47µs/iter +/- 5.36µs__ * vec::bench_dedup_random_100 77.00ns/iter +/- 15.00ns * vec::bench_dedup_random_1000 681.00ns/iter +/- 86.00ns * vec::bench_dedup_random_10000 11.66µs/iter +/- 2.22µs * vec::bench_dedup_random_100000 469.35µs/iter +/- 20.53µs * vec::bench_dedup_slice_truncate_100 100.00ns/iter +/- 5.00ns * vec::bench_dedup_slice_truncate_1000 2.55µs/iter +/- 224.00ns * vec::bench_dedup_slice_truncate_10000 18.95µs/iter +/- 2.59µs * vec::bench_dedup_slice_truncate_100000 492.85µs/iter +/- 72.84µs Resolves rust-lang#77772

…rsion_77772_2, r=<try> Split `Vec::dedup_by` into 2 cycles First cycle runs until we found 2 same elements, second runs after if there any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove. This leads to significant performance gains if all `Vec` items are kept: -40% on my benchmark with unique integers. Results of benchmarks before implementation (including new benchmark where nothing needs to be removed): * vec::bench_dedup_all_100 74.00ns/iter +/- 13.00ns * vec::bench_dedup_all_1000 572.00ns/iter +/- 272.00ns * vec::bench_dedup_all_100000 64.42µs/iter +/- 19.47µs * __vec::bench_dedup_none_100 67.00ns/iter +/- 17.00ns__ * __vec::bench_dedup_none_1000 662.00ns/iter +/- 86.00ns__ * __vec::bench_dedup_none_10000 9.16µs/iter +/- 2.71µs__ * __vec::bench_dedup_none_100000 91.25µs/iter +/- 1.82µs__ * vec::bench_dedup_random_100 105.00ns/iter +/- 11.00ns * vec::bench_dedup_random_1000 781.00ns/iter +/- 10.00ns * vec::bench_dedup_random_10000 9.00µs/iter +/- 5.62µs * vec::bench_dedup_random_100000 449.81µs/iter +/- 74.99µs * vec::bench_dedup_slice_truncate_100 105.00ns/iter +/- 16.00ns * vec::bench_dedup_slice_truncate_1000 2.65µs/iter +/- 481.00ns * vec::bench_dedup_slice_truncate_10000 18.33µs/iter +/- 5.23µs * vec::bench_dedup_slice_truncate_100000 501.12µs/iter +/- 46.97µs Results after implementation: * vec::bench_dedup_all_100 75.00ns/iter +/- 9.00ns * vec::bench_dedup_all_1000 494.00ns/iter +/- 117.00ns * vec::bench_dedup_all_100000 58.13µs/iter +/- 8.78µs * __vec::bench_dedup_none_100 52.00ns/iter +/- 22.00ns__ * __vec::bench_dedup_none_1000 417.00ns/iter +/- 116.00ns__ * __vec::bench_dedup_none_10000 4.11µs/iter +/- 546.00ns__ * __vec::bench_dedup_none_100000 40.47µs/iter +/- 5.36µs__ * vec::bench_dedup_random_100 77.00ns/iter +/- 15.00ns * vec::bench_dedup_random_1000 681.00ns/iter +/- 86.00ns * vec::bench_dedup_random_10000 11.66µs/iter +/- 2.22µs * vec::bench_dedup_random_100000 469.35µs/iter +/- 20.53µs * vec::bench_dedup_slice_truncate_100 100.00ns/iter +/- 5.00ns * vec::bench_dedup_slice_truncate_1000 2.55µs/iter +/- 224.00ns * vec::bench_dedup_slice_truncate_10000 18.95µs/iter +/- 2.59µs * vec::bench_dedup_slice_truncate_100000 492.85µs/iter +/- 72.84µs Resolves rust-lang#77772 P.S. Note that this is same PR as rust-lang#92104 I just missed review then forgot about it. Also, I cannot reopen that pull request so I am creating a new one. I responded to remaining questions directly by adding commentaries to my code.

First cycle runs until we found 2 same elements, second runs after if there any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove. This leads to significant performance gains if all `Vec` items are kept: -40% on my benchmark with unique integers. Results of benchmarks before implementation (including new benchmark where nothing needs to be removed): * vec::bench_dedup_all_100 74.00ns/iter +/- 13.00ns * vec::bench_dedup_all_1000 572.00ns/iter +/- 272.00ns * vec::bench_dedup_all_100000 64.42µs/iter +/- 19.47µs * __vec::bench_dedup_none_100 67.00ns/iter +/- 17.00ns__ * __vec::bench_dedup_none_1000 662.00ns/iter +/- 86.00ns__ * __vec::bench_dedup_none_10000 9.16µs/iter +/- 2.71µs__ * __vec::bench_dedup_none_100000 91.25µs/iter +/- 1.82µs__ * vec::bench_dedup_random_100 105.00ns/iter +/- 11.00ns * vec::bench_dedup_random_1000 781.00ns/iter +/- 10.00ns * vec::bench_dedup_random_10000 9.00µs/iter +/- 5.62µs * vec::bench_dedup_random_100000 449.81µs/iter +/- 74.99µs * vec::bench_dedup_slice_truncate_100 105.00ns/iter +/- 16.00ns * vec::bench_dedup_slice_truncate_1000 2.65µs/iter +/- 481.00ns * vec::bench_dedup_slice_truncate_10000 18.33µs/iter +/- 5.23µs * vec::bench_dedup_slice_truncate_100000 501.12µs/iter +/- 46.97µs Results after implementation: * vec::bench_dedup_all_100 75.00ns/iter +/- 9.00ns * vec::bench_dedup_all_1000 494.00ns/iter +/- 117.00ns * vec::bench_dedup_all_100000 58.13µs/iter +/- 8.78µs * __vec::bench_dedup_none_100 52.00ns/iter +/- 22.00ns__ * __vec::bench_dedup_none_1000 417.00ns/iter +/- 116.00ns__ * __vec::bench_dedup_none_10000 4.11µs/iter +/- 546.00ns__ * __vec::bench_dedup_none_100000 40.47µs/iter +/- 5.36µs__ * vec::bench_dedup_random_100 77.00ns/iter +/- 15.00ns * vec::bench_dedup_random_1000 681.00ns/iter +/- 86.00ns * vec::bench_dedup_random_10000 11.66µs/iter +/- 2.22µs * vec::bench_dedup_random_100000 469.35µs/iter +/- 20.53µs * vec::bench_dedup_slice_truncate_100 100.00ns/iter +/- 5.00ns * vec::bench_dedup_slice_truncate_1000 2.55µs/iter +/- 224.00ns * vec::bench_dedup_slice_truncate_10000 18.95µs/iter +/- 2.59µs * vec::bench_dedup_slice_truncate_100000 492.85µs/iter +/- 72.84µs Resolves rust-lang#77772

…rsion_77772_2, r=the8472 Split `Vec::dedup_by` into 2 cycles First cycle runs until we found 2 same elements, second runs after if there any found in the first one. This allows to avoid any memory writes until we found an item which we want to remove. This leads to significant performance gains if all `Vec` items are kept: -40% on my benchmark with unique integers. Results of benchmarks before implementation (including new benchmark where nothing needs to be removed): * vec::bench_dedup_all_100 74.00ns/iter +/- 13.00ns * vec::bench_dedup_all_1000 572.00ns/iter +/- 272.00ns * vec::bench_dedup_all_100000 64.42µs/iter +/- 19.47µs * __vec::bench_dedup_none_100 67.00ns/iter +/- 17.00ns__ * __vec::bench_dedup_none_1000 662.00ns/iter +/- 86.00ns__ * __vec::bench_dedup_none_10000 9.16µs/iter +/- 2.71µs__ * __vec::bench_dedup_none_100000 91.25µs/iter +/- 1.82µs__ * vec::bench_dedup_random_100 105.00ns/iter +/- 11.00ns * vec::bench_dedup_random_1000 781.00ns/iter +/- 10.00ns * vec::bench_dedup_random_10000 9.00µs/iter +/- 5.62µs * vec::bench_dedup_random_100000 449.81µs/iter +/- 74.99µs * vec::bench_dedup_slice_truncate_100 105.00ns/iter +/- 16.00ns * vec::bench_dedup_slice_truncate_1000 2.65µs/iter +/- 481.00ns * vec::bench_dedup_slice_truncate_10000 18.33µs/iter +/- 5.23µs * vec::bench_dedup_slice_truncate_100000 501.12µs/iter +/- 46.97µs Results after implementation: * vec::bench_dedup_all_100 75.00ns/iter +/- 9.00ns * vec::bench_dedup_all_1000 494.00ns/iter +/- 117.00ns * vec::bench_dedup_all_100000 58.13µs/iter +/- 8.78µs * __vec::bench_dedup_none_100 52.00ns/iter +/- 22.00ns__ * __vec::bench_dedup_none_1000 417.00ns/iter +/- 116.00ns__ * __vec::bench_dedup_none_10000 4.11µs/iter +/- 546.00ns__ * __vec::bench_dedup_none_100000 40.47µs/iter +/- 5.36µs__ * vec::bench_dedup_random_100 77.00ns/iter +/- 15.00ns * vec::bench_dedup_random_1000 681.00ns/iter +/- 86.00ns * vec::bench_dedup_random_10000 11.66µs/iter +/- 2.22µs * vec::bench_dedup_random_100000 469.35µs/iter +/- 20.53µs * vec::bench_dedup_slice_truncate_100 100.00ns/iter +/- 5.00ns * vec::bench_dedup_slice_truncate_1000 2.55µs/iter +/- 224.00ns * vec::bench_dedup_slice_truncate_10000 18.95µs/iter +/- 2.59µs * vec::bench_dedup_slice_truncate_100000 492.85µs/iter +/- 72.84µs Resolves rust-lang#77772 P.S. Note that this is same PR as rust-lang#92104 I just missed review then forgot about it. Also, I cannot reopen that pull request so I am creating a new one. I responded to remaining questions directly by adding commentaries to my code.

jonas-schievink added C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Oct 9, 2020

pickfire mentioned this issue Oct 23, 2020

Slice partition_dedup_by avoid inner branch #78291

Closed

AngelicosPhosphoros mentioned this issue Dec 19, 2021

Split Vec::dedup_by into 2 cycles #92104

Closed

the8472 added the A-collections Area: std::collections. label Feb 8, 2022

AngelicosPhosphoros mentioned this issue Nov 25, 2023

Split Vec::dedup_by into 2 cycles #118273

Merged

bors closed this as completed in #118273 Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potentially faster dedup_by implementation #77772

Potentially faster dedup_by implementation #77772

Plecra commented Oct 9, 2020 •

edited

pickfire commented Oct 10, 2020

the8472 commented Oct 10, 2020

Plecra commented Oct 10, 2020 •

edited

Plecra commented Oct 11, 2020

pickfire commented Oct 11, 2020

Plecra commented Oct 11, 2020

pickfire commented Oct 12, 2020

Plecra commented Oct 12, 2020

pickfire commented Oct 12, 2020 •

edited

AngelicosPhosphoros commented Jan 17, 2022

Potentially faster dedup_by implementation #77772

Potentially faster dedup_by implementation #77772

Comments

Plecra commented Oct 9, 2020 • edited

pickfire commented Oct 10, 2020

the8472 commented Oct 10, 2020

Plecra commented Oct 10, 2020 • edited

Plecra commented Oct 11, 2020

pickfire commented Oct 11, 2020

Plecra commented Oct 11, 2020

pickfire commented Oct 12, 2020

Plecra commented Oct 12, 2020

pickfire commented Oct 12, 2020 • edited

AngelicosPhosphoros commented Jan 17, 2022

Plecra commented Oct 9, 2020 •

edited

Plecra commented Oct 10, 2020 •

edited

pickfire commented Oct 12, 2020 •

edited