Replace sort implementations #124032

Voultapher · 2024-04-16T18:28:13Z

This PR replaces the sort implementations with tailor-made ones that strike a balance of run-time, compile-time and binary-size, yielding run-time and compile-time improvements. Regressing binary-size for slice::sort while improving it for slice::sort_unstable. All while upholding the existing soft and hard safety guarantees, and even extending the soft guarantees, detecting strict weak ordering violations with a high chance and reporting it to users via a panic.

slice::sort -> driftsort design document
slice::sort_unstable -> ipnsort design document

Why should we change the sort implementations?

In the 2023 Rust survey, one of the questions was: "In your opinion, how should work on the following aspects of Rust be prioritized?". The second place was "Runtime performance" and the third one "Compile Times". This PR aims to improve both.

Why is this one big PR and not multiple?

The current documentation gives performance recommendations for slice::sort and slice::sort_unstable. If for example only one of them were to changed, this advise may be misleading for some Rust versions. By replacing them atomically, the advise remains largely unchanged, and users don't have to change their code.
driftsort and ipnsort share a substantial part of their implementations.
The implementation of select_nth_unstable uses internals of slice::sort_unstable, which makes it impractical to split changes.

This PR is a collaboration with @orlp.

- `slice::sort` -> driftsort https://github.com/Voultapher/sort-research-rs/blob/main/writeup/driftsort_introduction/text.md - `slice::sort_unstable` -> ipnsort https://github.com/Voultapher/sort-research-rs/blob/main/writeup/ipnsort_introduction/text.md Replaces the sort implementations with tailor made ones that strike a balance of run-time, compile-time and binary-size, yielding run-time and compile-time improvements. Regressing binary-size for `slice::sort` while improving it for `slice::sort_unstable`. All while upholding the existing soft and hard safety guarantees, and even extending the soft guarantees, detecting strict weak ordering violations with a high chance and reporting it to users via a panic. In addition the implementation of `select_nth_unstable` is also adapted as it uses `slice::sort_unstable` internals.

rustbot · 2024-04-16T18:28:22Z

r? @jhpratt

rustbot has assigned @jhpratt.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

Voultapher · 2024-04-16T18:28:27Z

r? thomcc

matthiaskrgr · 2024-04-16T19:31:32Z

@bors try @rust-timer queue

library/alloc/src/slice/tests.rs

Kobzol · 2024-04-17T12:31:40Z

@bors try

(looks like the try build didn't get through)

bors · 2024-04-17T12:32:51Z

⌛ Trying commit eeee5de with merge a4132fa...

Replace sort implementations This PR replaces the sort implementations with tailor-made ones that strike a balance of run-time, compile-time and binary-size, yielding run-time and compile-time improvements. Regressing binary-size for `slice::sort` while improving it for `slice::sort_unstable`. All while upholding the existing soft and hard safety guarantees, and even extending the soft guarantees, detecting strict weak ordering violations with a high chance and reporting it to users via a panic. * `slice::sort` -> driftsort [design document](https://github.com/Voultapher/sort-research-rs/blob/main/writeup/driftsort_introduction/text.md) * `slice::sort_unstable` -> ipnsort [design document](https://github.com/Voultapher/sort-research-rs/blob/main/writeup/ipnsort_introduction/text.md) #### Why should we change the sort implementations? In the [2023 Rust survey](https://blog.rust-lang.org/2024/02/19/2023-Rust-Annual-Survey-2023-results.html#challenges), one of the questions was: "In your opinion, how should work on the following aspects of Rust be prioritized?". The second place was "Runtime performance" and the third one "Compile Times". This PR aims to improve both. #### Why is this one big PR and not multiple? * The current documentation gives performance recommendations for `slice::sort` and `slice::sort_unstable`. If for example only one of them were to changed, this advise may be misleading for some Rust versions. By replacing them atomically, the advise remains largely unchanged, and users don't have to change their code. * driftsort and ipnsort share a substantial part of their implementations. * The implementation of `select_nth_unstable` uses internals of `slice::sort_unstable`, which makes it impractical to split changes. --- This PR is a collaboration with `@orlp.`

bors · 2024-04-17T14:06:17Z

☀️ Try build successful - checks-actions
Build commit: a4132fa (a4132fa857b71c2cec6719753cc4f31d6f7ea3c4)

rust-timer · 2024-04-17T15:23:06Z

Finished benchmarking commit (a4132fa): comparison URL.

Overall result: ❌ regressions - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	2.2%	[0.2%, 18.4%]	44
Regressions ❌ (secondary)	1.1%	[0.2%, 4.3%]	9
Improvements ✅ (primary)	-0.4%	[-0.5%, -0.3%]	7
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	1.9%	[-0.5%, 18.4%]	51

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	3.6%	[1.0%, 7.3%]	10
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-3.2%	[-3.8%, -2.0%]	3
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	2.1%	[-3.8%, 7.3%]	13

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	3.6%	[0.9%, 19.0%]	24
Regressions ❌ (secondary)	3.2%	[2.2%, 4.2%]	2
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	3.6%	[0.9%, 19.0%]	24

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	1.5%	[0.1%, 4.7%]	75
Regressions ❌ (secondary)	2.5%	[0.8%, 4.1%]	76
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	1.5%	[0.1%, 4.7%]	75

Bootstrap: 679.119s -> 683.166s (0.60%)
Artifact size: 316.14 MiB -> 317.49 MiB (0.43%)

Also includes small doc fixes.

Voultapher · 2024-04-17T15:58:56Z

The binary-size results are inline with what we expected, given that slice::sort is more prevalent than slice::sort_unstable it makes sense that the regressions outweigh the improvements.

Regarding compile-times, we looked at those in our analysis here and here. The instruction count metric does not account for gains via parallelizability, right?

orlp · 2024-04-17T22:07:31Z

One thing that I find particularly strange is a 4.31% regression in helloworld-tiny, which doesn't call sort. Is this just due to the sort call in backtrace-rs?

Urgau · 2024-04-17T22:19:45Z

One thing that I find particularly strange is a 4.31% regression in helloworld-tiny, which doesn't call sort. Is this just due to the sort call in backtrace-rs?

@orlp Nearly the all the benchmarks represent the measurement on rustc it-self, not the final binary/library. So for the helloworld-tiny benchmark the regression indicates that rustc it-self needing 4.31% more instructions than before this PR.

Kobzol · 2024-04-17T22:31:12Z

~~(I think that orlp was asking about the binary size). It's quite possible that it's because of backtrace, the binary size is quite sensitive to changes in stdlib deps.~~ Nope, sorry, 4.31 != 4.13 😆 Nevermind.

orlp · 2024-04-17T22:35:18Z

@Urgau Yes I'm aware, nevertheless a 4.3% regression in compile-time for helloworld-tiny, which is almost entirely found in the backend code generation:

This indicates that it's likely a hidden a sort call embedded in this program that needs extra time to be compiled. I conjecture due to the backtrace routine containing a sort call.

Also, when I say 'time' I must concur with @Voultapher in that we found that our new sort implementation does use more resources to compile, but may not actually be slower in real-time as more can be compiled in parallel.

Voultapher · 2024-04-18T06:10:07Z

In the past I looked at how much run-time rustc spends inside sort #108005 (comment) and it's less than 0.1% of the total runtime. The regressions are caused by more complex implementations that may take longer to compile, and for example due to backtrace-rs every program contains at least one call to slice::sort_unstable. Another data-point that supports this theory is this experiment #108662, the run-time code-path stayed nearly identical, yet we see large regressions in compile-time.

Kobzol · 2024-04-18T06:25:44Z

It would be nice to add some (even very trivial) runtime benchmark that uses sorting. I know that our runtime benchmark measurement process is far from ideal, but it would be good to see some green result on this PR to confirm that it is an improvement in some basic real world usage. Also, could you please post results of benchmarks from stdlib (IIRC there are some microbenchmarks for sorting?).

(Great job with the new sort. implementation, btw!)

orlp · 2024-04-18T06:29:15Z

@Kobzol Do you mean adding sorting runtime benchmarks to the CI? Because we have extensive benchmarks in the design documents: stable, unstable.

Kobzol · 2024-04-18T06:30:31Z

Yeah, I meant adding it to rustc-perf. But maybe it's not needed, given that you already have such an extensive suite.

Voultapher · 2024-04-18T06:36:23Z

@Kobzol as discussed on Zulip, this stand-alone benchmark runner and analysis would be a good fit for the rustc-perf suite https://github.com/Voultapher/sort-research-rs/tree/main/util/rustc-sort-bench. I think we should add it to notice future regressions and improvements.

Kobzol · 2024-04-18T09:06:17Z

Yeah, that would be nice, although I still haven't found time to look at it, sorry (was quite busy lately, and will still be in the upcoming weeks). This is not something that has to block progress on this PR though!

Voultapher · 2024-04-18T17:24:04Z

I concur, let's not block this PR on extending the rustc-perf suite.

Voultapher · 2024-04-18T20:06:14Z

@Kobzol here are the microbenchmark results you asked for on my Zen 3 machine:

 name                                          baseline ns/iter      branch ns/iter        diff ns/iter   diff %  speedup 
 slice::sort_by_cached_key_lexicographic       1,285,959 (62 MB/s)   1,106,405 (72 MB/s)       -179,554  -13.96%   x 1.16 
 slice::sort_by_key_lexicographic              7,918,161 (10 MB/s)   8,450,892 (9 MB/s)         532,731    6.73%   x 0.94 
 slice::sort_large_ascending                   3,580 (22346 MB/s)    3,367 (23760 MB/s)            -213   -5.95%   x 1.06 
 slice::sort_large_big                         716,164 (1787 MB/s)   1,439,015 (889 MB/s)       722,851  100.93%   x 0.50 
 slice::sort_large_descending                  4,754 (16827 MB/s)    6,628 (12070 MB/s)           1,874   39.42%   x 0.72 
 slice::sort_large_expensive                   13,105,320 (6 MB/s)   12,933,701 (6 MB/s)       -171,619   -1.31%   x 1.01 
 slice::sort_large_mostly_ascending            144,789 (552 MB/s)    128,831 (620 MB/s)         -15,958  -11.02%   x 1.12 
 slice::sort_large_mostly_descending           150,506 (531 MB/s)    140,820 (568 MB/s)          -9,686   -6.44%   x 1.07 
 slice::sort_large_random                      257,440 (310 MB/s)    88,850 (900 MB/s)         -168,590  -65.49%   x 2.90 
 slice::sort_large_strings                     1,031,233 (155 MB/s)  641,908 (249 MB/s)        -389,325  -37.75%   x 1.61 
 slice::sort_medium_random                     923 (866 MB/s)        481 (1663 MB/s)               -442  -47.89%   x 1.92 
 slice::sort_small_ascending                   22 (3636 MB/s)        20 (4000 MB/s)                  -2   -9.09%   x 1.10 
 slice::sort_small_big                         125 (10240 MB/s)      111 (11531 MB/s)               -14  -11.20%   x 1.13 
 slice::sort_small_descending                  32 (2500 MB/s)        34 (2352 MB/s)                   2    6.25%   x 0.94 
 slice::sort_small_random                      27 (2962 MB/s)        30 (2666 MB/s)                   3   11.11%   x 0.90 
 slice::sort_unstable_by_key_lexicographic     9,018,384 (8 MB/s)    8,171,665 (9 MB/s)        -846,719   -9.39%   x 1.10 
 slice::sort_unstable_large_ascending          4,714 (16970 MB/s)    3,329 (24031 MB/s)          -1,385  -29.38%   x 1.42 
 slice::sort_unstable_large_big                520,268 (2460 MB/s)   597,851 (2141 MB/s)         77,583   14.91%   x 0.87 
 slice::sort_unstable_large_descending         5,934 (13481 MB/s)    4,455 (17957 MB/s)          -1,479  -24.92%   x 1.33 
 slice::sort_unstable_large_expensive          8,998,540 (8 MB/s)    9,063,229 (8 MB/s)          64,689    0.72%   x 0.99 
 slice::sort_unstable_large_mostly_ascending   36,093 (2216 MB/s)    72,351 (1105 MB/s)          36,258  100.46%   x 0.50 
 slice::sort_unstable_large_mostly_descending  37,750 (2119 MB/s)    75,697 (1056 MB/s)          37,947  100.52%   x 0.50 
 slice::sort_unstable_large_random             132,289 (604 MB/s)    69,359 (1153 MB/s)         -62,930  -47.57%   x 1.91 
 slice::sort_unstable_large_strings            718,218 (222 MB/s)    537,293 (297 MB/s)        -180,925  -25.19%   x 1.34 
 slice::sort_unstable_medium_random            447 (1789 MB/s)       348 (2298 MB/s)                -99  -22.15%   x 1.28 
 slice::sort_unstable_small_ascending          24 (3333 MB/s)        22 (3636 MB/s)                  -2   -8.33%   x 1.09 
 slice::sort_unstable_small_big                131 (9770 MB/s)       115 (11130 MB/s)               -16  -12.21%   x 1.14 
 slice::sort_unstable_small_descending         38 (2105 MB/s)        36 (2222 MB/s)                  -2   -5.26%   x 1.06 
 slice::sort_unstable_small_random             38 (2105 MB/s)        32 (2500 MB/s)                  -6  -15.79%   x 1.19

sort_unstable_large_mostly_ascending and sort_unstable_large_mostly_descending are expected to be slower, as described in the design document. If the input is partially sorted using slice::sort is probably faster, as it has access to efficient merging. sort_large_big aligns with the 1k type as tested, seeing a mixed bag of results in our testing as well. Overall I'd say there are no real surprises here.

Voultapher · 2024-04-18T20:07:55Z

I suggest we also run the rustc-perf suite, I would only expect large changes if any of the simulated workloads spend a substantial amount of time in slice::sort or slice::sort_unstable.

Kobzol · 2024-04-19T18:25:27Z

We already did.

Voultapher · 2024-04-19T20:35:47Z

We already did.

I though those were the compile-time benchmarks, and not the run-time suite. Or are they together in the same suite?

Voultapher · 2024-04-19T20:36:59Z

Wait, never mind I just found the tab in the results page.

Kobzol · 2024-04-19T20:37:18Z

Yeah, they are together. You can see the runtime results in the Runtime tab. But there's nothing much to see here.. :)

rustbot assigned jhpratt Apr 16, 2024

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Apr 16, 2024

rustbot assigned thomcc and unassigned jhpratt Apr 16, 2024