Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace sort implementations #124032

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Voultapher
Copy link
Contributor

This PR replaces the sort implementations with tailor-made ones that strike a balance of run-time, compile-time and binary-size, yielding run-time and compile-time improvements. Regressing binary-size for slice::sort while improving it for slice::sort_unstable. All while upholding the existing soft and hard safety guarantees, and even extending the soft guarantees, detecting strict weak ordering violations with a high chance and reporting it to users via a panic.

Why should we change the sort implementations?

In the 2023 Rust survey, one of the questions was: "In your opinion, how should work on the following aspects of Rust be prioritized?". The second place was "Runtime performance" and the third one "Compile Times". This PR aims to improve both.

Why is this one big PR and not multiple?

  • The current documentation gives performance recommendations for slice::sort and slice::sort_unstable. If for example only one of them were to changed, this advise may be misleading for some Rust versions. By replacing them atomically, the advise remains largely unchanged, and users don't have to change their code.
  • driftsort and ipnsort share a substantial part of their implementations.
  • The implementation of select_nth_unstable uses internals of slice::sort_unstable, which makes it impractical to split changes.

This PR is a collaboration with @orlp.

- `slice::sort` -> driftsort
  https://github.com/Voultapher/sort-research-rs/blob/main/writeup/driftsort_introduction/text.md

- `slice::sort_unstable` -> ipnsort
  https://github.com/Voultapher/sort-research-rs/blob/main/writeup/ipnsort_introduction/text.md

Replaces the sort implementations with tailor made ones that strike a
balance of run-time, compile-time and binary-size, yielding run-time and
compile-time improvements. Regressing binary-size for `slice::sort`
while improving it for `slice::sort_unstable`. All while upholding the
existing soft and hard safety guarantees, and even extending the soft
guarantees, detecting strict weak ordering violations with a high chance
and reporting it to users via a panic.

In addition the implementation of `select_nth_unstable` is also adapted
as it uses `slice::sort_unstable` internals.
@rustbot
Copy link
Collaborator

rustbot commented Apr 16, 2024

r? @jhpratt

rustbot has assigned @jhpratt.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Apr 16, 2024
@Voultapher
Copy link
Contributor Author

r? thomcc

@rustbot rustbot assigned thomcc and unassigned jhpratt Apr 16, 2024
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@matthiaskrgr
Copy link
Member

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Apr 16, 2024
@rust-log-analyzer

This comment has been minimized.

@Voultapher

This comment was marked as outdated.

@rustbot rustbot assigned joboet and thomcc and unassigned thomcc and joboet Apr 17, 2024
@Kobzol
Copy link
Contributor

Kobzol commented Apr 17, 2024

@bors try

(looks like the try build didn't get through)

@bors
Copy link
Contributor

bors commented Apr 17, 2024

⌛ Trying commit eeee5de with merge a4132fa...

bors added a commit to rust-lang-ci/rust that referenced this pull request Apr 17, 2024
Replace sort implementations

This PR replaces the sort implementations with tailor-made ones that strike a balance of run-time, compile-time and binary-size, yielding run-time and compile-time improvements. Regressing binary-size for `slice::sort` while improving it for `slice::sort_unstable`. All while upholding the existing soft and hard safety guarantees, and even extending the soft guarantees, detecting strict weak ordering violations with a high chance and reporting it to users via a panic.

* `slice::sort` -> driftsort [design document](https://github.com/Voultapher/sort-research-rs/blob/main/writeup/driftsort_introduction/text.md)

* `slice::sort_unstable` -> ipnsort [design document](https://github.com/Voultapher/sort-research-rs/blob/main/writeup/ipnsort_introduction/text.md)

#### Why should we change the sort implementations?

In the [2023 Rust survey](https://blog.rust-lang.org/2024/02/19/2023-Rust-Annual-Survey-2023-results.html#challenges), one of the questions was: "In your opinion, how should work on the following aspects of Rust be prioritized?". The second place was "Runtime performance" and the third one "Compile Times". This PR aims to improve both.

#### Why is this one big PR and not multiple?

* The current documentation gives performance recommendations for `slice::sort` and `slice::sort_unstable`. If for example only one of them were to changed, this advise may be misleading for some Rust versions. By replacing them atomically, the advise remains largely unchanged, and users don't have to change their code.
* driftsort and ipnsort share a substantial part of their implementations.
* The implementation of `select_nth_unstable` uses internals of `slice::sort_unstable`, which makes it impractical to split changes.

---

This PR is a collaboration with `@orlp.`
@bors
Copy link
Contributor

bors commented Apr 17, 2024

☀️ Try build successful - checks-actions
Build commit: a4132fa (a4132fa857b71c2cec6719753cc4f31d6f7ea3c4)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (a4132fa): comparison URL.

Overall result: ❌ regressions - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.2% [0.2%, 18.4%] 44
Regressions ❌
(secondary)
1.1% [0.2%, 4.3%] 9
Improvements ✅
(primary)
-0.4% [-0.5%, -0.3%] 7
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 1.9% [-0.5%, 18.4%] 51

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.6% [1.0%, 7.3%] 10
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-3.2% [-3.8%, -2.0%] 3
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 2.1% [-3.8%, 7.3%] 13

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.6% [0.9%, 19.0%] 24
Regressions ❌
(secondary)
3.2% [2.2%, 4.2%] 2
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 3.6% [0.9%, 19.0%] 24

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.5% [0.1%, 4.7%] 75
Regressions ❌
(secondary)
2.5% [0.8%, 4.1%] 76
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 1.5% [0.1%, 4.7%] 75

Bootstrap: 679.119s -> 683.166s (0.60%)
Artifact size: 316.14 MiB -> 317.49 MiB (0.43%)

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Apr 17, 2024
Also includes small doc fixes.
@Voultapher
Copy link
Contributor Author

Voultapher commented Apr 17, 2024

The binary-size results are inline with what we expected, given that slice::sort is more prevalent than slice::sort_unstable it makes sense that the regressions outweigh the improvements.

Regarding compile-times, we looked at those in our analysis here and here. The instruction count metric does not account for gains via parallelizability, right?

@orlp
Copy link
Contributor

orlp commented Apr 17, 2024

One thing that I find particularly strange is a 4.31% regression in helloworld-tiny, which doesn't call sort. Is this just due to the sort call in backtrace-rs?

@Urgau
Copy link
Contributor

Urgau commented Apr 17, 2024

One thing that I find particularly strange is a 4.31% regression in helloworld-tiny, which doesn't call sort. Is this just due to the sort call in backtrace-rs?

@orlp Nearly the all the benchmarks represent the measurement on rustc it-self, not the final binary/library. So for the helloworld-tiny benchmark the regression indicates that rustc it-self needing 4.31% more instructions than before this PR.

@Kobzol
Copy link
Contributor

Kobzol commented Apr 17, 2024

(I think that orlp was asking about the binary size). It's quite possible that it's because of backtrace, the binary size is quite sensitive to changes in stdlib deps. Nope, sorry, 4.31 != 4.13 😆 Nevermind.

@orlp
Copy link
Contributor

orlp commented Apr 17, 2024

@Urgau Yes I'm aware, nevertheless a 4.3% regression in compile-time for helloworld-tiny, which is almost entirely found in the backend code generation:

image

This indicates that it's likely a hidden a sort call embedded in this program that needs extra time to be compiled. I conjecture due to the backtrace routine containing a sort call.

Also, when I say 'time' I must concur with @Voultapher in that we found that our new sort implementation does use more resources to compile, but may not actually be slower in real-time as more can be compiled in parallel.

@Voultapher
Copy link
Contributor Author

Voultapher commented Apr 18, 2024

In the past I looked at how much run-time rustc spends inside sort #108005 (comment) and it's less than 0.1% of the total runtime. The regressions are caused by more complex implementations that may take longer to compile, and for example due to backtrace-rs every program contains at least one call to slice::sort_unstable. Another data-point that supports this theory is this experiment #108662, the run-time code-path stayed nearly identical, yet we see large regressions in compile-time.

@Kobzol
Copy link
Contributor

Kobzol commented Apr 18, 2024

It would be nice to add some (even very trivial) runtime benchmark that uses sorting. I know that our runtime benchmark measurement process is far from ideal, but it would be good to see some green result on this PR to confirm that it is an improvement in some basic real world usage. Also, could you please post results of benchmarks from stdlib (IIRC there are some microbenchmarks for sorting?).

(Great job with the new sort. implementation, btw!)

@orlp
Copy link
Contributor

orlp commented Apr 18, 2024

@Kobzol Do you mean adding sorting runtime benchmarks to the CI? Because we have extensive benchmarks in the design documents: stable, unstable.

@Kobzol
Copy link
Contributor

Kobzol commented Apr 18, 2024

Yeah, I meant adding it to rustc-perf. But maybe it's not needed, given that you already have such an extensive suite.

@Voultapher
Copy link
Contributor Author

@Kobzol as discussed on Zulip, this stand-alone benchmark runner and analysis would be a good fit for the rustc-perf suite https://github.com/Voultapher/sort-research-rs/tree/main/util/rustc-sort-bench. I think we should add it to notice future regressions and improvements.

@Kobzol
Copy link
Contributor

Kobzol commented Apr 18, 2024

Yeah, that would be nice, although I still haven't found time to look at it, sorry (was quite busy lately, and will still be in the upcoming weeks). This is not something that has to block progress on this PR though!

@Voultapher
Copy link
Contributor Author

I concur, let's not block this PR on extending the rustc-perf suite.

@Voultapher
Copy link
Contributor Author

@Kobzol here are the microbenchmark results you asked for on my Zen 3 machine:

 name                                          baseline ns/iter      branch ns/iter        diff ns/iter   diff %  speedup 
 slice::sort_by_cached_key_lexicographic       1,285,959 (62 MB/s)   1,106,405 (72 MB/s)       -179,554  -13.96%   x 1.16 
 slice::sort_by_key_lexicographic              7,918,161 (10 MB/s)   8,450,892 (9 MB/s)         532,731    6.73%   x 0.94 
 slice::sort_large_ascending                   3,580 (22346 MB/s)    3,367 (23760 MB/s)            -213   -5.95%   x 1.06 
 slice::sort_large_big                         716,164 (1787 MB/s)   1,439,015 (889 MB/s)       722,851  100.93%   x 0.50 
 slice::sort_large_descending                  4,754 (16827 MB/s)    6,628 (12070 MB/s)           1,874   39.42%   x 0.72 
 slice::sort_large_expensive                   13,105,320 (6 MB/s)   12,933,701 (6 MB/s)       -171,619   -1.31%   x 1.01 
 slice::sort_large_mostly_ascending            144,789 (552 MB/s)    128,831 (620 MB/s)         -15,958  -11.02%   x 1.12 
 slice::sort_large_mostly_descending           150,506 (531 MB/s)    140,820 (568 MB/s)          -9,686   -6.44%   x 1.07 
 slice::sort_large_random                      257,440 (310 MB/s)    88,850 (900 MB/s)         -168,590  -65.49%   x 2.90 
 slice::sort_large_strings                     1,031,233 (155 MB/s)  641,908 (249 MB/s)        -389,325  -37.75%   x 1.61 
 slice::sort_medium_random                     923 (866 MB/s)        481 (1663 MB/s)               -442  -47.89%   x 1.92 
 slice::sort_small_ascending                   22 (3636 MB/s)        20 (4000 MB/s)                  -2   -9.09%   x 1.10 
 slice::sort_small_big                         125 (10240 MB/s)      111 (11531 MB/s)               -14  -11.20%   x 1.13 
 slice::sort_small_descending                  32 (2500 MB/s)        34 (2352 MB/s)                   2    6.25%   x 0.94 
 slice::sort_small_random                      27 (2962 MB/s)        30 (2666 MB/s)                   3   11.11%   x 0.90 
 slice::sort_unstable_by_key_lexicographic     9,018,384 (8 MB/s)    8,171,665 (9 MB/s)        -846,719   -9.39%   x 1.10 
 slice::sort_unstable_large_ascending          4,714 (16970 MB/s)    3,329 (24031 MB/s)          -1,385  -29.38%   x 1.42 
 slice::sort_unstable_large_big                520,268 (2460 MB/s)   597,851 (2141 MB/s)         77,583   14.91%   x 0.87 
 slice::sort_unstable_large_descending         5,934 (13481 MB/s)    4,455 (17957 MB/s)          -1,479  -24.92%   x 1.33 
 slice::sort_unstable_large_expensive          8,998,540 (8 MB/s)    9,063,229 (8 MB/s)          64,689    0.72%   x 0.99 
 slice::sort_unstable_large_mostly_ascending   36,093 (2216 MB/s)    72,351 (1105 MB/s)          36,258  100.46%   x 0.50 
 slice::sort_unstable_large_mostly_descending  37,750 (2119 MB/s)    75,697 (1056 MB/s)          37,947  100.52%   x 0.50 
 slice::sort_unstable_large_random             132,289 (604 MB/s)    69,359 (1153 MB/s)         -62,930  -47.57%   x 1.91 
 slice::sort_unstable_large_strings            718,218 (222 MB/s)    537,293 (297 MB/s)        -180,925  -25.19%   x 1.34 
 slice::sort_unstable_medium_random            447 (1789 MB/s)       348 (2298 MB/s)                -99  -22.15%   x 1.28 
 slice::sort_unstable_small_ascending          24 (3333 MB/s)        22 (3636 MB/s)                  -2   -8.33%   x 1.09 
 slice::sort_unstable_small_big                131 (9770 MB/s)       115 (11130 MB/s)               -16  -12.21%   x 1.14 
 slice::sort_unstable_small_descending         38 (2105 MB/s)        36 (2222 MB/s)                  -2   -5.26%   x 1.06 
 slice::sort_unstable_small_random             38 (2105 MB/s)        32 (2500 MB/s)                  -6  -15.79%   x 1.19 

sort_unstable_large_mostly_ascending and sort_unstable_large_mostly_descending are expected to be slower, as described in the design document. If the input is partially sorted using slice::sort is probably faster, as it has access to efficient merging. sort_large_big aligns with the 1k type as tested, seeing a mixed bag of results in our testing as well. Overall I'd say there are no real surprises here.

@Voultapher
Copy link
Contributor Author

I suggest we also run the rustc-perf suite, I would only expect large changes if any of the simulated workloads spend a substantial amount of time in slice::sort or slice::sort_unstable.

@Kobzol
Copy link
Contributor

Kobzol commented Apr 19, 2024

We already did.

@Voultapher
Copy link
Contributor Author

We already did.

I though those were the compile-time benchmarks, and not the run-time suite. Or are they together in the same suite?

@Voultapher
Copy link
Contributor Author

Wait, never mind I just found the tab in the results page.

@Kobzol
Copy link
Contributor

Kobzol commented Apr 19, 2024

Yeah, they are together. You can see the runtime results in the Runtime tab. But there's nothing much to see here.. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf-regression Performance regression. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet