Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upImplement a faster sort algorithm #38192
Conversation
This comment has been minimized.
This comment has been minimized.
|
Since it is quite a major rewrite of the sorting routine, and I see that we have no test guaranteeing stability, I would very much like a test that ensures this invariant. Something along the lines of
|
This comment has been minimized.
This comment has been minimized.
|
Actually, we do have a sort guaranteeing stability. It's called |
bluss
requested changes
Dec 6, 2016
|
I loved reading this. The request for changes is mostly comments and the guard struct names, but also a look at the zero sized types case, that it doesn't panic. |
| // that case, `i == j` so we don't copy. The | ||
| // `.offset(j)` is always in bounds. | ||
| // When dropped, copies from `src` into `dst`. | ||
| struct Guard<T> { |
This comment has been minimized.
This comment has been minimized.
bluss
Dec 6, 2016
Contributor
Since there are two struct Guard with slightly different jobs, they should have different names. (Sidenote: What it does with a hole here is exactly like Hole in the implementation of BinaryHeap).
This comment has been minimized.
This comment has been minimized.
stjepang
Dec 6, 2016
Contributor
How about InsertionHole and MergeHole? Or InsertionGuard and MergeGuard?
This comment has been minimized.
This comment has been minimized.
|
|
||
| let len = v.len(); | ||
| // Increments the pointer and returns the old value. | ||
| unsafe fn forward<T>(ptr: &mut *mut T) -> *mut T { |
This comment has been minimized.
This comment has been minimized.
bluss
Dec 6, 2016
Contributor
forward is also known as ptr++ in C, that is "post increment".
In my opinion, this is in easier to read style with an extension method on raw pointers, so that it reads ptr.post_increment(). Not sure what others think?
This comment has been minimized.
This comment has been minimized.
gereeter
Dec 6, 2016
Contributor
Relatedly, the fact that forward does a post-increment and that backwards does a pre-increment is somewhat confusing. It'd be nice to have a comment explaining that when iterating backwards, the pointer points past the end of the array, while when iterating forwards, the pointer points directly to the beginning of the array.
This comment has been minimized.
This comment has been minimized.
stjepang
Dec 6, 2016
Contributor
Another option is to name the two functions get_incr and decr_get instead, but I'm not too happy with that solution either.
This comment has been minimized.
This comment has been minimized.
bluss
Dec 6, 2016
Contributor
Using the word order for ordering of operations is too subtle for me. Depends on what one is used to, maybe.
| insertion_sort(v, compare); | ||
| return; | ||
| // Decrements the pointer and returns the new value. | ||
| unsafe fn backward<T>(ptr: &mut *mut T) -> *mut T { |
This comment has been minimized.
This comment has been minimized.
| // rather than <=, to maintain stability. | ||
| impl<T> Drop for Guard<T> { | ||
| fn drop(&mut self) { | ||
| let len = (self.end as usize - self.start as usize) / mem::size_of::<T>(); |
This comment has been minimized.
This comment has been minimized.
bluss
Dec 6, 2016
Contributor
If T is a zero sized type (ZST), this is a division by zero.
Now I know — the old sort code also does this! The division by size of turns out to be unreachable for ZSTs in that implementation. We can't require much of anything in this case, but maybe a test that ensures that it finishes without panic.
This comment has been minimized.
This comment has been minimized.
stjepang
Dec 6, 2016
Contributor
Would it make sense to make the entire sort algorithm on ZSTs simply a no-op (i.e. return immediately from merge_sort)? Can one meaningfully sort an array of ZSTs at all?
This comment has been minimized.
This comment has been minimized.
bluss
Dec 6, 2016
Contributor
I think it's fine to make it a noop. We could discuss the finer points of that another time. I don't think the current sort has any meaningful behaviour for sorting ZST.
| if len <= insertion { | ||
| if len >= 2 { | ||
| for i in (0..len-1).rev() { | ||
| insert_head(&mut v[i..len], &mut compare); |
This comment has been minimized.
This comment has been minimized.
bluss
Dec 6, 2016
Contributor
The len on the end of the range here seems redundant. Is it needed for some reason?
This comment has been minimized.
This comment has been minimized.
| len: left.len + right.len, | ||
| }; | ||
| runs.remove(r + 1); | ||
| } | ||
| } |
This comment has been minimized.
This comment has been minimized.
bluss
Dec 6, 2016
Contributor
There should be exactly one Run in the runs vector here, is that right? Maybe a debug assertion for that (so that tests exercise it)?
This comment has been minimized.
This comment has been minimized.
bluss
reviewed
Dec 6, 2016
| break; | ||
| } | ||
| ptr::copy_nonoverlapping(&v[i], &mut v[i - 1], 1); | ||
| guard.dst = &mut v[i]; | ||
| } |
This comment has been minimized.
This comment has been minimized.
bluss
Dec 6, 2016
Contributor
Can you put a comment after the } here that indicates that the guard fills the hole as part of normal execution.
| }; | ||
| ptr::copy_nonoverlapping(to_copy, backward(&mut out), 1); | ||
| } | ||
| } | ||
|
|
This comment has been minimized.
This comment has been minimized.
bluss
Dec 6, 2016
Contributor
The same here: a comment reminding that guard is dropped and it copies the range into place.
bluss
reviewed
Dec 6, 2016
| // http://envisage-project.eu/timsort-specification-and-verification/ | ||
| // | ||
| // The gist of the story is: we must enforce the invariants on the top four runs on the stack. | ||
| // Enforcing them on just top three is not enough. |
This comment has been minimized.
This comment has been minimized.
bluss
Dec 6, 2016
Contributor
The invariants must be enforced, else .. what? Just that it will not be O(n log n) otherwise, if I understand correctly?
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
apasel422
reviewed
Dec 6, 2016
| while j > 0 && compare(&*read_ptr, &*buf_v.offset(j - 1)) == Less { | ||
| j -= 1; | ||
| // A common method of doing insertion sort is swapping adjacent elements until the | ||
| // first one gets to it's destination. Swapping is slow whenever elements are big |
This comment has been minimized.
This comment has been minimized.
gereeter
reviewed
Dec 6, 2016
| src: &mut tmp.value, | ||
| dst: &mut v[1], | ||
| }; | ||
| ptr::copy_nonoverlapping(&v[1], &mut v[0], 1); |
This comment has been minimized.
This comment has been minimized.
gereeter
Dec 6, 2016
Contributor
What is the advantage of copying individual elements, overlapped with the comparisons? Is it slower to just search for the correct insertion location and do a single bulk ptr::copy? If so, could you add a comment explaining that?
This comment has been minimized.
This comment has been minimized.
stjepang
Dec 6, 2016
Contributor
Yes, it's slightly slower to do a single bulk ptr::copy after search. I'll explain that in a comment.
| struct Guard<T> { | ||
| start: *mut T, | ||
| end: *mut T, | ||
| dst: *mut T, |
This comment has been minimized.
This comment has been minimized.
gereeter
Dec 6, 2016
Contributor
Nit: could this be called dest instead of dst? At least for me, the first thing I think of with dst is dynamically sized types.
This comment has been minimized.
This comment has been minimized.
stjepang
Dec 6, 2016
•
Contributor
ptr::copy has arguments src and dst (link).
Good point about dynamically sized types. I agree that dest is clearer, so I'll rename this field.
| mem::swap(&mut buf_dat, &mut buf_tmp); | ||
| // Insert some more elements into the run if it's too short. Insertion sort is faster than | ||
| // merge sort on short sequences, so this significantly improves performance. | ||
| while start > 0 && end - start < insertion { |
This comment has been minimized.
This comment has been minimized.
gereeter
Dec 6, 2016
Contributor
Possible micro-optimization (I don't know if LLVM does this already or how much it matters): instead of using two conditionals here, the final start position could be calculated once (if end > insertion { end - insertion } else { 0 }), and then start could be compared against that final start position.
This comment has been minimized.
This comment has been minimized.
stjepang
Dec 7, 2016
Contributor
I did a bit of benchmarking, and it seems this micro-optimization is totally irrelevant. I'll keep the code as is to keep clarity.
|
|
||
| let mut runs = vec![]; | ||
| let mut end = len; | ||
| while end > 0 { |
This comment has been minimized.
This comment has been minimized.
gereeter
Dec 6, 2016
Contributor
Is there any particular reason that this implementation goes back-to-front through the array? This isn't a problem, just a bit surprising - almost every piece of array-handling code I've seen, when given the choice, goes front-to-back.
This comment has been minimized.
This comment has been minimized.
stjepang
Dec 6, 2016
Contributor
Yeah, the fact that you're asking this tells me I should explain this decision in comments. Will do so. :)
| /// | ||
| /// The invariants ensure that the total running time is `O(n log n)` worst-case. | ||
| fn merge_sort<T, F>(v: &mut [T], mut compare: F) | ||
| where F: FnMut(&T, &T) -> Ordering |
This comment has been minimized.
This comment has been minimized.
gereeter
Dec 6, 2016
Contributor
As far as I can tell, this implementation of merge sort only ever compares the result of compare to Greater. It might be more efficient to just pass around a function greater: FnMut(&T, &T) -> bool, since I think that some types can implement binary comparisons slightly more efficiently than trinary comparisons.
This comment has been minimized.
This comment has been minimized.
stjepang
Dec 6, 2016
Contributor
How would we expose the ability to pass a function greater instead of compare in the public API, if at all?
This comment has been minimized.
This comment has been minimized.
sfackler
Dec 7, 2016
Member
It would probably only matter for sort, where we can call a.gt(b) rather than a.cmp(b) == Greater.
This comment has been minimized.
This comment has been minimized.
EpicatSupercell
Dec 14, 2016
How about adding greater (and the others - less, less_eq, ...) to std::cmp::Ord trait with a default implementation using cmp? That will allow anyone to write specializations for any type. It's the same we are doing for std::iter::Iterator.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
|
||
| let len = v.len(); | ||
| // Increments the pointer and returns the old value. | ||
| unsafe fn forward<T>(ptr: &mut *mut T) -> *mut T { |
This comment has been minimized.
This comment has been minimized.
gereeter
Dec 6, 2016
Contributor
Relatedly, the fact that forward does a post-increment and that backwards does a pre-increment is somewhat confusing. It'd be nice to have a comment explaining that when iterating backwards, the pointer points past the end of the array, while when iterating forwards, the pointer points directly to the beginning of the array.
| ptr::copy(&*buf_dat.offset(j), buf_dat.offset(j + 1), i - j as usize); | ||
| ptr::copy_nonoverlapping(read_ptr, buf_dat.offset(j), 1); | ||
| // Short arrays get sorted in-place via insertion sort to avoid allocations. | ||
| if len <= insertion { |
This comment has been minimized.
This comment has been minimized.
gereeter
Dec 6, 2016
Contributor
It occurs to me that this threshold might want to be higher than the threshold for sorting runs, since it might be worth some time penalty to avoid the allocation.
This comment has been minimized.
This comment has been minimized.
stjepang
Dec 7, 2016
Contributor
You're absolutely right. I did some benchmarking and will separate the two threshold into two different variables.
Stebalien
reviewed
Dec 6, 2016
| #![feature(trusted_len)] | ||
| #![feature(unicode)] | ||
| #![feature(unique)] | ||
| #![feature(slice_get_slice)] | ||
| #![feature(untagged_unions)] |
This comment has been minimized.
This comment has been minimized.
Stebalien
Dec 6, 2016
Contributor
Maybe I'm missing something obvious but I don't see where you use this feature (or why you changed these).
This comment has been minimized.
This comment has been minimized.
stjepang
Dec 6, 2016
•
Contributor
Three changes to feature list are in this file:
- Removed
step_by. Old sort used it, but the new one doesn't. - Simply moved
slice_get_slicea few line up to make the feature list sorted alphabetically. - Added
untagged_unions. This feature is used in functioninsert_headabovestruct NoDrop<T>.
This comment has been minimized.
This comment has been minimized.
Stebalien
Dec 7, 2016
Contributor
GitHub wasn't loading all the diffs... sigh (sorry). I was excited to see a non-ffi usecase for unions and got a bit confused when searching didn't pick up any.
This comment has been minimized.
This comment has been minimized.
|
I addressed all requests for changes and improved comments in the code. |
bluss
approved these changes
Dec 7, 2016
This comment has been minimized.
This comment has been minimized.
|
Re: #38192 (comment), sorting a collection of ZSTs For what its worth, I think the only correct answer is to make it a nop. ZSTs are types with exactly one value. Thus, given a collection |
mark-i-m
reviewed
Dec 7, 2016
| @@ -1087,7 +1086,7 @@ impl<T> [T] { | |||
| /// elements. | |||
| /// | |||
| /// This sort is stable and `O(n log n)` worst-case but allocates | |||
| /// approximately `2 * n`, where `n` is the length of `self`. | |||
| /// approximately `n / 2`, where `n` is the length of `self`. | |||
This comment has been minimized.
This comment has been minimized.
mark-i-m
Dec 7, 2016
Contributor
I haven't gotten a chance to read all of the code yet, but I was wondering why big-O notation is used for the time complexity, but not for space? Is "approximately n / 2" not asymptotic behavior (e.g. is it an empirical number)? If not, then, I think it would be good to explicitly mention this.
This comment has been minimized.
This comment has been minimized.
stjepang
Dec 7, 2016
•
Contributor
Saying "allocates O(n / 2)" is equivalent to "allocates less than n / 2 * K, for some constant K".
Saying "allocates approximately n / 2" is equivalent to "allocates between n / 2 - K and n / 2 + K, for some constant K".
The latter is more precise and therefore a better description of memory usage.
This comment has been minimized.
This comment has been minimized.
rkruppe
Dec 7, 2016
Member
There's still a constant factor hidden in this wording, because it doesn't contain any units. n / 2 what? Bytes? Words? size_of::<T>()?
The old version also has this problem, but while you're updating it and have the details fresh in mind, could you add this detail?
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Ah, I see.
Thanks
…On Dec 7, 2016 4:10 AM, "Stjepan Glavina" ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/libcollections/slice.rs
<#38192>:
> @@ -1087,7 +1086,7 @@ impl<T> [T] {
/// elements.
///
/// This sort is stable and `O(n log n)` worst-case but allocates
- /// approximately `2 * n`, where `n` is the length of `self`.
+ /// approximately `n / 2`, where `n` is the length of `self`.
Saying "allocates O(n / 2)" is equivalent to "allocates less than n / 2 *
K, for some constant K".
Saying "allocates approximately n / 2" is equivalent to "allocates
between n / 2 - K and n / 2 + K", for some constant K".
The latter is more precise and therefore a better description of memory
usage.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38192>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AIazwFQoAPKJio1ylmgBO8x7imia05Txks5rFoYYgaJpZM4LFWL1>
.
|
This comment has been minimized.
This comment has been minimized.
|
Is there a case for which this implementation performs worse than the old one? |
This comment has been minimized.
This comment has been minimized.
|
@gnzlbg I don't think so. Let's discuss a few cases:
Sorting depends on structure size, type of input, comparison function, your platform, etc. I can't guarantee the new sort always wins, but haven't found a case where it wouldn't. If it ever loses, I'm sure that would be a race lost by a very small margin. |
bluss
added
the
relnotes
label
Dec 7, 2016
stjepang
force-pushed the
stjepang:faster-sort-algorithm
branch
2 times, most recently
from
d0529a5
to
32ee2fb
Dec 7, 2016
This comment has been minimized.
This comment has been minimized.
|
@stjepang In practice, most data to be sorted has some structure and is rarely truly random. I think it would be worth it to check the benchmarking patterns of Boost.Sort, libc++ algorithm benchmarks, Morwenn/cpp-sort, ... (the licenses are permissive). For example, libc++-std::sort is benchmarked using the following patterns: This makes libc++'s See also this Boost.Sort issue. IMO the new implementation doesn't need to be better than the old one in all of these cases, but I still would like to be sure that it doesn't blow up in any of them. libc++'s Pinging @Morwenn since he might be interested on this and might know other people who can give constructive feedback. |
This comment has been minimized.
This comment has been minimized.
|
@gnzlbg That was already done, comparing this PR's modified TimSort with with the current MergeSort and an Introsort implementation: https://gist.github.com/stjepang/b9f0c3eaa0e1f1280b61b963dae19a30 |
This comment has been minimized.
This comment has been minimized.
|
ok, can you explain what the two performance listings mean and which numbers are for which implementation? Important information is missing like: larger is better(?) |
This comment has been minimized.
This comment has been minimized.
|
Here's the program that generated those patterns: https://github.com/notriddle/quickersort/blob/master/examples/perf_txt.rs#L63 Also, here's how the old |
This comment has been minimized.
This comment has been minimized.
Morwenn
commented
Dec 7, 2016
|
Another trick to speed up sorting algorithms is to switch to an unstable sorting algorithm when you can statically infer that the stability (or lack thereof) of the sorting algorithm isn't an observable effect. I don't know about Rust, but if you've got a plain array of simple integers and you sort it with a trivial comparator, if I'm not mistaken there's no way to tell the difference between a stable and an unstable sort. |
This comment has been minimized.
This comment has been minimized.
|
@Morwenn rust could do this through specialization and a unstable-sort-safe marker trait. |
This comment has been minimized.
This comment has been minimized.
A brief description of what perf.txt is showing you
|
This comment has been minimized.
This comment has been minimized.
|
@notriddle Where are the benchmarks for the patterns: organ pipe and sawtooth? I cannot find them. EDIT: the benchmarks are there (at least for sawtooth, can't still find organ pipe), but the results for those don't seem to have been posted in the OP? |
This comment has been minimized.
This comment has been minimized.
|
Sawtooth is literally the first one shown. Organ pipe is actually not there. I'd be happy to add it, though. |
This comment has been minimized.
This comment has been minimized.
|
Thanks for the explanation. |
This comment has been minimized.
This comment has been minimized.
|
@gnzlbg Some of the patterns displayed in the picture you posted are already benchmarked as part of this PR. I get what you're saying about sorting structured data, and agree with you. The new implementation is better than the old one on all of those benchmarks. That is undisputable even without any further benchmarking, really. And it won't blow up by design either. :) It's It is certainly possible to harness more speed on structured data at the expense of speed on random data (and code complexity, too!). However, note that speed on structured data is already very good (look at the benchmarks we have). Also, keep in mind that we really don't have much choice with stable sorts. TimSort is basically the only reasonably fast and versatile stable sort, and it was designed to be fast on structured data in the first place. Our choices boil down to: how far do we go with it, and how do we adapt TimSort (designed for Python) to Rust (a vastly different language)? |
This comment has been minimized.
This comment has been minimized.
|
The PR is a great improvement on several fronts ( There's a lot of interesting things that can follow here. We can create a meta issue to track the different ideas for those that are interested. |
stjepang
force-pushed the
stjepang:faster-sort-algorithm
branch
from
32ee2fb
to
c8d73ea
Dec 7, 2016
This comment has been minimized.
This comment has been minimized.
|
The fixup commits are squashed now. |
This comment has been minimized.
This comment has been minimized.
|
@bors r+ |
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
added a commit
that referenced
this pull request
Dec 8, 2016
This comment has been minimized.
This comment has been minimized.
|
Why did the Travis build fail? |
This comment has been minimized.
This comment has been minimized.
|
@notriddle Due to an unrelated problem. Looks like all builds have been failing for the past day or two. |
This comment has been minimized.
This comment has been minimized.
|
|
stjepang
referenced this pull request
Dec 8, 2016
Closed
Buildbot failing due to a link error on arm-linux-androideabi #38248
This comment has been minimized.
This comment has been minimized.
|
@bors r+ |
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
added a commit
that referenced
this pull request
Dec 9, 2016
This comment has been minimized.
This comment has been minimized.
bors
merged commit c0e150a
into
rust-lang:master
Dec 9, 2016
stjepang
deleted the
stjepang:faster-sort-algorithm
branch
Dec 9, 2016
tbelaire
reviewed
Dec 14, 2016
| /// satisfied, for every `i` in `0 .. runs.len() - 2`: | ||
| /// | ||
| /// 1. `runs[i].len > runs[i + 1].len` | ||
| /// 2. `runs[i].len > runs[i + 1].len + runs[i + 2].len` |
This comment has been minimized.
This comment has been minimized.
tbelaire
Dec 14, 2016
Contributor
Uh, I think there's a typo in your invariant?
From the timsort post
What turned out to be a good compromise maintains two invariants on the
stack entries, where A, B and C are the lengths of the three righmost not-yet
merged slices:
- A > B+C
- B > C
But here you have
- A > B
- A > B + C
which seems incorrect.
This comment has been minimized.
This comment has been minimized.
stjepang
Dec 15, 2016
•
Contributor
Good catch, there's a minor off-by-one error in the comment. However, probably not for the reason you have in mind :)
Take a look at fn collapse - it's equivalent to the corrected and formally verified mergeCollapse presented in the paper (link). Let's trust that one instead of the original TimSort post from 2002.
I suggest rewriting the comment like this:
/// satisfied:
///
/// 1. for every `i` in `1..runs.len()`: `runs[i - 1].len > runs[i].len`
/// 2. for every `i` in `2..runs.len()`: `runs[i - 2].len > runs[i - 1].len + runs[i].len`These invariants are exactly what fn collapse is enforcing.
What do you think - does that look good to you?
This comment has been minimized.
This comment has been minimized.
TomOnTime
commented
Dec 15, 2016
|
I'm not usually this particular about English grammar, but the sentence "Performs less comparisons on both random and partially sorted inputs." should be changed s/less/fewer/g. Less refers to non-countable items (I'm less happy today than yesterday). Fewer refers to countable items ("Mary has fewer apples than John."). |
This comment has been minimized.
This comment has been minimized.
|
Lol, grammar makes me fewer happy :P |
This comment has been minimized.
This comment has been minimized.
|
Heh, fixed :) Feel free to point out any grammar mistakes in comments within the code. |
stjepang
referenced this pull request
Dec 21, 2016
Closed
Reconsidering the recent changes in sort guarantees #38524
alexcrichton
referenced this pull request
Dec 29, 2016
Closed
iwls-0.0.2 fails its test suite in Rust 1.15 beta #38685
This comment has been minimized.
This comment has been minimized.
|
For those interested, the pdqsort implementation is here: #40601 |

stjepang commentedDec 6, 2016
•
edited
Hi everyone, this is my first PR.
I've made some changes to the standard sort algorithm, starting out with a few tweaks here and there, but in the end this endeavour became a complete rewrite of it.
Summary
Changes:
Benchmark:
Background
First, let me just do a quick brain dump to discuss what I learned along the way.
The official documentation says that the standard sort in Rust is a stable sort. This constraint is thus set in stone and immediately rules out many popular sorting algorithms. Essentially, the only algorithms we might even take into consideration are:
Actually, all of those are just merge sort flavors. :) The current standard sort in Rust is a simple iterative merge sort. It has three problems. First, it's slow on partially sorted inputs (even though #29675 helped quite a bit). Second, it always makes around
log(n)iterations copying the entire array between buffers, no matter what. Third, it allocates huge amounts of temporary memory (a buffer of size2*n, wherenis the size of input).The problem of auxilliary memory allocation is a tough one. Ideally, it would be best for our sort to allocate
O(1)additional memory. This is what block sort (and it's variants) does. However, it's often very complicated (look at this) and even then performs rather poorly. The author of WikiSort claims good performance, but that must be taken with a grain of salt. It performs well in comparison tostd::stable_sortin C++. It can even beatstd::sorton partially sorted inputs, but on random inputs it's always far worse. My rule of thumb is: high performance, low memory overhead, stability - choose two.TimSort is another option. It allocates a buffer of size
n/2, which is not great, but acceptable. Performs extremelly well on partially sorted inputs. However, it seems pretty much all implementations suck on random inputs. I benchmarked implementations in Rust, C++, and D. The results were a bit disappointing. It seems bad performance is due to complex galloping procedures in hot loops. Galloping noticeably improves performance on partially sorted inputs, but worsens it on random ones.The new algorithm
Choosing the best algorithm is not easy. Plain merge sort is bad on partially sorted inputs. TimSort is bad on random inputs and block sort is even worse. However, if we take the main ideas from TimSort (intelligent merging strategy of sorted runs) and drop galloping, then we'll have great performance on random inputs and it won't be bad on partially sorted inputs either.
That is exactly what this new algorithm does. I can't call it TimSort, since it steals just a few of it's ideas. Complete TimSort would be a much more complex and elaborate implementation. In case we in the future figure out how to incorporate more of it's ideas into this implementation without crippling performance on random inputs, it's going to be very easy to extend. I also did several other minor improvements, like reworked insertion sort to make it faster.
There are also new, more thorough benchmarks and panic safety tests.
The final code is not terribly complex and has less unsafe code than I anticipated, but there's still plenty of it that should be carefully reviewed. I did my best at documenting non-obvious code.
I'd like to notify several people of this PR, since they might be interested and have useful insights:
P.S. quickersort describes itself as being universally faster than the standard sort, which is true. However, if this PR gets merged, things might change a bit. ;)