-
-
Notifications
You must be signed in to change notification settings - Fork 29.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace list sorting merge_collapse()? #78742
Comments
The invariants on the run-length stack are uncomfortably subtle. There was a flap a while back when an attempt at a formal correctness proof uncovered that the _intended_ invariants weren't always maintained. That was easily repaired (as the researchers suggested), but it remains hard to grasp why it's always correct now. The Java variant didn't take the suggested fix, instead boosting its stack size. But it turns out that's still not enough, because the original researchers made a different mistake in their own paper ;-) http://drops.dagstuhl.de/opus/volltexte/2018/9467/ Buss & Knop recently published results on different ways of managing the stack, and I think they're worth investigating for "real life" use: https://arxiv.org/abs/1801.04641 Offhand, their "2-merge" looks quite appealing to me. The only invariant it maintains is that each run on the stack is at least twice the length of the run one above it on the stack, which can be maintained just by checking the topmost two stack entries in the loop. So, e.g., where merge_collapse() is currently happy with runs of lengths 150 and 100 ending the stack, 2-merge would force a merge (it's not the case that 150 >= 2*100). Both are happy if the stack ends with runs of lengths 200 and 100. This is much easier to reason about, simpler to code, and would allow reducing the size of the stack (worst-case size is proportional to the log (of the largest possible array) to the base 2 instead of to the current base phi (~1.62)). They developed some formal measures showing that its "merge cost" (an overall measure abstracting some from real-life behavior) is significantly better than timsort's current strategy. And a little thought convinced me it preserves that, for random data, it would continue to strongly tend towared triggering perfectly balanced merges (merging runs of the same size). When I was developing this code, virtually all the time was spent making merging itself as fast as possible; the stack-management code was just the first thing I tried that "worked well". But turns out that's the only part researchers care about ;-) I'd be pleased if it could be replaced with something both better and simpler, or even just simpler but no worse. But the devil's in the details ... |
The attached runstack.py models the relevant parts of timsort's current merge_collapse and the proposed 2-merge. Barring conceptual or coding errors, they appear to behave much the same with respect to "total cost", with no clear overall winner. Of course cases can be constructed to make either look better. As expected, twomerge requires fewer stack levels. And they behave identically when all runs have the same length. Note that the notion of "cost" here is simply that merging runs of lengths A and B always has cost A+B. That should be viewed as worst case, where the rest of timsort finds nothing to exploit. |
Looks like all sorts of academics are exercised over the run-merging order now. Here's a paper that's unhappy because timsort's strategy, and 2-merge too, aren't always near-optimal with respect to the entropy of the distribution of natural run lengths: "Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods That Optimally Adapt to Existing Runs" Their alternatives are worth looking at too. Offhand, their "powersort" ordering looks more promising to me than their "peeksort". It was very deliberate that timsort looks at whether to merge a run immediately after identifying one, for the "freshness in memory hierarchy" reason. That peeksort does not is, I expect, more significant in real life than the "one little blemish" they call it. In any case, don't flip out over this. On randomly ordered input, timsort already strongly tends toward perfectly balanced merges, and there's no significant improvement to be had over that. On the other hand, on inputs with significant order that _can_ be exploited by this approach, the gains are far more due to "galloping" than to the order in which runs are merged. All these papers ignore galloping entirely. Example: take range(1_000_000), cut it in half, and riffle shuffle it. So you get
Either way, there are half a million natural runs, each of length 2. Any balanced way of merging them is as good as it gets. It's galloping alone that buys huge improvements in cases like this. Especially in the context of Python, where object comparisons are (in general) far more expensive than moving pointers around, some excess in worst-case memory copying is, well, "one little blemish" ;-) But still worth addressing if feasible. Now that sorting often also adapts to avoid the relatively massive expense of PyObject_RichCompareBool, memory-movement costs become more important. Approaches like 2-merge also give simpler merge_collapse() code that's easier to reason about, and reduces the worst-case stack size (which becomes very easy to compute in advance instead of the current puzzle). |
It doesn’t seem like there’s a real problem here, but you seem to suggest there would be some useful improvements possible here. I don’t know much about sorting algorithms per se. What do you mean by galloping? |
"Galloping" is the heart & soul of Python's sorting algorithm. It's explained in detail here: https://github.com/python/cpython/blob/master/Objects/listsort.txt The Java fork of the sorting code has had repeated bugs due to reducing the size of "the stack" used to hold merge states. Read the papers already referenced for details about that. There is no "bug" here in the Python version. It's that the complex code Java keeps tripping over ;-) , _could_ (possibly) be replaced by simpler code where the max stack size needed is obvious; that (e.g.) 2-merge moves around less memory overall in some cases where the current scheme is particularly wasteful in this respect; and that Munro & Wild present more ambitious merge-ordering schemes that are said to be provably near-optimal in a broader sense than 2-merge achieves. |
A new version of the file models a version of the Against it, its code is much more complex, and the algorithm is very far from obvious. The paper "motivates" it but doesn't really explain it in enough detail to make most of it clear (but refers to other papers where pieces of it are developed). Curious: unless a run is the last in an array, powersort never merges it immediately upon finding it. Instead its start and length are used to compute a "power", in turn used to decide whether to merge some previous number of runs. A dollar to anyone who can figure out what the heck the "power" computation is really doing in less than a full day without reading the paper ;-) Then two dollars if you can prove that the "never equal!" assert holds. It would require timing lots of C code on various cases to see whether the possible delay in merging new runs really matters. It's unclear to me a priori, because it buys something too (less memory copying overall). In any case, just switching to 2-merge would be easy, and that alone is enough to squash the contrived "bad cases" for the current approach. |
Haha ok. Cache optimization makes it pretty complicated to reason about true costs. Anyway, it’s not obvious to me even that the run lengths would need to be descending for best performace. I’m not even starting to think about how the merging order might affect galloping boosts on realistic data ;-). I didn’t get to that power thing at this point, but it looks like it’s unrelated to this consideration, except perhaps by chance. I might have time tonight to have a look at that. Surely we don’t want an algorithm that’s optimized for what they call ”Timsort drag” sequences in the literature ;-). |
So it looks like we're working with a logarithmic measure of the "cost". I'm operating largely based on your description of Timsort in the link in msg324597, which the paper also refers to. But since the paper is sorting an array of Java ints (not Integers), I assume the performance comparisons of the code they timed is not really representative of Python equivalents. Probably galloping boosts are much larger in the Python case. I haven't tried running the attached code yet, because I mostly carry a tablet with me now. The never-equal assert probably doesn't get any more obvious by running the code, though, anyway?-). |
The notion of cost is that merging runs of lengths A and B has "cost" A+B, period. Nothing to do with logarithms. Merge runs of lengths 1 and 1000, and it has cost 1001. They don't care about galloping, only about how the order in which merges are performed affect that notion of cost. Example: suppose we have runs of lengths 1, 2, and 3. Merge 1 and 2 first for a cost of 1+2 = 3, and we're left with runs of length 3 and 3. Merge those for a cost of 3+3 = 6, and add to the earlier cost of 3 for a total cost of 9. But if 2 and 3 are merged first that has cost 5, then merging runs of length 1 and 5 has a cost of 6, added to the earlier 5 gives a total cost of 11. So it's cheaper to merge 1 and 2 first. Galloping has nothing to do with this measure, nor does the binary insertion sort portion of timsort. And Java itself doesn't use timsort for sorting integers anyway (the overheads are too high when comparisons are dirt cheap). They're trying to quantify the effects of their merge-ordering strategies, relative to timsort's merge-ordering strategy. Which could have been done more clearly by ignoring Java's timsort implementation entirely and just changing the parts of their code that implement _their_ merge-ordering strategy. timsort's strategy is hard to analyze, but is trivially easy to _implement_. As is, they're inadvertently timing all sorts of stuff that's orthogonal to the merge-ordering strategy. |
Sorry, I meant that the funny code in the "power" computation in powersort is a logarithmic measure. But I don't know where that came from. I just looked at Timsort and now figuring out how powersort is different. |
And by "looking at" Timsort, I mean reading your explanation. The motivation for merge ordering and so on was already quite clear from there. But that motivation does not imply that the stack has to be monotonous in run length, although memory considerations might. |
No, there's no requirement that run lengths on the stack be ordered in any way by magnitude. That's simply one rule timsort uses, as well as 2-merge and various other schemes discussed in papers. powersort has no such rule, and that's fine. Regardless, rules must be such that the max stack size is at worst logarithmic in the number of array elements. It's also nice if it's obvious that a sequence of equal-length runs will lead to perfectly balanced merges, which is "obvious enough" with the timsort and 2-merge rules. It also happens to be true of the powersort rules, but harder to see from staring at code because it's hard to compute "node powers" in one's head. |
New version of runstack.py.
|
Another runstack.py adds a bad case for 2-merge, and an even worse (percentage-wise) bad case for timsort. powersort happens to be optimal for both. So they all have contrived bad cases now. powersort's bad cases are the least bad. So far ;-) But I expect that to remain so. |
Dear all, After me and my colleagues worked on the first paper you mention, I recently created another merge-based sorting algorithm, which I called "Adaptive Shivers Sort". This is a close variant of the Augmented Shivers Sort presented by Buss & Knop in their article Like Munro & Wild's Powersort and Peeksort, this algorithm has an optimal worst-case running time (up to a linear additive constant) when measured with the merge cost model that is used in all the articles you cite in this discussion. It also maintains a stack of size log_2(n). Custom tests that I had performed suggest that this algorithm is, in practice, as efficient as Munro & Wild's algorithms, and also does not suffer from critically bad cases. Moreover, like Buss & Knop's 2-merge, and unlike Munro & Wild's Powersort and Peeksort, this algorithm has the advantage of having a structure that is very similar to that of Timsort, which may be an advantage in your situation. That is why, upon reading your discussion, I refurbished my notes about Adaptive Shivers Sort, which you may find here: These notes include a description of the algorithm and a worst-time complexity analysis. If extending my analysis of this algorithm or investigating tuning details were of interest for you, I would be happy to help. Best regards, Vincent Jugé |
Thank you, Vincent! I very much enjoyed - and appreciated - your paper I referenced at the start. Way back when, I thought I had a proof of O(N log N), but never wrote it up because some details weren't convincing - even to me ;-) . Then I had to move on to other things, and never got back to it. I was probably mistaken. So it was delightful to see your elegant proof of that, and even more. I'm attaching a new runstack.py that includes code modeling your new adaptive Shivers strategy. Like all these approximations, it has its own "bad cases" on smallish inputs. Based on the specific cases this code runs, powersort remains in a category of its own, with timsort, twomerge, and shivers roughly tied on _average_ merge cost. Part of the reason, I suspect: powersort is the only strategy here that's always optimal for 3-run cases. Which it buys at the cost of never merging the run just discovered (unless it's the last run in the array). For example, in
twomerge and shivers merge 31 and 16 first, before seeing 1, which is "far from" optimal. timsort and powersort merge 16 and 1 first, although that's by design in powersort and by luck in timsort. In
only powersort does the best thing (note: runstack.py doesn't model peeksort; I'm sure it would merge 31 and 1 first on that too). I care about small-number-of-runs because real-life Python lists aren't asymptotic in nature ;-) People deliberately construct lists with a small number of runs now before sorting, because they know Python's sort can exploit that. For many other cases, all runs are artificially forced to length What I can't know before we time things is whether order-of-merging makes a measurable difference in real life, and whether powersort's possible delay in merging the most recent run has bad cache effects that overwhelm its smaller "merge cost". In any case, I'm keen to replace timsort's merge-order strategy with _something_ that's much easier to analyze. The makes your Shivers variant and powersort the only two real contenders now. It seems quite remarkable that the Shivers variant has such good asymptotics! It's certainly the easiest to modify Python's C code to implement (straightforward edits in a single function). |
I see... Indeed, my only goal when adapting Shivers Sort was to maintain some invariant that would make the analysis easy, while mimicking the arguments developed by Buss & Knop for their analysis of (plain) Shivers Sort. It is, however, sure that the n(H+4) upper bound I get should be refined when H is small, which more or less amounts to tackling the 3-run case you mention. I am reasonably convinced that the current version of Adaptive Shivers Sort could be adapted to take this problem into account, while keeping its overall simple stricture and complexity analysis. If this first step turns out to be feasible, I will also look at the latest version of runstack.py to investigate other possible improvements on Adaptive Shivers Sort. |
After having worked a little bit on improving AdaptiveShiversSort on few-run cases, I designed a new version of the algorithm, called shivers2 in the file runstack.py joined to this message It looks more complicated than the original AdaptiveShiversSort but the spirit, inspired by your comments, is as follows:
This variant's performance seems comparable with Powersort's.
Given its excellent overall performance, I think that trying to slightly tweak Powersort in cases it underperforms other sorts might be worth some effort. Best, Vincent |
The merge order was mentioned on python-dev today, and a quick web searched turned up a revision of Vincent Jugé's "Adaptive Shivers Sort: An Alternative Sorting Algorithm" paper I hadn't seen before: https://arxiv.org/pdf/1809.08411.pdf Its "length-adaptive ShiversSort" variation sure _sounds_ like it was intended to address the issues we discussed here (some "bad" cases when very few runs are in play). The analyses in that paper show that length-adaptive ShiversSort, and powersort, have the same asymptotic best and worst-case behaviors. Average case? Who knows? Hard to believe it could really be an issue, because even the worst cases are pretty darn good. So length-adaptive ShiversSort is worth looking into. But, if someone has the energy to pursue it, I'd be happy, at this point, just to give plain old "adaptive ShiversSort" a try. The version of the paper linked to above even lists the precise changes needed to CPython's code to implement it (although a code review would ask for changes, most obviously that its "int x" is too narrow an integer type). |
My benchmarks could be improved but however I found that Shivers' sort it can be done easily by switching the
main natural merge strategy like I did here :
https://github.com/LLyaudet/TSODLULS/commit/2968c4b4ca58ae794157dc9636ed87017e5f7a17
The relevant code is not very long :
/**
* This strategy is from ShiversSort:
* see the research report by Olin Shivers, Georgia Institute of
Technology, 2002.
* It uses bitwise tricks to check condition on floor of logarithms in
base 2 of runs lengths.
* When I benchmark it, it is slightly faster than Tim's sort strategy.
*/
#define TSODLULS_natural_merge_main_strategy__Shivers_sort \
while(merge_state.i_run_instances_count > 1){\
size_t i_merge_at = merge_state.i_run_instances_count - 2;\
size_t i_order_of_magnitude =
merge_state.arr_run_instances[i_merge_at + 1].i_length;\
if(i_order_of_magnitude < ((~i_order_of_magnitude) &
merge_state.arr_run_instances[i_merge_at].i_length) ){\
break;\
}\
i_compare_result = TSODLULS_MERGE_TWO_RUNS(&merge_state, i_merge_at);\
if(i_compare_result != 0){\
goto clean_and_return_error;\
}\
}\
/**
* This strategy is from AdaptiveShiversSort:
* see the articles by Vincent Jugé, for example 1024 Bulletin de la
SIF, avril 2020, in French,
* or the preprint on arXiv or SODA 2020 proceedings.
* It uses bitwise tricks to check condition on floor of logarithms in
base 2 of runs lengths.
* When I benchmark it, it is slightly faster than Tim's sort strategy.
* Despite its ressemblance with Shivers's sort,
* the distinct properties of this strategy make that JugéSort would
probably be a better name than AdaptiveShiversSort,
* or an even better name for the whole algorithm should be
TimShiversJugéSort and I must have missed many names ;)
* With AdaptiveShiversSort we avoid 'é' and diacritics in function names ;P
*/
#define TSODLULS_natural_merge_main_strategy__adaptive_Shivers_sort \
while(merge_state.i_run_instances_count > 2){\
size_t i_merge_at = merge_state.i_run_instances_count - 3;\
size_t i_order_of_magnitude =
merge_state.arr_run_instances[i_merge_at + 1].i_length |
merge_state.arr_run_instances[i_merge_at + 2].i_length;\
if(i_order_of_magnitude < ((~i_order_of_magnitude) &
merge_state.arr_run_instances[i_merge_at].i_length) ){\
break;\
}\
i_compare_result = TSODLULS_MERGE_TWO_RUNS(&merge_state, i_merge_at);\
if(i_compare_result != 0){\
goto clean_and_return_error;\
}\
}\ I will try to provide a patch before the end of the week. |
Added new runstack.py. New New Worth some thought, though. From the results here, length-adaptive ShiversSort is a bigger improvement over (plain-old) adaptive ShiversSort than the latter is over timsort. |
And another runstack.py adds 2**k * max(run_length_1, run_length_2, run_length3) >= |
New runstack.py mostly adds comments about a surprise: the idea that length-adaptive ShiversSort eeks out better results than powersort appears nearly unique to the specific "0.80" cutoff used in the random-case generation code to pick between two uniform distributions. Change that cutoff, and powersort almost always does better. So powersort remains the best known overall, although shivers4 remains competitive with it. |
I created a PR that implements the powersort merge strategy: Across all the time this issue report has been open, that strategy continues to be the top contender. Enough already ;-) It's indeed a more difficult change to make to the code, but that's in relative terms. In absolute terms, it's not at all a hard change. Laurent, if you find that some variant of ShiversSort actually runs faster than that, let us know here! I'm a big fan of Vincent's innovations too, but powersort seems to do somewhat better "on average" than even his length-adaptive ShiversSort (and implementing that too would require changing code outside of merge_collapse()). |
Thanks for the patch Tim. |
Thanks for your work Tim, just adapted the changes to PyPy's Timsort, using bits of runstack.py! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: