Optimizing list.sort() by performing safety checks in advance #72871
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
assignee = 'https://github.com/rhettinger' closed_at = <Date 2018-01-29.03:04:11.973> created_at = <Date 2016-11-13.21:19:40.081> labels = ['interpreter-core', '3.7', 'performance'] title = 'Optimizing list.sort() by performing safety checks in advance' updated_at = <Date 2018-01-29.12:47:08.795> user = 'https://github.com/embg'
activity = <Date 2018-01-29.12:47:08.795> actor = 'vstinner' assignee = 'rhettinger' closed = True closed_date = <Date 2018-01-29.03:04:11.973> closer = 'rhettinger' components = ['Interpreter Core'] creation = <Date 2016-11-13.21:19:40.081> creator = 'elliot.gorokhovsky' dependencies =  files = ['45477', '45508'] hgrepos =  issue_num = 28685 keywords = ['patch'] message_count = 43.0 messages = ['280718', '280772', '280799', '280801', '280815', '280978', '280987', '281668', '289192', '289339', '289394', '289422', '289426', '289429', '289432', '289434', '289435', '289447', '289462', '289464', '289465', '289466', '289467', '289468', '289469', '289470', '289471', '289472', '289474', '289475', '289476', '289477', '289508', '289518', '289523', '310979', '310989', '310997', '311024', '311042', '311046', '311049', '311118'] nosy_count = 9.0 nosy_names = ['tim.peters', 'rhettinger', 'vstinner', 'serhiy.storchaka', 'eryksun', 'ppperry', 'mdk', 'elliot.gorokhovsky', 'godaygo'] pr_nums = ['582', '5423'] priority = 'high' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'performance' url = 'https://bugs.python.org/issue28685' versions = ['Python 3.7']
The text was updated successfully, but these errors were encountered:
When python compares two objects, there are many safety checks that have to be performed before the actual comparison can take place. These include checking the types of the two objects, as well as checking various type-specific properties, like character width for strings or the number of machine-words that represent a long.
Obviously, the vast majority of the work done during a list.sort() is spent in comparisons, so even small optimizations can be important. What I noticed is that since we have n objects, but we're doing O(nlogn) comparisons, it pays off to do all the safety checks in a single pass and then select an optimized comparison function that leverages the information gained to avoid as many sort-time checks as possible. I made the following assumptions:
My patch adds the following routine to list.sort():
There are then two questions: when none of the assumptions hold, how expensive is (1)? And when we do get a "hit", how much time do we save by applying (2)?
Those questions, of course, can only be answered by benchmarks. So here they are, computed on an isolated CPU core, on two python interpreters, both compiled with ./configure --with-optimizations:
# This is a runnable script. Just build the reference interpreter in python-ref and the patched interpreter in python-dev.
# Pathological cases (where we pay for (1) but don't get any savings from (2)): what kind of losses do we suffer?
# How expensive is (1) for empty lists?
# How expensive is (1) for singleton lists?
# How expensive is (1) for non-type-homogeneous lists?
# OK, now for the cases that actually occur in practice:
# What kind of gains do we get for floats?
# What kind of gains do we get for latin strings?
# What kind of gains do we get for non-latin strings (which I imagine aren't that common to sort in practice anyway)
# What kind of gains do we get for ints that fit in a machine word? (I'll keep them in (-2^15,2^15) to be safe)
# What kind of gains do we get for pathologically large ints?
# What kind of gains do we get for tuples whose first elements are floats?
# What kind of gains do we get for tuples whose first elements are latin strings?
# What kind of gains do we get for tuples of other stuff?
# What kind of gains do we for arbitrary lists of objects of the same type?
# End of script #
TL;DR: This patch makes common list.sort() cases 40-75% faster, and makes very uncommon pathological cases at worst 15% slower.
Hi Elliot, nice spot!
Why are you redefining Py_ABS, which looks already defined in
I tried to compile without your definition of Py_ABS, just in case I missed something in the includes, and it works.
Sure, if it compiles without that def, I'll remove it from the patch. I
On Mon, Nov 14, 2016 at 6:09 AM Julien Palard <email@example.com>
Maybe we should investigate more optimizations on specialized lists.
PyPy uses a more compact structure for lists of integers for example. Something like compact strings, PEP-393, of Python 3.3, but for lists.
But we are limited by the C API, so we cannot change deeply the C structure without breaking backward compatibility.
You can use perf timeit --compare-to to check if the result is significant or not, and it displays you the "N.NNx faster" or "N.NNx slower" if it's significant.
About benchmarks, I also would like to see a benchmark on the bad case, when specialization is not used. And not only on an empty list :-) For example, sort 1000 objects which implement compare operators and/or a sort function.
On Mon, Nov 14, 2016 at 10:32 AM STINNER Victor <firstname.lastname@example.org>
Will do -- I'm writing this up as a paper since this is my science fair
The worst case is the third benchmark from the top -- a list of floats with
So, again, the absolute worst possible case is the third benchmark, which
So thanks for pointing out that perf has a --compare_to option: it turns out I had calculated the times wrong! Specifically, I had used
while I should have used
which is what perf --compare_to uses. Anyway, the actual times are even more incredible than I could have imagined! First, here's my benchmark script:
And here are the results:
./bench.sh "import random; l=[random.uniform(-1,1) for _ in range(0,100)]"
So it's 150% faster! (i.e. 150% + 100% = 250%). 150% faster sorting for floats!!! If we make them tuples, it's even more incredible:
319% faster!!! And earlier, I had thought 75% was impressive... I mean, 319%!!! And again, this is an application that is directly useful: DEAP spends a great deal of time sorting tuples of floats, this will make their EAs run a lot faster.
"import random; l=[str(random.uniform(-1,1)) for _ in range(0,100)]"
"import random; l=[int(random.uniform(-1,1)*2**15) for _ in range(0,100)]"
On Fri, Mar 10, 2017 at 12:26 AM Serhiy Storchaka <email@example.com>
Ya, sorry about that. This is my first time contributing.
I like the idea, and benchmarking results for randomized lists look nice.
David Mertz asked for the same thing on python-ideas. Here's what I replied
You are entirely correct, as the benchmarks below demonstrate. I used the
My results are below (the script can be found at
Heterogeneous ([int]*n + [0.0]):
As you can see, because there's a lot less non-comparison overhead in the
Thanks for the feedback!
Does this work with wacky code like this?
Your code changes __class__, not type, which would remain equal to
So I think it's safe to assume that type doesn't change; if you change
On Fri, Mar 10, 2017 at 5:08 PM ppperry <firstname.lastname@example.org> wrote:
Yup, I was completely wrong.
If your classes were defined in pure-Python, this would raise an exception
Overall, I feel like if you're mutating the objects while they're being
Anyway, here's what Tim Peters said on the Github PR comments, where I
Either way, great catch! Thanks for the feedback.
On Fri, Mar 10, 2017 at 6:15 PM ppperry <email@example.com> wrote:
And what about even wackier code like this?
[A(i) for i in range(20, 5, -1)].sort()
This alternates printing "zebra" and "gizmo" for every comparison, and there is no way to add some sort of caching without changing this behavior.
Actually, I just ran this in the patched interpreter, and it worked!
Inspired by the above result, I ran your counterexample (below) to see if
class OrdinaryOldInteger: def __init__(self, i): self._i = i def __lt__(self, other): print('rocket') return self._i < (other._i if hasattr(other, '_i') else other) lst = [ClassAssignmentCanBreakChecks(i) for i in range(10)] shuffle(lst) last = lst[-1] lst.sort()
And it did! It printed:
Note the "rocket" prints at the end; those could not have printed if the
Do I have any idea *why* these tests work? No. But I swear, I *just*
Wacky! (seriously though I have no idea *why* this works, it just...
What about if one of the relevant comparison functions is implemented in C?
class WackyComparator(int): def __lt__(self, other): elem.__class__ = WackyList2 return int.__lt__(self, other) class WackyList1(list):pass class WackyList2(list): def __lt__(self, other): raise ValueError lst = list(map(WackyList1,[[WackyComparator(3),5],[WackyComparator(4),6],[WackyComparator(7),7]])) random.shuffle(lst) elem = lst[-1] lst.sort()
This code raises ValueError, and caching seems like it would cache the
Python is very very dynamic ...
I haven't tried the example, but at this point I'd be surprised if it failed. The caching here isn't at level of
>>> class F(float): ... pass >>> a = F(2) >>> b = F(3) >>> a < b True
Is F.tp_richcompare the same as float.tp_richcompare? We can't tell from Python code, because tp_richcompare isn't exposed. But, _whatever_ F.tp_richcompare is, it notices when relevant new methods are defined (which float.tp_richcompare emphatically does not); for example, continuing the above:
>>> F.__lt__ = lambda a, b: 0 >>> a < b 0 >>> del F.__lt__ >>> a < b True
That said, I know nothing about how comparison internals changed for Python 3, so I may just be hallucinating :-)
Elliot, I don't care if the example behaves differently. Although someone else may ;-)
The only things
If crazy mutation examples can provoke a segfault, that's possibly "a problem" - but different results really aren't (at least not to me).
On Sat, Mar 11, 2017 at 9:01 PM Tim Peters <firstname.lastname@example.org> wrote:
That's great to hear. (Of course, one could always remove
Elliot, did you run the example in a release build or a debug build? I'm wondering why this:
assert(v->ob_type == w->ob_type &&
didn't blow up (in
If that does blow up in a debug build, it suggests "a fix": unconditionally check whether the tp_richcompare slot is the expected value. If not, use
Yes. CPython doesn't implement individual dispatching of the rich-comparison functions. There's a single tp_richcompare slot, so overriding one rich comparison forces the use of slot_tp_richcompare. For built-in types this incurs the performance penalty of using a wrapper_descriptor for the other rich comparisons. For example, overriding F.__lt__ forces calling float.__gt__ for the greater-than comparison.
The __gt__ wrapper_descriptor gets bound as a method-wrapper, and the method-wrapper tp_call is wrapper_call, which calls the wrapper function (e.g. richcmp_gt) with the wrapped function (e.g. float_richcompare). The object ID in CPython is the object address, so we can easily get the address of the __gt__ wrapper_descriptor to confirm how these C function pointers are stored in it:
>>> id(vars(float)['__gt__']) 2154486684248
It was a release build -- it would blow up in a debug build.
Now, regarding the fix you propose: I'll benchmark it tomorrow. If the
On Sat, Mar 11, 2017 at 9:12 PM Tim Peters <email@example.com> wrote:
Ya, that makes sense... I just don't get why it's faster at all, then!
On Sat, Mar 11, 2017 at 9:32 PM Tim Peters <firstname.lastname@example.org> wrote:
Elliot, PyObject_RichCompareBool calls PyObject_RichCompare. That in turn does some checks, hides a small mountain of tests in the expansions of the recursion-checking macros, and calls do_richcompare. That in turn does some useless (in the cases you're aiming at) tests and finally gets around to invoking tp_richcompare. Your patch gets to that final step at once.
I'm surprised you didn't know that ;-)
I am embarrassed! That's why I said IIRC... I remembered that either
On Sat, Mar 11, 2017 at 10:07 PM Tim Peters <email@example.com> wrote:
Doesn't youre skipping PyObject_RichCompareBool and directly getting
class PointlessComparator: def __lt__(self, other): return NotImplemented [PointlessComparator(),PointlessComparator()].sort()
... ... ...
I suggest to post-pone this optimization to Python 3.8. Faster list.sort() is nice to have, but I'm not sure that it's a killer feature that will make everybody move to Python 3.7. It can wait for 3.8.
No core dev took the lead on this non-trivial issue, and IMHO it's getting too late for 3.7.
It see that Serhiy started to review the change and asked for more benchmarks. Serhiy would be a good candidate to drive such work, but sadly he seems to be busy these days...
While such optimization is nice to have, we should be careful to not introduce a performance regressions on some cases. I read quickly the issue, and I'm not sure that it was fully carefully reviewed and tested yet. Sorry, I only read it quickly, ignore me if I'm wrong.
Well, if someone wants to take the responsability of pushing this right now, it's up to you :-)
Thank you for giving this worthy orphan a home, Raymond!
Victor, don't fret too much. The code is really quite simple, and at worst affects only
I confess I didn't press for more benchmarks, because I don't care about more here: the code is so obviously a major speed win when it applies, it so obviously applies often, and the new worst-case overhead when it doesn't apply is so obviously minor compared to the cost of a sort (
Nevertheless ... if this brings a major player's server to its knees, blame Raymond ;-)