-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizing list.sort() by performing safety checks in advance #72871
Comments
When python compares two objects, there are many safety checks that have to be performed before the actual comparison can take place. These include checking the types of the two objects, as well as checking various type-specific properties, like character width for strings or the number of machine-words that represent a long. Obviously, the vast majority of the work done during a list.sort() is spent in comparisons, so even small optimizations can be important. What I noticed is that since we have n objects, but we're doing O(nlogn) comparisons, it pays off to do all the safety checks in a single pass and then select an optimized comparison function that leverages the information gained to avoid as many sort-time checks as possible. I made the following assumptions:
My patch adds the following routine to list.sort():
There are then two questions: when none of the assumptions hold, how expensive is (1)? And when we do get a "hit", how much time do we save by applying (2)? Those questions, of course, can only be answered by benchmarks. So here they are, computed on an isolated CPU core, on two python interpreters, both compiled with ./configure --with-optimizations: # This is a runnable script. Just build the reference interpreter in python-ref and the patched interpreter in python-dev. # Pathological cases (where we pay for (1) but don't get any savings from (2)): what kind of losses do we suffer? # How expensive is (1) for empty lists? # How expensive is (1) for singleton lists? # How expensive is (1) for non-type-homogeneous lists? # OK, now for the cases that actually occur in practice: # What kind of gains do we get for floats? # What kind of gains do we get for latin strings? # What kind of gains do we get for non-latin strings (which I imagine aren't that common to sort in practice anyway) # What kind of gains do we get for ints that fit in a machine word? (I'll keep them in (-2^15,2^15) to be safe) # What kind of gains do we get for pathologically large ints? # What kind of gains do we get for tuples whose first elements are floats? # What kind of gains do we get for tuples whose first elements are latin strings? # What kind of gains do we get for tuples of other stuff? # What kind of gains do we for arbitrary lists of objects of the same type? # End of script # TL;DR: This patch makes common list.sort() cases 40-75% faster, and makes very uncommon pathological cases at worst 15% slower. |
Hi Elliot, nice spot! Why are you redefining Py_ABS, which looks already defined in I tried to compile without your definition of Py_ABS, just in case I missed something in the includes, and it works. |
Sure, if it compiles without that def, I'll remove it from the patch. I On Mon, Nov 14, 2016 at 6:09 AM Julien Palard <report@bugs.python.org>
|
Maybe we should investigate more optimizations on specialized lists. PyPy uses a more compact structure for lists of integers for example. Something like compact strings, PEP-393, of Python 3.3, but for lists. http://doc.pypy.org/en/latest/interpreter-optimizations.html#list-optimizations But we are limited by the C API, so we cannot change deeply the C structure without breaking backward compatibility.
You can use perf timeit --compare-to to check if the result is significant or not, and it displays you the "N.NNx faster" or "N.NNx slower" if it's significant. About benchmarks, I also would like to see a benchmark on the bad case, when specialization is not used. And not only on an empty list :-) For example, sort 1000 objects which implement compare operators and/or a sort function. |
On Mon, Nov 14, 2016 at 10:32 AM STINNER Victor <report@bugs.python.org>
Will do -- I'm writing this up as a paper since this is my science fair
The worst case is the third benchmark from the top -- a list of floats with
So, again, the absolute worst possible case is the third benchmark, which |
So thanks for pointing out that perf has a --compare_to option: it turns out I had calculated the times wrong! Specifically, I had used (ref-dev)/ref while I should have used ref/dev which is what perf --compare_to uses. Anyway, the actual times are even more incredible than I could have imagined! First, here's my benchmark script: #!/bin/bash And here are the results: ./bench.sh "import random; l=[random.uniform(-1,1) for _ in range(0,100)]" So it's 150% faster! (i.e. 150% + 100% = 250%). 150% faster sorting for floats!!! If we make them tuples, it's even more incredible: 319% faster!!! And earlier, I had thought 75% was impressive... I mean, 319%!!! And again, this is an application that is directly useful: DEAP spends a great deal of time sorting tuples of floats, this will make their EAs run a lot faster. "import random; l=[str(random.uniform(-1,1)) for _ in range(0,100)]" "import random; l=[int(random.uniform(-1,1)*2**15) for _ in range(0,100)]" |
Oh wait... uh... never mind... we want "faster" to refer to total time taken, so 1-def/ref is indeed the correct formula. I just got confused because perf outputs ref/dev, but that doesn't make sense for percents. |
list.sort() is a very sensitive function. Maybe (dummy idea?), you may start with a project on PyPI, play with it and try it on large applications (Django?). And come back later once it's battle tested? |
Will post the final version of this patch as a pull request on Github. |
The issue shouldn't be closed until it resolved or rejected. I like the idea, and benchmarking results for randomized lists look nice. But could you please run benchmarks for already sorted lists? |
On Fri, Mar 10, 2017 at 12:26 AM Serhiy Storchaka <report@bugs.python.org>
Ya, sorry about that. This is my first time contributing. I like the idea, and benchmarking results for randomized lists look nice.
David Mertz asked for the same thing on python-ideas. Here's what I replied You are entirely correct, as the benchmarks below demonstrate. I used the
values My results are below (the script can be found at Heterogeneous ([int]*n + [0.0]): As you can see, because there's a lot less non-comparison overhead in the Thanks for the feedback! |
Does this work with wacky code like this? |
Your code changes __class__, not type, which would remain equal to https://docs.python.org/3.7/reference/datamodel.html
So I think it's safe to assume that type doesn't change; if you change On Fri, Mar 10, 2017 at 5:08 PM ppperry <report@bugs.python.org> wrote:
|
Nope: |
Yup, I was completely wrong. If your classes were defined in pure-Python, this would raise an exception Overall, I feel like if you're mutating the objects while they're being Anyway, here's what Tim Peters said on the Github PR comments, where I
Either way, great catch! Thanks for the feedback. On Fri, Mar 10, 2017 at 6:15 PM ppperry <report@bugs.python.org> wrote:
|
And what about even wackier code like this? [A(i) for i in range(20, 5, -1)].sort() This alternates printing "zebra" and "gizmo" for every comparison, and there is no way to add some sort of caching without changing this behavior. |
Actually, I just ran this in the patched interpreter, and it worked! Inspired by the above result, I ran your counterexample (below) to see if class OrdinaryOldInteger:
def __init__(self, i):
self._i = i
def __lt__(self, other):
print('rocket')
return self._i < (other._i if hasattr(other, '_i') else other)
lst = [ClassAssignmentCanBreakChecks(i) for i in range(10)]
shuffle(lst)
last = lst[-1]
lst.sort() And it did! It printed: Note the "rocket" prints at the end; those could not have printed if the Do I have any idea *why* these tests work? No. But I swear, I *just* Wacky! (seriously though I have no idea *why* this works, it just... |
What about if one of the relevant comparison functions is implemented in C? class WackyComparator(int):
def __lt__(self, other):
elem.__class__ = WackyList2
return int.__lt__(self, other)
class WackyList1(list):pass
class WackyList2(list):
def __lt__(self, other):
raise ValueError
lst =
list(map(WackyList1,[[WackyComparator(3),5],[WackyComparator(4),6],[WackyComparator(7),7]]))
random.shuffle(lst)
elem = lst[-1]
lst.sort() This code raises ValueError, and caching seems like it would cache the Python is very very dynamic ... |
I haven't tried the example, but at this point I'd be surprised if it failed. The caching here isn't at level of >>> class F(float):
... pass
>>> a = F(2)
>>> b = F(3)
>>> a < b
True Is F.tp_richcompare the same as float.tp_richcompare? We can't tell from Python code, because tp_richcompare isn't exposed. But, _whatever_ F.tp_richcompare is, it notices when relevant new methods are defined (which float.tp_richcompare emphatically does not); for example, continuing the above: >>> F.__lt__ = lambda a, b: 0
>>> a < b
0
>>> del F.__lt__
>>> a < b
True That said, I know nothing about how comparison internals changed for Python 3, so I may just be hallucinating :-) |
Wouldn't the assignment of "__lt__" change the value of the tp_richcompare slot? That seems to be what the code in Objects/typeobject.c is doing with the update_slot method and the related helper functions. |
I just ran it. With the patched interpreter, I get no error. With the |
@PPPerry, I have no idea what the bulk of the code in typeobect.c is trying to do. |
Elliot, I don't care if the example behaves differently. Although someone else may ;-) The only things If crazy mutation examples can provoke a segfault, that's possibly "a problem" - but different results really aren't (at least not to me). |
On Sat, Mar 11, 2017 at 9:01 PM Tim Peters <report@bugs.python.org> wrote:
That's great to hear. (Of course, one could always remove |
Elliot, did you run the example in a release build or a debug build? I'm wondering why this: assert(v->ob_type == w->ob_type && didn't blow up (in If that does blow up in a debug build, it suggests "a fix": unconditionally check whether the tp_richcompare slot is the expected value. If not, use |
Yes. CPython doesn't implement individual dispatching of the rich-comparison functions. There's a single tp_richcompare slot, so overriding one rich comparison forces the use of slot_tp_richcompare. For built-in types this incurs the performance penalty of using a wrapper_descriptor for the other rich comparisons. For example, overriding F.__lt__ forces calling float.__gt__ for the greater-than comparison. Before:
After:
The __gt__ wrapper_descriptor gets bound as a method-wrapper, and the method-wrapper tp_call is wrapper_call, which calls the wrapper function (e.g. richcmp_gt) with the wrapped function (e.g. float_richcompare). The object ID in CPython is the object address, so we can easily get the address of the __gt__ wrapper_descriptor to confirm how these C function pointers are stored in it: >>> id(vars(float)['__gt__'])
2154486684248
|
It was a release build -- it would blow up in a debug build. Now, regarding the fix you propose: I'll benchmark it tomorrow. If the On Sat, Mar 11, 2017 at 9:12 PM Tim Peters <report@bugs.python.org> wrote:
|
The impact would be small: it would add one (or so) pointer-equality compare that, in practice, will always say "yup, they're equal". Dirt cheap, and the branch is 100% predictable. |
Ya, that makes sense... I just don't get why it's faster at all, then! On Sat, Mar 11, 2017 at 9:32 PM Tim Peters <report@bugs.python.org> wrote:
|
Eryk Sun: Thanks for your detailed response. I'm curious, though: can you |
Elliot, PyObject_RichCompareBool calls PyObject_RichCompare. That in turn does some checks, hides a small mountain of tests in the expansions of the recursion-checking macros, and calls do_richcompare. That in turn does some useless (in the cases you're aiming at) tests and finally gets around to invoking tp_richcompare. Your patch gets to that final step at once. I'm surprised you didn't know that ;-) |
I am embarrassed! That's why I said IIRC... I remembered that either On Sat, Mar 11, 2017 at 10:07 PM Tim Peters <report@bugs.python.org> wrote:
|
OK, I added the safety check to unsafe_object_compare. I verified that it |
----- Doesn't youre skipping PyObject_RichCompareBool and directly getting class PointlessComparator:
def __lt__(self, other):
return NotImplemented
[PointlessComparator(),PointlessComparator()].sort() ... ... ... |
@PPPerry, I believe you're right - good catch! I expect the current patch would treat the NotImplemented return as meaning "the first argument is less than the second argument". I added a comment to the code (on github) suggesting an obvious fix for that. |
What is the current status of this issue and will it go into Python 3.7? |
It would be really nice to have this into 3.7 |
I agree it would be nice (very!) to get this in. Indeed, I'm surprised to see that this is still open :-( But who can drive it? I cannot. |
I suggest to post-pone this optimization to Python 3.8. Faster list.sort() is nice to have, but I'm not sure that it's a killer feature that will make everybody move to Python 3.7. It can wait for 3.8. No core dev took the lead on this non-trivial issue, and IMHO it's getting too late for 3.7. It see that Serhiy started to review the change and asked for more benchmarks. Serhiy would be a good candidate to drive such work, but sadly he seems to be busy these days... While such optimization is nice to have, we should be careful to not introduce a performance regressions on some cases. I read quickly the issue, and I'm not sure that it was fully carefully reviewed and tested yet. Sorry, I only read it quickly, ignore me if I'm wrong. Well, if someone wants to take the responsability of pushing this right now, it's up to you :-) |
Thank you for giving this worthy orphan a home, Raymond! Victor, don't fret too much. The code is really quite simple, and at worst affects only I confess I didn't press for more benchmarks, because I don't care about more here: the code is so obviously a major speed win when it applies, it so obviously applies often, and the new worst-case overhead when it doesn't apply is so obviously minor compared to the cost of a sort ( Nevertheless ... if this brings a major player's server to its knees, blame Raymond ;-) |
... and I'm still trying to come up with even more pathological mutating cases |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: