Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing list.sort() by performing safety checks in advance #72871

Closed
embg mannequin opened this issue Nov 13, 2016 · 43 comments
Closed

Optimizing list.sort() by performing safety checks in advance #72871

embg mannequin opened this issue Nov 13, 2016 · 43 comments
Assignees
Labels
3.7 interpreter-core performance

Comments

@embg
Copy link
Mannequin

@embg embg mannequin commented Nov 13, 2016

BPO 28685
Nosy @tim-one, @rhettinger, @vstinner, @serhiy-storchaka, @eryksun, @pppery, @JulienPalard, @embg, @godaygo
PRs
  • #582
  • #5423
  • Files
  • fastsort.patch
  • fastsort.patch: Slightly simplified the patch code and removed the #define Py_ABS.
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/rhettinger'
    closed_at = <Date 2018-01-29.03:04:11.973>
    created_at = <Date 2016-11-13.21:19:40.081>
    labels = ['interpreter-core', '3.7', 'performance']
    title = 'Optimizing list.sort() by performing safety checks in advance'
    updated_at = <Date 2018-01-29.12:47:08.795>
    user = 'https://github.com/embg'

    bugs.python.org fields:

    activity = <Date 2018-01-29.12:47:08.795>
    actor = 'vstinner'
    assignee = 'rhettinger'
    closed = True
    closed_date = <Date 2018-01-29.03:04:11.973>
    closer = 'rhettinger'
    components = ['Interpreter Core']
    creation = <Date 2016-11-13.21:19:40.081>
    creator = 'elliot.gorokhovsky'
    dependencies = []
    files = ['45477', '45508']
    hgrepos = []
    issue_num = 28685
    keywords = ['patch']
    message_count = 43.0
    messages = ['280718', '280772', '280799', '280801', '280815', '280978', '280987', '281668', '289192', '289339', '289394', '289422', '289426', '289429', '289432', '289434', '289435', '289447', '289462', '289464', '289465', '289466', '289467', '289468', '289469', '289470', '289471', '289472', '289474', '289475', '289476', '289477', '289508', '289518', '289523', '310979', '310989', '310997', '311024', '311042', '311046', '311049', '311118']
    nosy_count = 9.0
    nosy_names = ['tim.peters', 'rhettinger', 'vstinner', 'serhiy.storchaka', 'eryksun', 'ppperry', 'mdk', 'elliot.gorokhovsky', 'godaygo']
    pr_nums = ['582', '5423']
    priority = 'high'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue28685'
    versions = ['Python 3.7']

    @embg
    Copy link
    Mannequin Author

    @embg embg mannequin commented Nov 13, 2016

    When python compares two objects, there are many safety checks that have to be performed before the actual comparison can take place. These include checking the types of the two objects, as well as checking various type-specific properties, like character width for strings or the number of machine-words that represent a long.

    Obviously, the vast majority of the work done during a list.sort() is spent in comparisons, so even small optimizations can be important. What I noticed is that since we have n objects, but we're doing O(nlogn) comparisons, it pays off to do all the safety checks in a single pass and then select an optimized comparison function that leverages the information gained to avoid as many sort-time checks as possible. I made the following assumptions:

    1. In practice, lists to be sorted are almost always type-homogeneous.
    2. Most lists to be sorted consist of homogeneous elements of the following types:
      a. Latin strings (i.e. strings whose characters are one machine-word wide)
      b. Longs that fit in a single machine-word
      c. Floats
      d. Tuples of the above
    3. The above assumptions also hold for lists constructed by taking a list of tuples to be sorted and extracting the first elements.
    4. The vast majority of tuple comparisons never get past the first element (i.e. the first elements of tuples are rarely equal).

    My patch adds the following routine to list.sort():

    1. Go through the list and see which of the assumptions, if any, hold (except (4), which can't be checked in advance).
    2. Replace PyObject_RichCompareBool() with an optimized compare function that is able to ignore some safety checks by applying the assumption we've already verified to be true.

    There are then two questions: when none of the assumptions hold, how expensive is (1)? And when we do get a "hit", how much time do we save by applying (2)?

    Those questions, of course, can only be answered by benchmarks. So here they are, computed on an isolated CPU core, on two python interpreters, both compiled with ./configure --with-optimizations:

    # This is a runnable script. Just build the reference interpreter in python-ref and the patched interpreter in python-dev.

    # Pathological cases (where we pay for (1) but don't get any savings from (2)): what kind of losses do we suffer?

    # How expensive is (1) for empty lists?
    python-dev/python -m perf timeit -s "l=[]" "sorted(l)" --rigorous
    python-ref/python -m perf timeit -s "l=[]" "sorted(l)" --rigorous
    # Median +- std dev: 212 ns +- 9 ns
    # Median +- std dev: 216 ns +- 10 ns
    # (difference is within std dev)

    # How expensive is (1) for singleton lists?
    python-dev/python -m perf timeit -s "l=[None]" "sorted(l)" --rigorous
    python-ref/python -m perf timeit -s "l=[None]" "sorted(l)" --rigorous
    # Median +- std dev: 235 ns +- 16 ns
    # Median +- std dev: 238 ns +- 15 ns
    # (difference is within std dev)

    # How expensive is (1) for non-type-homogeneous lists?
    # (we use small lists because as n gets large, the fraction time spent in (1) gets smaller)
    python-dev/python -m perf timeit -s "import random; l=[random.uniform(-1,1) for _ in range(0,10)]+[1]" "sorted(l)" --rigorous
    python-ref/python -m perf timeit -s "import random; l=[random.uniform(-1,1) for _ in range(0,10)]+[1]" "sorted(l)" --rigorous
    # Median +- std dev: 784 ns +- 35 ns
    # Median +- std dev: 707 ns +- 51 ns
    # 10% slower. While this is unfortunate, this case almost never occurs in practice, and the losses are certainly offset by the gains we'll see below.
    # Also, note that for large lists, the difference would be *much* smaller, since the time spent in (1) gets smaller like 1/log(n).
    # So basically, we only pay a penalty for non-type-homogeneous small lists.

    # OK, now for the cases that actually occur in practice:

    # What kind of gains do we get for floats?
    python-dev/python -m perf timeit -s "import random; l=[random.uniform(-1,1) for _ in range(0,100)]" "sorted(l)" --rigorous
    python-ref/python -m perf timeit -s "import random; l=[random.uniform(-1,1) for _ in range(0,100)]" "sorted(l)" --rigorous
    # Median +- std dev: 3.63 us +- 0.20 us
    # Median +- std dev: 8.81 us +- 0.37 us
    # Wow! 59% faster! And if we used a large list, we would've gotten even better numbers.

    # What kind of gains do we get for latin strings?
    python-dev/python -m perf timeit -s "import random; l=[str(random.uniform(-1,1)) for _ in range(0,100)]" "sorted(l)" --rigorous
    python-ref/python -m perf timeit -s "import random; l=[str(random.uniform(-1,1)) for _ in range(0,100)]" "sorted(l)" --rigorous
    # Median +- std dev: 9.51 us +- 0.28 us
    # Median +- std dev: 15.9 us +- 0.7 us
    # 40% faster!

    # What kind of gains do we get for non-latin strings (which I imagine aren't that common to sort in practice anyway)
    python-dev/python -m perf timeit -s "import random; l=[str(random.uniform(-1,1))+'\uffff' for _ in range(0,100)]" "sorted(l)" --rigorous
    python-ref/python -m perf timeit -s "import random; l=[str(random.uniform(-1,1))+'\uffff' for _ in range(0,100)]" "sorted(l)" --rigorous
    # Median +- std dev: 12.2 us +- 0.4 us
    # Median +- std dev: 14.2 us +- 0.6 us
    # 14%. Not as impressive, but again, this is a bit of a pathological case.

    # What kind of gains do we get for ints that fit in a machine word? (I'll keep them in (-2^15,2^15) to be safe)
    python-dev/python -m perf timeit -s "import random; l=[int(random.uniform(-1,1)*2**15) for _ in range(0,100)]" "sorted(l)" --rigorous
    python-ref/python -m perf timeit -s "import random; l=[int(random.uniform(-1,1)*2**15) for _ in range(0,100)]" "sorted(l)" --rigorous
    # Median +- std dev: 4.92 us +- 0.35 us
    # Median +- std dev: 9.59 us +- 0.75 us
    # 49% faster. Pretty cool stuff!

    # What kind of gains do we get for pathologically large ints?
    python-dev/python -m perf timeit -s "import random; l=[int(random.uniform(-1,1)*2**100) for _ in range(0,100)]" "sorted(l)" --rigorous
    python-ref/python -m perf timeit -s "import random; l=[int(random.uniform(-1,1)*2**100) for _ in range(0,100)]" "sorted(l)" --rigorous
    # Median +- std dev: 6.93 us +- 0.26 us
    # Median +- std dev: 8.93 us +- 0.23 us
    # 22% faster. The same comparison function is used as in the pathological string case, but the numbers are different because comparing strings
    # is more expensive than comparing ints, so we save less time from skipping the type checks. Regardless, 22% is still pretty cool. And I can't
    # imagine it's that common to sort huge ints anyway, unless you're doing some sort of math conjecture testing or something.

    # What kind of gains do we get for tuples whose first elements are floats?
    python-dev/python -m perf timeit -s "import random; l=[(random.uniform(-1,1), 'whatever') for _ in range(0,100)]" "sorted(l)" --rigorous
    python-ref/python -m perf timeit -s "import random; l=[(random.uniform(-1,1), 'whatever') for _ in range(0,100)]" "sorted(l)" --rigorous
    # Median +- std dev: 5.83 us +- 0.50 us
    # Median +- std dev: 21.5 us +- 0.5 us
    # WOW! 73% faster!
    # A note: I know of at least one application where this is relevant. The DEAP evolutionary algorithms library, which is very popular, has to sort
    # tuples of floats when it's selecting individuals for the next generation. It spends a non-trivial amount of time sorting those tuples of floats,
    # and this patch cuts that non-trivial time by 73%! Pretty cool! Now we can evolve more individuals more better!

    # What kind of gains do we get for tuples whose first elements are latin strings?
    python-dev/python -m perf timeit -s "import random; l=[(str(random.uniform(-1,1)), 42) for _ in range(0,100)]" "sorted(l)" --rigorous
    python-ref/python -m perf timeit -s "import random; l=[(str(random.uniform(-1,1)), 42) for _ in range(0,100)]" "sorted(l)" --rigorous
    # Median +- std dev: 14.9 us +- 0.8 us
    # Median +- std dev: 31.2 us +- 1.3 us
    # 52%. Not too shabby!

    # What kind of gains do we get for tuples of other stuff?
    # Obviously there are lots of combinations possible, I won't waste your time documenting all of them.
    # Suffice it to say that the gains for lists tuples of x is always greater than the gains for lists of x, since we bypass
    # two layers of safety checks: checks on the tuples, and then checks on the x.

    # What kind of gains do we for arbitrary lists of objects of the same type?
    # See the benchmarks for large ints and non-latin strings. It will be different for each object, since it depends on the ratio between
    # the cost of type-check and the cost of comparison. Regardless, it is always non-trivial, unless the cost of comparison is huge.

    # End of script #

    TL;DR: This patch makes common list.sort() cases 40-75% faster, and makes very uncommon pathological cases at worst 15% slower.
    It accomplishes this by performing the endless, but necessary, safety checks involved in comparison up front, and then using
    optimized compares that ignore the already-performed checks during the actual sort.

    @embg embg mannequin added 3.7 interpreter-core performance labels Nov 13, 2016
    @JulienPalard
    Copy link
    Member

    @JulienPalard JulienPalard commented Nov 14, 2016

    Hi Elliot, nice spot!

    Why are you redefining Py_ABS, which looks already defined in pymacro.h included itself by Python.h? I'm not fan of undefining it later, it may surprise someone later expecting it to be there.

    I tried to compile without your definition of Py_ABS, just in case I missed something in the includes, and it works.

    @embg
    Copy link
    Mannequin Author

    @embg embg mannequin commented Nov 14, 2016

    Sure, if it compiles without that def, I'll remove it from the patch. I
    added it because I did all the development in an extension module, which
    also included Python.h, but for some reason it gave me a "function
    implicitly defined" error for using Py_ABS. Even though I had included
    Python.h. Weird, right? So I assumed I had to leave it in the patch as
    well. Thanks for pointing this out!

    On Mon, Nov 14, 2016 at 6:09 AM Julien Palard <report@bugs.python.org>
    wrote:

    Julien Palard added the comment:

    Hi Elliot, nice spot!

    Why are you redefining Py_ABS, which looks already defined in pymacro.h
    included itself by Python.h? I'm not fan of undefining it later, it may
    surprise someone later expecting it to be there.

    I tried to compile without your definition of Py_ABS, just in case I
    missed something in the includes, and it works.

    ----------
    nosy: +mdk


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue28685\>


    @vstinner
    Copy link
    Member

    @vstinner vstinner commented Nov 14, 2016

    Maybe we should investigate more optimizations on specialized lists.

    PyPy uses a more compact structure for lists of integers for example. Something like compact strings, PEP-393, of Python 3.3, but for lists.

    http://doc.pypy.org/en/latest/interpreter-optimizations.html#list-optimizations

    But we are limited by the C API, so we cannot change deeply the C structure without breaking backward compatibility.

    (difference is within std dev)

    You can use perf timeit --compare-to to check if the result is significant or not, and it displays you the "N.NNx faster" or "N.NNx slower" if it's significant.

    About benchmarks, I also would like to see a benchmark on the bad case, when specialization is not used. And not only on an empty list :-) For example, sort 1000 objects which implement compare operators and/or a sort function.

    @embg
    Copy link
    Mannequin Author

    @embg embg mannequin commented Nov 14, 2016

    On Mon, Nov 14, 2016 at 10:32 AM STINNER Victor <report@bugs.python.org>
    wrote:

    You can use perf timeit --compare-to to check if the result is significant
    or not, and it displays you the "N.NNx faster" or "N.NNx slower" if it's
    significant.

    Will do -- I'm writing this up as a paper since this is my science fair
    project, so I'll redo the measurements that way and upload a pdf here.

    About benchmarks, I also would like to see a benchmark on the bad case,
    when specialization is not used. And not only on an empty list :-) For
    example, sort 1000 objects which implement compare operators and/or a sort
    function.

    The worst case is the third benchmark from the top -- a list of floats with
    a single, sad, solitary long at the end. That disables optimization because
    keys_are_all_same_type gets set to 0 when assign_compare_function finds the
    long (since key_type == &PyFloat_Type). That benchmark is the *absolute*
    worst case for two reasons:

    1. Float compares are really cheap, so the ratio doesn't get messed up by
      the common term "time spent actually comparing floats" (since the total
      time is "time spent on overhead + time spend actually comparing floats").
    2. The list is of size 10. I guess I could've used size 3 or 4, but it
      shouldn't be too far off... smaller lists give worse ratios because the
      overhead is O(n) while sorting is O(nlogn).

    So, again, the absolute worst possible case is the third benchmark, which
    suffers a 10% slowdown. Certainly a reasonable price to pay considering how
    rare that case is in practice, and considering the 40-75% gains we get on
    the common cases.

    @embg
    Copy link
    Mannequin Author

    @embg embg mannequin commented Nov 16, 2016

    So thanks for pointing out that perf has a --compare_to option: it turns out I had calculated the times wrong! Specifically, I had used

    (ref-dev)/ref

    while I should have used

    ref/dev

    which is what perf --compare_to uses. Anyway, the actual times are even more incredible than I could have imagined! First, here's my benchmark script:

    #!/bin/bash
    rm ref.json dev.json 2> /dev/null
    python-dev/python -m perf timeit -s "$1" "sorted(l)" --rigorous -o dev.json > /dev/null
    python-ref/python -m perf timeit -s "$1" "sorted(l)" --rigorous -o ref.json > /dev/null
    python-ref/python -m perf compare_to ref.json dev.json

    And here are the results:

    ./bench.sh "import random; l=[random.uniform(-1,1) for _ in range(0,100)]"
    Median +- std dev: [ref] 8.34 us +- 0.18 us -> [dev] 3.33 us +- 0.13 us: 2.50x faster

    So it's 150% faster! (i.e. 150% + 100% = 250%). 150% faster sorting for floats!!! If we make them tuples, it's even more incredible:
    Median +- std dev: [ref] 20.9 us +- 1.0 us -> [dev] 4.99 us +- 0.27 us: 4.19x faster

    319% faster!!! And earlier, I had thought 75% was impressive... I mean, 319%!!! And again, this is an application that is directly useful: DEAP spends a great deal of time sorting tuples of floats, this will make their EAs run a lot faster.

    "import random; l=[str(random.uniform(-1,1)) for _ in range(0,100)]"
    Median +- std dev: [ref] 15.7 us +- 0.9 us -> [dev] 9.24 us +- 0.52 us: 1.70x faster

    "import random; l=[int(random.uniform(-1,1)*2**15) for _ in range(0,100)]"
    Median +- std dev: [ref] 8.59 us +- 0.19 us -> [dev] 4.35 us +- 0.13 us: 1.98x faster

    @embg
    Copy link
    Mannequin Author

    @embg embg mannequin commented Nov 16, 2016

    Oh wait... uh... never mind... we want "faster" to refer to total time taken, so 1-def/ref is indeed the correct formula. I just got confused because perf outputs ref/dev, but that doesn't make sense for percents.

    @vstinner
    Copy link
    Member

    @vstinner vstinner commented Nov 24, 2016

    list.sort() is a very sensitive function. Maybe (dummy idea?), you may start with a project on PyPI, play with it and try it on large applications (Django?). And come back later once it's battle tested?

    @embg
    Copy link
    Mannequin Author

    @embg embg mannequin commented Mar 8, 2017

    Will post the final version of this patch as a pull request on Github.

    @embg embg mannequin closed this as completed Mar 8, 2017
    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Mar 10, 2017

    The issue shouldn't be closed until it resolved or rejected.

    I like the idea, and benchmarking results for randomized lists look nice. But could you please run benchmarks for already sorted lists?

    @embg
    Copy link
    Mannequin Author

    @embg embg mannequin commented Mar 10, 2017

    On Fri, Mar 10, 2017 at 12:26 AM Serhiy Storchaka <report@bugs.python.org>
    wrote:

    Serhiy Storchaka added the comment:

    The issue shouldn't be closed until it resolved or rejected.

    Ya, sorry about that. This is my first time contributing.

    I like the idea, and benchmarking results for randomized lists look nice.

    But could you please run benchmarks for already sorted lists?

    David Mertz asked for the same thing on python-ideas. Here's what I replied
    (you can also find these numbers in my pull request description):


    You are entirely correct, as the benchmarks below demonstrate. I used the
    benchmark lists from Objects/listsort.txt, which are:

      \\sort: descending data
      /sort: ascending data
      3sort: ascending, then 3 random exchanges
      +sort: ascending, then 10 random at the end
      %sort: ascending, then randomly replace 1% of elements w/ random
    

    values
    ~sort: many duplicates
    =sort: all equal

    My results are below (the script can be found at
    https://github.com/embg/python-fastsort-benchmark/blob/master/bench-structured.py
    ):
    Homogeneous ([int]):
    \sort: 54.6%
    /sort: 56.5%
    3sort: 53.5%
    +sort: 55.3%
    %sort: 52.4%
    ~sort: 48.0%
    =sort: 45.2%

    Heterogeneous ([int]*n + [0.0]):
    \sort: -17.2%
    /sort: -19.8%
    3sort: -18.0%
    +sort: -18.8%
    %sort: -10.0%
    ~sort: -2.1%
    =sort: -21.3%

    As you can see, because there's a lot less non-comparison overhead in the
    structured lists, the impact of the optimization is much greater, both in
    performance gain and in worst-case cost. However, I would argue that these
    data do not invalidate the utility of my patch: the probability of
    encountering a type-heterogeneous list is certainly less than 5% in
    practice. So the expected savings, even for structured lists, is something
    like (5%)(-20%) + (95%)(50%) = 46.5%. And, of course, not *all* the lists
    one encounters in practice are structured; certainly not *this* structured.
    So, overall, I would say the numbers above are extremely encouraging.
    Thanks for pointing out the need for this benchmark, though!
    ***

    Thanks for the feedback!

    @pppery
    Copy link
    Mannequin

    @pppery pppery mannequin commented Mar 11, 2017

    Does this work with wacky code like this?
    @functools.total_ordering
    class ClassAssignmentCanBreakChecks():
    def __init__(self, i):
    self._i = i
    def __lt__ (self, other):
    last.__class__ = OrdinaryOldInteger
    return self._i < (other._i if hasattr(other, '_i') else other)
    @functools.total_ordering
    class OrdinaryOldInteger:
    def __init__(self, i):
    self._i = i
    def __lt__(self, other):
    return self._i < (other._i if hasattr(other, '_i') else other)
    lst = [ClassAssignmentCanBreakChecks(i) for i in range(10)]
    random.shuffle(lst)
    last = lst[-1]
    lst.sort()
    It looks like it will initially say that all are the same type, and attempt that optimization, which will probably lead to unexpected results as that condition is no longer true after the first compare.

    @embg
    Copy link
    Mannequin Author

    @embg embg mannequin commented Mar 11, 2017

    Your code changes __class__, not type, which would remain equal to
    "instance". (my understanding, could be wrong). The docs say the following:

    https://docs.python.org/3.7/reference/datamodel.html

    Like its identity, an object’s type is also unchangeable. [1]

    [1] It is possible in some cases to change an object’s type, under
    certain controlled conditions.
    It generally isn’t a good idea though, since it can lead to some very
    strange behavior if it is
    handled incorrectly.

    So I think it's safe to assume that type doesn't change; if you change
    type, you get undefined ("very strange") behavior. Based on the comment in
    the footnote, other code clearly assumes that type doesn't change, so I
    don't see why list.sort() should be any different.

    On Fri, Mar 10, 2017 at 5:08 PM ppperry <report@bugs.python.org> wrote:

    ppperry added the comment:

    Does this work with wacky code like this?
    @functools.total_ordering
    class ClassAssignmentCanBreakChecks():
    def __init__(self, i):
    self._i = i
    def __lt__ (self, other):
    last.__class__ = OrdinaryOldInteger
    return self._i < (other._i if hasattr(other, '_i')
    else other)
    @functools.total_ordering
    class OrdinaryOldInteger:
    def __init__(self, i):
    self._i = i
    def __lt__(self, other):
    return self._i < (other._i if hasattr(other, '_i')
    else other)
    lst = [ClassAssignmentCanBreakChecks(i) for i in range(10)]
    random.shuffle(lst)
    last = lst[-1]
    lst.sort()
    It looks like it will initially say that all are the same type, and
    attempt that optimization, which will probably lead to unexpected results
    as that condition is no longer true after the first compare.

    ----------
    nosy: +ppperry


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue28685\>


    @pppery
    Copy link
    Mannequin

    @pppery pppery mannequin commented Mar 11, 2017

    Elliot Gorokhovsky added the comment:

    Your code changes __class__, not type, which would remain equal to
    "instance". (my understanding, could be wrong). The docs say the
    following:

    Nope:
    class A:pass
    class B:pass
    a = A()
    a.__class__ = B
    type(a)
    returns "<class '__main__.B'>"

    @embg
    Copy link
    Mannequin Author

    @embg embg mannequin commented Mar 11, 2017

    Yup, I was completely wrong.

    If your classes were defined in pure-Python, this would raise an exception
    (since the pure-Python operators/functions check for bad types, correct me
    if I'm wrong). However, if you defined your compares at the C API level,
    you could get a segfault. There are much easier ways to get segfaults using
    the C API, however :)

    Overall, I feel like if you're mutating the objects while they're being
    sorted, you should be prepared for undefined behavior. Specifically, in the
    current implementation, the sort result can be incorrect if you mutate as
    you sort; no sort algorithm will recheck comparisons periodically to see if
    you've tried to fool it.

    Anyway, here's what Tim Peters said on the Github PR comments, where I
    asked about the issue you raised:

    About the __class__-changing example, I don't care provided it doesn't >
    blow up. But someone else should weigh in on that. I hope we can get, >
    e.g., Raymond Hettinger to stare at this issue too.

    Either way, great catch! Thanks for the feedback.

    On Fri, Mar 10, 2017 at 6:15 PM ppperry <report@bugs.python.org> wrote:

    ppperry added the comment:

    >
    > Elliot Gorokhovsky added the comment:
    >
    > Your code changes __class__, not type, which would remain equal to
    > "instance". (my understanding, could be wrong). The docs say the
    > following:
    >
    Nope:
    class A:pass
    class B:pass
    a = A()
    a.__class__ = B
    type(a)
    returns "<class '__main__.B'>"

    ----------


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue28685\>


    @pppery
    Copy link
    Mannequin

    @pppery pppery mannequin commented Mar 11, 2017

    And what about even wackier code like this?
    class A(int):
    def __lt__(self, other):
    print("zebra")
    A.__lt__ = A.__false_lt__
    return int.__lt__(self, other)
    __true_lt__ = __lt__
    def __false_lt__(self, other):
    print("gizmo")
    A.__lt__ = A.__true_lt__
    return int.__lt__(self, other)

    [A(i) for i in range(20, 5, -1)].sort()

    This alternates printing "zebra" and "gizmo" for every comparison, and there is no way to add some sort of caching without changing this behavior.

    @embg
    Copy link
    Mannequin Author

    @embg embg mannequin commented Mar 11, 2017

    Actually, I just ran this in the patched interpreter, and it worked!
    It printed:
    zebra
    gizmo
    zebra
    gizmo
    zebra
    gizmo
    zebra
    gizmo
    zebra
    gizmo
    zebra
    gizmo
    zebra
    gizmo

    Inspired by the above result, I ran your counterexample (below) to see if
    it would work as well:
    from random import *
    class ClassAssignmentCanBreakChecks():
    def __init__(self, i):
    self._i = i
    def __lt__ (self, other):
    print('gizmo')
    last.__class__ = OrdinaryOldInteger
    return self._i < (other._i if hasattr(other, '_i') else other)

    class OrdinaryOldInteger:
        def __init__(self, i):
            self._i = i
        def __lt__(self, other):
            print('rocket')
            return self._i < (other._i if hasattr(other, '_i') else other)
    lst = [ClassAssignmentCanBreakChecks(i) for i in range(10)]
    shuffle(lst)
    last = lst[-1]
    lst.sort()

    And it did! It printed:
    gizmo
    gizmo
    gizmo
    gizmo
    gizmo
    gizmo
    gizmo
    gizmo
    gizmo
    gizmo
    gizmo
    gizmo
    gizmo
    gizmo
    gizmo
    gizmo
    gizmo
    gizmo
    gizmo
    rocket
    rocket
    rocket

    Note the "rocket" prints at the end; those could not have printed if the
    compare method didn't change!

    Do I have any idea *why* these tests work? No. But I swear, I *just*
    re-made the patched interpeter in the directory where I ran the tests. You
    will definitely be able to reproduce my results on your system.

    Wacky! (seriously though I have no idea *why* this works, it just...
    does... I'm scared...)

    @pppery
    Copy link
    Mannequin

    @pppery pppery mannequin commented Mar 11, 2017

    What about if one of the relevant comparison functions is implemented in C?

            class WackyComparator(int):
    		def __lt__(self, other):
    			elem.__class__ = WackyList2
    			return int.__lt__(self, other)
    
    	class WackyList1(list):pass
    	class WackyList2(list):
    		def __lt__(self, other):
    			raise ValueError
    	lst =
    list(map(WackyList1,[[WackyComparator(3),5],[WackyComparator(4),6],[WackyComparator(7),7]]))
    	random.shuffle(lst)
    	elem = lst[-1]
    	lst.sort()

    This code raises ValueError, and caching seems like it would cache the
    comparator for WackyList1 objects, which is the same as the comparator for
    'list' objects -- and midway through comparison, one of them changes type
    to WackyList2, which has its own (broken) comparison function.

    Python is very very dynamic ...

    @tim-one
    Copy link
    Member

    @tim-one tim-one commented Mar 12, 2017

    I haven't tried the example, but at this point I'd be surprised if it failed. The caching here isn't at level of __lt__ but at the higher level of (invisible from Python code) a type's tp_richcompare slot. A heap type - regardless of whether it derives from a C-level or Python-level type - has to be prepared for methods to pop into existence, change, or vanish at (almost) any time. So its tp_richcompare has to be defensive itself. For example:

    >>> class F(float):
    ...     pass
    >>> a = F(2)
    >>> b = F(3)
    >>> a < b
    True

    Is F.tp_richcompare the same as float.tp_richcompare? We can't tell from Python code, because tp_richcompare isn't exposed. But, _whatever_ F.tp_richcompare is, it notices when relevant new methods are defined (which float.tp_richcompare emphatically does not); for example, continuing the above:

    >>> F.__lt__ = lambda a, b: 0
    >>> a < b
    0
    >>> del F.__lt__
    >>> a < b
    True

    That said, I know nothing about how comparison internals changed for Python 3, so I may just be hallucinating :-)

    @pppery
    Copy link
    Mannequin

    @pppery pppery mannequin commented Mar 12, 2017

    Wouldn't the assignment of "__lt__" change the value of the tp_richcompare slot? That seems to be what the code in Objects/typeobject.c is doing with the update_slot method and the related helper functions.

    @embg
    Copy link
    Mannequin Author

    @embg embg mannequin commented Mar 12, 2017

    I just ran it. With the patched interpreter, I get no error. With the
    unpatched interpreter, I get ValueError. :(.

    @tim-one
    Copy link
    Member

    @tim-one tim-one commented Mar 12, 2017

    @PPPerry, I have no idea what the bulk of the code in typeobect.c is trying to do.

    @tim-one
    Copy link
    Member

    @tim-one tim-one commented Mar 12, 2017

    Elliot, I don't care if the example behaves differently. Although someone else may ;-)

    The only things .sort() has ever tried to guarantee in the presence of mutations (of either the list or the elements) during sorting are that (a) the implementation won't segfault; and, (b) the list at the end is some permutation of the input list (no elements are lost or duplicated).

    If crazy mutation examples can provoke a segfault, that's possibly "a problem" - but different results really aren't (at least not to me).

    @embg
    Copy link
    Mannequin Author

    @embg embg mannequin commented Mar 12, 2017

    On Sat, Mar 11, 2017 at 9:01 PM Tim Peters <report@bugs.python.org> wrote:

    Elliot, I don't care if the example behaves differently. Although someone
    else may ;-)

    The only things .sort() has ever tried to guarantee in the presence of
    mutations (of either the list or the elements) during sorting are that (a)
    the implementation won't segfault; and, (b) the list at the end is some
    permutation of the input list (no elements are lost or duplicated).

    If crazy mutation examples can provoke a segfault, that's possibly "a
    problem" - but different results really aren't (at least not to me).

    That's great to hear. (Of course, one could always remove
    unsafe_object_compare from the patch and keep the rest, but that would be a
    real shame). I don't think segfaults are possible if the code is
    pure-Python, because all the builtin/stdlib functions type-check anyway, so
    you would just get an exception. Right? Of course, using the C API you
    could probably provoke segfaults, but there are much easier ways to
    segfault using the C API :).

    @tim-one
    Copy link
    Member

    @tim-one tim-one commented Mar 12, 2017

    Elliot, did you run the example in a release build or a debug build? I'm wondering why this:

    assert(v->ob_type == w->ob_type &&
    v->ob_type->tp_richcompare != NULL &&
    v->ob_type->tp_richcompare == compare_funcs.key_richcompare);

    didn't blow up (in unsafe_object_compare).

    If that does blow up in a debug build, it suggests "a fix": unconditionally check whether the tp_richcompare slot is the expected value. If not, use PyObject_RichCompareBool(v, w, Py_LT) instead.

    @eryksun
    Copy link
    Contributor

    @eryksun eryksun commented Mar 12, 2017

    assignment of "__lt__" change the value of the tp_richcompare slot?

    Yes. CPython doesn't implement individual dispatching of the rich-comparison functions. There's a single tp_richcompare slot, so overriding one rich comparison forces the use of slot_tp_richcompare. For built-in types this incurs the performance penalty of using a wrapper_descriptor for the other rich comparisons. For example, overriding F.__lt__ forces calling float.__gt__ for the greater-than comparison.

    Before:

    >>> F() > 1
    
    Breakpoint 0 hit
    python37_d!float_richcompare:
    00000000`6099d930 4489442418      mov     dword ptr [rsp+18h],r8d
                                              ss:00000056`cefef280=a1670058
    0:000> kc 3
    Call Site
    python37_d!float_richcompare
    python37_d!do_richcompare
    python37_d!PyObject_RichCompare
    0:000> g
    False
    

    After:

    >>> F.__lt__ = lambda a, b: 0
    >>> F() > 1
    
    Breakpoint 0 hit
    python37_d!float_richcompare:
    00000000`6099d930 4489442418      mov     dword ptr [rsp+18h],r8d
                                              ss:00000056`cefef0a0=a39d7c70
    0:000> kc 9
    Call Site
    python37_d!float_richcompare
    python37_d!wrap_richcmpfunc
    python37_d!richcmp_gt
    python37_d!wrapper_call
    python37_d!_PyObject_FastCallDict
    python37_d!call_unbound
    python37_d!slot_tp_richcompare
    python37_d!do_richcompare
    python37_d!PyObject_RichCompare
    

    The __gt__ wrapper_descriptor gets bound as a method-wrapper, and the method-wrapper tp_call is wrapper_call, which calls the wrapper function (e.g. richcmp_gt) with the wrapped function (e.g. float_richcompare). The object ID in CPython is the object address, so we can easily get the address of the __gt__ wrapper_descriptor to confirm how these C function pointers are stored in it:

        >>> id(vars(float)['__gt__'])
        2154486684248
    0:001> ln @@(((PyWrapperDescrObject *)2154486684248)->d_base->wrapper)
    (00000000`60a20580)   python37_d!richcmp_gt   |
    (00000000`60a205c0)   python37_d!slot_tp_finalize
    Exact matches:
        python37_d!richcmp_gt (struct _object *, struct _object *, void *)
    
    0:001> ln @@(((PyWrapperDescrObject *)2154486684248)->d_wrapped)
    (00000000`6099d930)   python37_d!float_richcompare   |
    (00000000`6099e6f0)   python37_d!float_getzero
    Exact matches:
        python37_d!float_richcompare (struct _object *, struct _object *, int)
    

    @embg
    Copy link
    Mannequin Author

    @embg embg mannequin commented Mar 12, 2017

    It was a release build -- it would blow up in a debug build.

    Now, regarding the fix you propose: I'll benchmark it tomorrow. If the
    impact is small, I agree that it would be an elegant fix. If not, however,
    perhaps we just have to get rid of unsafe_object_compare entirely? The
    common use cases would still work just fine, and maybe we could add a case
    for bytes or something to compensate for the loss.

    On Sat, Mar 11, 2017 at 9:12 PM Tim Peters <report@bugs.python.org> wrote:

    Tim Peters added the comment:

    Elliot, did you run the example in a release build or a debug build? I'm
    wondering why this:

    assert(v->ob_type == w->ob_type &&
    v->ob_type->tp_richcompare != NULL &&
    v->ob_type->tp_richcompare == compare_funcs.key_richcompare);

    didn't blow up (in unsafe_object_compare).

    If that does blow up in a debug build, it suggests "a fix":
    unconditionally check whether the tp_richcompare slot is the expected
    value. If not, use PyObject_RichCompareBool(v, w, Py_LT) instead.

    ----------


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue28685\>


    @tim-one
    Copy link
    Member

    @tim-one tim-one commented Mar 12, 2017

    The impact would be small: it would add one (or so) pointer-equality compare that, in practice, will always say "yup, they're equal". Dirt cheap, and the branch is 100% predictable.

    @embg
    Copy link
    Mannequin Author

    @embg embg mannequin commented Mar 12, 2017

    Ya, that makes sense... I just don't get why it's faster at all, then!
    Because if we add the (v==w) check, and the tp_richcompare check, how is
    unsafe_object_compare any different from PyObject_RichComareBool? Is it
    that we're saving function calls? (PyObject_RichCompareBool calls
    do_richcompare, so it's one extra call IIRC).

    On Sat, Mar 11, 2017 at 9:32 PM Tim Peters <report@bugs.python.org> wrote:

    Tim Peters added the comment:

    The impact would be small: it would add one (or so) pointer-equality
    compare that, in practice, will always say "yup, they're equal". Dirt
    cheap, and the branch is 100% predictable.

    ----------


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue28685\>


    @embg
    Copy link
    Mannequin Author

    @embg embg mannequin commented Mar 12, 2017

    Eryk Sun: Thanks for your detailed response. I'm curious, though: can you
    figure out why those other two examples *didn't* fail? It's driving me
    crazy! For reference: https://bugs.python.org/msg289435

    @tim-one
    Copy link
    Member

    @tim-one tim-one commented Mar 12, 2017

    Elliot, PyObject_RichCompareBool calls PyObject_RichCompare. That in turn does some checks, hides a small mountain of tests in the expansions of the recursion-checking macros, and calls do_richcompare. That in turn does some useless (in the cases you're aiming at) tests and finally gets around to invoking tp_richcompare. Your patch gets to that final step at once.

    I'm surprised you didn't know that ;-)

    @embg
    Copy link
    Mannequin Author

    @embg embg mannequin commented Mar 12, 2017

    I am embarrassed! That's why I said IIRC... I remembered that either
    RichCompare calls RichCompareBool, or the other way around, and I was too
    lazy to check :) I did remember about do_richcompare, though!

    On Sat, Mar 11, 2017 at 10:07 PM Tim Peters <report@bugs.python.org> wrote:

    Tim Peters added the comment:

    Elliot, PyObject_RichCompareBool calls PyObject_RichCompare. That in turn
    does some checks, hides a small mountain of tests in the expansions of the
    recursion-checking macros, and calls do_richcompare. That in turn does
    some useless (in the cases you're aiming at) tests and finally gets around
    to invoking tp_richcompare. Your patch gets to that final step at once.

    I'm surprised you didn't know that ;-)

    ----------


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue28685\>


    @embg
    Copy link
    Mannequin Author

    @embg embg mannequin commented Mar 12, 2017

    OK, I added the safety check to unsafe_object_compare. I verified that it
    matches the current implementation on all the tests proposed in this thread.

    @pppery
    Copy link
    Mannequin

    @pppery pppery mannequin commented Mar 13, 2017

    -----

    Doesn't youre skipping PyObject_RichCompareBool and directly getting
    tp_richcompare also mean that you bypass the NotImplemented checking?
    Thus, wouldn't this code, which raises a TypeError currently, silently
    work?

        class PointlessComparator:
            def __lt__(self, other):
                return NotImplemented
        [PointlessComparator(),PointlessComparator()].sort()

    ... ... ...

    @tim-one
    Copy link
    Member

    @tim-one tim-one commented Mar 13, 2017

    @PPPerry, I believe you're right - good catch! I expect the current patch would treat the NotImplemented return as meaning "the first argument is less than the second argument".

    I added a comment to the code (on github) suggesting an obvious fix for that.

    @godaygo
    Copy link
    Mannequin

    @godaygo godaygo mannequin commented Jan 28, 2018

    What is the current status of this issue and will it go into Python 3.7?

    @rhettinger
    Copy link
    Contributor

    @rhettinger rhettinger commented Jan 28, 2018

    It would be really nice to have this into 3.7

    @tim-one
    Copy link
    Member

    @tim-one tim-one commented Jan 28, 2018

    I agree it would be nice (very!) to get this in. Indeed, I'm surprised to see that this is still open :-(

    But who can drive it? I cannot.

    @vstinner
    Copy link
    Member

    @vstinner vstinner commented Jan 28, 2018

    I suggest to post-pone this optimization to Python 3.8. Faster list.sort() is nice to have, but I'm not sure that it's a killer feature that will make everybody move to Python 3.7. It can wait for 3.8.

    No core dev took the lead on this non-trivial issue, and IMHO it's getting too late for 3.7.

    It see that Serhiy started to review the change and asked for more benchmarks. Serhiy would be a good candidate to drive such work, but sadly he seems to be busy these days...

    While such optimization is nice to have, we should be careful to not introduce a performance regressions on some cases. I read quickly the issue, and I'm not sure that it was fully carefully reviewed and tested yet. Sorry, I only read it quickly, ignore me if I'm wrong.

    Well, if someone wants to take the responsability of pushing this right now, it's up to you :-)

    @rhettinger rhettinger self-assigned this Jan 29, 2018
    @rhettinger
    Copy link
    Contributor

    @rhettinger rhettinger commented Jan 29, 2018

    New changeset 1e34da4 by Raymond Hettinger (embg) in branch 'master':
    bpo-28685: Optimize sorted() list.sort() with type-specialized comparisons (#582)
    1e34da4

    @tim-one
    Copy link
    Member

    @tim-one tim-one commented Jan 29, 2018

    Thank you for giving this worthy orphan a home, Raymond!

    Victor, don't fret too much. The code is really quite simple, and at worst affects only sorted() and list.sort(). The vast bulk of review effort (of which it got enough) went into dreaming up pathological cases (like mutating list elements, and even list element comparison methods, while a sort was in progress).

    I confess I didn't press for more benchmarks, because I don't care about more here: the code is so obviously a major speed win when it applies, it so obviously applies often, and the new worst-case overhead when it doesn't apply is so obviously minor compared to the cost of a sort (len(list)-1 wasted C-level pointer equality compares). The only real value of timing benchmarks to me here was as a gross sanity check, and enough of those were run to confirm that all the preceding were qualitatively on target. It really doesn't matter at all, e.g., whether the best cases are 2 times faster or 10 times faster, or the truly pathologal worst cases 10% slower or 20% slower. Regardless, it's as close to a pure win as just about anything can be.

    Nevertheless ... if this brings a major player's server to its knees, blame Raymond ;-)

    @pppery
    Copy link
    Mannequin

    @pppery pppery mannequin commented Jan 29, 2018

    ... and I'm still trying to come up with even more pathological mutating cases

    @vstinner
    Copy link
    Member

    @vstinner vstinner commented Jan 29, 2018

    New changeset 8017b80 by Victor Stinner in branch 'master':
    bpo-28685: Fix compiler warning (GH-5423)
    8017b80

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 interpreter-core performance
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants