Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize list comprehensions with preallocate size and protect against overflow #80732

Closed
tonybaloney mannequin opened this issue Apr 8, 2019 · 26 comments
Closed

Optimize list comprehensions with preallocate size and protect against overflow #80732

tonybaloney mannequin opened this issue Apr 8, 2019 · 26 comments
Assignees
Labels
3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@tonybaloney
Copy link
Mannequin

tonybaloney mannequin commented Apr 8, 2019

BPO 36551
Nosy @ronaldoussoren, @ncoghlan, @methane, @serhiy-storchaka, @aaronchall, @tonybaloney, @pablogsal, @tonybaloney
PRs
  • bpo-36551: Optimize list comprehensions with preallocate size and protect against overflow #12718
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/tonybaloney'
    closed_at = <Date 2019-05-09.06:56:01.430>
    created_at = <Date 2019-04-08.01:28:21.834>
    labels = ['interpreter-core', 'type-feature', '3.8']
    title = 'Optimize list comprehensions with preallocate size and protect against overflow'
    updated_at = <Date 2019-05-09.09:06:57.743>
    user = 'https://github.com/tonybaloney'

    bugs.python.org fields:

    activity = <Date 2019-05-09.09:06:57.743>
    actor = 'serhiy.storchaka'
    assignee = 'anthonypjshaw'
    closed = True
    closed_date = <Date 2019-05-09.06:56:01.430>
    closer = 'anthonypjshaw'
    components = ['Interpreter Core']
    creation = <Date 2019-04-08.01:28:21.834>
    creator = 'anthony shaw'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 36551
    keywords = ['patch']
    message_count = 26.0
    messages = ['339586', '339594', '339595', '339612', '339613', '339614', '339617', '339618', '339619', '339620', '339621', '339622', '339623', '339625', '339626', '339627', '339628', '339630', '339631', '339636', '339640', '339684', '339699', '339702', '339709', '341975']
    nosy_count = 8.0
    nosy_names = ['ronaldoussoren', 'ncoghlan', 'methane', 'serhiy.storchaka', 'Aaron Hall', 'anthonypjshaw', 'pablogsal', 'anthony shaw']
    pr_nums = ['12718']
    priority = 'normal'
    resolution = 'rejected'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue36551'
    versions = ['Python 3.8']

    @tonybaloney
    Copy link
    Mannequin Author

    tonybaloney mannequin commented Apr 8, 2019

    List comprehensions currently create a series of opcodes inside a code object, the first of which is BUILD_LIST with an oparg of 0, effectively creating a zero-length list with a preallocated size of 0.

    If you're doing a simple list comprehension on an iterator, e.g.

    def foo():
      a = iterable
      return [x for x in a]

    Disassembly of <code object <listcomp> at 0x109db2c40, file "<stdin>", line 3>:
    3 0 BUILD_LIST 0
    2 LOAD_FAST 0 (.0)
    >> 4 FOR_ITER 8 (to 14)
    6 STORE_FAST 1 (x)
    8 LOAD_FAST 1 (x)
    10 LIST_APPEND 2
    12 JUMP_ABSOLUTE 4
    >> 14 RETURN_VALUE

    The list comprehension will do a list_resize on the 4, 8, 16, 25, 35, 46, 58, 72, 88th iterations, etc.

    This PR preallocates the list created in a list comprehension to the length of the iterator using PyObject_LengthHint().

    It uses a new BUILD_LIST_PREALLOC opcode which builds a list with the allocated size of PyObject_LengthHint(co_varnames[oparg]).

    [x for x in iterable] compiles to:

    Disassembly of <code object <listcomp> at 0x109db2c40, file "<stdin>", line 3>:
    3 0 BUILD_LIST_PREALLOC 0
    2 LOAD_FAST 0 (.0)
    >> 4 FOR_ITER 8 (to 14)
    6 STORE_FAST 1 (x)
    8 LOAD_FAST 1 (x)
    10 LIST_APPEND 2
    12 JUMP_ABSOLUTE 4
    >> 14 RETURN_VALUE

    If the comprehension has ifs, then it will use the existing BUILD_LIST opcode

    Testing using a range length of 10000

    ./python.exe -m timeit "x=list(range(10000)); [y for y in x]"

    Gives 392us on the current 3.8 branch
    and 372us with this change (about 8-10% faster)

    the longer the iterable, the bigger the impact.

    This change also catches the issue that a very large iterator, like a range object :
    [a for a in range(2**256)]

    Would cause the 3.8< interpreter to consume all memory and crash because there is no check against PY_SSIZE_MAX currently.

    With this change (assuming there is no if inside the comprehension) is now caught and thrown as an OverflowError:

    >>> [a for a in range(2**256)]
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<stdin>", line 1, in <listcomp>
    OverflowError: Python int too large to convert to C ssize_t

    @tonybaloney tonybaloney mannequin added 3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Apr 8, 2019
    @serhiy-storchaka
    Copy link
    Member

    The benefit is too small to add a new opcode.

    @tonybaloney
    Copy link
    Mannequin Author

    tonybaloney mannequin commented Apr 8, 2019

    The opcode would not solely apply to this specific use case.

    I could seek another way of implementing the same behaviour without an additional opcode?

    @ronaldoussoren
    Copy link
    Contributor

    This might cause a MemoryError when the __length_hint__ of the source returns a too large value, even when the actual size of the comprehension is smaller, e.g.:

     [x**2 for x in range(LARGE_VALUE) if is_prime(x)]
    

    See also bpo-28940

    @methane
    Copy link
    Member

    methane commented Apr 8, 2019

    I agree with Serhiy. Benefit seems too small to add new opcode.

    I could seek another way of implementing the same behaviour without an additional opcode?

    How about converting [x for x in it] to [*it] in AST?

    @methane
    Copy link
    Member

    methane commented Apr 8, 2019

    $ python3 -m timeit -s 'r=range(1000)' -- '[x for x in r]'
    5000 loops, best of 5: 40 usec per loop
    
    $ python3 -m timeit -s 'r=range(1000)' -- '[*r]'
    20000 loops, best of 5: 17.3 usec per loop

    @pablogsal
    Copy link
    Member

    This patch makes it slow for small iterators:

    Perf program:

    import perf
    
    runner = perf.Runner()
    runner.timeit("list_comp",
                   stmt="[x for x in range(10)]",
                   setup="")

    Current master:
    ❯ ./python.exe ../check.py
    .....................
    list_comp: Mean +- std dev: 3.97 us +- 0.15 us

    PR 12718:
    ❯ ./python.exe ../check.py
    .....................
    list_comp: Mean +- std dev: 4.57 us +- 0.17 us

    The overhead is very likely due to calling __length_hint__

    @tonybaloney
    Copy link
    Mannequin Author

    tonybaloney mannequin commented Apr 8, 2019

    How about converting [x for x in it] to [*it] in AST?

    I should have been more explicit, this patch improves the performance of all list comprehensions that don’t have an if clause.

    Not just
    [x for x in y]

    but:

    d = {} # some sort of dictionary
    [f”{k} — {v}” for k, v in d.items()]
    
    a = iterable
    [val**2 for val in a]

    Would all use BUILD_LIST_PREALLOC and use a LengthHint.

    I can do another speed test for those other scenarios.

    Most of the stdlib packages have these sorts of list comps, including those in the default site.py.

    @methane
    Copy link
    Member

    methane commented Apr 8, 2019

    I should have been more explicit, this patch improves the performance of all list comprehensions that don’t have an if clause.

    But in these cases, overhead of reallocation will be smaller than simple case.

    @tonybaloney
    Copy link
    Mannequin Author

    tonybaloney mannequin commented Apr 8, 2019

    This might cause a MemoryError when the __length_hint__ of the source returns a too large value, even when the actual size of the comprehension is smaller, e.g.:

    The current implementation of list comprehensions raise neither a memoryerror or overflow error. They will consume all available memory and crash the interpreter.

    This patch raises an OverflowError before execution instead of just looping until memory heap exhaustion

    @pablogsal
    Copy link
    Member

    More benchmarks for slow iterators:

    import perf
    
    runner = perf.Runner()
    runner.timeit("list_comp",
                   stmt="[x**2 for x in k]",
                   setup="k=iter(list(range(10)))")

    Current master:
    ❯ ./python.exe ../check.py
    .....................
    list_comp: Mean +- std dev: 924 ns +- 35 ns

    PR 12718:
    ❯ ./python.exe ../check.py
    .....................
    list_comp: Mean +- std dev: 1.17 us +- 0.06 us

    @methane
    Copy link
    Member

    methane commented Apr 8, 2019

    > This might cause a MemoryError when the __length_hint__ of the source returns a too large value, even when the actual size of the comprehension is smaller, e.g.:

    The current implementation of list comprehensions raise neither a memoryerror or overflow error. They will consume all available memory and crash the interpreter.

    This patch raises an OverflowError before execution instead of just looping until memory heap exhaustion

    Note PEP-424.

    """
    __length_hint__ must return an integer (else a TypeError is raised) or
    NotImplemented, and is not required to be accurate. It may return a
    value that is either larger or smaller than the actual size of the
    container.
    """

    it.__length_hint__ can return 2**1000 even if len(list(it))==0.

    In such case, current behavior works. And your patch will raise OverflowError.

    @tonybaloney
    Copy link
    Mannequin Author

    tonybaloney mannequin commented Apr 8, 2019

    This patch makes it slow for small iterators

    That is a one-off cost for the __length_hint__ of the range object specifically.
    Objects with a known length (lists, sets, tuples) would not have that overhead.

    I can run a more useful set of benchmarks against this.

    So the +0.6us would be the same for ranges 8-16. Then less for 16-25, then again for 25-35 as the removal of the reallocation process has a more significant factor for larger ranges.

    @tonybaloney
    Copy link
    Mannequin Author

    tonybaloney mannequin commented Apr 8, 2019

    In such case, current behavior works. And your patch will raise OverflowError.

    Try

    [x for x in range(2**1000)]

    in a REPL. It doesn’t raise anything, it tries to create a list that will eventually exceed PY_SIZE_MAX, but it only crashes once it reaches that iteration.

    This raises an OverflowError instead, the same way:
    len(range(2**1000))
    raises an OverflowError

    @methane
    Copy link
    Member

    methane commented Apr 8, 2019

    Try

    [x for x in range(2**1000)]

    in a REPL. It doesn’t raise anything, it tries to create a list that will eventually exceed PY_SIZE_MAX, but it only crashes once it reaches that iteration.

    It is expected behavior.

    This raises an OverflowError instead, the same way:
    len(range(2**1000))
    raises an OverflowError

    If your patch uses __length_hint__, it is bug.
    iterator will return 2**1000 for __length_hint__, but produce no item
    on iteration.

    @tonybaloney
    Copy link
    Mannequin Author

    tonybaloney mannequin commented Apr 8, 2019

    If your patch uses __length_hint__, it is bug.
    iterator will return 2**1000 for __length_hint__, but produce no item
    on iteration.

    It raises an OverflowError because of the goto
    https://github.com/python/cpython/pull/12718/files#diff-7f17c8d8448b7b6f90549035d2147a9fR2493 this could just as easily set size to 0.

    I put goto error given the opportunity to handle an expected fault gracefully.

    @tonybaloney
    Copy link
    Mannequin Author

    tonybaloney mannequin commented Apr 8, 2019

    If your patch uses __length_hint__, it is bug.

    I’m not sure I understand this comment,

    PEP-424 says “This is useful for presizing containers when building from an iterable.“

    This patch uses __length_hint__ to presize the list container for a list comprehension.

    @methane
    Copy link
    Member

    methane commented Apr 8, 2019

    "useful" doesn't mean "use it as-is".
    It is just a hint. It will be wrong.

    See here for list example:

    /* Guess a result list size. */
    n = PyObject_LengthHint(iterable, 8);
    if (n < 0) {
    Py_DECREF(it);
    return NULL;
    }
    m = Py_SIZE(self);
    if (m > PY_SSIZE_T_MAX - n) {
    /* m + n overflowed; on the chance that n lied, and there really
    * is enough room, ignore it. If n was telling the truth, we'll
    * eventually run out of memory during the loop.
    */

    @methane
    Copy link
    Member

    methane commented Apr 8, 2019

    I'm sorry. list_extend raises OverflowError too.

    @pablogsal
    Copy link
    Member

    That is a one-off cost for the __length_hint__ of the range object specifically.
    Objects with a known length (lists, sets, tuples) would not have that overhead.

    That seems incorrect. This is not unique of range objects as it affects also objects with known lengths (like a list):

    import perf
    
    runner = perf.Runner()
    runner.timeit("list_comp",
                   stmt="[x*2 for x in k]",
                   setup="k=list(range(10))")

    Current master:
    ❯ ./python.exe ../check.py -n 10
    .....................
    list_comp: Mean +- std dev: 3.82 us +- 0.13 us

    PR 12718:
    ❯ ./python.exe ../check.py -n 10
    .....................
    list_comp: Mean +- std dev: 4.38 us +- 0.16 us

    Check also my other benchmark with a list iterator ( iter(list(range(10))) ) or this one with a generator comp:

    import perf
    
    runner = perf.Runner()
    runner.timeit("list_comp",
                   stmt="[x*2 for x in it]",
                   setup="k=list(range(10));it=(x for x in k)")

    Current master:
    ❯ ./python.exe ../check.py -n 10
    .....................
    list_comp: Mean +- std dev: 945 ns +- 27 ns

    PR 12718:
    ❯ ./python.exe ../check.py -n 10
    .....................
    list_comp: Mean +- std dev: 1.33 us +- 0.05 us

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Apr 8, 2019

    I was going to note that the algorithm Anthony has pursued here is the same one we already use for the list constructor and list.extend(), but Inada-san already pointed that out :)

    While length_hint is allowed to be somewhat inaccurate, we do expect it to be at least *vaguely* accurate (otherwise it isn't very useful, and if it can be inaccurate enough to trigger OverflowError or MemoryError in cases that would otherwise work reasonably well, it would be better for a type not to implement it at all).

    While it would be nice to be able to avoid adding a new opcode, the problem is that the existing candidate opcodes (BUILD_LIST, BUILD_LIST_UNPACK) are both inflexible in what they do:

    • BUILD_LIST requires that the final list length be known at compile time
    • BUILD_LIST_UNPACK infers the final length from an object reference, but calls _PyList_Extend directly, so it combines the preallocation and the iterator consumption into a single step, and hence can't be used outside of iterable unpacking into a list

    At the same time, attempting to generalise either of them isn't desirable, since it would slow them down for their existing use cases, and be slower than a new opcode for this use case.

    The proposed BUILD_LIST_PREALLOC opcode splits the difference: it lets the compiler provide the interpreter with a *hint* as to how big the resulting list is expected to be.

    That said, you'd want to run the result through the benchmark suite rather than relying solely on microbenchmarks, as even though unfiltered "[some_operation_on_x for x in y]" comprehensions without nested loops or filter clauses are pretty common (more common than the relatively new "[*itr]" syntax), it's far less clear what the typical distribution in input lengths actually is, and how many memory allocations need to be avoided in order to offset the cost of the initial _PyObject_LengthHint call (as Pablo's small scale results show).

    (Note that in the _PyList_Extend code, there are preceding special cases for builtin lists and tuples that take those down a much faster path that avoids the _PyObject_LengthHint call entirely)

    @pablogsal
    Copy link
    Member

    Here are the updated results for the benchmark suite. The previous results (unlinked from the issue to reduce noise)
    were against an old version of the master branch.

    2019-04-08_13-08-master-58721a903074.json.gz
    ============================================

    Performance version: 0.7.0
    Report on Linux-5.0.5-1-ARCH-x86_64-with-glibc2.28
    Number of logical CPUs: 12
    Start date: 2019-04-09 00:32:40.656576
    End date: 2019-04-09 00:54:03.721798

    12718-5f06333a4e49.json.gz
    ==========================

    Performance version: 0.7.0
    Report on Linux-5.0.5-1-ARCH-x86_64-with-glibc2.28
    Number of logical CPUs: 12
    Start date: 2019-04-09 00:07:53.433511
    End date: 2019-04-09 00:29:32.603022

                                        -- CURRENT MASTER --                    -- [PR12718](https://github.com/python/cpython/pull/12718) --
    

    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | Benchmark | 2019-04-08_13-08-master-58721a903074.json.gz | 12718-5f06333a4e49.json.gz | Change | Significance |
    +=========================+==============================================+============================+==============+========================+
    | 2to3 | 455 ms | 465 ms | 1.02x slower | Significant (t=-2.27) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | chaos | 171 ms | 180 ms | 1.05x slower | Significant (t=-5.92) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | crypto_pyaes | 169 ms | 177 ms | 1.05x slower | Significant (t=-6.06) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | deltablue | 11.5 ms | 11.9 ms | 1.03x slower | Significant (t=-3.28) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | django_template | 175 ms | 191 ms | 1.09x slower | Significant (t=-13.00) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | dulwich_log | 127 ms | 131 ms | 1.03x slower | Significant (t=-8.97) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | fannkuch | 646 ms | 665 ms | 1.03x slower | Significant (t=-12.37) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | float | 165 ms | 165 ms | 1.00x faster | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | go | 386 ms | 381 ms | 1.01x faster | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | hexiom | 15.4 ms | 15.5 ms | 1.00x slower | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | html5lib | 130 ms | 132 ms | 1.01x slower | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | json_dumps | 18.4 ms | 18.3 ms | 1.00x faster | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | json_loads | 38.6 us | 38.6 us | 1.00x faster | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | logging_format | 15.0 us | 15.6 us | 1.04x slower | Significant (t=-10.98) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | logging_silent | 284 ns | 287 ns | 1.01x slower | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | logging_simple | 13.5 us | 13.9 us | 1.03x slower | Significant (t=-8.03) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | mako | 24.6 ms | 24.5 ms | 1.01x faster | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | meteor_contest | 157 ms | 155 ms | 1.01x faster | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | nbody | 196 ms | 198 ms | 1.01x slower | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | nqueens | 144 ms | 147 ms | 1.02x slower | Significant (t=-9.68) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | pathlib | 27.6 ms | 28.4 ms | 1.03x slower | Significant (t=-5.24) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | pickle | 14.1 us | 14.2 us | 1.01x slower | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | pickle_dict | 33.7 us | 35.2 us | 1.04x slower | Significant (t=-13.28) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | pickle_list | 5.22 us | 5.38 us | 1.03x slower | Significant (t=-6.65) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | pickle_pure_python | 681 us | 719 us | 1.05x slower | Significant (t=-5.64) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | pidigits | 217 ms | 219 ms | 1.01x slower | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | python_startup | 12.2 ms | 12.5 ms | 1.02x slower | Significant (t=-9.34) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | python_startup_no_site | 8.90 ms | 8.92 ms | 1.00x slower | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | raytrace | 789 ms | 793 ms | 1.00x slower | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | regex_compile | 267 ms | 270 ms | 1.01x slower | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | regex_dna | 221 ms | 221 ms | 1.00x faster | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | regex_effbot | 3.94 ms | 3.98 ms | 1.01x slower | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | regex_v8 | 30.4 ms | 30.4 ms | 1.00x slower | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | richards | 108 ms | 108 ms | 1.00x slower | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | scimark_fft | 525 ms | 525 ms | 1.00x faster | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | scimark_lu | 270 ms | 271 ms | 1.00x slower | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | scimark_monte_carlo | 160 ms | 160 ms | 1.00x slower | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | scimark_sor | 295 ms | 305 ms | 1.03x slower | Significant (t=-9.82) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | scimark_sparse_mat_mult | 6.63 ms | 6.57 ms | 1.01x faster | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | spectral_norm | 217 ms | 219 ms | 1.01x slower | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | sqlalchemy_declarative | 239 ms | 244 ms | 1.02x slower | Significant (t=-4.34) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | sqlalchemy_imperative | 50.7 ms | 52.0 ms | 1.03x slower | Significant (t=-3.84) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | sqlite_synth | 4.20 us | 4.14 us | 1.01x faster | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | sympy_expand | 612 ms | 637 ms | 1.04x slower | Significant (t=-8.36) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | sympy_integrate | 27.5 ms | 28.9 ms | 1.05x slower | Significant (t=-4.16) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | sympy_str | 284 ms | 299 ms | 1.05x slower | Significant (t=-6.97) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | sympy_sum | 151 ms | 156 ms | 1.03x slower | Significant (t=-4.37) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | telco | 9.70 ms | 9.64 ms | 1.01x faster | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | tornado_http | 276 ms | 287 ms | 1.04x slower | Significant (t=-5.75) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | unpack_sequence | 64.4 ns | 67.3 ns | 1.05x slower | Significant (t=-6.16) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | unpickle | 22.8 us | 22.7 us | 1.01x faster | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | unpickle_list | 5.72 us | 5.97 us | 1.04x slower | Significant (t=-6.05) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | unpickle_pure_python | 510 us | 527 us | 1.03x slower | Significant (t=-3.89) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | xml_etree_generate | 150 ms | 154 ms | 1.02x slower | Significant (t=-3.72) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | xml_etree_iterparse | 145 ms | 148 ms | 1.02x slower | Significant (t=-3.60) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | xml_etree_parse | 208 ms | 207 ms | 1.00x faster | Not significant |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
    | xml_etree_process | 121 ms | 125 ms | 1.04x slower | Significant (t=-5.43) |
    +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+

    @tonybaloney
    Copy link
    Mannequin Author

    tonybaloney mannequin commented Apr 9, 2019

    Have just optimized some of the code and pushed another change as 69dce1c552.

    ran both master and 69dce1c552 using pyperformance with PGO:

    ➜ ~ python3.8 -m perf compare_to master.json 69dce1c552.json --table
    +-------------------------+---------+-----------------------------+
    | Benchmark | master | 69dce1c552 |
    +=========================+=========+=============================+
    | 2to3 | 432 ms | 426 ms: 1.02x faster (-2%) |
    +-------------------------+---------+-----------------------------+
    | chaos | 157 ms | 155 ms: 1.01x faster (-1%) |
    +-------------------------+---------+-----------------------------+
    | crypto_pyaes | 154 ms | 153 ms: 1.00x faster (-0%) |
    +-------------------------+---------+-----------------------------+
    | dulwich_log | 123 ms | 124 ms: 1.00x slower (+0%) |
    +-------------------------+---------+-----------------------------+
    | fannkuch | 603 ms | 600 ms: 1.01x faster (-1%) |
    +-------------------------+---------+-----------------------------+
    | float | 153 ms | 154 ms: 1.01x slower (+1%) |
    +-------------------------+---------+-----------------------------+
    | go | 323 ms | 326 ms: 1.01x slower (+1%) |
    +-------------------------+---------+-----------------------------+
    | hexiom | 13.6 ms | 13.5 ms: 1.01x faster (-1%) |
    +-------------------------+---------+-----------------------------+
    | json_dumps | 18.1 ms | 17.9 ms: 1.01x faster (-1%) |
    +-------------------------+---------+-----------------------------+
    | logging_format | 13.2 us | 13.8 us: 1.05x slower (+5%) |
    +-------------------------+---------+-----------------------------+
    | logging_silent | 266 ns | 280 ns: 1.05x slower (+5%) |
    +-------------------------+---------+-----------------------------+
    | logging_simple | 12.4 us | 13.1 us: 1.06x slower (+6%) |
    +-------------------------+---------+-----------------------------+
    | meteor_contest | 145 ms | 132 ms: 1.10x faster (-9%) |
    +-------------------------+---------+-----------------------------+
    | nbody | 179 ms | 172 ms: 1.04x faster (-4%) |
    +-------------------------+---------+-----------------------------+
    | nqueens | 138 ms | 134 ms: 1.03x faster (-3%) |
    +-------------------------+---------+-----------------------------+
    | pathlib | 56.4 ms | 55.6 ms: 1.01x faster (-1%) |
    +-------------------------+---------+-----------------------------+
    | pickle | 15.0 us | 15.4 us: 1.03x slower (+3%) |
    +-------------------------+---------+-----------------------------+
    | pickle_pure_python | 620 us | 617 us: 1.01x faster (-1%) |
    +-------------------------+---------+-----------------------------+
    | raytrace | 696 ms | 691 ms: 1.01x faster (-1%) |
    +-------------------------+---------+-----------------------------+
    | regex_compile | 242 ms | 243 ms: 1.00x slower (+0%) |
    +-------------------------+---------+-----------------------------+
    | scimark_monte_carlo | 140 ms | 143 ms: 1.02x slower (+2%) |
    +-------------------------+---------+-----------------------------+
    | scimark_sparse_mat_mult | 5.90 ms | 5.94 ms: 1.01x slower (+1%) |
    +-------------------------+---------+-----------------------------+
    | spectral_norm | 194 ms | 196 ms: 1.01x slower (+1%) |
    +-------------------------+---------+-----------------------------+
    | sympy_str | 246 ms | 245 ms: 1.00x faster (-0%) |
    +-------------------------+---------+-----------------------------+
    | telco | 8.42 ms | 8.31 ms: 1.01x faster (-1%) |
    +-------------------------+---------+-----------------------------+
    | unpack_sequence | 59.2 ns | 59.7 ns: 1.01x slower (+1%) |
    +-------------------------+---------+-----------------------------+
    | unpickle | 21.2 us | 21.4 us: 1.01x slower (+1%) |
    +-------------------------+---------+-----------------------------+
    | unpickle_list | 5.73 us | 5.81 us: 1.01x slower (+1%) |
    +-------------------------+---------+-----------------------------+
    | unpickle_pure_python | 471 us | 467 us: 1.01x faster (-1%) |
    +-------------------------+---------+-----------------------------+
    | xml_etree_iterparse | 142 ms | 143 ms: 1.01x slower (+1%) |
    +-------------------------+---------+-----------------------------+
    | xml_etree_generate | 139 ms | 137 ms: 1.02x faster (-2%) |
    +-------------------------+---------+-----------------------------+
    | xml_etree_process | 109 ms | 108 ms: 1.01x faster (-1%) |
    +-------------------------+---------+-----------------------------+

    Not significant (21): deltablue; django_template; html5lib; json_loads; mako; pickle_dict; pickle_list; pidigits; python_startup; python_startup_no_site; regex_dna; regex_effbot; regex_v8; richards; scimark_fft; scimark_lu; scimark_sor; sympy_expand; sympy_integrate; sympy_sum; xml_etree_parse

    I'd like to look at the way range object LengthHint works, it looks like the path for those is not ideal and could use some optimization. Also, BUILD_LIST_PREALLOC uses the Iterator, not the actual object, so you can't use the much faster _HasLen and PyObject_Length().

    I'm going to look at how __length_hint__ could be optimized for iterators that would make the smaller range cases more efficient.

    meteor_contest uses a lot of list comprehensions, so should show the impact for the patch.

    @serhiy-storchaka
    Copy link
    Member

    I was going to note that the algorithm Anthony has pursued here is the same one we already use for the list constructor and list.extend(), but Inada-san already pointed that out :)

    And that optimization looks questionable to me. I tried to reduce an overhead for small lists, but this requires much more complex code and gives mixed results.

    I am -1 for this optimization because it affects only one particular case (neither other kinds of comprehensions, nor generator expressions, nor list comprehensions with conditions) and even in this case it is small. It is possible to add a lot of other optimizations for other cases which will sped up them to 50% or 100%, but we do not do this, because every such optimization has a cost. It increases the amount of code which should be maintained and covered by tests, it adds small overhead in common cases to speed up an uncommon case, and increasing the code base can negatively affect surrounding code (just because the CPU cache and registers are used inappropriate and the compiler optimizes less important paths).

    In addition, while this change speed up list comprehensions for long list, it slows down them for short lists. Short lists are more common.

    @tonybaloney
    Copy link
    Mannequin Author

    tonybaloney mannequin commented Apr 9, 2019

    I am -1 for this optimization because it affects only one particular case (neither other kinds of comprehensions, nor generator expressions, nor list comprehensions with
    conditions) and even in this case it is small. It is possible to add a lot of other optimizations for other cases which will sped up them to 50% or 100%, but we do not do
    this, because every such optimization has a cost. It increases the amount of code which should be maintained and covered by tests, it adds small overhead in common cases to
    speed up an uncommon case, and increasing the code base can negatively affect surrounding code (just because the CPU cache and registers are used inappropriate and the
    compiler optimizes less important paths).

    Understood, I had hoped this change would have a broader impact. The additional opcode is not ideal either.

    In addition, while this change speed up list comprehensions for long list, it slows down them for short lists. Short lists are more common.

    I've been profiling this today, basically, this implementation receives the list_iter, range_iter, etc.

    There is no standard object model for an iterator's length, _PyObject_HasLen would return false because it neither implements tp_as_sequence nor, tp_as_mapping (rightly so).

    What this has uncovered (so hopefully there's some value from this whole experience!) is that __length_hint__ for iterators is _really_ inefficient. Take a list_iterator for example:

    PyObject_LengthHint will call, _PyObject_HasLen, which returns false, which then goes to call
    _PyObject_LookupSpecial, then
    _PyObject_CallNoArg, which calls
    listiter_len, which calls
    PyList_GET_SIZE which returns a Py_ssize_t, which is then converted to a PyLong via
    PyLong_FromSsize_t, which is then returned back to PyObject_LengthHint, which then
    PyLong_AsSsize_t is run to convert the PyLong back into a Py_ssize_t

    The Py_ssize_t is then finally returned to the caller!

    My conclusion was that the list comprehension should be initialized to the length of the target, before GET_ITER is run. This would remove the overhead for range objects, because you could simply call _PyObject_HasLen, which would return true for dict, list, tuple and set, but false for range objects (which is what you want).

    The issue is that GET_ITER is called outside the code object for the comprehension, so you'd have to pass an additional argument to the comprehension generator.

    This is way outside of my expertise, but the only way I can see to find a sizeable benefit, with minimal code and no edge cases which are slower.

    Thanks for your time

    @tonybaloney tonybaloney mannequin self-assigned this May 9, 2019
    @tonybaloney tonybaloney mannequin closed this as completed May 9, 2019
    @serhiy-storchaka
    Copy link
    Member

    We can return to this issue if make the invocation of __length_hint__ much much faster. For example by adding the tp_length_hint slot. But currently it is too large change and his has negative effects.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants