Optimize list comprehensions with preallocate size and protect against overflow #80732

tonybaloney · 2019-04-08T01:28:22Z

BPO	36551
Nosy	@ronaldoussoren, @ncoghlan, @methane, @serhiy-storchaka, @aaronchall, @tonybaloney, @pablogsal, @tonybaloney
PRs	bpo-36551: Optimize list comprehensions with preallocate size and protect against overflow #12718

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/tonybaloney'
closed_at = <Date 2019-05-09.06:56:01.430>
created_at = <Date 2019-04-08.01:28:21.834>
labels = ['interpreter-core', 'type-feature', '3.8']
title = 'Optimize list comprehensions with preallocate size and protect against overflow'
updated_at = <Date 2019-05-09.09:06:57.743>
user = 'https://github.com/tonybaloney'

bugs.python.org fields:

activity = <Date 2019-05-09.09:06:57.743>
actor = 'serhiy.storchaka'
assignee = 'anthonypjshaw'
closed = True
closed_date = <Date 2019-05-09.06:56:01.430>
closer = 'anthonypjshaw'
components = ['Interpreter Core']
creation = <Date 2019-04-08.01:28:21.834>
creator = 'anthony shaw'
dependencies = []
files = []
hgrepos = []
issue_num = 36551
keywords = ['patch']
message_count = 26.0
messages = ['339586', '339594', '339595', '339612', '339613', '339614', '339617', '339618', '339619', '339620', '339621', '339622', '339623', '339625', '339626', '339627', '339628', '339630', '339631', '339636', '339640', '339684', '339699', '339702', '339709', '341975']
nosy_count = 8.0
nosy_names = ['ronaldoussoren', 'ncoghlan', 'methane', 'serhiy.storchaka', 'Aaron Hall', 'anthonypjshaw', 'pablogsal', 'anthony shaw']
pr_nums = ['12718']
priority = 'normal'
resolution = 'rejected'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue36551'
versions = ['Python 3.8']

tonybaloney · 2019-04-08T01:28:21Z

List comprehensions currently create a series of opcodes inside a code object, the first of which is BUILD_LIST with an oparg of 0, effectively creating a zero-length list with a preallocated size of 0.

If you're doing a simple list comprehension on an iterator, e.g.

def foo():
  a = iterable
  return [x for x in a]

Disassembly of <code object <listcomp> at 0x109db2c40, file "<stdin>", line 3>:
3 0 BUILD_LIST 0
2 LOAD_FAST 0 (.0)
>> 4 FOR_ITER 8 (to 14)
6 STORE_FAST 1 (x)
8 LOAD_FAST 1 (x)
10 LIST_APPEND 2
12 JUMP_ABSOLUTE 4
>> 14 RETURN_VALUE

The list comprehension will do a list_resize on the 4, 8, 16, 25, 35, 46, 58, 72, 88th iterations, etc.

This PR preallocates the list created in a list comprehension to the length of the iterator using PyObject_LengthHint().

It uses a new BUILD_LIST_PREALLOC opcode which builds a list with the allocated size of PyObject_LengthHint(co_varnames[oparg]).

[x for x in iterable] compiles to:

Disassembly of <code object <listcomp> at 0x109db2c40, file "<stdin>", line 3>:
3 0 BUILD_LIST_PREALLOC 0
2 LOAD_FAST 0 (.0)
>> 4 FOR_ITER 8 (to 14)
6 STORE_FAST 1 (x)
8 LOAD_FAST 1 (x)
10 LIST_APPEND 2
12 JUMP_ABSOLUTE 4
>> 14 RETURN_VALUE

If the comprehension has ifs, then it will use the existing BUILD_LIST opcode

Testing using a range length of 10000

./python.exe -m timeit "x=list(range(10000)); [y for y in x]"

Gives 392us on the current 3.8 branch
and 372us with this change (about 8-10% faster)

the longer the iterable, the bigger the impact.

This change also catches the issue that a very large iterator, like a range object :
[a for a in range(2**256)]

Would cause the 3.8< interpreter to consume all memory and crash because there is no check against PY_SSIZE_MAX currently.

With this change (assuming there is no if inside the comprehension) is now caught and thrown as an OverflowError:

>>> [a for a in range(2**256)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
OverflowError: Python int too large to convert to C ssize_t

serhiy-storchaka · 2019-04-08T07:30:33Z

The benefit is too small to add a new opcode.

tonybaloney · 2019-04-08T07:34:19Z

The opcode would not solely apply to this specific use case.

I could seek another way of implementing the same behaviour without an additional opcode?

ronaldoussoren · 2019-04-08T11:02:53Z

This might cause a MemoryError when the __length_hint__ of the source returns a too large value, even when the actual size of the comprehension is smaller, e.g.:

 [x**2 for x in range(LARGE_VALUE) if is_prime(x)]

See also bpo-28940

methane · 2019-04-08T11:20:11Z

I agree with Serhiy. Benefit seems too small to add new opcode.

I could seek another way of implementing the same behaviour without an additional opcode?

How about converting [x for x in it] to [*it] in AST?

methane · 2019-04-08T11:21:03Z

$ python3 -m timeit -s 'r=range(1000)' -- '[x for x in r]'
5000 loops, best of 5: 40 usec per loop

$ python3 -m timeit -s 'r=range(1000)' -- '[*r]'
20000 loops, best of 5: 17.3 usec per loop

pablogsal · 2019-04-08T11:34:39Z

This patch makes it slow for small iterators:

Perf program:

import perf

runner = perf.Runner()
runner.timeit("list_comp",
               stmt="[x for x in range(10)]",
               setup="")

Current master:
❯ ./python.exe ../check.py
.....................
list_comp: Mean +- std dev: 3.97 us +- 0.15 us

PR 12718:
❯ ./python.exe ../check.py
.....................
list_comp: Mean +- std dev: 4.57 us +- 0.17 us

The overhead is very likely due to calling __length_hint__

tonybaloney · 2019-04-08T11:35:08Z

How about converting [x for x in it] to [*it] in AST?

I should have been more explicit, this patch improves the performance of all list comprehensions that don’t have an if clause.

Not just
[x for x in y]

but:

d = {} # some sort of dictionary
[f”{k} — {v}” for k, v in d.items()]

a = iterable
[val**2 for val in a]

Would all use BUILD_LIST_PREALLOC and use a LengthHint.

I can do another speed test for those other scenarios.

Most of the stdlib packages have these sorts of list comps, including those in the default site.py.

methane · 2019-04-08T11:36:41Z

I should have been more explicit, this patch improves the performance of all list comprehensions that don’t have an if clause.

But in these cases, overhead of reallocation will be smaller than simple case.

tonybaloney · 2019-04-08T11:36:58Z

This might cause a MemoryError when the __length_hint__ of the source returns a too large value, even when the actual size of the comprehension is smaller, e.g.:

The current implementation of list comprehensions raise neither a memoryerror or overflow error. They will consume all available memory and crash the interpreter.

This patch raises an OverflowError before execution instead of just looping until memory heap exhaustion

pablogsal · 2019-04-08T11:39:44Z

More benchmarks for slow iterators:

import perf

runner = perf.Runner()
runner.timeit("list_comp",
               stmt="[x**2 for x in k]",
               setup="k=iter(list(range(10)))")

Current master:
❯ ./python.exe ../check.py
.....................
list_comp: Mean +- std dev: 924 ns +- 35 ns

PR 12718:
❯ ./python.exe ../check.py
.....................
list_comp: Mean +- std dev: 1.17 us +- 0.06 us

methane · 2019-04-08T11:45:13Z

> This might cause a MemoryError when the __length_hint__ of the source returns a too large value, even when the actual size of the comprehension is smaller, e.g.:

The current implementation of list comprehensions raise neither a memoryerror or overflow error. They will consume all available memory and crash the interpreter.

This patch raises an OverflowError before execution instead of just looping until memory heap exhaustion

Note PEP-424.

"""
__length_hint__ must return an integer (else a TypeError is raised) or
NotImplemented, and is not required to be accurate. It may return a
value that is either larger or smaller than the actual size of the
container.
"""

it.__length_hint__ can return 2**1000 even if len(list(it))==0.

In such case, current behavior works. And your patch will raise OverflowError.

tonybaloney · 2019-04-08T11:47:48Z

This patch makes it slow for small iterators

That is a one-off cost for the __length_hint__ of the range object specifically.
Objects with a known length (lists, sets, tuples) would not have that overhead.

I can run a more useful set of benchmarks against this.

So the +0.6us would be the same for ranges 8-16. Then less for 16-25, then again for 25-35 as the removal of the reallocation process has a more significant factor for larger ranges.

tonybaloney · 2019-04-08T11:50:06Z

In such case, current behavior works. And your patch will raise OverflowError.

Try

[x for x in range(2**1000)]

in a REPL. It doesn’t raise anything, it tries to create a list that will eventually exceed PY_SIZE_MAX, but it only crashes once it reaches that iteration.

This raises an OverflowError instead, the same way:
len(range(2**1000))
raises an OverflowError

methane · 2019-04-08T11:54:53Z

Try

[x for x in range(2**1000)]

in a REPL. It doesn’t raise anything, it tries to create a list that will eventually exceed PY_SIZE_MAX, but it only crashes once it reaches that iteration.

It is expected behavior.

This raises an OverflowError instead, the same way:
len(range(2**1000))
raises an OverflowError

If your patch uses __length_hint__, it is bug.
iterator will return 2**1000 for __length_hint__, but produce no item
on iteration.

tonybaloney · 2019-04-08T12:08:40Z

If your patch uses __length_hint__, it is bug.
iterator will return 2**1000 for __length_hint__, but produce no item
on iteration.

It raises an OverflowError because of the goto
https://github.com/python/cpython/pull/12718/files#diff-7f17c8d8448b7b6f90549035d2147a9fR2493 this could just as easily set size to 0.

I put goto error given the opportunity to handle an expected fault gracefully.

tonybaloney · 2019-04-08T12:16:14Z

If your patch uses __length_hint__, it is bug.

I’m not sure I understand this comment,

PEP-424 says “This is useful for presizing containers when building from an iterable.“

This patch uses __length_hint__ to presize the list container for a list comprehension.

methane · 2019-04-08T12:33:36Z

"useful" doesn't mean "use it as-is".
It is just a hint. It will be wrong.

See here for list example:

cpython/Objects/listobject.c

Lines 929 to 940 in 7a0630c

    
               /* Guess a result list size. */ 
        
               n = PyObject_LengthHint(iterable, 8); 
        
               if (n < 0) { 
        
                   Py_DECREF(it); 
        
                   return NULL; 
        
               } 
        
               m = Py_SIZE(self); 
        
               if (m > PY_SSIZE_T_MAX - n) { 
        
                   /* m + n overflowed; on the chance that n lied, and there really 
        
                    * is enough room, ignore it.  If n was telling the truth, we'll 
        
                    * eventually run out of memory during the loop. 
        
                    */

methane · 2019-04-08T12:42:21Z

I'm sorry. list_extend raises OverflowError too.

pablogsal · 2019-04-08T13:09:33Z

That is a one-off cost for the __length_hint__ of the range object specifically.
Objects with a known length (lists, sets, tuples) would not have that overhead.

That seems incorrect. This is not unique of range objects as it affects also objects with known lengths (like a list):

import perf

runner = perf.Runner()
runner.timeit("list_comp",
               stmt="[x*2 for x in k]",
               setup="k=list(range(10))")

Current master:
❯ ./python.exe ../check.py -n 10
.....................
list_comp: Mean +- std dev: 3.82 us +- 0.13 us

PR 12718:
❯ ./python.exe ../check.py -n 10
.....................
list_comp: Mean +- std dev: 4.38 us +- 0.16 us

Check also my other benchmark with a list iterator ( iter(list(range(10))) ) or this one with a generator comp:

import perf

runner = perf.Runner()
runner.timeit("list_comp",
               stmt="[x*2 for x in it]",
               setup="k=list(range(10));it=(x for x in k)")

Current master:
❯ ./python.exe ../check.py -n 10
.....................
list_comp: Mean +- std dev: 945 ns +- 27 ns

PR 12718:
❯ ./python.exe ../check.py -n 10
.....................
list_comp: Mean +- std dev: 1.33 us +- 0.05 us

ncoghlan · 2019-04-08T13:50:09Z

I was going to note that the algorithm Anthony has pursued here is the same one we already use for the list constructor and list.extend(), but Inada-san already pointed that out :)

While length_hint is allowed to be somewhat inaccurate, we do expect it to be at least *vaguely* accurate (otherwise it isn't very useful, and if it can be inaccurate enough to trigger OverflowError or MemoryError in cases that would otherwise work reasonably well, it would be better for a type not to implement it at all).

While it would be nice to be able to avoid adding a new opcode, the problem is that the existing candidate opcodes (BUILD_LIST, BUILD_LIST_UNPACK) are both inflexible in what they do:

BUILD_LIST requires that the final list length be known at compile time
BUILD_LIST_UNPACK infers the final length from an object reference, but calls _PyList_Extend directly, so it combines the preallocation and the iterator consumption into a single step, and hence can't be used outside of iterable unpacking into a list

At the same time, attempting to generalise either of them isn't desirable, since it would slow them down for their existing use cases, and be slower than a new opcode for this use case.

The proposed BUILD_LIST_PREALLOC opcode splits the difference: it lets the compiler provide the interpreter with a *hint* as to how big the resulting list is expected to be.

That said, you'd want to run the result through the benchmark suite rather than relying solely on microbenchmarks, as even though unfiltered "[some_operation_on_x for x in y]" comprehensions without nested loops or filter clauses are pretty common (more common than the relatively new "[*itr]" syntax), it's far less clear what the typical distribution in input lengths actually is, and how many memory allocations need to be avoided in order to offset the cost of the initial _PyObject_LengthHint call (as Pablo's small scale results show).

(Note that in the _PyList_Extend code, there are preceding special cases for builtin lists and tuples that take those down a much faster path that avoids the _PyObject_LengthHint call entirely)

pablogsal · 2019-04-08T23:31:25Z

Here are the updated results for the benchmark suite. The previous results (unlinked from the issue to reduce noise)
were against an old version of the master branch.

2019-04-08_13-08-master-58721a903074.json.gz
============================================

Performance version: 0.7.0
Report on Linux-5.0.5-1-ARCH-x86_64-with-glibc2.28
Number of logical CPUs: 12
Start date: 2019-04-09 00:32:40.656576
End date: 2019-04-09 00:54:03.721798

12718-5f06333a4e49.json.gz
==========================

Performance version: 0.7.0
Report on Linux-5.0.5-1-ARCH-x86_64-with-glibc2.28
Number of logical CPUs: 12
Start date: 2019-04-09 00:07:53.433511
End date: 2019-04-09 00:29:32.603022

                                    -- CURRENT MASTER --                    -- [PR12718](https://github.com/python/cpython/pull/12718) --

+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| Benchmark | 2019-04-08_13-08-master-58721a903074.json.gz | 12718-5f06333a4e49.json.gz | Change | Significance |
+=========================+==============================================+============================+==============+========================+
| 2to3 | 455 ms | 465 ms | 1.02x slower | Significant (t=-2.27) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| chaos | 171 ms | 180 ms | 1.05x slower | Significant (t=-5.92) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| crypto_pyaes | 169 ms | 177 ms | 1.05x slower | Significant (t=-6.06) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| deltablue | 11.5 ms | 11.9 ms | 1.03x slower | Significant (t=-3.28) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| django_template | 175 ms | 191 ms | 1.09x slower | Significant (t=-13.00) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| dulwich_log | 127 ms | 131 ms | 1.03x slower | Significant (t=-8.97) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| fannkuch | 646 ms | 665 ms | 1.03x slower | Significant (t=-12.37) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| float | 165 ms | 165 ms | 1.00x faster | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| go | 386 ms | 381 ms | 1.01x faster | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| hexiom | 15.4 ms | 15.5 ms | 1.00x slower | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| html5lib | 130 ms | 132 ms | 1.01x slower | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| json_dumps | 18.4 ms | 18.3 ms | 1.00x faster | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| json_loads | 38.6 us | 38.6 us | 1.00x faster | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| logging_format | 15.0 us | 15.6 us | 1.04x slower | Significant (t=-10.98) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| logging_silent | 284 ns | 287 ns | 1.01x slower | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| logging_simple | 13.5 us | 13.9 us | 1.03x slower | Significant (t=-8.03) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| mako | 24.6 ms | 24.5 ms | 1.01x faster | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| meteor_contest | 157 ms | 155 ms | 1.01x faster | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| nbody | 196 ms | 198 ms | 1.01x slower | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| nqueens | 144 ms | 147 ms | 1.02x slower | Significant (t=-9.68) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| pathlib | 27.6 ms | 28.4 ms | 1.03x slower | Significant (t=-5.24) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| pickle | 14.1 us | 14.2 us | 1.01x slower | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| pickle_dict | 33.7 us | 35.2 us | 1.04x slower | Significant (t=-13.28) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| pickle_list | 5.22 us | 5.38 us | 1.03x slower | Significant (t=-6.65) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| pickle_pure_python | 681 us | 719 us | 1.05x slower | Significant (t=-5.64) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| pidigits | 217 ms | 219 ms | 1.01x slower | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| python_startup | 12.2 ms | 12.5 ms | 1.02x slower | Significant (t=-9.34) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| python_startup_no_site | 8.90 ms | 8.92 ms | 1.00x slower | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| raytrace | 789 ms | 793 ms | 1.00x slower | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| regex_compile | 267 ms | 270 ms | 1.01x slower | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| regex_dna | 221 ms | 221 ms | 1.00x faster | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| regex_effbot | 3.94 ms | 3.98 ms | 1.01x slower | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| regex_v8 | 30.4 ms | 30.4 ms | 1.00x slower | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| richards | 108 ms | 108 ms | 1.00x slower | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| scimark_fft | 525 ms | 525 ms | 1.00x faster | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| scimark_lu | 270 ms | 271 ms | 1.00x slower | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| scimark_monte_carlo | 160 ms | 160 ms | 1.00x slower | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| scimark_sor | 295 ms | 305 ms | 1.03x slower | Significant (t=-9.82) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| scimark_sparse_mat_mult | 6.63 ms | 6.57 ms | 1.01x faster | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| spectral_norm | 217 ms | 219 ms | 1.01x slower | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| sqlalchemy_declarative | 239 ms | 244 ms | 1.02x slower | Significant (t=-4.34) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| sqlalchemy_imperative | 50.7 ms | 52.0 ms | 1.03x slower | Significant (t=-3.84) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| sqlite_synth | 4.20 us | 4.14 us | 1.01x faster | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| sympy_expand | 612 ms | 637 ms | 1.04x slower | Significant (t=-8.36) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| sympy_integrate | 27.5 ms | 28.9 ms | 1.05x slower | Significant (t=-4.16) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| sympy_str | 284 ms | 299 ms | 1.05x slower | Significant (t=-6.97) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| sympy_sum | 151 ms | 156 ms | 1.03x slower | Significant (t=-4.37) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| telco | 9.70 ms | 9.64 ms | 1.01x faster | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| tornado_http | 276 ms | 287 ms | 1.04x slower | Significant (t=-5.75) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| unpack_sequence | 64.4 ns | 67.3 ns | 1.05x slower | Significant (t=-6.16) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| unpickle | 22.8 us | 22.7 us | 1.01x faster | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| unpickle_list | 5.72 us | 5.97 us | 1.04x slower | Significant (t=-6.05) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| unpickle_pure_python | 510 us | 527 us | 1.03x slower | Significant (t=-3.89) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| xml_etree_generate | 150 ms | 154 ms | 1.02x slower | Significant (t=-3.72) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| xml_etree_iterparse | 145 ms | 148 ms | 1.02x slower | Significant (t=-3.60) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| xml_etree_parse | 208 ms | 207 ms | 1.00x faster | Not significant |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| xml_etree_process | 121 ms | 125 ms | 1.04x slower | Significant (t=-5.43) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+

tonybaloney · 2019-04-09T05:35:20Z

Have just optimized some of the code and pushed another change as 69dce1c552.

ran both master and 69dce1c552 using pyperformance with PGO:

➜ ~ python3.8 -m perf compare_to master.json 69dce1c552.json --table
+-------------------------+---------+-----------------------------+
| Benchmark | master | 69dce1c552 |
+=========================+=========+=============================+
| 2to3 | 432 ms | 426 ms: 1.02x faster (-2%) |
+-------------------------+---------+-----------------------------+
| chaos | 157 ms | 155 ms: 1.01x faster (-1%) |
+-------------------------+---------+-----------------------------+
| crypto_pyaes | 154 ms | 153 ms: 1.00x faster (-0%) |
+-------------------------+---------+-----------------------------+
| dulwich_log | 123 ms | 124 ms: 1.00x slower (+0%) |
+-------------------------+---------+-----------------------------+
| fannkuch | 603 ms | 600 ms: 1.01x faster (-1%) |
+-------------------------+---------+-----------------------------+
| float | 153 ms | 154 ms: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| go | 323 ms | 326 ms: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| hexiom | 13.6 ms | 13.5 ms: 1.01x faster (-1%) |
+-------------------------+---------+-----------------------------+
| json_dumps | 18.1 ms | 17.9 ms: 1.01x faster (-1%) |
+-------------------------+---------+-----------------------------+
| logging_format | 13.2 us | 13.8 us: 1.05x slower (+5%) |
+-------------------------+---------+-----------------------------+
| logging_silent | 266 ns | 280 ns: 1.05x slower (+5%) |
+-------------------------+---------+-----------------------------+
| logging_simple | 12.4 us | 13.1 us: 1.06x slower (+6%) |
+-------------------------+---------+-----------------------------+
| meteor_contest | 145 ms | 132 ms: 1.10x faster (-9%) |
+-------------------------+---------+-----------------------------+
| nbody | 179 ms | 172 ms: 1.04x faster (-4%) |
+-------------------------+---------+-----------------------------+
| nqueens | 138 ms | 134 ms: 1.03x faster (-3%) |
+-------------------------+---------+-----------------------------+
| pathlib | 56.4 ms | 55.6 ms: 1.01x faster (-1%) |
+-------------------------+---------+-----------------------------+
| pickle | 15.0 us | 15.4 us: 1.03x slower (+3%) |
+-------------------------+---------+-----------------------------+
| pickle_pure_python | 620 us | 617 us: 1.01x faster (-1%) |
+-------------------------+---------+-----------------------------+
| raytrace | 696 ms | 691 ms: 1.01x faster (-1%) |
+-------------------------+---------+-----------------------------+
| regex_compile | 242 ms | 243 ms: 1.00x slower (+0%) |
+-------------------------+---------+-----------------------------+
| scimark_monte_carlo | 140 ms | 143 ms: 1.02x slower (+2%) |
+-------------------------+---------+-----------------------------+
| scimark_sparse_mat_mult | 5.90 ms | 5.94 ms: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| spectral_norm | 194 ms | 196 ms: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| sympy_str | 246 ms | 245 ms: 1.00x faster (-0%) |
+-------------------------+---------+-----------------------------+
| telco | 8.42 ms | 8.31 ms: 1.01x faster (-1%) |
+-------------------------+---------+-----------------------------+
| unpack_sequence | 59.2 ns | 59.7 ns: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| unpickle | 21.2 us | 21.4 us: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| unpickle_list | 5.73 us | 5.81 us: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| unpickle_pure_python | 471 us | 467 us: 1.01x faster (-1%) |
+-------------------------+---------+-----------------------------+
| xml_etree_iterparse | 142 ms | 143 ms: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| xml_etree_generate | 139 ms | 137 ms: 1.02x faster (-2%) |
+-------------------------+---------+-----------------------------+
| xml_etree_process | 109 ms | 108 ms: 1.01x faster (-1%) |
+-------------------------+---------+-----------------------------+

Not significant (21): deltablue; django_template; html5lib; json_loads; mako; pickle_dict; pickle_list; pidigits; python_startup; python_startup_no_site; regex_dna; regex_effbot; regex_v8; richards; scimark_fft; scimark_lu; scimark_sor; sympy_expand; sympy_integrate; sympy_sum; xml_etree_parse

I'd like to look at the way range object LengthHint works, it looks like the path for those is not ideal and could use some optimization. Also, BUILD_LIST_PREALLOC uses the Iterator, not the actual object, so you can't use the much faster _HasLen and PyObject_Length().

I'm going to look at how __length_hint__ could be optimized for iterators that would make the smaller range cases more efficient.

meteor_contest uses a lot of list comprehensions, so should show the impact for the patch.

serhiy-storchaka · 2019-04-09T05:50:57Z

I was going to note that the algorithm Anthony has pursued here is the same one we already use for the list constructor and list.extend(), but Inada-san already pointed that out :)

And that optimization looks questionable to me. I tried to reduce an overhead for small lists, but this requires much more complex code and gives mixed results.

I am -1 for this optimization because it affects only one particular case (neither other kinds of comprehensions, nor generator expressions, nor list comprehensions with conditions) and even in this case it is small. It is possible to add a lot of other optimizations for other cases which will sped up them to 50% or 100%, but we do not do this, because every such optimization has a cost. It increases the amount of code which should be maintained and covered by tests, it adds small overhead in common cases to speed up an uncommon case, and increasing the code base can negatively affect surrounding code (just because the CPU cache and registers are used inappropriate and the compiler optimizes less important paths).

In addition, while this change speed up list comprehensions for long list, it slows down them for short lists. Short lists are more common.

tonybaloney · 2019-04-09T07:04:14Z

I am -1 for this optimization because it affects only one particular case (neither other kinds of comprehensions, nor generator expressions, nor list comprehensions with
conditions) and even in this case it is small. It is possible to add a lot of other optimizations for other cases which will sped up them to 50% or 100%, but we do not do
this, because every such optimization has a cost. It increases the amount of code which should be maintained and covered by tests, it adds small overhead in common cases to
speed up an uncommon case, and increasing the code base can negatively affect surrounding code (just because the CPU cache and registers are used inappropriate and the
compiler optimizes less important paths).

Understood, I had hoped this change would have a broader impact. The additional opcode is not ideal either.

In addition, while this change speed up list comprehensions for long list, it slows down them for short lists. Short lists are more common.

I've been profiling this today, basically, this implementation receives the list_iter, range_iter, etc.

There is no standard object model for an iterator's length, _PyObject_HasLen would return false because it neither implements tp_as_sequence nor, tp_as_mapping (rightly so).

What this has uncovered (so hopefully there's some value from this whole experience!) is that __length_hint__ for iterators is _really_ inefficient. Take a list_iterator for example:

PyObject_LengthHint will call, _PyObject_HasLen, which returns false, which then goes to call
_PyObject_LookupSpecial, then
_PyObject_CallNoArg, which calls
listiter_len, which calls
PyList_GET_SIZE which returns a Py_ssize_t, which is then converted to a PyLong via
PyLong_FromSsize_t, which is then returned back to PyObject_LengthHint, which then
PyLong_AsSsize_t is run to convert the PyLong back into a Py_ssize_t

The Py_ssize_t is then finally returned to the caller!

My conclusion was that the list comprehension should be initialized to the length of the target, before GET_ITER is run. This would remove the overhead for range objects, because you could simply call _PyObject_HasLen, which would return true for dict, list, tuple and set, but false for range objects (which is what you want).

The issue is that GET_ITER is called outside the code object for the comprehension, so you'd have to pass an additional argument to the comprehension generator.

This is way outside of my expertise, but the only way I can see to find a sizeable benefit, with minimal code and no edge cases which are slower.

Thanks for your time

serhiy-storchaka · 2019-05-09T09:06:58Z

We can return to this issue if make the invocation of __length_hint__ much much faster. For example by adding the tp_length_hint slot. But currently it is too large change and his has negative effects.

tonybaloney mannequin added 3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Apr 8, 2019

tonybaloney mannequin self-assigned this May 9, 2019

tonybaloney mannequin closed this as completed May 9, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

brandtbucher mentioned this issue Jan 25, 2023

Avoid reallocations for list comprehensions faster-cpython/ideas#549

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize list comprehensions with preallocate size and protect against overflow #80732

Optimize list comprehensions with preallocate size and protect against overflow #80732

tonybaloney mannequin commented Apr 8, 2019

tonybaloney mannequin commented Apr 8, 2019

serhiy-storchaka commented Apr 8, 2019

tonybaloney mannequin commented Apr 8, 2019

ronaldoussoren commented Apr 8, 2019

methane commented Apr 8, 2019

methane commented Apr 8, 2019

pablogsal commented Apr 8, 2019

tonybaloney mannequin commented Apr 8, 2019

methane commented Apr 8, 2019

tonybaloney mannequin commented Apr 8, 2019

pablogsal commented Apr 8, 2019

methane commented Apr 8, 2019

tonybaloney mannequin commented Apr 8, 2019

tonybaloney mannequin commented Apr 8, 2019

methane commented Apr 8, 2019

tonybaloney mannequin commented Apr 8, 2019

tonybaloney mannequin commented Apr 8, 2019

methane commented Apr 8, 2019

methane commented Apr 8, 2019

pablogsal commented Apr 8, 2019

ncoghlan commented Apr 8, 2019

pablogsal commented Apr 8, 2019

tonybaloney mannequin commented Apr 9, 2019

serhiy-storchaka commented Apr 9, 2019

tonybaloney mannequin commented Apr 9, 2019

serhiy-storchaka commented May 9, 2019

Optimize list comprehensions with preallocate size and protect against overflow #80732

Optimize list comprehensions with preallocate size and protect against overflow #80732

Comments

tonybaloney mannequin commented Apr 8, 2019

tonybaloney mannequin commented Apr 8, 2019

serhiy-storchaka commented Apr 8, 2019

tonybaloney mannequin commented Apr 8, 2019

ronaldoussoren commented Apr 8, 2019

methane commented Apr 8, 2019

methane commented Apr 8, 2019

pablogsal commented Apr 8, 2019

tonybaloney mannequin commented Apr 8, 2019

methane commented Apr 8, 2019

tonybaloney mannequin commented Apr 8, 2019

pablogsal commented Apr 8, 2019

methane commented Apr 8, 2019

tonybaloney mannequin commented Apr 8, 2019

tonybaloney mannequin commented Apr 8, 2019

methane commented Apr 8, 2019

tonybaloney mannequin commented Apr 8, 2019

tonybaloney mannequin commented Apr 8, 2019

methane commented Apr 8, 2019

methane commented Apr 8, 2019

pablogsal commented Apr 8, 2019

ncoghlan commented Apr 8, 2019

pablogsal commented Apr 8, 2019

tonybaloney mannequin commented Apr 9, 2019

serhiy-storchaka commented Apr 9, 2019

tonybaloney mannequin commented Apr 9, 2019

serhiy-storchaka commented May 9, 2019