New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize list comprehensions with preallocate size and protect against overflow #80732
Comments
List comprehensions currently create a series of opcodes inside a code object, the first of which is BUILD_LIST with an oparg of 0, effectively creating a zero-length list with a preallocated size of 0. If you're doing a simple list comprehension on an iterator, e.g. def foo():
a = iterable
return [x for x in a] Disassembly of <code object <listcomp> at 0x109db2c40, file "<stdin>", line 3>: The list comprehension will do a list_resize on the 4, 8, 16, 25, 35, 46, 58, 72, 88th iterations, etc. This PR preallocates the list created in a list comprehension to the length of the iterator using PyObject_LengthHint(). It uses a new BUILD_LIST_PREALLOC opcode which builds a list with the allocated size of PyObject_LengthHint(co_varnames[oparg]). [x for x in iterable] compiles to: Disassembly of <code object <listcomp> at 0x109db2c40, file "<stdin>", line 3>: If the comprehension has ifs, then it will use the existing BUILD_LIST opcode Testing using a range length of 10000 ./python.exe -m timeit "x=list(range(10000)); [y for y in x]" Gives 392us on the current 3.8 branch the longer the iterable, the bigger the impact. This change also catches the issue that a very large iterator, like a range object : Would cause the 3.8< interpreter to consume all memory and crash because there is no check against PY_SSIZE_MAX currently. With this change (assuming there is no if inside the comprehension) is now caught and thrown as an OverflowError: >>> [a for a in range(2**256)]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
OverflowError: Python int too large to convert to C ssize_t |
The benefit is too small to add a new opcode. |
The opcode would not solely apply to this specific use case. I could seek another way of implementing the same behaviour without an additional opcode? |
This might cause a MemoryError when the __length_hint__ of the source returns a too large value, even when the actual size of the comprehension is smaller, e.g.:
See also bpo-28940 |
I agree with Serhiy. Benefit seems too small to add new opcode.
How about converting |
$ python3 -m timeit -s 'r=range(1000)' -- '[x for x in r]'
5000 loops, best of 5: 40 usec per loop
$ python3 -m timeit -s 'r=range(1000)' -- '[*r]'
20000 loops, best of 5: 17.3 usec per loop |
This patch makes it slow for small iterators: Perf program: import perf
runner = perf.Runner()
runner.timeit("list_comp",
stmt="[x for x in range(10)]",
setup="") Current master: PR 12718: The overhead is very likely due to calling __length_hint__ |
I should have been more explicit, this patch improves the performance of all list comprehensions that don’t have an if clause. Not just but: d = {} # some sort of dictionary
[f”{k} — {v}” for k, v in d.items()]
a = iterable
[val**2 for val in a] Would all use BUILD_LIST_PREALLOC and use a LengthHint. I can do another speed test for those other scenarios. Most of the stdlib packages have these sorts of list comps, including those in the default site.py. |
But in these cases, overhead of reallocation will be smaller than simple case. |
The current implementation of list comprehensions raise neither a memoryerror or overflow error. They will consume all available memory and crash the interpreter. This patch raises an OverflowError before execution instead of just looping until memory heap exhaustion |
More benchmarks for slow iterators: import perf
runner = perf.Runner()
runner.timeit("list_comp",
stmt="[x**2 for x in k]",
setup="k=iter(list(range(10)))") Current master: PR 12718: |
Note PEP-424. """ it.__length_hint__ can return 2**1000 even if len(list(it))==0. In such case, current behavior works. And your patch will raise OverflowError. |
That is a one-off cost for the __length_hint__ of the range object specifically. I can run a more useful set of benchmarks against this. So the +0.6us would be the same for ranges 8-16. Then less for 16-25, then again for 25-35 as the removal of the reallocation process has a more significant factor for larger ranges. |
Try [x for x in range(2**1000)] in a REPL. It doesn’t raise anything, it tries to create a list that will eventually exceed PY_SIZE_MAX, but it only crashes once it reaches that iteration. This raises an OverflowError instead, the same way: |
It is expected behavior.
If your patch uses __length_hint__, it is bug. |
It raises an OverflowError because of the goto I put |
I’m not sure I understand this comment, PEP-424 says “This is useful for presizing containers when building from an iterable.“ This patch uses __length_hint__ to presize the list container for a list comprehension. |
"useful" doesn't mean "use it as-is". See here for list example: Lines 929 to 940 in 7a0630c
|
I'm sorry. list_extend raises OverflowError too. |
That seems incorrect. This is not unique of range objects as it affects also objects with known lengths (like a list): import perf
runner = perf.Runner()
runner.timeit("list_comp",
stmt="[x*2 for x in k]",
setup="k=list(range(10))") Current master: PR 12718: Check also my other benchmark with a list iterator ( iter(list(range(10))) ) or this one with a generator comp: import perf
runner = perf.Runner()
runner.timeit("list_comp",
stmt="[x*2 for x in it]",
setup="k=list(range(10));it=(x for x in k)") Current master: PR 12718: |
I was going to note that the algorithm Anthony has pursued here is the same one we already use for the list constructor and list.extend(), but Inada-san already pointed that out :) While length_hint is allowed to be somewhat inaccurate, we do expect it to be at least *vaguely* accurate (otherwise it isn't very useful, and if it can be inaccurate enough to trigger OverflowError or MemoryError in cases that would otherwise work reasonably well, it would be better for a type not to implement it at all). While it would be nice to be able to avoid adding a new opcode, the problem is that the existing candidate opcodes (BUILD_LIST, BUILD_LIST_UNPACK) are both inflexible in what they do:
At the same time, attempting to generalise either of them isn't desirable, since it would slow them down for their existing use cases, and be slower than a new opcode for this use case. The proposed BUILD_LIST_PREALLOC opcode splits the difference: it lets the compiler provide the interpreter with a *hint* as to how big the resulting list is expected to be. That said, you'd want to run the result through the benchmark suite rather than relying solely on microbenchmarks, as even though unfiltered "[some_operation_on_x for x in y]" comprehensions without nested loops or filter clauses are pretty common (more common than the relatively new "[*itr]" syntax), it's far less clear what the typical distribution in input lengths actually is, and how many memory allocations need to be avoided in order to offset the cost of the initial _PyObject_LengthHint call (as Pablo's small scale results show). (Note that in the _PyList_Extend code, there are preceding special cases for builtin lists and tuples that take those down a much faster path that avoids the _PyObject_LengthHint call entirely) |
Here are the updated results for the benchmark suite. The previous results (unlinked from the issue to reduce noise) 2019-04-08_13-08-master-58721a903074.json.gz Performance version: 0.7.0 12718-5f06333a4e49.json.gz Performance version: 0.7.0
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ |
Have just optimized some of the code and pushed another change as 69dce1c552. ran both master and 69dce1c552 using pyperformance with PGO: ➜ ~ python3.8 -m perf compare_to master.json 69dce1c552.json --table Not significant (21): deltablue; django_template; html5lib; json_loads; mako; pickle_dict; pickle_list; pidigits; python_startup; python_startup_no_site; regex_dna; regex_effbot; regex_v8; richards; scimark_fft; scimark_lu; scimark_sor; sympy_expand; sympy_integrate; sympy_sum; xml_etree_parse I'd like to look at the way range object LengthHint works, it looks like the path for those is not ideal and could use some optimization. Also, BUILD_LIST_PREALLOC uses the Iterator, not the actual object, so you can't use the much faster _HasLen and PyObject_Length(). I'm going to look at how __length_hint__ could be optimized for iterators that would make the smaller range cases more efficient. meteor_contest uses a lot of list comprehensions, so should show the impact for the patch. |
And that optimization looks questionable to me. I tried to reduce an overhead for small lists, but this requires much more complex code and gives mixed results. I am -1 for this optimization because it affects only one particular case (neither other kinds of comprehensions, nor generator expressions, nor list comprehensions with conditions) and even in this case it is small. It is possible to add a lot of other optimizations for other cases which will sped up them to 50% or 100%, but we do not do this, because every such optimization has a cost. It increases the amount of code which should be maintained and covered by tests, it adds small overhead in common cases to speed up an uncommon case, and increasing the code base can negatively affect surrounding code (just because the CPU cache and registers are used inappropriate and the compiler optimizes less important paths). In addition, while this change speed up list comprehensions for long list, it slows down them for short lists. Short lists are more common. |
Understood, I had hoped this change would have a broader impact. The additional opcode is not ideal either.
I've been profiling this today, basically, this implementation receives the There is no standard object model for an iterator's length, _PyObject_HasLen would return false because it neither implements tp_as_sequence nor, tp_as_mapping (rightly so). What this has uncovered (so hopefully there's some value from this whole experience!) is that __length_hint__ for iterators is _really_ inefficient. Take a list_iterator for example: PyObject_LengthHint will call, _PyObject_HasLen, which returns false, which then goes to call The Py_ssize_t is then finally returned to the caller! My conclusion was that the list comprehension should be initialized to the length of the target, before GET_ITER is run. This would remove the overhead for range objects, because you could simply call _PyObject_HasLen, which would return true for dict, list, tuple and set, but false for range objects (which is what you want). The issue is that GET_ITER is called outside the code object for the comprehension, so you'd have to pass an additional argument to the comprehension generator. This is way outside of my expertise, but the only way I can see to find a sizeable benefit, with minimal code and no edge cases which are slower. Thanks for your time |
We can return to this issue if make the invocation of __length_hint__ much much faster. For example by adding the tp_length_hint slot. But currently it is too large change and his has negative effects. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: