Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bytecode compile times are O(nlocals**2) #97912

Closed
brandtbucher opened this issue Oct 5, 2022 · 7 comments
Closed

Bytecode compile times are O(nlocals**2) #97912

brandtbucher opened this issue Oct 5, 2022 · 7 comments
Assignees
Labels
3.12 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage sprint type-bug An unexpected behavior, bug, or error

Comments

@brandtbucher
Copy link
Member

@JelleZijlstra discovered something interesting yesterday: bytecode compile times on main are currently quadratic in the number of local variables (3.11 is still linear). For example, a function with 100,000 assignments to unique local variables takes well over a minute to compile (but is less than 2 seconds on 3.11).

I believe the culprit is the new LOAD_FAST/LOAD_FAST_CHECK stuff. The current implementation appears to loop over every local variable, and for each one, loop over all bytecode instructions. This can probably be refactored to run in one pass for all locals.

CC: @markshannon @sweeneyde

@brandtbucher brandtbucher added type-bug An unexpected behavior, bug, or error performance Performance or resource usage sprint interpreter-core (Objects, Python, Grammar, and Parser dirs) 3.12 bugs and security fixes labels Oct 5, 2022
@sweeneyde
Copy link
Member

sweeneyde commented Oct 5, 2022

If it proves too complex to track multiple local variable states at the same time, another solution might be to only scan for the first hundred local variables or something like that.

It's not clear to me how easy tracking everything would be. Do we need potentially unbounded stack space, since blocks might need to be processed more than once? Would the approach still be linear in the face of many different del statements in unpredictable places? Maybe we don't care.

But if you manage to do it all linearly, that sounds great.

@sweeneyde
Copy link
Member

sweeneyde commented Oct 5, 2022

Here's an example of a "hard" case:

def f():
    a1 = a2 = a3 = a4 = a5 = 1
    for i in range(2):
        if cond1(): use(a1)
        else: del a1
        
        if cond2(): use(a2)
        else: del a2
        
        if cond3(): use(a3)
        else: del a3
        
        if cond4(): use(a4)
        else: del a4
        
        if cond5(): use(a5)
        else: del a5

# (generalize to more than 5 locals...)

For example, the fact that del a3 may have happened has to propagate to the if cond4(), to if cond5(), to for i, to if cond1(), to if cond2(), to if cond3(), to use(a3). Combining that all together with the effects of all other del statements, there are n deletions that we have to move over at least n basicblock linkages each.

That's not a proof that linear time is impossible, just that there may be some cleverness required (moving info across multiple basicblock linkages at once?) if we want linear time in all cases. However, some sort of heuristic may be possible that takes care of most typical cases without many/any del statements.

By the way, the reason to propagate "possible undefinedness" rather than "certain definedness" is that any path with undefinedness leading into a LOAD_FAST is enough to have to use LOAD_FAST_CHECK.

@sweeneyde
Copy link
Member

Here's a script that gathered co_nlocals data some from Sympy, which I figured might have some long code objects:

The script
from pathlib import Path
from types import CodeType

def all_code_recursive(code):
    # based on https://github.com/faster-cpython/tools/blob/main/scripts/count_opcodes.py
    yield code
    for x in code.co_consts:
        if isinstance(x, CodeType):
            yield from all_code_recursive(x)

def all_code(root):
    for path in root.glob("**/*.py"):
        text = path.read_text("utf-8")
        code = compile(text, path.name, "exec")
        yield from all_code_recursive(code)

from collections import Counter

import sympy

var_counts = Counter()
init = Path(sympy.__file__)
for code in all_code(init.parent):
    n = code.co_nlocals
    var_counts[code.co_nlocals] += 1
    if code.co_nlocals > 50:
        print(code.co_qualname)

print(dict(sorted(var_counts.items())))

total = var_counts.total()

for bound in [2, 5, 10, 20, 50, 100, 200, 500, 1000]:
    bigger = sum(count for n, count in var_counts.items() if n > bound)
    print(f"{bigger/total:.2%} of code objects have >{bound} locals")
The output
test_DiagramGrid 53
_FixSimplify 144
_SimplifyAntiderivative 57
_TrigSimplifyAux 61
_ExpandIntegrand 200
binomial_products 1044
exponential 330
hyperbolic 1069
integrand_simplification 143
inverse_hyperbolic 1546
inverse_trig 1442
linear_products 485
logarithms 466
miscellaneous_algebraic 1204
miscellaneous_integration 221
miscellaneous_trig 877
piecewise_linear 64
quadratic_products 1291
secant 1727
sine 2836
special_functions 477
tangent 1208
trinomial_products 1022
test_binary_operators 60
test_var_decl 64
test_TransferFunction_functions 58
test_aux_dep 58
test_non_central_inertia 65
test_sub_qdot 59
test_bicycle 109
test_linearize_rolling_disc_kane 59
modgcd_bivariate 52
roots_quintic 67
_vertical_bisection 68
_horizontal_bisection 68
dup_isolate_complex_roots_sqf 71
powsimp 53
_solve 58
_solve_system 55
unrad 63
BinaryQuadratic.solve 57
classify_ode 51
test__classify_linear_system 56
test_sysode_linear_neq_order1_type1 53
test_sysode_linear_neq_order1_type1_slow 103
test_sysode_linear_neq_order1_type2 75
test_DiscreteMarkovChain 52
test_M38 56
{0: 10914, 1: 8656, 2: 13341, 3: 5899, 4: 3445, 5: 2496, 6: 2478, 7: 2358, 8: 2162, 9: 1656, 10: 1221, 11: 904, 12: 515, 13: 329, 
14: 215, 15: 177, 16: 133, 17: 117, 18: 95, 19: 66, 20: 64, 21: 55, 22: 33, 23: 35, 24: 40, 25: 33, 26: 27, 27: 22, 28: 22, 29: 19, 30: 17, 31: 12, 32: 16, 33: 10, 34: 8, 35: 5, 36: 7, 37: 8, 38: 8, 39: 7, 40: 4, 41: 3, 42: 4, 43: 2, 44: 2, 45: 3, 46: 1, 47: 2, 49: 3, 50: 1, 51: 1, 52: 2, 53: 3, 55: 1, 56: 2, 57: 2, 58: 3, 59: 2, 60: 1, 61: 1, 63: 1, 64: 2, 65: 1, 67: 1, 68: 2, 71: 1, 75: 1, 103: 1, 109: 1, 143: 1, 144: 1, 200: 1, 221: 1, 330: 1, 466: 1, 477: 1, 485: 1, 877: 1, 1022: 1, 1044: 1, 1069: 1, 1204: 1, 1208: 1, 1291: 1, 1442: 1, 1546: 1, 1727: 1, 2836: 1}
66.08% of code objects have >1 locals
42.96% of code objects have >2 locals
22.44% of code objects have >5 locals
5.32% of code objects have >10 locals
0.79% of code objects have >20 locals
0.08% of code objects have >50 locals
0.04% of code objects have >100 locals
0.03% of code objects have >200 locals
0.02% of code objects have >500 locals
0.02% of code objects have >1000 locals
0.00% of code objects have >2000 locals
0.00% of code objects have >5000 locals

Summarized output:

For .py files in Sympy distribution:
------------------------------------
66.08% of code objects have >1 locals
42.96% of code objects have >2 locals
22.44% of code objects have >5 locals
5.32% of code objects have >10 locals
0.79% of code objects have >20 locals
0.08% of code objects have >50 locals
0.04% of code objects have >100 locals
0.03% of code objects have >200 locals
0.02% of code objects have >500 locals
0.02% of code objects have >1000 locals
0.00% of code objects have >2000 locals

Data from other places:

For .py files in Numpy distribution:
------------------------------------
66.77% of code objects have >1 locals
44.70% of code objects have >2 locals
15.82% of code objects have >5 locals
3.37% of code objects have >10 locals
0.38% of code objects have >20 locals
0.03% of code objects have >50 locals
0.00% of code objects have >100 locals

For .py files in Lib/test/ (ignoring SytaxError):
-------------------------------------------------
55.63% of code objects have >1 locals  
33.39% of code objects have >2 locals  
8.09% of code objects have >5 locals   
1.10% of code objects have >10 locals  
0.06% of code objects have >20 locals  
0.00% of code objects have >50 locals

For .py files in Django distribution:
-------------------------------------
61.79% of code objects have >1 locals
38.18% of code objects have >2 locals
11.70% of code objects have >5 locals
2.68% of code objects have >10 locals
0.32% of code objects have >20 locals
0.00% of code objects have >50 locals

@JelleZijlstra
Copy link
Member

Thanks! The biggest one is https://github.com/sympy/sympy/blob/0e4bab73562d6f90fcdf5fa079731f26a455347c/sympy/integrals/rubi/rules/sine.py#L148, which looks terrifying. It is noticeably slower to compile on 3.12 than 3.11:

% time ./python.exe -m compileall sine.py
Compiling 'sine.py'...
./python.exe -m compileall sine.py  2.66s user 0.09s system 99% cpu 2.763 total
% time ~/py/venvs/311-jun14/bin/python3 -m compileall sine.py
Compiling 'sine.py'...
~/py/venvs/311-jun14/bin/python3 -m compileall sine.py  0.98s user 0.09s system 98% cpu 1.082 total

@sweeneyde
Copy link
Member

One speedup might come from doing some precomputation for the "first pass": start by storing i --> array of *basicblocks where DELETE_FAST(i) occurs (often empty). This wouldn't guarantee linear time overall, but it would greatly speed up almost every first pass. A parameter variable with no del ... statements could be completely skipped for free.

The b->b_visited = 0; loop is also quadratic, but faster. We could instead widen b_visited to more than one bit (or add another field) and do something like

#define MAYBE_PUSH(B) do {                          \
        if ((B)->b_visited > target) {              \
            *(*stack_top)++ = (B);                  \
            (B)->b_visited = target + 1;            \
        }                                           \
    } while (0)

Though again, I'm not sure whether it would be better to just limit the number of locals to analyze and set the rest to use LOAD_FAST_CHECK. Or maybe do both.

If someone else hasn't already started a patch, I can work on this.

@markshannon
Copy link
Member

It won't reduce the big-O complexity, but handling 64 locals at once (using an int64_t as a vector of 64 booleans) would reduce the constant factor by ~60

sweeneyde added a commit that referenced this issue Oct 20, 2022
)

* The compiler analyzes the usage of the first 64 local variables all at once using bit masks.

* Local variables beyond the first 64 are only partially analyzed, achieving linear time.
@sweeneyde
Copy link
Member

This should be fixed now. Thanks for finding this!

(feel free to re-open if I missed something)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage sprint type-bug An unexpected behavior, bug, or error
Projects
Archived in project
Development

No branches or pull requests

4 participants