Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-46528: Attempt SWAPs at compile-time #30970

Merged
merged 11 commits into from Feb 9, 2022
Merged

Conversation

brandtbucher
Copy link
Member

@brandtbucher brandtbucher commented Jan 27, 2022

@brandtbucher
Copy link
Member Author

1% faster:

Slower (10):
- nbody: 93.2 ms +- 1.6 ms -> 98.3 ms +- 1.5 ms: 1.05x slower
- regex_dna: 206 ms +- 1 ms -> 217 ms +- 1 ms: 1.05x slower
- pickle_dict: 27.0 us +- 0.4 us -> 28.2 us +- 0.1 us: 1.05x slower
- nqueens: 84.2 ms +- 1.1 ms -> 87.9 ms +- 1.1 ms: 1.04x slower
- regex_effbot: 3.18 ms +- 0.09 ms -> 3.26 ms +- 0.02 ms: 1.02x slower
- chameleon: 7.43 ms +- 0.12 ms -> 7.52 ms +- 0.06 ms: 1.01x slower
- chaos: 71.4 ms +- 0.6 ms -> 72.2 ms +- 0.7 ms: 1.01x slower
- django_template: 34.6 ms +- 0.4 ms -> 34.9 ms +- 0.5 ms: 1.01x slower
- go: 143 ms +- 1 ms -> 144 ms +- 1 ms: 1.00x slower
- python_startup_no_site: 5.87 ms +- 0.01 ms -> 5.87 ms +- 0.00 ms: 1.00x slower

Faster (32):
- logging_silent: 115 ns +- 3 ns -> 107 ns +- 1 ns: 1.07x faster
- unpack_sequence: 52.0 ns +- 0.4 ns -> 49.0 ns +- 0.8 ns: 1.06x faster
- telco: 6.64 ms +- 0.09 ms -> 6.42 ms +- 0.10 ms: 1.03x faster
- unpickle_list: 5.18 us +- 0.17 us -> 5.02 us +- 0.09 us: 1.03x faster
- spectral_norm: 103 ms +- 0 ms -> 99.6 ms +- 0.8 ms: 1.03x faster
- float: 78.2 ms +- 0.9 ms -> 75.8 ms +- 0.9 ms: 1.03x faster
- pidigits: 194 ms +- 0 ms -> 188 ms +- 0 ms: 1.03x faster
- scimark_lu: 116 ms +- 3 ms -> 112 ms +- 2 ms: 1.03x faster
- scimark_monte_carlo: 72.4 ms +- 1.0 ms -> 70.6 ms +- 0.8 ms: 1.03x faster
- scimark_sor: 128 ms +- 2 ms -> 125 ms +- 1 ms: 1.02x faster
- logging_format: 6.48 us +- 0.12 us -> 6.35 us +- 0.07 us: 1.02x faster
- hexiom: 6.69 ms +- 0.04 ms -> 6.56 ms +- 0.05 ms: 1.02x faster
- json_dumps: 12.6 ms +- 0.1 ms -> 12.4 ms +- 0.1 ms: 1.02x faster
- scimark_fft: 336 ms +- 3 ms -> 331 ms +- 2 ms: 1.02x faster
- logging_simple: 5.96 us +- 0.08 us -> 5.88 us +- 0.08 us: 1.01x faster
- fannkuch: 396 ms +- 5 ms -> 391 ms +- 4 ms: 1.01x faster
- sympy_str: 302 ms +- 5 ms -> 298 ms +- 3 ms: 1.01x faster
- regex_compile: 137 ms +- 0 ms -> 136 ms +- 1 ms: 1.01x faster
- mako: 11.6 ms +- 0.1 ms -> 11.5 ms +- 0.1 ms: 1.01x faster
- pyflate: 451 ms +- 3 ms -> 447 ms +- 3 ms: 1.01x faster
- richards: 51.8 ms +- 0.7 ms -> 51.3 ms +- 0.8 ms: 1.01x faster
- pickle_pure_python: 332 us +- 3 us -> 329 us +- 3 us: 1.01x faster
- unpickle_pure_python: 250 us +- 2 us -> 248 us +- 2 us: 1.01x faster
- sympy_sum: 168 ms +- 2 ms -> 167 ms +- 2 ms: 1.01x faster
- xml_etree_iterparse: 106 ms +- 1 ms -> 106 ms +- 1 ms: 1.01x faster
- meteor_contest: 105 ms +- 2 ms -> 104 ms +- 1 ms: 1.01x faster
- dulwich_log: 64.8 ms +- 0.4 ms -> 64.4 ms +- 0.4 ms: 1.01x faster
- json_loads: 25.3 us +- 0.2 us -> 25.2 us +- 0.2 us: 1.01x faster
- sympy_integrate: 21.3 ms +- 0.2 ms -> 21.2 ms +- 0.2 ms: 1.00x faster
- 2to3: 263 ms +- 1 ms -> 262 ms +- 1 ms: 1.00x faster
- raytrace: 312 ms +- 2 ms -> 312 ms +- 2 ms: 1.00x faster
- python_startup: 8.24 ms +- 0.01 ms -> 8.23 ms +- 0.01 ms: 1.00x faster

Benchmark hidden because not significant (16): crypto_pyaes, deltablue, pathlib, pickle, pickle_list, regex_v8, scimark_sparse_mat_mult, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_expand, tornado_http, unpickle, xml_etree_parse, xml_etree_generate, xml_etree_process

Geometric mean: 1.01x faster

@sweeneyde
Copy link
Member

sweeneyde commented Jan 28, 2022

Could we maybe factor out a helper function like

static int
next_safe_store(basicblock *block, int i, int expected_lineno)
{
    for (; i <= block->b_iused; i++) {
        struct instr *instruction = &block->b_instr[i];
        if (expected_lineno != -1 && instruction->i_lineno != expected_lineno) {
            return -1;
        }
        switch (block->b_instr[i].i_opcode) {
            case NOP:
                continue;
            case STORE_FAST:
            case POP_TOP:
                return i;
            default:
                return -1;
        }
    }
    return -1;
}

And then call it once to get the first stack-consumer, and then oparg-1 more times to get the opargth consumer.

@sweeneyde
Copy link
Member

I'm assuming we don't need to worry that this is quadratic if there are a bunch of non-simplifiable swaps followed by a bunch of stores?

Also, are cases like the following possible?

# won't get static-folded with this PR (7 is too big):
SWAP 2; SWAP 3; SWAP 2; SWAP 7; STORE_FAST STORE_FAST STORE_FAST STORE_ATTR

# same permutation, but can be static-folded with this PR:
SWAP 7; SWAP 2; SWAP 3; SWAP 2; STORE_FAST STORE_FAST STORE_FAST STORE_ATTR

@brandtbucher
Copy link
Member Author

brandtbucher commented Jan 28, 2022

Could we maybe factor out a helper function like

Yeah, I like that idea.

I'm assuming we don't need to worry that this is quadratic if there are a bunch of non-simplifiable swaps followed by a bunch of stores?

I might be missing something... isn't this only quadratic if we successfully fold lots of swaps? We move right-to-left, and bail on the first one that doesn't work.

I'm okay with quadratic behavior here, since N is generally going to be very small (1 or 2 swaps of depth 2 or 3). Pattern matching code can be worse, but even then it's still rare to see anything worse than a handful of shallow-ish swaps.

Also, are cases like the following possible?

I had given that some thought, but I wasn't able to create any code that compiled that way by hand. The long runs of SWAPs typically come from pattern matching code right now, and those tend to look like SWAP(n), SWAP(n - 1), ..., SWAP(2) before hitting this point. So I'm not sure we have to worry about it happening in practice.

Maybe there is a more general solution than this that can handle those cases, though. I just wouldn't want to add much more complexity to support it if we're not sure of the payoff.

@sweeneyde
Copy link
Member

isn't this only quadratic if we successfully fold lots of swaps

I think that's right -- I guess that would only not pay off if we have a very strange million-line source file that only gets run once or something. Probably not worth worrying about too much.

@brandtbucher
Copy link
Member Author

brandtbucher commented Jan 28, 2022

I guess that would only not pay off if we have a very strange million-line source file

Nah, it would need to be one really long line. 🙂

Something like this:

def f(seq):
    match seq:
        case a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, _:
            pass

@sweeneyde
Copy link
Member

sweeneyde commented Jan 28, 2022

Can we do similar things for

>>> def f(a, b):
...     a.a, b.b = b, a
...
>>> dis(f)
  1           0 RESUME                   0

  2           2 LOAD_FAST                1 (b)
              4 LOAD_FAST                0 (a)
              6 SWAP                     2
              8 LOAD_FAST                0 (a)
             10 STORE_ATTR               0 (a)
             12 LOAD_FAST                1 (b)
             14 STORE_ATTR               1 (b)
             16 LOAD_CONST               0 (None)
             18 RETURN_VALUE

or is that for a separate PR?

@brandtbucher
Copy link
Member Author

No, we need to be pretty conservative with this.

The order of attribute stores and fast local loads are user-visible. They can both raise exceptions, and attribute access can be customized with arbitrary Python code.

@brandtbucher
Copy link
Member Author

(Maybe you could do something clever like only loading a and b once each and duplicating them on the stack, but that sounds slower, more complicated, and uncommon.)

@sweeneyde
Copy link
Member

sweeneyde commented Jan 28, 2022

I think STORE_DEREF might be user-visible, even without frame-related hacks:

def f(x, y):
    class A:
        def __init__(self, value):
            self.value = value
        def __del__(self):
            print(a, b)
        def __repr__(self):
            return f"<value={self.value}>"

    a = A(333)
    b = A(444)
    a, b = x, y

f(1, 2)

Before:

1 <value=444>
1 2

After:

<value=333> 2
1 2

I know the exact timing of when __del__ methods are called is unspecified, but I presume they should theoretically still have access to the world in a consistent state: 1 2\n1 2 and 1 <value=444>\n1 2 are both consistent, while <value=33> 2\n1 2 is inconsistent with the assignment order. I don't know how realistic that is or how much we want to cater to that edge-case.

@brandtbucher
Copy link
Member Author

I think that @markshannon believes our ordering guarantees don't apply to finalizers.

I'm not entirely convinced of that argument myself, though. I'll just leave STORE_DEREF out, since it's an edge-case with negligible benefit.

@brandtbucher
Copy link
Member Author

@sweeneyde if this looks good to you, I'll go ahead and merge it today.

(Congrats on your promotion, by the way!)

@sweeneyde
Copy link
Member

@sweeneyde if this looks good to you, I'll go ahead and merge it today.

(Congrats on your promotion, by the way!)

I just looked over it again and it looks good to me (modulo merge conflicts).

Thanks - I appreciate it!

@brandtbucher brandtbucher merged commit 78ae4cc into python:main Feb 9, 2022
@brandtbucher brandtbucher deleted the unswap branch July 21, 2022 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants