bpo-46528: Attempt `SWAP`s at compile-time #30970

brandtbucher · 2022-01-27T22:27:24Z

https://bugs.python.org/issue46528

brandtbucher · 2022-01-28T00:25:16Z

1% faster:

Slower (10):
- nbody: 93.2 ms +- 1.6 ms -> 98.3 ms +- 1.5 ms: 1.05x slower
- regex_dna: 206 ms +- 1 ms -> 217 ms +- 1 ms: 1.05x slower
- pickle_dict: 27.0 us +- 0.4 us -> 28.2 us +- 0.1 us: 1.05x slower
- nqueens: 84.2 ms +- 1.1 ms -> 87.9 ms +- 1.1 ms: 1.04x slower
- regex_effbot: 3.18 ms +- 0.09 ms -> 3.26 ms +- 0.02 ms: 1.02x slower
- chameleon: 7.43 ms +- 0.12 ms -> 7.52 ms +- 0.06 ms: 1.01x slower
- chaos: 71.4 ms +- 0.6 ms -> 72.2 ms +- 0.7 ms: 1.01x slower
- django_template: 34.6 ms +- 0.4 ms -> 34.9 ms +- 0.5 ms: 1.01x slower
- go: 143 ms +- 1 ms -> 144 ms +- 1 ms: 1.00x slower
- python_startup_no_site: 5.87 ms +- 0.01 ms -> 5.87 ms +- 0.00 ms: 1.00x slower

Faster (32):
- logging_silent: 115 ns +- 3 ns -> 107 ns +- 1 ns: 1.07x faster
- unpack_sequence: 52.0 ns +- 0.4 ns -> 49.0 ns +- 0.8 ns: 1.06x faster
- telco: 6.64 ms +- 0.09 ms -> 6.42 ms +- 0.10 ms: 1.03x faster
- unpickle_list: 5.18 us +- 0.17 us -> 5.02 us +- 0.09 us: 1.03x faster
- spectral_norm: 103 ms +- 0 ms -> 99.6 ms +- 0.8 ms: 1.03x faster
- float: 78.2 ms +- 0.9 ms -> 75.8 ms +- 0.9 ms: 1.03x faster
- pidigits: 194 ms +- 0 ms -> 188 ms +- 0 ms: 1.03x faster
- scimark_lu: 116 ms +- 3 ms -> 112 ms +- 2 ms: 1.03x faster
- scimark_monte_carlo: 72.4 ms +- 1.0 ms -> 70.6 ms +- 0.8 ms: 1.03x faster
- scimark_sor: 128 ms +- 2 ms -> 125 ms +- 1 ms: 1.02x faster
- logging_format: 6.48 us +- 0.12 us -> 6.35 us +- 0.07 us: 1.02x faster
- hexiom: 6.69 ms +- 0.04 ms -> 6.56 ms +- 0.05 ms: 1.02x faster
- json_dumps: 12.6 ms +- 0.1 ms -> 12.4 ms +- 0.1 ms: 1.02x faster
- scimark_fft: 336 ms +- 3 ms -> 331 ms +- 2 ms: 1.02x faster
- logging_simple: 5.96 us +- 0.08 us -> 5.88 us +- 0.08 us: 1.01x faster
- fannkuch: 396 ms +- 5 ms -> 391 ms +- 4 ms: 1.01x faster
- sympy_str: 302 ms +- 5 ms -> 298 ms +- 3 ms: 1.01x faster
- regex_compile: 137 ms +- 0 ms -> 136 ms +- 1 ms: 1.01x faster
- mako: 11.6 ms +- 0.1 ms -> 11.5 ms +- 0.1 ms: 1.01x faster
- pyflate: 451 ms +- 3 ms -> 447 ms +- 3 ms: 1.01x faster
- richards: 51.8 ms +- 0.7 ms -> 51.3 ms +- 0.8 ms: 1.01x faster
- pickle_pure_python: 332 us +- 3 us -> 329 us +- 3 us: 1.01x faster
- unpickle_pure_python: 250 us +- 2 us -> 248 us +- 2 us: 1.01x faster
- sympy_sum: 168 ms +- 2 ms -> 167 ms +- 2 ms: 1.01x faster
- xml_etree_iterparse: 106 ms +- 1 ms -> 106 ms +- 1 ms: 1.01x faster
- meteor_contest: 105 ms +- 2 ms -> 104 ms +- 1 ms: 1.01x faster
- dulwich_log: 64.8 ms +- 0.4 ms -> 64.4 ms +- 0.4 ms: 1.01x faster
- json_loads: 25.3 us +- 0.2 us -> 25.2 us +- 0.2 us: 1.01x faster
- sympy_integrate: 21.3 ms +- 0.2 ms -> 21.2 ms +- 0.2 ms: 1.00x faster
- 2to3: 263 ms +- 1 ms -> 262 ms +- 1 ms: 1.00x faster
- raytrace: 312 ms +- 2 ms -> 312 ms +- 2 ms: 1.00x faster
- python_startup: 8.24 ms +- 0.01 ms -> 8.23 ms +- 0.01 ms: 1.00x faster

Benchmark hidden because not significant (16): crypto_pyaes, deltablue, pathlib, pickle, pickle_list, regex_v8, scimark_sparse_mat_mult, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_expand, tornado_http, unpickle, xml_etree_parse, xml_etree_generate, xml_etree_process

Geometric mean: 1.01x faster

sweeneyde · 2022-01-28T00:28:23Z

Could we maybe factor out a helper function like

static int
next_safe_store(basicblock *block, int i, int expected_lineno)
{
    for (; i <= block->b_iused; i++) {
        struct instr *instruction = &block->b_instr[i];
        if (expected_lineno != -1 && instruction->i_lineno != expected_lineno) {
            return -1;
        }
        switch (block->b_instr[i].i_opcode) {
            case NOP:
                continue;
            case STORE_FAST:
            case POP_TOP:
                return i;
            default:
                return -1;
        }
    }
    return -1;
}

And then call it once to get the first stack-consumer, and then oparg-1 more times to get the opargth consumer.

sweeneyde · 2022-01-28T00:46:37Z

I'm assuming we don't need to worry that this is quadratic if there are a bunch of non-simplifiable swaps followed by a bunch of stores?

Also, are cases like the following possible?

# won't get static-folded with this PR (7 is too big):
SWAP 2; SWAP 3; SWAP 2; SWAP 7; STORE_FAST STORE_FAST STORE_FAST STORE_ATTR

# same permutation, but can be static-folded with this PR:
SWAP 7; SWAP 2; SWAP 3; SWAP 2; STORE_FAST STORE_FAST STORE_FAST STORE_ATTR

brandtbucher · 2022-01-28T01:14:19Z

Could we maybe factor out a helper function like

Yeah, I like that idea.

I'm assuming we don't need to worry that this is quadratic if there are a bunch of non-simplifiable swaps followed by a bunch of stores?

I might be missing something... isn't this only quadratic if we successfully fold lots of swaps? We move right-to-left, and bail on the first one that doesn't work.

I'm okay with quadratic behavior here, since N is generally going to be very small (1 or 2 swaps of depth 2 or 3). Pattern matching code can be worse, but even then it's still rare to see anything worse than a handful of shallow-ish swaps.

Also, are cases like the following possible?

I had given that some thought, but I wasn't able to create any code that compiled that way by hand. The long runs of SWAPs typically come from pattern matching code right now, and those tend to look like SWAP(n), SWAP(n - 1), ..., SWAP(2) before hitting this point. So I'm not sure we have to worry about it happening in practice.

Maybe there is a more general solution than this that can handle those cases, though. I just wouldn't want to add much more complexity to support it if we're not sure of the payoff.

sweeneyde · 2022-01-28T01:19:24Z

isn't this only quadratic if we successfully fold lots of swaps

I think that's right -- I guess that would only not pay off if we have a very strange million-line source file that only gets run once or something. Probably not worth worrying about too much.

brandtbucher · 2022-01-28T01:23:25Z

I guess that would only not pay off if we have a very strange million-line source file

Nah, it would need to be one really long line. 🙂

Something like this:

def f(seq):
    match seq:
        case a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, _:
            pass

sweeneyde · 2022-01-28T02:26:51Z

Can we do similar things for

>>> def f(a, b):
...     a.a, b.b = b, a
...
>>> dis(f)
  1           0 RESUME                   0

  2           2 LOAD_FAST                1 (b)
              4 LOAD_FAST                0 (a)
              6 SWAP                     2
              8 LOAD_FAST                0 (a)
             10 STORE_ATTR               0 (a)
             12 LOAD_FAST                1 (b)
             14 STORE_ATTR               1 (b)
             16 LOAD_CONST               0 (None)
             18 RETURN_VALUE

or is that for a separate PR?

brandtbucher · 2022-01-28T03:11:05Z

No, we need to be pretty conservative with this.

The order of attribute stores and fast local loads are user-visible. They can both raise exceptions, and attribute access can be customized with arbitrary Python code.

brandtbucher · 2022-01-28T03:12:39Z

(Maybe you could do something clever like only loading a and b once each and duplicating them on the stack, but that sounds slower, more complicated, and uncommon.)

sweeneyde · 2022-01-28T19:49:39Z

I think STORE_DEREF might be user-visible, even without frame-related hacks:

def f(x, y):
    class A:
        def __init__(self, value):
            self.value = value
        def __del__(self):
            print(a, b)
        def __repr__(self):
            return f"<value={self.value}>"

    a = A(333)
    b = A(444)
    a, b = x, y

f(1, 2)

Before:

1 <value=444>
1 2

After:

<value=333> 2
1 2

I know the exact timing of when __del__ methods are called is unspecified, but I presume they should theoretically still have access to the world in a consistent state: 1 2\n1 2 and 1 <value=444>\n1 2 are both consistent, while <value=33> 2\n1 2 is inconsistent with the assignment order. I don't know how realistic that is or how much we want to cater to that edge-case.

brandtbucher · 2022-01-28T20:07:32Z

I think that @markshannon believes our ordering guarantees don't apply to finalizers.

I'm not entirely convinced of that argument myself, though. I'll just leave STORE_DEREF out, since it's an edge-case with negligible benefit.

brandtbucher · 2022-02-09T19:27:08Z

@sweeneyde if this looks good to you, I'll go ahead and merge it today.

(Congrats on your promotion, by the way!)

sweeneyde · 2022-02-09T20:18:46Z

@sweeneyde if this looks good to you, I'll go ahead and merge it today.

(Congrats on your promotion, by the way!)

I just looked over it again and it looks good to me (modulo merge conflicts).

Thanks - I appreciate it!

brandtbucher added 7 commits January 26, 2022 18:00

Add a second pass to perform static swaps

e2884f5

Add new tests

b73d5d1

Clean things up

f4c5cff

Add more tests and comments

a1367c0

Add missing test case

d26e2dd

More cleanup

958c1b7

blurb add

1a63c80

brandtbucher added the performance Performance or resource usage label Jan 27, 2022

brandtbucher requested review from markshannon and sweeneyde January 27, 2022 22:27

brandtbucher self-assigned this Jan 27, 2022

the-knights-who-say-ni added the CLA signed label Jan 27, 2022

bedevere-bot added the awaiting core review label Jan 27, 2022

Handle inefficient edge-case

bdf21e6

Refactor (and add STORE_DEREF)

89f0b91

Don't swap STORE_DEREF

341f4ff

brandtbucher mentioned this pull request Jan 31, 2022

Attempt SWAPs at compile-time faster-cpython/ideas#244

Closed

Merge branch 'main' into unswap

5af92b8

brandtbucher merged commit 78ae4cc into python:main Feb 9, 2022

bedevere-bot removed the awaiting core review label Feb 9, 2022

brandtbucher deleted the unswap branch July 21, 2022 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-46528: Attempt `SWAP`s at compile-time #30970

bpo-46528: Attempt `SWAP`s at compile-time #30970

brandtbucher commented Jan 27, 2022 •

edited by bedevere-bot

brandtbucher commented Jan 28, 2022

sweeneyde commented Jan 28, 2022 •

edited

sweeneyde commented Jan 28, 2022

brandtbucher commented Jan 28, 2022 •

edited

sweeneyde commented Jan 28, 2022

brandtbucher commented Jan 28, 2022 •

edited

sweeneyde commented Jan 28, 2022 •

edited

brandtbucher commented Jan 28, 2022

brandtbucher commented Jan 28, 2022

sweeneyde commented Jan 28, 2022 •

edited

brandtbucher commented Jan 28, 2022

brandtbucher commented Feb 9, 2022

sweeneyde commented Feb 9, 2022

bpo-46528: Attempt SWAPs at compile-time #30970

bpo-46528: Attempt SWAPs at compile-time #30970

Conversation

brandtbucher commented Jan 27, 2022 • edited by bedevere-bot

brandtbucher commented Jan 28, 2022

sweeneyde commented Jan 28, 2022 • edited

sweeneyde commented Jan 28, 2022

brandtbucher commented Jan 28, 2022 • edited

sweeneyde commented Jan 28, 2022

brandtbucher commented Jan 28, 2022 • edited

sweeneyde commented Jan 28, 2022 • edited

brandtbucher commented Jan 28, 2022

brandtbucher commented Jan 28, 2022

sweeneyde commented Jan 28, 2022 • edited

brandtbucher commented Jan 28, 2022

brandtbucher commented Feb 9, 2022

sweeneyde commented Feb 9, 2022

bpo-46528: Attempt `SWAP`s at compile-time #30970

bpo-46528: Attempt `SWAP`s at compile-time #30970

brandtbucher commented Jan 27, 2022 •

edited by bedevere-bot

sweeneyde commented Jan 28, 2022 •

edited

brandtbucher commented Jan 28, 2022 •

edited

brandtbucher commented Jan 28, 2022 •

edited

sweeneyde commented Jan 28, 2022 •

edited

sweeneyde commented Jan 28, 2022 •

edited