Superoptimize sequences of instructions without observable side-effects in the bytecode compiler. #102869

markshannon · 2023-03-21T11:24:02Z

Sequences of LOAD_FAST, COPY, LOAD_CONST, SWAP, POP_TOP, NOP (and other?) instructions have no observable side-effects as the evaluation stack cannot be observed.
Therefore we can replace any sequence of these instructions with a shorter sequence that has the same effect.
By using a superoptimizer we replace sequences with the optimal equivalent sequence.

In practice we will want to limit the size of lookup tables we generate, but we can generate complete tables for sequences up to 4 or 5.

There is a slight complication of line numbers, but as long the number of line numbers does not exceed the optimal sequence length, we can just allocate instructions line numbers without regard to the original mapping of instructions to line numbers.

Generating and using tables.

During build (or offline), we can generate a table mapping sequences of instructions to the stack, recording the optimum sequence for each stack.
E.g. The sequences LOAD_FAST x; LOAD_FAST y; SWAP 2 and LOAD_FAST y; LOAD_FAST x generate the stack [ y, x ].
So the lookup table maps [ y, x ] to LOAD_FAST y; LOAD_FAST x.

Whenever we see a sequence of instructions without observable side effects, we compute the stack, and look it up. If we find a match, emit the optimal sequence.

Mimimizing the size of tables

We will need to use some sort of numbering scheme for the indexes of LOAD_FAST and LOAD_CONST, and treat LOAD_CONST the same as LOAD_FAST.

So that LOAD_FAST a; LOAD_CONST 1.0 becomes LOAD 1; LOAD 2.

In order to keep the table size down, we should limit the number of distinct locals and constants to 4, and the depth of the stack to something like 7. That way there are only a few thousand possible stacks, and the optimal sequence for each cannot be more than 4 instructions. The entire table should then fit into tens of kbytes of const data.

Handling line numbers.

Since the instructions have no visible side-effects, we can place the line events anywhere in the instruction sequence.
Taking the above sequence with line numbers:

4 LOAD_FAST x
  LOAD_FAST y
5 SWAP 2

We can generate:

4 LOAD_FAST y
5 LOAD_FAST x

which is valid as it produces the same sequence of line events and the observable state of the VM is the same in both cases.

Linked PRs

gh-102869: Bytecode Superoptimizer #104056

The text was updated successfully, but these errors were encountered:

markshannon · 2023-03-21T11:24:31Z

See faster-cpython/ideas#567 for the original motivation for this.

CCLDArjun · 2023-04-06T01:28:31Z

@markshannon why not hardcode these sequences into the peepholer?

something like this: CCLDArjun@22b6aff

Are there enough cases of these that calculating the stack effect and building a lookup table would be worthwhile?

I would imagine finding the stack effect would involve calculating how the stack is before and after all of these operations, right?

brandtbucher · 2023-05-02T19:08:16Z

Just curious... how is this different from the existing swaptimize and apply_static_swaps functions in flowgraph.c? If it's similar to what we have there, perhaps those functions could just be taught how to handle more instructions (like COPY)?

The main idea there is that optimizing runs of SWAPs is a solved problem (adding COPY instructions on top of this would be very easy). Then, instead of copying and swapping things on the stack at runtime, you just copy and swap instructions in the stream at compile time.

brandtbucher · 2023-05-02T19:12:29Z

3.10:

>>> def f():
...    a, b, c = b, c, a
... 
>>> dis.dis(f)
  2           0 LOAD_FAST                0 (b)
              2 LOAD_FAST                1 (c)
              4 LOAD_FAST                2 (a)
              6 ROT_THREE
              8 ROT_TWO
             10 STORE_FAST               2 (a)
             12 STORE_FAST               0 (b)
             14 STORE_FAST               1 (c)
             16 LOAD_CONST               0 (None)
             18 RETURN_VALUE

3.11:

>>> def f():
...    a, b, c = b, c, a
... 
>>> dis.dis(f)
  1           0 RESUME                   0

  2           2 LOAD_FAST                0 (b)
              4 LOAD_FAST                1 (c)
              6 LOAD_FAST                2 (a)
              8 STORE_FAST               1 (c)
             10 STORE_FAST               0 (b)
             12 STORE_FAST               2 (a)
             14 LOAD_CONST               0 (None)
             16 RETURN_VALUE

carljm · 2023-05-02T21:05:10Z

apply_static_swaps only reorders SWAP and opcodes that pop the TOS (STORE_FAST and POP_TOP.) This issue proposes to also reorder COPY, LOAD_FAST, and LOAD_CONST (opcodes that push to stack), but notably doesn't propose to touch STORE_FAST, since it does have an observable side effect (changes the value of a variable in locals.)

I think if apply_static_swaps were to gain support for opcodes that push to stack (and arbitrary sequences of pushes/pops/swaps), its implementation would have to change quite a bit and it would end up looking pretty similar to what is proposed here.

Because apply_static_swaps reorders STORE_FAST, it has to be conservative about not optimizing anytime it sees a change in lineno, whereas the proposal here (by avoiding STORE_FAST) can freely optimize as long as there are enough instructions in the optimal sequence to account for all distinct line numbers seen in the original sequence (and can pad with NOP if not; likely the actual implementation would always pad with NOP to avoid changing block size, and would allow later NOP-removal to remove NOPs that aren't needed for their lineno.)

carljm · 2023-06-02T22:30:36Z

Since the instructions have no visible side-effects

Discussion in faster-cpython/ideas#585 suggests that we can't actually move LOAD_FAST to a different line, because of frame_setlineno and debugger jump instructions. That makes the line number handling part of this a bit more complex than suggested above.

markshannon added the performance Performance or resource usage label Mar 21, 2023

CCLDArjun mentioned this issue May 1, 2023

gh-102869: Bytecode Superoptimizer #104056

Closed

brandtbucher mentioned this issue May 10, 2023

gh-99360: Avoid exc=None; del exc in bytecode where unnecessary #99361

Draft

markshannon mentioned this issue May 31, 2023

dead store elimination in the compiler #104635

Open

carljm mentioned this issue Jun 2, 2023

Single instruction "superinstructions" faster-cpython/ideas#585

Closed

iritkatriel added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Superoptimize sequences of instructions without observable side-effects in the bytecode compiler. #102869

Superoptimize sequences of instructions without observable side-effects in the bytecode compiler. #102869

markshannon commented Mar 21, 2023 •

edited by bedevere-bot

markshannon commented Mar 21, 2023

CCLDArjun commented Apr 6, 2023 •

edited

brandtbucher commented May 2, 2023 •

edited

brandtbucher commented May 2, 2023

carljm commented May 2, 2023 •

edited

carljm commented Jun 2, 2023

Superoptimize sequences of instructions without observable side-effects in the bytecode compiler. #102869

Superoptimize sequences of instructions without observable side-effects in the bytecode compiler. #102869

Comments

markshannon commented Mar 21, 2023 • edited by bedevere-bot

Generating and using tables.

Mimimizing the size of tables

Handling line numbers.

Linked PRs

markshannon commented Mar 21, 2023

CCLDArjun commented Apr 6, 2023 • edited

brandtbucher commented May 2, 2023 • edited

brandtbucher commented May 2, 2023

carljm commented May 2, 2023 • edited

carljm commented Jun 2, 2023

markshannon commented Mar 21, 2023 •

edited by bedevere-bot

CCLDArjun commented Apr 6, 2023 •

edited

brandtbucher commented May 2, 2023 •

edited

carljm commented May 2, 2023 •

edited