Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Superoptimize sequences of instructions without observable side-effects in the bytecode compiler. #102869

Open
markshannon opened this issue Mar 21, 2023 · 6 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@markshannon
Copy link
Member

markshannon commented Mar 21, 2023

Sequences of LOAD_FAST, COPY, LOAD_CONST, SWAP, POP_TOP, NOP (and other?) instructions have no observable side-effects as the evaluation stack cannot be observed.
Therefore we can replace any sequence of these instructions with a shorter sequence that has the same effect.
By using a superoptimizer we replace sequences with the optimal equivalent sequence.

In practice we will want to limit the size of lookup tables we generate, but we can generate complete tables for sequences up to 4 or 5.

There is a slight complication of line numbers, but as long the number of line numbers does not exceed the optimal sequence length, we can just allocate instructions line numbers without regard to the original mapping of instructions to line numbers.

Generating and using tables.

During build (or offline), we can generate a table mapping sequences of instructions to the stack, recording the optimum sequence for each stack.
E.g. The sequences LOAD_FAST x; LOAD_FAST y; SWAP 2 and LOAD_FAST y; LOAD_FAST x generate the stack [ y, x ].
So the lookup table maps [ y, x ] to LOAD_FAST y; LOAD_FAST x.

Whenever we see a sequence of instructions without observable side effects, we compute the stack, and look it up. If we find a match, emit the optimal sequence.

Mimimizing the size of tables

We will need to use some sort of numbering scheme for the indexes of LOAD_FAST and LOAD_CONST, and treat LOAD_CONST the same as LOAD_FAST.

So that LOAD_FAST a; LOAD_CONST 1.0 becomes LOAD 1; LOAD 2.

In order to keep the table size down, we should limit the number of distinct locals and constants to 4, and the depth of the stack to something like 7. That way there are only a few thousand possible stacks, and the optimal sequence for each cannot be more than 4 instructions. The entire table should then fit into tens of kbytes of const data.

Handling line numbers.

Since the instructions have no visible side-effects, we can place the line events anywhere in the instruction sequence.
Taking the above sequence with line numbers:

4 LOAD_FAST x
  LOAD_FAST y
5 SWAP 2

We can generate:

4 LOAD_FAST y
5 LOAD_FAST x

which is valid as it produces the same sequence of line events and the observable state of the VM is the same in both cases.

Linked PRs

@markshannon markshannon added the performance Performance or resource usage label Mar 21, 2023
@markshannon
Copy link
Member Author

See faster-cpython/ideas#567 for the original motivation for this.

@CCLDArjun
Copy link
Contributor

CCLDArjun commented Apr 6, 2023

@markshannon why not hardcode these sequences into the peepholer?

something like this: CCLDArjun@22b6aff

Are there enough cases of these that calculating the stack effect and building a lookup table would be worthwhile?

I would imagine finding the stack effect would involve calculating how the stack is before and after all of these operations, right?

@brandtbucher
Copy link
Member

brandtbucher commented May 2, 2023

Just curious... how is this different from the existing swaptimize and apply_static_swaps functions in flowgraph.c? If it's similar to what we have there, perhaps those functions could just be taught how to handle more instructions (like COPY)?

The main idea there is that optimizing runs of SWAPs is a solved problem (adding COPY instructions on top of this would be very easy). Then, instead of copying and swapping things on the stack at runtime, you just copy and swap instructions in the stream at compile time.

@brandtbucher
Copy link
Member

3.10:

>>> def f():
...    a, b, c = b, c, a
... 
>>> dis.dis(f)
  2           0 LOAD_FAST                0 (b)
              2 LOAD_FAST                1 (c)
              4 LOAD_FAST                2 (a)
              6 ROT_THREE
              8 ROT_TWO
             10 STORE_FAST               2 (a)
             12 STORE_FAST               0 (b)
             14 STORE_FAST               1 (c)
             16 LOAD_CONST               0 (None)
             18 RETURN_VALUE

3.11:

>>> def f():
...    a, b, c = b, c, a
... 
>>> dis.dis(f)
  1           0 RESUME                   0

  2           2 LOAD_FAST                0 (b)
              4 LOAD_FAST                1 (c)
              6 LOAD_FAST                2 (a)
              8 STORE_FAST               1 (c)
             10 STORE_FAST               0 (b)
             12 STORE_FAST               2 (a)
             14 LOAD_CONST               0 (None)
             16 RETURN_VALUE

@carljm
Copy link
Member

carljm commented May 2, 2023

apply_static_swaps only reorders SWAP and opcodes that pop the TOS (STORE_FAST and POP_TOP.) This issue proposes to also reorder COPY, LOAD_FAST, and LOAD_CONST (opcodes that push to stack), but notably doesn't propose to touch STORE_FAST, since it does have an observable side effect (changes the value of a variable in locals.)

I think if apply_static_swaps were to gain support for opcodes that push to stack (and arbitrary sequences of pushes/pops/swaps), its implementation would have to change quite a bit and it would end up looking pretty similar to what is proposed here.

Because apply_static_swaps reorders STORE_FAST, it has to be conservative about not optimizing anytime it sees a change in lineno, whereas the proposal here (by avoiding STORE_FAST) can freely optimize as long as there are enough instructions in the optimal sequence to account for all distinct line numbers seen in the original sequence (and can pad with NOP if not; likely the actual implementation would always pad with NOP to avoid changing block size, and would allow later NOP-removal to remove NOPs that aren't needed for their lineno.)

@carljm
Copy link
Member

carljm commented Jun 2, 2023

Since the instructions have no visible side-effects

Discussion in faster-cpython/ideas#585 suggests that we can't actually move LOAD_FAST to a different line, because of frame_setlineno and debugger jump instructions. That makes the line number handling part of this a bit more complex than suggested above.

@iritkatriel iritkatriel added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
Projects
None yet
Development

No branches or pull requests

5 participants