Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dead store elimination in the compiler #104635

Open
carljm opened this issue May 18, 2023 · 12 comments
Open

dead store elimination in the compiler #104635

carljm opened this issue May 18, 2023 · 12 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@carljm
Copy link
Member

carljm commented May 18, 2023

There are several code patterns that can cause multiple stores to the same locals index in near succession, without any intervening instructions that can possibly use the stored value. One such code pattern is e.g. a, a, b = x, y, z. Another (since PEP 709) is a = [_ for a in []]. There are likely others.

In such cases, all but the final store can be replaced with POP_TOP. This alone isn't a huge gain, but this can also allow apply_static_swaps to take effect in these cases (because it can't safely reorder writes to the same location, but it can safely reorder POP_TOPs at will), removing SWAPs, which shrinks code size and reduces runtime overhead.

Not sure how much of a gain this will be in realistic cases, but it's also not hard to implement, so it seems worth doing as a marginal improvement to compiler output.

Note that in order to maintain tracing / debugger compatibility, we can only do this if line number is the same for all the store instructions. And we definitely can't eliminate the final store, even if it also appears to be unused, since it can always be visible to tracers/debuggers/locals()/etc.

Linked PRs

@corona10
Copy link
Member

corona10 commented May 28, 2023

I just experimented with my curiosity. (please understand that I'm not an optimization expert)
With my super naive implementation, it shows 14% slower than baseline.

I think that this is because of the non-existence of super instruction(POP_TOP__POP_TOP / POP_TOP__STORE_FAST)
while baseline has super instruction(STORE_FAST__STORE_FAST).

If we try to implement such optimization, do you think we must also implement the super instruction?
Or do you think that that scenario is not realistic? (I meant the 14% slower scenario only happens at microbenchmark?)

Please let me know if I understood your suggestion wrongly :)

baseline

  1           0 RESUME                   0

  2           2 LOAD_CONST               1 ((1, 2, 3, 4))
              4 UNPACK_SEQUENCE          4
              8 STORE_FAST               0 (a)
             10 STORE_FAST               0 (a)
             12 STORE_FAST               1 (b)
             14 STORE_FAST               1 (b)
             16 RETURN_CONST             0 (None)

suggestion

  1           0 RESUME                   0

  2           2 LOAD_CONST               1 ((1, 2, 3, 4))
              4 UNPACK_SEQUENCE          4
              8 POP_TOP
             10 STORE_FAST               0 (a)
             12 POP_TOP
             14 STORE_FAST               1 (b)
             16 RETURN_CONST             0 (None)

benchmark

import pyperf

runner = pyperf.Runner()
runner.timeit(name="bench dead_store",
              stmt="""
a, a, b, b = 1, 2, 3, 4
""")

@corona10
Copy link
Member

corona10 commented May 28, 2023

With super-instruction: 6-7% faster
: corona10@a96ed91


Mean +- std dev: [base] 10.7 ns +- 0.2 ns -> [opt] 12.2 ns +- 0.1 ns: 1.14x slower # Only removing dead store
Mean +- std dev: [base] 10.7 ns +- 0.2 ns -> [super] 10.1 ns +- 0.1 ns: 1.06x faster # Removing dead store + super_instruction

For more realistic example

import pyperf

runner = pyperf.Runner()
runner.timeit(name="bench dead_store",
              stmt="""
_, _, _, a = 1, 2, 3, 4
""")


Mean +- std dev: [base] 10.8 ns +- 0.2 ns -> [super] 10.1 ns +- 0.1 ns: 1.07x faster

@corona10
Copy link
Member

corona10 commented May 29, 2023

pyperformance benchmark: https://gist.github.com/corona10/2cfd404ac0cc9d9a61c71d4365e7c5bb
1.01x faster (might be noise)

@carljm
Copy link
Member Author

carljm commented May 30, 2023

Ah, I hadn't considered the possibility that we'd need new super-instructions to make this a win instead of a regression. Thanks for trying it out!

Not sure if there are reasons to avoid adding new super-instructions, but apart from that issue, this still seems worth it to me. Made some comments on the draft PR.

@markshannon
Copy link
Member

OOI what are the advantages of this over the superoptimizer approach?
#102869

The superoptimizer would need to understand stores, but that doesn't seem too complex.

@corona10
Copy link
Member

OOI what are the advantages of this over the superoptimizer approach?

In my opinion, the two optimizations are responsible for optimization in different layers. I believe that flowgraph-based optimization can optimize with lower cost compared to simulating stack side effects, and the superoptimizer can optimize a wider window that a flowgraph cannot handle. Therefore, I think these two optimizations complement each other, and if flowgraph-based optimization does not interfere with the optimization of the superoptimizer, it is acceptable to add additional optimizations.

Maybe @carljm has a different opinion from me.

@carljm
Copy link
Member Author

carljm commented May 31, 2023

@markshannon This issue is just tracking the fact that we can (and should) replace dead STORE_FAST with POP_TOP, and if we do so, it will allow better static elimination of SWAP instructions (and maybe in some cases allow eliminating a corresponding load and the POP_TOP as well). AFAIK that isn't currently explicitly tracked or mentioned anywhere; the currently proposed superoptimizer design explicitly leaves stores out of scope.

I don't have strong opinions about whether we do this via the existing optimize_basic_block infra, or via an expanded superoptimizer design. But I think it's quite simple to do via the existing infra, so I don't see any reason to block doing it now on the basis that someday we might do it in a superoptimizer instead.

The main wrinkle is that it seems we need to add some POP_TOP based super-instructions in order to keep micro-benchmarks positive with this change (otherwise they lose the benefit of the existing STORE_FAST super-instructions), but that will be true whether we do this in super-optimizer or standalone.

@iritkatriel
Copy link
Member

... multiple stores to the same locals index in near succession, without any intervening instructions that can possibly use the stored value.

Also need to make sure there isn't something between them that could raise an exception. We don't currently have this, we probably need to add this in the opcode_metadata.

@markshannon
Copy link
Member

Are there many dead stores in Python, given that two writes must occur on the same line for the first to count as dead?

I suspect that they are quite rare. Has anyone tried to measure how many?

@corona10
Copy link
Member

corona10 commented Jun 5, 2023

Are there many dead stores in Python, given that two writes must occur on the same line for the first to count as dead?
I suspect that they are quite rare. Has anyone tried to measure how many?

Assuming that super instruction is applied when the corresponding code pattern is used, it is true that it is a small proportion when observed indirectly.
#105040 (comment)

However, this only means that super instruction does not need to be applied, and whether reducing the opcodes for (_, _, a = some_tuple) is a separate issue.

@corona10
Copy link
Member

corona10 commented Jun 5, 2023

However, this only means that super instruction does not need to be applied, and whether reducing the opcodes for (_, _, a = some_tuple) is a separate issue.

Okay we don't need to add extra super instructions for this optimization after Mark's PR is merged.
It becomes 3-4% faster.
cc @carljm

Benchmarks

import pyperf

runner = pyperf.Runner()
runner.timeit(name="bench dead_store",
              stmt="""
_, _, _, a = 1, 2, 3, 4
""")

Mean +- std dev: [base] 11.6 ns +- 0.1 ns -> [opt] 11.2 ns +- 0.1 ns: 1.03x faster
import pyperf

runner = pyperf.Runner()
runner.timeit(name="bench static swaps",
              stmt="""
_ = f(True, False)
""",
setup="""
def f(x, y):
    a, a = x, y
    return a
""")


Mean +- std dev: [static_base] 27.1 ns +- 0.3 ns -> [static_opt] 26.0 ns +- 0.2 ns: 1.04x faster

@carljm
Copy link
Member Author

carljm commented Jun 6, 2023

Are there many dead stores in Python

I also suspect it's not common. But given that the latest PR shows a clear improvement for this case without the need to add superinstructions, and with just a few lines in the compiler, it seems almost pure win to do it, even if the case is not common.

corona10 added a commit to corona10/cpython that referenced this issue Jun 7, 2023
corona10 added a commit to corona10/cpython that referenced this issue Jul 4, 2023
corona10 added a commit to corona10/cpython that referenced this issue Jul 4, 2023
corona10 added a commit to corona10/cpython that referenced this issue Jul 4, 2023
corona10 added a commit to corona10/cpython that referenced this issue Jul 4, 2023
corona10 added a commit to corona10/cpython that referenced this issue Jul 9, 2023
@iritkatriel iritkatriel added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
Projects
None yet
Development

No branches or pull requests

4 participants