Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-87729: add LOAD_SUPER_ATTR instruction for faster super() #103497

Merged
merged 23 commits into from Apr 24, 2023

Conversation

carljm
Copy link
Member

@carljm carljm commented Apr 13, 2023

This PR speeds up super() (by around 85%, for a simple one-level super().meth() microbenchmark) by avoiding allocation of a new single-use super() object on each use.

Microbenchmark results

With this PR:

➜ ./python -m pyperf timeit -s 'from superbench import b' 'b.meth()'
.....................
Mean +- std dev: 70.4 ns +- 1.4 ns

Without this PR:

➜ ./python -m pyperf timeit -s 'from superbench import b' 'b.meth()'
.....................
Mean +- std dev: 130 ns +- 1 ns

Microbenchmark code

➜ cat superbench.py
class A:
    def meth(self):
        return 1

class B(A):
    def meth(self):
        return super().meth()

b = B()

Microbenchmark numbers are the same (both pre and post) if the microbenchmark is switched to use return super(B, self).meth() instead.

super() is already special-cased in the compiler to ensure the presence of the __class__ cell needed by zero-argument super(). This extends that special-casing a bit in order to compile super().meth() as

              4 LOAD_GLOBAL              0 (super)
             14 LOAD_DEREF               1 (__class__)
             16 LOAD_FAST                0 (self)
             18 LOAD_SUPER_ATTR          5 (NULL|self + meth)
             20 CALL                     0

instead of the current:

              4 LOAD_GLOBAL              1 (NULL + super)
             14 CALL                     0
             22 LOAD_ATTR                3 (NULL|self + meth)
             42 CALL                     0
Bytecode comparison for simple attribute

And compile super().attr as

              4 LOAD_GLOBAL              0 (super)
             14 LOAD_DEREF               1 (__class__)
             16 LOAD_FAST                0 (self)
             18 LOAD_SUPER_ATTR     4 (attr)

instead of the current:

              4 LOAD_GLOBAL              1 (NULL + super)
             14 CALL                     0
             22 LOAD_ATTR                2 (attr)

The new bytecode has one more instruction, but still ends up executing much faster, because it eliminates the cost of allocating a new single-use super object each time. For zero-arg super, it also eliminates dynamically figuring out each time via frame introspection where to find the self argument and __class__ cell, even though the location of both is already known at compile time.

The LOAD_GLOBAL of super remains only in order to support existing semantics in case the name super is re-bound to some other callable besides the built-in super type.

Besides being faster, the new bytecode is preferable because it regularizes the loading of self and __class__ to use the normal LOAD_FAST and LOAD_DEREF opcodes, instead of custom code in the super object (not part of the interpreter) relying on private details of interpreter frames to load these in a bespoke way. This helps optimizers like the Cinder JIT that fully support LOAD_FAST and LOAD_DEREF but may not maintain frame locals in the same way. It also makes the bytecode more easily amenable to future optimization by a type-specializing tier 2 interpreter, because __class__ and self will now be surfaced and visible to the optimizer in the usual way, rather than hidden inside the super object.

I'll follow up with a specialization of LOAD_SUPER_ATTR for the case where we are looking up a method and a method is found (because this is a common case, and a case where the output of LOAD_SUPER_ATTR depends only on the type of self and not on the actual instance). But to simplify review, I'll do this in a separate PR. I think the benefits of this PR stand alone, even without further benefits of specialization. (ETA: the specialization is now also ready at https://github.com/carljm/cpython/compare/superopt...carljm:cpython:superopt_spec?expand=1 and increases the microbenchmark win from 85% to 2.3x.)

The frame introspection code for runtime/dynamic zero-arg super() still remains, but after this PR it would only ever be used in an odd edge case like super(*args) (if args turns out to be empty at runtime), where we can't detect at compile time whether we will have zero-arg or two-arg super().

"Odd" uses of super() (like one-argument super, use of a super object as a descriptor etc) are still supported and experience no change; the compiler will not emit the new LOAD_SUPER_ATTR opcode.

I chose to make the new opcode more general by using it for both (statically detectable) zero- and two-arg super. Optimizing zero-arg super is more important because it is more common in modern Python code, and because it also eliminates the frame introspection. But supporting two-arg super costs only one extra bit smuggled via the oparg; this seems worth it.

Real-world results and macrobenchmarks

This approach provides a speed-up of about 0.5% globally on the Instagram server real-world workload (measured recently on Python 3.10.) I can work on a macrobenchmark for the pyperformance suite that exercises super() (currently it isn't significantly exercised by any benchmark.) (ETA: benchmark is now ready at python/pyperformance#271 -- this diff improves its performance by 10%, the specialization follow-up by another 10%.)

Prior art

This PR is essentially an updated version of #24936 -- thanks to @vladima for the original inspiration for this approach. Notable differences from that PR:

  • I avoid turning the oparg for the new opcode into a const load, preferring to pass the needed bits of information by bit-shifting the oparg instead (following the precedent of LOAD_ATTR).
  • I prioritize code simplicity over performance in edge cases like when a super() attribute access raises AttributeError, which also reduces the footprint of the PR.

#30992 was an attempt to optimize super() solely using the specializing interpreter, but it was never merged because there are too many problems caused by adaptive super-instructions in the tier 1 specializing interpreter.

@carljm carljm changed the title gh-87729: add instruction for faster zero-arg super() gh-87729: add LOAD_SUPER_ATTR instruction for faster super() Apr 13, 2023
* main:
  pythongh-103479: [Enum] require __new__ to be considered a data type (pythonGH-103495)
  pythongh-103365: [Enum] STRICT boundary corrections (pythonGH-103494)
  pythonGH-103488: Use return-offset, not yield-offset. (pythonGH-103502)
  pythongh-103088: Fix test_venv error message to avoid bytes/str warning (pythonGH-103500)
  pythonGH-103082: Turn on branch events for FOR_ITER instructions. (python#103507)
  pythongh-102978: Fix mock.patch function signatures for class and staticmethod decorators (python#103228)
  pythongh-103462: Ensure SelectorSocketTransport.writelines registers a writer when data is still pending (python#103463)
  pythongh-95299: Rework test_cppext.py to not invoke setup.py directly (python#103316)
@corona10
Copy link
Member

cc @Fidget-Spinner

@carljm
Copy link
Member Author

carljm commented Apr 14, 2023

https://github.com/carljm/cpython/compare/superopt...carljm:cpython:superopt_spec?expand=1 has a draft of the first specialization of LOAD_SUPER_ATTR built on top of this, specializing for the method case.

With that specialization, the ./python -m pyperf timeit -s 'from superbench import b' 'b.meth()' microbenchmark shown above now runs in 56ns, down from 130ns originally and 70ns without the specialization. That's 2.3x better than the current main-branch speed. For reference, a version of the same benchmark that uses return A.meth(self) in place of return super().meth() runs in 48ns. So we are getting pretty close to zero-cost super method calls.

(If reviewers would prefer to just have the specialization(s) included directly in this PR and all reviewed together, let me know and I can push everything here.)

@corona10 corona10 added the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Apr 14, 2023
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @corona10 for commit 92c943b 🤖

If you want to schedule another build, you need to add the 🔨 test-with-refleak-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Apr 14, 2023
Objects/typeobject.c Outdated Show resolved Hide resolved
Python/bytecodes.c Outdated Show resolved Hide resolved
Objects/typeobject.c Outdated Show resolved Hide resolved
carljm and others added 2 commits April 14, 2023 08:35
@markshannon
Copy link
Member

Is the microbenchmark code correct? It doesn't look like you call meth()

@carljm
Copy link
Member Author

carljm commented Apr 18, 2023

Is the microbenchmark code correct? It doesn't look like you call meth()

The call to b.meth() happens in the actual invocation of pyperf timeit: ./python -m pyperf timeit -s 'from superbench import b' 'b.meth()'

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a fair bit of branching LOAD_SUPER_ATTR which suggest that either it needs reworking or splitting up.

I've made a few suggestions as to how it can be made less branchy.
We'll see if that is sufficient.

The compiler code looks OK to me, but I'll leave it @iritkatriel to review it properly.

Lib/test/test_super.py Show resolved Hide resolved
Lib/test/test_super.py Show resolved Hide resolved
Objects/typeobject.c Outdated Show resolved Hide resolved
Python/bytecodes.c Outdated Show resolved Hide resolved
Objects/typeobject.c Show resolved Hide resolved
@carljm
Copy link
Member Author

carljm commented Apr 19, 2023

@markshannon

Thanks for the review!

There is a fair bit of branching LOAD_SUPER_ATTR which suggest that either it needs reworking or splitting up.

The causes of branching are these:

  1. Has built-in super been replaced or shadowed? This branching is unavoidable, since it can happen dynamically; we have to check at runtime.
  2. Are we loading a method that will be called?
  3. Is this zero-arg or two-arg super? (We only need to know this if super has been shadowed, so we can reconstruct the right call to whatever it is now.)

(2) and (3) are both known at compile time, so we could split the opcode in two along either axis (i.e. LOAD_SUPER_ATTR vs LOAD_SUPER_METHOD, or LOAD_ZERO_SUPER_ATTR vs LOAD_TWO_SUPER_ATTR). I considered both splits, and decided neither made sense: the second split would result in two separate opcodes that we'll later want to specialize to the same opcode, which is awkward, and the first split loses the parallel to how LOAD_ATTR works. (Both splits would result in code duplication.)

Your suggestion above about how to handle oparg & 2 eliminates the branching for zero-arg vs two-arg super in the shadowing case; hopefully that's enough.

* main: (24 commits)
  pythongh-98040: Move the Single-Phase Init Tests Out of test_imp (pythongh-102561)
  pythongh-83861: Fix datetime.astimezone() method (pythonGH-101545)
  pythongh-102856: Clean some of the PEP 701 tokenizer implementation (python#103634)
  pythongh-102856: Skip test_mismatched_parens in WASI builds (python#103633)
  pythongh-102856: Initial implementation of PEP 701 (python#102855)
  pythongh-103583: Add ref. dependency between multibytecodec modules (python#103589)
  pythongh-83004: Harden msvcrt further (python#103420)
  pythonGH-88342: clarify that `asyncio.as_completed` accepts generators yielding tasks (python#103626)
  pythongh-102778: IDLE - make sys.last_exc available in Shell after traceback (python#103314)
  pythongh-103582: Remove last references to `argparse.REMAINDER` from docs (python#103586)
  pythongh-103583: Always pass multibyte codec structs as const (python#103588)
  pythongh-103617: Fix compiler warning in _iomodule.c (python#103618)
  pythongh-103596: [Enum] do not shadow mixed-in methods/attributes (pythonGH-103600)
  pythonGH-100530: Change the error message for non-class class patterns (pythonGH-103576)
  pythongh-95299: Remove lingering setuptools reference in installer scripts (pythonGH-103613)
  [Doc] Fix a typo in optparse.rst (python#103504)
  pythongh-101100: Fix broken reference `__format__` in `string.rst` (python#103531)
  pythongh-95299: Stop installing setuptools as a part of ensurepip and venv (python#101039)
  pythonGH-103484: Docs: add linkcheck allowed redirects entries for most cases (python#103569)
  pythongh-67230: update whatsnew note for csv changes (python#103598)
  ...
@carljm
Copy link
Member Author

carljm commented Apr 20, 2023

@markshannon I've now addressed or replied to all comments, if you want to take another look.

* main: (53 commits)
  pythongh-102498 Clean up unused variables and imports in the email module  (python#102482)
  pythongh-99184: Bypass instance attribute access in `repr` of `weakref.ref` (python#99244)
  pythongh-99032: datetime docs: Encoding is no longer relevant (python#93365)
  pythongh-94300: Update datetime.strptime documentation (python#95318)
  pythongh-103776: Remove explicit uses of $(SHELL) from Makefile (pythonGH-103778)
  pythongh-87092: fix a few cases of incorrect error handling in compiler (python#103456)
  pythonGH-103727: Avoid advancing tokenizer too far in f-string mode (pythonGH-103775)
  Revert "Add tests for empty range equality (python#103751)" (python#103770)
  pythongh-94518: Port 23-argument `_posixsubprocess.fork_exec` to Argument Clinic (python#94519)
  pythonGH-65022: Fix description of copyreg.pickle function (python#102656)
  pythongh-103323: Get the "Current" Thread State from a Thread-Local Variable (pythongh-103324)
  pythongh-91687: modernize dataclass example typing (python#103773)
  pythongh-103746: Test `types.UnionType` and `Literal` types together (python#103747)
  pythongh-103765: Fix 'Warning: py:class reference target not found: ModuleSpec' (pythonGH-103769)
  pythongh-87452: Improve the Popen.returncode docs
  Removed unnecessary escaping of asterisks (python#103714)
  pythonGH-102973: Slim down Fedora packages in the dev container (python#103283)
  pythongh-103091: Add PyUnstable_Type_AssignVersionTag (python#103095)
  Add tests for empty range equality (python#103751)
  pythongh-103712: Increase the length of the type name in AttributeError messages (python#103713)
  ...
@markshannon markshannon self-requested a review April 24, 2023 21:12
* main:
  pythongh-101517: fix line number propagation in code generated for except* (python#103550)
  pythongh-103780: Use patch instead of mock in asyncio unix events test (python#103782)
@carljm carljm enabled auto-merge (squash) April 24, 2023 21:27
@carljm carljm added the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Apr 24, 2023
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @carljm for commit 0de5bc6 🤖

If you want to schedule another build, you need to add the 🔨 test-with-refleak-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Apr 24, 2023
* main:
  pythongh-100227: Only Use deepfreeze for the Main Interpreter (pythongh-103794)
  pythongh-103492: Clarify SyntaxWarning with literal comparison (python#103493)
  pythongh-101100: Fix Sphinx warnings in `argparse` module (python#103289)
@carljm carljm added the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Apr 24, 2023
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @carljm for commit dbe1665 🤖

If you want to schedule another build, you need to add the 🔨 test-with-refleak-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Apr 24, 2023
@carljm carljm merged commit 0dc8b50 into python:main Apr 24, 2023
20 checks passed
carljm added a commit to carljm/cpython that referenced this pull request Apr 24, 2023
* main:
  pythongh-87729: add LOAD_SUPER_ATTR instruction for faster super() (python#103497)
  pythongh-103791: Make contextlib.suppress also act on exceptions within an ExceptionGroup (python#103792)
@carljm carljm deleted the superopt branch April 28, 2023 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants