Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-46564: Optimize super().meth() calls via adaptive superinstructions #30992

Closed
wants to merge 11 commits into from

Conversation

Fidget-Spinner
Copy link
Member

@Fidget-Spinner Fidget-Spinner commented Jan 28, 2022

They should now have almost no overhead over a corresponding self.meth() call.

Summary of changes:

  • typeobject.c -- refactoring to reuse code during specialization, also use InterpreterFrame over PyFrameObject for lazy frame benefits. Some changes here are partially taken from bpo-43563 : Introduce dedicated opcodes for super calls #24936. All credits to @vladima (I've tried to properly include them in the news item too.)
  • specialize.c -- specialize for the 0-argument and 2-argument form of super().
  • ceval.c -- does both a CALL and LOAD_METHOD without intermediates (and both are specialized forms too).

TODO:
benchmarks!

https://bugs.python.org/issue46564

@Fidget-Spinner Fidget-Spinner changed the title bpo-46564: Optimize super().meth() calls bpo-46564: Optimize super().meth() calls via adaptive superinstructions Jan 28, 2022
@markshannon markshannon self-assigned this Jan 28, 2022
@Fidget-Spinner Fidget-Spinner marked this pull request as draft January 28, 2022 18:14
@Fidget-Spinner
Copy link
Member Author

Marking as draft as I need make this work with the new CALL convention.


DEOPT_IF(_PyType_CAST(super_callable) != &PySuper_Type, CALL);
/* super() - zero argument form */
if (_PySuper_GetTypeArgs(frame, frame->f_code, &su_type, &su_obj) < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we do this at specialization time? The number of locals, the index of "self", and whether it is a cell are all known then. Likewise the nature of __class__ is also known.

}
assert(su_obj != NULL);
DEOPT_IF(lm_adaptive->version != Py_TYPE(su_obj)->tp_version_tag, CALL);
DEOPT_IF(cache0->version != su_type->tp_version_tag, CALL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When can this fail?
Isn't the next item in the MRO determined solely by type(self) and __class__, both of which are known at this point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted assurance that __class__ didn't change.. Then again, I'm not sure if it can?

@markshannon
Copy link
Member

Maybe we should merge #31002 first, as that PR is simpler.
It would also allow us to compare the performance of just the specialization, without the removal of frame object allocation.

@Fidget-Spinner
Copy link
Member Author

Maybe we should merge #31002 first, as that PR is simpler. It would also allow us to compare the performance of just the specialization, without the removal of frame object allocation.

👍

@Fidget-Spinner
Copy link
Member Author

Mark, I'm going to run benchmarks on deltablue first since it uses 2-argument form super. I'll address your optimization ideas for the 0-arg form once those results come back. Fingers crossed.

Copy link
Member

@arhadthedev arhadthedev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of indentation-related inconsistencies:

static int
super_init_without_args(InterpreterFrame *cframe, PyCodeObject *co,
int
_PySuper_GetTypeArgs(InterpreterFrame *cframe, PyCodeObject *co,
PyTypeObject **type_p, PyObject **obj_p)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PyTypeObject **type_p, PyObject **obj_p)
PyTypeObject **type_p, PyObject **obj_p)

The line was aligned with an opening parenthesis of a parameter list.

Comment on lines +278 to +279
PyObject *kwnames, SpecializedCacheEntry *cache, PyObject *builtins,
PyObject **stack_pointer, InterpreterFrame *frame, PyObject *names);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PyObject *kwnames, SpecializedCacheEntry *cache, PyObject *builtins,
PyObject **stack_pointer, InterpreterFrame *frame, PyObject *names);
PyObject *kwnames, SpecializedCacheEntry *cache, PyObject *builtins,
PyObject **stack_pointer, InterpreterFrame *frame, PyObject *names);

as in a removed line, or even:

Suggested change
PyObject *kwnames, SpecializedCacheEntry *cache, PyObject *builtins,
PyObject **stack_pointer, InterpreterFrame *frame, PyObject *names);
PyObject *kwnames, SpecializedCacheEntry *cache, PyObject *builtins,
PyObject **stack_pointer, InterpreterFrame *frame, PyObject *names);

as in _Py_Specialize_BinaryOp right below.

@Fidget-Spinner
Copy link
Member Author

Mark, I'm going to run benchmarks on deltablue first since it uses 2-argument form super. I'll address your optimization ideas for the 0-arg form once those results come back. Fingers crossed.

Well that was depressing. deltablue only shows 1.03x speedup. Looking closer at the code, super isn't called in any tight loops so that might be why. Maybe I need to pull out microbenchmarks now.

@Fidget-Spinner
Copy link
Member Author

Fidget-Spinner commented Feb 1, 2022

Microbenchmarks show that super() has sped up by more than 2.2x. This is faster than that other attempt because there's also speedups from the LOAD_METHOD_CACHED:
(Extremely unscientific, I'm short on time to set up pyperf right now)

import timeit

setup = """
class A:
    def f(self): pass
class B(A):
    def g(self): super().f()
    def h(self): self.f()

b = B()
"""

# super() call
print(timeit.timeit("b.g()", setup=setup, number=20_000_000))
# reference
print(timeit.timeit("b.h()", setup=setup, number=20_000_000))

Results:

# Main
5.796037399995839
2.4094066999969073

# This branch
2.4578273000006448
2.3718886000060593

So super().meth() is now only ~10% slowly than the corresponding self.meth() call whereas it was nearly 2x as slow previously. If I manage to incorporate your suggestions correctly, this will effectively just be a competition between LOAD_GLOBAL_BUILTIN (super) and LOAD_FAST (self).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting core review performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants