Skip to content

gh-87613: Argument Clinic vectorcall decorator#145381

Open
cmaloney wants to merge 4 commits intopython:mainfrom
cmaloney:ac_add_vectorcall
Open

gh-87613: Argument Clinic vectorcall decorator#145381
cmaloney wants to merge 4 commits intopython:mainfrom
cmaloney:ac_add_vectorcall

Conversation

@cmaloney
Copy link
Contributor

@cmaloney cmaloney commented Mar 1, 2026

Add @vectorcall as a decorator to Argument Clinic (AC) which emits a Vectorcall Protocol argument parsing C function named {type}_vectorcall. This is only supported for __new__ and __init__ currently to simplify implementation.

The generated code has similar or better performance to existing hand-written cases for list, float, str, tuple, enumerate, reversed, and int. Using the decorator on bytearray, which has no handwritten case, construction got 1.09x faster. For more benchmark details see #87613 (comment).

The @vectorcall decorator has two options:

  • zero_arg={C_FUNC}: Some types, like int, can be called with zero arguments and return an immortal object in that case. Adding a shortcut is needed to match existing hand-written performance; provides an over 10% performance change for those cases.
  • exact_only: If the type is not an exact match delegate to the existing non-vectorcall implementation. Needed for str to get matching performance while ensuring correct behavior.

Implementation details:

  • Adds support for the new decorator with arguments in the AC DSL Parser
  • Move keyword argument parsing generation from inline to a function so both vectorcall, vc_, and existing can share code generation.
  • Adds an emit helper to simplify code a bit from existing AC cases

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

Add `@vectorcall` as a decorator to Argument Clinic (AC) which generates a new
[Vectorcall Protocol](https://docs.python.org/3/c-api/call.html#the-vectorcall-protocol)
argument parsing C function named `{}_vectorcall`. This is only supported for
`__new__` and `__init__` currently to simplify implementation.

The generated code has similar or better performance to existing hand-written
cases for `list`, `float`, `str`, `tuple`, `enumerate`, `reversed`, and `int`.
Using the decorator added vectorcall to `bytearray` and construction got
1.09x faster. For more details see the comments in pythongh-87613.

The `@vectorcall` decorator has two options:
 - **zero_arg={C_FUNC}**: Some types, like `int`, can be called with zero
   arguments and return an immortal object in that case. Adding a shortcut is
   needed to match existing hand-written performance; provides an over 10%
   performance change for those cases.
  - **exact_only**: If the type is not an exact match delegate to the existing
  non-vectorcall implementation. NEeded for `str` to get matching performance
  while ensuring correct behavior.

Implementation details:
 - Adds support for the new decorator with arguments in the AC DSL Parser
 - Move keyword argument parsing generation from inline to a function so both
   vectorcall, `vc_`, and existing can share code generation.
 - Adds an `emit` helper to simplify code a bit from existing AC cases

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cmaloney cmaloney changed the title gh-87613: Argument Cliic @vectorcall decorator gh-87613: Argument Clinic @vectorcall decorator Mar 1, 2026
@cmaloney cmaloney added performance Performance or resource usage and removed performance Performance or resource usage labels Mar 1, 2026
@cmaloney cmaloney changed the title gh-87613: Argument Clinic @vectorcall decorator gh-87613: Argument Clinic vectorcall decorator Mar 1, 2026
Copy link
Member

@corona10 corona10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you replace current hand-written with your new DSL.

Let's see how handle them.

@cmaloney
Copy link
Contributor Author

cmaloney commented Mar 1, 2026

I have commits to do that in my draft branch (https://github.com/python/cpython/compare/main...cmaloney:cpython:ac_vectorcall_v1?expand=0); can pull them into this branch if that would be easier / better to review. This generally produces code that is as fast or faster than the hand-written ones currently (full benchmarking in: #87613 (comment))

@cmaloney
Copy link
Contributor Author

cmaloney commented Mar 1, 2026

Added commits moving enum.c (reversed, enumerate) and tuple to the new decorator. enum.c had comments pointing to this issue and covers positional + keyword arguments. tuple uses the "zero arg" optimization and has no keyword args. None of those cases use the __init__ code; only cases of that are the new bytearray or list which is otherwise very similar to tuple. Hoping those serve as a good sample for what the code generation looks like relative to the handwritten while iterating; happy to include more in this PR if desired.

#undef KWTUPLE
PyObject *argsbuf[2];
Py_ssize_t noptargs = nargs + (kwnames ? PyTuple_GET_SIZE(kwnames) : 0) - 1;
args = _PyArg_UnpackKeywords(args, nargs, NULL, kwnames,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Evaluating direct keyword argument parsing for these / what the code change is relative to the performance change.

For most the hand written vectorcall implementations there aren't a lot of keyword arguments which I think is part of why the performance is equal to existing making this a simplifying refactor. Wondering if with explicit keyword argument parsing get to be quite a bit faster.

Copy link
Contributor Author

@cmaloney cmaloney Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated code is a little cleaner to read but the performance change is negative if anything in my first attempt here; can pull in if needed but leaning is focus on iterative improvement.

Also found another optimization in str: Doing a one-arg override much just like the zero_arg rather than its generic dispatch which does make a bit of perf change; but think that is a good additional step to add later when expanding to str type.

@corona10
Copy link
Member

corona10 commented Mar 2, 2026

I will take a look at this PR til end of this week.

@skirpichev
Copy link
Member

happy to include more in this PR if desired.

I would like to suggest you at least to think about complexobject.c. Based on benchmarks for the float pr (#22432) I would expect a good performance boost (maybe not 1.5x, but more than from freelist addition).

Yes, this case seems to be already covered by the enum.c example (kwargs).

On another hand, the complex class has special hacks to support multiple signatures (complex('123') is allowed, while complex(real='123') - not). Maybe it's not the only case, but I can't find quickly others across the CPython codebase. I suspect that AC magic will not work in this case and we will need some workarounds somewhere (well, maybe just one hand-written case). Though, it would be great if you disprove this hypothesis.

This generally produces code that is as fast or faster than the hand-written ones currently (full benchmarking in: #87613 (comment))

Still, there are some regressions, e.g. int(str). Could you explain this difference?

I also suggest you to try pyperformance on this.

@cmaloney
Copy link
Contributor Author

cmaloney commented Mar 2, 2026

I would like to suggest you at least to think about complexobject.c. Based on benchmarks for the float pr (#22432) I would expect a good performance boost (maybe not 1.5x, but more than from freelist addition).

Will implemnt it in my draft branch this week. As part of developing this PR I added Vectorcall Protocol support to bytes (cmaloney@f5c7b7c) and bytearray (cmaloney@7de2ab7). With just two small changes: 1. add @vectorcall, 2. set .tp_vectorcall construction is 1.09x to 1.23x faster. Multiply that speedup across the many AC implemented types without vectorcall construction and I definitely get excited.

Still, there are some regressions, e.g. int(str). Could you explain this difference?

The int hand written vectorcall implementation, long_vectorcall, is a particularly elegant switch:

cpython/Objects/longobject.c

Lines 6539 to 6559 in c9a5d9a

long_vectorcall(PyObject *type, PyObject * const*args,
size_t nargsf, PyObject *kwnames)
{
Py_ssize_t nargs = PyVectorcall_NARGS(nargsf);
if (kwnames != NULL) {
PyThreadState *tstate = PyThreadState_GET();
return _PyObject_MakeTpCall(tstate, type, args, nargs, kwnames);
}
switch (nargs) {
case 0:
return _PyLong_GetZero();
case 1:
return PyNumber_Long(args[0]);
case 2:
return long_new_impl(_PyType_CAST(type), args[0], args[1]);
default:
return PyErr_Format(PyExc_TypeError,
"int expected at most 2 arguments, got %zd",
nargs);
}
}

That switch specializes 1-argument to call PyNumber_Long instead of long_new_impl which matches a very similar performance delta I investigated yesterday in the hand written vectorcall for str. Worried the hand written is faster because the compiler optimizer is doing clever things around the switch form. Adding support for a one_arg special case will need more code in the AC implementation to handle. Overall not sure it's actually worth replacing the hand written int vectorcall with an AC generated version for.

I'm comparing to the handwritten because I want @vectorcall when people try it out on a type they care about to be as good as I can get it. Adding to two types without vectorcall construction, bytes and bytearray, it provides a measurable speedup for a two-line code change. I think correctness of generated code, maintainability of the new decorator, and providing a speedup for types with no vectorcall today is a lot of benefit even if it's not quite as fast as hand written expert code. If adopting the new decorator on an AC type is really low-cost for a significant performance gain that will lead to speedy adoption and a speedier CPython.

I also suggest you to try pyperformance on this.

Will run on this PR as it exists currently.

I can also run on my draft branch but not sure that will give a clear signal as it migrates every hand-written vectorcall even if it makes them slower. Ideally to me would be able to figure out what types are commonly constructed in pyperformance benchmarks so I can make a draft branch adding vectorcall support to those. Not sure what would be the most important set of types to migrate before running pyperformance.

@erlend-aasland
Copy link
Contributor

Did you consider adding this implicitly if supported, instead of making it opt-in? Disclaimer: I didn't take a look at the implementation yet.

Comment on lines +305 to +307
self.vectorcall = False
self.vectorcall_exact_only = False
self.vectorcall_zero_arg = ''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should collect these in a "vectorcall config dataclass". The stuff in this file is already so cluttered with tons of class members and local variables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants