gh-87613: Argument Clinic vectorcall decorator#145381
gh-87613: Argument Clinic vectorcall decorator#145381cmaloney wants to merge 4 commits intopython:mainfrom
Conversation
Add `@vectorcall` as a decorator to Argument Clinic (AC) which generates a new [Vectorcall Protocol](https://docs.python.org/3/c-api/call.html#the-vectorcall-protocol) argument parsing C function named `{}_vectorcall`. This is only supported for `__new__` and `__init__` currently to simplify implementation. The generated code has similar or better performance to existing hand-written cases for `list`, `float`, `str`, `tuple`, `enumerate`, `reversed`, and `int`. Using the decorator added vectorcall to `bytearray` and construction got 1.09x faster. For more details see the comments in pythongh-87613. The `@vectorcall` decorator has two options: - **zero_arg={C_FUNC}**: Some types, like `int`, can be called with zero arguments and return an immortal object in that case. Adding a shortcut is needed to match existing hand-written performance; provides an over 10% performance change for those cases. - **exact_only**: If the type is not an exact match delegate to the existing non-vectorcall implementation. NEeded for `str` to get matching performance while ensuring correct behavior. Implementation details: - Adds support for the new decorator with arguments in the AC DSL Parser - Move keyword argument parsing generation from inline to a function so both vectorcall, `vc_`, and existing can share code generation. - Adds an `emit` helper to simplify code a bit from existing AC cases Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
corona10
left a comment
There was a problem hiding this comment.
Could you replace current hand-written with your new DSL.
Let's see how handle them.
|
I have commits to do that in my draft branch (https://github.com/python/cpython/compare/main...cmaloney:cpython:ac_vectorcall_v1?expand=0); can pull them into this branch if that would be easier / better to review. This generally produces code that is as fast or faster than the hand-written ones currently (full benchmarking in: #87613 (comment)) |
|
Added commits moving |
| #undef KWTUPLE | ||
| PyObject *argsbuf[2]; | ||
| Py_ssize_t noptargs = nargs + (kwnames ? PyTuple_GET_SIZE(kwnames) : 0) - 1; | ||
| args = _PyArg_UnpackKeywords(args, nargs, NULL, kwnames, |
There was a problem hiding this comment.
Evaluating direct keyword argument parsing for these / what the code change is relative to the performance change.
For most the hand written vectorcall implementations there aren't a lot of keyword arguments which I think is part of why the performance is equal to existing making this a simplifying refactor. Wondering if with explicit keyword argument parsing get to be quite a bit faster.
There was a problem hiding this comment.
The generated code is a little cleaner to read but the performance change is negative if anything in my first attempt here; can pull in if needed but leaning is focus on iterative improvement.
Also found another optimization in str: Doing a one-arg override much just like the zero_arg rather than its generic dispatch which does make a bit of perf change; but think that is a good additional step to add later when expanding to str type.
|
I will take a look at this PR til end of this week. |
I would like to suggest you at least to think about complexobject.c. Based on benchmarks for the float pr (#22432) I would expect a good performance boost (maybe not 1.5x, but more than from freelist addition). Yes, this case seems to be already covered by the enum.c example (kwargs). On another hand, the complex class has special hacks to support multiple signatures (
Still, there are some regressions, e.g. int(str). Could you explain this difference? I also suggest you to try pyperformance on this. |
Will implemnt it in my draft branch this week. As part of developing this PR I added Vectorcall Protocol support to
The Lines 6539 to 6559 in c9a5d9a That switch specializes 1-argument to call I'm comparing to the handwritten because I want
Will run on this PR as it exists currently. I can also run on my draft branch but not sure that will give a clear signal as it migrates every hand-written vectorcall even if it makes them slower. Ideally to me would be able to figure out what types are commonly constructed in pyperformance benchmarks so I can make a draft branch adding |
|
Did you consider adding this implicitly if supported, instead of making it opt-in? Disclaimer: I didn't take a look at the implementation yet. |
| self.vectorcall = False | ||
| self.vectorcall_exact_only = False | ||
| self.vectorcall_zero_arg = '' |
There was a problem hiding this comment.
I wonder if we should collect these in a "vectorcall config dataclass". The stuff in this file is already so cluttered with tons of class members and local variables.
Add
@vectorcallas a decorator to Argument Clinic (AC) which emits a Vectorcall Protocol argument parsing C function named{type}_vectorcall. This is only supported for__new__and__init__currently to simplify implementation.The generated code has similar or better performance to existing hand-written cases for
list,float,str,tuple,enumerate,reversed, andint. Using the decorator onbytearray, which has no handwritten case, construction got 1.09x faster. For more benchmark details see #87613 (comment).The
@vectorcalldecorator has two options:zero_arg={C_FUNC}: Some types, likeint, can be called with zero arguments and return an immortal object in that case. Adding a shortcut is needed to match existing hand-written performance; provides an over 10% performance change for those cases.exact_only: If the type is not an exact match delegate to the existing non-vectorcall implementation. Needed forstrto get matching performance while ensuring correct behavior.Implementation details:
vc_, and existing can share code generation.emithelper to simplify code a bit from existing AC casesCo-Authored-By: Claude Opus 4.6 noreply@anthropic.com