-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Argument Clinic: inline parsing code for functions with keyword parameters #80308
Comments
This is a follow up of bpo-23867 and bpo-35582. The proposed PR makes Argument Clinic inlining parsing code for functions with keyword parameters, i.e. functions that use _PyArg_ParseTupleAndKeywordsFast() and _PyArg_ParseStackAndKeywords() now. This saves time for parsing format strings and calling few levels of functions. |
Some examples: $ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 -s "round_ = round" "round_(4.2)"
Mean +- std dev: [...] 110 ns +- 3 ns -> [...] 81.4 ns +- 2.2 ns: 1.35x faster (-26%)
$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 -s "sum_ = sum" "sum_(())"
Mean +- std dev: [...] 88.0 ns +- 6.5 ns -> [...] 57.6 ns +- 1.1 ns: 1.53x faster (-35%)
$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 -s "sum_ = sum; a = [1, 2]" "sum_(a)"
Mean +- std dev: [...] 95.9 ns +- 2.1 ns -> [...] 70.6 ns +- 2.0 ns: 1.36x faster (-26%)
$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 "'abc'.split()"
Mean +- std dev: [...] 102 ns +- 3 ns -> [...] 80.5 ns +- 2.1 ns: 1.26x faster (-21%)
$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 "b'abc'.split()"
Mean +- std dev: [...] 91.8 ns +- 2.3 ns -> [...] 75.1 ns +- 1.4 ns: 1.22x faster (-18%)
$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 "'abc'.split('-')"
Mean +- std dev: [...] 118 ns +- 2 ns -> [...] 89.2 ns +- 1.8 ns: 1.32x faster (-24%)
$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 "b'abc'.decode()"
Mean +- std dev: [...] 96.1 ns +- 3.6 ns -> [...] 78.9 ns +- 2.2 ns: 1.22x faster (-18%)
$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 "'abc'.encode()"
Mean +- std dev: [...] 72.4 ns +- 1.9 ns -> [...] 55.1 ns +- 1.8 ns: 1.31x faster (-24%)
$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 -s "int_= int" "int(4.2)"
Mean +- std dev: [...] 105 ns +- 4 ns -> [...] 78.8 ns +- 1.9 ns: 1.33x faster (-25%)
$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 -s "int_= int" "int('5')"
Mean +- std dev: [...] 154 ns +- 5 ns -> [...] 122 ns +- 4 ns: 1.26x faster (-21%)
$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 "42 .to_bytes(2, 'little')
Mean +- std dev: [...] 109 ns +- 3 ns -> [...] 72.4 ns +- 1.9 ns: 1.51x faster (-34%)
$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 "from_bytes = int.from_bytes" "from_bytes(b'ab', 'little')"
Mean +- std dev: [...] 138 ns +- 3 ns -> [...] 96.3 ns +- 3.0 ns: 1.43x faster (-30%) |
That's a lot faster and will be great if they make it to next alpha :) Adding Ammar Askar since they did review for positional arguments. |
How much bigger does the core interpreter + built-in extension modules get when you make this change? How much more memory is used by real world programs? I'm a little concerned when I see individual functions growing by 140 lines in the first file of the diff. Clearly the increase in memory usage wasn't enough to slow down the microbenchmarks (which can probably fit the entire hot code path in CPU on die cache), but I'd be worried about the aggregate effect on real world programs if we go from a handful of argument parsing code paths reused by every function (and therefore kept constantly hot) to hundreds (or thousands?) of unique argument parsing code paths, each one unique to a single function. It could easily look great when a single function is being called repeatedly, while looking much less great (possibly worse) when the many varied function calls are interleaved. It should be tested on a number of systems too; any losses to cache unfriendliness would be highly dependent on the size of the CPU cache. I'll grant, it doesn't seem like inlining positional argument parsing has caused problems, and it looks like we're still using _PyArg_UnpackKeywords, so argument parsing isn't completely devolved to each functions, but I think we need something more than microbenchmarks before we jump on this. If my worries end up being unfounded, I'll be happy. Looks very cool if we can really get that sort of speed up for all function calls, not just positional args only functions. :-) |
Well, any optimization is a matter of trade-off between memory and CPU. Last years, CPU are not really getting way faster (especially when you consider that Python is usually only able to use a single CPU thread), whereas computers are getting more and more RAM.
I prefer to no pay too much attention to assumptions on the hardware. I prefer to only trust benchmarks :-) |
Good questions Josh! The size of the python binary has been increased from 17494944 bytes (17085 KiB) to 17657112 bytes (17243 KiB) -- by 1%. I think that this change can not increase memory consumption, because the new code does not use the heap (the old code can allocate additional memory dynamically). As for using the stack memory, it is not so clear. On one side, the new code allocates an array on the stack for references to all parameters, and this memory is left in use until you return from the function. On other side, the old code allocates a lot of variables and static-sized buffers, and creates several function frames, but this memory is released after the end of arguments parsing. I think that for non-recursive functions the new code has smaller stack memory consumption (while the function has less than several tens of parameters), but for recursive functions it can increase stack memory consumption. Although I do not know whether any of affected functions is recursive. |
As for depending the optimization on the size of CPU cache, I have repeated mickrobenchmarks on the computer with 6 MiB cache and two computers with 512 KiB caches (64- and 32-bit). Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz (cache size: 6144 KB): +---------------------------------+----------+------------------------------+ Not significant (1): zlib_compress(b'abc') AMD Athlon(tm) 64 X2 Dual Core Processor 4600+ (cache size: 512 KB): +---------------------------------+----------+-----------------------------+ Intel(R) Atom(TM) CPU N570 @ 1.66GHz (cache size: 512 KB), 32-bit: +---------------------------------+----------+------------------------------+ The speed up is significant on all computers. |
About the stack memory usage, in the past, I used a subfunction tagged with _Py_NO_INLINE to work on temporary stack but then drop it: void function()
{
subfunction(); /* use temporary stack */
/* don't waste stack memory */
...
} I'm not sure if such pattern could be used here for things like " PyObject *argsbuf[12];". The problem is that argument parsing uses a lot of local variables allocated on the stack. In practice, it's more like: void function(args)
{
int x;
parse_args(args, &x); /* use temporary stack */
/* don't waste stack memory */
...
} I expect a long list of "&arg" where arg is a local variable of function(). Well, that's basically the design of the current PyArg_ParseTuple() function family :-) PyArg_ParseTuple() does its stuff in private and uses more stack memory, but once PyArg_ParseTuple() returns, the memory on the stack is "released" just because we exited the function. |
In addition, this change can allow us to get rid of large and complex functions _PyArg_ParseTupleAndKeywordsFast() and _PyArg_ParseStackAndKeywords(). The former is no longer used in CPython, and the latter is still used in few places to support some deprecated formatting codes for which I intentionally not implemented inlining. After getting rid of uses of such codes (the patch in progress) we could remove both these functions. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: