New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize calling type slots #74694
Comments
In excellent Peter Cawley's article "Why are slots so slow?" [1] analysed causes why In 3.7 the difference between two ways is smaller, but The proposed patch tweaks the code and makes |
Despite the fact that |
I have other patch that makes |
It seems like you are a victim of the "deadcode" issue related to code locality: To run microbenchmarks on such very tiny functions taken less than 200 ns, it's more reliable to compile Python using LTO+PGO. |
type-slot-calls.diff: Can you please create a pull request?
Hum, can you please post a microbenchmark results to see the effect of the patch?
The article has two main points:
Yeah, it seems like the FASTCALL changes I made in typeobject.c removed the overhead of the temporary tuple. Yury's and Naoki's work on CALL_METHOD also improved performances here on method calls. I don't think that we can change the semantics, only try to optimize the implementation. |
I provided just a patch because I expected that you perhaps will want to play with it and propose alternative patch. It is simpler to compare patches with Rietveld than on GitHub. But if you prefer, I'll make a PR.
$ cat x.py
class A(object):
def __add__(self, other):
return 42
$ ./python -m perf timeit -s 'from x import A; a = A(); b = A()' --duplicate 100 'a.__add__(b)'
Unpatched: Mean +- std dev: 256 ns +- 9 ns
Patched: Mean +- std dev: 255 ns +- 10 ns
$ ./python -m perf timeit -s 'from x import A; a = A(); b = A()' --duplicate 100 'a + b'
Unpatched: Mean +- std dev: 332 ns +- 10 ns
Patched: Mean +- std dev: 286 ns +- 5 ns
It also makes other optimizations, like avoiding using varargs and creating immediate method object. All this already is applied as side effects of your changes.
Since a and b have the same type the complex semantic doesn't play a role here. |
It seems like Rietveld is broken: there is no [Review] button on your patch. I wouldn't be suprised that it's broken since CPython migrated to Git. |
The PR makes different changes:
If possible, I would prefer to not have to duplicate functions for 0, 1 and 2 parameters (3 variants). I would like to know which changes are responsible for the speedup. To ease the review, would it be possible to split your change into smaller changes? At least, separated commits, maybe even a first "cleanup" PR before the "optimization" PR. |
PR 1883 cleans up the code related to calling type slots.
|
Sorry, wrong data. PR 1883 makes indexing 1.2 times faster, PR 1861 makes it 1.7 times faster $ ./python -m perf timeit -s 'class A:' -s ' def __getitem__(s, i): return t[i]' -s 'a = A(); t = tuple(range(1000))' --duplicate 100 'list(a)' Unpatched: Mean +- std dev: 498 us +- 26 us |
FYI you can use "./python -m perf timeit --compare-to=./python-ref" if you keep the "reference" Python (unpatched), so perf computes for you the "?.??x slower/faster" factor ;-) |
Thank you, I know about this, but it takes twice more time, so I don't use it regularly. And it doesn't allow to compare three versions. :-( |
I believe Rietveld does not work with git-format patches. I don't know if git can produce the format hg did. |
$ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s ' def __add__(s, o): return s' -s 'a = A(); b = A()' --duplicate=100 'a.__add__(b)'
Mean +- std dev: [python-orig] 229 ns +- 9 ns -> [python] 235 ns +- 13 ns: 1.02x slower (+2%)
$ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s ' def __add__(s, o): return s' -s 'a = A(); b = A()' --duplicate=100 'a + b'
Mean +- std dev: [python-orig] 277 ns +- 10 ns -> [python] 251 ns +- 23 ns: 1.10x faster (-9%)
$ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s ' def __add__(s, o): return s' -s 'a = [A() for i in range(1000)]' 'sum(a, A())'
Mean +- std dev: [python-orig] 259 us +- 17 us -> [python] 218 us +- 16 us: 1.19x faster (-16%)
$ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s ' def __getitem__(s, i): return t[i]' -s 'a = A(); t = tuple(range(1000))' 'list(a)'
Mean +- std dev: [python-orig] 324 us +- 14 us -> [python] 300 us +- 16 us: 1.08x faster (-8%)
$ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s ' def __neg__(s): return s' -s 'a = A()' --duplicate=100 '(----------a)'
Mean +- std dev: [python-orig] 2.12 us +- 0.13 us -> [python] 1.91 us +- 0.11 us: 1.11x faster (-10%) |
I'm not sure about adding Py_LOCAL_INLINE() (static inline). I'm not sure that it's needed when you use PGO compilation. Would it be possible to run again your benchmark without added Py_LOCAL_INLINE() please? It's hard to say no to a change makes so many core Python functions faster. I'm just suprised that "specializing" the "call_unbound" and "call_method" functions make the code up to 1.2x faster. |
Without Py_LOCAL_INLINE all mickrobenchmarks become about 20% slower. I'm not sure that all these changes are needed. Maybe the same effect can be achieved by smaller changes. But I tried and failed to achieve the same performance with a smaller patch yet. Maybe you will be more lucky. Note that even with this patch type slots still about 5% slower than ordinal methods (despite the fact that using operators needs less bytecode instructions than calling a method). There is some additional overhead. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: