Optimize calling type slots #74694

serhiy-storchaka · 2017-05-30T05:56:41Z

BPO	30509
Nosy	@terryjreedy, @pitrou, @vstinner, @serhiy-storchaka
PRs	bpo-30509: Optimize and clean up calling type slots. #1861 bpo-30509: Clean up calling type slots. #1883
Files	type-slot-calls.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2017-05-30.05:56:40.654>
labels = ['interpreter-core', '3.8', 'performance']
title = 'Optimize calling type slots'
updated_at = <Date 2019-05-17.22:24:57.943>
user = 'https://github.com/serhiy-storchaka'

bugs.python.org fields:

activity = <Date 2019-05-17.22:24:57.943>
actor = 'cheryl.sabella'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Interpreter Core']
creation = <Date 2017-05-30.05:56:40.654>
creator = 'serhiy.storchaka'
dependencies = []
files = ['46913']
hgrepos = []
issue_num = 30509
keywords = ['patch']
message_count = 17.0
messages = ['294739', '294740', '294742', '294749', '294750', '294757', '294762', '294763', '294817', '294839', '294843', '294848', '294905', '295062', '295953', '295963', '295984']
nosy_count = 4.0
nosy_names = ['terry.reedy', 'pitrou', 'vstinner', 'serhiy.storchaka']
pr_nums = ['1861', '1883']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue30509'
versions = ['Python 3.8']

serhiy-storchaka · 2017-05-30T05:56:37Z

In excellent Peter Cawley's article "Why are slots so slow?" [1] analysed causes why a + b is slower than a.__add__(b) for custom add and provided suggestions for optimizing type slot calls. a + b and a.__add__(b) execute the same user code, a + b should have smaller overhead of bytecode interpreting, but it was 2 times slower than a.__add__(b). In the article a + b has been made 16% faster than a.__add__(b).

In 3.7 the difference between two ways is smaller, but a + b still is 25-30% slower than a.__add__(b). After analyzing the article and comparing it with the current code I have found that virtually all proposed optimization steps already applied in 3.7 by Victor! The difference is only in details.

The proposed patch tweaks the code and makes a + b only 12% slower than a.__add__(b). There is similar effect for other type slot calls.

[1] https://www.corsix.org/content/why-are-slots-so-slow

serhiy-storchaka · 2017-05-30T06:02:05Z

Despite the fact that a + b still is slower than a.__add__(b) in 3.7, it is more than 2 times faster than in 2.7 and 3.5.

serhiy-storchaka · 2017-05-30T07:06:44Z

I have other patch that makes a + b yet tens percents faster, faster than a.__add__(b), by adding functions and declarations that are never used. Confusing.

vstinner · 2017-05-30T11:01:51Z

I have other patch that makes a + b yet tens percents faster, faster than a.__add__(b), by adding functions and declarations that are never used. Confusing.

It seems like you are a victim of the "deadcode" issue related to code locality:
https://haypo.github.io/journey-to-stable-benchmark-deadcode.html

To run microbenchmarks on such very tiny functions taken less than 200 ns, it's more reliable to compile Python using LTO+PGO.

vstinner · 2017-05-30T11:05:35Z

type-slot-calls.diff: Can you please create a pull request?

a + b still is 25-30% slower than a.__add__(b)

Hum, can you please post a microbenchmark results to see the effect of the patch?

After analyzing the article and comparing it with the current code I have found that virtually all proposed optimization steps already applied in 3.7 by Victor! The difference is only in details.

The article has two main points:

the calling convention of the Python C API requires to create a tuple, and that's expensive
"a + b" has a complex semantics which requires to check for __radd__, check for issubclass(), etc.

Yeah, it seems like the FASTCALL changes I made in typeobject.c removed the overhead of the temporary tuple. Yury's and Naoki's work on CALL_METHOD also improved performances here on method calls.

I don't think that we can change the semantics, only try to optimize the implementation.

serhiy-storchaka · 2017-05-30T14:00:33Z

type-slot-calls.diff: Can you please create a pull request?

I provided just a patch because I expected that you perhaps will want to play with it and propose alternative patch. It is simpler to compare patches with Rietveld than on GitHub. But if you prefer, I'll make a PR.

Hum, can you please post a microbenchmark results to see the effect of the patch?

$ cat x.py
class A(object):
    def __add__(self, other):
        return 42

$ ./python -m perf timeit -s 'from x import A; a = A(); b = A()' --duplicate 100 'a.__add__(b)'
Unpatched:  Mean +- std dev: 256 ns +- 9 ns
Patched:    Mean +- std dev: 255 ns +- 10 ns

$ ./python -m perf timeit -s 'from x import A; a = A(); b = A()' --duplicate 100 'a + b'
Unpatched:  Mean +- std dev: 332 ns +- 10 ns
Patched:    Mean +- std dev: 286 ns +- 5 ns

the calling convention of the Python C API requires to create a tuple, and that's expensive

It also makes other optimizations, like avoiding using varargs and creating immediate method object. All this already is applied as side effects of your changes.

"a + b" has a complex semantics which requires to check for __radd__, check for issubclass(), etc.

Since a and b have the same type the complex semantic doesn't play a role here.

vstinner · 2017-05-30T14:29:23Z

I provided just a patch because I expected that you perhaps will want to play with it and propose alternative patch. It is simpler to compare patches with Rietveld than on GitHub.

It seems like Rietveld is broken: there is no [Review] button on your patch. I wouldn't be suprised that it's broken since CPython migrated to Git.

vstinner · 2017-05-30T14:37:47Z

The PR makes different changes:

replace lookup_method() with lookup_maybe_method()
specialize call_xxx() functions for a fixed number of parameters
rename lookup_maybe() to _PyObject_LookupSpecial()

If possible, I would prefer to not have to duplicate functions for 0, 1 and 2 parameters (3 variants). I would like to know which changes are responsible for the speedup.

To ease the review, would it be possible to split your change into smaller changes? At least, separated commits, maybe even a first "cleanup" PR before the "optimization" PR.

serhiy-storchaka · 2017-05-31T06:56:19Z

PR 1883 cleans up the code related to calling type slots.

Use _PyObject_LookupSpecial() instead of lookup_maybe() in set_names(). This isn't performance critical.
Inline lookup_maybe() in _PyObject_LookupSpecial(). This was the only remaining use of lookup_maybe().
Replace lookup_maybe_method() and following raising an exception with lookup_method() in call_method().
Replace lookup_method() with lookup_maybe_method() if the exception is ignored.
Update outdated comments.
Use call_method() in slot_sq_item. The comment about super-optimized version was outdated, now call_method() implements this.

serhiy-storchaka · 2017-05-31T12:24:41Z

Sorry, wrong data. PR 1883 makes indexing 1.2 times faster, PR 1861 makes it 1.7 times faster

$ ./python -m perf timeit -s 'class A:' -s '  def __getitem__(s, i): return t[i]' -s 'a = A(); t = tuple(range(1000))' --duplicate 100 'list(a)'

Unpatched: Mean +- std dev: 498 us +- 26 us
PR 1883: Mean +- std dev: 351 us +- 10 us
PR 1861: Mean +- std dev: 288 us +- 7 us

vstinner · 2017-05-31T14:02:43Z

FYI you can use "./python -m perf timeit --compare-to=./python-ref" if you keep the "reference" Python (unpatched), so perf computes for you the "?.??x slower/faster" factor ;-)

serhiy-storchaka · 2017-05-31T16:13:19Z

Thank you, I know about this, but it takes twice more time, so I don't use it regularly. And it doesn't allow to compare three versions. :-(

serhiy-storchaka · 2017-06-01T05:18:28Z

New changeset 4e624ca by Serhiy Storchaka in branch 'master':
bpo-30509: Clean up calling type slots. (bpo-1883)
4e624ca

terryjreedy · 2017-06-03T00:31:32Z

I believe Rietveld does not work with git-format patches. I don't know if git can produce the format hg did.

serhiy-storchaka · 2017-06-13T19:22:43Z

$ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s '  def __add__(s, o): return s' -s 'a = A(); b = A()' --duplicate=100 'a.__add__(b)'
Mean +- std dev: [python-orig] 229 ns +- 9 ns -> [python] 235 ns +- 13 ns: 1.02x slower (+2%)

$ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s '  def __add__(s, o): return s' -s 'a = A(); b = A()' --duplicate=100 'a + b'
Mean +- std dev: [python-orig] 277 ns +- 10 ns -> [python] 251 ns +- 23 ns: 1.10x faster (-9%)

$ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s '  def __add__(s, o): return s' -s 'a = [A() for i in range(1000)]' 'sum(a, A())'
Mean +- std dev: [python-orig] 259 us +- 17 us -> [python] 218 us +- 16 us: 1.19x faster (-16%)

$ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s '  def __getitem__(s, i): return t[i]' -s 'a = A(); t = tuple(range(1000))' 'list(a)'
Mean +- std dev: [python-orig] 324 us +- 14 us -> [python] 300 us +- 16 us: 1.08x faster (-8%)

$ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s '  def __neg__(s): return s' -s 'a = A()' --duplicate=100 '(----------a)'
Mean +- std dev: [python-orig] 2.12 us +- 0.13 us -> [python] 1.91 us +- 0.11 us: 1.11x faster (-10%)

vstinner · 2017-06-13T22:07:50Z

I'm not sure about adding Py_LOCAL_INLINE() (static inline). I'm not sure that it's needed when you use PGO compilation.

Would it be possible to run again your benchmark without added Py_LOCAL_INLINE() please?

It's hard to say no to a change makes so many core Python functions faster. I'm just suprised that "specializing" the "call_unbound" and "call_method" functions make the code up to 1.2x faster.

serhiy-storchaka · 2017-06-14T05:21:34Z

Without Py_LOCAL_INLINE all mickrobenchmarks become about 20% slower.

I'm not sure that all these changes are needed. Maybe the same effect can be achieved by smaller changes. But I tried and failed to achieve the same performance with a smaller patch yet. Maybe you will be more lucky.

Note that even with this patch type slots still about 5% slower than ordinal methods (despite the fact that using operators needs less bytecode instructions than calling a method). There is some additional overhead.

serhiy-storchaka added 3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels May 30, 2017

csabella added 3.8 only security fixes and removed 3.7 (EOL) end of life labels May 17, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

iritkatriel added 3.12 bugs and security fixes and removed 3.8 only security fixes labels Sep 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize calling type slots #74694

Optimize calling type slots #74694

serhiy-storchaka commented May 30, 2017

serhiy-storchaka commented May 30, 2017

serhiy-storchaka commented May 30, 2017

serhiy-storchaka commented May 30, 2017

vstinner commented May 30, 2017

vstinner commented May 30, 2017

serhiy-storchaka commented May 30, 2017

vstinner commented May 30, 2017

vstinner commented May 30, 2017

serhiy-storchaka commented May 31, 2017

serhiy-storchaka commented May 31, 2017

vstinner commented May 31, 2017

serhiy-storchaka commented May 31, 2017

serhiy-storchaka commented Jun 1, 2017

terryjreedy commented Jun 3, 2017

serhiy-storchaka commented Jun 13, 2017

vstinner commented Jun 13, 2017

serhiy-storchaka commented Jun 14, 2017

Optimize calling type slots #74694

Optimize calling type slots #74694

Comments

serhiy-storchaka commented May 30, 2017

serhiy-storchaka commented May 30, 2017

serhiy-storchaka commented May 30, 2017

serhiy-storchaka commented May 30, 2017

vstinner commented May 30, 2017

vstinner commented May 30, 2017

serhiy-storchaka commented May 30, 2017

vstinner commented May 30, 2017

vstinner commented May 30, 2017

serhiy-storchaka commented May 31, 2017

serhiy-storchaka commented May 31, 2017

vstinner commented May 31, 2017

serhiy-storchaka commented May 31, 2017

serhiy-storchaka commented Jun 1, 2017

terryjreedy commented Jun 3, 2017

serhiy-storchaka commented Jun 13, 2017

vstinner commented Jun 13, 2017

serhiy-storchaka commented Jun 14, 2017