Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize calling type slots #74694

Open
serhiy-storchaka opened this issue May 30, 2017 · 17 comments
Open

Optimize calling type slots #74694

serhiy-storchaka opened this issue May 30, 2017 · 17 comments
Labels
3.12 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@serhiy-storchaka
Copy link
Member

BPO 30509
Nosy @terryjreedy, @pitrou, @vstinner, @serhiy-storchaka
PRs
  • bpo-30509: Optimize and clean up calling type slots. #1861
  • bpo-30509: Clean up calling type slots. #1883
  • Files
  • type-slot-calls.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2017-05-30.05:56:40.654>
    labels = ['interpreter-core', '3.8', 'performance']
    title = 'Optimize calling type slots'
    updated_at = <Date 2019-05-17.22:24:57.943>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2019-05-17.22:24:57.943>
    actor = 'cheryl.sabella'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Interpreter Core']
    creation = <Date 2017-05-30.05:56:40.654>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = ['46913']
    hgrepos = []
    issue_num = 30509
    keywords = ['patch']
    message_count = 17.0
    messages = ['294739', '294740', '294742', '294749', '294750', '294757', '294762', '294763', '294817', '294839', '294843', '294848', '294905', '295062', '295953', '295963', '295984']
    nosy_count = 4.0
    nosy_names = ['terry.reedy', 'pitrou', 'vstinner', 'serhiy.storchaka']
    pr_nums = ['1861', '1883']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue30509'
    versions = ['Python 3.8']

    @serhiy-storchaka
    Copy link
    Member Author

    In excellent Peter Cawley's article "Why are slots so slow?" [1] analysed causes why a + b is slower than a.__add__(b) for custom add and provided suggestions for optimizing type slot calls. a + b and a.__add__(b) execute the same user code, a + b should have smaller overhead of bytecode interpreting, but it was 2 times slower than a.__add__(b). In the article a + b has been made 16% faster than a.__add__(b).

    In 3.7 the difference between two ways is smaller, but a + b still is 25-30% slower than a.__add__(b). After analyzing the article and comparing it with the current code I have found that virtually all proposed optimization steps already applied in 3.7 by Victor! The difference is only in details.

    The proposed patch tweaks the code and makes a + b only 12% slower than a.__add__(b). There is similar effect for other type slot calls.

    [1] https://www.corsix.org/content/why-are-slots-so-slow

    @serhiy-storchaka serhiy-storchaka added 3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels May 30, 2017
    @serhiy-storchaka
    Copy link
    Member Author

    Despite the fact that a + b still is slower than a.__add__(b) in 3.7, it is more than 2 times faster than in 2.7 and 3.5.

    @serhiy-storchaka
    Copy link
    Member Author

    I have other patch that makes a + b yet tens percents faster, faster than a.__add__(b), by adding functions and declarations that are never used. Confusing.

    @vstinner
    Copy link
    Member

    I have other patch that makes a + b yet tens percents faster, faster than a.__add__(b), by adding functions and declarations that are never used. Confusing.

    It seems like you are a victim of the "deadcode" issue related to code locality:
    https://haypo.github.io/journey-to-stable-benchmark-deadcode.html

    To run microbenchmarks on such very tiny functions taken less than 200 ns, it's more reliable to compile Python using LTO+PGO.

    @vstinner
    Copy link
    Member

    type-slot-calls.diff: Can you please create a pull request?

    a + b still is 25-30% slower than a.__add__(b)

    Hum, can you please post a microbenchmark results to see the effect of the patch?

    After analyzing the article and comparing it with the current code I have found that virtually all proposed optimization steps already applied in 3.7 by Victor! The difference is only in details.

    The article has two main points:

    • the calling convention of the Python C API requires to create a tuple, and that's expensive
    • "a + b" has a complex semantics which requires to check for __radd__, check for issubclass(), etc.

    Yeah, it seems like the FASTCALL changes I made in typeobject.c removed the overhead of the temporary tuple. Yury's and Naoki's work on CALL_METHOD also improved performances here on method calls.

    I don't think that we can change the semantics, only try to optimize the implementation.

    @serhiy-storchaka
    Copy link
    Member Author

    type-slot-calls.diff: Can you please create a pull request?

    I provided just a patch because I expected that you perhaps will want to play with it and propose alternative patch. It is simpler to compare patches with Rietveld than on GitHub. But if you prefer, I'll make a PR.

    Hum, can you please post a microbenchmark results to see the effect of the patch?

    $ cat x.py
    class A(object):
        def __add__(self, other):
            return 42
    
    $ ./python -m perf timeit -s 'from x import A; a = A(); b = A()' --duplicate 100 'a.__add__(b)'
    Unpatched:  Mean +- std dev: 256 ns +- 9 ns
    Patched:    Mean +- std dev: 255 ns +- 10 ns
    
    $ ./python -m perf timeit -s 'from x import A; a = A(); b = A()' --duplicate 100 'a + b'
    Unpatched:  Mean +- std dev: 332 ns +- 10 ns
    Patched:    Mean +- std dev: 286 ns +- 5 ns
    • the calling convention of the Python C API requires to create a tuple, and that's expensive

    It also makes other optimizations, like avoiding using varargs and creating immediate method object. All this already is applied as side effects of your changes.

    • "a + b" has a complex semantics which requires to check for __radd__, check for issubclass(), etc.

    Since a and b have the same type the complex semantic doesn't play a role here.

    @vstinner
    Copy link
    Member

    I provided just a patch because I expected that you perhaps will want to play with it and propose alternative patch. It is simpler to compare patches with Rietveld than on GitHub.

    It seems like Rietveld is broken: there is no [Review] button on your patch. I wouldn't be suprised that it's broken since CPython migrated to Git.

    @vstinner
    Copy link
    Member

    The PR makes different changes:

    • replace lookup_method() with lookup_maybe_method()
    • specialize call_xxx() functions for a fixed number of parameters
    • rename lookup_maybe() to _PyObject_LookupSpecial()

    If possible, I would prefer to not have to duplicate functions for 0, 1 and 2 parameters (3 variants). I would like to know which changes are responsible for the speedup.

    To ease the review, would it be possible to split your change into smaller changes? At least, separated commits, maybe even a first "cleanup" PR before the "optimization" PR.

    @serhiy-storchaka
    Copy link
    Member Author

    PR 1883 cleans up the code related to calling type slots.

    • Use _PyObject_LookupSpecial() instead of lookup_maybe() in set_names(). This isn't performance critical.
    • Inline lookup_maybe() in _PyObject_LookupSpecial(). This was the only remaining use of lookup_maybe().
    • Replace lookup_maybe_method() and following raising an exception with lookup_method() in call_method().
    • Replace lookup_method() with lookup_maybe_method() if the exception is ignored.
    • Update outdated comments.
    • Use call_method() in slot_sq_item. The comment about super-optimized version was outdated, now call_method() implements this.

    @serhiy-storchaka
    Copy link
    Member Author

    Sorry, wrong data. PR 1883 makes indexing 1.2 times faster, PR 1861 makes it 1.7 times faster

    $ ./python -m perf timeit -s 'class A:' -s '  def __getitem__(s, i): return t[i]' -s 'a = A(); t = tuple(range(1000))' --duplicate 100 'list(a)'

    Unpatched: Mean +- std dev: 498 us +- 26 us
    PR 1883: Mean +- std dev: 351 us +- 10 us
    PR 1861: Mean +- std dev: 288 us +- 7 us

    @vstinner
    Copy link
    Member

    FYI you can use "./python -m perf timeit --compare-to=./python-ref" if you keep the "reference" Python (unpatched), so perf computes for you the "?.??x slower/faster" factor ;-)

    @serhiy-storchaka
    Copy link
    Member Author

    Thank you, I know about this, but it takes twice more time, so I don't use it regularly. And it doesn't allow to compare three versions. :-(

    @serhiy-storchaka
    Copy link
    Member Author

    New changeset 4e624ca by Serhiy Storchaka in branch 'master':
    bpo-30509: Clean up calling type slots. (bpo-1883)
    4e624ca

    @terryjreedy
    Copy link
    Member

    I believe Rietveld does not work with git-format patches. I don't know if git can produce the format hg did.

    @serhiy-storchaka
    Copy link
    Member Author

    $ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s '  def __add__(s, o): return s' -s 'a = A(); b = A()' --duplicate=100 'a.__add__(b)'
    Mean +- std dev: [python-orig] 229 ns +- 9 ns -> [python] 235 ns +- 13 ns: 1.02x slower (+2%)
    
    $ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s '  def __add__(s, o): return s' -s 'a = A(); b = A()' --duplicate=100 'a + b'
    Mean +- std dev: [python-orig] 277 ns +- 10 ns -> [python] 251 ns +- 23 ns: 1.10x faster (-9%)
    
    $ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s '  def __add__(s, o): return s' -s 'a = [A() for i in range(1000)]' 'sum(a, A())'
    Mean +- std dev: [python-orig] 259 us +- 17 us -> [python] 218 us +- 16 us: 1.19x faster (-16%)
    
    $ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s '  def __getitem__(s, i): return t[i]' -s 'a = A(); t = tuple(range(1000))' 'list(a)'
    Mean +- std dev: [python-orig] 324 us +- 14 us -> [python] 300 us +- 16 us: 1.08x faster (-8%)
    
    $ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s '  def __neg__(s): return s' -s 'a = A()' --duplicate=100 '(----------a)'
    Mean +- std dev: [python-orig] 2.12 us +- 0.13 us -> [python] 1.91 us +- 0.11 us: 1.11x faster (-10%)

    @vstinner
    Copy link
    Member

    I'm not sure about adding Py_LOCAL_INLINE() (static inline). I'm not sure that it's needed when you use PGO compilation.

    Would it be possible to run again your benchmark without added Py_LOCAL_INLINE() please?

    It's hard to say no to a change makes so many core Python functions faster. I'm just suprised that "specializing" the "call_unbound" and "call_method" functions make the code up to 1.2x faster.

    @serhiy-storchaka
    Copy link
    Member Author

    Without Py_LOCAL_INLINE all mickrobenchmarks become about 20% slower.

    I'm not sure that all these changes are needed. Maybe the same effect can be achieved by smaller changes. But I tried and failed to achieve the same performance with a smaller patch yet. Maybe you will be more lucky.

    Note that even with this patch type slots still about 5% slower than ordinal methods (despite the fact that using operators needs less bytecode instructions than calling a method). There is some additional overhead.

    @csabella csabella added 3.8 only security fixes and removed 3.7 (EOL) end of life labels May 17, 2019
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @iritkatriel iritkatriel added 3.12 bugs and security fixes and removed 3.8 only security fixes labels Sep 12, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.12 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants