Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup method calls 1.2x #70298

Closed
1st1 opened this issue Jan 14, 2016 · 31 comments
Closed

Speedup method calls 1.2x #70298

1st1 opened this issue Jan 14, 2016 · 31 comments
Assignees
Labels
3.7 interpreter-core performance

Comments

@1st1
Copy link
Member

@1st1 1st1 commented Jan 14, 2016

BPO 26110
Nosy @Yhg1s, @brettcannon, @jcea, @ncoghlan, @vstinner, @benjaminp, @DinoV, @methane, @serhiy-storchaka, @soltysh, @jdemeyer, @1st1, @jstasiak, @MojoVampire, @Vgr255, @florinpapa, @efahl, @dopplershift, @Fidget-Spinner
PRs
  • #552
  • #26014
  • #26159
  • Files
  • call_method_1.patch
  • call_method_2.patch
  • call_method_3.patch
  • call_method_4.patch: updated for Python 3.7
  • default.json.gz
  • callmethod4.json.gz
  • call_method_5.patch
  • default.json.gz
  • patched.json.gz
  • callmethod-doc.patch
  • callmethod-doc2.patch
  • callmethod-doc3.patch
  • call-method-doc4.patch
  • call-method-doc5.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/1st1'
    closed_at = <Date 2017-01-16.08:23:59.319>
    created_at = <Date 2016-01-14.17:09:06.979>
    labels = ['interpreter-core', '3.7', 'performance']
    title = 'Speedup method calls 1.2x'
    updated_at = <Date 2021-05-16.09:26:18.303>
    user = 'https://github.com/1st1'

    bugs.python.org fields:

    activity = <Date 2021-05-16.09:26:18.303>
    actor = 'kj'
    assignee = 'yselivanov'
    closed = True
    closed_date = <Date 2017-01-16.08:23:59.319>
    closer = 'methane'
    components = ['Interpreter Core']
    creation = <Date 2016-01-14.17:09:06.979>
    creator = 'yselivanov'
    dependencies = []
    files = ['41614', '43599', '43601', '45786', '45787', '45788', '45866', '45887', '45888', '45889', '45891', '45896', '45963', '46294']
    hgrepos = []
    issue_num = 26110
    keywords = ['patch']
    message_count = 31.0
    messages = ['258204', '258205', '258206', '258211', '258220', '258222', '259333', '259339', '264659', '269662', '269665', '282620', '282621', '282622', '282627', '283075', '283077', '283093', '283094', '283147', '283148', '283149', '283160', '283161', '283166', '283168', '283177', '285415', '285511', '285544', '345335']
    nosy_count = 22.0
    nosy_names = ['twouters', 'brett.cannon', 'jcea', 'ncoghlan', 'vstinner', 'benjamin.peterson', 'dino.viehland', 'methane', 'santagada', 'python-dev', 'serhiy.storchaka', 'maciej.szulik', 'jdemeyer', 'yselivanov', 'jstasiak', 'josh.r', 'abarry', 'alecsandru.patrascu', 'florin.papa', 'eric.fahlgren', 'Ryan May', 'kj']
    pr_nums = ['552', '26014', '26159']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'commit review'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue26110'
    versions = ['Python 3.7']

    @1st1
    Copy link
    Member Author

    @1st1 1st1 commented Jan 14, 2016

    This issue supersedes issue bpo-6033. I decided to open a new one, since the patch is about Python 3.6 (not 2.7) and is written from scratch.

    The idea is to add new opcodes to avoid instantiation of BoundMethods. The patch only affects method calls of Python functions with positional arguments.

    I'm working on the attached patch in this repo: https://github.com/1st1/cpython/tree/call_meth2

    If the patch gets accepted, I'll update it with the docs etc.

    Performance Improvements
    ------------------------

    Method calls in micro-benchmarks are 1.2x faster:

    ### call_method ###
    Avg: 0.330371 -> 0.281452: 1.17x faster

    ### call_method_slots ###
    Avg: 0.333489 -> 0.280612: 1.19x faster

    ### call_method_unknown ###
    Avg: 0.304949 -> 0.251623: 1.21x faster

    Improvements in mini-benchmarks, such as Richards are less impressive, I'd say it's 3-7% improvement. The full output of benchmarks/perf.py is here: https://gist.github.com/1st1/e00f11586329f68fd490

    When the full benchmarks suite is run, some of them report that they were slow. When I ran them separately several times, they all show no real slowdowns. It's just some of them (such as nbody) are highly unstable.

    It's actually possible to improve the performance another 1-3% if we fuse __PyObject_GetMethod with ceval/LOAD_METHOD code. I've tried this here: https://github.com/1st1/cpython/tree/call_meth4, however I don't like to have so many details of object.c into ceval.c.

    Changes in the Core
    -------------------

    Two new opcodes are added -- LOAD_METHOD and CALL_METHOD. Whenever compiler sees a method call "obj.method(..)" with positional arguments it compiles it as follows:

      LOAD_FAST(obj)
      LOAD_METHOD(method)
      {call arguments}
      CALL_METHOD

    LOAD_METHOD implementation in ceval looks up "method" on obj's type, and checks that it wasn't overridden in obj.__dict__. Apparently, even with the __dict__ check this is still faster then creating a BoundMethod instance etc.

    If the method is found and not overridden, LOAD_METHOD pushes the resolved method object, and 'obj'. If the method was overridden, the resolved method object and NULL are pushed to the stack.

    CALL_METHOD then looks at the two stack values after call arguments. If the first one isn't NULL, it means that we have a method call.

    Why CALL_METHOD?
    ----------------

    It's actually possible to hack CALL_FUNCTION to support LOAD_METHOD. I've tried this approach in https://github.com/1st1/cpython/tree/call_meth3. It looks like that adding extra checks in CALL_FUNCTION have negative impact on many benchmarks. It's really easier to add another opcode.

    Why only pure-Python methods?
    -----------------------------

    LOAD_METHOD atm works only with methods defined in pure Python. C methods, such as list.append are still wrapped into a descriptor, that creates a PyCFunction object on each attribute access. I've tried to do that in https://github.com/1st1/cpython/tree/call_cmeth. It does impact C method calls in a positive way, although my implementation is very hacky. It still uses LOAD_METHOD and CALL_METHOD opcodes, so my idea is to consider merging this patch first, and then introduce the necessary refactoring of PyCFunction and MethodDesctiptors in a separate issue.

    Why only calls with positional arguments?
    -----------------------------------------

    As showed in "Why CALL_METHOD?", making CALL_FUNCTION to work with LOAD_METHOD slows it down. For keyword and var-arg calls we have three more opcodes -- CALL_FUNCTION_VAR, CALL_FUNCTION_KW, and CALL_FUNCTION_VAR_KW. I suspect that making them work with LOAD_METHOD would slow them down too, which will probably require us to add three (!) more opcodes for LOAD_METHOD.

    And these kind of calls require much more overhead anyways, I don't expect them to be as optimizable as positional arg calls.

    @1st1 1st1 self-assigned this Jan 14, 2016
    @1st1 1st1 added interpreter-core performance labels Jan 14, 2016
    @vstinner
    Copy link
    Member

    @vstinner vstinner commented Jan 14, 2016

    For keyword and var-arg calls we have three more opcodes -- CALL_FUNCTION_VAR, CALL_FUNCTION_KW, and CALL_FUNCTION_VAR_KW. I suspect that making them work with LOAD_METHOD would slow them down too, which will probably require us to add three (!) more opcodes for LOAD_METHOD.

    I don't think that it's an issue to add 3 more opcodes for performance. If you prefer to limit the number of opcodes, you can pass a flag in arguments. For example, use 2 bytes for 2 arguments instead of only 1?

    See also the (now old) WPython project which proposed kind of CISC instructions:
    https://code.google.com/p/wpython/

    @1st1
    Copy link
    Member Author

    @1st1 1st1 commented Jan 14, 2016

    If you prefer to limit the number of opcodes, you can pass a flag in arguments. For example, use 2 bytes for 2 arguments instead of only 1?

    I tried two approaches:

    1. Push one more leading NULL to the stack in LOAD_METHOD (which makes it push 3 vals). Then, in CALL_FUNCTION, detect when the callable is NULL and to the stuff that CALL_METHOD does.

    2. Use a bitflag on oparg for CALL_FUNCTION and CALL_* opcodes. This adds EXTENDED_ARG opcode overhead.

    Long story short, the only option would be to add dedicated opcodes to work with LOAD_METHOD. However, I'd prefer to first merge this patch, as it's relatively small and easy to review, and then focus on improving other things (keyword/*arg calls, C methods, etc). This is just a first step.

    @gvanrossum
    Copy link
    Member

    @gvanrossum gvanrossum commented Jan 14, 2016

    I like this idea! I like the limitations to positional-only calls. I do think that it would be nice if we could speed up C calls too -- today, s.startswith('abc') is slower than s[:3] == 'abc' precisely because of the lookup. But I'm all for doing this one step at a time, so we can be sure it is solid before taking the next step(s).

    @1st1
    Copy link
    Member Author

    @1st1 1st1 commented Jan 14, 2016

    I like this idea! I like the limitations to positional-only calls. I do think that it would be nice if we could speed up C calls too -- today, s.startswith('abc') is slower than s[:3] == 'abc' precisely because of the lookup. But I'm all for doing this one step at a time, so we can be sure it is solid before taking the next step(s).

    Yes, I think we can make <built-in type>.method(..) calls much faster with LOAD_METHOD. Since those types lack __dict__ and their __class__ is read-only, we can use a far better optimized code path without extra lookups and creation of BoundMethod/PyCFunction objects.

    @vstinner
    Copy link
    Member

    @vstinner vstinner commented Jan 14, 2016

    I'm happy to see people working on optimizing CPython ;-)

    @1st1
    Copy link
    Member Author

    @1st1 1st1 commented Feb 1, 2016

    For those interested in reviewing this patch at some point: please wait until I upload a new version. The current patch is somewhat outdated.

    @alecsandrupatrascu
    Copy link
    Mannequin

    @alecsandrupatrascu alecsandrupatrascu mannequin commented Feb 1, 2016

    Yury, thank you for the heads up! Here at Intel, in the Dynamic Scripting Languages Optimization Team, we can help the community with reviewing and measuring this patch in our quiet and stable environment, the same one that we use to provide public CPython daily measurements. We will wait for your update.

    @gvanrossum
    Copy link
    Member

    @gvanrossum gvanrossum commented May 2, 2016

    This patch doesn't apply cleanly any more. Is it easy to update?

    @methane
    Copy link
    Member

    @methane methane commented Jul 1, 2016

    Updated, based on 102241:908b801f8a62

    @methane
    Copy link
    Member

    @methane methane commented Jul 1, 2016

    Oops, previous patch doesn't update magic number in PC/launcher.c
    Should I update it? Or don't touch it to avoid additional conflicts?

    @methane methane added the 3.7 label Dec 7, 2016
    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Dec 7, 2016

    Please increase the magic number by 10. We need to reserve few numbers for the case of bytecode bug fixes in 3.6.

    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Dec 7, 2016

    Added comments on Rietveld. Please document new bytecodes in the dis module documentation and What's New.

    @methane
    Copy link
    Member

    @methane methane commented Dec 7, 2016

    $ ./python-default -m perf compare_to default.json callmethod4.json -G
    Slower (7):
    - pickle_dict: 66.0 us +- 4.6 us -> 77.0 us +- 5.9 us: 1.17x slower
    - json_loads: 63.7 us +- 0.7 us -> 68.4 us +- 1.4 us: 1.07x slower
    - unpack_sequence: 120 ns +- 2 ns -> 125 ns +- 3 ns: 1.04x slower
    - scimark_lu: 499 ms +- 12 ms -> 514 ms +- 24 ms: 1.03x slower
    - scimark_monte_carlo: 272 ms +- 10 ms -> 278 ms +- 9 ms: 1.02x slower
    - scimark_sor: 517 ms +- 9 ms -> 526 ms +- 10 ms: 1.02x slower
    - regex_effbot: 5.25 ms +- 0.15 ms -> 5.27 ms +- 0.17 ms: 1.00x slower

    Faster (52):

    • call_method: 18.1 ms +- 0.1 ms -> 14.8 ms +- 0.1 ms: 1.22x faster
    • call_method_slots: 17.9 ms +- 0.2 ms -> 14.7 ms +- 0.2 ms: 1.22x faster
    • call_method_unknown: 20.0 ms +- 0.9 ms -> 16.5 ms +- 0.1 ms: 1.21x faster
    • xml_etree_parse: 304 ms +- 13 ms -> 283 ms +- 6 ms: 1.07x faster
    • float: 310 ms +- 4 ms -> 292 ms +- 4 ms: 1.06x faster
    • call_simple: 14.2 ms +- 0.6 ms -> 13.4 ms +- 0.3 ms: 1.06x faster
    • pickle_pure_python: 1.34 ms +- 0.02 ms -> 1.27 ms +- 0.02 ms: 1.05x faster
    • richards: 189 ms +- 3 ms -> 180 ms +- 7 ms: 1.05x faster
    • logging_silent: 793 ns +- 14 ns -> 754 ns +- 9 ns: 1.05x faster
    • deltablue: 18.4 ms +- 0.2 ms -> 17.6 ms +- 0.1 ms: 1.05x faster
    • go: 623 ms +- 5 ms -> 594 ms +- 5 ms: 1.05x faster
    • genshi_xml: 202 ms +- 2 ms -> 193 ms +- 3 ms: 1.04x faster
    • fannkuch: 1.10 sec +- 0.02 sec -> 1.05 sec +- 0.02 sec: 1.04x faster
    • raytrace: 1.38 sec +- 0.02 sec -> 1.32 sec +- 0.00 sec: 1.04x faster
    • django_template: 419 ms +- 5 ms -> 402 ms +- 4 ms: 1.04x faster
    • hexiom: 24.2 ms +- 0.2 ms -> 23.3 ms +- 0.2 ms: 1.04x faster
    • xml_etree_iterparse: 225 ms +- 4 ms -> 216 ms +- 5 ms: 1.04x faster
    • spectral_norm: 282 ms +- 14 ms -> 271 ms +- 4 ms: 1.04x faster
    • python_startup_no_site: 9.83 ms +- 0.02 ms -> 9.46 ms +- 0.03 ms: 1.04x faster
    • python_startup: 16.4 ms +- 0.0 ms -> 15.7 ms +- 0.0 ms: 1.04x faster
    • unpickle_pure_python: 918 us +- 24 us -> 884 us +- 24 us: 1.04x faster
    • unpickle: 33.0 us +- 0.4 us -> 31.8 us +- 0.4 us: 1.04x faster
    • dulwich_log: 158 ms +- 1 ms -> 153 ms +- 1 ms: 1.04x faster
    • html5lib: 234 ms +- 6 ms -> 225 ms +- 6 ms: 1.04x faster
    • scimark_fft: 735 ms +- 19 ms -> 709 ms +- 22 ms: 1.04x faster
    • sqlalchemy_declarative: 325 ms +- 4 ms -> 314 ms +- 3 ms: 1.04x faster
    • mako: 45.4 ms +- 0.3 ms -> 43.8 ms +- 0.7 ms: 1.04x faster
    • genshi_text: 90.1 ms +- 0.9 ms -> 87.0 ms +- 0.9 ms: 1.04x faster
    • tornado_http: 388 ms +- 4 ms -> 375 ms +- 4 ms: 1.03x faster
    • 2to3: 758 ms +- 4 ms -> 734 ms +- 4 ms: 1.03x faster
    • scimark_sparse_mat_mult: 8.75 ms +- 0.25 ms -> 8.48 ms +- 0.39 ms: 1.03x faster
    • json_dumps: 29.9 ms +- 0.4 ms -> 29.0 ms +- 0.3 ms: 1.03x faster
    • sqlite_synth: 10.1 us +- 0.2 us -> 9.80 us +- 0.19 us: 1.03x faster
    • regex_dna: 286 ms +- 6 ms -> 278 ms +- 4 ms: 1.03x faster
    • crypto_pyaes: 235 ms +- 2 ms -> 228 ms +- 2 ms: 1.03x faster
    • pidigits: 320 ms +- 0 ms -> 311 ms +- 1 ms: 1.03x faster
    • logging_format: 38.1 us +- 0.5 us -> 37.1 us +- 0.4 us: 1.03x faster
    • meteor_contest: 206 ms +- 3 ms -> 201 ms +- 1 ms: 1.03x faster
    • xml_etree_process: 248 ms +- 4 ms -> 241 ms +- 4 ms: 1.03x faster
    • chameleon: 30.9 ms +- 0.4 ms -> 30.1 ms +- 0.4 ms: 1.03x faster
    • logging_simple: 32.8 us +- 0.4 us -> 32.0 us +- 0.3 us: 1.03x faster
    • sqlalchemy_imperative: 61.3 ms +- 1.6 ms -> 59.9 ms +- 1.6 ms: 1.02x faster
    • chaos: 308 ms +- 2 ms -> 301 ms +- 2 ms: 1.02x faster
    • unpickle_list: 7.90 us +- 0.04 us -> 7.73 us +- 0.10 us: 1.02x faster
    • sympy_integrate: 47.9 ms +- 0.3 ms -> 46.9 ms +- 0.4 ms: 1.02x faster
    • sympy_sum: 216 ms +- 6 ms -> 212 ms +- 6 ms: 1.02x faster
    • sympy_str: 481 ms +- 5 ms -> 473 ms +- 2 ms: 1.02x faster
    • sympy_expand: 1.08 sec +- 0.01 sec -> 1.06 sec +- 0.01 sec: 1.02x faster
    • regex_compile: 449 ms +- 3 ms -> 442 ms +- 5 ms: 1.02x faster
    • nqueens: 275 ms +- 3 ms -> 271 ms +- 3 ms: 1.02x faster
    • pathlib: 51.6 ms +- 0.6 ms -> 50.9 ms +- 0.7 ms: 1.01x faster
    • pickle_list: 8.85 us +- 0.09 us -> 8.79 us +- 0.11 us: 1.01x faster

    Benchmark hidden because not significant (5): nbody, pickle, regex_v8, telco, xml_etree_generate

    @1st1
    Copy link
    Member Author

    @1st1 1st1 commented Dec 7, 2016

    Please don't merge this without my review.

    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Dec 13, 2016

    Technically the patch LGTM. But we should find the cause of the regression in some benchmarks.

    And would be nice to extend the optimization to C functions.

    In any case this optimization is worth mentioning in What's New.

    @methane
    Copy link
    Member

    @methane methane commented Dec 13, 2016

    Technically the patch LGTM. But we should find the cause of the regression in some benchmarks.

    The benchmark is on Sandy Bridge (Core i5 2400) and I didn't use PGO build.
    perf_event reported branch-miss rate increase at cpickle's save function.

    I'll rerun benchmark with PGO build. I hope PGO is friendly with CPU branch
    prediction, like L1/L2 cache.

    Anyway, recent amd64 CPUs have more large branch history.

    And would be nice to extend the optimization to C functions.
    In any case this optimization is worth mentioning in What's New.

    I'll do them.

    @methane
    Copy link
    Member

    @methane methane commented Dec 13, 2016

    And would be nice to extend the optimization to C functions.

    I tried it but skipping creating PyCFunction seems impossible for now.

    My current idea is adding new tp_fastcall slot which has same signature
    to _PyFunction_FastCallDict or _PyCFunction_FastCallDict.

    If MethodDescrObject implement it, we can skip temporary PyCFunction object and
    argument tuple.

    But I think it should be separated issue. Patch is large enough already.

    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Dec 13, 2016

    But I think it should be separated issue.

    Agreed if that so hard.

    @1st1
    Copy link
    Member Author

    @1st1 1st1 commented Dec 13, 2016

    Inada-san, when I tested the patch last time, I think there was a regression somewhere, related to the descriptor protocol. Have you fixed that one?

    @python-dev
    Copy link
    Mannequin

    @python-dev python-dev mannequin commented Dec 14, 2016

    New changeset 64afd5cab40a by Yury Selivanov in branch 'default':
    Issue bpo-26110: Add LOAD_METHOD/CALL_METHOD opcodes.
    https://hg.python.org/cpython/rev/64afd5cab40a

    @1st1
    Copy link
    Member Author

    @1st1 1st1 commented Dec 14, 2016

    Inada-san, when I tested the patch last time, I think there was a regression somewhere, related to the descriptor protocol. Have you fixed that one?

    Seems to be either fixed, or maybe those bugs were related to my opcode cache patch. Anyways, I decided to commit the patch to 3.7, otherwise it might miss the commit window as it did for 3.6. Let's fix any regressions right in the repo now.

    My current idea is adding new tp_fastcall slot which has same signature
    to _PyFunction_FastCallDict or _PyCFunction_FastCallDict.

    I like this idea!

    Thanks Inada-san for pushing this patch through, and thanks to Serhiy for reviewing it.

    @methane
    Copy link
    Member

    @methane methane commented Dec 14, 2016

    I'm working on changing stack layout slightly

    current patch: callable | NULL | arg1 | ...argN
    next patch will: NULL | callable | arg1 | ...argN

    After benchmark with PGO build, I'll post it.

    @methane
    Copy link
    Member

    @methane methane commented Dec 14, 2016

    I haven't noticed the patch is committed already.

    Changing stack layout slightly is for easy to document, not for performance.
    Please don't close this issue until adding document in Doc/library/dis.rst

    @methane
    Copy link
    Member

    @methane methane commented Dec 14, 2016

    @methane
    Copy link
    Member

    @methane methane commented Dec 14, 2016

    This patch modify stack layout slightly and adds document in Doc/library/dis.rst

    @vstinner
    Copy link
    Member

    @vstinner vstinner commented Dec 14, 2016

    INADA Naoki: "My current idea is adding new tp_fastcall slot which has same signature to _PyFunction_FastCallDict or _PyCFunction_FastCallDict."

    FYI I'm working on a solution to avoid tuple and dict to pass parameters to tp_new, tp_init and tp_call. I have PoC implementations (yeah, more than one). Please come to me directly if you are interested, this issue is not the right place to discuss.

    @vstinner
    Copy link
    Member

    @vstinner vstinner commented Jan 13, 2017

    I just created the issue bpo-29263: "Implement LOAD_METHOD/CALL_METHOD for C functions".

    @methane
    Copy link
    Member

    @methane methane commented Jan 15, 2017

    Yury, could you review this?

    @python-dev
    Copy link
    Mannequin

    @python-dev python-dev mannequin commented Jan 16, 2017

    New changeset a6241b2073c6 by INADA Naoki in branch 'default':
    Issue bpo-26110: Add document for LOAD_METHOD and CALL_METHOD opcode.
    https://hg.python.org/cpython/rev/a6241b2073c6

    @methane methane closed this as completed Jan 16, 2017
    @jdemeyer
    Copy link
    Contributor

    @jdemeyer jdemeyer commented Jun 12, 2019

    Note that this idea has been generalized by PEP-590: any type can support this optimization by setting the Py_TPFLAGS_METHOD_DESCRIPTOR flag.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 interpreter-core performance
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants