-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup method calls 1.2x #70298
Comments
This issue supersedes issue bpo-6033. I decided to open a new one, since the patch is about Python 3.6 (not 2.7) and is written from scratch. The idea is to add new opcodes to avoid instantiation of BoundMethods. The patch only affects method calls of Python functions with positional arguments. I'm working on the attached patch in this repo: https://github.com/1st1/cpython/tree/call_meth2 If the patch gets accepted, I'll update it with the docs etc. Performance Improvements Method calls in micro-benchmarks are 1.2x faster: ### call_method ### ### call_method_slots ### ### call_method_unknown ### Improvements in mini-benchmarks, such as Richards are less impressive, I'd say it's 3-7% improvement. The full output of benchmarks/perf.py is here: https://gist.github.com/1st1/e00f11586329f68fd490 When the full benchmarks suite is run, some of them report that they were slow. When I ran them separately several times, they all show no real slowdowns. It's just some of them (such as nbody) are highly unstable. It's actually possible to improve the performance another 1-3% if we fuse __PyObject_GetMethod with ceval/LOAD_METHOD code. I've tried this here: https://github.com/1st1/cpython/tree/call_meth4, however I don't like to have so many details of object.c into ceval.c. Changes in the Core Two new opcodes are added -- LOAD_METHOD and CALL_METHOD. Whenever compiler sees a method call "obj.method(..)" with positional arguments it compiles it as follows: LOAD_FAST(obj)
LOAD_METHOD(method)
{call arguments}
CALL_METHOD LOAD_METHOD implementation in ceval looks up "method" on obj's type, and checks that it wasn't overridden in obj.__dict__. Apparently, even with the __dict__ check this is still faster then creating a BoundMethod instance etc. If the method is found and not overridden, LOAD_METHOD pushes the resolved method object, and 'obj'. If the method was overridden, the resolved method object and NULL are pushed to the stack. CALL_METHOD then looks at the two stack values after call arguments. If the first one isn't NULL, it means that we have a method call. Why CALL_METHOD? It's actually possible to hack CALL_FUNCTION to support LOAD_METHOD. I've tried this approach in https://github.com/1st1/cpython/tree/call_meth3. It looks like that adding extra checks in CALL_FUNCTION have negative impact on many benchmarks. It's really easier to add another opcode. Why only pure-Python methods? LOAD_METHOD atm works only with methods defined in pure Python. C methods, such as Why only calls with positional arguments? As showed in "Why CALL_METHOD?", making CALL_FUNCTION to work with LOAD_METHOD slows it down. For keyword and var-arg calls we have three more opcodes -- CALL_FUNCTION_VAR, CALL_FUNCTION_KW, and CALL_FUNCTION_VAR_KW. I suspect that making them work with LOAD_METHOD would slow them down too, which will probably require us to add three (!) more opcodes for LOAD_METHOD. And these kind of calls require much more overhead anyways, I don't expect them to be as optimizable as positional arg calls. |
I don't think that it's an issue to add 3 more opcodes for performance. If you prefer to limit the number of opcodes, you can pass a flag in arguments. For example, use 2 bytes for 2 arguments instead of only 1? See also the (now old) WPython project which proposed kind of CISC instructions: |
I tried two approaches:
Long story short, the only option would be to add dedicated opcodes to work with LOAD_METHOD. However, I'd prefer to first merge this patch, as it's relatively small and easy to review, and then focus on improving other things (keyword/*arg calls, C methods, etc). This is just a first step. |
I like this idea! I like the limitations to positional-only calls. I do think that it would be nice if we could speed up C calls too -- today, s.startswith('abc') is slower than s[:3] == 'abc' precisely because of the lookup. But I'm all for doing this one step at a time, so we can be sure it is solid before taking the next step(s). |
Yes, I think we can make |
I'm happy to see people working on optimizing CPython ;-) |
For those interested in reviewing this patch at some point: please wait until I upload a new version. The current patch is somewhat outdated. |
Yury, thank you for the heads up! Here at Intel, in the Dynamic Scripting Languages Optimization Team, we can help the community with reviewing and measuring this patch in our quiet and stable environment, the same one that we use to provide public CPython daily measurements. We will wait for your update. |
This patch doesn't apply cleanly any more. Is it easy to update? |
Updated, based on 102241:908b801f8a62 |
Oops, previous patch doesn't update magic number in PC/launcher.c |
Please increase the magic number by 10. We need to reserve few numbers for the case of bytecode bug fixes in 3.6. |
Added comments on Rietveld. Please document new bytecodes in the dis module documentation and What's New. |
$ ./python-default -m perf compare_to default.json callmethod4.json -G
Slower (7):
- pickle_dict: 66.0 us +- 4.6 us -> 77.0 us +- 5.9 us: 1.17x slower
- json_loads: 63.7 us +- 0.7 us -> 68.4 us +- 1.4 us: 1.07x slower
- unpack_sequence: 120 ns +- 2 ns -> 125 ns +- 3 ns: 1.04x slower
- scimark_lu: 499 ms +- 12 ms -> 514 ms +- 24 ms: 1.03x slower
- scimark_monte_carlo: 272 ms +- 10 ms -> 278 ms +- 9 ms: 1.02x slower
- scimark_sor: 517 ms +- 9 ms -> 526 ms +- 10 ms: 1.02x slower
- regex_effbot: 5.25 ms +- 0.15 ms -> 5.27 ms +- 0.17 ms: 1.00x slower Faster (52):
Benchmark hidden because not significant (5): nbody, pickle, regex_v8, telco, xml_etree_generate |
Please don't merge this without my review. |
Technically the patch LGTM. But we should find the cause of the regression in some benchmarks. And would be nice to extend the optimization to C functions. In any case this optimization is worth mentioning in What's New. |
The benchmark is on Sandy Bridge (Core i5 2400) and I didn't use PGO build. I'll rerun benchmark with PGO build. I hope PGO is friendly with CPU branch Anyway, recent amd64 CPUs have more large branch history.
I'll do them. |
I tried it but skipping creating PyCFunction seems impossible for now. My current idea is adding new If MethodDescrObject implement it, we can skip temporary PyCFunction object and But I think it should be separated issue. Patch is large enough already. |
Agreed if that so hard. |
Inada-san, when I tested the patch last time, I think there was a regression somewhere, related to the descriptor protocol. Have you fixed that one? |
New changeset 64afd5cab40a by Yury Selivanov in branch 'default': |
Seems to be either fixed, or maybe those bugs were related to my opcode cache patch. Anyways, I decided to commit the patch to 3.7, otherwise it might miss the commit window as it did for 3.6. Let's fix any regressions right in the repo now.
I like this idea! Thanks Inada-san for pushing this patch through, and thanks to Serhiy for reviewing it. |
I'm working on changing stack layout slightly current patch: callable | NULL | arg1 | ...argN After benchmark with PGO build, I'll post it. |
I haven't noticed the patch is committed already. Changing stack layout slightly is for easy to document, not for performance. |
PGO benchmark result |
This patch modify stack layout slightly and adds document in Doc/library/dis.rst |
INADA Naoki: "My current idea is adding new FYI I'm working on a solution to avoid tuple and dict to pass parameters to tp_new, tp_init and tp_call. I have PoC implementations (yeah, more than one). Please come to me directly if you are interested, this issue is not the right place to discuss. |
I just created the issue bpo-29263: "Implement LOAD_METHOD/CALL_METHOD for C functions". |
Yury, could you review this? |
New changeset a6241b2073c6 by INADA Naoki in branch 'default': |
Note that this idea has been generalized by PEP-590: any type can support this optimization by setting the Py_TPFLAGS_METHOD_DESCRIPTOR flag. |
Misc/NEWS
so that it is managed by towncrier #552CALL_METHOD_KW
opcode to speedup method calls with keywords #26014CALL_METHOD_KW
#26159Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: