-
-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use FASTCALL in typeobject.c call_method() to avoid temporary tuple #73693
Comments
Subset of the (almost) rejected issue bpo-29259 (tp_fastcall), attached patch adds _PyMethod_FastCall() and uses it in call_method() of typeobject.c. The change avoids the creation of a temporary tuple for Python functions and METH_FASTCALL C functions. Currently, call_method() calls method_call() which calls _PyObject_Call_Prepend(), and calling method_call() requires a tuple for positional arguments. Example of benchmark on __getitem__(): 1.3x faster (-22%). $ ./python -m perf timeit -s 'class C:' -s ' def __getitem__(self, index): return index' -s 'c=C()' 'c[0]' Median +- std dev: 130 ns +- 1 ns => 102 ns +- 1 ns See also the issue bpo-29263 "Implement LOAD_METHOD/CALL_METHOD for C functions". |
Maybe PyObject_Call(), _PyObject_FastCallDict(), etc. can also be modified to get the following fast-path: + if (Py_TYPE(func) == &PyMethod_Type) { But I don't know how common it is to get a PyMethod_Type object in these functions, nor the code of the additional if. |
Maybe, we can skip Method object entirely using _PyObject_GetMethod(). |
CallMethod[Id]ObjArgs() can use it easily. |
callmethod.patch: + ../python.default -m perf compare_to default.json patched2.json -G --min-speed=1
Slower (5):
- logging_silent: 717 ns +- 9 ns -> 737 ns +- 8 ns: 1.03x slower (+3%)
- fannkuch: 1.04 sec +- 0.01 sec -> 1.06 sec +- 0.02 sec: 1.02x slower (+2%)
- call_method: 14.5 ms +- 0.1 ms -> 14.7 ms +- 0.1 ms: 1.02x slower (+2%)
- call_method_slots: 14.3 ms +- 0.3 ms -> 14.6 ms +- 0.1 ms: 1.02x slower (+2%)
- scimark_sparse_mat_mult: 8.66 ms +- 0.21 ms -> 8.76 ms +- 0.25 ms: 1.01x slower (+1%) Faster (17):
Benchmark hidden because not significant (42) |
I'm sorry, callmethod.patch is tuned other place, and causing SEGV. method_fastcall2.patch is tuning same function (call_method() in typeobject.c), and uses trick to bypass temporary method object (same to _PyObject_GetMethod()). $ ./python -m perf timeit --compare-to `pwd`/python.default -s 'class C:' -s ' def __getitem__(self, index): return index' -s 'c=C()' 'c[0]'
python.default: ..................... 155 ns +- 4 ns
python: ..................... 111 ns +- 1 ns Median +- std dev: [python.default] 155 ns +- 4 ns -> [python] 111 ns +- 1 ns: 1.40x faster (-28%) |
Oh, great idea! That's why I put you in the nosy list ;-) You know better than me this area of the code.
Wow, much better than my patch. Good job! Can we implement the same optimization in callmethod() of Objects/abstract.c? Maybe add a "is_method" argument to the static function _PyObject_CallFunctionVa(), to only enable the optimization for callmehod(). |
method_fastcall3.patch implement the trick in more general way. |
method_fastcall4.patch: Based on method_fastcall3.patch, I just added call_unbound() and call_unbound_noarg() helper functions to factorize code. I also modified mro_invoke() to be able to remove lookup_method(). I confirm the speedup with attached bench.py: Median +- std dev: [ref] 121 ns +- 5 ns -> [patch] 82.8 ns +- 1.0 ns: 1.46x faster (-31%) |
method_fastcall4.patch looks clean enough, and performance benefit seems nice. I don't know current test suite covers unusual special methods. |
method_fastcall4.patch benchmark results. It's not the first time that I notice that fannkuch and nbody benchmarks become slower. I guess that it's effect of changing code placement because of unrelated change in the C code. Results don't seem significant on such macro benchmarks (may be random performance changes due to code placement). IMHO the change is worth it! "1.46x faster (-31%)" on a microbenchmark is significant and the change is small. $ python3 -m perf compare_to /home/haypo/benchmarks/2017-02-08_15-49-default-f507545ad22a.json method_fastcall4_ref_f507545ad22a.json -G --min-speed=5
Slower (2):
- fannkuch: 900 ms +- 20 ms -> 994 ms +- 10 ms: 1.10x slower (+10%)
- nbody: 215 ms +- 3 ms -> 228 ms +- 4 ms: 1.06x slower (+6%) Faster (3):
Benchmark hidden because not significant (59): (...) |
+1 Though this is a rather large and impactful patch, I think it is a great idea. It will be one of the highest payoff applications of FASTCALL, broadly benefitting a lot of code. Let's be sure to be extra careful with this one because it is touching central parts of the language, so any errors or subtle behavior changes will be felt by a lot of code, some of which is sure to hit the rare corner cases and to rely on implementation details. |
New changeset 7b8df4a5d81d by Victor Stinner in branch 'default': |
Raymond Hettinger: "+1 Though this is a rather large and impactful patch, I think it is a great idea. It will be one of the highest payoff applications of FASTCALL, broadly benefitting a lot of code." In my experience, avoiding temporary tuple to pass positional arguments provides a speedup to up 30% faster in the best case. Here it's 1.5x faster because the optimization also avoids the creation of temporary PyMethodObject. "Let's be sure to be extra careful with this one because it is touching central parts of the language, so any errors or subtle behavior changes will be felt by a lot of code, some of which is sure to hit the rare corner cases and to rely on implementation details." I reviewed Naoki's patch carefully, but in fact it isn't as big as I expected. In Python 3.6, call_method() calls tp_descr_get of PyFunction_Type which creates PyMethodObject. The tp_call of PyMethodObject calls the function with self, nothing crazy. The patch removes a lot of steps and (IMHO) makes the code simpler than before (when calling Python methods). I'm not saying that such change is bugfree-proof :-) But we are far from Python 3.7 final, so it's the right time to push such large optimization. |
Naoki: "method_fastcall4.patch looks clean enough, and performance benefit seems nice." Ok, I pushed the patch with minor changes:
"I don't know current test suite covers unusual special methods." What do you mean by "unusual special methods"? "Maybe, we can extend test_class to cover !unbound (e.g. @classmethod) case." It's hard to test all cases, since they are a lot of function types in Python, and each slot (wrapper in typeobject.c) has its own C implementation. But yeah, in general more tests don't harm :-) Since the patch here optimizes the most common case, a regular method implemented in Python, I didn't add a specific test with the change. This case is already very well tested, like everything in the stdlib, no? -- I tried to imagine how we could avoid temporary method objects in more cases like Python class methods (using @classmethod), but I don't think that it's worth it. It would require more complex code for a less common case. Or do someone see other common cases which would benefit of a similar optimization? |
patch looks good to me. |
New changeset be663c9a9e24 by Victor Stinner in branch 'default': |
Oh, I was too lazy to run the full test suite, I only ran a subset and I was bitten by buildbots :-) test_unraisable() of test_exceptions fails. IHMO the BrokenRepr subtest on this test function is really implementation specific. To fix buildbots, I removed the BrokenRepr unit test, but kept the other cases on test_unraisable(): change be663c9a9e24. See my commit message for the full rationale. In fact, the patch changed the error message logged when a destructor fails. Example: class Obj:
def __del__(self):
raise Exception("broken del")
def __repr__(self):
return "<useful repr>"
obj = Obj()
del obj Before, contains "<useful repr>": Exception ignored in: <bound method Obj.__del__ of <useful repr>>
Traceback (most recent call last):
File "x.py", line 3, in __del__
raise Exception("broken del")
Exception: broken del After, "<useful repr>" is gone: Exception ignored in: <function Obj.__del__ at 0x7f10294c3110>
Traceback (most recent call last):
File "x.py", line 3, in __del__
raise Exception("broken del")
Exception: broken del There is an advantage. The error message is now better when repr(obj) fails. Example: class Obj:
def __del__(self):
raise Exception("broken del")
def __repr__(self):
raise Excepiton("broken repr")
obj = Obj()
del obj Before, raw "<object repr() failed>" with no information on the type: Exception ignored in: <object repr() failed>
Traceback (most recent call last):
File "x.py", line 3, in __del__
raise Exception("broken del")
Exception: broken del After, the error message includes the type: Exception ignored in: <function Obj.__del__ at 0x7f162f873110>
Traceback (most recent call last):
File "x.py", line 3, in __del__
raise Exception("broken del")
Exception: broken del Technically, slot_tp_finalize() can call lookup_maybe() to get a bound method if the unbound method failed. The question is if it's worth it? In general, I dislike calling too much code to log an exception, since it's likely to raise a new exception. It's exactly the case here: logging an exception raises a new exception (in repr())! Simpler option: revert the change in slot_tp_finalize() and document that's it's deliberate to get a bound method to get a better error message. The question is a tradeoff between performance and correctness. |
I checked typeobject.c: there is a single case where we use the result of lookup_maybe_method()/lookup_method() for something else than calling the unbound method: slot_tp_finalize() calls PyErr_WriteUnraisable(del), the case discussed in my previous message which caused test_exceptions failure (now fixed). |
Thanks for finishing my draft patch, Victor. callmetohd2.patch is same trick for PyObject_CallMethod* APIs in abstract.c. As I grepping "PyObject_CallMethod", there are many format=NULL callers. |
New changeset e5cd74868dfc by Victor Stinner in branch 'default': |
callmethod2.patch: I like that change on object_vacall(), I'm not sure about the change on PyObject_CallMethod*() only for empty format string. I suggest to split your patch into two parts, and first focus on object_vacall(). Do you have a benchmark for this one? Note: I doesn't like the name I chose for object_vacall(). If we modify it, I would suggest to rename it objet_call_vargs() instead. Anyway, before pushing anything more, I would like to take a decision on the repr()/test_exceptions issue. What do you think Naoki? |
There are many place using _PyObject_CallMethodId() to call method without args. |
performance benefit is small. |
I'm more interested by an optimization PyObject_CallMethod*() for any number of arguments, as done in typeobject.c ;-) |
Are you using PGO+LTO compilation? Without PGO, the noise of code placement can be too high. In your "perf stat" comparisons, I see that "insn per cycle" is lower with the patch, which sounds like a code placement issue like a performance issue with the patch. |
Yes, I used --enable-optimization this time. BTW, since benefit of GetMethod is small, how about this?
_PyObject_FastCall* can use FASTCALL C function and method (PyCFunction), |
I also *hope* that a call.c file would *help* a little bit, but I'm not sure that it will fix *all* code placement issues. I created the issue bpo-29524 with a patch creating Objects/call.c. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: