-
-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add _PyObject_FastCall() #71315
Comments
Since the issue bpo-26814 proved that avoiding the creation of temporary tuples to call Python and C functions makes Python faster (between 2% and 29% depending on the benchmark), I extracted a first "minimal" patch to start merging this work. The first patch adds new functions:
I hesitate between the C types "int" and "Py_ssize_t" for nargs. I read once that using "int" can cause performance issues on a loop using "i++" and "data[i]" because the compiler has to handle integer overflow of the int type. The "int" type is also annoying on Windows 64-bit, it causes compiler warnings on downcast like PyTuple_GET_SIZE(co->co_argcount) stored into a C int. _PyObject_FastCall() avoids the creation of tuple for:
The patch removes the "cache tuple" optimization from property_descr_get(), it uses PyObject_CallArg1() instead. It means that the optimization is (currently) missed in some cases compared to the current code, but the code is safer and simpler. The patch adds Python/pystack.c which currently only contains _PyStack_AsTuple(), but will contain more code later. I tried to write the smallest patch, but I started to use PyObject_CallNoArg() and PyObject_CallArg1() when the code already created a tuple at each call: PyObject_CallObject(), call_function_tail() and PyEval_CallObjectWithKeywords(). In the patch, keywords are not used in fast calls. But they will be used later. I prefer to start directly with keywords than changing the calling convention once again later. -- Later, I will propose other patches to:
So the fast call will be taken in more cases. -- The long term plan is to slowly use the new FASTCALL calling convention "everywhere". The tricky point are tp_new, tp_init and tp_call attributes of type objects. In the issue bpo-26814, I wrote a patch adding Py_TPFLAGS_FASTNEW, Py_TPFLAGS_FASTINIT and Py_TPFLAGS_FASTCALL flags to use the FASTCALL calling convention for tp_new, tp_init and tp_call. The problem is that calling directly these methods looks common. If we can the calling convention of these methods, it will break the C API, I propose to discuss that later ;-) An alternative is to add a tp_fastcall method to PyTypeObject and use a wrapper for tp_call for backward compatibility. This option has also drawbacks. Again, I propose to discuss this later, and first start to focus on the changes that don't break anything ;-) |
Quick & dirty microbenchmark: I ran bench_fast-2.py of the issue bpo-26814. It looks like everything is slower :-p In fact, I already noticed this issue and I think that it is fixed with better compilation option: use "./configure --with-lto" and "make profile-opt". See my article: ----------------------------------+-------------+--------------- At least, we have a starting point ;-) |
default-May26-13-36-33.log: CPython benchmark suite run using stable config. Faster (15):
Slower (8):
|
Updated bench_fast-2.py result with Python compiled with PGO+LTO, with benchmark.py fixed to compute average + standard deviation. Only getattr() really seems slower: ----------------------------------+-----------------------+-------------------------- |
See bpo-27213. Maybe fast call with keyword arguments would avoid the creation of a dict. |
Serhiy Storchaka added the comment:
In a first verison of my implementation, I used dictionary items But in practice, it's very rare in the C code base to have to call a |
Rebased patch. |
(Oops, I removed a broken fastcall-2.patch which didn't include new pystack.c/pystack.h files. It's now fixed in the new fastcall-2.patch.) |
I spent the last 3 months on making the CPython benchmark suite more stable and enhance my procedure to run benchmarks to ensure that benchmarks are more stable. See my articles: I forked and enhanced the benchmark suite to use my perf module to run benchmarks in multiple processes: I ran this better benchmark suite on fastcall-2.patch on my laptop. The result is quite good: $ python3 -m perf compare_to ref.json fastcall.json -G --min-speed=5
Slower (4):
- fastpickle/pickle_dict: 326 us +- 15 us -> 350 us +- 29 us: 1.07x slower
- regex_effbot: 49.4 ms +- 1.3 ms -> 53.0 ms +- 1.2 ms: 1.07x slower
- fastpickle/pickle: 432 us +- 8 us -> 457 us +- 10 us: 1.06x slower
- pybench.ComplexPythonFunctionCalls: 838 ns +- 11 ns -> 884 ns +- 8 ns: 1.05x slower Faster (13):
Benchmark hidden because not significant (84): (...) Most benchmarks are not significant which is expected since fastcall-2.patch is really the most simple patch to start the work on "FASTCALL", it doesn't really implement any optimization, it only adds a new infrastructure to implement new optimizations. A few benchmarks are faster (only benchmarks at least 5% faster are shown using --min-speed=5). 4 benchmarks are slower, but the slowdown should be temporarily: new optimizations should these benchmarks slower. See the issue bpo-26814 for more a concrete implementation and a lot of benchmark results if you don't trust me :-) I consider that benchmarks proved that there is no major slowdown, so fastcall-2.patch can be merged to be able to start working on real optimizations. |
Benchmarking results look nice, but despite the fact that this patch is only small part of bpo-26814, it looks to me larger that it could be.
|
Oh I failed to express my intent. This initial patch is not expected to
See my remark above, no speedup is expected. Do you suggest to not add these 2 new functions? Since they are well
Sorry, I don't understand. This function requires a tuple. The whole In my full patch, PyObject_Call() calls _PyObject_FastCall() in most cases.
Well, I can add support for keyword arguments later and start with an I plan to add a new METH_FASTCALL calling convention for C functions. I
Keyword arguments are optional. Having support for them cost nothing when
I really want to have a "pystack" API. In this patch, the new file looks I'm limited by Mercurial and our workflow (tools), it would be much easier |
Yes, I suggest to not add them. The API for calling is already too large.
Sorry, I meant PyObject_CallFunctionObjArgs() and like.
My point is that if keyword arguments are used, this is not a fast call, and
But for now there is no a "pystack" API. What do you want to add? Can it be Here is alternative simplified patch.
|
About "I hesitate between the C types "int" and "Py_ssize_t" for nargs. I read once that using "int" can cause performance issues on a loop using "i++" and "data[i]" because the compiler has to handle integer overflow of the int type." This is true because of -fwrapv, but I believe it is true also for Py_ssize_t which is also of signed type. However, there would be a speed-up achievable by disabling -fwrapv, because only then the i++; data[i] can be safely optimized into *(++data) |
Serhiy Storchaka added the comment:
Well, we can start without them, and see later if it's worth it. I didn't propose to add new functions to make the code faster, but to I dislike PyEval_CallObjectWithKeywords(func, arg, kw) because it has It's common to call a function with no argument or just one argument,
Yes, my full patch does optimize these functions:
I'm not sure that I understand your point. For example, in my full patch, I have a METH_FASTCALL calling How do you want to implement METH_FASTCALL if you cannot pass keyword It's ok if passing keyword arguments is not faster, but simply as fast
See my fastcall branch: https://hg.python.org/sandbox/fastcall/file/2dc558e01e66/Include/pystack.h All these functions are private. They are used internally to implement
Ah? Maybe you should open a different issue for that. I prefer to have an API specific to build a "stack" to call functions.
My full patch does optimize "everything", it's deliberate to start
Ah? What is the regression? |
New changeset 288ec55f1912 by Victor Stinner in branch 'default': New changeset e615718a6455 by Victor Stinner in branch 'default': |
Patch version 3: simpler and shorter patch
I also pushed some changes unrelated to fastcall in Python/ceval.c to simplify the patch. Very few functions are modified (directly or indirectly) to use _PyObject_FastCall():
Much more will come in following patches. |
Oh, I now understand. The change makes "namedtuple.attr" slower. With fastcall-3.patch attached to this issue, the fast path is not taken on this benchmark, and so you loose the removed optimization (tuple cached in the modified descriptor function). In fact, you need the "full" fastcall change to make this attribute lookup *faster*: So yeah, it's better to wait until more changes are merged. |
New changeset a1a29d20f52d by Victor Stinner in branch 'default': New changeset 89e4ad001f3d by Victor Stinner in branch 'default': New changeset 7cd479573de9 by Victor Stinner in branch 'default': New changeset 34af2edface9 by Victor Stinner in branch 'default': New changeset adceb14cab96 by Victor Stinner in branch 'default': New changeset 10f1a4910adb by Victor Stinner in branch 'default': New changeset 5cf9524f2923 by Victor Stinner in branch 'default': New changeset f1ad6f64a11e by Victor Stinner in branch 'default': |
New changeset 2da6dc1c30d8 by Victor Stinner in branch 'default': New changeset 2d4d40da2aba by Victor Stinner in branch '3.5': New changeset 5b1ed48aedef by Victor Stinner in branch '2.7': New changeset df4efc23ab18 by Victor Stinner in branch '3.5': New changeset 7669fb39a9ce by Victor Stinner in branch '2.7': |
New changeset 73b00fb1dc9d by Victor Stinner in branch 'default': New changeset 8e085070ab28 by Victor Stinner in branch 'default': New changeset 2d2bc1906b5b by Victor Stinner in branch 'default': New changeset 6eb586b85fa1 by Victor Stinner in branch 'default': New changeset 605a42a50496 by Victor Stinner in branch 'default': New changeset 6a21b6599692 by Victor Stinner in branch 'default': New changeset 45d2b5c12b19 by Victor Stinner in branch 'default': New changeset 124d5d0ef81f by Victor Stinner in branch 'default': New changeset 71c22e592a9b by Victor Stinner in branch 'default': |
New changeset 3ab32f7add6e by Victor Stinner in branch 'default': |
Ok, I updated the most simple forms of function calls. I will open new issues for more complex calls and more sensible parts of Buildbots seem to be happy. |
New changeset c2af917bde71 by Victor Stinner in branch 'default': New changeset 0da1ce362d15 by Victor Stinner in branch 'default': New changeset e5b24f595235 by Victor Stinner in branch 'default': New changeset 154f78d387f9 by Victor Stinner in branch 'default': New changeset 351b987d6d1c by Victor Stinner in branch 'default': New changeset abb93035ebb7 by Victor Stinner in branch 'default': New changeset 2954d2aa4c90 by Victor Stinner in branch 'default': |
FYI: I copied your (no-kwargs) implementation over into Cython and I get around 17% faster calls to Python functions with 2 positional arguments. |
Hey, cool! It's always cool to get performance enhancement without having to break the C API nor having to modify source code :-) What do you mean by "I copied your (no-kwargs) implementation"? The whole giant patch? Or just a few changes? Which changes? |
New changeset 7dd85b19c873 by Victor Stinner in branch 'default': |
The problem is that passing keyword arguments as a dict is not the most efficient way due to an overhead of creating a dict. For now keyword arguments are pushed on the stack as interlaced array of keyword names and values. It may be more efficient to push values and names as continuous arrays (bpo-27213). PyArg_ParseTupleAndKeywords() accepts a tuple and a dict, but private function _PyArg_ParseTupleAndKeywordsFast() (bpo-27574) can be changed to accept positional and keyword arguments as continuous arrays: (int nargs, PyObject **args, int nkwargs, PyObject **kwnames, PyObject **kwargs). Therefore we will be forced either to change the signature of _PyObject_FastCall() and the meaning of METH_FASTCALL, or add new _PyObject_FastCallKw() and METH_FASTCALLKW for support fast passing keyword arguments. Or may be add yet _PyObject_FastCallNoKw() for faster passing only positional arguments without an overhead of _PyObject_FastCall(). And make older _PyObject_FastCall() and METH_FASTCALL obsolete. There is yet one possibility. Argument Clinic can generate a dict that maps keyword argument names to indices of arguments and tie it to a function. External code should map names to indices using this dictionary and pass arguments as just a continuous array to function with METH_FASTCALL (raising an error if some argument is passed as positional and keyword, or if keyword-only argument is passed as positional, etc). In that case the kwargs parameter of _PyObject_FastCall() becomes obsolete too. |
I copied what you committed into CPython for _PyFunction_FastCall(): and then enabled its usage in a couple of places: especially for all function/method calls that we generate for user code: Note that PyMethod objects get unpacked into function+self right before the PyFunction_Check(), so the tuple avoidance optimisation also applies to Python method calls. |
Ok, I see much better with concrete commits. I'm really happy that Note: handling keywords is likely to change quickly ;-) |
The main features (_PyFunction_FastCall()) has been merged. Supporting keyword arguments is now handled by other issue (see issue bpo-27830). I close this issue. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: