-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster opcode dispatch on gcc #49003
Comments
This patch implements what is usually called "threaded code" for the The opcode jump table is generated by a separate script which is called On compilers other than gcc, performance will of course be unchanged. Test minimum run-time average run-time |
Armin, by reading the pypy-dev mailing-list it looks like you were |
On 2008-12-26 22:09, Antoine Pitrou wrote:
Now I know why you want opcode stats in pybench :-) This looks like a promising approach. Is is possible to backport |
Certainly. |
This new patch uses a statically initialized jump table rather than |
I'm having trouble understanding the technique of the jump table. Can |
I haven't read any papers. Having a jump table in itself isn't special If you read the patch it will probably be easy to understand. I had the idea to try this after a thread on pypy-dev, there are more
Don't know! Your experiments are welcome. My patch is far simpler to |
You are right. It's easier to understand after I've learned how the
Yes, your patch is much smaller, less intrusive and easier to understand |
Yes, it is. |
This new patch adds some detailed comments, at Jason Orendorff's request. |
You may want to check out bpo-1408710 in which a similar patch was I didn't get the advertised ~15% speed-up, but only 4% on my Intel Core2 The patch looks pretty clean. Here is a few things that caught my First, you should rename opcode_targets.c to opcode_targets.h. This will Also, the macro USE_THREADED_CODE should be renamed to something else; Finally, why do you disable your patch when DYNAMIC_EXECUTION_PROFILE or By the way, SUNCC also supports GCC's syntax for labels as values |
Works pretty well for me on my MacBook Pro, but on my G5 it performed If this is applied to the core I think it will have to select for more |
You're sure you didn't compile in debug mode or something? Just |
Antoine> You're sure you didn't compile in debug mode or something? Just There was a cut-n-paste error in that one which I noticed right after Skip |
Hello,
Thanks. The machine I got the 15% speedup on is in 64-bit mode with gcc If you want to investigate, you can output the assembler code for gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -IInclude -I./Include -DPy_BUILD_CORE -S -dA Python/ceval.c and then count the number of indirect jump instructions in ceval.c: grep -E "jmp[[:space:]]\*%" ceval.s There should be 85 to 90 of them, roughly. If there are many less, then
Ok.
Ok.
Because otherwise the measurements these options are meant to do would
I don't have a Sun machine to test, so I'll leave to someone else to |
Attached new patch for fixes suggested by Alexandre (rename |
Mentioning other versions as well. |
Paolo> (2.5 is in bugfix-only mode, and as far as I can see this patch You could backport it to 2.4 & 2.5 and just put it up on PyPI... |
Some other comments.
That's true.
I'd prefer USE_DIRECT_DISPATCH (or better, USE_THREADED_DISPATCH) rather "indirect threading" is the standard name in CS literature to define The best paper about this is: The original paper about (direct) threaded code is this: |
The only difference with your change is that you save the range check Anyway, this suggests that the speedup really comes from better branch I guess (but this might be wrong) that's because the execution units |
Topics
======= About different speedups on 32bits vs 64 bits ======= Look at the amount of register variables in PyEval_EvalFrameEx() (btw, In fact, adding locals this way gave huge speedups on tight loops on the And adding a write to memory in the dispatch code (to f->last_i) gave a ======= About PPC slowdown ======= During our VM course, threading slowed down a toy interpreter with 4 toy I don't have right now neither a Pentium4 nor a PowerPC available, so I ======= PyPI ======= Skip> You could backport it to 2.4 & 2.5 and just put it up on PyPI... |
== On the patch itself == #define OPCODE_LIST(DEFINE_OPCODE) \
DEFINE_OPCODE(STOP_CODE, 0) \
DEFINE_OPCODE(POP_TOP, 1) \
DEFINE_OPCODE(ROT_TWO, 2) \
DEFINE_OPCODE(ROT_THREE, 3) \
DEFINE_OPCODE(DUP_TOP, 4) \
DEFINE_OPCODE(ROT_FOUR, 5) \
DEFINE_OPCODE(NOP, 9) \
... # define DECL_OPCODE(opcode) \ void *opcodes[] = {
OPCODE_LIST(DECL_OPCODE)
};
# undef DECL_OPCODE There are also other ways to do it, but using higher-order functions |
Hi,
We would have to change opcode.h for this to be truely useful (in order Thanks for all the explanation and pointers! About register allocation, As for the "register" declarations, I think they're just remains of the Antoine. |
Skip> You could backport it to 2.4 & 2.5 and just put it up on PyPI...
I don't see why it wouldn't. Skip |
Yep.
Agreed, I'll maybe try to find time for it.
I didn't look at the output, but that shouldn't be needed with decent I think that change would just save some compilation time for dataflow I just studied liveness analysis in compilers, and it computes whether a The only important thing is that the content of the jump table are known |
Antoine Pitrou wrote:
I get 86 with GCC 4.x and SUNCC. However, with GCC 3.4 I only get a
Ah, I see now. Maybe you should add a quick comment that mentions this.
I tested it and it worked, no test failures to report. Just change the #ifdef __GNUC__ && \
... to #ifdef (__GNUC__ || __SUNPRO_C) && \
... I attached some additional benchmarks on SunOS. So far, it seems the |
You forgot to update your script to use the new name. |
Try -fno-crossjumping.
Thanks.
Ah, that's rather dumb :) |
For the record, I've compiled py3k on an embarassingly fast Core2-based (with gcc 4.3.2 in 64-bit mode) |
Does anyone know the equivalent ICC command line option for GCC's -fno- It looks like ICC hits the same combining goto problem, as was Even if stock Python is built with MSVC, devs like myself who ship |
The x86 gentoo buildbot is failing to compile, with error: /Python/makeopcodetargets.py ./Python/opcode_targets.h I suspect that it's because the system Python on this buildbot is Python |
One other thought: it seems that as a result of this change, the py3k Might it be worth adding the file Python/opcode_targets.h to the |
Sorry: ignore that last. Python/opcode_targets.h is already part of the |
Mark: """Are there any objections to me adding a couple of square brackets to No problems for me. You might also add to the top comments of the file |
Square brackets added in r69133. The gentoo x86 3.x buildbot seems to be |
The test failure also happens on trunk, it may be related to the recent |
Yes; sorry---I didn't mean to suggest that the test failures were in any |
This has been checked in, right? Might I suggest that the TARGET and
than with TARGET(NOP) S |
Yes, please! |
Skip, removing the colon doesn't work if the macro adds code after the |
Antoine> Skip, removing the colon doesn't work if the macro adds code When I looked I thought both TARGET and TARGET_WITH_IMPL ended with a colon, Skip |
Out of interest, the attached patch against the py3k branch at r70516
On my systems (all older AMD with old versions of gcc), this patch has |
Is a backport to 2.7 still planned? |
On 2009-03-31 03:19, A.M. Kuchling wrote:
I hope it is. |
Andrew, your patch disables the optimization that HAS_ARG(op) is a |
Antoine, in my testing the "loss" of the HAS_ARG() optimisation in my On an Intel E8200 cpu running FreeBSD 7.1 amd64, with gcc 7.2.1 and the For comparison, this machine running Windows XP (32 bit) with the |
I have patch the code of python3.1 to use computed goto tecnique also |
This is too late for 2.x now, closing. |
Hi All, This is Vamsi from Server Scripting Languages Optimization team at Intel Corporation. Would like to submit a request to enable the computed goto based dispatch in Python 2.x (which happens to be enabled by default in Python 3 given its performance benefits on a wide range of workloads). We talked about this patch with Guido and he encouraged us to submit a request on Python-dev (email conversation with Guido shown at the bottom of this email). Attached is the computed goto patch (along with instructions to run) for Python 2.7.10 (based on the patch submitted by Jeffrey Yasskin at http://bugs.python.org/issue4753). We built and tested this patch for Python 2.7.10 on a Linux machine (Ubuntu 14.04 LTS server, Intel Xeon – Haswell EP CPU with 18 cores, hyper-threading off, turbo off). Below is a summary of the performance we saw on the “grand unified python benchmarks” suite (available at https://hg.python.org/benchmarks/). We made 3 rigorous runs of the following benchmarks. In each rigorous run, a benchmark is run 100 times with and without the computed goto patch. Below we show the average performance boost for the 3 rigorous runs.
Python 2.7.10 (original) vs Computed Goto performance Thanks, ------------------------------------------------------------------------------------------------------------------------------------------------------------ Hi Robert and David, You might also check with Benjamin Peterson, who is the 2.7 release manager. I think he just announced 2.7.10, so it's too late for that, but I assume we'll keep doing 2.7.x releases until 2020. --Guido PS. I am assuming you are contributing this under a PSF-accepted license, e.g. Apache 2.0, otherwise it's an automatic nogo. On Tue, May 19, 2015 at 9:33 AM, Cohn, Robert S <robert.s.cohn@intel.com> wrote: When we met for lunch at pycon, I asked if performance related patches would be ok for python 2.x. My understanding is that you thought it was possible if it did not create a maintainability problem. We have an example now, a 2.7 patch for computed goto based on the implementation in python 3 http://bugs.python.org/issue4753 It increases performance by up to 10% across a wide range of workloads. As I mentioned at lunch, we hired David Murray’s company, and he is guiding intel through the development process for cpython. David and I thought it would be good to run this by you before raising the issue on python-dev. Do you have a specific concern about this patch or a more general concern about performance patches to 2.7? Thanks. Robert |
FWIW I'm interested and willing to poke at this if more testers/reviewers are needed. |
@vamsi, could you please open a new issue and attach your patch there so it can be properly tracked for 2.7? This issue has been closed for five years and the code has been out in the field for a long time in Python 3. Thanks! |
New changeset 17d3bbde60d2 by Benjamin Peterson in branch '2.7': |
The 2.7 back-ported version of this patch appears to have broken compilation on the Windows XP buildbot, during the OpenSSL build process, when the newly built Python is used to execute the build_ssl.py script. After this patch, when that stage executes, and prior to any output from the build script, the python_d process goes to 100% CPU and sticks there until the build process times out 1200s later and kills it. I don't think it's really ssl related though, as after doing some debugging the exact same thing happens if I simply run python_d (I never see a prompt - it just sits there burning CPU). So I think build_ssl.py is just the first use of the generated python_d during the build process. I did try attaching to the CPU-stuck version of python_d from VS, and so far from what I can see, the process never gets past the Py_Initialize() call in Py_Main(). It's all over the place in terms of locations if I try interrupting it, but it's always stuck inside that first Py_Initialize call. I'm not sure if it's something environmental on my slave, or a difference with a debug vs. production build (I haven't had a chance to try building a release version yet). -- David |
I ran a few more tests, and the generated executable hangs in both release and debug builds. The closest I can get at the moment is that it's stuck importing errno from the "import sys, errno" line in os.py - at least no matter how long I wait after starting a process before breaking out, output with -v looks like: > python_d -v
# installing zipimport hook
import zipimport # builtin
# installed zipimport hook
# D:\cygwin\home\db3l\test\python2.7\lib\site.pyc matches D:\cygwin\home\db3l\test\python2.7\lib\site.py
import site # precompiled from D:\cygwin\home\db3l\test\python2.7\lib\site.pyc
# D:\cygwin\home\db3l\test\python2.7\lib\os.pyc matches D:\cygwin\home\db3l\test\python2.7\lib\os.py
import os # precompiled from D:\cygwin\home\db3l\test\python2.7\lib\os.pyc
import errno # builtin
Traceback (most recent call last):
File "D:\cygwin\home\db3l\test\python2.7\lib\site.py", line 62, in <module>
import os
File "D:\cygwin\home\db3l\test\python2.7\lib\os.py", line 26, in <module>
import sys, errno
KeyboardInterrupt
# clear __builtin__._
# clear sys.path
# clear sys.argv
# clear sys.ps1
# clear sys.ps2
# clear sys.exitfunc
# clear sys.exc_type
# clear sys.exc_value
# clear sys.exc_traceback
# clear sys.last_type
# clear sys.last_value
# clear sys.last_traceback
# clear sys.path_hooks
# clear sys.path_importer_cache
# clear sys.meta_path
# clear sys.flags
# clear sys.float_info
# restore sys.stdin
# restore sys.stdout
# restore sys.stderr
# cleanup __main__
# cleanup[1] zipimport
# cleanup[1] errno
# cleanup[1] signal
# cleanup[1] exceptions
# cleanup[1] _warnings
# cleanup sys
# cleanup __builtin__
[8991 refs]
# cleanup ints: 6 unfreed ints
# cleanup floats I never have a problem interrupting the process, so KeyboardInterrupt is processed normally - it just looks like it gets stuck in an infinite loop during startup. -- David |
Please open a new issue with the details about your problem. |
Oops, sorry, I had just followed the commit comment to this issue. For the record here, it looks like Benjamin has committed an update (5e8fa1b13516) that resolves the problem. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: