Faster opcode dispatch on gcc #49003

pitrou · 2008-12-26T21:09:38Z

BPO	4753
Nosy	@malemburg, @akuchling, @rhettinger, @facundobatista, @gpshead, @jcea, @db3l, @abalkin, @pitrou, @tiran, @devdanzin, @rbtcollins, @ned-deily, @djc, @bitdancer, @jab
Files	amd-athlon64-x2-pybench.txt intel-core2-mobile-pybench.txt amd-athlon64-x2-suncc-pybench.txt amd-athlon64-x2-gcc-sunos-pybench.txt threadedceval5.patch pybench.sum.PPC: Pybench summary for Mac G5 pybench.sum.Intel: Pybench summary for MacBook Pro (Intel Core2 Duo) amd-athlon64-x2-64bit-pybench.txt ceval.i.unthreaded: PPC (G5) unthreaded eval assembler ceval.i.threaded: PPC (G5) threaded eval assembler abstract-switch.diff abstract-switch-reduced.diff: Fixed version of abstract-switch.diff, without extraneous changes (IMHO) restore-old-oparg-load.diff: Additional fix, on top of threadedceval5.patch and abstract-switch-reduced.diff, to restore back performances and have no switch at all reenable-static-prediction.diff: Reenable static opcode prediction on top of the rest pitrou_dispatch_2.7.patch: threadedceval5.patch ported to trunk (2.7) vmgen_2.7.patch: vmgen (GForth)-based dispatch optimization with simple superinstructions threadedceval6.patch threadedceval6-py254.patch: threadedceval6 for Python 2.5.4 ceval.c.threadcode-tidyup.patch: threaded code tidy up in eval.c newfile.zip: ceval.c and opcode_targets.h cgoto_py2710_hg_final.patch: Computed Goto Patch for Python 2.7.10

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/gpshead'
closed_at = <Date 2010-07-19.23:55:22.123>
created_at = <Date 2008-12-26.21:09:38.283>
labels = ['interpreter-core', 'performance']
title = 'Faster opcode dispatch on gcc'
updated_at = <Date 2015-06-02.21:38:03.306>
user = 'https://github.com/pitrou'

bugs.python.org fields:

activity = <Date 2015-06-02.21:38:03.306>
actor = 'db3l'
assignee = 'gregory.p.smith'
closed = True
closed_date = <Date 2010-07-19.23:55:22.123>
closer = 'pitrou'
components = ['Interpreter Core']
creation = <Date 2008-12-26.21:09:38.283>
creator = 'pitrou'
dependencies = []
files = ['12512', '12513', '12521', '12522', '12524', '12545', '12546', '12551', '12553', '12555', '12584', '12633', '12634', '12635', '12687', '12705', '12769', '12824', '13392', '14521', '39527']
hgrepos = []
issue_num = 4753
keywords = ['patch']
message_count = 117.0
messages = ['78306', '78307', '78362', '78363', '78364', '78616', '78620', '78628', '78629', '78651', '78653', '78656', '78657', '78658', '78660', '78663', '78685', '78686', '78687', '78688', '78689', '78691', '78708', '78711', '78722', '78730', '78731', '78735', '78736', '78748', '78808', '78828', '78831', '78834', '78837', '78871', '78872', '78898', '78899', '78910', '78911', '78915', '78921', '78923', '78925', '78956', '79032', '79033', '79034', '79035', '79037', '79051', '79053', '79056', '79058', '79077', '79080', '79105', '79123', '79264', '79302', '79321', '79335', '79365', '79505', '79530', '79532', '79536', '79537', '79539', '79541', '79545', '79605', '79627', '79688', '79695', '79699', '79720', '79722', '79734', '79835', '79980', '79984', '79985', '80324', '80489', '80515', '80571', '80669', '80717', '80828', '80832', '80833', '80834', '80838', '80867', '80881', '80882', '81120', '81166', '81341', '81361', '83959', '84736', '84764', '84770', '85838', '90688', '110842', '244229', '244231', '244256', '244329', '244617', '244620', '244628', '244693']
nosy_count = 31.0
nosy_names = ['lemburg', 'akuchling', 'collinwinter', 'rhettinger', 'facundobatista', 'gregory.p.smith', 'jcea', 'spiv', 'db3l', 'aimacintyre', 'belopolsky', 'ggenellina', 'pitrou', 'christian.heimes', 'ajaksu2', 'jyasskin', 'rbcollins', 'kevinwatters', 'ned.deily', 'djc', 'ralph.corderoy', 'bboissin', 'r.david.murray', 'blaisorblade', 'theatrus', 'Ringding', 'jab', 'mdionisio', 'willp', 'python-dev', 'parasa']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue4753'
versions = ['Python 3.1']

pitrou · 2008-12-26T21:09:27Z

This patch implements what is usually called "threaded code" for the
ceval loop on compilers which support it (only gcc). The idea is that
there is a separate opcode dispatch epilog at the end of each opcode,
which allows the CPU to make much better use of its branch prediction
capabilities. The net result is a 15-20% average speedup on pybench and
pystone, with higher speedups on very tight loops (see below for the
full pybench result chart).

The opcode jump table is generated by a separate script which is called
as part of the Makefile (just as the asdl building script already is).

On compilers other than gcc, performance will of course be unchanged.

Test minimum run-time average run-time
this other diff this other
diff
-------------------------------------------------------------------------------
BuiltinFunctionCalls: 100ms 107ms -7.1% 101ms 110ms
-8.2%
BuiltinMethodLookup: 76ms 106ms -28.1% 78ms 106ms
-26.5%
CompareFloats: 108ms 141ms -23.2% 108ms 141ms
-23.2%
CompareFloatsIntegers: 171ms 188ms -9.4% 173ms 204ms
-15.3%
CompareIntegers: 165ms 213ms -22.6% 168ms 224ms
-25.1%
CompareInternedStrings: 127ms 169ms -24.6% 127ms 169ms
-24.8%
CompareLongs: 95ms 124ms -23.1% 95ms 126ms
-24.5%
CompareStrings: 109ms 136ms -20.2% 111ms 139ms
-19.9%
ComplexPythonFunctionCalls: 131ms 150ms -12.4% 136ms 151ms
-10.2%
ConcatStrings: 159ms 171ms -6.9% 160ms 173ms
-7.4%
CreateInstances: 148ms 157ms -5.6% 150ms 158ms
-4.9%
CreateNewInstances: 112ms 117ms -4.3% 112ms 118ms
-4.6%
CreateStringsWithConcat: 144ms 198ms -27.3% 148ms 199ms
-25.7%
DictCreation: 90ms 104ms -13.3% 90ms 104ms
-13.1%
DictWithFloatKeys: 117ms 153ms -23.7% 117ms 154ms
-24.0%
DictWithIntegerKeys: 104ms 153ms -32.3% 104ms 154ms
-32.5%
DictWithStringKeys: 90ms 140ms -35.7% 90ms 141ms
-36.3%
ForLoops: 100ms 161ms -38.1% 100ms 161ms
-38.1%
IfThenElse: 123ms 170ms -28.0% 125ms 171ms
-27.1%
ListSlicing: 142ms 141ms +0.3% 142ms 142ms
+0.2%
NestedForLoops: 135ms 190ms -29.0% 135ms 190ms
-29.0%
NormalClassAttribute: 249ms 281ms -11.5% 249ms 281ms
-11.3%
NormalInstanceAttribute: 110ms 153ms -28.2% 111ms 154ms
-28.1%
PythonFunctionCalls: 106ms 130ms -18.7% 108ms 131ms
-17.2%
PythonMethodCalls: 151ms 169ms -10.1% 152ms 169ms
-9.8%
Recursion: 183ms 242ms -24.7% 191ms 243ms
-21.4%
SecondImport: 142ms 138ms +2.7% 144ms 139ms
+3.4%
SecondPackageImport: 146ms 149ms -2.3% 148ms 150ms
-1.5%
SecondSubmoduleImport: 201ms 193ms +3.9% 201ms 195ms
+3.4%
SimpleComplexArithmetic: 90ms 112ms -19.6% 90ms 112ms
-19.8%
SimpleDictManipulation: 172ms 230ms -25.2% 173ms 231ms
-25.0%
SimpleFloatArithmetic: 98ms 133ms -26.3% 99ms 137ms
-27.9%
SimpleIntFloatArithmetic: 134ms 175ms -23.6% 138ms 176ms
-21.6%
SimpleIntegerArithmetic: 134ms 183ms -26.8% 141ms 183ms
-23.1%
SimpleListManipulation: 91ms 143ms -36.3% 93ms 143ms
-35.1%
SimpleLongArithmetic: 88ms 108ms -17.9% 91ms 109ms
-16.2%
SmallLists: 127ms 162ms -21.6% 129ms 164ms
-21.2%
SmallTuples: 149ms 177ms -15.6% 151ms 178ms
-15.1%
SpecialClassAttribute: 423ms 426ms -0.7% 426ms 430ms
-0.9%
SpecialInstanceAttribute: 110ms 154ms -28.2% 111ms 154ms
-28.3%
StringMappings: 428ms 443ms -3.4% 432ms 449ms
-3.7%
StringPredicates: 124ms 161ms -23.1% 125ms 162ms
-22.7%
StringSlicing: 207ms 223ms -7.1% 208ms 228ms
-8.7%
TryExcept: 72ms 166ms -56.3% 73ms 166ms
-56.2%
TryFinally: 93ms 120ms -22.9% 93ms 124ms
-25.2%
TryRaiseExcept: 52ms 64ms -19.2% 52ms 65ms
-19.2%
TupleSlicing: 177ms 195ms -9.1% 178ms 198ms
-10.2%
WithFinally: 147ms 163ms -10.2% 147ms 164ms
-10.1%
WithRaiseExcept: 156ms 173ms -10.1% 157ms 174ms
-9.7%
-------------------------------------------------------------------------------
Totals: 6903ms 8356ms -17.4% 6982ms 8443ms
-17.3%

pitrou · 2008-12-26T21:22:48Z

Armin, by reading the pypy-dev mailing-list it looks like you were
interested in this.

malemburg · 2008-12-27T13:21:07Z

On 2008-12-26 22:09, Antoine Pitrou wrote:

This patch implements what is usually called "threaded code" for the
ceval loop on compilers which support it (only gcc). The idea is that
there is a separate opcode dispatch epilog at the end of each opcode,
which allows the CPU to make much better use of its branch prediction
capabilities. The net result is a 15-20% average speedup on pybench and
pystone, with higher speedups on very tight loops (see below for the
full pybench result chart).

Now I know why you want opcode stats in pybench :-)

This looks like a promising approach. Is is possible to backport
this change to Python 2.x as well ?

pitrou · 2008-12-27T13:27:21Z

This looks like a promising approach. Is is possible to backport
this change to Python 2.x as well ?

Certainly.

pitrou · 2008-12-27T13:29:16Z

This new patch uses a statically initialized jump table rather than
specific initialization code.

tiran · 2008-12-31T16:02:04Z

I'm having trouble understanding the technique of the jump table. Can
you provide some links to papers that explain the threaded code? I'm
interested in learning more.
How does your implementation compare to the GForth based threaded code
speedwise?

pitrou · 2008-12-31T16:15:21Z

I'm having trouble understanding the technique of the jump table. Can
you provide some links to papers that explain the threaded code? I'm
interested in learning more.

I haven't read any papers. Having a jump table in itself isn't special
(the compiler does exactly that when compiling the switch() statement).
What's special is that a dedicated indirect jump instruction at the end
of each opcode helps the CPU make a separate prediction for which opcode
follows the other one, which is not possible with a switch statement
where the jump instruction is shared by all opcodes. I believe that's
where most of the speedup comes from.

If you read the patch it will probably be easy to understand.

I had the idea to try this after a thread on pypy-dev, there are more
references there:
http://codespeak.net/pipermail/pypy-dev/2008q4/004916.html

How does your implementation compare to the GForth based threaded code
speedwise?

Don't know! Your experiments are welcome. My patch is far simpler to
integrate though (it's small, introduces very few changes and does not
break any existing tests).

tiran · 2008-12-31T16:37:23Z

I haven't read any papers. Having a jump table in itself isn't special
(the compiler does exactly that when compiling the switch() statement).
What's special is that a dedicated indirect jump instruction at the end
of each opcode helps the CPU make a separate prediction for which opcode
follows the other one, which is not possible with a switch statement
where the jump instruction is shared by all opcodes. I believe that's
where most of the speedup comes from.

If you read the patch it will probably be easy to understand.

You are right. It's easier to understand after I've learned how the
opcode_targets table is working. Previously I didn't know that one can
store the address of a label in an array. Before I got it I wondered
where the pointers were defined. Is this a special GCC feature? I
haven't seen it before.

Don't know! Your experiments are welcome. My patch is far simpler to
integrate though (it's small, introduces very few changes and does not
break any existing tests).

Yes, your patch is much smaller, less intrusive and easier to understand
with a little background in CS.

pitrou · 2008-12-31T16:40:49Z

Is this a special GCC feature?

Yes, it is.
http://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html

pitrou · 2008-12-31T20:24:53Z

This new patch adds some detailed comments, at Jason Orendorff's request.

avassalotti · 2008-12-31T21:20:58Z

You may want to check out bpo-1408710 in which a similar patch was
provided, but failed to deliver the desired results.

I didn't get the advertised ~15% speed-up, but only 4% on my Intel Core2
laptop and 8% on my AMD Athlon64 X2 desktop. I attached the benchmark
results.

The patch looks pretty clean. Here is a few things that caught my
attention while reading your patch.

First, you should rename opcode_targets.c to opcode_targets.h. This will
make it explicit that the file is not compiled, but just included.

Also, the macro USE_THREADED_CODE should be renamed to something else;
the word "thread" is too tightly associated with multi-threading.
Furthermore, threaded code simply refers to code consisting only of
function calls. Maybe, USE_COMPUTED_GOTO or USE_DIRECT_DISPATCH would be
better.

Finally, why do you disable your patch when DYNAMIC_EXECUTION_PROFILE or
LLTRACE is enabled? I tested your patch with both enabled and I didn't
see any test failures.

By the way, SUNCC also supports GCC's syntax for labels as values
(http://docs.sun.com/app/docs/doc/819-5265/bjabt?l=en&a=view).

smontanaro · 2008-12-31T22:12:28Z

Works pretty well for me on my MacBook Pro, but on my G5 it performed
abysmally. In fact, it ran so much worse that I cleaned up my sandbox
and did both checks all over again to make sure I didn't mess something
up. It looks like my MacBook Pro saw about a 7% improvement while my
G5 saw a 14% degradation. Both computers are running Mac OS X 10.5.6
with the latest Xcode - 3.1.2. On both computers gcc -v reports 4.0.1,
Apple build 5490.

If this is applied to the core I think it will have to select for more
than just gcc. It will also have to select based on the instruction set
architecture.

pitrou · 2008-12-31T22:17:17Z

Works pretty well for me on my MacBook Pro, but on my G5 it performed
abysmally. In fact, it ran so much worse that I cleaned up my sandbox
and did both checks all over again to make sure I didn't mess something
up. It looks like my MacBook Pro saw about a 14% degradation. Both
computers are running Mac OS X 10.5.6 with the latest Xcode - 3.1.2.
On both computers gcc -v reports 4.0.1, Apple build 5490.

You're sure you didn't compile in debug mode or something? Just
checking.

smontanaro · 2008-12-31T22:24:15Z

Antoine> You're sure you didn't compile in debug mode or something? Just
Antoine> checking.

There was a cut-n-paste error in that one which I noticed right after
submitting (man, do I hate the crappy editing capability of <textarea>
widgets). I removed it within a minute or two and replaced it with a
correct version. Short answer: Intel thumbs up, PowerPC thumbs down.

Skip

pitrou · 2008-12-31T22:41:01Z

Hello,

You may want to check out bpo-1408710 in which a similar patch was
provided, but failed to deliver the desired results.

I didn't get the advertised ~15% speed-up, but only 4% on my Intel Core2
laptop and 8% on my AMD Athlon64 X2 desktop. I attached the benchmark
results.

Thanks. The machine I got the 15% speedup on is in 64-bit mode with gcc
4.3.2.

If you want to investigate, you can output the assembler code for
ceval.c; the command-line should be something like:

gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -IInclude -I./Include -DPy_BUILD_CORE -S -dA Python/ceval.c

and then count the number of indirect jump instructions in ceval.c:

grep -E "jmp[[:space:]]\*%" ceval.s

There should be 85 to 90 of them, roughly. If there are many less, then
the compiler has tried to optimize them by "sharing" them.

First, you should rename opcode_targets.c to opcode_targets.h. This will
make it explicit that the file is not compiled, but just included.

Ok.

Also, the macro USE_THREADED_CODE should be renamed to something else;
the word "thread" is too tightly associated with multi-threading.
Furthermore, threaded code simply refers to code consisting only of
function calls. Maybe, USE_COMPUTED_GOTO or USE_DIRECT_DISPATCH would be
better.

Ok.

Finally, why do you disable your patch when DYNAMIC_EXECUTION_PROFILE or
LLTRACE is enabled? I tested your patch with both enabled and I didn't
see any test failures.

Because otherwise the measurements these options are meant to do would
be meaningless.

By the way, SUNCC also supports GCC's syntax for labels as values

I don't have a Sun machine to test, so I'll leave to someone else to
check and enable if they want to.

pitrou · 2008-12-31T22:53:32Z

Attached new patch for fixes suggested by Alexandre (rename
opcode_targets.c to opcode_targets.h, replace USE_THREADED_CODE with
USE_COMPUTED_GOTOS).

blaisorblade · 2009-01-01T03:44:46Z

Mentioning other versions as well.
The patch is so easy that it can be backported to all supported
versions, so I'm adding all of them (2.5 is in bugfix-only mode, and as
far as I can see this patch cannot be accepted there, sadly).

smontanaro · 2009-01-01T04:10:21Z

Paolo> (2.5 is in bugfix-only mode, and as far as I can see this patch
Paolo> cannot be accepted there, sadly).

You could backport it to 2.4 & 2.5 and just put it up on PyPI...

blaisorblade · 2009-01-01T04:28:08Z

Some other comments.
The time saving of indirect threading are also associated with the
removal of the range check, but better branch prediction is the main
advantage.

Also, the macro USE_THREADED_CODE should be renamed to something else;
the word "thread" is too tightly associated with multi-threading.

That's true.

Furthermore, threaded code simply refers to code consisting only of
function calls. Maybe, USE_COMPUTED_GOTO or USE_DIRECT_DISPATCH would be
better.

I'd prefer USE_DIRECT_DISPATCH (or better, USE_THREADED_DISPATCH) rather
than USE_COMPUTED_GOTO, since the latter is just the used language
construct.

"indirect threading" is the standard name in CS literature to define
this technique. "Direct threading" is a variation where in the bytecode
array, opcode is replaced by the pointer opcode_handler[opcode], so that
dispatch is slightly faster. Most interpreters use indirect threading to
save memory, and because it enables to switch the opcode handlers table
to activate for instance debugging.

The best paper about this is:
"The Structure and Performance of Efficient Interpreters, M. Anton Ertl
and David Gregg, 2003".

The original paper about (direct) threaded code is this:
"Threaded Code, James R. Bell, Comm. of ACM, 1973", but I don't feel it
relevant at all today.
Indirect threading was introduced in "Indirect Threaded Code, Robert B.
K. Dewar, Communications of the ACM, 1975" (that's just a bit more
relevant, but still).

blaisorblade · 2009-01-01T04:36:55Z

You may want to check out bpo-1408710 in which a similar patch was
provided, but failed to deliver the desired results.
It's not really similar, because you don't duplicate the dispatch code.
It took me some time to understand why you didn't change the "goto
fast_next_opcode", but that's where you miss the speedup.

The only difference with your change is that you save the range check
for the switch, so the slowdown probably comes from some minor output
change from GCC I guess.

Anyway, this suggests that the speedup really comes from better branch
prediction and not from saving the range check. The 1st paper I
mentioned simply states that saving the range check might make a small
differences. The point is that sometimes, when you are going to flush
the pipeline, it's like adding a few instructions, even conditional
jumps, does not make a difference. I've observed this behaviour quite a
few times while building from scratch a small Python interpreter.

I guess (but this might be wrong) that's because the execution units
were not used at their fullest, and adding conditional jumps doesn't
make a differeence because flushing a pipeline once or twice is almost
the same (the second flush removes just few instructions). Or something
like that, I'm not expert enough of CPU architecture to be sure of such
guesses.

blaisorblade · 2009-01-01T05:03:02Z

Topics

About different speedups on 32bits vs 64 bits
About PPC slowdown
PyPI

======= About different speedups on 32bits vs 64 bits =======
An interpreter is very register-hungry, so on x86_64 it spends much less
time on register spill (i.e. moving locals from/to memory), so
instruction dispatch takes a bigger share of execution time. If the rest
of the interpreter is too slow, indirect threading gives no advantage.

Look at the amount of register variables in PyEval_EvalFrameEx() (btw,
do you support any compiler where nowadays 'register' still makes a
difference? That's quite weird). Lots of them are simply cached copies
of fields of the current frame and of the current function; without
copying them to locals, the compiler should assume they could change at
any function call.

In fact, adding locals this way gave huge speedups on tight loops on the
Python interpreter I built with a colleague for our student project, to
experiment with speeding up Python.

And adding a write to memory in the dispatch code (to f->last_i) gave a
20% slowdown. Since my interpreter uses a copying garbage collector and
CPython uses reference counting, which is much slower (if you don't know
this, show me a fast JVM with reference counting), I'm even surprised
you can get such a big speedup from threading.

======= About PPC slowdown =======
Somebody should try the patch on Pentium4 as well.

During our VM course, threading slowed down a toy interpreter with 4 toy
opcodes. Our teachers suggested that with such a small interpreter,
since threaded code takes more space (in that case, ~64 vs ~100 bytes),
this could give problems with code caches, but suggested checking that
idea using performance counters. I'm not sure about why.

I don't have right now neither a Pentium4 nor a PowerPC available, so I
can't check myself. But this is the best way to analyze the performance
unexpected behaviour.

======= PyPI =======
Paolo> (2.5 is in bugfix-only mode, and as far as I can see this patch
Paolo> cannot be accepted there, sadly).

Skip> You could backport it to 2.4 & 2.5 and just put it up on PyPI...
I was thinking to a private backport as well.
I didn't know about PyPI, it looks like PyPI is more for contributed
modules than for this, would that work?

blaisorblade · 2009-01-01T05:15:11Z

== On the patch itself ==
Why don't you use the C preprocessor instead of that Python code?
Sample code:

#define OPCODE_LIST(DEFINE_OPCODE) \
    DEFINE_OPCODE(STOP_CODE, 0)                 \
    DEFINE_OPCODE(POP_TOP, 1)                   \
    DEFINE_OPCODE(ROT_TWO, 2)                   \
    DEFINE_OPCODE(ROT_THREE, 3)                 \
    DEFINE_OPCODE(DUP_TOP, 4)                   \
    DEFINE_OPCODE(ROT_FOUR, 5)                  \
    DEFINE_OPCODE(NOP, 9)                       \
    ...

# define DECL_OPCODE(opcode) \
[opcode] = && label_ ## opcode,

        void *opcodes[] = {
                OPCODE_LIST(DECL_OPCODE)
        };
# undef DECL_OPCODE

There are also other ways to do it, but using higher-order functions
within the preprocessor in this way is something that I learned from the
V8 source code.
It has the advantage that OPCODE_LIST can be used in a lot of other
places (maybe when implementing the 'opcode' module, if it's written in C).

pitrou · 2009-01-01T13:06:20Z

Hi,

Why don't you use the C preprocessor instead of that Python code?
Sample code:

We would have to change opcode.h for this to be truely useful (in order
to re-use OPCODE_LIST()). I think that should be the subject of a
separate bug entry for code reorganization.

Thanks for all the explanation and pointers! About register allocation,
I wonder if the "temporary variables" u,v,w could be declared separately
in each opcode rather than at the top of the eval function, so that the
compiler doesn't try to store their values between two opcodes.

As for the "register" declarations, I think they're just remains of the
past.

Antoine.

smontanaro · 2009-01-01T14:00:40Z

Skip> You could backport it to 2.4 & 2.5 and just put it up on PyPI...

Paolo> I was thinking to a private backport as well.  I didn't know
Paolo> about PyPI, it looks like PyPI is more for contributed modules
Paolo> than for this, would that work?

I don't see why it wouldn't.

Skip

blaisorblade · 2009-01-01T19:43:16Z

We would have to change opcode.h for this to be truely useful (in
order to re-use OPCODE_LIST()).

Yep.

I think that should be the subject of a separate bug entry for code
reorganization.

Agreed, I'll maybe try to find time for it.

Thanks for all the explanation and pointers!
You're welcome, thanks to you for writing the patch! And
About register allocation, I wonder if the "temporary variables" u,v,w
could be declared separately in each opcode rather than at the top of
the eval function, so that the compiler doesn't try to store their
values between two opcodes.

I didn't look at the output, but that shouldn't be needed with decent
optimizers, since they are local variables, so the compiler has a full
picture of their usage (this does not happen with the content of the
heap, where the frame object may lie).

I think that change would just save some compilation time for dataflow
analysis, maybe :-). Or could make clearer which variables is used
where, but that is a matter of taste (what's there is fine for me).

I just studied liveness analysis in compilers, and it computes whether a
variable is live before and after each statement; if the value of a
variable is not used in some piece of code until the next write to the
variable, it is considered dead in that piece of code, and that variable
does not take space; since u, v, w are either unused or are written to
before usage in all opcodes, they're dead at the beginning of each
opcode, so they're also dead just before dispatch.

The only important thing is that the content of the jump table are known
to the compiler and that the compiler makes use of that. Simply passing
a non-const jump table to some function defined elsewhere (which could
in theory modify it) would make the output much worse.

avassalotti · 2009-01-01T21:58:23Z

Antoine Pitrou wrote:

[...] count the number of indirect jump instructions in ceval.c:

grep -E "jmp[[:space:]]\*%" ceval.s

There should be 85 to 90 of them, roughly. If there are many less, then
the compiler has tried to optimize them by "sharing" them.

I get 86 with GCC 4.x and SUNCC. However, with GCC 3.4 I only get a
single computed goto. Is there some hidden option to make GCC avoid
sharing jumps?

Because otherwise the measurements these options are meant to do would
be meaningless.

Ah, I see now. Maybe you should add a quick comment that mentions this.

I don't have a Sun machine to test, so I'll leave to someone else to
check and enable if they want to.

I tested it and it worked, no test failures to report. Just change the
macro test:

#ifdef __GNUC__ && \
...

to

#ifdef (__GNUC__ || __SUNPRO_C) && \
...

I attached some additional benchmarks on SunOS. So far, it seems the
benefits of the proposed optimization are highly compiler-dependent.

avassalotti · 2009-01-01T22:00:25Z

Attached new patch for fixes suggested by Alexandre (rename
opcode_targets.c to opcode_targets.h, replace USE_THREADED_CODE with
USE_COMPUTED_GOTOS).

You forgot to update your script to use the new name.

pitrou · 2009-01-01T22:48:36Z

I get 86 with GCC 4.x and SUNCC. However, with GCC 3.4 I only get a
single computed goto. Is there some hidden option to make GCC avoid
sharing jumps?

Try -fno-crossjumping.

I tested it and it worked, no test failures to report. Just change the
macro test:

#ifdef __GNUC__ && \
...

to

#ifdef (GNUC || __SUNPRO_C) && \
...

Thanks.

You forgot to update your script to use the new name.

Ah, that's rather dumb :)

pitrou · 2009-01-28T16:42:30Z

For the record, I've compiled py3k on an embarassingly fast Core2-based
server (Xeon E5410), and the computed gotos option gives a 16% speedup
on pybench and pystone.

(with gcc 4.3.2 in 64-bit mode)

kevinwatters · 2009-01-30T17:00:12Z

Does anyone know the equivalent ICC command line option for GCC's -fno-
gcse? (Or if it has one?) I can't find a related option in the docs.

It looks like ICC hits the same combining goto problem, as was
mentioned: without changing any options, I applied pitrou_dispatch_2.7.patch to release-26maint and pybench reported
literally no difference, +0.0%.

Even if stock Python is built with MSVC, devs like myself who ship
Python would like to see the benefit.

mdickinson · 2009-01-30T19:04:28Z

The x86 gentoo buildbot is failing to compile, with error:

/Python/makeopcodetargets.py ./Python/opcode_targets.h
File "./Python/makeopcodetargets.py", line 28
f.write(",\n".join("\t&&%s" % s for s in targets))
^
SyntaxError: invalid syntax
make: *** [Python/opcode_targets.h] Error 1

I suspect that it's because the system Python on this buildbot is Python
2.3, which doesn't understand generator expressions. Are there any
objections to me adding a couple of square brackets to this line to turn
the argument of join into a list comprehension?

mdickinson · 2009-01-30T19:21:01Z

One other thought: it seems that as a result of this change, the py3k
build process now depends on having some version of Python already
installed; before this, it didn't. Is this true, or am I misinterpreting
something?

Might it be worth adding the file Python/opcode_targets.h to the
distribution to avoid this problem?

mdickinson · 2009-01-30T19:23:21Z

Sorry: ignore that last. Python/opcode_targets.h is already part of the
distribution. I don't know what I was doing wrong.

pitrou · 2009-01-30T21:38:51Z

Mark:

"""Are there any objections to me adding a couple of square brackets to
this line to turn the argument of join into a list comprehension?"""

No problems for me. You might also add to the top comments of the file
that it is 2.3-compatible.

mdickinson · 2009-01-31T12:27:57Z

Square brackets added in r69133. The gentoo x86 3.x buildbot seems to be
passing the compile stage now. (Though not the test stage, of course:
one can't have everything!)

pitrou · 2009-01-31T17:17:56Z

Square brackets added in r69133. The gentoo x86 3.x buildbot seems to be
passing the compile stage now. (Though not the test stage, of course:
one can't have everything!)

The test failure also happens on trunk, it may be related to the recent
tk changes.

mdickinson · 2009-01-31T19:36:09Z

The test failure also happens on trunk, it may be related to the recent
tk changes.

Yes; sorry---I didn't mean to suggest that the test failures were in any
way related to the opcode dispatch stuff. Apart from the ttk teething
difficulties, there's a weird 'Unknown signal 32' error that's been going
on on the gentoo buildbot for many months now. But that's a separate
issue (issue bpo-4970, to be precise).

smontanaro · 2009-02-04T01:05:18Z

This has been checked in, right? Might I suggest that the TARGET and
TARGET_WITH_IMPL macros not include the trailing colon? I think that
will make it more friendly toward "smart" editors such as Emacs' C
mode. I definitely get better indentation with

TARGET(NOP):

than with

    TARGET(NOP)

S

ggenellina · 2009-02-04T21:42:08Z

Might I suggest that the TARGET and TARGET_WITH_IMPL macros not
include the trailing colon?

Yes, please!

pitrou · 2009-02-07T17:43:44Z

Skip, removing the colon doesn't work if the macro adds code after the
colon :)

smontanaro · 2009-02-08T01:31:50Z

Antoine> Skip, removing the colon doesn't work if the macro adds code
Antoine> after the colon :)

When I looked I thought both TARGET and TARGET_WITH_IMPL ended with a colon,
but I see that's not the case. How about removing TARGET_WITH_IMPL and just
include the goto explicitly? There are only a few instances of the
TARGET_WITH_IMPL used.

Skip

aimacintyre · 2009-03-22T07:37:23Z

Out of interest, the attached patch against the py3k branch at r70516
cleans up the threaded code changes a little:

gets rid of TARGET_WITH_IMPL macro;
TARGET(op) is followed by a colon, so that it looks like a label (for
editors that make use of that).

On my systems (all older AMD with old versions of gcc), this patch has
performance slightly better than SVN r70516, and performance is
usually very close to the NO_COMPUTED_GOTOS build.

akuchling · 2009-03-31T01:19:06Z

Is a backport to 2.7 still planned?

malemburg · 2009-03-31T08:54:02Z

On 2009-03-31 03:19, A.M. Kuchling wrote:

A.M. Kuchling <lists@amk.ca> added the comment:

Is a backport to 2.7 still planned?

I hope it is.

pitrou · 2009-03-31T12:58:47Z

Andrew, your patch disables the optimization that HAS_ARG(op) is a
constant when op is known by the compiler (that is, inside a
"TARGET_##op" label), so I'd rather keep the version which is currently
in SVN.

aimacintyre · 2009-04-10T02:15:39Z

Antoine, in my testing the "loss" of the HAS_ARG() optimisation in my
patch appears to have negligible cost on i386, but starts to look
significant on amd64.

On an Intel E8200 cpu running FreeBSD 7.1 amd64, with gcc 7.2.1 and the
3.1a2 sources, the computed goto version is ~8% faster (average time of
all rounds) for pybench (with warp factor set to 3 rather than the
default 10, to get the round time up over 10s) than without computed
gotos. With my patch applied, the computed goto version is ~5.5% faster
than without computed gotos by the same measure. On this platform,
Pystone rates at ~86k (no computed gotos), ~85k (computed gotos) and
~82k (computed gotos + my patch).

For comparison, this machine running Windows XP (32 bit) with the
python.org builds rates ~92k pystones for 2.6.1 and ~81k for 3.1a2.
Pybench isn't distributed in the MSI installers :-(

mdionisio · 2009-07-18T19:58:13Z

I have patch the code of python3.1 to use computed goto tecnique also
with Visual Studio. The performance result is not good (I really don't
know why). But it is a good work-araound for use the computed goto also
on windows.
The only diffentes is that the opcode_targets vector is filled at run-time.

pitrou · 2010-07-19T23:55:22Z

This is too late for 2.x now, closing.

parasa · 2015-05-28T00:23:23Z

Hi All,

This is Vamsi from Server Scripting Languages Optimization team at Intel Corporation.

Would like to submit a request to enable the computed goto based dispatch in Python 2.x (which happens to be enabled by default in Python 3 given its performance benefits on a wide range of workloads). We talked about this patch with Guido and he encouraged us to submit a request on Python-dev (email conversation with Guido shown at the bottom of this email).

Attached is the computed goto patch (along with instructions to run) for Python 2.7.10 (based on the patch submitted by Jeffrey Yasskin at http://bugs.python.org/issue4753). We built and tested this patch for Python 2.7.10 on a Linux machine (Ubuntu 14.04 LTS server, Intel Xeon – Haswell EP CPU with 18 cores, hyper-threading off, turbo off).

Below is a summary of the performance we saw on the “grand unified python benchmarks” suite (available at https://hg.python.org/benchmarks/). We made 3 rigorous runs of the following benchmarks. In each rigorous run, a benchmark is run 100 times with and without the computed goto patch. Below we show the average performance boost for the 3 rigorous runs.
-----
Instructions to run the computed goto patch

Apply the patch and then generate the new configure script (autoconf configure.ac > configure)
Enable execute permissions for Python/makeopcodetargets.py ( sudo chmod +x Python/makeopcodetargets.py)
To enable computed gotos, do: ./configure --with-computed-gotos
Build the new python binary using make && sudo make install

Python 2.7.10 (original) vs Computed Goto performance
Benchmark Delta (rigorous run #1) % Delta (rigorous run 2) % Delta (rigorous run #3) % Avg. Delta %
iterative_count 24.48 24.36 23.78 24.2
unpack_sequence 19.06 18.47 19.48 19.0
slowspitfire 14.36 13.41 16.65 14.8
threaded_count 15.85 13.43 13.93 14.4
pystone 10.68 11.67 11.08 11.1
nbody 10.25 8.93 9.28 9.5
go 7.96 8.76 7.69 8.1
pybench 6.3 6.8 7.2 6.8
spectral_norm 5.49 9.37 4.62 6.5
float 6.09 6.2 6.96 6.4
richards 6.19 6.41 6.42 6.3
slowunpickle 6.37 8.78 3.55 6.2
json_dump_v2 1.96 12.53 3.57 6.0
call_simple 6.37 5.91 3.92 5.4
chaos 4.57 5.34 3.85 4.6
call_method_slots 2.63 3.27 7.71 4.5
telco 5.18 1.83 6.47 4.5
simple_logging 3.48 1.57 7.4 4.2
call_method 2.61 5.4 3.88 4.0
chameleon 2.03 6.26 3.2 3.8
fannkuch 3.89 3.19 4.39 3.8
silent_logging 4.33 3.07 3.39 3.6
slowpickle 5.72 -1.12 6.06 3.6
2to3 2.99 3.6 3.45 3.3
etree_iterparse 3.41 2.51 3 3.0
regex_compile 3.44 2.48 2.84 2.9
mako_v2 2.14 1.29 5.22 2.9
meteor_contest 2.01 2.2 3.88 2.7
django 6.68 -1.23 2.56 2.7
formatted_logging 1.97 5.82 -0.11 2.6
hexiom2 2.83 2.1 2.55 2.5
django_v2 1.93 2.53 2.92 2.5
etree_generate 2.38 2.13 2.51 2.3
mako -0.3 9.66 -3.11 2.1
bzr_startup 0.35 1.97 3 1.8
etree_process 1.84 1.01 1.9 1.6
spambayes 1.76 0.76 0.48 1.0
regex_v8 1.96 -0.66 1.63 1.0
html5lib 0.83 0.72 0.97 0.8
normal_startup 1.41 0.39 0.24 0.7
startup_nosite 1.2 0.41 0.42 0.7
etree_parse 0.24 0.9 0.79 0.6
json_load 1.38 0.56 -0.25 0.6
pidigits 0.45 0.33 0.28 0.4
hg_startup 0.32 2.07 -1.41 0.3
rietveld 0.05 0.91 -0.43 0.2
tornado_http 2.34 -0.92 -1.27 0.1
call_method_unknown 0.72 1.26 -1.85 0.0
raytrace -0.35 -0.75 0.94 -0.1
regex_effbot 1.97 -1.18 -2.57 -0.6
fastunpickle -1.65 0.5 -0.88 -0.7
nqueens -2.24 -1.53 -0.81 -1.5
fastpickle -0.74 1.98 -6.26 -1.7

Thanks,
Vamsi

------------------------------------------------------------------------------------------------------------------------------------------------------------
From: gvanrossum@gmail.com [mailto:gvanrossum@gmail.com] On Behalf Of Guido van Rossum
Sent: Tuesday, May 19, 2015 1:59 PM
To: Cohn, Robert S
Cc: R. David Murray (r.david.murray@murrayandwalker.com)
Subject: Re: meeting at PyCon

Hi Robert and David,
I just skimmed that thread. There were a lot of noises about backporting it to 2.7 but the final message on the topic, by Antoine, claimed it was too late for 2.7. However, that was before we had announced the EOL extension of 2.7 till 2020, and perhaps we were also in denial about 3.x uptake vs. 2.x. So I think it's definitively worth bringing this up. I would start with a post on python-dev linking to the source code for your patch, and adding a message to the original tracker issue too (without reopening it though -- just so the people who were on the bug will be pinged about it).
Because of backwards compatibility with previous 2.7.x releases, it's very important that the patch not break anything -- in particular this means you can't add opcodes or change their specification. You will also undoubtedly be asked to test this on a variety of platforms 32-bit and 64-bit that people care about. But I'm sure you're expecting all that. :-)

You might also check with Benjamin Peterson, who is the 2.7 release manager. I think he just announced 2.7.10, so it's too late for that, but I assume we'll keep doing 2.7.x releases until 2020.
Good luck,

--Guido

PS. I am assuming you are contributing this under a PSF-accepted license, e.g. Apache 2.0, otherwise it's an automatic nogo.

On Tue, May 19, 2015 at 9:33 AM, Cohn, Robert S <robert.s.cohn@intel.com> wrote:
Hi Guido,

When we met for lunch at pycon, I asked if performance related patches would be ok for python 2.x. My understanding is that you thought it was possible if it did not create a maintainability problem. We have an example now, a 2.7 patch for computed goto based on the implementation in python 3 http://bugs.python.org/issue4753 It increases performance by up to 10% across a wide range of workloads.

As I mentioned at lunch, we hired David Murray’s company, and he is guiding intel through the development process for cpython. David and I thought it would be good to run this by you before raising the issue on python-dev. Do you have a specific concern about this patch or a more general concern about performance patches to 2.7? Thanks.

Robert
--------

rbtcollins · 2015-05-28T00:34:28Z

FWIW I'm interested and willing to poke at this if more testers/reviewers are needed.

ned-deily · 2015-05-28T03:52:26Z

@vamsi, could you please open a new issue and attach your patch there so it can be properly tracked for 2.7? This issue has been closed for five years and the code has been out in the field for a long time in Python 3. Thanks!

python-dev · 2015-05-28T17:45:40Z

New changeset 17d3bbde60d2 by Benjamin Peterson in branch '2.7':
backport computed gotos (bpo-4753)
https://hg.python.org/cpython/rev/17d3bbde60d2

db3l · 2015-06-01T22:02:34Z

The 2.7 back-ported version of this patch appears to have broken compilation on the Windows XP buildbot, during the OpenSSL build process, when the newly built Python is used to execute the build_ssl.py script.

After this patch, when that stage executes, and prior to any output from the build script, the python_d process goes to 100% CPU and sticks there until the build process times out 1200s later and kills it.

I don't think it's really ssl related though, as after doing some debugging the exact same thing happens if I simply run python_d (I never see a prompt - it just sits there burning CPU). So I think build_ssl.py is just the first use of the generated python_d during the build process.

I did try attaching to the CPU-stuck version of python_d from VS, and so far from what I can see, the process never gets past the Py_Initialize() call in Py_Main(). It's all over the place in terms of locations if I try interrupting it, but it's always stuck inside that first Py_Initialize call.

I'm not sure if it's something environmental on my slave, or a difference with a debug vs. production build (I haven't had a chance to try building a release version yet).

-- David

db3l · 2015-06-01T22:26:56Z

I ran a few more tests, and the generated executable hangs in both release and debug builds. The closest I can get at the moment is that it's stuck importing errno from the "import sys, errno" line in os.py - at least no matter how long I wait after starting a process before breaking out, output with -v looks like:

> python_d -v
# installing zipimport hook
import zipimport # builtin
# installed zipimport hook
# D:\cygwin\home\db3l\test\python2.7\lib\site.pyc matches D:\cygwin\home\db3l\test\python2.7\lib\site.py
import site # precompiled from D:\cygwin\home\db3l\test\python2.7\lib\site.pyc
# D:\cygwin\home\db3l\test\python2.7\lib\os.pyc matches D:\cygwin\home\db3l\test\python2.7\lib\os.py
import os # precompiled from D:\cygwin\home\db3l\test\python2.7\lib\os.pyc
import errno # builtin
Traceback (most recent call last):
  File "D:\cygwin\home\db3l\test\python2.7\lib\site.py", line 62, in <module>
    import os
  File "D:\cygwin\home\db3l\test\python2.7\lib\os.py", line 26, in <module>
    import sys, errno
KeyboardInterrupt
# clear __builtin__._
# clear sys.path
# clear sys.argv
# clear sys.ps1
# clear sys.ps2
# clear sys.exitfunc
# clear sys.exc_type
# clear sys.exc_value
# clear sys.exc_traceback
# clear sys.last_type
# clear sys.last_value
# clear sys.last_traceback
# clear sys.path_hooks
# clear sys.path_importer_cache
# clear sys.meta_path
# clear sys.flags
# clear sys.float_info
# restore sys.stdin
# restore sys.stdout
# restore sys.stderr
# cleanup __main__
# cleanup[1] zipimport
# cleanup[1] errno
# cleanup[1] signal
# cleanup[1] exceptions
# cleanup[1] _warnings
# cleanup sys
# cleanup __builtin__
[8991 refs]
# cleanup ints: 6 unfreed ints
# cleanup floats

I never have a problem interrupting the process, so KeyboardInterrupt is processed normally - it just looks like it gets stuck in an infinite loop during startup.

-- David

bitdancer · 2015-06-02T01:11:15Z

Please open a new issue with the details about your problem.

db3l · 2015-06-02T21:38:03Z

Oops, sorry, I had just followed the commit comment to this issue. For the record here, it looks like Benjamin has committed an update (5e8fa1b13516) that resolves the problem.

pitrou added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Dec 26, 2008

pitrou closed this as completed Jul 19, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

Faster opcode dispatch on gcc #49003

Faster opcode dispatch on gcc #49003

Comments

pitrou commented Dec 26, 2008

pitrou commented Dec 26, 2008

pitrou commented Dec 26, 2008

malemburg commented Dec 27, 2008

pitrou commented Dec 27, 2008

pitrou commented Dec 27, 2008

tiran commented Dec 31, 2008

pitrou commented Dec 31, 2008

tiran commented Dec 31, 2008

pitrou commented Dec 31, 2008

pitrou commented Dec 31, 2008

avassalotti commented Dec 31, 2008

smontanaro commented Dec 31, 2008

pitrou commented Dec 31, 2008

smontanaro commented Dec 31, 2008

pitrou commented Dec 31, 2008

pitrou commented Dec 31, 2008

blaisorblade mannequin commented Jan 1, 2009

smontanaro commented Jan 1, 2009

blaisorblade mannequin commented Jan 1, 2009

blaisorblade mannequin commented Jan 1, 2009

blaisorblade mannequin commented Jan 1, 2009

blaisorblade mannequin commented Jan 1, 2009

pitrou commented Jan 1, 2009

smontanaro commented Jan 1, 2009

blaisorblade mannequin commented Jan 1, 2009

avassalotti commented Jan 1, 2009

avassalotti commented Jan 1, 2009

pitrou commented Jan 1, 2009

pitrou commented Jan 28, 2009

kevinwatters mannequin commented Jan 30, 2009

mdickinson commented Jan 30, 2009

mdickinson commented Jan 30, 2009

mdickinson commented Jan 30, 2009

pitrou commented Jan 30, 2009

mdickinson commented Jan 31, 2009

pitrou commented Jan 31, 2009

mdickinson commented Jan 31, 2009

smontanaro commented Feb 4, 2009

ggenellina mannequin commented Feb 4, 2009

pitrou commented Feb 7, 2009

smontanaro commented Feb 8, 2009

aimacintyre mannequin commented Mar 22, 2009

akuchling commented Mar 31, 2009

malemburg commented Mar 31, 2009

pitrou commented Mar 31, 2009

aimacintyre mannequin commented Apr 10, 2009

mdionisio mannequin commented Jul 18, 2009

pitrou commented Jul 19, 2010

parasa mannequin commented May 28, 2015

rbtcollins commented May 28, 2015

ned-deily commented May 28, 2015

python-dev mannequin commented May 28, 2015

db3l commented Jun 1, 2015

db3l commented Jun 1, 2015

bitdancer commented Jun 2, 2015

db3l commented Jun 2, 2015