bpo-26280: Port BINARY_SUBSCR to PEP 659 adaptive interpreter #27043

iritkatriel · 2021-07-05T22:53:34Z

https://bugs.python.org/issue26280

iritkatriel · 2021-07-05T22:54:08Z

I'll try to get some performance numbers tomorrow.

Python/ceval.c

iritkatriel · 2021-07-06T13:08:32Z

I have some microbenchmark numbers (similar to @serhiy-storchaka 's from #9853). This is on a Mac with --enable-optimizations.

LIST:
cpython_baseline % ./python.exe -m timeit -s "l=[1,2,3,4,5,6]" "l[3]"
10000000 loops, best of 5: 23.7 nsec per loop
cpython % ./python.exe -m timeit -s "l=[1,2,3,4,5,6]" "l[3]"
20000000 loops, best of 5: 17.8 nsec per loop

TUPLE:
cpython_baseline % ./python.exe -m timeit -s "t=(1,2,3,4,5,6)" "t[3]"
10000000 loops, best of 5: 23.4 nsec per loop
cpython % ./python.exe -m timeit -s "t=(1,2,3,4,5,6)" "t[3]"
20000000 loops, best of 5: 17.4 nsec per loop

DICT:
cpython_baseline % ./python.exe -m timeit -s "d={1:'a',2:'b',3:'c',4:'d',5:'e',6:'f'}" "d[3]"
10000000 loops, best of 5: 25.7 nsec per loop
iritkatriel@Irits-MBP cpython % ./python.exe -m timeit -s "d={1:'a',2:'b',3:'c',4:'d',5:'e',6:'f'}" "d[3]"
10000000 loops, best of 5: 21.6 nsec per loop

STRING (NOT OPTIMIZED):
cpython_baseline % ./python.exe -m timeit -s "s='abcdef'" "s[3]"
10000000 loops, best of 5: 25.7 nsec per loop
cpython % ./python.exe -m timeit -s "s='abcdef'" "s[3]"
10000000 loops, best of 5: 25.9 nsec per loop

TUPLE WITH NEGATIVE INDEX (NOT OPTIMIZED):
cpython_baseline % ./python.exe -m timeit -s "t=(1,2,3,4,5,6)" "t[-2]"
10000000 loops, best of 5: 24 nsec per loop
cpython % ./python.exe -m timeit -s "t=(1,2,3,4,5,6)" "t[-2]"
10000000 loops, best of 5: 23.7 nsec per loop

In summary:

	baseline	opt	% speedup
list	23.7	17.8	24.89
tuple	23.4	17.4	25.64
dict	25.7	21.6	15.95
string	25.7	25.9	-0.78
tuple-neg	24	23.7	1.25

Fidget-Spinner

I'm not really an expert in this area. So I apologise in advance if some of my reviews aren't valid.

Lib/importlib/_bootstrap_external.py

Python/ceval.c

iritkatriel · 2021-07-06T14:12:58Z

pyperformance results:

baseline.json

Performance version: 1.0.2
Report on macOS-10.15.7-x86_64-i386-64bit
Number of logical CPUs: 12
Start date: 2021-07-06 14:43:18.158523
End date: 2021-07-06 15:03:17.963231

opt.json

Performance version: 1.0.2
Report on macOS-10.15.7-x86_64-i386-64bit
Number of logical CPUs: 12
Start date: 2021-07-06 14:16:54.831239
End date: 2021-07-06 14:37:48.404535

2to3

Mean +- std dev: 361 ms +- 6 ms -> 347 ms +- 4 ms: 1.04x faster
Significant (t=13.99)

chameleon

Mean +- std dev: 10.7 ms +- 0.1 ms -> 10.4 ms +- 0.2 ms: 1.03x faster
Significant (t=9.32)

chaos

Mean +- std dev: 124 ms +- 1 ms -> 119 ms +- 1 ms: 1.04x faster
Significant (t=21.46)

crypto_pyaes

Mean +- std dev: 137 ms +- 4 ms -> 130 ms +- 2 ms: 1.05x faster
Significant (t=11.17)

deltablue

Mean +- std dev: 8.68 ms +- 0.16 ms -> 8.51 ms +- 0.21 ms: 1.02x faster
Not significant

django_template

Mean +- std dev: 59.3 ms +- 3.3 ms -> 57.6 ms +- 2.8 ms: 1.03x faster
Significant (t=3.05)

dulwich_log

Mean +- std dev: 105 ms +- 2 ms -> 104 ms +- 4 ms: 1.01x faster
Not significant

fannkuch

Mean +- std dev: 561 ms +- 10 ms -> 537 ms +- 6 ms: 1.04x faster
Significant (t=16.10)

float

Mean +- std dev: 127 ms +- 3 ms -> 125 ms +- 3 ms: 1.01x faster
Not significant

go

Mean +- std dev: 252 ms +- 4 ms -> 237 ms +- 4 ms: 1.06x faster
Significant (t=21.01)

hexiom

Mean +- std dev: 11.9 ms +- 0.1 ms -> 11.3 ms +- 0.3 ms: 1.06x faster
Significant (t=14.95)

json_dumps

Mean +- std dev: 15.1 ms +- 0.2 ms -> 15.0 ms +- 0.3 ms: 1.01x faster
Not significant

json_loads

Mean +- std dev: 29.3 us +- 0.3 us -> 28.8 us +- 0.4 us: 1.02x faster
Not significant

logging_format

Mean +- std dev: 10.7 us +- 0.4 us -> 10.3 us +- 0.2 us: 1.04x faster
Significant (t=6.48)

logging_silent

Mean +- std dev: 231 ns +- 7 ns -> 221 ns +- 7 ns: 1.04x faster
Significant (t=7.36)

logging_simple

Mean +- std dev: 9.78 us +- 0.30 us -> 9.62 us +- 0.22 us: 1.02x faster
Not significant

mako

Mean +- std dev: 17.1 ms +- 0.5 ms -> 16.5 ms +- 0.2 ms: 1.04x faster
Significant (t=9.14)

meteor_contest

Mean +- std dev: 114 ms +- 2 ms -> 111 ms +- 3 ms: 1.02x faster
Significant (t=5.65)

nbody

Mean +- std dev: 171 ms +- 5 ms -> 158 ms +- 4 ms: 1.08x faster
Significant (t=14.66)

nqueens

Mean +- std dev: 124 ms +- 3 ms -> 115 ms +- 2 ms: 1.08x faster
Significant (t=18.59)

pathlib

Mean +- std dev: 46.6 ms +- 1.3 ms -> 45.3 ms +- 0.9 ms: 1.03x faster
Significant (t=6.35)

pickle

Mean +- std dev: 11.3 us +- 0.4 us -> 10.8 us +- 0.1 us: 1.05x faster
Significant (t=9.69)

pickle_dict

Mean +- std dev: 26.3 us +- 0.2 us -> 23.9 us +- 0.6 us: 1.10x faster
Significant (t=29.70)

pickle_list

Mean +- std dev: 4.14 us +- 0.12 us -> 3.88 us +- 0.05 us: 1.07x faster
Significant (t=15.82)

pickle_pure_python

Mean +- std dev: 538 us +- 10 us -> 518 us +- 8 us: 1.04x faster
Significant (t=11.83)

pidigits

Mean +- std dev: 190 ms +- 3 ms -> 196 ms +- 1 ms: 1.03x slower
Significant (t=-12.11)

pyflate

Mean +- std dev: 789 ms +- 16 ms -> 756 ms +- 10 ms: 1.04x faster
Significant (t=13.64)

python_startup

Mean +- std dev: 13.2 ms +- 0.3 ms -> 13.4 ms +- 0.5 ms: 1.02x slower
Not significant

python_startup_no_site

Mean +- std dev: 9.73 ms +- 0.32 ms -> 9.58 ms +- 0.27 ms: 1.02x faster
Not significant

raytrace

Mean +- std dev: 611 ms +- 9 ms -> 581 ms +- 11 ms: 1.05x faster
Significant (t=16.01)

regex_dna

Mean +- std dev: 183 ms +- 4 ms -> 194 ms +- 5 ms: 1.06x slower
Significant (t=-12.68)

regex_effbot

Mean +- std dev: 3.36 ms +- 0.05 ms -> 3.46 ms +- 0.04 ms: 1.03x slower
Significant (t=-11.84)

regex_v8

Mean +- std dev: 25.8 ms +- 1.0 ms -> 25.9 ms +- 1.0 ms: 1.00x slower
Not significant

richards

Mean +- std dev: 95.3 ms +- 1.5 ms -> 93.7 ms +- 1.4 ms: 1.02x faster
Not significant

scimark_fft

Mean +- std dev: 467 ms +- 5 ms -> 465 ms +- 6 ms: 1.01x faster
Not significant

scimark_lu

Mean +- std dev: 214 ms +- 4 ms -> 209 ms +- 6 ms: 1.02x faster
Significant (t=5.27)

scimark_monte_carlo

Mean +- std dev: 112 ms +- 2 ms -> 111 ms +- 3 ms: 1.01x faster
Not significant

scimark_sor

Mean +- std dev: 229 ms +- 3 ms -> 226 ms +- 7 ms: 1.02x faster
Not significant

scimark_sparse_mat_mult

Mean +- std dev: 6.30 ms +- 0.26 ms -> 6.54 ms +- 0.19 ms: 1.04x slower
Significant (t=-5.78)

spectral_norm

Mean +- std dev: 192 ms +- 2 ms -> 189 ms +- 2 ms: 1.02x faster
Not significant

sqlalchemy_declarative

Mean +- std dev: 161 ms +- 6 ms -> 159 ms +- 4 ms: 1.01x faster
Not significant

sqlalchemy_imperative

Mean +- std dev: 24.2 ms +- 0.5 ms -> 24.2 ms +- 0.7 ms: 1.00x slower
Not significant

sqlite_synth

Mean +- std dev: 3.06 us +- 0.06 us -> 2.97 us +- 0.06 us: 1.03x faster
Significant (t=7.81)

sympy_expand

Mean +- std dev: 639 ms +- 10 ms -> 630 ms +- 9 ms: 1.02x faster
Not significant

sympy_integrate

Mean +- std dev: 27.0 ms +- 0.5 ms -> 26.2 ms +- 0.5 ms: 1.03x faster
Significant (t=8.65)

sympy_str

Mean +- std dev: 382 ms +- 4 ms -> 375 ms +- 5 ms: 1.02x faster
Not significant

sympy_sum

Mean +- std dev: 211 ms +- 6 ms -> 206 ms +- 2 ms: 1.02x faster
Significant (t=6.32)

telco

Mean +- std dev: 7.21 ms +- 0.13 ms -> 7.09 ms +- 0.09 ms: 1.02x faster
Not significant

tornado_http

Mean +- std dev: 174 ms +- 13 ms -> 180 ms +- 48 ms: 1.03x slower
Not significant

unpack_sequence

Mean +- std dev: 59.7 ns +- 0.6 ns -> 55.9 ns +- 1.2 ns: 1.07x faster
Significant (t=21.74)

unpickle

Mean +- std dev: 16.1 us +- 0.3 us -> 16.1 us +- 0.3 us: 1.00x faster
Not significant

unpickle_list

Mean +- std dev: 5.02 us +- 0.14 us -> 5.06 us +- 0.16 us: 1.01x slower
Not significant

unpickle_pure_python

Mean +- std dev: 382 us +- 6 us -> 364 us +- 4 us: 1.05x faster
Significant (t=18.25)

xml_etree_generate

Mean +- std dev: 108 ms +- 2 ms -> 106 ms +- 2 ms: 1.02x faster
Not significant

xml_etree_iterparse

Mean +- std dev: 116 ms +- 3 ms -> 117 ms +- 3 ms: 1.00x slower
Not significant

xml_etree_parse

Mean +- std dev: 162 ms +- 5 ms -> 164 ms +- 5 ms: 1.01x slower
Not significant

xml_etree_process

Mean +- std dev: 88.4 ms +- 2.2 ms -> 85.2 ms +- 0.9 ms: 1.04x faster
Significant (t=10.57)

Skipped 1 benchmarks only in baseline.json: regex_compile

gvanrossum · 2021-07-06T14:17:12Z

Doesn't pyperf[ormance] have a flag to sort the benchmarks from "most faster" to "most slower" and to skip the "not significant" ones?

Fidget-Spinner · 2021-07-06T14:18:37Z

Doesn't pyperf[ormance] have a flag to sort the benchmarks from "most faster" to "most slower" and to skip the "not significant" ones?

Yeah it's pyperf compare_to first.json second.json -G --min-speed=X

Edit: Here's the docs: https://pyperf.readthedocs.io/en/latest/cli.html#pyperf-compare-to

Fidget-Spinner · 2021-07-06T14:25:24Z

@iritkatriel hmm, I just noticed something strange. Some of the speedups make sense (like nqueens which has lots of BINARY_SUBSCR), but I don't understand how pickle_dict and pickle_list sped up 10%. There aren't any BINARY_SUBCR instructions inside the benchmark functions, and those are basically thin wrappers over calls to C functions so they shouldn't be affected? Am I missing something here :(?

iritkatriel · 2021-07-06T14:25:41Z

./python.exe -m pyperf compare_to baseline.json ../cpython/opt.json -G
Slower (5):

regex_dna: 183 ms +- 4 ms -> 194 ms +- 5 ms: 1.06x slower
scimark_sparse_mat_mult: 6.30 ms +- 0.26 ms -> 6.54 ms +- 0.19 ms: 1.04x slower
regex_effbot: 3.36 ms +- 0.05 ms -> 3.46 ms +- 0.04 ms: 1.03x slower
pidigits: 190 ms +- 3 ms -> 196 ms +- 1 ms: 1.03x slower
python_startup: 13.2 ms +- 0.3 ms -> 13.4 ms +- 0.5 ms: 1.02x slower

Faster (43):

pickle_dict: 26.3 us +- 0.2 us -> 23.9 us +- 0.6 us: 1.10x faster
nbody: 171 ms +- 5 ms -> 158 ms +- 4 ms: 1.08x faster
nqueens: 124 ms +- 3 ms -> 115 ms +- 2 ms: 1.08x faster
pickle_list: 4.14 us +- 0.12 us -> 3.88 us +- 0.05 us: 1.07x faster
unpack_sequence: 59.7 ns +- 0.6 ns -> 55.9 ns +- 1.2 ns: 1.07x faster
go: 252 ms +- 4 ms -> 237 ms +- 4 ms: 1.06x faster
hexiom: 11.9 ms +- 0.1 ms -> 11.3 ms +- 0.3 ms: 1.06x faster
raytrace: 611 ms +- 9 ms -> 581 ms +- 11 ms: 1.05x faster
pickle: 11.3 us +- 0.4 us -> 10.8 us +- 0.1 us: 1.05x faster
crypto_pyaes: 137 ms +- 4 ms -> 130 ms +- 2 ms: 1.05x faster
unpickle_pure_python: 382 us +- 6 us -> 364 us +- 4 us: 1.05x faster
fannkuch: 561 ms +- 10 ms -> 537 ms +- 6 ms: 1.04x faster
pyflate: 789 ms +- 16 ms -> 756 ms +- 10 ms: 1.04x faster
logging_silent: 231 ns +- 7 ns -> 221 ns +- 7 ns: 1.04x faster
chaos: 124 ms +- 1 ms -> 119 ms +- 1 ms: 1.04x faster
pickle_pure_python: 538 us +- 10 us -> 518 us +- 8 us: 1.04x faster
mako: 17.1 ms +- 0.5 ms -> 16.5 ms +- 0.2 ms: 1.04x faster
2to3: 361 ms +- 6 ms -> 347 ms +- 4 ms: 1.04x faster
xml_etree_process: 88.4 ms +- 2.2 ms -> 85.2 ms +- 0.9 ms: 1.04x faster
logging_format: 10.7 us +- 0.4 us -> 10.3 us +- 0.2 us: 1.04x faster
django_template: 59.3 ms +- 3.3 ms -> 57.6 ms +- 2.8 ms: 1.03x faster
sympy_integrate: 27.0 ms +- 0.5 ms -> 26.2 ms +- 0.5 ms: 1.03x faster
pathlib: 46.6 ms +- 1.3 ms -> 45.3 ms +- 0.9 ms: 1.03x faster
sqlite_synth: 3.06 us +- 0.06 us -> 2.97 us +- 0.06 us: 1.03x faster
chameleon: 10.7 ms +- 0.1 ms -> 10.4 ms +- 0.2 ms: 1.03x faster
scimark_lu: 214 ms +- 4 ms -> 209 ms +- 6 ms: 1.02x faster
sympy_sum: 211 ms +- 6 ms -> 206 ms +- 2 ms: 1.02x faster
meteor_contest: 114 ms +- 2 ms -> 111 ms +- 3 ms: 1.02x faster
json_loads: 29.3 us +- 0.3 us -> 28.8 us +- 0.4 us: 1.02x faster
deltablue: 8.68 ms +- 0.16 ms -> 8.51 ms +- 0.21 ms: 1.02x faster
sympy_str: 382 ms +- 4 ms -> 375 ms +- 5 ms: 1.02x faster
spectral_norm: 192 ms +- 2 ms -> 189 ms +- 2 ms: 1.02x faster
richards: 95.3 ms +- 1.5 ms -> 93.7 ms +- 1.4 ms: 1.02x faster
logging_simple: 9.78 us +- 0.30 us -> 9.62 us +- 0.22 us: 1.02x faster
telco: 7.21 ms +- 0.13 ms -> 7.09 ms +- 0.09 ms: 1.02x faster
xml_etree_generate: 108 ms +- 2 ms -> 106 ms +- 2 ms: 1.02x faster
scimark_sor: 229 ms +- 3 ms -> 226 ms +- 7 ms: 1.02x faster
sympy_expand: 639 ms +- 10 ms -> 630 ms +- 9 ms: 1.02x faster
python_startup_no_site: 9.73 ms +- 0.32 ms -> 9.58 ms +- 0.27 ms: 1.02x faster
float: 127 ms +- 3 ms -> 125 ms +- 3 ms: 1.01x faster
scimark_monte_carlo: 112 ms +- 2 ms -> 111 ms +- 3 ms: 1.01x faster
json_dumps: 15.1 ms +- 0.2 ms -> 15.0 ms +- 0.3 ms: 1.01x faster
scimark_fft: 467 ms +- 5 ms -> 465 ms +- 6 ms: 1.01x faster

Benchmark hidden because not significant (9): dulwich_log, regex_v8, sqlalchemy_declarative, sqlalchemy_imperative, tornado_http, unpickle, unpickle_list, xml_etree_parse, xml_etree_iterparse

Geometric mean: 1.02x faster
Ignored benchmarks (1) of baseline.json: regex_compile

gvanrossum · 2021-07-06T20:40:37Z

I ran the benchmarks on the "faster cpython" team's benchmark machine (which sits in a secret location :-) and got the following results, comparing this PR with the point where it branched off main (two commits behind).

This primarily goes to show that there's a lot of noise in the data...

pyperf compare_to -G --min-speed 1 req-1625592889-gvanrossum/results-data.json req-1625594534-gvanrossum/results-data.json

Slower (10):

scimark_sparse_mat_mult: 7.11 ms +- 0.17 ms -> 7.41 ms +- 0.18 ms: 1.04x slower
regex_dna: 216 ms +- 4 ms -> 222 ms +- 3 ms: 1.03x slower
unpack_sequence: 61.0 ns +- 0.6 ns -> 62.5 ns +- 0.8 ns: 1.02x slower
regex_effbot: 3.66 ms +- 0.05 ms -> 3.75 ms +- 0.01 ms: 1.02x slower
scimark_fft: 496 ms +- 5 ms -> 508 ms +- 4 ms: 1.02x slower
spectral_norm: 187 ms +- 2 ms -> 191 ms +- 4 ms: 1.02x slower
pidigits: 201 ms +- 0 ms -> 204 ms +- 1 ms: 1.02x slower
go: 265 ms +- 2 ms -> 269 ms +- 2 ms: 1.02x slower
regex_compile: 215 ms +- 1 ms -> 217 ms +- 1 ms: 1.01x slower
scimark_sor: 251 ms +- 2 ms -> 254 ms +- 3 ms: 1.01x slower

Faster (12):

crypto_pyaes: 144 ms +- 1 ms -> 139 ms +- 2 ms: 1.03x faster
unpickle_list: 5.50 us +- 0.06 us -> 5.33 us +- 0.04 us: 1.03x faster
nbody: 164 ms +- 2 ms -> 160 ms +- 1 ms: 1.03x faster
fannkuch: 584 ms +- 5 ms -> 570 ms +- 4 ms: 1.02x faster
nqueens: 125 ms +- 1 ms -> 122 ms +- 1 ms: 1.02x faster
chaos: 129 ms +- 5 ms -> 127 ms +- 1 ms: 1.02x faster
hexiom: 12.8 ms +- 0.1 ms -> 12.5 ms +- 0.1 ms: 1.02x faster
telco: 8.38 ms +- 0.37 ms -> 8.22 ms +- 0.27 ms: 1.02x faster
scimark_monte_carlo: 128 ms +- 2 ms -> 126 ms +- 2 ms: 1.02x faster
pickle_pure_python: 569 us +- 6 us -> 561 us +- 5 us: 1.01x faster
unpickle: 17.7 us +- 0.2 us -> 17.5 us +- 0.2 us: 1.01x faster
pathlib: 22.7 ms +- 0.3 ms -> 22.5 ms +- 0.4 ms: 1.01x faster

Benchmark hidden because not significant (36): 2to3, chameleon, deltablue, django_template, dulwich_log, float, json_dumps, json_loads, logging_format, logging_silent, logging_simple, mako, meteor_contest, pickle, pickle_dict, pickle_list, pyflate, python_startup, python_startup_no_site, raytrace, regex_v8, richards, scimark_lu, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, tornado_http, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process

Geometric mean: 1.00x faster

gvanrossum · 2021-07-06T20:44:46Z

PS. I did not use --pgo when building. That might affect things.

iritkatriel · 2021-07-06T20:57:27Z

I repeated it and got this. I configured with --with-optimizations (Mac)

pickle_dict: 26.2 us +- 0.1 us -> 23.7 us +- 0.2 us: 1.11x faster
hexiom: 12.0 ms +- 0.3 ms -> 11.2 ms +- 0.2 ms: 1.07x faster
unpack_sequence: 59.5 ns +- 0.5 ns -> 55.9 ns +- 1.4 ns: 1.06x faster
nqueens: 122 ms +- 1 ms -> 115 ms +- 1 ms: 1.06x faster
nbody: 167 ms +- 1 ms -> 158 ms +- 2 ms: 1.06x faster
go: 250 ms +- 3 ms -> 238 ms +- 4 ms: 1.05x faster
raytrace: 607 ms +- 14 ms -> 578 ms +- 12 ms: 1.05x faster
unpickle_pure_python: 381 us +- 4 us -> 365 us +- 4 us: 1.04x faster
pickle_list: 4.07 us +- 0.04 us -> 3.90 us +- 0.10 us: 1.04x faster

gvanrossum · 2021-07-07T04:17:12Z

Huh, I guess clang (Mac) is better than gcc (Linux), or more stable or something. I had not used pgo/lto, and when I turned that on I got similar results only more pronounced, but still with a geometric mean of 1.00, so I won't repeat the numbers here. (It wasn't the always same benchmarks that were slower/faster, although the top 5 or so on either were roughly the same.)

I don't really understand what's going on with benchmarks, maybe @vstinner has some insights? Are they just very noisy? I have a feeling that if we pay them too much attention we'll never make progress. Perhaps we should focus more on micro-benchmarks, and use PyPerformance to prove that we're not making things worse. Although I'd like to understand why unpack_sequence is 1.17x slower for me with pgo+lto.

(BTW, pgo+lto really does make a big difference. Comparing this PR without and with lto+pgo gives a geometric mean of 1.21x faster, with the extreme being richards at 1.46x. Puts our efforts in perspective. :-)

gvanrossum · 2021-07-07T04:21:46Z

Huh, lto+pgo makes unpack_sequence 1.10x slower on your branch. And on main they make it only 1.04x faster. So there's something weird about it. After all, it's 500 lines of this:

        a, b, c, d, e, f, g, h, i, j = to_unpack

vstinner · 2021-07-07T09:06:27Z

Ah, benchmarking Python! I suggest to only measure Python performance when it's built with PGO+LTO. On Linux, pyperf is more reliable using CPU isolation (see pyperf system tune command).

Usually, I ignore differences smaller than 5%. I treat them as "noise" in the benchmark itself, rather than in the change. To understand that, you can try to build Python 5 times and each time measure performances. In my experience, each PGO build gives performances a little bit different because the PROFILE_TASK in Makefile is not fully determistic. In general, the Python build is not deterministic.

"Geometric mean: 1.02x faster"
"Geometric mean: 1.00x faster"

I added this feature recently in pyperf. So I don't have a good experience with it. But I'm not impressed by "Geometric mean: 1.00x faster": the change doesn't seem to make Python faster nor slower.

The geometric mean doesn't put weights on benchmarks. Each benchmark is treated equally.

"nbody: 171 ms +- 5 ms -> 158 ms +- 4 ms: 1.08x faster": that's nice, since it's a "generic" benchmark. The advance() hot code contains multiple BINARY_SUBSCR instruction:

def advance(dt, n, bodies=SYSTEM, pairs=PAIRS):
    for i in range(n):
        for (([x1, y1, z1], v1, m1),
             ([x2, y2, z2], v2, m2)) in pairs:
            dx = x1 - x2
            dy = y1 - y2
            dz = z1 - z2
            mag = dt * ((dx * dx + dy * dy + dz * dz) ** (-1.5))
            b1m = m1 * mag
            b2m = m2 * mag
            v1[0] -= dx * b2m
            v1[1] -= dy * b2m
            v1[2] -= dz * b2m
            v2[0] += dx * b1m
            v2[1] += dy * b1m
            v2[2] += dz * b1m
        for (r, [vx, vy, vz], m) in bodies:
            r[0] += dt * vx
            r[1] += dt * vy
            r[2] += dt * vz

"nqueens: 124 ms +- 3 ms -> 115 ms +- 2 ms: 1.08x faster": same, it also contains many BINARY_SUBSCR instructions.

I'm also trying to not pay too much attention to speedup, but mostly check if the mandatory slowdowns are acceptable or not.

I never see a pyperformance result with no "slower" benchmark. It's always a balance. That's why I added the geometric mean.

More about pyperformance individual benchmarks: https://pyperformance.readthedocs.io/benchmarks.html#available-benchmarks

vstinner · 2021-07-07T09:07:28Z

Huh, lto+pgo makes unpack_sequence 1.10x slower on your branch.

This benchmark is a micro-benchmark. It was never reliable. I suggest to remove it.

vstinner · 2021-07-07T09:11:22Z

See unpack sequence at speed.python.org: results over only 1 month. The differences are quite significan.

Timings are between 115 ms and 140 ms. IMO it's more a benchmark on the CPU L1 instruction cache: it depends on how the compiler decides the organize the Python code machine using PGO. It depends a lot on code placement.

See https://vstinner.github.io/analysis-python-performance-issue.html for an example of analysis on code placement.

vstinner · 2021-07-07T09:28:44Z

(BTW, pgo+lto really does make a big difference. Comparing this PR without and with lto+pgo gives a geometric mean of 1.21x faster, with the extreme being richards at 1.46x. Puts our efforts in perspective. :-)

By the way, ./configure --enable-shared can also have a significant impact on performance if you build Python using GCC without -fno-semantic-interposition. This compiler flag is now enabled by default using ./configure --enable-optimizations.

If the "python" executable is a small binary (ex: 20 KB) which only calls Py_BytesMain() which lives in libpython dynamic library, function calls from libpython to libpython (Python calling itself its own C functions) can be slower because of the semantic interposition (ability to override symbols at runtime using LD_PRELOAD or whatever else).

Using -fno-semantic-interposition, you get back to the same speed than not using ./configure --enable-shared: the compiler can again inline performance critical functions. LTO is quite impressive for that! Compilers made great progress in LTO over the last years.

See my analysis of -fno-semantic-interposition and inlining in libpython:
https://developers.redhat.com/blog/2020/06/25/red-hat-enterprise-linux-8-2-brings-faster-python-3-8-run-speeds/

How Python is built makes a huge difference in performances.

Note: clang uses -fno-semantic-interposition by default.

pablogsal · 2021-07-07T09:34:30Z

The key is as well running the benchmarks with CPU isolation or otherwise the noise is going to be in the range of 2-6% or even more as the OS can schedule any process in the CPU you are using for benchmarking or even schedule system interrupts and other RC there.

pablogsal · 2021-07-07T09:36:22Z

I ran the benchmarks on the "faster cpython" team's benchmark machine (which sits in a secret location :-) and got the following results, comparing this PR with the point where it branched off main (two commits behind).

This primarily goes to show that there's a lot of noise in the data...
pyperf compare_to -G --min-speed 1 req-1625592889-gvanrossum/results-data.json req-1625594534-gvanrossum/results-data.json
Slower (10):

scimark_sparse_mat_mult: 7.11 ms +- 0.17 ms -> 7.41 ms +- 0.18 ms: 1.04x slower

regex_dna: 216 ms +- 4 ms -> 222 ms +- 3 ms: 1.03x slower

unpack_sequence: 61.0 ns +- 0.6 ns -> 62.5 ns +- 0.8 ns: 1.02x slower

regex_effbot: 3.66 ms +- 0.05 ms -> 3.75 ms +- 0.01 ms: 1.02x slower

scimark_fft: 496 ms +- 5 ms -> 508 ms +- 4 ms: 1.02x slower

spectral_norm: 187 ms +- 2 ms -> 191 ms +- 4 ms: 1.02x slower

pidigits: 201 ms +- 0 ms -> 204 ms +- 1 ms: 1.02x slower

go: 265 ms +- 2 ms -> 269 ms +- 2 ms: 1.02x slower

regex_compile: 215 ms +- 1 ms -> 217 ms +- 1 ms: 1.01x slower

scimark_sor: 251 ms +- 2 ms -> 254 ms +- 3 ms: 1.01x slower

Faster (12):

crypto_pyaes: 144 ms +- 1 ms -> 139 ms +- 2 ms: 1.03x faster

unpickle_list: 5.50 us +- 0.06 us -> 5.33 us +- 0.04 us: 1.03x faster

nbody: 164 ms +- 2 ms -> 160 ms +- 1 ms: 1.03x faster

fannkuch: 584 ms +- 5 ms -> 570 ms +- 4 ms: 1.02x faster

nqueens: 125 ms +- 1 ms -> 122 ms +- 1 ms: 1.02x faster

chaos: 129 ms +- 5 ms -> 127 ms +- 1 ms: 1.02x faster

hexiom: 12.8 ms +- 0.1 ms -> 12.5 ms +- 0.1 ms: 1.02x faster

telco: 8.38 ms +- 0.37 ms -> 8.22 ms +- 0.27 ms: 1.02x faster

scimark_monte_carlo: 128 ms +- 2 ms -> 126 ms +- 2 ms: 1.02x faster

pickle_pure_python: 569 us +- 6 us -> 561 us +- 5 us: 1.01x faster

unpickle: 17.7 us +- 0.2 us -> 17.5 us +- 0.2 us: 1.01x faster

pathlib: 22.7 ms +- 0.3 ms -> 22.5 ms +- 0.4 ms: 1.01x faster

Benchmark hidden because not significant (36): 2to3, chameleon, deltablue, django_template, dulwich_log, float, json_dumps, json_loads, logging_format, logging_silent, logging_simple, mako, meteor_contest, pickle, pickle_dict, pickle_list, pyflate, python_startup, python_startup_no_site, raytrace, regex_v8, richards, scimark_lu, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, tornado_http, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process

Geometric mean: 1.00x faster

I repeated these benchmarks on my machine with PGO/LTO and CPU isolation and I get very similar results as these ones. I will try to post them later.

pablogsal · 2021-07-07T09:40:35Z

Perhaps we should focus more on micro-benchmarks, and use PyPerformance to prove that we're not making things worse.

I get the feeling but macro benchmarks are the only way to really claim general speedups. These are also the benchmarks we show in speed.python.org and the ones we have been using when doing similar work before to decide if it was worth it or not.

It is also is the metric that every other implementation is using to claim that is "X times faster than CPython".

Python/ceval.c

markshannon · 2021-07-07T12:16:55Z

There are useful numbers that can be gathered that should be deterministic.

What fraction of total executed instruction is BINARY_SUBSCR?
What are the cache hit, miss and de-opt numbers for the various specializations?
What kinds of values are not being specialized for?

Python/ceval.c

markshannon

Looks good. Just needs a few minor changes.

Include/internal/pycore_code.h

Python/ceval.c

markshannon · 2021-07-14T15:08:45Z

The WIndows compiler doesn't seem to like the reference to _PyLong_GetZero().
You could change to assert to Py_SIZE(sub) == 1 || ((PyLongObject *)sub)->ob_digit[0] == 0, or just delete it.

…ounter

iritkatriel · 2021-07-14T15:20:10Z

The WIndows compiler doesn't seem to like the reference to _PyLong_GetZero().
You could change to assert to Py_SIZE(sub) == 1 || ((PyLongObject *)sub)->ob_digit[0] == 0, or just delete it.

I fixed it locally with "#include "pycore_long.h"", but I can do this instead.

markshannon · 2021-07-14T15:21:01Z

The WIndows compiler doesn't seem to like the reference to _PyLong_GetZero().
You could change to assert to Py_SIZE(sub) == 1 || ((PyLongObject *)sub)->ob_digit[0] == 0, or just delete it.

I fixed it locally with "#include "pycore_long.h"", but I can do this instead.

If you've already fixed it, then that's fine. No need to change it again.

iritkatriel · 2021-07-14T15:30:55Z

I have made the requested changes; please review again.

bedevere-bot · 2021-07-14T15:30:59Z

Thanks for making the requested changes!

@markshannon: please review the changes made to this pull request.

iritkatriel · 2021-07-14T17:18:48Z

/python.exe -m pyperf compare_to -G baseline.json opt.json
Slower (11):

pickle_dict: 26.2 us +- 0.7 us -> 26.8 us +- 0.3 us: 1.02x slower
scimark_sparse_mat_mult: 6.25 ms +- 0.13 ms -> 6.37 ms +- 0.20 ms: 1.02x slower
python_startup: 13.2 ms +- 0.4 ms -> 13.4 ms +- 0.5 ms: 1.02x slower
unpickle: 16.3 us +- 0.4 us -> 16.6 us +- 0.3 us: 1.02x slower
chameleon: 10.4 ms +- 0.2 ms -> 10.6 ms +- 0.1 ms: 1.01x slower
python_startup_no_site: 9.47 ms +- 0.26 ms -> 9.60 ms +- 0.29 ms: 1.01x slower
pidigits: 190 ms +- 2 ms -> 193 ms +- 4 ms: 1.01x slower
sqlite_synth: 3.09 us +- 0.06 us -> 3.12 us +- 0.06 us: 1.01x slower
pickle: 11.1 us +- 0.1 us -> 11.2 us +- 0.2 us: 1.01x slower
xml_etree_process: 87.2 ms +- 1.1 ms -> 87.8 ms +- 1.1 ms: 1.01x slower
regex_effbot: 3.45 ms +- 0.05 ms -> 3.47 ms +- 0.04 ms: 1.01x slower

Faster (32):

nbody: 168 ms +- 2 ms -> 147 ms +- 2 ms: 1.14x faster
crypto_pyaes: 138 ms +- 2 ms -> 125 ms +- 2 ms: 1.11x faster
hexiom: 12.1 ms +- 0.3 ms -> 11.0 ms +- 0.2 ms: 1.10x faster
nqueens: 122 ms +- 2 ms -> 113 ms +- 5 ms: 1.08x faster
fannkuch: 569 ms +- 5 ms -> 528 ms +- 7 ms: 1.08x faster
go: 252 ms +- 4 ms -> 234 ms +- 3 ms: 1.07x faster
chaos: 127 ms +- 2 ms -> 119 ms +- 3 ms: 1.07x faster
scimark_lu: 214 ms +- 4 ms -> 201 ms +- 4 ms: 1.06x faster
logging_silent: 233 ns +- 9 ns -> 220 ns +- 6 ns: 1.06x faster
pyflate: 796 ms +- 24 ms -> 753 ms +- 8 ms: 1.06x faster
scimark_monte_carlo: 115 ms +- 2 ms -> 110 ms +- 2 ms: 1.04x faster
meteor_contest: 114 ms +- 2 ms -> 110 ms +- 2 ms: 1.04x faster
unpack_sequence: 60.8 ns +- 0.8 ns -> 58.5 ns +- 1.1 ns: 1.04x faster
float: 125 ms +- 2 ms -> 120 ms +- 2 ms: 1.04x faster
richards: 96.5 ms +- 2.1 ms -> 93.0 ms +- 2.0 ms: 1.04x faster
telco: 7.17 ms +- 0.10 ms -> 6.91 ms +- 0.11 ms: 1.04x faster
scimark_fft: 471 ms +- 5 ms -> 455 ms +- 6 ms: 1.04x faster
scimark_sor: 232 ms +- 6 ms -> 225 ms +- 3 ms: 1.03x faster
django_template: 58.9 ms +- 2.2 ms -> 57.3 ms +- 2.5 ms: 1.03x faster
sympy_sum: 215 ms +- 9 ms -> 209 ms +- 4 ms: 1.03x faster
pickle_pure_python: 536 us +- 19 us -> 522 us +- 7 us: 1.03x faster
sympy_integrate: 26.9 ms +- 0.4 ms -> 26.2 ms +- 0.5 ms: 1.03x faster
unpickle_pure_python: 375 us +- 5 us -> 366 us +- 5 us: 1.02x faster
logging_format: 10.8 us +- 0.3 us -> 10.6 us +- 0.3 us: 1.02x faster
json_loads: 30.3 us +- 1.2 us -> 29.7 us +- 0.4 us: 1.02x faster
logging_simple: 9.93 us +- 0.20 us -> 9.75 us +- 0.16 us: 1.02x faster
spectral_norm: 193 ms +- 2 ms -> 190 ms +- 3 ms: 1.02x faster
pathlib: 46.7 ms +- 1.1 ms -> 46.0 ms +- 1.0 ms: 1.02x faster
regex_dna: 186 ms +- 4 ms -> 183 ms +- 2 ms: 1.02x faster
regex_v8: 28.0 ms +- 1.3 ms -> 27.6 ms +- 0.8 ms: 1.02x faster
sympy_str: 383 ms +- 7 ms -> 378 ms +- 5 ms: 1.01x faster
sympy_expand: 636 ms +- 7 ms -> 631 ms +- 7 ms: 1.01x faster

Benchmark hidden because not significant (14): 2to3, deltablue, dulwich_log, json_dumps, mako, pickle_list, regex_compile, sqlalchemy_declarative, sqlalchemy_imperative, tornado_http, unpickle_list, xml_etree_parse, xml_etree_iterparse, xml_etree_generate

Geometric mean: 1.02x faster
Ignored benchmarks (1) of baseline.json: raytrace

iritkatriel · 2021-07-14T17:41:39Z

Micro Benchmarks:

	baseline	opt	% speedup
list	22.3	14.9	33.18
tuple	22.2	14.7	33.78
dict	24.4	21.1	13.52
string	24.9	25.3	-1.61
tuple-neg	23.2	22.8	1.72

Lib/importlib/_bootstrap_external.py

markshannon · 2021-07-15T10:25:46Z

Include/internal/pycore_code.h

@@ -310,17 +310,18 @@ too_many_cache_misses(_PyAdaptiveEntry *entry) {
    return entry->counter == saturating_zero();
 }

-#define BACKOFF 64
+#define ADAPTIVE_CACHE_BACKOFF 64


Python/ceval.c

markshannon · 2021-07-15T10:49:30Z

Two very minor tweaks, otherwise ready to merge 🙂

markshannon · 2021-07-15T12:12:14Z

Excellent.
Thanks Irit for seeing this through.

bpo-26280: Port BINARY_SUBSCR to PEP 659 adaptive interpreter (pythonGH-27043)

iritkatriel requested review from gvanrossum and markshannon July 5, 2021 22:53

the-knights-who-say-ni added the CLA signed label Jul 5, 2021

bedevere-bot added the awaiting core review label Jul 5, 2021

iritkatriel mentioned this pull request Jul 5, 2021

Implement BINARY_SUBSCR specialization "family" for PEP 659 adaptive interpreter faster-cpython/ideas#59

Closed

gvanrossum reviewed Jul 6, 2021

View reviewed changes

Python/ceval.c Outdated Show resolved Hide resolved

Fidget-Spinner reviewed Jul 6, 2021

View reviewed changes

Lib/importlib/_bootstrap_external.py Outdated Show resolved Hide resolved

Python/ceval.c Outdated Show resolved Hide resolved

markshannon reviewed Jul 7, 2021

View reviewed changes

Python/ceval.c Outdated Show resolved Hide resolved

markshannon reviewed Jul 7, 2021

View reviewed changes

Python/ceval.c Outdated Show resolved Hide resolved

markshannon reviewed Jul 7, 2021

View reviewed changes

Python/ceval.c Outdated Show resolved Hide resolved

markshannon requested changes Jul 14, 2021

View reviewed changes

Include/internal/pycore_code.h Outdated Show resolved Hide resolved

Python/ceval.c Outdated Show resolved Hide resolved

Python/ceval.c Outdated Show resolved Hide resolved

iritkatriel added 2 commits July 14, 2021 15:28

revert change to adptive cache functions api

5b2e3f9

add missing include

4f909d6

added UPDATE_PREV_INSTR_OPARG macro and use it to control the oparg c…

a863a3d

…ounter

use the full weird sign check

c30a188

bedevere-bot added awaiting change review and removed awaiting changes labels Jul 14, 2021

bedevere-bot requested a review from markshannon July 14, 2021 15:31

neonene reviewed Jul 14, 2021

View reviewed changes

Lib/importlib/_bootstrap_external.py Outdated Show resolved Hide resolved

iritkatriel added 2 commits July 14, 2021 22:05

revert unintended change to magic number

d5b0f32

update Python/importlib_external.h

a9c9851

markshannon reviewed Jul 15, 2021

View reviewed changes

Python/ceval.c Outdated Show resolved Hide resolved

markshannon reviewed Jul 15, 2021

View reviewed changes

Python/ceval.c Outdated Show resolved Hide resolved

_Size --> _GET_SIZE

19cdf02

markshannon merged commit 641345d into python:main Jul 15, 2021

bedevere-bot removed the awaiting change review label Jul 15, 2021

sthagen added a commit to sthagen/python-cpython that referenced this pull request Jul 15, 2021

Merge pull request #523 from python/main

230d06b

bpo-26280: Port BINARY_SUBSCR to PEP 659 adaptive interpreter (pythonGH-27043)

iritkatriel deleted the bpo-26280 branch July 15, 2021 14:21

Fidget-Spinner mentioned this pull request Aug 4, 2021

Implement CALL_FUNCTION specialization "family" for PEP 659 adaptive interpreter faster-cpython/ideas#54

Closed

Fidget-Spinner mentioned this pull request Apr 5, 2022

bpo-47189: What's New in 3.11: Faster CPython #32235

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-26280: Port BINARY_SUBSCR to PEP 659 adaptive interpreter #27043

bpo-26280: Port BINARY_SUBSCR to PEP 659 adaptive interpreter #27043

iritkatriel commented Jul 5, 2021 •

edited by bedevere-bot

Loading

iritkatriel commented Jul 5, 2021

iritkatriel commented Jul 6, 2021

Fidget-Spinner left a comment

iritkatriel commented Jul 6, 2021

gvanrossum commented Jul 6, 2021

Fidget-Spinner commented Jul 6, 2021 •

edited

Loading

Fidget-Spinner commented Jul 6, 2021

iritkatriel commented Jul 6, 2021

gvanrossum commented Jul 6, 2021 •

edited

Loading

gvanrossum commented Jul 6, 2021

iritkatriel commented Jul 6, 2021 •

edited

Loading

gvanrossum commented Jul 7, 2021

gvanrossum commented Jul 7, 2021

vstinner commented Jul 7, 2021

vstinner commented Jul 7, 2021

vstinner commented Jul 7, 2021

vstinner commented Jul 7, 2021

pablogsal commented Jul 7, 2021

pablogsal commented Jul 7, 2021

pablogsal commented Jul 7, 2021 •

edited

Loading

markshannon commented Jul 7, 2021

markshannon left a comment

markshannon commented Jul 14, 2021

iritkatriel commented Jul 14, 2021

markshannon commented Jul 14, 2021

iritkatriel commented Jul 14, 2021

bedevere-bot commented Jul 14, 2021

iritkatriel commented Jul 14, 2021

iritkatriel commented Jul 14, 2021

markshannon Jul 15, 2021

markshannon commented Jul 15, 2021

markshannon commented Jul 15, 2021

bpo-26280: Port BINARY_SUBSCR to PEP 659 adaptive interpreter #27043

bpo-26280: Port BINARY_SUBSCR to PEP 659 adaptive interpreter #27043

Conversation

iritkatriel commented Jul 5, 2021 • edited by bedevere-bot Loading

iritkatriel commented Jul 5, 2021

iritkatriel commented Jul 6, 2021

Fidget-Spinner left a comment

Choose a reason for hiding this comment

iritkatriel commented Jul 6, 2021

baseline.json

opt.json

2to3

chameleon

chaos

crypto_pyaes

deltablue

django_template

dulwich_log

fannkuch

float

go

hexiom

json_dumps

json_loads

logging_format

logging_silent

logging_simple

mako

meteor_contest

nbody

nqueens

pathlib

pickle

pickle_dict

pickle_list

pickle_pure_python

pidigits

pyflate

python_startup

python_startup_no_site

raytrace

regex_dna

regex_effbot

regex_v8

richards

scimark_fft

scimark_lu

scimark_monte_carlo

scimark_sor

scimark_sparse_mat_mult

spectral_norm

sqlalchemy_declarative

sqlalchemy_imperative

sqlite_synth

sympy_expand

sympy_integrate

sympy_str

sympy_sum

telco

tornado_http

unpack_sequence

unpickle

unpickle_list

unpickle_pure_python

xml_etree_generate

xml_etree_iterparse

xml_etree_parse

xml_etree_process

gvanrossum commented Jul 6, 2021

Fidget-Spinner commented Jul 6, 2021 • edited Loading

Fidget-Spinner commented Jul 6, 2021

iritkatriel commented Jul 6, 2021

gvanrossum commented Jul 6, 2021 • edited Loading

gvanrossum commented Jul 6, 2021

iritkatriel commented Jul 6, 2021 • edited Loading

gvanrossum commented Jul 7, 2021

gvanrossum commented Jul 7, 2021

vstinner commented Jul 7, 2021

vstinner commented Jul 7, 2021

vstinner commented Jul 7, 2021

iritkatriel commented Jul 5, 2021 •

edited by bedevere-bot

Loading

Fidget-Spinner commented Jul 6, 2021 •

edited

Loading

gvanrossum commented Jul 6, 2021 •

edited

Loading

iritkatriel commented Jul 6, 2021 •

edited

Loading

pablogsal commented Jul 7, 2021 •

edited

Loading