Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-26280: Port BINARY_SUBSCR to PEP 659 adaptive interpreter #27043

Merged
merged 23 commits into from
Jul 15, 2021

Conversation

iritkatriel
Copy link
Member

@iritkatriel iritkatriel commented Jul 5, 2021

@iritkatriel
Copy link
Member Author

I'll try to get some performance numbers tomorrow.

Python/ceval.c Outdated Show resolved Hide resolved
@iritkatriel
Copy link
Member Author

I have some microbenchmark numbers (similar to @serhiy-storchaka 's from #9853). This is on a Mac with --enable-optimizations.

LIST:
cpython_baseline % ./python.exe -m timeit -s "l=[1,2,3,4,5,6]" "l[3]"
10000000 loops, best of 5: 23.7 nsec per loop
cpython % ./python.exe -m timeit -s "l=[1,2,3,4,5,6]" "l[3]"
20000000 loops, best of 5: 17.8 nsec per loop

TUPLE:
cpython_baseline % ./python.exe -m timeit -s "t=(1,2,3,4,5,6)" "t[3]"
10000000 loops, best of 5: 23.4 nsec per loop
cpython % ./python.exe -m timeit -s "t=(1,2,3,4,5,6)" "t[3]"
20000000 loops, best of 5: 17.4 nsec per loop

DICT:
cpython_baseline % ./python.exe -m timeit -s "d={1:'a',2:'b',3:'c',4:'d',5:'e',6:'f'}" "d[3]"
10000000 loops, best of 5: 25.7 nsec per loop
iritkatriel@Irits-MBP cpython % ./python.exe -m timeit -s "d={1:'a',2:'b',3:'c',4:'d',5:'e',6:'f'}" "d[3]"
10000000 loops, best of 5: 21.6 nsec per loop

STRING (NOT OPTIMIZED):
cpython_baseline % ./python.exe -m timeit -s "s='abcdef'" "s[3]"
10000000 loops, best of 5: 25.7 nsec per loop
cpython % ./python.exe -m timeit -s "s='abcdef'" "s[3]"
10000000 loops, best of 5: 25.9 nsec per loop

TUPLE WITH NEGATIVE INDEX (NOT OPTIMIZED):
cpython_baseline % ./python.exe -m timeit -s "t=(1,2,3,4,5,6)" "t[-2]"
10000000 loops, best of 5: 24 nsec per loop
cpython % ./python.exe -m timeit -s "t=(1,2,3,4,5,6)" "t[-2]"
10000000 loops, best of 5: 23.7 nsec per loop

In summary:

  baseline opt % speedup
list 23.7 17.8 24.89
tuple 23.4 17.4 25.64
dict 25.7 21.6 15.95
string 25.7 25.9 -0.78
tuple-neg 24 23.7 1.25

Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really an expert in this area. So I apologise in advance if some of my reviews aren't valid.

Lib/importlib/_bootstrap_external.py Outdated Show resolved Hide resolved
Python/ceval.c Outdated Show resolved Hide resolved
@iritkatriel
Copy link
Member Author

pyperformance results:

baseline.json

Performance version: 1.0.2
Report on macOS-10.15.7-x86_64-i386-64bit
Number of logical CPUs: 12
Start date: 2021-07-06 14:43:18.158523
End date: 2021-07-06 15:03:17.963231

opt.json

Performance version: 1.0.2
Report on macOS-10.15.7-x86_64-i386-64bit
Number of logical CPUs: 12
Start date: 2021-07-06 14:16:54.831239
End date: 2021-07-06 14:37:48.404535

2to3

Mean +- std dev: 361 ms +- 6 ms -> 347 ms +- 4 ms: 1.04x faster
Significant (t=13.99)

chameleon

Mean +- std dev: 10.7 ms +- 0.1 ms -> 10.4 ms +- 0.2 ms: 1.03x faster
Significant (t=9.32)

chaos

Mean +- std dev: 124 ms +- 1 ms -> 119 ms +- 1 ms: 1.04x faster
Significant (t=21.46)

crypto_pyaes

Mean +- std dev: 137 ms +- 4 ms -> 130 ms +- 2 ms: 1.05x faster
Significant (t=11.17)

deltablue

Mean +- std dev: 8.68 ms +- 0.16 ms -> 8.51 ms +- 0.21 ms: 1.02x faster
Not significant

django_template

Mean +- std dev: 59.3 ms +- 3.3 ms -> 57.6 ms +- 2.8 ms: 1.03x faster
Significant (t=3.05)

dulwich_log

Mean +- std dev: 105 ms +- 2 ms -> 104 ms +- 4 ms: 1.01x faster
Not significant

fannkuch

Mean +- std dev: 561 ms +- 10 ms -> 537 ms +- 6 ms: 1.04x faster
Significant (t=16.10)

float

Mean +- std dev: 127 ms +- 3 ms -> 125 ms +- 3 ms: 1.01x faster
Not significant

go

Mean +- std dev: 252 ms +- 4 ms -> 237 ms +- 4 ms: 1.06x faster
Significant (t=21.01)

hexiom

Mean +- std dev: 11.9 ms +- 0.1 ms -> 11.3 ms +- 0.3 ms: 1.06x faster
Significant (t=14.95)

json_dumps

Mean +- std dev: 15.1 ms +- 0.2 ms -> 15.0 ms +- 0.3 ms: 1.01x faster
Not significant

json_loads

Mean +- std dev: 29.3 us +- 0.3 us -> 28.8 us +- 0.4 us: 1.02x faster
Not significant

logging_format

Mean +- std dev: 10.7 us +- 0.4 us -> 10.3 us +- 0.2 us: 1.04x faster
Significant (t=6.48)

logging_silent

Mean +- std dev: 231 ns +- 7 ns -> 221 ns +- 7 ns: 1.04x faster
Significant (t=7.36)

logging_simple

Mean +- std dev: 9.78 us +- 0.30 us -> 9.62 us +- 0.22 us: 1.02x faster
Not significant

mako

Mean +- std dev: 17.1 ms +- 0.5 ms -> 16.5 ms +- 0.2 ms: 1.04x faster
Significant (t=9.14)

meteor_contest

Mean +- std dev: 114 ms +- 2 ms -> 111 ms +- 3 ms: 1.02x faster
Significant (t=5.65)

nbody

Mean +- std dev: 171 ms +- 5 ms -> 158 ms +- 4 ms: 1.08x faster
Significant (t=14.66)

nqueens

Mean +- std dev: 124 ms +- 3 ms -> 115 ms +- 2 ms: 1.08x faster
Significant (t=18.59)

pathlib

Mean +- std dev: 46.6 ms +- 1.3 ms -> 45.3 ms +- 0.9 ms: 1.03x faster
Significant (t=6.35)

pickle

Mean +- std dev: 11.3 us +- 0.4 us -> 10.8 us +- 0.1 us: 1.05x faster
Significant (t=9.69)

pickle_dict

Mean +- std dev: 26.3 us +- 0.2 us -> 23.9 us +- 0.6 us: 1.10x faster
Significant (t=29.70)

pickle_list

Mean +- std dev: 4.14 us +- 0.12 us -> 3.88 us +- 0.05 us: 1.07x faster
Significant (t=15.82)

pickle_pure_python

Mean +- std dev: 538 us +- 10 us -> 518 us +- 8 us: 1.04x faster
Significant (t=11.83)

pidigits

Mean +- std dev: 190 ms +- 3 ms -> 196 ms +- 1 ms: 1.03x slower
Significant (t=-12.11)

pyflate

Mean +- std dev: 789 ms +- 16 ms -> 756 ms +- 10 ms: 1.04x faster
Significant (t=13.64)

python_startup

Mean +- std dev: 13.2 ms +- 0.3 ms -> 13.4 ms +- 0.5 ms: 1.02x slower
Not significant

python_startup_no_site

Mean +- std dev: 9.73 ms +- 0.32 ms -> 9.58 ms +- 0.27 ms: 1.02x faster
Not significant

raytrace

Mean +- std dev: 611 ms +- 9 ms -> 581 ms +- 11 ms: 1.05x faster
Significant (t=16.01)

regex_dna

Mean +- std dev: 183 ms +- 4 ms -> 194 ms +- 5 ms: 1.06x slower
Significant (t=-12.68)

regex_effbot

Mean +- std dev: 3.36 ms +- 0.05 ms -> 3.46 ms +- 0.04 ms: 1.03x slower
Significant (t=-11.84)

regex_v8

Mean +- std dev: 25.8 ms +- 1.0 ms -> 25.9 ms +- 1.0 ms: 1.00x slower
Not significant

richards

Mean +- std dev: 95.3 ms +- 1.5 ms -> 93.7 ms +- 1.4 ms: 1.02x faster
Not significant

scimark_fft

Mean +- std dev: 467 ms +- 5 ms -> 465 ms +- 6 ms: 1.01x faster
Not significant

scimark_lu

Mean +- std dev: 214 ms +- 4 ms -> 209 ms +- 6 ms: 1.02x faster
Significant (t=5.27)

scimark_monte_carlo

Mean +- std dev: 112 ms +- 2 ms -> 111 ms +- 3 ms: 1.01x faster
Not significant

scimark_sor

Mean +- std dev: 229 ms +- 3 ms -> 226 ms +- 7 ms: 1.02x faster
Not significant

scimark_sparse_mat_mult

Mean +- std dev: 6.30 ms +- 0.26 ms -> 6.54 ms +- 0.19 ms: 1.04x slower
Significant (t=-5.78)

spectral_norm

Mean +- std dev: 192 ms +- 2 ms -> 189 ms +- 2 ms: 1.02x faster
Not significant

sqlalchemy_declarative

Mean +- std dev: 161 ms +- 6 ms -> 159 ms +- 4 ms: 1.01x faster
Not significant

sqlalchemy_imperative

Mean +- std dev: 24.2 ms +- 0.5 ms -> 24.2 ms +- 0.7 ms: 1.00x slower
Not significant

sqlite_synth

Mean +- std dev: 3.06 us +- 0.06 us -> 2.97 us +- 0.06 us: 1.03x faster
Significant (t=7.81)

sympy_expand

Mean +- std dev: 639 ms +- 10 ms -> 630 ms +- 9 ms: 1.02x faster
Not significant

sympy_integrate

Mean +- std dev: 27.0 ms +- 0.5 ms -> 26.2 ms +- 0.5 ms: 1.03x faster
Significant (t=8.65)

sympy_str

Mean +- std dev: 382 ms +- 4 ms -> 375 ms +- 5 ms: 1.02x faster
Not significant

sympy_sum

Mean +- std dev: 211 ms +- 6 ms -> 206 ms +- 2 ms: 1.02x faster
Significant (t=6.32)

telco

Mean +- std dev: 7.21 ms +- 0.13 ms -> 7.09 ms +- 0.09 ms: 1.02x faster
Not significant

tornado_http

Mean +- std dev: 174 ms +- 13 ms -> 180 ms +- 48 ms: 1.03x slower
Not significant

unpack_sequence

Mean +- std dev: 59.7 ns +- 0.6 ns -> 55.9 ns +- 1.2 ns: 1.07x faster
Significant (t=21.74)

unpickle

Mean +- std dev: 16.1 us +- 0.3 us -> 16.1 us +- 0.3 us: 1.00x faster
Not significant

unpickle_list

Mean +- std dev: 5.02 us +- 0.14 us -> 5.06 us +- 0.16 us: 1.01x slower
Not significant

unpickle_pure_python

Mean +- std dev: 382 us +- 6 us -> 364 us +- 4 us: 1.05x faster
Significant (t=18.25)

xml_etree_generate

Mean +- std dev: 108 ms +- 2 ms -> 106 ms +- 2 ms: 1.02x faster
Not significant

xml_etree_iterparse

Mean +- std dev: 116 ms +- 3 ms -> 117 ms +- 3 ms: 1.00x slower
Not significant

xml_etree_parse

Mean +- std dev: 162 ms +- 5 ms -> 164 ms +- 5 ms: 1.01x slower
Not significant

xml_etree_process

Mean +- std dev: 88.4 ms +- 2.2 ms -> 85.2 ms +- 0.9 ms: 1.04x faster
Significant (t=10.57)

Skipped 1 benchmarks only in baseline.json: regex_compile

@gvanrossum
Copy link
Member

Doesn't pyperf[ormance] have a flag to sort the benchmarks from "most faster" to "most slower" and to skip the "not significant" ones?

@Fidget-Spinner
Copy link
Member

Fidget-Spinner commented Jul 6, 2021

Doesn't pyperf[ormance] have a flag to sort the benchmarks from "most faster" to "most slower" and to skip the "not significant" ones?

Yeah it's pyperf compare_to first.json second.json -G --min-speed=X

Edit: Here's the docs: https://pyperf.readthedocs.io/en/latest/cli.html#pyperf-compare-to

@Fidget-Spinner
Copy link
Member

@iritkatriel hmm, I just noticed something strange. Some of the speedups make sense (like nqueens which has lots of BINARY_SUBSCR), but I don't understand how pickle_dict and pickle_list sped up 10%. There aren't any BINARY_SUBCR instructions inside the benchmark functions, and those are basically thin wrappers over calls to C functions so they shouldn't be affected? Am I missing something here :(?

@iritkatriel
Copy link
Member Author

./python.exe -m pyperf compare_to baseline.json ../cpython/opt.json -G
Slower (5):

  • regex_dna: 183 ms +- 4 ms -> 194 ms +- 5 ms: 1.06x slower
  • scimark_sparse_mat_mult: 6.30 ms +- 0.26 ms -> 6.54 ms +- 0.19 ms: 1.04x slower
  • regex_effbot: 3.36 ms +- 0.05 ms -> 3.46 ms +- 0.04 ms: 1.03x slower
  • pidigits: 190 ms +- 3 ms -> 196 ms +- 1 ms: 1.03x slower
  • python_startup: 13.2 ms +- 0.3 ms -> 13.4 ms +- 0.5 ms: 1.02x slower

Faster (43):

  • pickle_dict: 26.3 us +- 0.2 us -> 23.9 us +- 0.6 us: 1.10x faster
  • nbody: 171 ms +- 5 ms -> 158 ms +- 4 ms: 1.08x faster
  • nqueens: 124 ms +- 3 ms -> 115 ms +- 2 ms: 1.08x faster
  • pickle_list: 4.14 us +- 0.12 us -> 3.88 us +- 0.05 us: 1.07x faster
  • unpack_sequence: 59.7 ns +- 0.6 ns -> 55.9 ns +- 1.2 ns: 1.07x faster
  • go: 252 ms +- 4 ms -> 237 ms +- 4 ms: 1.06x faster
  • hexiom: 11.9 ms +- 0.1 ms -> 11.3 ms +- 0.3 ms: 1.06x faster
  • raytrace: 611 ms +- 9 ms -> 581 ms +- 11 ms: 1.05x faster
  • pickle: 11.3 us +- 0.4 us -> 10.8 us +- 0.1 us: 1.05x faster
  • crypto_pyaes: 137 ms +- 4 ms -> 130 ms +- 2 ms: 1.05x faster
  • unpickle_pure_python: 382 us +- 6 us -> 364 us +- 4 us: 1.05x faster
  • fannkuch: 561 ms +- 10 ms -> 537 ms +- 6 ms: 1.04x faster
  • pyflate: 789 ms +- 16 ms -> 756 ms +- 10 ms: 1.04x faster
  • logging_silent: 231 ns +- 7 ns -> 221 ns +- 7 ns: 1.04x faster
  • chaos: 124 ms +- 1 ms -> 119 ms +- 1 ms: 1.04x faster
  • pickle_pure_python: 538 us +- 10 us -> 518 us +- 8 us: 1.04x faster
  • mako: 17.1 ms +- 0.5 ms -> 16.5 ms +- 0.2 ms: 1.04x faster
  • 2to3: 361 ms +- 6 ms -> 347 ms +- 4 ms: 1.04x faster
  • xml_etree_process: 88.4 ms +- 2.2 ms -> 85.2 ms +- 0.9 ms: 1.04x faster
  • logging_format: 10.7 us +- 0.4 us -> 10.3 us +- 0.2 us: 1.04x faster
  • django_template: 59.3 ms +- 3.3 ms -> 57.6 ms +- 2.8 ms: 1.03x faster
  • sympy_integrate: 27.0 ms +- 0.5 ms -> 26.2 ms +- 0.5 ms: 1.03x faster
  • pathlib: 46.6 ms +- 1.3 ms -> 45.3 ms +- 0.9 ms: 1.03x faster
  • sqlite_synth: 3.06 us +- 0.06 us -> 2.97 us +- 0.06 us: 1.03x faster
  • chameleon: 10.7 ms +- 0.1 ms -> 10.4 ms +- 0.2 ms: 1.03x faster
  • scimark_lu: 214 ms +- 4 ms -> 209 ms +- 6 ms: 1.02x faster
  • sympy_sum: 211 ms +- 6 ms -> 206 ms +- 2 ms: 1.02x faster
  • meteor_contest: 114 ms +- 2 ms -> 111 ms +- 3 ms: 1.02x faster
  • json_loads: 29.3 us +- 0.3 us -> 28.8 us +- 0.4 us: 1.02x faster
  • deltablue: 8.68 ms +- 0.16 ms -> 8.51 ms +- 0.21 ms: 1.02x faster
  • sympy_str: 382 ms +- 4 ms -> 375 ms +- 5 ms: 1.02x faster
  • spectral_norm: 192 ms +- 2 ms -> 189 ms +- 2 ms: 1.02x faster
  • richards: 95.3 ms +- 1.5 ms -> 93.7 ms +- 1.4 ms: 1.02x faster
  • logging_simple: 9.78 us +- 0.30 us -> 9.62 us +- 0.22 us: 1.02x faster
  • telco: 7.21 ms +- 0.13 ms -> 7.09 ms +- 0.09 ms: 1.02x faster
  • xml_etree_generate: 108 ms +- 2 ms -> 106 ms +- 2 ms: 1.02x faster
  • scimark_sor: 229 ms +- 3 ms -> 226 ms +- 7 ms: 1.02x faster
  • sympy_expand: 639 ms +- 10 ms -> 630 ms +- 9 ms: 1.02x faster
  • python_startup_no_site: 9.73 ms +- 0.32 ms -> 9.58 ms +- 0.27 ms: 1.02x faster
  • float: 127 ms +- 3 ms -> 125 ms +- 3 ms: 1.01x faster
  • scimark_monte_carlo: 112 ms +- 2 ms -> 111 ms +- 3 ms: 1.01x faster
  • json_dumps: 15.1 ms +- 0.2 ms -> 15.0 ms +- 0.3 ms: 1.01x faster
  • scimark_fft: 467 ms +- 5 ms -> 465 ms +- 6 ms: 1.01x faster

Benchmark hidden because not significant (9): dulwich_log, regex_v8, sqlalchemy_declarative, sqlalchemy_imperative, tornado_http, unpickle, unpickle_list, xml_etree_parse, xml_etree_iterparse

Geometric mean: 1.02x faster
Ignored benchmarks (1) of baseline.json: regex_compile

@gvanrossum
Copy link
Member

gvanrossum commented Jul 6, 2021

I ran the benchmarks on the "faster cpython" team's benchmark machine (which sits in a secret location :-) and got the following results, comparing this PR with the point where it branched off main (two commits behind).

This primarily goes to show that there's a lot of noise in the data...

pyperf compare_to -G --min-speed 1 req-1625592889-gvanrossum/results-data.json req-1625594534-gvanrossum/results-data.json

Slower (10):

  • scimark_sparse_mat_mult: 7.11 ms +- 0.17 ms -> 7.41 ms +- 0.18 ms: 1.04x slower
  • regex_dna: 216 ms +- 4 ms -> 222 ms +- 3 ms: 1.03x slower
  • unpack_sequence: 61.0 ns +- 0.6 ns -> 62.5 ns +- 0.8 ns: 1.02x slower
  • regex_effbot: 3.66 ms +- 0.05 ms -> 3.75 ms +- 0.01 ms: 1.02x slower
  • scimark_fft: 496 ms +- 5 ms -> 508 ms +- 4 ms: 1.02x slower
  • spectral_norm: 187 ms +- 2 ms -> 191 ms +- 4 ms: 1.02x slower
  • pidigits: 201 ms +- 0 ms -> 204 ms +- 1 ms: 1.02x slower
  • go: 265 ms +- 2 ms -> 269 ms +- 2 ms: 1.02x slower
  • regex_compile: 215 ms +- 1 ms -> 217 ms +- 1 ms: 1.01x slower
  • scimark_sor: 251 ms +- 2 ms -> 254 ms +- 3 ms: 1.01x slower

Faster (12):

  • crypto_pyaes: 144 ms +- 1 ms -> 139 ms +- 2 ms: 1.03x faster
  • unpickle_list: 5.50 us +- 0.06 us -> 5.33 us +- 0.04 us: 1.03x faster
  • nbody: 164 ms +- 2 ms -> 160 ms +- 1 ms: 1.03x faster
  • fannkuch: 584 ms +- 5 ms -> 570 ms +- 4 ms: 1.02x faster
  • nqueens: 125 ms +- 1 ms -> 122 ms +- 1 ms: 1.02x faster
  • chaos: 129 ms +- 5 ms -> 127 ms +- 1 ms: 1.02x faster
  • hexiom: 12.8 ms +- 0.1 ms -> 12.5 ms +- 0.1 ms: 1.02x faster
  • telco: 8.38 ms +- 0.37 ms -> 8.22 ms +- 0.27 ms: 1.02x faster
  • scimark_monte_carlo: 128 ms +- 2 ms -> 126 ms +- 2 ms: 1.02x faster
  • pickle_pure_python: 569 us +- 6 us -> 561 us +- 5 us: 1.01x faster
  • unpickle: 17.7 us +- 0.2 us -> 17.5 us +- 0.2 us: 1.01x faster
  • pathlib: 22.7 ms +- 0.3 ms -> 22.5 ms +- 0.4 ms: 1.01x faster

Benchmark hidden because not significant (36): 2to3, chameleon, deltablue, django_template, dulwich_log, float, json_dumps, json_loads, logging_format, logging_silent, logging_simple, mako, meteor_contest, pickle, pickle_dict, pickle_list, pyflate, python_startup, python_startup_no_site, raytrace, regex_v8, richards, scimark_lu, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, tornado_http, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process

Geometric mean: 1.00x faster

@gvanrossum
Copy link
Member

PS. I did not use --pgo when building. That might affect things.

@iritkatriel
Copy link
Member Author

iritkatriel commented Jul 6, 2021

I repeated it and got this. I configured with --with-optimizations (Mac)

  • pickle_dict: 26.2 us +- 0.1 us -> 23.7 us +- 0.2 us: 1.11x faster
  • hexiom: 12.0 ms +- 0.3 ms -> 11.2 ms +- 0.2 ms: 1.07x faster
  • unpack_sequence: 59.5 ns +- 0.5 ns -> 55.9 ns +- 1.4 ns: 1.06x faster
  • nqueens: 122 ms +- 1 ms -> 115 ms +- 1 ms: 1.06x faster
  • nbody: 167 ms +- 1 ms -> 158 ms +- 2 ms: 1.06x faster
  • go: 250 ms +- 3 ms -> 238 ms +- 4 ms: 1.05x faster
  • raytrace: 607 ms +- 14 ms -> 578 ms +- 12 ms: 1.05x faster
  • unpickle_pure_python: 381 us +- 4 us -> 365 us +- 4 us: 1.04x faster
  • pickle_list: 4.07 us +- 0.04 us -> 3.90 us +- 0.10 us: 1.04x faster

@gvanrossum
Copy link
Member

Huh, I guess clang (Mac) is better than gcc (Linux), or more stable or something. I had not used pgo/lto, and when I turned that on I got similar results only more pronounced, but still with a geometric mean of 1.00, so I won't repeat the numbers here. (It wasn't the always same benchmarks that were slower/faster, although the top 5 or so on either were roughly the same.)

I don't really understand what's going on with benchmarks, maybe @vstinner has some insights? Are they just very noisy? I have a feeling that if we pay them too much attention we'll never make progress. Perhaps we should focus more on micro-benchmarks, and use PyPerformance to prove that we're not making things worse. Although I'd like to understand why unpack_sequence is 1.17x slower for me with pgo+lto.

(BTW, pgo+lto really does make a big difference. Comparing this PR without and with lto+pgo gives a geometric mean of 1.21x faster, with the extreme being richards at 1.46x. Puts our efforts in perspective. :-)

@gvanrossum
Copy link
Member

Huh, lto+pgo makes unpack_sequence 1.10x slower on your branch. And on main they make it only 1.04x faster. So there's something weird about it. After all, it's 500 lines of this:

        a, b, c, d, e, f, g, h, i, j = to_unpack

@vstinner
Copy link
Member

vstinner commented Jul 7, 2021

Ah, benchmarking Python! I suggest to only measure Python performance when it's built with PGO+LTO. On Linux, pyperf is more reliable using CPU isolation (see pyperf system tune command).

Usually, I ignore differences smaller than 5%. I treat them as "noise" in the benchmark itself, rather than in the change. To understand that, you can try to build Python 5 times and each time measure performances. In my experience, each PGO build gives performances a little bit different because the PROFILE_TASK in Makefile is not fully determistic. In general, the Python build is not deterministic.

  • "Geometric mean: 1.02x faster"
  • "Geometric mean: 1.00x faster"

I added this feature recently in pyperf. So I don't have a good experience with it. But I'm not impressed by "Geometric mean: 1.00x faster": the change doesn't seem to make Python faster nor slower.

The geometric mean doesn't put weights on benchmarks. Each benchmark is treated equally.

  • "nbody: 171 ms +- 5 ms -> 158 ms +- 4 ms: 1.08x faster": that's nice, since it's a "generic" benchmark. The advance() hot code contains multiple BINARY_SUBSCR instruction:
def advance(dt, n, bodies=SYSTEM, pairs=PAIRS):
    for i in range(n):
        for (([x1, y1, z1], v1, m1),
             ([x2, y2, z2], v2, m2)) in pairs:
            dx = x1 - x2
            dy = y1 - y2
            dz = z1 - z2
            mag = dt * ((dx * dx + dy * dy + dz * dz) ** (-1.5))
            b1m = m1 * mag
            b2m = m2 * mag
            v1[0] -= dx * b2m
            v1[1] -= dy * b2m
            v1[2] -= dz * b2m
            v2[0] += dx * b1m
            v2[1] += dy * b1m
            v2[2] += dz * b1m
        for (r, [vx, vy, vz], m) in bodies:
            r[0] += dt * vx
            r[1] += dt * vy
            r[2] += dt * vz
  • "nqueens: 124 ms +- 3 ms -> 115 ms +- 2 ms: 1.08x faster": same, it also contains many BINARY_SUBSCR instructions.

I'm also trying to not pay too much attention to speedup, but mostly check if the mandatory slowdowns are acceptable or not.

I never see a pyperformance result with no "slower" benchmark. It's always a balance. That's why I added the geometric mean.

More about pyperformance individual benchmarks: https://pyperformance.readthedocs.io/benchmarks.html#available-benchmarks

@vstinner
Copy link
Member

vstinner commented Jul 7, 2021

Huh, lto+pgo makes unpack_sequence 1.10x slower on your branch.

This benchmark is a micro-benchmark. It was never reliable. I suggest to remove it.

@vstinner
Copy link
Member

vstinner commented Jul 7, 2021

See unpack sequence at speed.python.org: results over only 1 month. The differences are quite significan.

unpack_sequence

Timings are between 115 ms and 140 ms. IMO it's more a benchmark on the CPU L1 instruction cache: it depends on how the compiler decides the organize the Python code machine using PGO. It depends a lot on code placement.

See https://vstinner.github.io/analysis-python-performance-issue.html for an example of analysis on code placement.

@vstinner
Copy link
Member

vstinner commented Jul 7, 2021

(BTW, pgo+lto really does make a big difference. Comparing this PR without and with lto+pgo gives a geometric mean of 1.21x faster, with the extreme being richards at 1.46x. Puts our efforts in perspective. :-)

By the way, ./configure --enable-shared can also have a significant impact on performance if you build Python using GCC without -fno-semantic-interposition. This compiler flag is now enabled by default using ./configure --enable-optimizations.

If the "python" executable is a small binary (ex: 20 KB) which only calls Py_BytesMain() which lives in libpython dynamic library, function calls from libpython to libpython (Python calling itself its own C functions) can be slower because of the semantic interposition (ability to override symbols at runtime using LD_PRELOAD or whatever else).

Using -fno-semantic-interposition, you get back to the same speed than not using ./configure --enable-shared: the compiler can again inline performance critical functions. LTO is quite impressive for that! Compilers made great progress in LTO over the last years.

See my analysis of -fno-semantic-interposition and inlining in libpython:
https://developers.redhat.com/blog/2020/06/25/red-hat-enterprise-linux-8-2-brings-faster-python-3-8-run-speeds/

How Python is built makes a huge difference in performances.

Note: clang uses -fno-semantic-interposition by default.

@pablogsal
Copy link
Member

The key is as well running the benchmarks with CPU isolation or otherwise the noise is going to be in the range of 2-6% or even more as the OS can schedule any process in the CPU you are using for benchmarking or even schedule system interrupts and other RC there.

@pablogsal
Copy link
Member

I ran the benchmarks on the "faster cpython" team's benchmark machine (which sits in a secret location :-) and got the following results, comparing this PR with the point where it branched off main (two commits behind).

This primarily goes to show that there's a lot of noise in the data...

pyperf compare_to -G --min-speed 1 req-1625592889-gvanrossum/results-data.json req-1625594534-gvanrossum/results-data.json

Slower (10):

  • scimark_sparse_mat_mult: 7.11 ms +- 0.17 ms -> 7.41 ms +- 0.18 ms: 1.04x slower
  • regex_dna: 216 ms +- 4 ms -> 222 ms +- 3 ms: 1.03x slower
  • unpack_sequence: 61.0 ns +- 0.6 ns -> 62.5 ns +- 0.8 ns: 1.02x slower
  • regex_effbot: 3.66 ms +- 0.05 ms -> 3.75 ms +- 0.01 ms: 1.02x slower
  • scimark_fft: 496 ms +- 5 ms -> 508 ms +- 4 ms: 1.02x slower
  • spectral_norm: 187 ms +- 2 ms -> 191 ms +- 4 ms: 1.02x slower
  • pidigits: 201 ms +- 0 ms -> 204 ms +- 1 ms: 1.02x slower
  • go: 265 ms +- 2 ms -> 269 ms +- 2 ms: 1.02x slower
  • regex_compile: 215 ms +- 1 ms -> 217 ms +- 1 ms: 1.01x slower
  • scimark_sor: 251 ms +- 2 ms -> 254 ms +- 3 ms: 1.01x slower

Faster (12):

  • crypto_pyaes: 144 ms +- 1 ms -> 139 ms +- 2 ms: 1.03x faster
  • unpickle_list: 5.50 us +- 0.06 us -> 5.33 us +- 0.04 us: 1.03x faster
  • nbody: 164 ms +- 2 ms -> 160 ms +- 1 ms: 1.03x faster
  • fannkuch: 584 ms +- 5 ms -> 570 ms +- 4 ms: 1.02x faster
  • nqueens: 125 ms +- 1 ms -> 122 ms +- 1 ms: 1.02x faster
  • chaos: 129 ms +- 5 ms -> 127 ms +- 1 ms: 1.02x faster
  • hexiom: 12.8 ms +- 0.1 ms -> 12.5 ms +- 0.1 ms: 1.02x faster
  • telco: 8.38 ms +- 0.37 ms -> 8.22 ms +- 0.27 ms: 1.02x faster
  • scimark_monte_carlo: 128 ms +- 2 ms -> 126 ms +- 2 ms: 1.02x faster
  • pickle_pure_python: 569 us +- 6 us -> 561 us +- 5 us: 1.01x faster
  • unpickle: 17.7 us +- 0.2 us -> 17.5 us +- 0.2 us: 1.01x faster
  • pathlib: 22.7 ms +- 0.3 ms -> 22.5 ms +- 0.4 ms: 1.01x faster

Benchmark hidden because not significant (36): 2to3, chameleon, deltablue, django_template, dulwich_log, float, json_dumps, json_loads, logging_format, logging_silent, logging_simple, mako, meteor_contest, pickle, pickle_dict, pickle_list, pyflate, python_startup, python_startup_no_site, raytrace, regex_v8, richards, scimark_lu, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, tornado_http, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process

Geometric mean: 1.00x faster

I repeated these benchmarks on my machine with PGO/LTO and CPU isolation and I get very similar results as these ones. I will try to post them later.

@pablogsal
Copy link
Member

pablogsal commented Jul 7, 2021

Perhaps we should focus more on micro-benchmarks, and use PyPerformance to prove that we're not making things worse.

I get the feeling but macro benchmarks are the only way to really claim general speedups. These are also the benchmarks we show in speed.python.org and the ones we have been using when doing similar work before to decide if it was worth it or not.

It is also is the metric that every other implementation is using to claim that is "X times faster than CPython".

Python/ceval.c Outdated Show resolved Hide resolved
@markshannon
Copy link
Member

There are useful numbers that can be gathered that should be deterministic.

What fraction of total executed instruction is BINARY_SUBSCR?
What are the cache hit, miss and de-opt numbers for the various specializations?
What kinds of values are not being specialized for?

Python/ceval.c Outdated Show resolved Hide resolved
Python/ceval.c Outdated Show resolved Hide resolved
Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just needs a few minor changes.

Include/internal/pycore_code.h Outdated Show resolved Hide resolved
Python/ceval.c Outdated Show resolved Hide resolved
Python/ceval.c Outdated Show resolved Hide resolved
@markshannon
Copy link
Member

The WIndows compiler doesn't seem to like the reference to _PyLong_GetZero().
You could change to assert to Py_SIZE(sub) == 1 || ((PyLongObject *)sub)->ob_digit[0] == 0, or just delete it.

@iritkatriel
Copy link
Member Author

The WIndows compiler doesn't seem to like the reference to _PyLong_GetZero().
You could change to assert to Py_SIZE(sub) == 1 || ((PyLongObject *)sub)->ob_digit[0] == 0, or just delete it.

I fixed it locally with "#include "pycore_long.h"", but I can do this instead.

@markshannon
Copy link
Member

The WIndows compiler doesn't seem to like the reference to _PyLong_GetZero().
You could change to assert to Py_SIZE(sub) == 1 || ((PyLongObject *)sub)->ob_digit[0] == 0, or just delete it.

I fixed it locally with "#include "pycore_long.h"", but I can do this instead.

If you've already fixed it, then that's fine. No need to change it again.

@iritkatriel
Copy link
Member Author

I have made the requested changes; please review again.

@bedevere-bot
Copy link

Thanks for making the requested changes!

@markshannon: please review the changes made to this pull request.

@iritkatriel
Copy link
Member Author

/python.exe -m pyperf compare_to -G baseline.json opt.json
Slower (11):

  • pickle_dict: 26.2 us +- 0.7 us -> 26.8 us +- 0.3 us: 1.02x slower
  • scimark_sparse_mat_mult: 6.25 ms +- 0.13 ms -> 6.37 ms +- 0.20 ms: 1.02x slower
  • python_startup: 13.2 ms +- 0.4 ms -> 13.4 ms +- 0.5 ms: 1.02x slower
  • unpickle: 16.3 us +- 0.4 us -> 16.6 us +- 0.3 us: 1.02x slower
  • chameleon: 10.4 ms +- 0.2 ms -> 10.6 ms +- 0.1 ms: 1.01x slower
  • python_startup_no_site: 9.47 ms +- 0.26 ms -> 9.60 ms +- 0.29 ms: 1.01x slower
  • pidigits: 190 ms +- 2 ms -> 193 ms +- 4 ms: 1.01x slower
  • sqlite_synth: 3.09 us +- 0.06 us -> 3.12 us +- 0.06 us: 1.01x slower
  • pickle: 11.1 us +- 0.1 us -> 11.2 us +- 0.2 us: 1.01x slower
  • xml_etree_process: 87.2 ms +- 1.1 ms -> 87.8 ms +- 1.1 ms: 1.01x slower
  • regex_effbot: 3.45 ms +- 0.05 ms -> 3.47 ms +- 0.04 ms: 1.01x slower

Faster (32):

  • nbody: 168 ms +- 2 ms -> 147 ms +- 2 ms: 1.14x faster
  • crypto_pyaes: 138 ms +- 2 ms -> 125 ms +- 2 ms: 1.11x faster
  • hexiom: 12.1 ms +- 0.3 ms -> 11.0 ms +- 0.2 ms: 1.10x faster
  • nqueens: 122 ms +- 2 ms -> 113 ms +- 5 ms: 1.08x faster
  • fannkuch: 569 ms +- 5 ms -> 528 ms +- 7 ms: 1.08x faster
  • go: 252 ms +- 4 ms -> 234 ms +- 3 ms: 1.07x faster
  • chaos: 127 ms +- 2 ms -> 119 ms +- 3 ms: 1.07x faster
  • scimark_lu: 214 ms +- 4 ms -> 201 ms +- 4 ms: 1.06x faster
  • logging_silent: 233 ns +- 9 ns -> 220 ns +- 6 ns: 1.06x faster
  • pyflate: 796 ms +- 24 ms -> 753 ms +- 8 ms: 1.06x faster
  • scimark_monte_carlo: 115 ms +- 2 ms -> 110 ms +- 2 ms: 1.04x faster
  • meteor_contest: 114 ms +- 2 ms -> 110 ms +- 2 ms: 1.04x faster
  • unpack_sequence: 60.8 ns +- 0.8 ns -> 58.5 ns +- 1.1 ns: 1.04x faster
  • float: 125 ms +- 2 ms -> 120 ms +- 2 ms: 1.04x faster
  • richards: 96.5 ms +- 2.1 ms -> 93.0 ms +- 2.0 ms: 1.04x faster
  • telco: 7.17 ms +- 0.10 ms -> 6.91 ms +- 0.11 ms: 1.04x faster
  • scimark_fft: 471 ms +- 5 ms -> 455 ms +- 6 ms: 1.04x faster
  • scimark_sor: 232 ms +- 6 ms -> 225 ms +- 3 ms: 1.03x faster
  • django_template: 58.9 ms +- 2.2 ms -> 57.3 ms +- 2.5 ms: 1.03x faster
  • sympy_sum: 215 ms +- 9 ms -> 209 ms +- 4 ms: 1.03x faster
  • pickle_pure_python: 536 us +- 19 us -> 522 us +- 7 us: 1.03x faster
  • sympy_integrate: 26.9 ms +- 0.4 ms -> 26.2 ms +- 0.5 ms: 1.03x faster
  • unpickle_pure_python: 375 us +- 5 us -> 366 us +- 5 us: 1.02x faster
  • logging_format: 10.8 us +- 0.3 us -> 10.6 us +- 0.3 us: 1.02x faster
  • json_loads: 30.3 us +- 1.2 us -> 29.7 us +- 0.4 us: 1.02x faster
  • logging_simple: 9.93 us +- 0.20 us -> 9.75 us +- 0.16 us: 1.02x faster
  • spectral_norm: 193 ms +- 2 ms -> 190 ms +- 3 ms: 1.02x faster
  • pathlib: 46.7 ms +- 1.1 ms -> 46.0 ms +- 1.0 ms: 1.02x faster
  • regex_dna: 186 ms +- 4 ms -> 183 ms +- 2 ms: 1.02x faster
  • regex_v8: 28.0 ms +- 1.3 ms -> 27.6 ms +- 0.8 ms: 1.02x faster
  • sympy_str: 383 ms +- 7 ms -> 378 ms +- 5 ms: 1.01x faster
  • sympy_expand: 636 ms +- 7 ms -> 631 ms +- 7 ms: 1.01x faster

Benchmark hidden because not significant (14): 2to3, deltablue, dulwich_log, json_dumps, mako, pickle_list, regex_compile, sqlalchemy_declarative, sqlalchemy_imperative, tornado_http, unpickle_list, xml_etree_parse, xml_etree_iterparse, xml_etree_generate

Geometric mean: 1.02x faster
Ignored benchmarks (1) of baseline.json: raytrace

@iritkatriel
Copy link
Member Author

Micro Benchmarks:

  baseline opt % speedup
list 22.3 14.9 33.18
tuple 22.2 14.7 33.78
dict 24.4 21.1 13.52
string 24.9 25.3 -1.61
tuple-neg 23.2 22.8 1.72

@@ -310,17 +310,18 @@ too_many_cache_misses(_PyAdaptiveEntry *entry) {
return entry->counter == saturating_zero();
}

#define BACKOFF 64
#define ADAPTIVE_CACHE_BACKOFF 64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Python/ceval.c Outdated Show resolved Hide resolved
Python/ceval.c Outdated Show resolved Hide resolved
@markshannon
Copy link
Member

Two very minor tweaks, otherwise ready to merge 🙂

@markshannon
Copy link
Member

Excellent.
Thanks Irit for seeing this through.

@markshannon markshannon merged commit 641345d into python:main Jul 15, 2021
sthagen added a commit to sthagen/python-cpython that referenced this pull request Jul 15, 2021
bpo-26280: Port BINARY_SUBSCR to PEP 659 adaptive interpreter (pythonGH-27043)
@iritkatriel iritkatriel deleted the bpo-26280 branch July 15, 2021 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants