Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[inductor][cpu]lennard_jones, pyhpc_isoneutral_mixing and pyhpc_equation_of_state performance regression in 2024-05-12 nightly release #126293

Closed
zxd1997066 opened this issue May 15, 2024 · 1 comment
Assignees
Labels
oncall: cpu inductor CPU Inductor issues for Intel team to triage

Comments

@zxd1997066
Copy link
Contributor

zxd1997066 commented May 15, 2024

馃悰 Describe the bug

fp32 static shape default wrapper

suite name thread batch_size_new speed_up_new inductor_new eager_new compilation_latency_new batch_size_old speed_up_old inductor_old eager_old compilation_latency_old Ratio Speedup(New/old) Eager Ratio(old/new) Inductor Ratio(old/new) Compilation_latency_Ratio(old/new)
torchbench lennard_jones single 1 1.572944 4.5809e-05 7.2054991696e-05 3.945146 1.0 1.834938 3.7955999999999994e-05 6.964690672799999e-05 5.762626 0.86 0.97 0.83 1.46
torchbench pyhpc_isoneutral_mixing single 1 53.873001 5.0139e-05 0.002701138397139 10.347831 1.0 64.445867 4.2233e-05 0.0027217423010110005 12.185837 0.84 1.01 0.84 1.18

fp32 dynamic shape default wrapper

suite name thread batch_size_new speed_up_new inductor_new eager_new compilation_latency_new batch_size_old speed_up_old inductor_old eager_old compilation_latency_old Ratio Speedup(New/old) Eager Ratio(old/new) Inductor Ratio(old/new) Compilation_latency_Ratio(old/new)
torchbench lennard_jones single 1 1.539954 4.5865e-05 7.062999021e-05 3.927617 1.0 1.817732 3.8005e-05 6.908290466e-05 5.76899 0.85 0.98 0.83 1.47
torchbench pyhpc_equation_of_state single 1 20.214927 5.2469e-05 0.0010606570047630001 6.882186 1.0 23.225694 4.4226e-05 0.001027179542844 8.763813 0.87 0.97 0.84 1.27
torchbench pyhpc_isoneutral_mixing single 1 54.378022 5.0307e-05 0.002735595152754 10.30333 1.0 64.706471 4.1711999999999996e-05 0.0026990363183519994 12.204166 0.84 0.99 0.83 1.18

fp32 static shape cpp wrapper

suite name thread batch_size_new speed_up_new inductor_new eager_new compilation_latency_new batch_size_old speed_up_old inductor_old eager_old compilation_latency_old Ratio Speedup(New/old) Eager Ratio(old/new) Inductor Ratio(old/new) Compilation_latency_Ratio(old/new)
torchbench lennard_jones single 1 1.908596 3.6968e-05 7.0556976928e-05 12.051354 1.0 2.093192 3.2567e-05 6.816898386400001e-05 13.964945 0.91 0.97 0.88 1.16
torchbench pyhpc_isoneutral_mixing single 1 49.319962 5.5797e-05 0.002751905919714 18.445618 1.0 55.909156 4.9216000000000004e-05 0.002751625021696 20.460509 0.88 1.0 0.88 1.11

fp32 dynamic shape cpp wrapper

suite name thread batch_size_new speed_up_new inductor_new eager_new compilation_latency_new batch_size_old speed_up_old inductor_old eager_old compilation_latency_old Ratio Speedup(New/old) Eager Ratio(old/new) Inductor Ratio(old/new) Compilation_latency_Ratio(old/new)
torchbench lennard_jones single 1 1.867954 3.7521e-05 7.0087502034e-05 11.966901 1.0 2.199674 3.1592000000000005e-05 6.949210100800001e-05 13.932152 0.85 0.99 0.84 1.16
torchbench pyhpc_isoneutral_mixing single 1 48.235191 5.6397e-05 0.002720320066827 18.342588 1.0 54.152881 4.9823e-05 0.002698058990063 20.393653 0.89 0.99 0.88 1.11

AMP static shape default wrapper

suite name thread batch_size_new speed_up_new inductor_new eager_new compilation_latency_new batch_size_old speed_up_old inductor_old eager_old compilation_latency_old Ratio Speedup(New/old) Eager Ratio(old/new) Inductor Ratio(old/new) Compilation_latency_Ratio(old/new)
torchbench pyhpc_equation_of_state single 1 18.280486 2.5393e-05 0.00046419638099799997 5.111035 1.0 21.430476 2.1813e-05 0.00046746297298799993 6.583431 0.85 1.01 0.86 1.29

AMP dynamic shape default wrapper

suite name thread batch_size_new speed_up_new inductor_new eager_new compilation_latency_new batch_size_old speed_up_old inductor_old eager_old compilation_latency_old Ratio Speedup(New/old) Eager Ratio(old/new) Inductor Ratio(old/new) Compilation_latency_Ratio(old/new)
torchbench pyhpc_equation_of_state single 1 18.205576 2.5667e-05 0.000467282519192 5.100246 1.0 21.220869 2.1705e-05 0.000460598961645 6.595176 0.86 0.99 0.85 1.29

AMP dynamic shape cpp wrapper

suite name thread batch_size_new speed_up_new inductor_new eager_new compilation_latency_new batch_size_old speed_up_old inductor_old eager_old compilation_latency_old Ratio Speedup(New/old) Eager Ratio(old/new) Inductor Ratio(old/new) Compilation_latency_Ratio(old/new)
torchbench pyhpc_isoneutral_mixing single 1 37.581918 3.8873e-05 0.001460921898414 14.335784 1.0 45.019578 3.2526e-05 0.001464306794028 15.774886 0.83 1.0 0.84 1.1

SW info

name target_branch target_commit refer_branch refer_commit
torchbench main d6015d42 main d6015d42
torch main 02093b6 main fc183f0
torchvision main 0.19.0a0+d23a6e1 main 0.19.0a0+06ad737
torchtext main 0.16.0a0+b0ebddc main 0.16.0a0+b0ebddc
torchaudio main 2.2.0a0+ea437b3 main 2.2.0a0+ea437b3
torchdata main 0.7.1a0+0790338 main 0.7.1a0+0790338
dynamo_benchmarks main nightly main nightly

Repro:
inductor_single_run.sh
bash inductor_single_run.sh single inference performance torchbench model float32/amp first dynamic/static default/cpp
Suspected guilty commit: b23b6e7
torchbench-pyhpc_isoneutral_mixing-inference-float32-static-default-single-performance-drop_guilty_commit.log
cc @WeizhuoZhang-intel @chuanqi129

@chuanqi129 chuanqi129 added the oncall: cpu inductor CPU Inductor issues for Intel team to triage label May 15, 2024
@zxd1997066
Copy link
Contributor Author

196a0b1

/workspace/pytorch# bash inductor_single_run.sh single inference performance torchbench pyhpc_isoneutral_mixing amp first dynamic cpp
Testing with dynamic shapes.
Testing with cpp wrapper.
Testing with freezing on.
single-thread testing....
loading model: 0it [00:00, ?it/s]
cpu  eval  pyhpc_isoneutral_mixing
running benchmark: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻坾 50/50 [00:00<00:00, 509.93it/s]
50.073x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips
cpu,pyhpc_isoneutral_mixing,1,50.072613,0.033701,7.295003,0.795865,38.535168,48.419226,746,1,0,0,0,0,0

/workspace/pytorch# bash inductor_single_run.sh single inference performance torchbench pyhpc_equation_of_state amp first static default
Testing with freezing on.
single-thread testing....
loading model: 0it [00:00, ?it/s]
cpu  eval  pyhpc_equation_of_state
running benchmark: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 50/50 [00:00<00:00, 1344.97it/s]
24.374x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips
cpu,pyhpc_equation_of_state,1,24.374139,0.022332,4.709370,0.823529,38.535168,46.792704,368,1,0,0,0,0,0

/workspace/pytorch# bash inductor_single_run.sh single inference performance torchbench lennard_jones float32 first static default
Testing with freezing on.
single-thread testing....
loading model: 0it [00:00, ?it/s]
cpu  eval  lennard_jones
running benchmark: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 50/50 [00:00<00:00, 4385.33it/s]
1.866x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips
cpu,lennard_jones,1,1.866308,0.021217,3.615042,0.849057,38.928384,45.848986,9,1,0,0,0,0,0

b23b6e7

/workspace/pytorch# bash inductor_single_run.sh single inference performance torchbench pyhpc_isoneutral_mixing amp first dynamic cpp
Testing with dynamic shapes.
Testing with cpp wrapper.
Testing with freezing on.
single-thread testing....
loading model: 0it [00:00, ?it/s]
cpu  eval  pyhpc_isoneutral_mixing
running benchmark: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻坾 50/50 [00:00<00:00, 512.97it/s]
43.962x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips
cpu,pyhpc_isoneutral_mixing,1,43.961587,0.038023,7.248890,0.800464,38.757171,48.418406,746,1,0,0,0,0,0

/workspace/pytorch# bash inductor_single_run.sh single inference performance torchbench pyhpc_equation_of_state amp first static default
Testing with freezing on.
single-thread testing....
loading model: 0it [00:00, ?it/s]
cpu  eval  pyhpc_equation_of_state
running benchmark: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 50/50 [00:00<00:00, 1338.79it/s]
20.614x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips
cpu,pyhpc_equation_of_state,1,20.614192,0.026325,4.696770,0.824128,38.613811,46.854144,368,1,0,0,0,0,0

/workspace/pytorch# bash inductor_single_run.sh single inference performance torchbench lennard_jones float32 first static default
Testing with freezing on.
single-thread testing....
loading model: 0it [00:00, ?it/s]
cpu  eval  lennard_jones
running benchmark: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 50/50 [00:00<00:00, 4308.39it/s]
1.614x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips
cpu,lennard_jones,1,1.613721,0.025152,2.632870,0.852487,39.085670,45.848986,9,1,0,0,0,0,0

Hi @aorenste, according to the bisect search log and test results, the PR #122074 may introduce performance regression issues on CPU, could you please help to double check it?

aorenste added a commit that referenced this issue May 23, 2024
The original change was about 9.5% slower than then backout.
This improves it to be only about 1.41% slower than the backout.

Fixes #126293

Ran torchbench 3 times on each change. Perf values before (stable), after (fix),
and with #122074 backed out (backout):
```
../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_isoneutral_mixing amp first dynamic cpp
stable:
43.948x
45.754x
44.906x

fix:
47.505x
49.987x
47.493x

backout:
48.243x
48.199x
48.192x

../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_equation_of_state amp first static default
stable:
15.224x
13.286x
15.354x

fix:
16.402x
16.370x
16.183x

backout:
16.554x
16.675x
16.787x

../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench lennard_jones float32 first static default
stable:
1.712x
1.651x
1.640x

fix:
1.804x
1.798x
1.792x

backout:
1.864x
1.824x
1.836x
```

[ghstack-poisoned]
aorenste added a commit that referenced this issue May 23, 2024
The original change was about 9.5% slower than then backout.
This improves it to be only about 1.41% slower than the backout.

Fixes #126293

Ran torchbench 3 times on each change. Perf values before (stable), after (fix),
and with #122074 backed out (backout):
```
../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_isoneutral_mixing amp first dynamic cpp
stable:
43.948x
45.754x
44.906x

fix:
47.505x
49.987x
47.493x

backout:
48.243x
48.199x
48.192x

../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_equation_of_state amp first static default
stable:
15.224x
13.286x
15.354x

fix:
16.402x
16.370x
16.183x

backout:
16.554x
16.675x
16.787x

../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench lennard_jones float32 first static default
stable:
1.712x
1.651x
1.640x

fix:
1.804x
1.798x
1.792x

backout:
1.864x
1.824x
1.836x
```

ghstack-source-id: ecdcee8881a666a27530ce73f2c0d1b1276e7b20
Pull Request resolved: #126996
aorenste added a commit that referenced this issue May 23, 2024
The original change was about 9.5% slower than then backout.
This improves it to be only about 1.41% slower than the backout.

Fixes #126293

Ran torchbench 3 times on each change. Perf values before (stable), after (fix),
and with #122074 backed out (backout):
```
../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_isoneutral_mixing amp first dynamic cpp
stable:
43.948x
45.754x
44.906x

fix:
47.505x
49.987x
47.493x

backout:
48.243x
48.199x
48.192x

../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_equation_of_state amp first static default
stable:
15.224x
13.286x
15.354x

fix:
16.402x
16.370x
16.183x

backout:
16.554x
16.675x
16.787x

../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench lennard_jones float32 first static default
stable:
1.712x
1.651x
1.640x

fix:
1.804x
1.798x
1.792x

backout:
1.864x
1.824x
1.836x
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
aorenste added a commit that referenced this issue May 23, 2024
The original change was about 9.5% slower than then backout.
This improves it to be only about 1.41% slower than the backout.

Fixes #126293

Ran torchbench 3 times on each change. Perf values before (stable), after (fix),
and with #122074 backed out (backout):
```
../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_isoneutral_mixing amp first dynamic cpp
stable:
43.948x
45.754x
44.906x

fix:
47.505x
49.987x
47.493x

backout:
48.243x
48.199x
48.192x

../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_equation_of_state amp first static default
stable:
15.224x
13.286x
15.354x

fix:
16.402x
16.370x
16.183x

backout:
16.554x
16.675x
16.787x

../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench lennard_jones float32 first static default
stable:
1.712x
1.651x
1.640x

fix:
1.804x
1.798x
1.792x

backout:
1.864x
1.824x
1.836x
```

ghstack-source-id: 2342f889c59771845dd46ac5a6d1f3c1fe5d1d10
Pull Request resolved: #126996
titaiwangms pushed a commit to titaiwangms/pytorch that referenced this issue May 28, 2024
The original change was about 9.5% slower than then before pytorch#122074 .
This improves it to be only about 1.4% slower.

Also touched up some unrelated nits that the linter complained about.

Fixes pytorch#126293

Ran torchbench 3 times on each change. Perf values before (stable), after (fix),
and with pytorch#122074 backed out (backout):
```
../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_isoneutral_mixing amp first dynamic cpp
stable:
43.948x
45.754x
44.906x

fix:
47.505x
49.987x
47.493x

backout:
48.243x
48.199x
48.192x

../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_equation_of_state amp first static default
stable:
15.224x
13.286x
15.354x

fix:
16.402x
16.370x
16.183x

backout:
16.554x
16.675x
16.787x

../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench lennard_jones float32 first static default
stable:
1.712x
1.651x
1.640x

fix:
1.804x
1.798x
1.792x

backout:
1.864x
1.824x
1.836x
```

Pull Request resolved: pytorch#126996
Approved by: https://github.com/jansel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: cpu inductor CPU Inductor issues for Intel team to triage
Projects
None yet
Development

No branches or pull requests

3 participants