-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
V2 Performance Signal Detected by TorchBench CI on '2.2.0.dev20231205+cu118' #2076
Comments
cc @atalman Looks like OneDNN upgrade will regress this model: pytorch/pytorch@62df4f34283 |
Hi @xuzhao9 , I have tried this model with cpu userbenchmark with 4 core on C6i.16xlarge instance, can't reproduce this performance drop. Below is my test cmd. Could you please help to double check, what's the different between our test envs. What instance type do you use in the test? I can try it again. export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1
# Base PyTorch commit: 3fbfa8cd0a5cefadb3f116c5cd0d60e96ab8c99e
root@ip-172-31-36-253:/workspace/benchmark# pip list | grep ^torch
torch 2.2.0a0+git3fbfa8c
torch-fidelity 0.3.0
torch_geometric 2.4.0
torchaudio 2.1.1+db62484
torchdata 0.7.0a0+11bb5b8
torchmetrics 1.2.0
torchtext 0.16.0a0+b0ebddc
torchvision 0.17.0a0+c1e2095
root@ip-172-31-36-253:/workspace/benchmark# python run_benchmark.py cpu -m pytorch_stargan -t train --launcher
Running benchmark: /opt/conda/bin/python -m torch.backends.xeon.run_cpu --throughput-mode /workspace/benchmark/userbenchmark/cpu/run_config.py -m pytorch_stargan -d cpu -t train --metrics latencies -o /workspace/benchmark/.userbenchmark/cpu/cpu-20231208044906
2023-12-08 04:49:07,596 - __main__ - WARNING - --throughput-mode is exclusive to --ninstances, --ncores-per-instance, --node-id and --use-logical-core. They won't take effect even they are set explicitly.
2023-12-08 04:49:07,604 - __main__ - INFO - Use JeMalloc memory allocator
2023-12-08 04:49:07,604 - __main__ - INFO - OMP_NUM_THREADS=32
2023-12-08 04:49:07,604 - __main__ - INFO - Using Intel OpenMP
2023-12-08 04:49:07,604 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
2023-12-08 04:49:07,604 - __main__ - INFO - KMP_BLOCKTIME=1
2023-12-08 04:49:07,604 - __main__ - INFO - LD_PRELOAD=/opt/conda/bin/..//lib/libiomp5.so:/opt/conda/bin/..//lib/libjemalloc.so
2023-12-08 04:49:07,604 - __main__ - INFO - numactl -C 0-31 -m 0 /opt/conda/bin/python -u /workspace/benchmark/userbenchmark/cpu/run_config.py -m pytorch_stargan -d cpu -t train --metrics latencies -o /workspace/benchmark/.userbenchmark/cpu/cpu-20231208044906
Running TorchBenchModelConfig(name='pytorch_stargan', test='train', device='cpu', batch_size=None, extra_args=[], extra_env=None) ...Start training...
/opt/conda/lib/python3.8/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
[Done]
root@ip-172-31-36-253:/workspace/benchmark# cat /workspace/benchmark/.userbenchmark/cpu/cpu-20231208044906/pytorch_stargan-train/metrics-29691.json
{
"name": "cpu",
"environ": {
"pytorch_git_version": "85aa3723749e0d06aa5fd34215b9b93529a60995"
},
"metrics": {
"latency": 431.629427
}
}
# Affected PyTorch commit: 7843df60e41f856edb148bbcbb5b9aee8292db74
root@ip-172-31-36-253:/workspace/benchmark# pip list | grep ^torch
torch 2.2.0a0+git7843df6
torch-fidelity 0.3.0
torch_geometric 2.4.0
torchaudio 2.1.1+db62484
torchdata 0.7.0a0+11bb5b8
torchmetrics 1.2.0
torchtext 0.16.0a0+b0ebddc
torchvision 0.17.0a0+c1e2095
root@ip-172-31-36-253:/workspace/benchmark# python run_benchmark.py cpu -m pytorch_stargan -t train --launcher
Running benchmark: /opt/conda/bin/python -m torch.backends.xeon.run_cpu --throughput-mode /workspace/benchmark/userbenchmark/cpu/run_config.py -m pytorch_stargan -d cpu -t train --metrics latencies -o /workspace/benchmark/.userbenchmark/cpu/cpu-20231208044703
2023-12-08 04:47:03,967 - __main__ - WARNING - --throughput-mode is exclusive to --ninstances, --ncores-per-instance, --node-id and --use-logical-core. They won't take effect even they are set explicitly.
2023-12-08 04:47:03,976 - __main__ - INFO - Use JeMalloc memory allocator
2023-12-08 04:47:03,976 - __main__ - INFO - OMP_NUM_THREADS=32
2023-12-08 04:47:03,976 - __main__ - INFO - Using Intel OpenMP
2023-12-08 04:47:03,976 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
2023-12-08 04:47:03,976 - __main__ - INFO - KMP_BLOCKTIME=1
2023-12-08 04:47:03,976 - __main__ - INFO - LD_PRELOAD=/opt/conda/bin/..//lib/libiomp5.so:/opt/conda/bin/..//lib/libjemalloc.so
2023-12-08 04:47:03,976 - __main__ - INFO - numactl -C 0-31 -m 0 /opt/conda/bin/python -u /workspace/benchmark/userbenchmark/cpu/run_config.py -m pytorch_stargan -d cpu -t train --metrics latencies -o /workspace/benchmark/.userbenchmark/cpu/cpu-20231208044703
Running TorchBenchModelConfig(name='pytorch_stargan', test='train', device='cpu', batch_size=None, extra_args=[], extra_env=None) ...Start training...
/opt/conda/lib/python3.8/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
[Done]
root@ip-172-31-36-253:/workspace/benchmark# cat /workspace/benchmark/.userbenchmark/cpu/cpu-20231208044703/pytorch_stargan-train/metrics-29485.json
{
"name": "cpu",
"environ": {
"pytorch_git_version": "85aa3723749e0d06aa5fd34215b9b93529a60995"
},
"metrics": {
"latency": 407.194572
}
}
|
@chuanqi129 It could also be noise - I noticed that this model has larger variance than other models. I will double check on our CI machine tomorrow and share the results. Our CI machine is AWS g4dn.metal. We are using the following setup:
|
Thanks @xuzhao9 for the quick response, update the results in comment with 1 socket (32 core). For cpu perf test, according to our practices, besides enable core binding, the jemalloc also can help to stable the performance on cpu device with below settings. export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1" The jemalloc can be installed by |
I think we should measure what users can observe on their system ( minus noise, this is why frequency scaling is a valid trick) |
@chuanqi129 To my understanding, the numbers you've shown actually shows that commit 7843df60e41f856edb148bbcbb5b9aee8292db74 is indeed faster than 3fbfa8cd0a5cefadb3f116c5cd0d60e96ab8c99e on What do you think? |
@chuanqi129 Can we look into the problem by comparing the performance profiles? |
Thanks @xuzhao9 help to correct me, I misunderstood that the base commit is the oneDNN 3.1.1 and the affected commit is oneDNN 3.3.2. I have double checked in my env, it's true, it has ~6% performance drop caused by oneDNN conv backward data on 2 specific shapes mb16_ic1024oc2048_ih4oh2kh4sh2dh0ph1_iw4ow2kw4sw2dw0pw1 & mb16_ic512oc1024_ih8oh4kh4sh2dh0ph1_iw8ow4kw4sw2dw0pw1. We have raised the issue to oneDNN team, and they will follow up it. And we also measured all runnable models (including this one) fp32 eager training with torchbench, the geomean ratio is 0.988x. In fact, we also did a lot of other paths inference tests with torchbench and inductor related tests by torchdynamo benchmark suites for the oneDNN upgrade PR. @xuzhao9 @malfet how do you think the priority of this issue? Can we waive it for the oneDNN 3.3.2 upgrade? |
Thanks @jgong5 , yes, we did it and verified the regression. |
@chuanqi129 do you see any perf gains as result of the update? If there are no gains and only regression, then it's a no brainer to revert. If say geomean speed up is up by 12%, but one test is down by eight then probably it is an acceptable tradeoff |
Hi @malfet, yes, we do see performance gain for some models, for example |
This upgrade contains the fixes to the known issues brought by oneDNN v3.3.2, including issues #115346, #120211 and #120406 and those listed in PR #112700. Issue #115346 (perf regression) was fixed by oneDNN v3.3.4. No new regression was found with v3.3.5. The detailed results of v3.3.4 are given below and compared with v3.1.1 (the oneDNN version in PyTorch before it was updated to v3.3.2). 1. A performance regression with 5.8% perf drop from `pytorch_stargan-train` (see pytorch/benchmark#2076 (comment)) Validation results with this patch: Latency increased by 0.60% ``` Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake) oneDNN v3.1.1 metrics-1484287.json { "name": "cpu", "environ": { "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0" }, "metrics": { "latency": 418.851717 } } oneDNN v3.3.4 { "name": "cpu", "environ": { "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0" }, "metrics": { "latency": 421.381313 } } ``` 2. Performance regression of FP32 rexnet_100 with Inductor, dynamic shape, multi-threads (see #115346 (comment)) Validation results with this patch: Latency reduced by 3.23% ``` Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake) oneDNN v3.1.1 (inductor speedup over eager mode) 2.876x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,rexnet_100,128,2.875904,113.314765,18.455283,0.990437,1302.636134,1315.212902,351,1,0,0 oneDNN v3.3.4 (inductor speedup over eager mode) 3.003x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,rexnet_100,128,3.003012,109.653012,91.547260,0.990048,1302.532506,1315.625370,351,1,0,0 ``` 3. Performance regression of AMP hf_T5_generate and tinynet_a with Inductor, static shape, multi-threads (see #115346 (comment)) Validation results with this patch: Latency reduced by 0.85% ``` Tested on an AWS spr metal instance oneDNN v3.1.1 (inductor speedup over eager mode) 1.120x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,hf_T5_generate,1,1.120018,1197.807729,205.905466,0.442803,125.179904,282.698957,10550,48,8,4 oneDNN v3.3.4 (inductor speedup over eager mode) 1.134x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,hf_T5_generate,1,1.133594,1187.701514,205.855527,0.422012,128.405094,304.268493,10550,48,8,4 ``` The following issues about functionality are fixed by this upgrade. Test cases are also added for these issues. - #120211 - #120406 - #120547 ----- Below are detailed data of torchbench CPU userbenchmark test and Inductor FP32/AMP inference tests. No regression of perf or functionality was found. I. *torchbench CPU userbenchmark test* Suite | Speedup -- | -- eager_throughtput_bf16_infer | 1.001848 eager_throughtput_fp32_infer | 1.000257 eager_throughtput_fx_int8 | 1.003069 jit_llga_throughtput_amp_bf16 | 1.000682 jit_llga_throughtput_fp32 | 1.000313 eager_throughtput_bf16_train | 0.998222 eager_throughtput_fp32_train | 1.003384 II. *Inductor FP32/AMP inference tests* i. FP32 static default suite | name | thread | batch size | Ratio Speedup(New/old) -- | -- | -- | -- | -- torchbench | timm_efficientnet | multiple | 64 | 1.09 timm_models | tinynet_a | multiple | 128 | 1.14 ii. FP32 dynamic default suite | name | thread | batch size | Ratio Speedup(New/old) -- | -- | -- | -- | -- torchbench | alexnet | multiple | 128 | 1.08 torchbench | basic_gnn_edgecnn | multiple | 1 | 0.98 torchbench | timm_efficientnet | multiple | 64 | 1.08 iii. AMP static default suite | name | thread | batch size | Ratio Speedup(New/old) -- | -- | -- | -- | -- torchbench | hf_distil_whisper | multiple | 1 | 1.18 torchbench | timm_efficientnet | multiple | 64 | 1.32 huggingface | BartForConditionalGeneration | multiple | 2 | 1.19 timm_models | eca_halonext26ts | multiple | 128 | 1.13 timm_models | nfnet_l0 | multiple | 128 | 1.13 timm_models | rexnet_100 | multiple | 128 | 1.45 timm_models | spnasnet_100 | multiple | 128 | 1.15 timm_models | tf_efficientnet_b0 | multiple | 128 | 1.22 timm_models | tinynet_a | multiple | 128 | 1.49 torchbench | hf_Bert_large | single | 1 | 1.16 huggingface | XLNetLMHeadModel | single | 1 | 1.07 iv. AMP dynamic default suite | name | thread | batch size | Ratio Speedup(New/old) -- | -- | -- | -- | -- torchbench | timm_efficientnet | multiple | 64 | 1.32 huggingface | PLBartForConditionalGeneration | multiple | 4 | 1.14 timm_models | nfnet_l0 | multiple | 128 | 1.15 timm_models | rexnet_100 | multiple | 128 | 1.45 timm_models | tinynet_a | multiple | 128 | 1.34 huggingface | XLNetLMHeadModel | single | 1 | 1.09 ----- Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: #120767 Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman
This upgrade contains the fixes to the known issues brought by oneDNN v3.3.2, including issues #115346, #120211 and #120406 and those listed in PR #112700. Issue #115346 (perf regression) was fixed by oneDNN v3.3.4. No new regression was found with v3.3.5. The detailed results of v3.3.4 are given below and compared with v3.1.1 (the oneDNN version in PyTorch before it was updated to v3.3.2). 1. A performance regression with 5.8% perf drop from `pytorch_stargan-train` (see pytorch/benchmark#2076 (comment)) Validation results with this patch: Latency increased by 0.60% ``` Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake) oneDNN v3.1.1 metrics-1484287.json { "name": "cpu", "environ": { "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0" }, "metrics": { "latency": 418.851717 } } oneDNN v3.3.4 { "name": "cpu", "environ": { "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0" }, "metrics": { "latency": 421.381313 } } ``` 2. Performance regression of FP32 rexnet_100 with Inductor, dynamic shape, multi-threads (see #115346 (comment)) Validation results with this patch: Latency reduced by 3.23% ``` Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake) oneDNN v3.1.1 (inductor speedup over eager mode) 2.876x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,rexnet_100,128,2.875904,113.314765,18.455283,0.990437,1302.636134,1315.212902,351,1,0,0 oneDNN v3.3.4 (inductor speedup over eager mode) 3.003x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,rexnet_100,128,3.003012,109.653012,91.547260,0.990048,1302.532506,1315.625370,351,1,0,0 ``` 3. Performance regression of AMP hf_T5_generate and tinynet_a with Inductor, static shape, multi-threads (see #115346 (comment)) Validation results with this patch: Latency reduced by 0.85% ``` Tested on an AWS spr metal instance oneDNN v3.1.1 (inductor speedup over eager mode) 1.120x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,hf_T5_generate,1,1.120018,1197.807729,205.905466,0.442803,125.179904,282.698957,10550,48,8,4 oneDNN v3.3.4 (inductor speedup over eager mode) 1.134x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,hf_T5_generate,1,1.133594,1187.701514,205.855527,0.422012,128.405094,304.268493,10550,48,8,4 ``` The following issues about functionality are fixed by this upgrade. Test cases are also added for these issues. - #120211 - #120406 - #120547 ----- Below are detailed data of torchbench CPU userbenchmark test and Inductor FP32/AMP inference tests. No regression of perf or functionality was found. I. *torchbench CPU userbenchmark test* Suite | Speedup -- | -- eager_throughtput_bf16_infer | 1.001848 eager_throughtput_fp32_infer | 1.000257 eager_throughtput_fx_int8 | 1.003069 jit_llga_throughtput_amp_bf16 | 1.000682 jit_llga_throughtput_fp32 | 1.000313 eager_throughtput_bf16_train | 0.998222 eager_throughtput_fp32_train | 1.003384 II. *Inductor FP32/AMP inference tests* i. FP32 static default suite | name | thread | batch size | Ratio Speedup(New/old) -- | -- | -- | -- | -- torchbench | timm_efficientnet | multiple | 64 | 1.09 timm_models | tinynet_a | multiple | 128 | 1.14 ii. FP32 dynamic default suite | name | thread | batch size | Ratio Speedup(New/old) -- | -- | -- | -- | -- torchbench | alexnet | multiple | 128 | 1.08 torchbench | basic_gnn_edgecnn | multiple | 1 | 0.98 torchbench | timm_efficientnet | multiple | 64 | 1.08 iii. AMP static default suite | name | thread | batch size | Ratio Speedup(New/old) -- | -- | -- | -- | -- torchbench | hf_distil_whisper | multiple | 1 | 1.18 torchbench | timm_efficientnet | multiple | 64 | 1.32 huggingface | BartForConditionalGeneration | multiple | 2 | 1.19 timm_models | eca_halonext26ts | multiple | 128 | 1.13 timm_models | nfnet_l0 | multiple | 128 | 1.13 timm_models | rexnet_100 | multiple | 128 | 1.45 timm_models | spnasnet_100 | multiple | 128 | 1.15 timm_models | tf_efficientnet_b0 | multiple | 128 | 1.22 timm_models | tinynet_a | multiple | 128 | 1.49 torchbench | hf_Bert_large | single | 1 | 1.16 huggingface | XLNetLMHeadModel | single | 1 | 1.07 iv. AMP dynamic default suite | name | thread | batch size | Ratio Speedup(New/old) -- | -- | -- | -- | -- torchbench | timm_efficientnet | multiple | 64 | 1.32 huggingface | PLBartForConditionalGeneration | multiple | 4 | 1.14 timm_models | nfnet_l0 | multiple | 128 | 1.15 timm_models | rexnet_100 | multiple | 128 | 1.45 timm_models | tinynet_a | multiple | 128 | 1.34 huggingface | XLNetLMHeadModel | single | 1 | 1.09 ----- Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: #120767 Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman
Hi @xuzhao9 the PR pytorch/pytorch#120767 has been landed, which can fix this issue. Could we close this issue after verification? Thanks~ |
@chuanqi129 Thanks for the update. I can confirm that the issue has been fixed (pytorch_stargan CPU latency has decreased to ~0.86 ms). |
TorchBench CI has detected a performance signal.
Base PyTorch version: 2.2.0.dev20231204+cu118
Base PyTorch commit: 3fbfa8cd0a5cefadb3f116c5cd0d60e96ab8c99e
Affected PyTorch version: 2.2.0.dev20231205+cu118
Affected PyTorch commit: 7843df60e41f856edb148bbcbb5b9aee8292db74
Affected Tests:
cc @xuzhao9
Result json:
Bisection workflow link: https://github.com/pytorch/benchmark/actions/runs/7117987075
The text was updated successfully, but these errors were encountered: