V2 Performance Signal Detected by TorchBench CI on '2.2.0.dev20231205+cu118' #2076

github-actions · 2023-12-06T18:54:22Z

TorchBench CI has detected a performance signal.

Base PyTorch version: 2.2.0.dev20231204+cu118

Base PyTorch commit: 3fbfa8cd0a5cefadb3f116c5cd0d60e96ab8c99e

Affected PyTorch version: 2.2.0.dev20231205+cu118

Affected PyTorch commit: 7843df60e41f856edb148bbcbb5b9aee8292db74

Affected Tests:

test_train[pytorch_stargan-cpu-eager]: -8.48571%
test_eval[pytorch_stargan-cpu-eager]: +9.71147%
test_train[timm_resnest-cpu-eager]: +7.15440%

cc @xuzhao9

Result json:

{
  "start": "3fbfa8cd0a5cefadb3f116c5cd0d60e96ab8c99e",
  "end": "7843df60e41f856edb148bbcbb5b9aee8292db74",
  "threshold": 7,
  "timeout": 120,
  "torchbench_branch": "v2.0",
  "result": [
    {
      "commit1": "a70c85ce90c",
      "commit1_time": "2023-12-04 19:08:36 +0000",
      "commit1_digest": {
        "test_train[pytorch_stargan-cpu-eager]": 0.924902675463818
      },
      "commit2": "62df4f34283",
      "commit2_time": "2023-12-04 19:41:12 +0000",
      "commit2_digest": {
        "test_train[pytorch_stargan-cpu-eager]": 0.8612936370074749
      }
    }
  ]
}

Bisection workflow link: https://github.com/pytorch/benchmark/actions/runs/7117987075

The text was updated successfully, but these errors were encountered:

xuzhao9 · 2023-12-06T18:58:56Z

cc @atalman Looks like OneDNN upgrade will regress this model: pytorch/pytorch@62df4f34283

chuanqi129 · 2023-12-08T03:41:08Z

Hi @xuzhao9 , I have tried this model with cpu userbenchmark with 4 core on C6i.16xlarge instance, can't reproduce this performance drop. Below is my test cmd. Could you please help to double check, what's the different between our test envs. What instance type do you use in the test? I can try it again.

export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1

# Base PyTorch commit: 3fbfa8cd0a5cefadb3f116c5cd0d60e96ab8c99e

root@ip-172-31-36-253:/workspace/benchmark# pip list | grep ^torch
torch                     2.2.0a0+git3fbfa8c
torch-fidelity            0.3.0
torch_geometric           2.4.0
torchaudio                2.1.1+db62484
torchdata                 0.7.0a0+11bb5b8
torchmetrics              1.2.0
torchtext                 0.16.0a0+b0ebddc
torchvision               0.17.0a0+c1e2095
root@ip-172-31-36-253:/workspace/benchmark# python run_benchmark.py cpu -m pytorch_stargan -t train --launcher

Running benchmark: /opt/conda/bin/python -m torch.backends.xeon.run_cpu --throughput-mode /workspace/benchmark/userbenchmark/cpu/run_config.py -m pytorch_stargan -d cpu -t train --metrics latencies -o /workspace/benchmark/.userbenchmark/cpu/cpu-20231208044906
2023-12-08 04:49:07,596 - __main__ - WARNING - --throughput-mode is exclusive to --ninstances, --ncores-per-instance, --node-id and --use-logical-core. They won't take effect even they are set explicitly.
2023-12-08 04:49:07,604 - __main__ - INFO - Use JeMalloc memory allocator
2023-12-08 04:49:07,604 - __main__ - INFO - OMP_NUM_THREADS=32
2023-12-08 04:49:07,604 - __main__ - INFO - Using Intel OpenMP
2023-12-08 04:49:07,604 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
2023-12-08 04:49:07,604 - __main__ - INFO - KMP_BLOCKTIME=1
2023-12-08 04:49:07,604 - __main__ - INFO - LD_PRELOAD=/opt/conda/bin/..//lib/libiomp5.so:/opt/conda/bin/..//lib/libjemalloc.so
2023-12-08 04:49:07,604 - __main__ - INFO - numactl -C 0-31 -m 0 /opt/conda/bin/python -u /workspace/benchmark/userbenchmark/cpu/run_config.py -m pytorch_stargan -d cpu -t train --metrics latencies -o /workspace/benchmark/.userbenchmark/cpu/cpu-20231208044906
Running TorchBenchModelConfig(name='pytorch_stargan', test='train', device='cpu', batch_size=None, extra_args=[], extra_env=None) ...Start training...
/opt/conda/lib/python3.8/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
 [Done]
root@ip-172-31-36-253:/workspace/benchmark# cat /workspace/benchmark/.userbenchmark/cpu/cpu-20231208044906/pytorch_stargan-train/metrics-29691.json
{
    "name": "cpu",
    "environ": {
        "pytorch_git_version": "85aa3723749e0d06aa5fd34215b9b93529a60995"
    },
    "metrics": {
        "latency": 431.629427
    }
}


# Affected PyTorch commit: 7843df60e41f856edb148bbcbb5b9aee8292db74

root@ip-172-31-36-253:/workspace/benchmark# pip list | grep ^torch
torch                     2.2.0a0+git7843df6
torch-fidelity            0.3.0
torch_geometric           2.4.0
torchaudio                2.1.1+db62484
torchdata                 0.7.0a0+11bb5b8
torchmetrics              1.2.0
torchtext                 0.16.0a0+b0ebddc
torchvision               0.17.0a0+c1e2095
root@ip-172-31-36-253:/workspace/benchmark# python run_benchmark.py cpu -m pytorch_stargan -t train --launcher            
Running benchmark: /opt/conda/bin/python -m torch.backends.xeon.run_cpu --throughput-mode /workspace/benchmark/userbenchmark/cpu/run_config.py -m pytorch_stargan -d cpu -t train --metrics latencies -o /workspace/benchmark/.userbenchmark/cpu/cpu-20231208044703
2023-12-08 04:47:03,967 - __main__ - WARNING - --throughput-mode is exclusive to --ninstances, --ncores-per-instance, --node-id and --use-logical-core. They won't take effect even they are set explicitly.
2023-12-08 04:47:03,976 - __main__ - INFO - Use JeMalloc memory allocator
2023-12-08 04:47:03,976 - __main__ - INFO - OMP_NUM_THREADS=32
2023-12-08 04:47:03,976 - __main__ - INFO - Using Intel OpenMP
2023-12-08 04:47:03,976 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
2023-12-08 04:47:03,976 - __main__ - INFO - KMP_BLOCKTIME=1
2023-12-08 04:47:03,976 - __main__ - INFO - LD_PRELOAD=/opt/conda/bin/..//lib/libiomp5.so:/opt/conda/bin/..//lib/libjemalloc.so
2023-12-08 04:47:03,976 - __main__ - INFO - numactl -C 0-31 -m 0 /opt/conda/bin/python -u /workspace/benchmark/userbenchmark/cpu/run_config.py -m pytorch_stargan -d cpu -t train --metrics latencies -o /workspace/benchmark/.userbenchmark/cpu/cpu-20231208044703
Running TorchBenchModelConfig(name='pytorch_stargan', test='train', device='cpu', batch_size=None, extra_args=[], extra_env=None) ...Start training...
/opt/conda/lib/python3.8/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
 [Done]
root@ip-172-31-36-253:/workspace/benchmark# cat /workspace/benchmark/.userbenchmark/cpu/cpu-20231208044703/pytorch_stargan-train/metrics-29485.json
{
    "name": "cpu",
    "environ": {
        "pytorch_git_version": "85aa3723749e0d06aa5fd34215b9b93529a60995"
    },
    "metrics": {
        "latency": 407.194572
    }
}

xuzhao9 · 2023-12-08T04:40:24Z

@chuanqi129 It could also be noise - I noticed that this model has larger variance than other models. I will double check on our CI machine tomorrow and share the results.

Our CI machine is AWS g4dn.metal. We are using the following setup:

Disable hyper-threading
CPU core isolation on core 0-24
Pin CPU frequency of all cores to 2.5 GHz (default frequency)

chuanqi129 · 2023-12-08T04:59:01Z

Thanks @xuzhao9 for the quick response, update the results in comment with 1 socket (32 core). For cpu perf test, according to our practices, besides enable core binding, the jemalloc also can help to stable the performance on cpu device with below settings.

export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"

The jemalloc can be installed by conda install jemalloc, you can try it also.

malfet · 2023-12-08T16:00:39Z

export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"

I think we should measure what users can observe on their system ( minus noise, this is why frequency scaling is a valid trick)
If we believe that above-mentioned malloc configuration is important for all CPU workloads we should integrate pytorch with jemalloc and/or mention it in a getting started guide

xuzhao9 · 2023-12-08T17:13:43Z

@chuanqi129 To my understanding, the numbers you've shown actually shows that commit 7843df60e41f856edb148bbcbb5b9aee8292db74 is indeed faster than 3fbfa8cd0a5cefadb3f116c5cd0d60e96ab8c99e on pytorch_stargan-train as the latency decreases from 431.629427 to 407.194572 (~5.8%). So I still think it could be a valid signal. Our bisector tells us that this is because 62df4f34283 landed between 7843df6 and 3fbfa8cd0, which is a revert of the oneDNN version upgrade. So it basically says reverting oneDNN from 3.3.2 to 3.1.1 speeds up pytorch_stargan-train by around 5.8%

What do you think?

jgong5 · 2023-12-09T05:45:13Z

@chuanqi129 Can we look into the problem by comparing the performance profiles?

chuanqi129 · 2023-12-09T07:57:52Z

Thanks @xuzhao9 help to correct me, I misunderstood that the base commit is the oneDNN 3.1.1 and the affected commit is oneDNN 3.3.2. I have double checked in my env, it's true, it has ~6% performance drop caused by oneDNN conv backward data on 2 specific shapes mb16_ic1024oc2048_ih4oh2kh4sh2dh0ph1_iw4ow2kw4sw2dw0pw1 & mb16_ic512oc1024_ih8oh4kh4sh2dh0ph1_iw8ow4kw4sw2dw0pw1. We have raised the issue to oneDNN team, and they will follow up it. And we also measured all runnable models (including this one) fp32 eager training with torchbench, the geomean ratio is 0.988x. In fact, we also did a lot of other paths inference tests with torchbench and inductor related tests by torchdynamo benchmark suites for the oneDNN upgrade PR. @xuzhao9 @malfet how do you think the priority of this issue? Can we waive it for the oneDNN 3.3.2 upgrade?

chuanqi129 · 2023-12-09T08:03:30Z

@chuanqi129 Can we look into the problem by comparing the performance profiles?

Thanks @jgong5 , yes, we did it and verified the regression.

malfet · 2023-12-12T16:37:43Z

@chuanqi129 do you see any perf gains as result of the update? If there are no gains and only regression, then it's a no brainer to revert. If say geomean speed up is up by 12%, but one test is down by eight then probably it is an acceptable tradeoff

chuanqi129 · 2023-12-13T13:58:50Z

Hi @malfet, yes, we do see performance gain for some models, for example doctr_det_predictor inductor fp32 dynamic shape inference with cpp wrapper, issue tracked in pytorch/pytorch#108324

This upgrade contains the fixes to the known issues brought by oneDNN v3.3.2, including issues #115346, #120211 and #120406 and those listed in PR #112700. Issue #115346 (perf regression) was fixed by oneDNN v3.3.4. No new regression was found with v3.3.5. The detailed results of v3.3.4 are given below and compared with v3.1.1 (the oneDNN version in PyTorch before it was updated to v3.3.2). 1. A performance regression with 5.8% perf drop from `pytorch_stargan-train` (see pytorch/benchmark#2076 (comment)) Validation results with this patch: Latency increased by 0.60% ``` Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake) oneDNN v3.1.1 metrics-1484287.json { "name": "cpu", "environ": { "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0" }, "metrics": { "latency": 418.851717 } } oneDNN v3.3.4 { "name": "cpu", "environ": { "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0" }, "metrics": { "latency": 421.381313 } } ``` 2. Performance regression of FP32 rexnet_100 with Inductor, dynamic shape, multi-threads (see #115346 (comment)) Validation results with this patch: Latency reduced by 3.23% ``` Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake) oneDNN v3.1.1 (inductor speedup over eager mode) 2.876x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,rexnet_100,128,2.875904,113.314765,18.455283,0.990437,1302.636134,1315.212902,351,1,0,0 oneDNN v3.3.4 (inductor speedup over eager mode) 3.003x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,rexnet_100,128,3.003012,109.653012,91.547260,0.990048,1302.532506,1315.625370,351,1,0,0 ``` 3. Performance regression of AMP hf_T5_generate and tinynet_a with Inductor, static shape, multi-threads (see #115346 (comment)) Validation results with this patch: Latency reduced by 0.85% ``` Tested on an AWS spr metal instance oneDNN v3.1.1 (inductor speedup over eager mode) 1.120x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,hf_T5_generate,1,1.120018,1197.807729,205.905466,0.442803,125.179904,282.698957,10550,48,8,4 oneDNN v3.3.4 (inductor speedup over eager mode) 1.134x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,hf_T5_generate,1,1.133594,1187.701514,205.855527,0.422012,128.405094,304.268493,10550,48,8,4 ``` The following issues about functionality are fixed by this upgrade. Test cases are also added for these issues. - #120211 - #120406 - #120547 ----- Below are detailed data of torchbench CPU userbenchmark test and Inductor FP32/AMP inference tests. No regression of perf or functionality was found. I. *torchbench CPU userbenchmark test* Suite | Speedup -- | -- eager_throughtput_bf16_infer | 1.001848 eager_throughtput_fp32_infer | 1.000257 eager_throughtput_fx_int8 | 1.003069 jit_llga_throughtput_amp_bf16 | 1.000682 jit_llga_throughtput_fp32 | 1.000313 eager_throughtput_bf16_train | 0.998222 eager_throughtput_fp32_train | 1.003384 II. *Inductor FP32/AMP inference tests* i. FP32 static default suite | name | thread | batch size | Ratio Speedup(New/old) -- | -- | -- | -- | -- torchbench | timm_efficientnet | multiple | 64 | 1.09 timm_models | tinynet_a | multiple | 128 | 1.14 ii. FP32 dynamic default suite | name | thread | batch size | Ratio Speedup(New/old) -- | -- | -- | -- | -- torchbench | alexnet | multiple | 128 | 1.08 torchbench | basic_gnn_edgecnn | multiple | 1 | 0.98 torchbench | timm_efficientnet | multiple | 64 | 1.08 iii. AMP static default suite | name | thread | batch size | Ratio Speedup(New/old) -- | -- | -- | -- | -- torchbench | hf_distil_whisper | multiple | 1 | 1.18 torchbench | timm_efficientnet | multiple | 64 | 1.32 huggingface | BartForConditionalGeneration | multiple | 2 | 1.19 timm_models | eca_halonext26ts | multiple | 128 | 1.13 timm_models | nfnet_l0 | multiple | 128 | 1.13 timm_models | rexnet_100 | multiple | 128 | 1.45 timm_models | spnasnet_100 | multiple | 128 | 1.15 timm_models | tf_efficientnet_b0 | multiple | 128 | 1.22 timm_models | tinynet_a | multiple | 128 | 1.49 torchbench | hf_Bert_large | single | 1 | 1.16 huggingface | XLNetLMHeadModel | single | 1 | 1.07 iv. AMP dynamic default suite | name | thread | batch size | Ratio Speedup(New/old) -- | -- | -- | -- | -- torchbench | timm_efficientnet | multiple | 64 | 1.32 huggingface | PLBartForConditionalGeneration | multiple | 4 | 1.14 timm_models | nfnet_l0 | multiple | 128 | 1.15 timm_models | rexnet_100 | multiple | 128 | 1.45 timm_models | tinynet_a | multiple | 128 | 1.34 huggingface | XLNetLMHeadModel | single | 1 | 1.09 ----- Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: #120767 Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman

chuanqi129 · 2024-03-12T07:05:08Z

Hi @xuzhao9 the PR pytorch/pytorch#120767 has been landed, which can fix this issue. Could we close this issue after verification? Thanks~

xuzhao9 · 2024-03-15T21:32:41Z

@chuanqi129 Thanks for the update. I can confirm that the issue has been fixed (pytorch_stargan CPU latency has decreased to ~0.86 ms).

github-actions bot added the torchbench-perf-report label Dec 6, 2023

xuzhao9 assigned malfet and atalman Dec 12, 2023

jgong5 mentioned this issue Dec 13, 2023

Update oneDNN submodule to v3.3.2 pytorch/pytorch#112700

Closed

Xia-Weiwen mentioned this issue Feb 19, 2024

Update oneDNN submodule to v3.3.4 pytorch/pytorch#117007

Closed

This was referenced Mar 1, 2024

Upgrade submodule onednn to v3.3.5 pytorch/pytorch#120767

Closed

☂️ oneDNN update functional/regression pytorch/pytorch#121150

Open

xuzhao9 closed this as completed Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V2 Performance Signal Detected by TorchBench CI on '2.2.0.dev20231205+cu118' #2076

V2 Performance Signal Detected by TorchBench CI on '2.2.0.dev20231205+cu118' #2076

github-actions bot commented Dec 6, 2023

xuzhao9 commented Dec 6, 2023 •

edited

Loading

chuanqi129 commented Dec 8, 2023 •

edited

Loading

xuzhao9 commented Dec 8, 2023 •

edited

Loading

chuanqi129 commented Dec 8, 2023

malfet commented Dec 8, 2023

xuzhao9 commented Dec 8, 2023 •

edited

Loading

jgong5 commented Dec 9, 2023

chuanqi129 commented Dec 9, 2023 •

edited

Loading

chuanqi129 commented Dec 9, 2023

malfet commented Dec 12, 2023

chuanqi129 commented Dec 13, 2023 •

edited

Loading

chuanqi129 commented Mar 12, 2024

xuzhao9 commented Mar 15, 2024 •

edited

Loading

V2 Performance Signal Detected by TorchBench CI on '2.2.0.dev20231205+cu118' #2076

V2 Performance Signal Detected by TorchBench CI on '2.2.0.dev20231205+cu118' #2076

Comments

github-actions bot commented Dec 6, 2023

xuzhao9 commented Dec 6, 2023 • edited Loading

chuanqi129 commented Dec 8, 2023 • edited Loading

xuzhao9 commented Dec 8, 2023 • edited Loading

chuanqi129 commented Dec 8, 2023

malfet commented Dec 8, 2023

xuzhao9 commented Dec 8, 2023 • edited Loading

jgong5 commented Dec 9, 2023

chuanqi129 commented Dec 9, 2023 • edited Loading

chuanqi129 commented Dec 9, 2023

malfet commented Dec 12, 2023

chuanqi129 commented Dec 13, 2023 • edited Loading

chuanqi129 commented Mar 12, 2024

xuzhao9 commented Mar 15, 2024 • edited Loading

xuzhao9 commented Dec 6, 2023 •

edited

Loading

chuanqi129 commented Dec 8, 2023 •

edited

Loading

xuzhao9 commented Dec 8, 2023 •

edited

Loading

xuzhao9 commented Dec 8, 2023 •

edited

Loading

chuanqi129 commented Dec 9, 2023 •

edited

Loading

chuanqi129 commented Dec 13, 2023 •

edited

Loading

xuzhao9 commented Mar 15, 2024 •

edited

Loading