Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V2 Performance Signal Detected by TorchBench CI on '2.2.0.dev20231205+cu118' #2076

Closed
github-actions bot opened this issue Dec 6, 2023 · 13 comments
Closed

Comments

@github-actions
Copy link

github-actions bot commented Dec 6, 2023

TorchBench CI has detected a performance signal.

Base PyTorch version: 2.2.0.dev20231204+cu118

Base PyTorch commit: 3fbfa8cd0a5cefadb3f116c5cd0d60e96ab8c99e

Affected PyTorch version: 2.2.0.dev20231205+cu118

Affected PyTorch commit: 7843df60e41f856edb148bbcbb5b9aee8292db74

Affected Tests:

  • test_train[pytorch_stargan-cpu-eager]: -8.48571%
  • test_eval[pytorch_stargan-cpu-eager]: +9.71147%
  • test_train[timm_resnest-cpu-eager]: +7.15440%

cc @xuzhao9

Result json:

{
  "start": "3fbfa8cd0a5cefadb3f116c5cd0d60e96ab8c99e",
  "end": "7843df60e41f856edb148bbcbb5b9aee8292db74",
  "threshold": 7,
  "timeout": 120,
  "torchbench_branch": "v2.0",
  "result": [
    {
      "commit1": "a70c85ce90c",
      "commit1_time": "2023-12-04 19:08:36 +0000",
      "commit1_digest": {
        "test_train[pytorch_stargan-cpu-eager]": 0.924902675463818
      },
      "commit2": "62df4f34283",
      "commit2_time": "2023-12-04 19:41:12 +0000",
      "commit2_digest": {
        "test_train[pytorch_stargan-cpu-eager]": 0.8612936370074749
      }
    }
  ]
}

Bisection workflow link: https://github.com/pytorch/benchmark/actions/runs/7117987075

@xuzhao9
Copy link
Contributor

xuzhao9 commented Dec 6, 2023

cc @atalman Looks like OneDNN upgrade will regress this model: pytorch/pytorch@62df4f34283

@chuanqi129
Copy link
Contributor

chuanqi129 commented Dec 8, 2023

Hi @xuzhao9 , I have tried this model with cpu userbenchmark with 4 core on C6i.16xlarge instance, can't reproduce this performance drop. Below is my test cmd. Could you please help to double check, what's the different between our test envs. What instance type do you use in the test? I can try it again.

export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1

# Base PyTorch commit: 3fbfa8cd0a5cefadb3f116c5cd0d60e96ab8c99e

root@ip-172-31-36-253:/workspace/benchmark# pip list | grep ^torch
torch                     2.2.0a0+git3fbfa8c
torch-fidelity            0.3.0
torch_geometric           2.4.0
torchaudio                2.1.1+db62484
torchdata                 0.7.0a0+11bb5b8
torchmetrics              1.2.0
torchtext                 0.16.0a0+b0ebddc
torchvision               0.17.0a0+c1e2095
root@ip-172-31-36-253:/workspace/benchmark# python run_benchmark.py cpu -m pytorch_stargan -t train --launcher

Running benchmark: /opt/conda/bin/python -m torch.backends.xeon.run_cpu --throughput-mode /workspace/benchmark/userbenchmark/cpu/run_config.py -m pytorch_stargan -d cpu -t train --metrics latencies -o /workspace/benchmark/.userbenchmark/cpu/cpu-20231208044906
2023-12-08 04:49:07,596 - __main__ - WARNING - --throughput-mode is exclusive to --ninstances, --ncores-per-instance, --node-id and --use-logical-core. They won't take effect even they are set explicitly.
2023-12-08 04:49:07,604 - __main__ - INFO - Use JeMalloc memory allocator
2023-12-08 04:49:07,604 - __main__ - INFO - OMP_NUM_THREADS=32
2023-12-08 04:49:07,604 - __main__ - INFO - Using Intel OpenMP
2023-12-08 04:49:07,604 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
2023-12-08 04:49:07,604 - __main__ - INFO - KMP_BLOCKTIME=1
2023-12-08 04:49:07,604 - __main__ - INFO - LD_PRELOAD=/opt/conda/bin/..//lib/libiomp5.so:/opt/conda/bin/..//lib/libjemalloc.so
2023-12-08 04:49:07,604 - __main__ - INFO - numactl -C 0-31 -m 0 /opt/conda/bin/python -u /workspace/benchmark/userbenchmark/cpu/run_config.py -m pytorch_stargan -d cpu -t train --metrics latencies -o /workspace/benchmark/.userbenchmark/cpu/cpu-20231208044906
Running TorchBenchModelConfig(name='pytorch_stargan', test='train', device='cpu', batch_size=None, extra_args=[], extra_env=None) ...Start training...
/opt/conda/lib/python3.8/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
 [Done]
root@ip-172-31-36-253:/workspace/benchmark# cat /workspace/benchmark/.userbenchmark/cpu/cpu-20231208044906/pytorch_stargan-train/metrics-29691.json
{
    "name": "cpu",
    "environ": {
        "pytorch_git_version": "85aa3723749e0d06aa5fd34215b9b93529a60995"
    },
    "metrics": {
        "latency": 431.629427
    }
}


# Affected PyTorch commit: 7843df60e41f856edb148bbcbb5b9aee8292db74

root@ip-172-31-36-253:/workspace/benchmark# pip list | grep ^torch
torch                     2.2.0a0+git7843df6
torch-fidelity            0.3.0
torch_geometric           2.4.0
torchaudio                2.1.1+db62484
torchdata                 0.7.0a0+11bb5b8
torchmetrics              1.2.0
torchtext                 0.16.0a0+b0ebddc
torchvision               0.17.0a0+c1e2095
root@ip-172-31-36-253:/workspace/benchmark# python run_benchmark.py cpu -m pytorch_stargan -t train --launcher            
Running benchmark: /opt/conda/bin/python -m torch.backends.xeon.run_cpu --throughput-mode /workspace/benchmark/userbenchmark/cpu/run_config.py -m pytorch_stargan -d cpu -t train --metrics latencies -o /workspace/benchmark/.userbenchmark/cpu/cpu-20231208044703
2023-12-08 04:47:03,967 - __main__ - WARNING - --throughput-mode is exclusive to --ninstances, --ncores-per-instance, --node-id and --use-logical-core. They won't take effect even they are set explicitly.
2023-12-08 04:47:03,976 - __main__ - INFO - Use JeMalloc memory allocator
2023-12-08 04:47:03,976 - __main__ - INFO - OMP_NUM_THREADS=32
2023-12-08 04:47:03,976 - __main__ - INFO - Using Intel OpenMP
2023-12-08 04:47:03,976 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
2023-12-08 04:47:03,976 - __main__ - INFO - KMP_BLOCKTIME=1
2023-12-08 04:47:03,976 - __main__ - INFO - LD_PRELOAD=/opt/conda/bin/..//lib/libiomp5.so:/opt/conda/bin/..//lib/libjemalloc.so
2023-12-08 04:47:03,976 - __main__ - INFO - numactl -C 0-31 -m 0 /opt/conda/bin/python -u /workspace/benchmark/userbenchmark/cpu/run_config.py -m pytorch_stargan -d cpu -t train --metrics latencies -o /workspace/benchmark/.userbenchmark/cpu/cpu-20231208044703
Running TorchBenchModelConfig(name='pytorch_stargan', test='train', device='cpu', batch_size=None, extra_args=[], extra_env=None) ...Start training...
/opt/conda/lib/python3.8/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
Start training...
 [Done]
root@ip-172-31-36-253:/workspace/benchmark# cat /workspace/benchmark/.userbenchmark/cpu/cpu-20231208044703/pytorch_stargan-train/metrics-29485.json
{
    "name": "cpu",
    "environ": {
        "pytorch_git_version": "85aa3723749e0d06aa5fd34215b9b93529a60995"
    },
    "metrics": {
        "latency": 407.194572
    }
}

@xuzhao9
Copy link
Contributor

xuzhao9 commented Dec 8, 2023

@chuanqi129 It could also be noise - I noticed that this model has larger variance than other models. I will double check on our CI machine tomorrow and share the results.

Our CI machine is AWS g4dn.metal. We are using the following setup:

  1. Disable hyper-threading
  2. CPU core isolation on core 0-24
  3. Pin CPU frequency of all cores to 2.5 GHz (default frequency)

@chuanqi129
Copy link
Contributor

Thanks @xuzhao9 for the quick response, update the results in comment with 1 socket (32 core). For cpu perf test, according to our practices, besides enable core binding, the jemalloc also can help to stable the performance on cpu device with below settings.

export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"

The jemalloc can be installed by conda install jemalloc, you can try it also.

@malfet
Copy link
Contributor

malfet commented Dec 8, 2023

export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"

I think we should measure what users can observe on their system ( minus noise, this is why frequency scaling is a valid trick)
If we believe that above-mentioned malloc configuration is important for all CPU workloads we should integrate pytorch with jemalloc and/or mention it in a getting started guide

@xuzhao9
Copy link
Contributor

xuzhao9 commented Dec 8, 2023

@chuanqi129 To my understanding, the numbers you've shown actually shows that commit 7843df60e41f856edb148bbcbb5b9aee8292db74 is indeed faster than 3fbfa8cd0a5cefadb3f116c5cd0d60e96ab8c99e on pytorch_stargan-train as the latency decreases from 431.629427 to 407.194572 (~5.8%). So I still think it could be a valid signal. Our bisector tells us that this is because 62df4f34283 landed between 7843df6 and 3fbfa8cd0, which is a revert of the oneDNN version upgrade. So it basically says reverting oneDNN from 3.3.2 to 3.1.1 speeds up pytorch_stargan-train by around 5.8%

What do you think?

@jgong5
Copy link

jgong5 commented Dec 9, 2023

@chuanqi129 Can we look into the problem by comparing the performance profiles?

@chuanqi129
Copy link
Contributor

chuanqi129 commented Dec 9, 2023

Thanks @xuzhao9 help to correct me, I misunderstood that the base commit is the oneDNN 3.1.1 and the affected commit is oneDNN 3.3.2. I have double checked in my env, it's true, it has ~6% performance drop caused by oneDNN conv backward data on 2 specific shapes mb16_ic1024oc2048_ih4oh2kh4sh2dh0ph1_iw4ow2kw4sw2dw0pw1 & mb16_ic512oc1024_ih8oh4kh4sh2dh0ph1_iw8ow4kw4sw2dw0pw1. We have raised the issue to oneDNN team, and they will follow up it. And we also measured all runnable models (including this one) fp32 eager training with torchbench, the geomean ratio is 0.988x. In fact, we also did a lot of other paths inference tests with torchbench and inductor related tests by torchdynamo benchmark suites for the oneDNN upgrade PR. @xuzhao9 @malfet how do you think the priority of this issue? Can we waive it for the oneDNN 3.3.2 upgrade?

@chuanqi129
Copy link
Contributor

@chuanqi129 Can we look into the problem by comparing the performance profiles?

Thanks @jgong5 , yes, we did it and verified the regression.

@malfet
Copy link
Contributor

malfet commented Dec 12, 2023

@chuanqi129 do you see any perf gains as result of the update? If there are no gains and only regression, then it's a no brainer to revert. If say geomean speed up is up by 12%, but one test is down by eight then probably it is an acceptable tradeoff

@chuanqi129
Copy link
Contributor

chuanqi129 commented Dec 13, 2023

Hi @malfet, yes, we do see performance gain for some models, for example doctr_det_predictor inductor fp32 dynamic shape inference with cpp wrapper, issue tracked in pytorch/pytorch#108324

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this issue Mar 11, 2024
This upgrade contains the fixes to the known issues brought by oneDNN v3.3.2, including issues #115346, #120211 and #120406 and those listed in PR #112700.

Issue #115346 (perf regression) was fixed by oneDNN v3.3.4. No new regression was found with v3.3.5. The detailed results of v3.3.4 are given below and compared with v3.1.1 (the oneDNN version in PyTorch before it was updated to v3.3.2).
1. A performance regression with 5.8% perf drop from `pytorch_stargan-train` (see pytorch/benchmark#2076 (comment))
Validation results with this patch: Latency increased by 0.60%
```
Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake)
oneDNN v3.1.1
metrics-1484287.json
{
    "name": "cpu",
    "environ": {
        "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0"
    },
    "metrics": {
        "latency": 418.851717
    }
}
oneDNN v3.3.4
{
    "name": "cpu",
    "environ": {
        "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0"
    },
    "metrics": {
        "latency": 421.381313
    }
}
```

2. Performance regression of FP32 rexnet_100 with Inductor, dynamic shape, multi-threads (see #115346 (comment))
Validation results with this patch: Latency reduced by 3.23%
```
Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake)
oneDNN v3.1.1
(inductor speedup over eager mode) 2.876x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,rexnet_100,128,2.875904,113.314765,18.455283,0.990437,1302.636134,1315.212902,351,1,0,0

oneDNN v3.3.4
(inductor speedup over eager mode) 3.003x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,rexnet_100,128,3.003012,109.653012,91.547260,0.990048,1302.532506,1315.625370,351,1,0,0
```

3. Performance regression of AMP hf_T5_generate and tinynet_a with Inductor, static shape, multi-threads (see #115346 (comment))
Validation results with this patch: Latency reduced by 0.85%
```
Tested on an AWS spr metal instance
oneDNN v3.1.1
(inductor speedup over eager mode) 1.120x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,hf_T5_generate,1,1.120018,1197.807729,205.905466,0.442803,125.179904,282.698957,10550,48,8,4

oneDNN v3.3.4
(inductor speedup over eager mode) 1.134x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,hf_T5_generate,1,1.133594,1187.701514,205.855527,0.422012,128.405094,304.268493,10550,48,8,4
```

The following issues about functionality are fixed by this upgrade. Test cases are also added for these issues.
- #120211
- #120406
- #120547

-----

Below are detailed data of torchbench CPU userbenchmark test and Inductor FP32/AMP inference tests. No regression of perf or functionality was found.
I.  *torchbench CPU userbenchmark test*
Suite | Speedup
-- | --
eager_throughtput_bf16_infer | 1.001848
eager_throughtput_fp32_infer | 1.000257
eager_throughtput_fx_int8 | 1.003069
jit_llga_throughtput_amp_bf16 | 1.000682
jit_llga_throughtput_fp32 | 1.000313
eager_throughtput_bf16_train | 0.998222
eager_throughtput_fp32_train | 1.003384

II. *Inductor FP32/AMP inference tests*
i.  FP32 static default
suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | timm_efficientnet | multiple | 64 | 1.09
timm_models | tinynet_a | multiple | 128 | 1.14

ii.  FP32 dynamic default

suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | alexnet | multiple | 128 | 1.08
torchbench | basic_gnn_edgecnn | multiple | 1 | 0.98
torchbench | timm_efficientnet | multiple | 64 | 1.08

iii. AMP static default

suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | hf_distil_whisper | multiple | 1 | 1.18
torchbench | timm_efficientnet | multiple | 64 | 1.32
huggingface | BartForConditionalGeneration | multiple | 2 | 1.19
timm_models | eca_halonext26ts | multiple | 128 | 1.13
timm_models | nfnet_l0 | multiple | 128 | 1.13
timm_models | rexnet_100 | multiple | 128 | 1.45
timm_models | spnasnet_100 | multiple | 128 | 1.15
timm_models | tf_efficientnet_b0 | multiple | 128 | 1.22
timm_models | tinynet_a | multiple | 128 | 1.49
torchbench | hf_Bert_large | single | 1 | 1.16
huggingface | XLNetLMHeadModel | single | 1 | 1.07

iv.  AMP dynamic default

suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | timm_efficientnet | multiple | 64 | 1.32
huggingface | PLBartForConditionalGeneration | multiple | 4 | 1.14
timm_models | nfnet_l0 | multiple | 128 | 1.15
timm_models | rexnet_100 | multiple | 128 | 1.45
timm_models | tinynet_a | multiple | 128 | 1.34
huggingface | XLNetLMHeadModel | single | 1 | 1.09

-----

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: #120767
Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman
pianpwk pushed a commit to pytorch/pytorch that referenced this issue Mar 11, 2024
This upgrade contains the fixes to the known issues brought by oneDNN v3.3.2, including issues #115346, #120211 and #120406 and those listed in PR #112700.

Issue #115346 (perf regression) was fixed by oneDNN v3.3.4. No new regression was found with v3.3.5. The detailed results of v3.3.4 are given below and compared with v3.1.1 (the oneDNN version in PyTorch before it was updated to v3.3.2).
1. A performance regression with 5.8% perf drop from `pytorch_stargan-train` (see pytorch/benchmark#2076 (comment))
Validation results with this patch: Latency increased by 0.60%
```
Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake)
oneDNN v3.1.1
metrics-1484287.json
{
    "name": "cpu",
    "environ": {
        "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0"
    },
    "metrics": {
        "latency": 418.851717
    }
}
oneDNN v3.3.4
{
    "name": "cpu",
    "environ": {
        "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0"
    },
    "metrics": {
        "latency": 421.381313
    }
}
```

2. Performance regression of FP32 rexnet_100 with Inductor, dynamic shape, multi-threads (see #115346 (comment))
Validation results with this patch: Latency reduced by 3.23%
```
Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake)
oneDNN v3.1.1
(inductor speedup over eager mode) 2.876x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,rexnet_100,128,2.875904,113.314765,18.455283,0.990437,1302.636134,1315.212902,351,1,0,0

oneDNN v3.3.4
(inductor speedup over eager mode) 3.003x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,rexnet_100,128,3.003012,109.653012,91.547260,0.990048,1302.532506,1315.625370,351,1,0,0
```

3. Performance regression of AMP hf_T5_generate and tinynet_a with Inductor, static shape, multi-threads (see #115346 (comment))
Validation results with this patch: Latency reduced by 0.85%
```
Tested on an AWS spr metal instance
oneDNN v3.1.1
(inductor speedup over eager mode) 1.120x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,hf_T5_generate,1,1.120018,1197.807729,205.905466,0.442803,125.179904,282.698957,10550,48,8,4

oneDNN v3.3.4
(inductor speedup over eager mode) 1.134x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,hf_T5_generate,1,1.133594,1187.701514,205.855527,0.422012,128.405094,304.268493,10550,48,8,4
```

The following issues about functionality are fixed by this upgrade. Test cases are also added for these issues.
- #120211
- #120406
- #120547

-----

Below are detailed data of torchbench CPU userbenchmark test and Inductor FP32/AMP inference tests. No regression of perf or functionality was found.
I.  *torchbench CPU userbenchmark test*
Suite | Speedup
-- | --
eager_throughtput_bf16_infer | 1.001848
eager_throughtput_fp32_infer | 1.000257
eager_throughtput_fx_int8 | 1.003069
jit_llga_throughtput_amp_bf16 | 1.000682
jit_llga_throughtput_fp32 | 1.000313
eager_throughtput_bf16_train | 0.998222
eager_throughtput_fp32_train | 1.003384

II. *Inductor FP32/AMP inference tests*
i.  FP32 static default
suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | timm_efficientnet | multiple | 64 | 1.09
timm_models | tinynet_a | multiple | 128 | 1.14

ii.  FP32 dynamic default

suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | alexnet | multiple | 128 | 1.08
torchbench | basic_gnn_edgecnn | multiple | 1 | 0.98
torchbench | timm_efficientnet | multiple | 64 | 1.08

iii. AMP static default

suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | hf_distil_whisper | multiple | 1 | 1.18
torchbench | timm_efficientnet | multiple | 64 | 1.32
huggingface | BartForConditionalGeneration | multiple | 2 | 1.19
timm_models | eca_halonext26ts | multiple | 128 | 1.13
timm_models | nfnet_l0 | multiple | 128 | 1.13
timm_models | rexnet_100 | multiple | 128 | 1.45
timm_models | spnasnet_100 | multiple | 128 | 1.15
timm_models | tf_efficientnet_b0 | multiple | 128 | 1.22
timm_models | tinynet_a | multiple | 128 | 1.49
torchbench | hf_Bert_large | single | 1 | 1.16
huggingface | XLNetLMHeadModel | single | 1 | 1.07

iv.  AMP dynamic default

suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | timm_efficientnet | multiple | 64 | 1.32
huggingface | PLBartForConditionalGeneration | multiple | 4 | 1.14
timm_models | nfnet_l0 | multiple | 128 | 1.15
timm_models | rexnet_100 | multiple | 128 | 1.45
timm_models | tinynet_a | multiple | 128 | 1.34
huggingface | XLNetLMHeadModel | single | 1 | 1.09

-----

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: #120767
Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman
@chuanqi129
Copy link
Contributor

Hi @xuzhao9 the PR pytorch/pytorch#120767 has been landed, which can fix this issue. Could we close this issue after verification? Thanks~

@xuzhao9
Copy link
Contributor

xuzhao9 commented Mar 15, 2024

@chuanqi129 Thanks for the update. I can confirm that the issue has been fixed (pytorch_stargan CPU latency has decreased to ~0.86 ms).

@xuzhao9 xuzhao9 closed this as completed Mar 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants