PyTorch 2.1.0 Performance regression from PyTorch 2.0.1 #117081

sirutBuasai · 2024-01-10T00:32:39Z

🐛 Describe the bug

Hi, we have found performance regression on PyTorch 2.1.0 from PyTorch 2.0.1 on AWS g5.2xlarge instance type. Below are the results we observed from running an example training scripts. I have also ran the script on PyTorch 2.1.2 which shows that the regression still exists.

PyTorch 2.0.1 + CUDA 11.8: average step time: 0.11788356158540853
PyTorch 2.1.0 + CUDA 11.8: average step time: 0.1284193184877639
PyTorch 2.1.0 + CUDA 12.1: average step time: 0.12725605948790183
PyTorch 2.1.2 + CUDA 11.8: average step time: 0.12841543558533886
PyTorch 2.1.2 + CUDA 12.1: average step time: 0.12724469848789585

We suspect that the regression may be related to aten: fill_ kernel that we see in the trace files within PyTorch 2.1.* that does not exist in PyTorch 2.0.1. We observed 10% performance regression drop for this training script but our customer has reported 30% performance drop.
I have attached the training script, trace files, conda environment template, as well as steps to reproduce the results. The gist is located here. Or to download the trace files through pt2.1-regression.zip.

Steps to reproduce:

Create the conda environment: conda env create -f pt201-cu118.yml and conda activate pt201-cu118
Run the script: python train.py
To inspect the trace files, locate them within the variable profiler_location declared within train.py.

Versions

Collecting environment information...
PyTorch version: 2.1.0
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.28.1
Libc version: glibc-2.31

Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-1051-aws-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A10G
Nvidia driver version: 535.104.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      48 bits physical, 48 bits virtual
CPU(s):                             8
On-line CPU(s) list:                0-7
Thread(s) per core:                 2
Core(s) per socket:                 4
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          AuthenticAMD
CPU family:                         23
Model:                              49
Model name:                         AMD EPYC 7R32
Stepping:                           0
CPU MHz:                            3269.558
BogoMIPS:                           5599.52
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          128 KiB
L1i cache:                          128 KiB
L2 cache:                           2 MiB
L3 cache:                           16 MiB
NUMA node0 CPU(s):                  0-7
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid

Versions of relevant libraries:
[pip3] numpy==1.26.3
[pip3] torch==2.1.0
[pip3] torchaudio==2.1.0
[pip3] torchvision==0.16.0
[pip3] triton==2.1.0
[conda] blas                      1.0                         mkl    conda-forge
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] libjpeg-turbo             2.0.0                h9bf148f_0    pytorch
[conda] mkl                       2023.1.0         h213fc3f_46344
[conda] numpy                     1.26.3          py310hb13e2d6_0    conda-forge
[conda] pytorch                   2.1.0           py3.10_cuda12.1_cudnn8.9.2_0    pytorch
[conda] pytorch-cuda              12.1                 ha16c6d3_5    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                2.1.0               py310_cu121    pytorch
[conda] torchtriton               2.1.0                     py310    pytorch
[conda] torchvision               0.16.0              py310_cu121    pytorch

cc @ezyang @gchanan @zou3519 @kadeng @ptrblck

The text was updated successfully, but these errors were encountered:

bdhirsh · 2024-01-16T20:10:57Z

tentatively marking hi-pri for the regression

chauhang · 2024-01-17T05:04:12Z

@sirutBuasai Are you seeing the issues with the latest PT2.2 RC or the Nightlines as well?

bdhirsh · 2024-01-17T15:13:55Z

@sirutBuasai are you using torch.use_deterministic_algorithms()? If so, you might want to set it to False if you care about the extra perf. @albanD points out this is likely due to #104995 (which is needed for determinism)

sirutBuasai · 2024-01-17T22:46:48Z

@bdhirsh Can confirm that without torch.use_deterministic_algorithms(), there is no aten: fill_ kernel and no performance regression. Will relay this to the customer.

sirutBuasai · 2024-01-17T23:05:41Z

@chauhang Regression still exists in nightly build when using torch.use_deterministic_algorithms()

pytorch                                2.3.0.dev20240117  py3.10_cuda11.8_cudnn8.7.0_0  pytorch-nightly/linux-64        2GB
pytorch-cuda                                        11.8  h7e8668a_5                    pytorch-nightly/linux-64     Cached
pytorch-mutex                                        1.0  cuda                          pytorch-nightly/noarch       Cached
torchaudio                             2.2.0.dev20240117  py310_cu118                   pytorch-nightly/linux-64        6MB
torchtriton                             2.2.0+e28a256d71  py310                         pytorch-nightly/linux-64      186MB
torchvision                           0.18.0.dev20240117  py310_cu118                   pytorch-nightly/linux-64        9MB

Training output: average step time: 0.12783529300000188.
Still observed aten: fill_ kernel in the trace.

malfet · 2024-01-18T05:41:22Z

This does not sound like a regression, but rather a feature work: torch.empty was non-deterministic even in deterministic mode, now it is. But it would be worth re-evaluating use of empty and empty_like in inner call and replace it with something lighter-weight if one knows it will be deterministically overwritten anyway

ezyang · 2024-01-18T15:08:40Z

Well, the point of filling empty is that it is can be directly used by a user, and in that case we don't necessarily know if they will properly initialize it. But I think having a "trust me" mode for "I promise not to directly use uninitialized memory" is very reasonable.

sirutBuasai · 2024-01-18T22:31:21Z

Thank you for your responses. We've root caused and notified the customers. Closing issue.

albanD · 2024-01-23T15:01:02Z

@ezyang FYI this mode to not fill uninitialized memory was already added in #111377

malfet · 2024-01-23T15:17:39Z

Hmm, I wonder if next step would be to update https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html doc to advertise this option and advice that initializing memory is expensive (also we probably should use faster at::empty whenever one creates a new tensor that is guaranteed to be initialized

malfet added module: performance Issues related to performance, either of kernel code or framework glue module: regression It used to work, and now it doesn't module: cuda Related to torch.cuda, and CUDA support in general labels Jan 10, 2024

bdhirsh added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module high priority and removed triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jan 11, 2024

pytorch-bot bot added the triage review label Jan 16, 2024

bdhirsh removed the high priority label Jan 17, 2024

huydhn added this to the 2.2.1 milestone Jan 17, 2024

sirutBuasai closed this as completed Jan 18, 2024

atalman mentioned this issue Feb 14, 2024

Add a note to use_deterministic_algorithms regarding perf impact #119920

Closed

atalman removed this from the 2.2.1 milestone Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch 2.1.0 Performance regression from PyTorch 2.0.1 #117081

PyTorch 2.1.0 Performance regression from PyTorch 2.0.1 #117081

sirutBuasai commented Jan 10, 2024 •

edited by pytorch-bot bot

bdhirsh commented Jan 16, 2024

chauhang commented Jan 17, 2024

bdhirsh commented Jan 17, 2024

sirutBuasai commented Jan 17, 2024

sirutBuasai commented Jan 17, 2024 •

edited

malfet commented Jan 18, 2024

ezyang commented Jan 18, 2024

sirutBuasai commented Jan 18, 2024

albanD commented Jan 23, 2024

malfet commented Jan 23, 2024

PyTorch 2.1.0 Performance regression from PyTorch 2.0.1 #117081

PyTorch 2.1.0 Performance regression from PyTorch 2.0.1 #117081

Comments

sirutBuasai commented Jan 10, 2024 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

bdhirsh commented Jan 16, 2024

chauhang commented Jan 17, 2024

bdhirsh commented Jan 17, 2024

sirutBuasai commented Jan 17, 2024

sirutBuasai commented Jan 17, 2024 • edited

malfet commented Jan 18, 2024

ezyang commented Jan 18, 2024

sirutBuasai commented Jan 18, 2024

albanD commented Jan 23, 2024

malfet commented Jan 23, 2024

sirutBuasai commented Jan 10, 2024 •

edited by pytorch-bot bot

sirutBuasai commented Jan 17, 2024 •

edited