Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch 2.1.0 Performance regression from PyTorch 2.0.1 #117081

Closed
sirutBuasai opened this issue Jan 10, 2024 · 10 comments
Closed

PyTorch 2.1.0 Performance regression from PyTorch 2.0.1 #117081

sirutBuasai opened this issue Jan 10, 2024 · 10 comments
Labels
module: cuda Related to torch.cuda, and CUDA support in general module: performance Issues related to performance, either of kernel code or framework glue module: regression It used to work, and now it doesn't triage review

Comments

@sirutBuasai
Copy link

sirutBuasai commented Jan 10, 2024

馃悰 Describe the bug

Hi, we have found performance regression on PyTorch 2.1.0 from PyTorch 2.0.1 on AWS g5.2xlarge instance type. Below are the results we observed from running an example training scripts. I have also ran the script on PyTorch 2.1.2 which shows that the regression still exists.

PyTorch 2.0.1 + CUDA 11.8: average step time: 0.11788356158540853
PyTorch 2.1.0 + CUDA 11.8: average step time: 0.1284193184877639
PyTorch 2.1.0 + CUDA 12.1: average step time: 0.12725605948790183
PyTorch 2.1.2 + CUDA 11.8: average step time: 0.12841543558533886
PyTorch 2.1.2 + CUDA 12.1: average step time: 0.12724469848789585

We suspect that the regression may be related to aten: fill_ kernel that we see in the trace files within PyTorch 2.1.* that does not exist in PyTorch 2.0.1. We observed 10% performance regression drop for this training script but our customer has reported 30% performance drop.
I have attached the training script, trace files, conda environment template, as well as steps to reproduce the results. The gist is located here. Or to download the trace files through pt2.1-regression.zip.

Steps to reproduce:

  1. Create the conda environment: conda env create -f pt201-cu118.yml and conda activate pt201-cu118
  2. Run the script: python train.py
  3. To inspect the trace files, locate them within the variable profiler_location declared within train.py.

Versions

Collecting environment information...
PyTorch version: 2.1.0
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.28.1
Libc version: glibc-2.31

Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-1051-aws-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A10G
Nvidia driver version: 535.104.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      48 bits physical, 48 bits virtual
CPU(s):                             8
On-line CPU(s) list:                0-7
Thread(s) per core:                 2
Core(s) per socket:                 4
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          AuthenticAMD
CPU family:                         23
Model:                              49
Model name:                         AMD EPYC 7R32
Stepping:                           0
CPU MHz:                            3269.558
BogoMIPS:                           5599.52
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          128 KiB
L1i cache:                          128 KiB
L2 cache:                           2 MiB
L3 cache:                           16 MiB
NUMA node0 CPU(s):                  0-7
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid

Versions of relevant libraries:
[pip3] numpy==1.26.3
[pip3] torch==2.1.0
[pip3] torchaudio==2.1.0
[pip3] torchvision==0.16.0
[pip3] triton==2.1.0
[conda] blas                      1.0                         mkl    conda-forge
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] libjpeg-turbo             2.0.0                h9bf148f_0    pytorch
[conda] mkl                       2023.1.0         h213fc3f_46344
[conda] numpy                     1.26.3          py310hb13e2d6_0    conda-forge
[conda] pytorch                   2.1.0           py3.10_cuda12.1_cudnn8.9.2_0    pytorch
[conda] pytorch-cuda              12.1                 ha16c6d3_5    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                2.1.0               py310_cu121    pytorch
[conda] torchtriton               2.1.0                     py310    pytorch
[conda] torchvision               0.16.0              py310_cu121    pytorch

cc @ezyang @gchanan @zou3519 @kadeng @ptrblck

@malfet malfet added module: performance Issues related to performance, either of kernel code or framework glue module: regression It used to work, and now it doesn't module: cuda Related to torch.cuda, and CUDA support in general labels Jan 10, 2024
@bdhirsh bdhirsh added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module high priority and removed triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jan 11, 2024
@bdhirsh
Copy link
Contributor

bdhirsh commented Jan 16, 2024

tentatively marking hi-pri for the regression

@chauhang
Copy link
Contributor

@sirutBuasai Are you seeing the issues with the latest PT2.2 RC or the Nightlines as well?

@bdhirsh
Copy link
Contributor

bdhirsh commented Jan 17, 2024

@sirutBuasai are you using torch.use_deterministic_algorithms()? If so, you might want to set it to False if you care about the extra perf. @albanD points out this is likely due to #104995 (which is needed for determinism)

@huydhn huydhn added this to the 2.2.1 milestone Jan 17, 2024
@sirutBuasai
Copy link
Author

@bdhirsh Can confirm that without torch.use_deterministic_algorithms(), there is no aten: fill_ kernel and no performance regression. Will relay this to the customer.

@sirutBuasai
Copy link
Author

sirutBuasai commented Jan 17, 2024

@chauhang Regression still exists in nightly build when using torch.use_deterministic_algorithms()

pytorch                                2.3.0.dev20240117  py3.10_cuda11.8_cudnn8.7.0_0  pytorch-nightly/linux-64        2GB
pytorch-cuda                                        11.8  h7e8668a_5                    pytorch-nightly/linux-64     Cached
pytorch-mutex                                        1.0  cuda                          pytorch-nightly/noarch       Cached
torchaudio                             2.2.0.dev20240117  py310_cu118                   pytorch-nightly/linux-64        6MB
torchtriton                             2.2.0+e28a256d71  py310                         pytorch-nightly/linux-64      186MB
torchvision                           0.18.0.dev20240117  py310_cu118                   pytorch-nightly/linux-64        9MB

Training output: average step time: 0.12783529300000188.
Still observed aten: fill_ kernel in the trace.

@malfet
Copy link
Contributor

malfet commented Jan 18, 2024

This does not sound like a regression, but rather a feature work: torch.empty was non-deterministic even in deterministic mode, now it is. But it would be worth re-evaluating use of empty and empty_like in inner call and replace it with something lighter-weight if one knows it will be deterministically overwritten anyway

@ezyang
Copy link
Contributor

ezyang commented Jan 18, 2024

Well, the point of filling empty is that it is can be directly used by a user, and in that case we don't necessarily know if they will properly initialize it. But I think having a "trust me" mode for "I promise not to directly use uninitialized memory" is very reasonable.

@sirutBuasai
Copy link
Author

Thank you for your responses. We've root caused and notified the customers. Closing issue.

@albanD
Copy link
Collaborator

albanD commented Jan 23, 2024

@ezyang FYI this mode to not fill uninitialized memory was already added in #111377

@malfet
Copy link
Contributor

malfet commented Jan 23, 2024

Hmm, I wonder if next step would be to update https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html doc to advertise this option and advice that initializing memory is expensive (also we probably should use faster at::empty whenever one creates a new tensor that is guaranteed to be initialized

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cuda Related to torch.cuda, and CUDA support in general module: performance Issues related to performance, either of kernel code or framework glue module: regression It used to work, and now it doesn't triage review
Projects
None yet
Development

No branches or pull requests

8 participants