Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

3x perf slow down in nightly build Torch 2.0.0.dev2023xxxx+cu118 #92288

Closed
aifartist opened this issue Jan 17, 2023 · 25 comments
Closed

3x perf slow down in nightly build Torch 2.0.0.dev2023xxxx+cu118 #92288

aifartist opened this issue Jan 17, 2023 · 25 comments
Labels
high priority module: cuda Related to torch.cuda, and CUDA support in general module: cudnn Related to torch.backends.cudnn, and CuDNN support module: performance Issues related to performance, either of kernel code or framework glue module: windows Windows support for PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Milestone

Comments

@aifartist
Copy link

aifartist commented Jan 17, 2023

馃悰 Describe the bug

This github repo doesn't have a discussions tab like automatic1111 has so I'll use this. Forgive me if this is wrong.

Stable Diffusion A1111 image generation using typical defaults like 20 steps euler_a simple prompts, sd 2.1 512 model.
Using the Linux nightly torch 2.0 on my 4090 only gives about 11 to 13 it/s.
With the WIndows nightly torch 2.0 build a 4090 gives about 35 to 38 it/s.
I have multiple confirmations of this from other folks.

However if you build pytorch locally on Linux you get about a 3X perf increase to the same perf seen on WIndows.
Today an ex-cto of a cloud company with GPU resources contacted me to try this on one of his cloud server he loaned me. It also sped up his 4090 and he will test an A4000 GPU tomorrow.

As a suggestion you might check if architecture sm_89 is one of the selected architectures listed in the linux build output.
If there was a simply py inference perf test I'd be willing to run it as a "repro" but my REPRO is the entirety of SD AUTOMATIC1111. I have no simple stand alone pytorch perf test. Let me know how I can help. Good night.

Versions

CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: version 3.25.0
Libc version: glibc-2.35

Python version: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.17.0-1019-oem-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA Graphics Device
Nvidia driver version: 520.61.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.3
[pip3] open-clip-torch==2.7.0
[pip3] pytorch-lightning==1.7.6
[pip3] pytorch-triton==2.0.0+0d7e753227
[pip3] torch==2.0.0.dev20230113+cu118
[pip3] torchdiffeq==0.2.3
[pip3] torchmetrics==0.11.0
[pip3] torchsde==0.2.5
[pip3] torchvision==0.15.0.dev20230116+cu118
[conda] Could not collect
~

cc @ezyang @gchanan @zou3519 @ngimel @peterjc123 @mszhanyi @skyline75489 @nbcsm @csarofeen @ptrblck @xwang233 @seemethere @malfet

@bdhirsh bdhirsh added high priority module: performance Issues related to performance, either of kernel code or framework glue labels Jan 18, 2023
@ezyang ezyang added the module: binaries Anything related to official binaries that we release to users label Jan 19, 2023
@malfet
Copy link
Contributor

malfet commented Jan 19, 2023

There are no official sm_89 enabled builds of PyTorch yet...

cc: @ptrblck

@malfet malfet added the module: cuda Related to torch.cuda, and CUDA support in general label Jan 19, 2023
@ptrblck
Copy link
Collaborator

ptrblck commented Jan 19, 2023

sm_89 is compatible to sm_80/sm_86 and should not give any benefits, just an increase in binary size.
A code snippet to reproduce the issue would be needed as we would need to profile it to see where the bottleneck in the nightly release is coming from.

@aifartist
Copy link
Author

@ptrblck sm_86 is Ampere and sm_89 is Ada. It might be "compatible" but when using the newer nvcc to compile specifically for sm_89 I'd be surprised it didn't leverage any Ada specific features.
As for a repo... I know more than most the value of such as my specialty for 40 years has been finding the root cause of bugs others couldn't in SQL databases and other software. However, I'm not yet a torch/cuda programmer. I'm learning now that I'm retired. My current repro is using the entirety of AUTOMATIC1111 to generate an image. I'm an expert on CPU profiling on Linux but I know nothing about profiling what is happening on a GPU(yet). It may be months before I learn the NVidia profiler to debug this myself.

The is definitely a problem. Today I finally figured out how to repeatedly get local builds to work and confirmed again that I get 13.9 it/s with the nightly Torch 2.0 cu118 and 39.5 it/s with my local Torch 2.0 with CUDA 12.0. I need to check "local build" Torch 2.0 with local CU118 but haven't gotten to it yet.

If anyone on the pytorch team uses SD on Linux you should see a large perf difference between Windows and Linux for the same GPU. I don't like sloppy bugs being reported either but this isn't just a 30% hit but a 300% hit in perf. I am fully capable of instrumenting the entire flow of inference in A1111 to find someplace where some function with some data is much slower in the nightly build case. And then turn that into a short test case. But that would be tedious.

Tomorrow my priority is to document the many gotcha's in trying to do your own build of pytorch and install it. There are probably a number of cloud providers of SD image generation on Linux that don't realize that can get a big perf boost by building Torch 2 themselves. I helped one earlier this week do just that. He was quite happy.

Do you have any .py files which you use to benchmark? I'd be happy to test them in both the slow and fast environments on my machine. ???

@ptrblck
Copy link
Collaborator

ptrblck commented Jan 19, 2023

sm_86 is Ampere and sm_89 is Ada. It might be "compatible" but when using the newer nvcc to compile specifically for sm_89 I'd be surprised it didn't leverage any Ada specific features.

You might be surprised, but indeed none of our CUDA Math libs ship sm_89-specific kernels and you can double check it via extracting the kernels via cuobjdump of e.g. cublas or cuDNN.

I know more than most the value of such as my specialty for 40 years has been finding the root cause of bugs others couldn't in SQL databases and other software.

That sounds amazing and I;m sure we can share a lot of debugging stories. You in the SQL world and I in CUDA, but this issue isn't the right place to do so ;)

I don't like sloppy bugs being reported either but this isn't just a 30% hit but a 300% hit in perf.

This is exactly why I would like to narrow down the root cause and debug it. However, the compute capability support for sm_89 should not be related at all.
If I understand your issue correctly you are seeing a difference between:

  • Linux nightly binary + cu118
  • Windows nightly binary + cu118
  • Linux source build using CUDA 12.0 + unknown cuDNN

Both nightly binaries would use the same CUDA libs (compiler, cublas, cuDNN) so unsure where the difference would be coming from between Linux and Windows. Since I'm not deeply familiar with Windows, your CUDA 12.0 setup on Linux would be interesting to see as well as a code snippet you are running to see the it/s output.

Do you have any .py files which you use to benchmark? I'd be happy to test them in both the slow and fast environments on my machine. ???

I don't think any script helps, but would recommend to profile the workload e.g. via Nsight Systems as described here.

@aifartist
Copy link
Author

aifartist commented Jan 19, 2023

sm_86 is Ampere and sm_89 is Ada. It might be "compatible" but when using the newer nvcc to compile specifically for sm_89 I'd be surprised it didn't leverage any Ada specific features.

You might be surprised, but indeed none of our CUDA Math libs ship sm_89-specific kernels and you can double check it via extracting the kernels via cuobjdump of e.g. cublas or cuDNN.

I've dumped the internals of obj files/executables before but due to my lack of familiarity with this new technology what command line options should I use and on which ?.so? file should I do it on AND what are we looking for? I can check both the nightly build stuff and my local build.

I don't like sloppy bugs being reported either but this isn't just a 30% hit but a 300% hit in perf.

This is exactly why I would like to narrow down the root cause and debug it. However, the compute capability support for sm_89 should not be related at all. If I understand your issue correctly you are seeing a difference between:

  • Linux nightly binary + cu118
  • Windows nightly binary + cu118
  • Linux source build using CUDA 12.0 + unknown cuDNN

Yes, this is correct. I think I have cuDNN 8.7

Both nightly binaries would use the same CUDA libs (compiler, cublas, cuDNN) so unsure where the difference would be coming from between Linux and Windows. Since I'm not deeply familiar with Windows, your CUDA 12.0 setup on Linux would be interesting to see as well as a code snippet you are running to see the it/s output.

I don't run a "code snippet" I run an application consisting of 10's of thousand code snippets. I might try to narrow this down myself. For instance, when trying to figure out why GPU memory usage went from 5 GB's at 16 images per batch to 18 GB's for a batch sized at 17 images I found it deep down in the application when it called conv2d(). It is a known problem which the torch/cuda community should have fixed by know but haven't. I just keep my batchsize under 17 images and in another case A1111 has to work around a variation of this problem by doing a slow 1 image at a time decode_first_stage() process. Otherwise users with smaller GPU's OOM.

I just discovered on reddit r/StableDiffusion a lot of people want to try my workaround that also see slow Linux perf.
Priority one is to write up some clear instructions for building pytorch from source and to describe some things that I learned can go wrong and how to fix them. Then I'll get back to testing and perhaps debugging this myself.

@malfet
Copy link
Contributor

malfet commented Jan 19, 2023

@aifartist can you please run python -c "import torch;print(torch.__config__.show(), torch.cuda.get_device_properties(0))" on Windows and Linux machine (using both local and nightly/release build) and post results here?

@aifartist
Copy link
Author

@aifartist can you please run python -c "import torch;print(torch.__config__.show(), torch.cuda.get_device_properties(0))" on Windows and Linux machine (using both local and nightly/release build) and post results here?

Windows would be difficult for me to do right now without losing a few hours and the drivers there are old. But that isn't where the problem lies. I do have two side by side setups with the nightly build torch 2 which is slow and my locally built torch 2 which is 3X faster. I wasn't aware of the above but looks helpful in figuring out the difference. I provide the results of the slow/fast below. If you see something obvious let me know because what I am going to do change my local build to match the nightly and see if I can make my torch 2 slower and that should tell us what the problem is.
CuDNN 8.7 -> CuDNN 8.5
CUDA 12 -> CUDA 11.8
sm_89 -> sm_86
I'm going to put the reddit community on hold, waiting on my writeup of torch 2 build instructions, to try to debug this given the info your command provides.

Here is the nightly build venv:

PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.5  (built against CUDA 11.7)
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 
 _CudaDeviceProperties(name='NVIDIA GeForce RTX 4090', major=8, minor=9, total_memory=24217MB, multi_processor_count=128)

Here is my local build env:

PyTorch built with:
  - GCC 11.3
  - C++ Version: 201703
  - Intel(R) MKL-DNN v2.7.2 (Git Hash fbec3e25a559ee252022ae066817b204e106a6ba)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 12.0
  - NVCC architecture flags: -gencode;arch=compute_89,code=sm_89
  - CuDNN 8.7  (built against CUDA 11.8)
    - Built with CuDNN 8.2.4
  - Build settings: BUILD_TYPE=Release, CUDA_VERSION=12.0, CUDNN_VERSION=8.2.4, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.0, USE_CUDA=1, USE_CUDNN=ON, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 
 _CudaDeviceProperties(name='NVIDIA GeForce RTX 4090', major=8, minor=9, total_memory=24217MB, multi_processor_count=128)

@redredbeard
Copy link

I also have this problem, but I'm almost certain it's because of the version of CuDNN pytorch was built with, especially if you have the newer lovelace GPU's. From what I've read, all the tensor cores aren't being used with this older version.

My Linux setup was compiled with the latest CuDNN, while the pytorch nightly binary is on the old version.

If I can ever figure out how to build pytorch on Windows, I would be able to test by compiling with the latest CuDNN.

@malfet
Copy link
Contributor

malfet commented Jan 20, 2023

Hmm, I have a wild theory why it might behave like that on Windows, but not on Linux: Windows searches for DLLs using PATH environment variable, while Linux is not, so it will use cuDNN it was compiled with.

I don't have 4090, so can't test it, but if this theory is correct, then copying libcudnn* into python -c "from pathlib import Path;import torch;print(Path(torch.__file__).parent / 'lib')" would result in the observed perf boost.

Anyone willing to try that?

I.e. pip install nvidia-cudnn-cu11==8.7.0.84 and then cp ~/miniconda3/envs/test-py310/lib/python3.10/site-packages/nvidia/cudnn/lib/libcudnn* ~/miniconda3/envs/test-py310/lib/python3.10/site-packages/torch/lib resulted in PyTorch installation using newer cudnn-8.7

@aifartist
Copy link
Author

@malfet @redredbeard
I just got back from dinner. It so happens I was also going to first focus on cudnn because when I finally got a clean build of pytorch, yesterday, and installed it, my SD app was 5 times slower at about 2.5 it/s instead of the near 14 it/s it was before. I forget why but I first tried updating to cudnn 8.7 it then boom, I got the 39 it/s.

I'll try this now.

@aifartist
Copy link
Author

aifartist commented Jan 20, 2023

BINGO! @ptrblck @malfet @redredbeard
I used the linux pmap command to see where both the slow(remote pyt2) and fast(local pty2) were getting their libcudnn.so from.
The slow nightly build gets it from venv/lib/python3.10/site-packages/torch/lib/libcudnn.so.8
Therefore the nightly build provides its own copy.
All I did was remove that file and then it used the one in /usr/lib/x86_64-linux-gnu which is the 8.7.0 version I installed. The performance difference without doing anything but deleting the nightly build's cudnn is:
Before 12.59it/s
After 30.66it/s
Yes, it isn't 39.7 it's but there are three differences still.
My pure setup has a sm_89 built executable, CUDA-12.0 and xformers. I also build the code as march=native to leverage my raptor lake and got at least 1% faster.

@aifartist
Copy link
Author

IMO, you need to update the libcudnn.so bundled with pytorch. If you are providing a cu118 version of this then you should be add in sm_89 and sm_90 to the list of architectures you as building. Torch 2.0 should be state of the art and at least Ada (sm_89) has been out quite a while now.

FYI, I could add this as a bug but I'll mention it. If you build using CUDA-12.0 AND you can't id the GPU you do a generic build for multiple architectures. The problem is that nvcc v12.0 no longer supports sm_35 and will fail. You need to trim it from the list if CUDA version is 12. Perhaps sm_50 also but I didn't check that.

@aifartist
Copy link
Author

aifartist commented Jan 20, 2023

It only gets more bizarre.
Forget about the nightly build or Torch 2.anything.
I have an env with the default pytorch that's fetched:
1.12.1+cu113
and if I remove the libcudnn that comes with it then it sees the cudnn 8.7 and gets 39+ it/s.
In other words I see no real advantage to Torch 2.0 for inference except to the degree that they include the best cudnn.

@aifartist
Copy link
Author

On the Windows question. I am getting feedback from some people on reddit r/StableDIffusion that even on Windows some see 13+ it/s. I don't know how the search path works there or where to install the new cudnn libraries to fix the problem.

@redredbeard
Copy link

redredbeard commented Jan 20, 2023

@malfet

I.e. pip install nvidia-cudnn-cu11==8.7.0.84 and then cp ~/miniconda3/envs/test-py310/lib/python3.10/site-packages/nvidia/cudnn/lib/libcudnn* ~/miniconda3/envs/test-py310/lib/python3.10/site-packages/torch/lib resulted in PyTorch installation using newer cudnn-8.7

This command doesn't work on Windows because pip doesn't have a package version of the latest cudnn.

So what I ended up doing is just overwriting the files in "~Anaconda3\envs\pytorch-dev\Lib\site-packages\torch\lib" with the files in "cudnn-windows-x86_64-8.7.0.84_cuda11-archive\bin" from the zip I downloaded directly from Nvidia's website here: https://developer.nvidia.com/rdp/cudnn-download (8.7.0 - Local Installer (Windows))

This is now exactly the same performance I was getting on linux. It seems cudnn 8.7.0 is basically a requirement for the 40 series cards to get any decent performance. Before doing this, the performance was about what I was getting with my 3080.

The nightly should really be moved up to compile against and be distributed with this version of cudnn considering the 3x performance gain by doing so.

For reference, this was with a Nvidia 4090, and my workload is video enhancement, not stable diffusion.

@aifartist
Copy link
Author

I have heard it also helped some using a 30xx. But it wasn't 3X. Also, I see no reason on the cu118 build to not add sm_87, sm_89 and sm_90 to the architecture list you build pytorch for. CUDA 11.8 nvcc supports these.

@ptrblck
Copy link
Collaborator

ptrblck commented Jan 20, 2023

Thanks for the quick verification using the new cuDNN version!
sm_90 support was just merged after review and testing in this PR ~3h ago.
sm_89 will not be added as already described, since it will yield no benefits besides binary size increase.
I have another PR open for the cuDNN update here, which currently fails to build due to unrelated changes.

@aifartist
Copy link
Author

I have another PR open for the cuDNN update pytorch/builder#1271, which currently fails to build due to unrelated changes.

You will have a lot of happy people now. Many people that I have been telling about this are having a hard time following the process to manually replace their libcudnn with the newer one. If you bundle it with pytorch they should immediately see benefits.

I am a bit confused about the sm_89 thing.
The Ada architecture has features the older ones do not. I presume they can help performance if used.
If you gencode for arch=compute_89,code=sm_89 nvcc can generate code for things on Ada that won't even work on older models. I would assume this would be similar to using "-march=native" with GCC to get Intel Raptor Lake optimizations.
Why have a sm_89 set of optimizations in nvcc if they don't help performance?

@ezyang ezyang added module: cudnn Related to torch.backends.cudnn, and CuDNN support module: windows Windows support for PyTorch and removed module: binaries Anything related to official binaries that we release to users triage review labels Jan 23, 2023
@ezyang ezyang added this to the 2.0.0 milestone Jan 23, 2023
@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 25, 2023
@atalman
Copy link
Contributor

atalman commented Feb 6, 2023

@aifartist Could you please update performance figures based on latest nightly builds ? With CUDA 11.8 and CUDNN 8.7.0.84. We want to know if this issue is resolved with this PR: pytorch/builder#1271

If its not resolved could you please post an environment you are comparing this against.

cc @ezyang @ptrblck @malfet

@ptrblck
Copy link
Collaborator

ptrblck commented Feb 6, 2023

Yes, this issue should be resolved via: pytorch/builder#1271

@aifartist
Copy link
Author

aifartist commented Feb 6, 2023

@aifartist Could you please update performance figures based on latest nightly builds ? With CUDA 11.8 and CUDNN 8.7.0.84. We want to know if this issue is resolved with this PR: pytorch/builder#1271

If its not resolved could you please post an environment you are comparing this against.

cc @ezyang @ptrblck @malfet

I'd be happy to test this. I'll run a comparison with the current build and ?? days ago. I'm have a problem figuring out when the final fix for this was merge to main nightly and an earlier version which doesn't segv on me. But the current nightly fix works, just tested it, and is fast. But I want to report both before and after to be complete.

@aifartist
Copy link
Author

cc @ezyang @ptrblck @malfet

I'd be happy to test this. I'll run a comparison with the current build and ?? days ago. I'm have a problem figuring out when the final fix for this was merge to main nightly and an earlier version which doesn't segv on me. But the current nightly fix works, just tested it, and is fast. But I want to report both before and after to be complete.

BEFORE: 12.83 it/s
AFTER: 39.25 it/s Current nightly build.

I don't see a real need to test what was there before. For some reason installing several different older 2023mmdd versions of pytorch result in SEGV's when I run AUTOMATIC 1111. Thus, I can't run the versions before the fix for some reason. However, I can copy my cuDNN v8.5 libraries over the v8.7 ones you now provide.

Thanks for the fix. I'll tell folks they can now use the nightly build if they want the perf improvement.

@redredbeard
Copy link

I can also confirm the latest nightly is working properly. I do want to note that there is a significant performance difference between linux and windows with the same pytorch build, but I believe that falls outside of the scope of this report. One thing I did notice is the CPU usage is significantly higher on windows than on linux. I believe this might already be tracked under a different report.

@aifartist
Copy link
Author

aifartist commented Feb 7, 2023

I can also confirm the latest nightly is working properly. I do want to note that there is a significant performance difference between linux and windows with the same pytorch build, but I believe that falls outside of the scope of this report. One thing I did notice is the CPU usage is significantly higher on windows than on linux. I believe this might already be tracked under a different report.

I've had a number of people tell me that on Windows they haven't quite gotten the same numbers as I get on Linux. Often it is just the slower CPU's they have which I've commented on elsewhere. A 4090 with this fix needs a something like a 5.8GHz processor to get the most from it in some cases.
On Linux I notice that this GPU hammering app is incurring ZERO system cpu usage indicating the user app has direct access to the hardware without system call overhead. On Windows one person told me that saw about 13% kernel time. Either this is inherent with Windows or perhaps in a setup with the 4090 sharing both AI and graphics duties it runs in a different mode. My setup uses my intel integrated gpu for the graphics leaving the 4090 dedicated for AI/SD. I'd have to boot my Windows setup to debug but haven't had the time.

Some github sites like https://github.com/AUTOMATIC1111/stable-diffusion-webui have a Discussions area. pytorch does not. Where can dev's communicate? Report an issue? For example, building pytorch with TensorRT doesn't appear to work, although it might with an very ancient v7x version. Yesterday I figured out how to get it built with TensorRT v8.5.3. I see some issues when using TensortRT and @torch.compile. I guess I'll report an "issue" for the time being.

@atalman
Copy link
Contributor

atalman commented Feb 7, 2023

Thank you @aifartist for confirming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: cuda Related to torch.cuda, and CUDA support in general module: cudnn Related to torch.backends.cudnn, and CuDNN support module: performance Issues related to performance, either of kernel code or framework glue module: windows Windows support for PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: Done
Development

No branches or pull requests

8 participants