New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
3x perf slow down in nightly build Torch 2.0.0.dev2023xxxx+cu118 #92288
Comments
There are no official sm_89 enabled builds of PyTorch yet... cc: @ptrblck |
|
@ptrblck sm_86 is Ampere and sm_89 is Ada. It might be "compatible" but when using the newer nvcc to compile specifically for sm_89 I'd be surprised it didn't leverage any Ada specific features. The is definitely a problem. Today I finally figured out how to repeatedly get local builds to work and confirmed again that I get 13.9 it/s with the nightly Torch 2.0 cu118 and 39.5 it/s with my local Torch 2.0 with CUDA 12.0. I need to check "local build" Torch 2.0 with local CU118 but haven't gotten to it yet. If anyone on the pytorch team uses SD on Linux you should see a large perf difference between Windows and Linux for the same GPU. I don't like sloppy bugs being reported either but this isn't just a 30% hit but a 300% hit in perf. I am fully capable of instrumenting the entire flow of inference in A1111 to find someplace where some function with some data is much slower in the nightly build case. And then turn that into a short test case. But that would be tedious. Tomorrow my priority is to document the many gotcha's in trying to do your own build of pytorch and install it. There are probably a number of cloud providers of SD image generation on Linux that don't realize that can get a big perf boost by building Torch 2 themselves. I helped one earlier this week do just that. He was quite happy. Do you have any .py files which you use to benchmark? I'd be happy to test them in both the slow and fast environments on my machine. ??? |
You might be surprised, but indeed none of our CUDA Math libs ship sm_89-specific kernels and you can double check it via extracting the kernels via
That sounds amazing and I;m sure we can share a lot of debugging stories. You in the SQL world and I in CUDA, but this issue isn't the right place to do so ;)
This is exactly why I would like to narrow down the root cause and debug it. However, the compute capability support for sm_89 should not be related at all.
Both nightly binaries would use the same CUDA libs (compiler, cublas, cuDNN) so unsure where the difference would be coming from between Linux and Windows. Since I'm not deeply familiar with Windows, your CUDA 12.0 setup on Linux would be interesting to see as well as a code snippet you are running to see the it/s output.
I don't think any script helps, but would recommend to profile the workload e.g. via Nsight Systems as described here. |
I've dumped the internals of obj files/executables before but due to my lack of familiarity with this new technology what command line options should I use and on which ?.so? file should I do it on AND what are we looking for? I can check both the nightly build stuff and my local build.
Yes, this is correct. I think I have cuDNN 8.7
I don't run a "code snippet" I run an application consisting of 10's of thousand code snippets. I might try to narrow this down myself. For instance, when trying to figure out why GPU memory usage went from 5 GB's at 16 images per batch to 18 GB's for a batch sized at 17 images I found it deep down in the application when it called conv2d(). It is a known problem which the torch/cuda community should have fixed by know but haven't. I just keep my batchsize under 17 images and in another case A1111 has to work around a variation of this problem by doing a slow 1 image at a time decode_first_stage() process. Otherwise users with smaller GPU's OOM. I just discovered on reddit r/StableDiffusion a lot of people want to try my workaround that also see slow Linux perf. |
@aifartist can you please run |
Windows would be difficult for me to do right now without losing a few hours and the drivers there are old. But that isn't where the problem lies. I do have two side by side setups with the nightly build torch 2 which is slow and my locally built torch 2 which is 3X faster. I wasn't aware of the above but looks helpful in figuring out the difference. I provide the results of the slow/fast below. If you see something obvious let me know because what I am going to do change my local build to match the nightly and see if I can make my torch 2 slower and that should tell us what the problem is. Here is the nightly build venv:
Here is my local build env:
|
I also have this problem, but I'm almost certain it's because of the version of CuDNN pytorch was built with, especially if you have the newer lovelace GPU's. From what I've read, all the tensor cores aren't being used with this older version. My Linux setup was compiled with the latest CuDNN, while the pytorch nightly binary is on the old version. If I can ever figure out how to build pytorch on Windows, I would be able to test by compiling with the latest CuDNN. |
Hmm, I have a wild theory why it might behave like that on Windows, but not on Linux: Windows searches for DLLs using I don't have 4090, so can't test it, but if this theory is correct, then copying Anyone willing to try that? I.e. |
@malfet @redredbeard I'll try this now. |
BINGO! @ptrblck @malfet @redredbeard |
IMO, you need to update the libcudnn.so bundled with pytorch. If you are providing a cu118 version of this then you should be add in sm_89 and sm_90 to the list of architectures you as building. Torch 2.0 should be state of the art and at least Ada (sm_89) has been out quite a while now. FYI, I could add this as a bug but I'll mention it. If you build using CUDA-12.0 AND you can't id the GPU you do a generic build for multiple architectures. The problem is that nvcc v12.0 no longer supports sm_35 and will fail. You need to trim it from the list if CUDA version is 12. Perhaps sm_50 also but I didn't check that. |
It only gets more bizarre. |
On the Windows question. I am getting feedback from some people on reddit r/StableDIffusion that even on Windows some see 13+ it/s. I don't know how the search path works there or where to install the new cudnn libraries to fix the problem. |
This command doesn't work on Windows because pip doesn't have a package version of the latest cudnn. So what I ended up doing is just overwriting the files in "~Anaconda3\envs\pytorch-dev\Lib\site-packages\torch\lib" with the files in "cudnn-windows-x86_64-8.7.0.84_cuda11-archive\bin" from the zip I downloaded directly from Nvidia's website here: https://developer.nvidia.com/rdp/cudnn-download (8.7.0 - Local Installer (Windows)) This is now exactly the same performance I was getting on linux. It seems cudnn 8.7.0 is basically a requirement for the 40 series cards to get any decent performance. Before doing this, the performance was about what I was getting with my 3080. The nightly should really be moved up to compile against and be distributed with this version of cudnn considering the 3x performance gain by doing so. For reference, this was with a Nvidia 4090, and my workload is video enhancement, not stable diffusion. |
I have heard it also helped some using a 30xx. But it wasn't 3X. Also, I see no reason on the cu118 build to not add sm_87, sm_89 and sm_90 to the architecture list you build pytorch for. CUDA 11.8 nvcc supports these. |
Thanks for the quick verification using the new cuDNN version! |
You will have a lot of happy people now. Many people that I have been telling about this are having a hard time following the process to manually replace their libcudnn with the newer one. If you bundle it with pytorch they should immediately see benefits. I am a bit confused about the sm_89 thing. |
@aifartist Could you please update performance figures based on latest nightly builds ? With CUDA 11.8 and CUDNN 8.7.0.84. We want to know if this issue is resolved with this PR: pytorch/builder#1271 If its not resolved could you please post an environment you are comparing this against. |
Yes, this issue should be resolved via: pytorch/builder#1271 |
I'd be happy to test this. I'll run a comparison with the current build and ?? days ago. I'm have a problem figuring out when the final fix for this was merge to main nightly and an earlier version which doesn't segv on me. But the current nightly fix works, just tested it, and is fast. But I want to report both before and after to be complete. |
BEFORE: 12.83 it/s I don't see a real need to test what was there before. For some reason installing several different older 2023mmdd versions of pytorch result in SEGV's when I run AUTOMATIC 1111. Thus, I can't run the versions before the fix for some reason. However, I can copy my cuDNN v8.5 libraries over the v8.7 ones you now provide. Thanks for the fix. I'll tell folks they can now use the nightly build if they want the perf improvement. |
I can also confirm the latest nightly is working properly. I do want to note that there is a significant performance difference between linux and windows with the same pytorch build, but I believe that falls outside of the scope of this report. One thing I did notice is the CPU usage is significantly higher on windows than on linux. I believe this might already be tracked under a different report. |
I've had a number of people tell me that on Windows they haven't quite gotten the same numbers as I get on Linux. Often it is just the slower CPU's they have which I've commented on elsewhere. A 4090 with this fix needs a something like a 5.8GHz processor to get the most from it in some cases. Some github sites like https://github.com/AUTOMATIC1111/stable-diffusion-webui have a Discussions area. pytorch does not. Where can dev's communicate? Report an issue? For example, building pytorch with TensorRT doesn't appear to work, although it might with an very ancient v7x version. Yesterday I figured out how to get it built with TensorRT v8.5.3. I see some issues when using TensortRT and @torch.compile. I guess I'll report an "issue" for the time being. |
Thank you @aifartist for confirming. |
馃悰 Describe the bug
This github repo doesn't have a discussions tab like automatic1111 has so I'll use this. Forgive me if this is wrong.
Stable Diffusion A1111 image generation using typical defaults like 20 steps euler_a simple prompts, sd 2.1 512 model.
Using the Linux nightly torch 2.0 on my 4090 only gives about 11 to 13 it/s.
With the WIndows nightly torch 2.0 build a 4090 gives about 35 to 38 it/s.
I have multiple confirmations of this from other folks.
However if you build pytorch locally on Linux you get about a 3X perf increase to the same perf seen on WIndows.
Today an ex-cto of a cloud company with GPU resources contacted me to try this on one of his cloud server he loaned me. It also sped up his 4090 and he will test an A4000 GPU tomorrow.
As a suggestion you might check if architecture sm_89 is one of the selected architectures listed in the linux build output.
If there was a simply py inference perf test I'd be willing to run it as a "repro" but my REPRO is the entirety of SD AUTOMATIC1111. I have no simple stand alone pytorch perf test. Let me know how I can help. Good night.
Versions
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: version 3.25.0
Libc version: glibc-2.35
Python version: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.17.0-1019-oem-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA Graphics Device
Nvidia driver version: 520.61.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.3
[pip3] open-clip-torch==2.7.0
[pip3] pytorch-lightning==1.7.6
[pip3] pytorch-triton==2.0.0+0d7e753227
[pip3] torch==2.0.0.dev20230113+cu118
[pip3] torchdiffeq==0.2.3
[pip3] torchmetrics==0.11.0
[pip3] torchsde==0.2.5
[pip3] torchvision==0.15.0.dev20230116+cu118
[conda] Could not collect
~
cc @ezyang @gchanan @zou3519 @ngimel @peterjc123 @mszhanyi @skyline75489 @nbcsm @csarofeen @ptrblck @xwang233 @seemethere @malfet
The text was updated successfully, but these errors were encountered: