-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The speed of pytorch with cudatoolkit 11.0 is slower than cudatoolkit 10.2 #47908
Comments
I can confirm this is the case as well. Recently had a 2x speedup downgrading from CUDA 11 to CUDA 10.2 on a GTX 1080 Ti.
This is output within the |
@ionlights @klyjm do you know if this is still the case with pytorch 1.7.1 ? |
@LukeAI It is not stable, sometimes the speeds are same, sometimes the 11.0 is slower. So, I still use 10.2. |
It is for me (about 20% slower on CUDA 11.0 compared to to CUDA 10.1). Here is my first logs (CUDA=10.1):
Here is my second logs (CUDA=11.0):
|
@feinsteinben Yeah, I have just test on my code, the speed of 1.7.1 in cuda 11.0 is about 25% slower than 10.2, more faster than 1.7.0, but still slower. |
Hi, I'm also facing the same issue (tried on A100 GPUs which I think need cuda >= 11). Was anybody able to overcome this issue? Thanks |
Hi, I'm also facing the same issue on 1080 when debugging a Libtorch segmentation model. The pytorch branch is v1.7.1.
|
The speed is really still slower when using CUDA 11, I don't know what makes. |
Have you tried to build your own PyTorch with cuda11.1(cuda11.2 is released but no cudnn support yet.) or use daily built PyTorch? |
I am also getting a 2x slowdown with cuda 11 vs 10.2 on pytorch 1.7.1 on a GTX1080Ti. |
Same problem on ubuntu 18.04 using titan rtx. Almost 2x speed up on batch size 5 when using conda and downgrading from edit: tested the 1.8 nightly, which came with cuda 11.0 and cudnn 8.0.3, and did not encounter speed issues. |
Is it happening also with Cuda 11.2 (supported by cudnn 8.1.0 since January 26th)? |
This is not supported by 1.8.0 until now, so I haven't tried it. |
I just try 1.8 with 11.1, still slow about 10%~15%. |
1.8 with 11.1 about 40%~45% slower than 1.8 with 10.2. |
If some of the benchmarks mentioned above are public, can someone post a concrete examples?
With CUDA-10.2:
|
@malfet Based on the reports (self included), it seems like NVIDIA GPUs which lack tensor cores are affected (or maybe it's just the 10-series). I should have some time today to run those benchmarks, though. Did you use the default arguments? |
@jmuchovej yes, just run it as |
@jmuchovej Not yet. I have tried on TITAN V which has tensor cores, still slow. And some people use 2080Ti |
I don't keep my previous builds, so I don't have comparable benchmark results, but the situation for me with a RTX 2060 was like this: I saw a huge performance boost for especially mixed training from pytorch 1.6.0-cuda 10.2 to pytorch 1.7.1-cuda 11.0. Normal training was the same. From pytorch 1.7.1-cuda 11.0 to pytorch 1.8.0-cuda 11.1, I've lost around 15-20% for both mixed and normal training. These results are very similar on both Windows and Ubuntu. |
@malfet I can't reproduce the slowdown with your benchmark. I am not sure why it doesn't show up. But on my own repo I still see a 40% slowdown with pytorch 1.8 and cudatoolkit 11.1. It's mostly just a resnet with a double backward pass. One epoch is ~1:20 with pytorch 1.8 and cudatoolkit 10.2, it's ~1:50 with cudatoolkit 11.1 This was tested on a GTX1080Ti. Everything installed through conda as described on pytorch.org. |
@y0ast thank you for the link to repro, will look into this one |
Same problem on 10 series GPU. Pytorch 1.8.1 with |
@klyjm I met the simliar problem in our new server with 8 x A100 GPU. The difference is that my model running slower only on DDP model and normal on DP mode. After my debugging, I found that the slower operation is The environment I use is: OS: Ubuntu 18.04.5 LTS (x86_64) Python version: 3.8 (64-bit runtime) Nvidia driver version: 450.119.04 Versions of relevant libraries: According to the official document of NVIDIA, TFLOPs for A100 and V100 is 19 and 15 respectively, which means A100 should run faster than V100, I am really confused about the result. |
@graycrown Yeah, and I have test PyTorch 1.9, there is no different. This bug is still unsolved. |
It seems to me this is related to cudnn. |
@maxwxzheng can you try updating to PyTorch-1.9 and compare the performance? (Several linking issues that could have negatively affect the CuDNN performance were fixed in 1.9) |
@malfet I have tested my code in PyTorch 1.9, there is no difference. The PyTorch with cuda 11.1 + cudnn 8.x is still slower than cuda 10.2 + cudnn 7.x |
I tested with pytorch 1.9 on RTX 3090. No difference as well. Training with cudnn on is still about 30 to 40 % slower than cudnn off. |
@klyjm, @maxwxzheng what benchmark are you running? |
@malfet Just like shown in the begin, I use test.py in ultralytics/yolov5 to test the speed. After the upgrade to cuda 11 and cudnn 8, the speed is always slower than use cuda 10 and cudnn 7 |
@klyjm ok, I can observe perf degradation between CUDA-11 and CUDA-10.2 for batchsize 1, but if batch size is larger, trend is inverse:
|
@malfet Yeah, you are right. The bigger batch size, the smaller difference. And I also find that set |
I did some benchmark across different pytorch versions. |
cc @ptrblck, can you reproduce these results? |
I couldn't find any examples how to use the posted repository, but since it seems to reuse the source build, cudnn8.2.2
cudnn.enabled=False, cudnn.benchmark=False, 20.43099s
cudnn.enabled=True, cudnn.benchmark=False, 17.86255s
cudnn.enabled=True, cudnn.benchmark=True, 17.97606s
1.9.0+cu111, cudnn8.0.5
cudnn.enabled=False, cudnn.benchmark=False, 21.61472s
cudnn.enabled=True, cudnn.benchmark=False, 19.17168s
cudnn.enabled=True, cudnn.benchmark=True, 19.05530s
1.8.1+cu111, cudnn8.0.5
cudnn.enabled=False, cudnn.benchmark=False, 21.60732s
cudnn.enabled=True, cudnn.benchmark=False, 19.55659s
cudnn.enabled=True, cudnn.benchmark=True, 19.03868s
1.7.1+cu110, cudnn8.0.5
cudnn.enabled=False, cudnn.benchmark=False, 45.85837s
cudnn.enabled=True, cudnn.benchmark=False, 19.83326s
cudnn.enabled=True, cudnn.benchmark=True, 19.36150s Code: import torch
import torch.nn as nn
import time
import timm
torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark = True
model = timm.create_model('efficientnet_b0').cuda()
x = torch.randn(12, 3, 960, 640, device='cuda')
# warmup
for _ in range(10):
out = model(x)
out.backward(torch.ones_like(out))
grad = torch.ones_like(out)
nb_iters = 100
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(nb_iters):
out = model(x)
out.backward(grad)
torch.cuda.synchronize()
t1 = time.perf_counter()
print('cudnn.enabled={}, cudnn.benchmark={}, {:.5f}s'.format(
torch.backends.cudnn.enabled, torch.backends.cudnn.benchmark, (t1 - t0))) |
Thanks @ngimel and @ptrblck for looking into this.
|
@ptrblck hello, the same issue on Ampere RTX 3090. Downgrade of speed in tensor loading in 2 times whith CUDA 11.3 + libtorch_cu113 ~6 sec, instead CUDA10.2.+libtorch cu102 ~3 sec. |
@AlexTitovWork could you describe a bit more what "speed in tensor loading" means? |
Hello @ptrblck ! I use simple test for upload data in to GPU tensor under docker container.
I use same test for two GPU platform
|
It seems to be a problem with the docker I'm using. |
Yes, very slow while loading the model parameter and memory allodation for cuda 11.5 and torch 1.3.1. Testing on cuda 10.1, I hope it will work. |
Hello! I found next information about loading and memory allocation at the first start Libtorch or pyTorch A: Due to the library split, cuDNN version 8.0 API will only load the necessary kernels on the first API call that requires it. In previous versions, this load would have been observed in the first cuDNN API call that triggers CUDA context initialization, typically cudnnCreate(). In version 8.0, this is delayed until the first sub-library call that triggers CUDA context initialization. Users who desire to have CUDA context preloaded can call the new cudnnCnnInferVersionCheck() API (or its related cousins), which has the side effect of initializing a CUDA context. This will reduce the run time for all subsequent API calls. |
Same observation for torch 1.8.2 on a 2080ti machine. The overall speed of one of my training job is about 20% slower with cu11 than cu10.2. |
I also still observe this with pytorch 1.11.0 With https://github.com/y0ast/DUE
~2 minutes on cudatoolkit 10.2, ~4 minutes per epoch on cudatoolkit 11.3 Reproduced on two different machines with a 1080Ti (driver 510.47) and Titan Xp (driver 510.68). |
I've gone over @ptrblck's example to see where the difference comes from and just adding:
Makes my epoch go from 4:55 to 1:57 with newer CUDA/CuDNN versions on the codebase I linked above. This is the same as it was with CUDA 10.2 (and CuDNN 7+). My hypothesis is that in CuDNN 8+ the default convolution algorithm changed. This change is probably fine for newer hardware, but runs badly on older hardware. By setting benchmark to true, CuDNN is forced to re-evaluate that choice and finds that the old choice is better. |
Same here, @y0ast didn't change anything for me |
Same here. |
"Same here" is unfortunately not actionable. |
🐛 Bug
When I update the pytorch to 1.7, the cudatoolkit is updated automaticlly to 11.0, and I find the speed of the same code is slower too much than before. So I change the version of the cudatoolkit back to 10.2, the speed is normal. Maybe I should update the cudnn version in Ubuntu?
To Reproduce
I just use the same code in the same device with the same environment only change the version of the cudatoolkit, the speed is slower too much.
Expected behavior
The speed of 11.0 should be no more slower than 10.2.
Environment
Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).
You can get the script and run it with:
Additional context
cc @ngimel @VitalyFedyunin
The text was updated successfully, but these errors were encountered: