-
Notifications
You must be signed in to change notification settings - Fork 22.1k
-
Notifications
You must be signed in to change notification settings - Fork 22.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance much worse on 2080ti than 1080ti #22961
Comments
My guess is that PyTorch prebuild binaries are compiled with CUDA Compute Capabilities up to 70 (including 61 targeting GTX 1080Ti). RTX 2080Ti is CUDA Compute Capabilities 75 and thus doesn't benefit from the optimized kernels ... The solution would be to rebuild PyTorch especially targeting 6.1 and 7.5 as follow:
@soumith Do you know if compute_75,sm_75 is supported in the distribution wheels ? |
@ngimel any ideas on what this could be? |
It looks like cudnn is not picking the right algorithm. Are you using |
Thanks for the quick response! @ngimel I am not setting Here are the cudnn logs: It might be difficult to give a repro script, but my model is mainly conv2d, conv3d, batch norm, relu. I tried my benchmark using a subset of the model (no conv3d) and there is no performance discrepancy between 1080ti and 2080ti. I can try building from source against cudnn 7.6. |
@ngimel Thanks for the link, I was looking exactly for this one early today ! Thus, if removing conv3d leads to no discrepancy you might be in the following:
|
update: rebuilding with cudnn 7.6 had no effect |
Same problem for me. |
Same problem |
From the latest (7.6.1) release notes, it seems this is a known issue in all versions of CUDNN:
Although it only mentions single precision on Turing. It also says it applies to Pascal too. |
interesting -- from pytorch, is there any way to choose an alternative algorithm? or any other suggested workaround? |
Please see this link https://github.com/hyperfraise/Apex-bench with reproductible code. Not exactly what you guys talk about, but related (maybe ?) |
After profiling via torch.autograd.profiler.profile, I observed the following issue, a significant amount of time is spent on the CPU side during CudnnConvolutionBackward, cudnn_convolution_backward,CudnnBatchNormBackward,cudnn_batch_norm_backward. Note that I am using half precision (via apex), and my network use 3D convolution operations. I use cuDNN 7.6.1, CUDA 10.0, and pytorch 1.1.0. The GPU is RTX 2080 ti. In contrast, a dumb approach which uses .half() only spends a tiny fraction of this time on the CPU side.
|
I also clearly observed this problem, where a RTX 2080 is slower than GTX 1080. import numpy as np
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
device = torch.device("cuda:0")
print("torch ", torch.__version__)
print("cudnn ", torch.backends.cudnn.version())
print(torch.cuda.get_device_name(device.index) if device.type == "cuda" else "cpu")
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.conv1 = nn.Conv3d(1, 8, 3)
self.conv2 = nn.Conv3d(8, 8, 3)
self.conv3 = nn.Conv3d(8, 8, 3)
self.conv4 = nn.Conv3d(8, 1, 3)
def forward(self, x):
x = F.relu(self.conv1(x))
x = F.relu(self.conv2(x))
x = F.relu(self.conv3(x))
x = F.relu(self.conv4(x))
return x.mean()
net = Model()
net.to(device)
input_data = np.zeros((1,1,64,64,64), np.float32)
inputs = torch.from_numpy(input_data).to(device, dtype=torch.float32)
for it in range(1, 161):
loss = net(inputs[:,:1])
loss.backward()
if it == 20: # ignore warmup
start_time = time.time()
if it % 40 == 0:
print( "%d (%.2f s)" % (it, time.time() - start_time)) |
Running the script above on RTX 2080ti (centos-7 machine, cuda 10.1):
With the older Quadro P4000 (roughly equal to a GTX 1070), on the same machine/config:
Note that related to the last column of the Quadro above, another GTX 1080, centos-7 ( different machine), running the script under different torch+cudnn settings exhibit large performance difference. Older torch+cudnn had better perfs. Although not directly related to the issue here.
|
As for the script just above, the default algorithm was apparently inadequate.
|
Hi, I ve found comparable performance discrepancy when running inference o mask rcnn model with RTX2070 vs TITAN RTX. RTX2070 shows significantly faster run especially on batch_norm. Tested on Win10, Pytorch 1.4, CUDA 10.1 just by exchanging GPU on same system. Profile for RTX2070
and for TITAN RTX:
|
Any progress on this? |
This is a pretty old issue comparing performance using now old versions of pytorch and cudnn, tbh, I don't think it's going to be looked at in detail. If you have perf issues with current pytorch/cudnn versions, please file a new issue with a script that demonstrates performance problems. Don't forget to use torch.backends.cudnn.benchmark=True and use a few warmup iterations, and benchmark and profile over several tens of iterations. Also, the reproducer should be reflective of a real usecase, and not just try to find problematic configurations. Convolution search space is very wide, there always will be convolution parameters that are unoptimized for and give poor performance. |
In my case, I am facing a 10 times increase in iteration time when I ported my code from Tensorflow to Pytorch. Upon profiling my code with both torch.utils.bottleneck.profiler and torch.cuda.profiler I observed that the issue I face is on the backward pass. More precisely on the update on the convolutions, even more precisely it is the following kernel call:
I can't publicly release my code at the moment, as soon as my publication is accepted I will. But I can continuously reproduce it on my machine which has a TitanV and on a DGX1 which has a V100. If I use benchmark=True I get even worse performance than with benchmark=False and nothing in my network is dynamic in size. |
I'm sorry, the information you've provided is not actionable. With proper benchmarking benchmark=True does not give worse performance, unless your input sizes change at each iteration. You can try using pyprof https://github.com/NVIDIA/apex/tree/master/apex/pyprof to identify problematic operation, but that's again subject to proper benchmarking, enough warm-up iterations, enough benchmarking iterations. |
Thank you for responding @ngimel. For me to give you the required data I have the following questions:
|
pyprof provides more detailed information than torch.autograd.profiler (e.g. torch.autograd.profiler does not record all the necessary shapes). If you are certain that your input sizes don't change every iteration, just running with benchmark=True is enough, otherwise, run with both benchmark=True and benchmark=False. There should be say 20 warmup iterations (with a synchronization after them) and 100 benchmarking iterations, you can change those numbers if you see that your results are flaky. |
@ngimel Thank you very much for the swift response! My network is a particular implementation of the VQ-VAE 2 paper and besides the quantization process, it is fully convolutional with a standard input size of [3,1,192,256,192]. I observed that PyProf does not cover transpose convolutions which I use heavily in my network. Will that be a big problem? Should I post the links to the files here after I generate them? |
Hm, then I'm surprised that cudnn.benchmark hurts - it's possible that it would not help, but it should not help. I've verified with pyprof author that transposed convolution is supported in a sense that it will capture all the relevant shapes and parameters, but they won't compute achieved flops/bandwidth. If some convolutions are egregiously slow though, just looking at the timing should be enough. Yes, please post the links to the files here (including sql file generated by pyprof) |
(wrt PyProf): I didn't yet add bytes, flops calculation for transposed convolution. But all the other useful information e.g. the call stack (file name, line number), tensor shape, datatype and attributes (strides, padding, dilation), kernel duration etc. are captured. The information is visible in the intermediate human readable dictionary (every line of output corresponds to the same line in the dictionary). The final output is not polished, I agree. As a stop gap, if you can post sql file, I can help. e.g. if you had m = nn.ConvTranspose2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2)).cuda()
input = torch.randn(20, 16, 50, 100).cuda()
output = m(input) the intermediate file will have the following information (Its uglier so I pretty printed here). Look at the information contained in {
'kShortName': 'cudnn::detail::dgrad2d_alg1_1',
'kDuration': 1849853,
'layer': [],
'trace': ['conv2d_transpose.py:17'],
'reprMarkers': [],
'marker': ["{'mod': 'torch.nn.functional', 'op': 'conv_transpose2d',
'args': [{'name': '', 'type': 'tensor', 'shape': (20, 16, 50, 100), 'dtype': 'float32'},
{'name': '', 'type': 'tensor', 'shape': (16, 33, 3, 5), 'dtype': 'float32'},
{'name': '', 'type': 'tensor', 'shape': (33,), 'dtype': 'float32'},
{'name': '', 'type': 'tuple', 'value': (2, 1)},
{'name': '', 'type': 'tuple', 'value': (4, 2)},
{'name': '', 'type': 'tuple', 'value': (0, 0)},
{'name': '', 'type': 'int', 'value': 1},
{'name': '', 'type': 'tuple', 'value': (1, 1)}]}"],
'seqMarker': [],
'seqId': [],
'subSeqId': 0,
'altSeqId': [],
'dir': 'fprop',
'mod': ['torch.nn.functional'],
'op': ['conv_transpose2d'],
'tid': 2833028928, 'device': 0, 'stream': 7, 'grid': (9, 20, 1), 'block': (16, 32, 1),
'kLongName': 'void cudnn::detail::dgrad2d_alg1_1<float, 0, 6, 7, 5, 4, 5, true, true>
(int, int, int, float const*, int, float const*, int, float*, kernel_grad_params, int, int, float, int, int)'} |
@ngimel @adityaiitb Thank you for your support. I am working now on installing Apex and running a full-blown profiling. Will come back to you with more details tomorrow morning London time. |
Cool! You don't need to install apex with cpp extensions, for profiling just python installation will do. |
@ngimel I have made public my git repo and you can find it here https://github.com/danieltudosiu/nmpevqvae In it you can find the net.sql.gz which is the profiling that I have done. To generate it I used the profiling.py script and the command line The python environment that I use can be found in the requirements.txt (besides apex which is just for profiling). The profiling has been run inside a Docker container that inherits from The hardware used was a V100 from a DGX1 server. Please let me know if I can help in any other way. |
A couple of things:
The command I used to look for the convolutions in the SQL file was |
@adityaiitb Interesting! I will modify the script now and push it to git. 10 calls to conv_transpose3d sounds about right. I need to go and move my workstation from office to home covid19 style. Let me know if you need anything else. There are NVTX annotation for each nn.Module that I create so I could debug it. |
I modified the profiling script to be independent of my data but respect the data shape. I did two more profilings one that tried to capture between start and stop called net_independent.sql.gz and another one which captured everything and in this one I was able to pars and you find it as net_independent_all.dict.gz since the sql was too big. Let me know if I can be of any further help! |
I did some personal parsing of the dict hoping to find something useful, but the most useful thing I found is this but I can't make heads or tails of it.
Also, the inbuilt parser crashes due to unusual return of layers (VectorQuantizerEMA, Quantization, Encoder, VectorQuantizedVAE). What would you recommend that I do next? I found out that PyTorch "they ship with their own CUDA, cudnn etc." as per @ptrblck comments. Does this mean that PyTorch is agnostic to the installed CUDA, CUDNN and Drivers? |
If you installed PyTorch from binaries (pip or conda), then yes. It will be agnostic to system's CUDA/CUDNN |
@soumith yes I pip installed it. Then my problem is purely a PyTorch one? |
I found out the reason. In my experiment.py I was missing the line of code |
@danieltudosiu it kicks off an internal auto-tuner in CuDNN which picks the best convolution codepath manually after benchmarking every single available codepath (algorithm) manually for each input shape |
I have quite a few updates.
torch.cuda.set_device(1)
device="cuda" # instead of "cuda:1"
pyprof2/parse/parse.py net.sql > net.dict
pyprof2/prof/prof.py -w 150 net.dict
OR
pyprof2/prof/prof.py --csv -c idx,dir,sub,mod,op,kernel,params,sil,tc,flops,bytes net.dict > net.csv
|
🐛 Bug
I have a model that I have historically trained on 1080ti, and recently I discovered that the training speed is much worse (almost 2x slower) on 2080ti. The rest of the setup (nvidia driver + cpu + networking) is the same between the two.
I profiled my script using
nvprof python my_script.py
, and discovered that on the 2080ti, way too much time (~70%) is spent in this function:void cudnn::detail::convolveNd_wgrad_engine<float, int=3, int=512, int=6, int=5, int=3, int=3, int=3, bool=1>(int, int, int, float const *, int, cudnn::detail::convolveNd_wgrad_engine<float, int=3, int=512, int=6, int=5, int=3, int=3, int=3, bool=1>*, float const , kernel_gradNd_params, int, float, int)
Any ideas what the problem could be?
I have attached the two profiles in case they are helpful.
1080ti.log
2080ti.log
conda
,pip
, source): pipThe text was updated successfully, but these errors were encountered: