Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance much worse on 2080ti than 1080ti #22961

Open
nikhilmishra000 opened this issue Jul 17, 2019 · 37 comments
Open

performance much worse on 2080ti than 1080ti #22961

nikhilmishra000 opened this issue Jul 17, 2019 · 37 comments
Labels
module: cuda Related to torch.cuda, and CUDA support in general module: performance Issues related to performance, either of kernel code or framework glue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@nikhilmishra000
Copy link

nikhilmishra000 commented Jul 17, 2019

🐛 Bug

I have a model that I have historically trained on 1080ti, and recently I discovered that the training speed is much worse (almost 2x slower) on 2080ti. The rest of the setup (nvidia driver + cpu + networking) is the same between the two.

I profiled my script using nvprof python my_script.py, and discovered that on the 2080ti, way too much time (~70%) is spent in this function:
void cudnn::detail::convolveNd_wgrad_engine<float, int=3, int=512, int=6, int=5, int=3, int=3, int=3, bool=1>(int, int, int, float const *, int, cudnn::detail::convolveNd_wgrad_engine<float, int=3, int=512, int=6, int=5, int=3, int=3, int=3, bool=1>*, float const , kernel_gradNd_params, int, float, int)

Any ideas what the problem could be?
I have attached the two profiles in case they are helpful.
1080ti.log
2080ti.log

  • PyTorch Version (e.g., 1.0): 1.1.0
  • OS (e.g., Linux): Ubuntu 16.04
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.6
  • CUDA/cuDNN version: 10.0 / 7.4
  • GPU models and configuration: 1080ti + 2080ti, nvidia driver 410.78
  • Any other relevant information:
@mfuntowicz
Copy link
Contributor

mfuntowicz commented Jul 17, 2019

My guess is that PyTorch prebuild binaries are compiled with CUDA Compute Capabilities up to 70 (including 61 targeting GTX 1080Ti).

RTX 2080Ti is CUDA Compute Capabilities 75 and thus doesn't benefit from the optimized kernels ... The solution would be to rebuild PyTorch especially targeting 6.1 and 7.5 as follow:

TORCH_CUDA_ARCH_LIST="6.1;7.5" python setup.py bdist_wheel

@soumith Do you know if compute_75,sm_75 is supported in the distribution wheels ?

@fmassa
Copy link
Member

fmassa commented Jul 17, 2019

@ngimel any ideas on what this could be?

@fmassa fmassa added module: cuda Related to torch.cuda, and CUDA support in general module: performance Issues related to performance, either of kernel code or framework glue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jul 17, 2019
@ngimel
Copy link
Collaborator

ngimel commented Jul 17, 2019

It looks like cudnn is not picking the right algorithm. Are you using torch.backends.cudnn.benchmark=True ?
Thank you, profiles are very helpful, can you also please collect cudnn call logs (https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#api-logging) or alternatively give us a small repro script?
Also, is it possible to try cudnn 7.6? There were some heuristics updates that went into it, so the problem might have been fixed.
@mfuntowicz when pytorch binaries are built, cudnn pruning still leaves 7.5 architecture kernels in. https://github.com/pytorch/builder/blob/59ad166ce23abcad030d922aa0331530a7dc7eda/manywheel/Dockerfile_100#L100, so cudnn should still be picking the right kernels for Turing, if it does not, it is a cudnn bug. For pytorch itself, whether 7.5 architecture is included in compilation flags or not (I don't know off the top of my head if it is), it should not matter much.

@nikhilmishra000
Copy link
Author

nikhilmishra000 commented Jul 17, 2019

Thanks for the quick response!

@ngimel I am not setting torch.backends.cudnn.benchmark.

Here are the cudnn logs:
1080ti_cudnn.log
2080ti_cudnn.log

It might be difficult to give a repro script, but my model is mainly conv2d, conv3d, batch norm, relu. I tried my benchmark using a subset of the model (no conv3d) and there is no performance discrepancy between 1080ti and 2080ti.

I can try building from source against cudnn 7.6.

@mfuntowicz
Copy link
Contributor

mfuntowicz commented Jul 17, 2019

@ngimel Thanks for the link, I was looking exactly for this one early today !

Thus, if removing conv3d leads to no discrepancy you might be in the following:

CuDNN 7.5.0 Release notes:

In cuDNN 7.4.2, for some cases the 3D convolution resulted in a reduced performance on Turing GPUs, compared to the previous cuDNN releases. This is fixed.

@nikhilmishra000
Copy link
Author

update: rebuilding with cudnn 7.6 had no effect

@dmenig
Copy link
Collaborator

dmenig commented Jul 18, 2019

Same problem for me.

@OValery16
Copy link

Same problem

@xsacha
Copy link
Contributor

xsacha commented Jul 19, 2019

From the latest (7.6.1) release notes, it seems this is a known issue in all versions of CUDNN:

In cuDNN 7.6.1, on Volta architecture only, there may be a performance degradation when the function cudnnConvolutionBackwardFilter() is used for 3D convolutions with CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1.
In cuDNN 7.6.1, on Turing and Pascal architectures, performance may be degraded for cudnnConvolutionBackwardData(), when used with the following conditions:
CUDNN_CONVOLUTION_BWD_DATA_ALGO_0 for 3D convolutions
wDesc, dyDesc and dxDesc are all in NCDHW
Data type configuration is FLOAT_CONFIG (i.e., single precision data and compute)

Although it only mentions single precision on Turing. It also says it applies to Pascal too.

@nikhilmishra000
Copy link
Author

interesting -- from pytorch, is there any way to choose an alternative algorithm? or any other suggested workaround?

@dmenig
Copy link
Collaborator

dmenig commented Jul 19, 2019

Please see this link https://github.com/hyperfraise/Apex-bench with reproductible code. Not exactly what you guys talk about, but related (maybe ?)

@OValery16
Copy link

OValery16 commented Jul 19, 2019

After profiling via torch.autograd.profiler.profile, I observed the following issue, a significant amount of time is spent on the CPU side during CudnnConvolutionBackward, cudnn_convolution_backward,CudnnBatchNormBackward,cudnn_batch_norm_backward. Note that I am using half precision (via apex), and my network use 3D convolution operations. I use cuDNN 7.6.1, CUDA 10.0, and pytorch 1.1.0. The GPU is RTX 2080 ti.

In contrast, a dumb approach which uses .half() only spends a tiny fraction of this time on the CPU side.

  • RTX 2080 ti with torch half
------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                                  Self CPU total %   Self CPU total      CPU total %        CPU total     CPU time avg     CUDA total %       CUDA total    CUDA time avg  Number of Calls
------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
torch::autograd::GraphRoot                      0.01%         30.060us            0.01%         30.060us         30.060us            0.00%          8.320us          8.320us                1
NllLossBackward                                 0.08%        253.392us            0.08%        253.392us        253.392us            0.00%        246.368us        246.368us                1
nll_loss_backward                               0.06%        177.542us            0.06%        177.542us        177.542us            0.00%        176.064us        176.064us                1
LogSoftmaxBackward                              0.03%         92.631us            0.03%         92.631us         92.631us            0.00%         92.160us         92.160us                1
_log_softmax_backward_data                      0.02%         75.321us            0.02%         75.321us         75.321us            0.00%         77.152us         77.152us                1
AddmmBackward                                   0.09%        272.563us            0.09%        272.563us        272.563us            0.01%        272.544us        272.544us                1
unsigned short                                  0.01%         19.150us            0.01%         19.150us         19.150us            0.00%         18.592us         18.592us                1
mm                                              0.04%        123.522us            0.04%        123.522us        123.522us            0.00%        125.408us        125.408us                1
unsigned short                                  0.00%         12.120us            0.00%         12.120us         12.120us            0.00%         12.288us         12.288us                1
mm                                              0.02%         56.040us            0.02%         56.040us         56.040us            0.00%         57.376us         57.376us                1
unsigned short                                  0.00%          7.751us            0.00%          7.751us          7.751us            0.00%          7.168us          7.168us                1
sum                                             0.03%         89.521us            0.03%         89.521us         89.521us            0.00%         90.368us         90.368us                1
view                                            0.00%         15.110us            0.00%         15.110us         15.110us            0.00%         15.488us         15.488us                1
torch::autograd::AccumulateGrad                 0.01%         20.210us            0.01%         20.210us         20.210us            0.00%         20.352us         20.352us                1
TBackward                                       0.01%         16.341us            0.01%         16.341us         16.341us            0.00%         16.096us         16.096us                1
unsigned short                                  0.00%          7.851us            0.00%          7.851us          7.851us            0.00%          7.712us          7.712us                1
torch::autograd::AccumulateGrad                 0.00%          5.730us            0.00%          5.730us          5.730us            0.00%          4.960us          4.960us                1
ViewBackward                                    0.01%         36.970us            0.01%         36.970us         36.970us            0.00%         36.576us         36.576us                1
reshape                                         0.01%         28.080us            0.01%         28.080us         28.080us            0.00%         28.512us         28.512us                1
as_strided                                      0.00%          7.000us            0.00%          7.000us          7.000us            0.00%          7.680us          7.680us                1
AdaptiveAvgPool3DBackward                       0.02%         77.891us            0.02%         77.891us         77.891us            0.02%        808.960us        808.960us                1
adaptive_avg_pool3d_backward                    0.02%         64.461us            0.02%         64.461us         64.461us            0.02%        800.512us        800.512us                1
ReluBackward1                                   0.02%         59.111us            0.02%         59.111us         59.111us            0.00%         40.960us         40.960us                1
threshold_backward                              0.01%         42.751us            0.01%         42.751us         42.751us            0.00%         38.304us         38.304us                1
AddBackward0                                    0.00%          4.440us            0.00%          4.440us          4.440us            0.00%          1.632us          1.632us                1
NativeBatchNormBackward                         0.03%        103.371us            0.03%        103.371us        103.371us            0.00%         74.496us         74.496us                1
native_batch_norm_backward                      0.02%         75.431us            0.02%         75.431us         75.431us            0.00%         71.680us         71.680us                1
torch::autograd::AccumulateGrad                 0.00%          6.361us            0.00%          6.361us          6.361us            0.00%          0.704us          0.704us                1
torch::autograd::AccumulateGrad                 0.00%          4.970us            0.00%          4.970us          4.970us            0.00%          1.824us          1.824us                1
CudnnConvolutionBackward                        0.69%          2.191ms            0.69%          2.191ms          2.191ms            0.04%          2.274ms          2.274ms                1
cudnn_convolution_backward                      0.68%          2.171ms            0.68%          2.171ms          2.171ms            0.04%          2.271ms          2.271ms                1
torch::autograd::AccumulateGrad                 0.00%          6.710us            0.00%          6.710us          6.710us            0.00%          0.929us          0.929us                1
ReluBackward1                                   0.01%         46.211us            0.01%         46.211us         46.211us            0.00%         26.592us         26.592us                1
threshold_backward                              0.01%         33.381us            0.01%         33.381us         33.381us            0.00%         22.880us         22.880us                1
NativeBatchNormBackward                         0.02%         65.761us            0.02%         65.761us         65.761us            0.00%         43.584us         43.584us                1
native_batch_norm_backward                      0.01%         46.851us            0.01%         46.851us         46.851us            0.00%         40.960us         40.960us                1
torch::autograd::AccumulateGrad                 0.00%          6.090us            0.00%          6.090us          6.090us            0.00%          1.729us          1.729us                1
torch::autograd::AccumulateGrad                 0.00%          4.590us            0.00%          4.590us          4.590us            0.00%          0.832us          0.832us                1
CudnnConvolutionBackward                        0.47%          1.495ms            0.47%          1.495ms          1.495ms            0.03%          1.626ms          1.626ms                1
cudnn_convolution_backward                      0.46%          1.479ms            0.46%          1.479ms          1.479ms            0.03%          1.622ms          1.622ms                1
torch::autograd::AccumulateGrad                 0.00%          6.580us            0.00%          6.580us          6.580us            0.00%          2.048us          2.048us                1
ReluBackward1                                   0.01%         43.341us            0.01%         43.341us         43.341us            0.00%         22.688us         22.688us                1
threshold_backward                              0.01%         31.021us            0.01%         31.021us         31.021us            0.00%         19.136us         19.136us                1
NativeBatchNormBackward                         0.02%         64.161us            0.02%         64.161us         64.161us            0.00%         40.320us         40.320us                1
native_batch_norm_backward                      0.01%         45.981us            0.01%         45.981us         45.981us            0.00%         37.312us         37.312us                1
torch::autograd::AccumulateGrad                 0.00%         10.750us            0.00%         10.750us         10.750us            0.00%          2.048us          2.048us                1
torch::autograd::AccumulateGrad                 0.00%          4.750us            0.00%          4.750us          4.750us            0.00%          1.504us          1.504us                1
CudnnConvolutionBackward                        0.06%        187.662us            0.06%        187.662us        187.662us            0.03%          1.384ms          1.384ms                1
cudnn_convolution_backward                      0.05%        173.032us            0.05%        173.032us        173.032us            0.03%          1.381ms          1.381ms                1
add                                             0.01%         40.201us            0.01%         40.201us         40.201us            0.00%         34.528us         34.528us                1
torch::autograd::AccumulateGrad                 0.00%          6.110us            0.00%          6.110us          6.110us            0.00%          0.832us          0.832us                1
ReluBackward1                                   0.01%         37.130us            0.01%         37.130us         37.130us            0.00%         45.057us         45.057us                1
threshold_backward                              0.01%         25.600us            0.01%         25.600us         25.600us            0.00%         42.592us         42.592us                1
AddBackward0                                    0.00%          4.000us            0.00%          4.000us          4.000us            0.00%          1.761us          1.761us                1
NativeBatchNormBackward                         0.02%         57.550us            0.02%         57.550us         57.550us            0.00%         76.607us         76.607us                1
native_batch_norm_backward                      0.01%         39.830us            0.01%         39.830us         39.830us            0.00%         75.008us         75.008us                1
torch::autograd::AccumulateGrad                 0.00%          6.060us            0.00%          6.060us          6.060us            0.00%          1.695us          1.695us                1
torch::autograd::AccumulateGrad                 0.00%          4.720us            0.00%          4.720us          4.720us            0.00%          0.736us          0.736us                1
CudnnConvolutionBackward                        0.05%        153.481us            0.05%        153.481us        153.481us            0.03%          1.411ms          1.411ms                1
cudnn_convolution_backward                      0.04%        134.891us            0.04%        134.891us        134.891us            0.03%          1.408ms          1.408ms                1
torch::autograd::AccumulateGrad                 0.00%          6.150us            0.00%          6.150us          6.150us            0.00%          1.568us          1.568us                1
ReluBackward1                                   0.01%         46.971us            0.01%         46.971us         46.971us            0.00%         27.487us         27.487us                1
threshold_backward                              0.01%         31.490us            0.01%         31.490us         31.490us            0.00%         26.111us         26.111us                1
NativeBatchNormBackward                         0.02%         64.061us            0.02%         64.061us         64.061us            0.00%         47.104us         47.104us                1
native_batch_norm_backward                      0.01%         38.801us            0.01%         38.801us         38.801us            0.00%         44.353us         44.353us                1
torch::autograd::AccumulateGrad                 0.00%          5.890us            0.00%          5.890us          5.890us            0.00%          1.695us          1.695us                1
torch::autograd::AccumulateGrad                 0.00%          4.540us            0.00%          4.540us          4.540us            0.00%          0.896us          0.896us                1
CudnnConvolutionBackward                        0.43%          1.358ms            0.43%          1.358ms          1.358ms            0.03%          1.624ms          1.624ms                1
cudnn_convolution_backward                      0.42%          1.343ms            0.42%          1.343ms          1.343ms            0.03%          1.620ms          1.620ms                1
torch::autograd::AccumulateGrad                 0.00%          6.400us            0.00%          6.400us          6.400us            0.00%          1.663us          1.663us                1
ReluBackward1                                   0.02%         49.950us            0.02%         49.950us         49.950us            0.00%         27.553us         27.553us                1
threshold_backward                              0.01%         37.140us            0.01%         37.140us         37.140us            0.00%         24.575us         24.575us                1
NativeBatchNormBackward                         0.02%         63.521us            0.02%         63.521us         63.521us            0.00%         43.391us         43.391us                1
native_batch_norm_backward                      0.01%         45.331us            0.01%         45.331us         45.331us            0.00%         41.119us         41.119us                1
torch::autograd::AccumulateGrad                 0.00%          6.310us            0.00%          6.310us          6.310us            0.00%          1.664us          1.664us                1
torch::autograd::AccumulateGrad                 0.00%          4.830us            0.00%          4.830us          4.830us            0.00%          0.896us          0.896us                1
CudnnConvolutionBackward                        0.04%        135.992us            0.04%        135.992us        135.992us            0.03%          1.393ms          1.393ms                1
cudnn_convolution_backward                      0.04%        118.831us            0.04%        118.831us        118.831us            0.03%          1.389ms          1.389ms                1
add                                             0.01%         28.780us            0.01%         28.780us         28.780us            0.00%         35.008us         35.008us                1
torch::autograd::AccumulateGrad                 0.00%          6.130us            0.00%          6.130us          6.130us            0.00%          1.951us          1.951us                1
ReluBackward1                                   0.01%         42.411us            0.01%         42.411us         42.411us            0.00%         46.943us         46.943us                1
threshold_backward                              0.01%         30.281us            0.01%         30.281us         30.281us            0.00%         44.770us         44.770us                1
AddBackward0                                    0.00%          4.210us            0.00%          4.210us          4.210us            0.00%          2.049us          2.049us                1
NativeBatchNormBackward                         0.02%         62.710us            0.02%         62.710us         62.710us            0.00%         80.287us         80.287us                1
native_batch_norm_backward                      0.01%         44.850us            0.01%         44.850us         44.850us            0.00%         78.209us         78.209us                1
torch::autograd::AccumulateGrad                 0.00%          5.570us            0.00%          5.570us          5.570us            0.00%          1.920us          1.920us                1
torch::autograd::AccumulateGrad                 0.00%          4.750us            0.00%          4.750us          4.750us            0.00%          0.672us          0.672us                1
CudnnConvolutionBackward                        0.17%        544.115us            0.17%        544.115us        544.115us            0.11%          5.790ms          5.790ms                1
cudnn_convolution_backward                      0.17%        528.815us            0.17%        528.815us        528.815us            0.11%          5.787ms          5.787ms                1
torch::autograd::AccumulateGrad                 0.00%         15.000us            0.00%         15.000us         15.000us            0.00%          1.822us          1.822us                1
NativeBatchNormBackward                         0.02%         66.350us            0.02%         66.350us         66.350us            0.00%         76.543us         76.543us                1
native_batch_norm_backward                      0.01%         46.760us            0.01%         46.760us         46.760us            0.00%         74.848us         74.848us                1
torch::autograd::AccumulateGrad                 0.00%          5.750us            0.00%          5.750us          5.750us            0.00%          1.537us          1.537us                1
torch::autograd::AccumulateGrad                 0.00%          5.000us            0.00%          5.000us          5.000us            0.00%          2.049us          2.049us                1
CudnnConvolutionBackward                        0.04%        130.121us            0.04%        130.121us        130.121us            0.03%          1.412ms          1.412ms                1
cudnn_convolution_backward                      0.04%        115.561us            0.04%        115.561us        115.561us            0.03%          1.409ms          1.409ms                1
torch::autograd::AccumulateGrad                 0.00%          5.880us            0.00%          5.880us          5.880us            0.00%          2.049us          2.049us                1
ReluBackward1                                   0.01%         46.741us            0.01%         46.741us         46.741us            0.00%         31.264us         31.264us                1
threshold_backward                              0.01%         35.161us            0.01%         35.161us         35.161us            0.00%         29.727us         29.727us                1
NativeBatchNormBackward                         0.02%         60.841us            0.02%         60.841us         60.841us            0.00%         46.176us         46.176us                1
------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 318.945ms
CUDA time total: 5.090s

  • RTX 2080 ti with apex half

------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                                  Self CPU total %   Self CPU total      CPU total %        CPU total     CPU time avg     CUDA total %       CUDA total    CUDA time avg  Number of Calls
------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
to                                              0.00%          4.650us            0.00%          4.650us          4.650us            0.00%          3.904us          3.904us                1
is_floating_point                               0.00%          2.270us            0.00%          2.270us          2.270us            0.00%          2.048us          2.048us                1
mul                                             0.00%         41.471us            0.00%         41.471us         41.471us            0.00%         41.568us         41.568us                1
torch::autograd::GraphRoot                      0.00%         29.271us            0.00%         29.271us         29.271us            0.00%          7.616us          7.616us                1
MulBackward0                                    0.00%        162.214us            0.00%        162.214us        162.214us            0.00%        157.504us        157.504us                1
mul                                             0.00%        107.863us            0.00%        107.863us        107.863us            0.00%        108.768us        108.768us                1
NllLossBackward                                 0.00%        159.613us            0.00%        159.613us        159.613us            0.00%        159.744us        159.744us                1
nll_loss_backward                               0.00%        129.183us            0.00%        129.183us        129.183us            0.00%        128.640us        128.640us                1
LogSoftmaxBackward                              0.00%         78.661us            0.00%         78.661us         78.661us            0.00%         77.856us         77.856us                1
_log_softmax_backward_data                      0.00%         64.331us            0.00%         64.331us         64.331us            0.00%         65.536us         65.536us                1
torch::autograd::CopyBackwards                  0.00%         85.042us            0.00%         85.042us         85.042us            0.00%         85.376us         85.376us                1
to                                              0.00%         63.722us            0.00%         63.722us         63.722us            0.00%         64.833us         64.833us                1
empty                                           0.00%          8.841us            0.00%          8.841us          8.841us            0.00%          9.376us          9.376us                1
AddmmBackward                                   0.00%        233.895us            0.00%        233.895us        233.895us            0.00%        233.792us        233.792us                1
unsigned short                                  0.00%         21.271us            0.00%         21.271us         21.271us            0.00%         21.792us         21.792us                1
mm                                              0.00%         95.212us            0.00%         95.212us         95.212us            0.00%         98.176us         98.176us                1
unsigned short                                  0.00%          8.100us            0.00%          8.100us          8.100us            0.00%          8.160us          8.160us                1
mm                                              0.00%         52.021us            0.00%         52.021us         52.021us            0.00%         53.152us         53.152us                1
unsigned short                                  0.00%         12.400us            0.00%         12.400us         12.400us            0.00%         12.288us         12.288us                1
sum                                             0.00%         88.282us            0.00%         88.282us         88.282us            0.00%         88.736us         88.736us                1
view                                            0.00%         14.390us            0.00%         14.390us         14.390us            0.00%         14.336us         14.336us                1
TBackward                                       0.00%         20.000us            0.00%         20.000us         20.000us            0.00%         18.976us         18.976us                1
unsigned short                                  0.00%         10.590us            0.00%         10.590us         10.590us            0.00%         10.240us         10.240us                1
torch::autograd::CopyBackwards                  0.00%         60.421us            0.00%         60.421us         60.421us            0.00%         59.520us         59.520us                1
to                                              0.00%         49.261us            0.00%         49.261us         49.261us            0.00%         50.176us         50.176us                1
empty                                           0.00%          8.110us            0.00%          8.110us          8.110us            0.00%          8.192us          8.192us                1
torch::autograd::AccumulateGrad                 0.00%         11.930us            0.00%         11.930us         11.930us            0.00%         12.000us         12.000us                1
torch::autograd::CopyBackwards                  0.00%         51.741us            0.00%         51.741us         51.741us            0.00%         52.512us         52.512us                1
to                                              0.00%         37.550us            0.00%         37.550us         37.550us            0.00%         38.400us         38.400us                1
empty                                           0.00%          9.070us            0.00%          9.070us          9.070us            0.00%          8.705us          8.705us                1
torch::autograd::AccumulateGrad                 0.00%          6.170us            0.00%          6.170us          6.170us            0.00%          6.144us          6.144us                1
ViewBackward                                    0.00%         44.671us            0.00%         44.671us         44.671us            0.00%         43.840us         43.840us                1
reshape                                         0.00%         31.741us            0.00%         31.741us         31.741us            0.00%         32.320us         32.320us                1
as_strided                                      0.00%          9.930us            0.00%          9.930us          9.930us            0.00%          8.128us          8.128us                1
AdaptiveAvgPool3DBackward                       0.00%         85.322us            0.00%         85.322us         85.322us            0.02%        810.496us        810.496us                1
adaptive_avg_pool3d_backward                    0.00%         71.082us            0.00%         71.082us         71.082us            0.02%        800.768us        800.768us                1
ReluBackward1                                   0.00%         60.452us            0.00%         60.452us         60.452us            0.00%         42.432us         42.432us                1
threshold_backward                              0.00%         38.450us            0.00%         38.450us         38.450us            0.00%         38.880us         38.880us                1
AddBackward0                                    0.00%          4.560us            0.00%          4.560us          4.560us            0.00%          2.048us          2.048us                1
CudnnBatchNormBackward                          0.03%          1.231ms            0.03%          1.231ms          1.231ms            0.01%        579.008us        579.008us                1
contiguous                                      0.00%          5.170us            0.00%          5.170us          5.170us            0.00%          0.640us          0.640us                1
cudnn_batch_norm_backward                       0.02%          1.190ms            0.02%          1.190ms          1.190ms            0.01%        573.280us        573.280us                1
torch::autograd::AccumulateGrad                 0.00%          7.030us            0.00%          7.030us          7.030us            0.00%          6.176us          6.176us                1
torch::autograd::AccumulateGrad                 0.00%          5.290us            0.00%          5.290us          5.290us            0.00%          5.408us          5.408us                1
CudnnConvolutionBackward                        0.02%          1.134ms            0.02%          1.134ms          1.134ms            0.04%          1.837ms          1.837ms                1
cudnn_convolution_backward                      0.02%          1.116ms            0.02%          1.116ms          1.116ms            0.04%          1.825ms          1.825ms                1
torch::autograd::CopyBackwards                  0.00%         64.342us            0.00%         64.342us         64.342us            0.00%         35.457us         35.457us                1
to                                              0.00%         52.031us            0.00%         52.031us         52.031us            0.00%         32.769us         32.769us                1
empty                                           0.00%         10.860us            0.00%         10.860us         10.860us            0.00%          1.760us          1.760us                1
torch::autograd::AccumulateGrad                 0.00%          6.470us            0.00%          6.470us          6.470us            0.00%          0.640us          0.640us                1
ReluBackward1                                   0.00%         47.741us            0.00%         47.741us         47.741us            0.00%         26.272us         26.272us                1
threshold_backward                              0.00%         35.231us            0.00%         35.231us         35.231us            0.00%         22.752us         22.752us                1
CudnnBatchNormBackward                          0.00%         79.771us            0.00%         79.771us         79.771us            0.00%         34.911us         34.911us                1
contiguous                                      0.00%          4.840us            0.00%          4.840us          4.840us            0.00%          2.049us          2.049us                1
cudnn_batch_norm_backward                       0.00%         51.131us            0.00%         51.131us         51.131us            0.00%         28.673us         28.673us                1
torch::autograd::AccumulateGrad                 0.00%         10.820us            0.00%         10.820us         10.820us            0.00%          2.048us          2.048us                1
torch::autograd::AccumulateGrad                 0.00%          5.210us            0.00%          5.210us          5.210us            0.00%          1.504us          1.504us                1
CudnnConvolutionBackward                        0.03%          1.502ms            0.03%          1.502ms          1.502ms            0.03%          1.608ms          1.608ms                1
cudnn_convolution_backward                      0.03%          1.486ms            0.03%          1.486ms          1.486ms            0.03%          1.605ms          1.605ms                1
torch::autograd::CopyBackwards                  0.00%         60.001us            0.00%         60.001us         60.001us            0.00%         16.384us         16.384us                1
to                                              0.00%         48.501us            0.00%         48.501us         48.501us            0.00%         13.409us         13.409us                1
empty                                           0.00%          9.650us            0.00%          9.650us          9.650us            0.00%          0.576us          0.576us                1
torch::autograd::AccumulateGrad                 0.00%          6.470us            0.00%          6.470us          6.470us            0.00%          1.504us          1.504us                1
ReluBackward1                                   0.00%         51.271us            0.00%         51.271us         51.271us            0.00%         25.887us         25.887us                1
threshold_backward                              0.00%         35.340us            0.00%         35.340us         35.340us            0.00%         22.369us         22.369us                1
CudnnBatchNormBackward                          0.00%         78.302us            0.00%         78.302us         78.302us            0.00%         32.385us         32.385us                1
contiguous                                      0.00%          4.910us            0.00%          4.910us          4.910us            0.00%          2.048us          2.048us                1
cudnn_batch_norm_backward                       0.00%         49.581us            0.00%         49.581us         49.581us            0.00%         25.150us         25.150us                1
torch::autograd::AccumulateGrad                 0.00%          6.150us            0.00%          6.150us          6.150us            0.00%          0.960us          0.960us                1
torch::autograd::AccumulateGrad                 0.00%          8.380us            0.00%          8.380us          8.380us            0.00%          1.792us          1.792us                1
CudnnConvolutionBackward                        0.00%        180.934us            0.00%        180.934us        180.934us            0.03%          1.368ms          1.368ms                1
cudnn_convolution_backward                      0.00%        167.003us            0.00%        167.003us        167.003us            0.03%          1.365ms          1.365ms                1
add                                             0.00%         33.070us            0.00%         33.070us         33.070us            0.00%         37.184us         37.184us                1
torch::autograd::CopyBackwards                  0.01%        534.221us            0.01%        534.221us        534.221us            0.00%         37.280us         37.280us                1
to                                              0.01%        522.121us            0.01%        522.121us        522.121us            0.00%         34.816us         34.816us                1
empty                                           0.01%        469.580us            0.01%        469.580us        469.580us            0.00%          2.048us          2.048us                1
torch::autograd::AccumulateGrad                 0.00%          6.831us            0.00%          6.831us          6.831us            0.00%          0.800us          0.800us                1
ReluBackward1                                   0.00%         41.371us            0.00%         41.371us         41.371us            0.00%         47.104us         47.104us                1
threshold_backward                              0.00%         29.670us            0.00%         29.670us         29.670us            0.00%         43.393us         43.393us                1
AddBackward0                                    0.00%          4.590us            0.00%          4.590us          4.590us            0.00%          1.471us          1.471us                1
CudnnBatchNormBackward                          0.00%         86.972us            0.00%         86.972us         86.972us            0.00%         56.735us         56.735us                1
contiguous                                      0.00%          4.840us            0.00%          4.840us          4.840us            0.00%          2.048us          2.048us                1
cudnn_batch_norm_backward                       0.00%         51.351us            0.00%         51.351us         51.351us            0.00%         49.632us         49.632us                1
torch::autograd::AccumulateGrad                 0.00%          6.170us            0.00%          6.170us          6.170us            0.00%          2.048us          2.048us                1
torch::autograd::AccumulateGrad                 0.00%          8.880us            0.00%          8.880us          8.880us            0.00%          1.504us          1.504us                1
CudnnConvolutionBackward                        0.00%        144.583us            0.00%        144.583us        144.583us            0.03%          1.407ms          1.407ms                1
cudnn_convolution_backward                      0.00%        131.113us            0.00%        131.113us        131.113us            0.03%          1.404ms          1.404ms                1
torch::autograd::CopyBackwards                  0.00%         58.572us            0.00%         58.572us         58.572us            0.00%         34.880us         34.880us                1
to                                              0.00%         47.011us            0.00%         47.011us         47.011us            0.00%         32.544us         32.544us                1
empty                                           0.00%         14.510us            0.00%         14.510us         14.510us            0.00%          2.049us          2.049us                1
torch::autograd::AccumulateGrad                 0.00%          6.430us            0.00%          6.430us          6.430us            0.00%          1.888us          1.888us                1
ReluBackward1                                   0.00%         39.851us            0.00%         39.851us         39.851us            0.00%         28.960us         28.960us                1
threshold_backward                              0.00%         28.581us            0.00%         28.581us         28.581us            0.00%         26.623us         26.623us                1
CudnnBatchNormBackward                          0.00%         76.212us            0.00%         76.212us         76.212us            0.00%         35.744us         35.744us                1
contiguous                                      0.00%          4.850us            0.00%          4.850us          4.850us            0.00%          2.047us          2.047us                1
cudnn_batch_norm_backward                       0.00%         48.611us            0.00%         48.611us         48.611us            0.00%         30.016us         30.016us                1
torch::autograd::AccumulateGrad                 0.00%          5.940us            0.00%          5.940us          5.940us            0.00%          1.504us          1.504us                1
torch::autograd::AccumulateGrad                 0.00%          8.201us            0.00%          8.201us          8.201us            0.00%          2.049us          2.049us                1
CudnnConvolutionBackward                        0.03%          1.334ms            0.03%          1.334ms          1.334ms            0.03%          1.620ms          1.620ms                1
cudnn_convolution_backward                      0.03%          1.317ms            0.03%          1.317ms          1.317ms            0.03%          1.617ms          1.617ms                1
------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 4.869s
CUDA time total: 5.038s

@bthyreau
Copy link

I also clearly observed this problem, where a RTX 2080 is slower than GTX 1080.
This is a minimal script that already exhibit the difference:

import numpy as np
import time
import torch
import torch.nn as nn
import torch.nn.functional as F

device = torch.device("cuda:0")
print("torch ", torch.__version__)
print("cudnn ", torch.backends.cudnn.version())
print(torch.cuda.get_device_name(device.index) if device.type == "cuda" else "cpu")

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv3d(1, 8, 3)
        self.conv2 = nn.Conv3d(8, 8, 3)
        self.conv3 = nn.Conv3d(8, 8, 3)
        self.conv4 = nn.Conv3d(8, 1, 3)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = F.relu(self.conv4(x))
        return x.mean()

net = Model()
net.to(device)

input_data = np.zeros((1,1,64,64,64), np.float32)
inputs = torch.from_numpy(input_data).to(device, dtype=torch.float32)

for it in range(1, 161):

    loss = net(inputs[:,:1])
    loss.backward()

    if it == 20: # ignore warmup
        start_time = time.time()
    if it % 40 == 0:
        print( "%d (%.2f s)" % (it, time.time() - start_time))

@bthyreau
Copy link

bthyreau commented Oct 25, 2019

Running the script above on RTX 2080ti (centos-7 machine, cuda 10.1):

GeForce RTX 2080 Ti GeForce RTX 2080 Ti GeForce RTX 2080 Ti
torch 1.0.0 torch 1.1.0 torch 1.3.0
cudnn 7401 cudnn 7501 cudnn 7603
40 (3.87 s) 40 (3.30 s) 40 (2.77 s)
80 (11.65 s) 80 (10.89 s) 80 (9.10 s)
120 (19.42 s) 120 (18.51 s) 120 (15.44 s)
160 (27.20 s) 160 (26.15 s) 160 (21.78 s)

With the older Quadro P4000 (roughly equal to a GTX 1070), on the same machine/config:

Quadro P4000 Quadro P4000 Quadro P4000
torch 1.0.0 torch 1.1.0 torch 1.3.0
cudnn 7401 cudnn 7501 cudnn 7603
40 (0.82 s) 40 (0.80 s) 40 (2.50 s)
80 (2.35 s) 80 (2.34 s) 80 (7.51 s)
120 (3.89 s) 120 (3.88 s) 120 (12.51 s)
160 (5.42 s) 160 (5.42 s) 160 (17.51 s)

Note that related to the last column of the Quadro above, another GTX 1080, centos-7 ( different machine), running the script under different torch+cudnn settings exhibit large performance difference. Older torch+cudnn had better perfs. Although not directly related to the issue here.
All use conda binaries, except last one which use pip binaries.

GeForce GTX 1080 GeForce GTX 1080 GeForce GTX 1080 GeForce GTX 1080 GeForce GTX 1080
torch 1.0.0 torch 1.1.0 torch 1.2.0 torch 1.3.0 torch 1.3.0+cu100 pip
cudnn 7401 cudnn 7501 cudnn 7602 cudnn 7603 cudnn 7603
40 (0.68 s) 40 (0.69 s) 40 (2.01 s) 40 (2.18 s) 40 (2.01 s)
80 (2.04 s) 80 (2.05 s) 80 (6.04 s) 80 (6.55 s) 80 (6.02 s)
120 (3.40 s) 120 (3.42 s) 120 (10.06 s) 120 (10.92 s) 120 (10.04 s)
160 (4.76 s) 160 (4.78 s) 160 (14.08 s) 160 (15.28 s) 160 (14.05 s)

@bthyreau
Copy link

As for the script just above, the default algorithm was apparently inadequate.
Using torch.backends.cudnn.benchmark=True improved the behavior drastically.

GeForce RTX 2080 Ti GeForce RTX 2080 Ti benchmark=True
torch 1.3.0 torch 1.3.0
cudnn 7603 cudnn 7603
40 (2.77 s) 40 (0.18 s)
80 (9.10 s) 80 (0.68 s)
120 (15.44 s) 120 (1.17 s)
160 (21.78 s) 160 (1.66 s)
In the logs, these calls are only performed in the slower (benchmark=True) version:
...
I! CuDNN (v7604) function cudnnGetConvolutionForwardAlgorithm_v7() called:
I! CuDNN (v7604) function cudnnGetConvolutionForwardAlgorithmMaxCount() called:
...
I! CuDNN (v7604) function cudnnGetConvolutionBackwardDataAlgorithm_v7() called:
I! CuDNN (v7604) function cudnnGetConvolutionBackwardDataAlgorithmMaxCount() called:

@soad89
Copy link

soad89 commented Feb 14, 2020

Hi, I ve found comparable performance discrepancy when running inference o mask rcnn model with RTX2070 vs TITAN RTX.

RTX2070 shows significantly faster run especially on batch_norm.
Note that torch.backends.cudnn.benchmark=True did not improve the behaviour.

Tested on Win10, Pytorch 1.4, CUDA 10.1 just by exchanging GPU on same system.

Profile for RTX2070

-------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                             Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes

-------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
to                               0.00%            7.600us          0.00%            7.600us          7.600us          0.00%            2.367us          2.367us          1                []

to                               1.63%            3.473ms          1.63%            3.473ms          3.473ms          0.57%            3.401ms          3.401ms          1                []

empty                            0.03%            53.400us         0.03%            53.400us         53.400us         0.30%            1.749ms          1.749ms          1                []

sub                              0.03%            71.500us         0.03%            71.500us         71.500us         0.02%            102.914us        102.914us        1                []

div                              0.02%            38.300us         0.02%            38.300us         38.300us         0.02%            129.410us        129.410us        1                []

unsqueeze                        0.00%            10.500us         0.00%            10.500us         10.500us         0.00%            2.047us          2.047us          1                []

conv2d                           0.08%            177.900us        0.08%            177.900us        177.900us        0.10%            613.344us        613.344us        1                []

convolution                      0.07%            146.400us        0.07%            146.400us        146.400us        0.10%            576.160us        576.160us        1                []

_convolution                     0.06%            133.400us        0.06%            133.400us        133.400us        0.09%            557.055us        557.055us        1                []

contiguous                       0.01%            15.500us         0.01%            15.500us         15.500us         0.00%            12.352us         12.352us         1                []

cudnn_convolution                0.04%            94.100us         0.04%            94.100us         94.100us         0.08%            495.617us        495.617us        1                []

batch_norm                       0.07%            145.300us        0.07%            145.300us        145.300us        0.08%            481.211us        481.211us        1                []

_batch_norm_impl_index           0.06%            131.500us        0.06%            131.500us        131.500us        0.08%            461.086us        461.086us        1                []

contiguous                       0.01%            12.000us         0.01%            12.000us         12.000us         0.00%            7.867us          7.867us          1                []

contiguous                       0.00%            8.900us          0.00%            8.900us          8.900us          0.01%            41.566us         41.566us         1                []

contiguous                       0.00%            9.200us          0.00%            9.200us          9.200us          0.00%            14.531us         14.531us         1                []

contiguous                       0.00%            8.700us          0.00%            8.700us          8.700us          0.01%            34.816us         34.816us         1                []

contiguous                       0.00%            8.900us          0.00%            8.900us          8.900us          0.00%            15.359us         15.359us         1                []

cudnn_batch_norm                 0.03%            62.000us         0.03%            62.000us         62.000us         0.05%            294.914us        294.914us        1                []

relu_                            0.01%            26.700us         0.01%            26.700us         26.700us         0.04%            226.113us        226.113us        1                []

max_pool2d                       0.03%            69.500us         0.03%            69.500us         69.500us         0.04%            264.191us        264.191us        1                []

max_pool2d_with_indices          0.03%            53.500us         0.03%            53.500us         53.500us         0.04%            231.332us        231.332us        1                []

conv2d                           0.06%            122.800us        0.06%            122.800us        122.800us        0.04%            232.254us        232.254us        1                []

convolution                      0.05%            108.700us        0.05%            108.700us        108.700us        0.03%            192.484us        192.484us        1                []

_convolution                     0.04%            95.600us         0.04%            95.600us         95.600us         0.03%            174.305us        174.305us        1                []

contiguous                       0.01%            13.800us         0.01%            13.800us         13.800us         0.00%            16.387us         16.387us         1                []

cudnn_convolution                0.03%            62.900us         0.03%            62.900us         62.900us         0.02%            116.730us        116.730us        1                []

batch_norm                       0.06%            137.400us        0.06%            137.400us        137.400us        0.05%            318.402us        318.402us        1                []

_batch_norm_impl_index           0.06%            124.400us        0.06%            124.400us        124.400us        0.05%            299.648us        299.648us        1                []

contiguous                       0.01%            11.600us         0.01%            11.600us         11.600us         0.00%            9.660us          9.660us          1                []

contiguous                       0.01%            11.200us         0.01%            11.200us         11.200us         0.01%            41.504us         41.504us         1                []

contiguous                       0.00%            9.600us          0.00%            9.600us          9.600us          0.00%            15.359us         15.359us         1                []

contiguous                       0.00%            8.500us          0.00%            8.500us          8.500us          0.01%            65.664us         65.664us         1                []

contiguous                       0.01%            10.900us         0.01%            10.900us         10.900us         0.00%            15.230us         15.230us         1                []

cudnn_batch_norm                 0.02%            52.700us         0.02%            52.700us         52.700us         0.02%            109.504us        109.504us        1                []

relu_                            0.02%            40.000us         0.02%            40.000us         40.000us         0.01%            63.969us         63.969us         1                []

conv2d                           0.07%            141.300us        0.07%            141.300us        141.300us        0.08%            473.887us        473.887us        1                []

convolution                      0.06%            129.100us        0.06%            129.100us        129.100us        0.07%            437.277us        437.277us        1                []

_convolution                     0.05%            117.300us        0.05%            117.300us        117.300us        0.07%            418.082us        418.082us        1                []

contiguous                       0.01%            16.300us         0.01%            16.300us         16.300us         0.00%            28.676us         28.676us         1                []

cudnn_convolution                0.04%            81.800us         0.04%            81.800us         81.800us         0.06%            355.391us        355.391us        1                []

batch_norm                       0.07%            149.300us        0.07%            149.300us        149.300us        0.05%            298.688us        298.688us        1                []

_batch_norm_impl_index           0.06%            136.500us        0.06%            136.500us        136.500us        0.05%            278.531us        278.531us        1                []

contiguous                       0.01%            11.700us         0.01%            11.700us         11.700us         0.00%            9.027us          9.027us          1                []

contiguous                       0.00%            10.100us         0.00%            10.100us         10.100us         0.01%            44.156us         44.156us         1                []

contiguous                       0.00%            9.500us          0.00%            9.500us          9.500us          0.00%            14.016us         14.016us         1                []

contiguous                       0.00%            9.500us          0.00%            9.500us          9.500us          0.01%            36.125us         36.125us         1                []

contiguous                       0.01%            28.100us         0.01%            28.100us         28.100us         0.00%            14.750us         14.750us         1                []

cudnn_batch_norm                 0.02%            48.700us         0.02%            48.700us         48.700us         0.02%            114.879us        114.879us        1                []

relu_                            0.01%            22.200us         0.01%            22.200us         22.200us         0.01%            63.488us         63.488us         1                []

conv2d                           0.05%            108.400us        0.05%            108.400us        108.400us        0.06%            378.590us        378.590us        1                []

convolution                      0.05%            97.000us         0.05%            97.000us         97.000us         0.06%            342.559us        342.559us        1                []

_convolution                     0.04%            85.100us         0.04%            85.100us         85.100us         0.05%            323.582us        323.582us        1                []

contiguous                       0.01%            12.900us         0.01%            12.900us         12.900us         0.00%            28.254us         28.254us         1                []

cudnn_convolution                0.03%            55.800us         0.03%            55.800us         55.800us         0.04%            252.383us        252.383us        1                []

batch_norm                       0.06%            128.200us        0.06%            128.200us        128.200us        0.16%            938.336us        938.336us        1                []

_batch_norm_impl_index           0.05%            115.800us        0.05%            115.800us        115.800us        0.16%            919.582us        919.582us        1                []

contiguous                       0.01%            11.800us         0.01%            11.800us         11.800us         0.00%            9.504us          9.504us          1                []

contiguous                       0.00%            9.000us          0.00%            9.000us          9.000us          0.01%            39.262us         39.262us         1                []

contiguous                       0.00%            9.100us          0.00%            9.100us          9.100us          0.00%            14.336us         14.336us         1                []

contiguous                       0.00%            9.100us          0.00%            9.100us          9.100us          0.09%            503.805us        503.805us        1                []

contiguous                       0.00%            8.900us          0.00%            8.900us          8.900us          0.00%            15.391us         15.391us         1                []

cudnn_batch_norm                 0.02%            48.200us         0.02%            48.200us         48.200us         0.05%            281.664us        281.664us        1                []

conv2d                           0.05%            104.700us        0.05%            104.700us        104.700us        0.06%            383.297us        383.297us        1                []

convolution                      0.04%            93.800us         0.04%            93.800us         93.800us         0.06%            346.113us        346.113us        1                []

_convolution                     0.04%            82.200us         0.04%            82.200us         82.200us         0.06%            327.777us        327.777us        1                []

contiguous                       0.01%            12.000us         0.01%            12.000us         12.000us         0.00%            29.277us         29.277us         1                []

cudnn_convolution                0.03%            55.000us         0.03%            55.000us         55.000us         0.04%            253.027us        253.027us        1                []

batch_norm                       0.06%            129.100us        0.06%            129.100us        129.100us        0.08%            466.945us        466.945us        1                []

_batch_norm_impl_index           0.05%            116.400us        0.05%            116.400us        116.400us        0.08%            446.340us        446.340us        1                []

contiguous                       0.01%            11.600us         0.01%            11.600us         11.600us         0.00%            8.668us          8.668us          1                []

contiguous                       0.00%            9.700us          0.00%            9.700us          9.700us          0.01%            39.137us         39.137us         1                []

contiguous                       0.00%            9.400us          0.00%            9.400us          9.400us          0.00%            14.496us         14.496us         1                []

contiguous                       0.00%            9.300us          0.00%            9.300us          9.300us          0.01%            35.168us         35.168us         1                []

contiguous                       0.00%            8.100us          0.00%            8.100us          8.100us          0.00%            14.625us         14.625us         1                []

cudnn_batch_norm                 0.02%            48.800us         0.02%            48.800us         48.800us         0.05%            280.832us        280.832us        1                []

add_                             0.01%            21.700us         0.01%            21.700us         21.700us         0.05%            318.305us        318.305us        1                []

relu_                            0.01%            20.700us         0.01%            20.700us         20.700us         0.04%            229.379us        229.379us        1                []

conv2d                           0.05%            107.700us        0.05%            107.700us        107.700us        0.06%            338.812us        338.812us        1                []

convolution                      0.04%            95.600us         0.04%            95.600us         95.600us         0.05%            304.641us        304.641us        1                []

_convolution                     0.04%            84.700us         0.04%            84.700us         84.700us         0.05%            286.270us        286.270us        1                []

contiguous                       0.01%            13.500us         0.01%            13.500us         13.500us         0.00%            28.578us         28.578us         1                []

cudnn_convolution                0.03%            54.800us         0.03%            54.800us         54.800us         0.04%            212.191us        212.191us        1                []

batch_norm                       0.06%            127.500us        0.06%            127.500us        127.500us        0.05%            290.977us        290.977us        1                []

_batch_norm_impl_index           0.05%            116.200us        0.05%            116.200us        116.200us        0.05%            272.031us        272.031us        1                []

contiguous                       0.01%            11.800us         0.01%            11.800us         11.800us         0.00%            9.664us          9.664us          1                []

contiguous                       0.00%            9.000us          0.00%            9.000us          9.000us          0.01%            36.801us         36.801us         1                []

contiguous                       0.00%            8.900us          0.00%            8.900us          8.900us          0.00%            14.531us         14.531us         1                []

contiguous                       0.00%            8.700us          0.00%            8.700us          8.700us          0.01%            38.879us         38.879us         1                []

contiguous                       0.00%            9.100us          0.00%            9.100us          9.100us          0.00%            14.594us         14.594us         1                []

cudnn_batch_norm                 0.02%            49.400us         0.02%            49.400us         49.400us         0.02%            106.016us        106.016us        1                []

relu_                            0.01%            22.300us         0.01%            22.300us         22.300us         0.01%            71.680us         71.680us         1                []

conv2d                           0.05%            113.900us        0.05%            113.900us        113.900us        0.08%            471.969us        471.969us        1                []

convolution                      0.05%            103.100us        0.05%            103.100us        103.100us        0.07%            436.219us        436.219us        1                []

_convolution                     0.04%            92.600us         0.04%            92.600us         92.600us         0.07%            415.840us        415.840us        1                []

contiguous                       0.01%            11.200us         0.01%            11.200us         11.200us         0.00%            26.367us         26.367us         1                []

cudnn_convolution                0.03%            66.000us         0.03%            66.000us         66.000us         0.06%            354.305us        354.305us        1                []

batch_norm                       0.06%            126.800us        0.06%            126.800us        126.800us        0.05%            290.754us        290.754us        1                []

_batch_norm_impl_index           0.05%            114.600us        0.05%            114.600us        114.600us        0.05%            269.090us        269.090us        1                []

contiguous                       0.01%            11.600us         0.01%            11.600us         11.600us         0.00%            9.848us          9.848us          1                []

-------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 213.563ms
CUDA time total: 592.358ms

and for TITAN RTX:

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
to                           0.00%            25.375us         0.00%            25.375us         25.375us         0.00%            2.910us          2.910us          1                []

to                           0.56%            4.238ms          0.56%            4.238ms          4.238ms          0.28%            4.259ms          4.259ms          1                []

empty                        0.01%            59.625us         0.01%            59.625us         59.625us         0.10%            1.493ms          1.493ms          1                []

sub                          0.02%            141.208us        0.02%            141.208us        141.208us        0.03%            448.801us        448.801us        1                []

div                          0.01%            67.916us         0.01%            67.916us         67.916us         0.04%            642.594us        642.594us        1                []

unsqueeze                    0.00%            19.917us         0.00%            19.917us         19.917us         0.00%            2.496us          2.496us          1                []

conv2d                       0.06%            477.917us        0.06%            477.917us        477.917us        0.12%            1.827ms          1.827ms          1                []

convolution                  0.06%            431.625us        0.06%            431.625us        431.625us        0.11%            1.656ms          1.656ms          1                []

_convolution                 0.04%            290.125us        0.04%            290.125us        290.125us        0.09%            1.326ms          1.326ms          1                []

contiguous                   0.00%            20.000us         0.00%            20.000us         20.000us         0.03%            398.238us        398.238us        1                []

cudnn_convolution            0.03%            194.833us        0.03%            194.833us        194.833us        0.05%            723.902us        723.902us        1                []

batch_norm                   0.09%            664.875us        0.09%            664.875us        664.875us        0.20%            3.006ms          3.006ms          1                []

_batch_norm_impl_index       0.08%            592.292us        0.08%            592.292us        592.292us        0.18%            2.676ms          2.676ms          1                []

contiguous                   0.00%            34.958us         0.00%            34.958us         34.958us         0.03%            446.047us        446.047us        1                []

contiguous                   0.00%            25.041us         0.00%            25.041us         25.041us         0.02%            231.422us        231.422us        1                []

contiguous                   0.00%            16.375us         0.00%            16.375us         16.375us         0.01%            148.223us        148.223us        1                []

contiguous                   0.01%            45.292us         0.01%            45.292us         45.292us         0.02%            373.477us        373.477us        1                []

contiguous                   0.01%            79.292us         0.01%            79.292us         79.292us         0.01%            181.309us        181.309us        1                []

cudnn_batch_norm             0.02%            140.666us        0.02%            140.666us        140.666us        0.05%            755.199us        755.199us        1                []

relu_                        0.01%            66.250us         0.01%            66.250us         66.250us         0.01%            163.395us        163.395us        1                []

max_pool2d                   0.03%            192.458us        0.03%            192.458us        192.458us        0.06%            953.824us        953.824us        1                []

max_pool2d_with_indices      0.02%            135.875us        0.02%            135.875us        135.875us        0.03%            520.383us        520.383us        1                []

conv2d                       0.06%            441.666us        0.06%            441.666us        441.666us        0.12%            1.899ms          1.899ms          1                []

convolution                  0.05%            400.291us        0.05%            400.291us        400.291us        0.11%            1.645ms          1.645ms          1                []

_convolution                 0.04%            329.000us        0.04%            329.000us        329.000us        0.09%            1.421ms          1.421ms          1                []

contiguous                   0.01%            62.625us         0.01%            62.625us         62.625us         0.01%            214.309us        214.309us        1                []

cudnn_convolution            0.03%            211.542us        0.03%            211.542us        211.542us        0.04%            637.664us        637.664us        1                []

batch_norm                   0.10%            773.042us        0.10%            773.042us        773.042us        0.15%            2.366ms          2.366ms          1                []

_batch_norm_impl_index       0.09%            700.083us        0.09%            700.083us        700.083us        0.14%            2.136ms          2.136ms          1                []

contiguous                   0.00%            34.000us         0.00%            34.000us         34.000us         0.01%            176.125us        176.125us        1                []

contiguous                   0.01%            86.291us         0.01%            86.291us         86.291us         0.02%            364.000us        364.000us        1                []

contiguous                   0.01%            46.292us         0.01%            46.292us         46.292us         0.01%            167.547us        167.547us        1                []

contiguous                   0.00%            29.958us         0.00%            29.958us         29.958us         0.03%            432.320us        432.320us        1                []

contiguous                   0.00%            25.000us         0.00%            25.000us         25.000us         0.01%            129.121us        129.121us        1                []

cudnn_batch_norm             0.02%            164.541us        0.02%            164.541us        164.541us        0.03%            493.570us        493.570us        1                []

relu_                        0.01%            71.292us         0.01%            71.292us         71.292us         0.03%            489.469us        489.469us        1                []

conv2d                       0.07%            561.167us        0.07%            561.167us        561.167us        0.12%            1.872ms          1.872ms          1                []

convolution                  0.06%            461.292us        0.06%            461.292us        461.292us        0.10%            1.460ms          1.460ms          1                []

_convolution                 0.04%            312.375us        0.04%            312.375us        312.375us        0.08%            1.257ms          1.257ms          1                []

contiguous                   0.00%            20.333us         0.00%            20.333us         20.333us         0.02%            359.012us        359.012us        1                []

cudnn_convolution            0.02%            184.500us        0.02%            184.500us        184.500us        0.04%            542.113us        542.113us        1                []

batch_norm                   0.10%            735.417us        0.10%            735.417us        735.417us        0.18%            2.713ms          2.713ms          1                []

_batch_norm_impl_index       0.09%            678.833us        0.09%            678.833us        678.833us        0.15%            2.297ms          2.297ms          1                []

contiguous                   0.02%            113.916us        0.02%            113.916us        113.916us        0.01%            184.164us        184.164us        1                []

contiguous                   0.00%            31.292us         0.00%            31.292us         31.292us         0.02%            334.496us        334.496us        1                []

contiguous                   0.01%            59.958us         0.01%            59.958us         59.958us         0.01%            192.512us        192.512us        1                []

contiguous                   0.01%            48.583us         0.01%            48.583us         48.583us         0.03%            489.469us        489.469us        1                []

contiguous                   0.00%            34.000us         0.00%            34.000us         34.000us         0.02%            253.797us        253.797us        1                []

cudnn_batch_norm             0.01%            84.250us         0.01%            84.250us         84.250us         0.05%            693.406us        693.406us        1                []

relu_                        0.01%            82.958us         0.01%            82.958us         82.958us         0.01%            149.566us        149.566us        1                []

conv2d                       0.04%            316.042us        0.04%            316.042us        316.042us        0.12%            1.810ms          1.810ms          1                []

convolution                  0.03%            231.125us        0.03%            231.125us        231.125us        0.11%            1.609ms          1.609ms          1                []

_convolution                 0.02%            169.792us        0.02%            169.792us        169.792us        0.09%            1.319ms          1.319ms          1                []

contiguous                   0.00%            24.959us         0.00%            24.959us         24.959us         0.03%            402.141us        402.141us        1                []

cudnn_convolution            0.01%            69.958us         0.01%            69.958us         69.958us         0.03%            387.078us        387.078us        1                []

batch_norm                   0.07%            548.917us        0.07%            548.917us        548.917us        0.20%            3.110ms          3.110ms          1                []

_batch_norm_impl_index       0.07%            518.917us        0.07%            518.917us        518.917us        0.18%            2.728ms          2.728ms          1                []

contiguous                   0.01%            60.291us         0.01%            60.291us         60.291us         0.03%            515.742us        515.742us        1                []

contiguous                   0.01%            90.583us         0.01%            90.583us         90.583us         0.02%            327.617us        327.617us        1                []

contiguous                   0.00%            16.333us         0.00%            16.333us         16.333us         0.03%            382.055us        382.055us        1                []

contiguous                   0.00%            24.959us         0.00%            24.959us         24.959us         0.02%            256.477us        256.477us        1                []

contiguous                   0.00%            30.000us         0.00%            30.000us         30.000us         0.02%            229.367us        229.367us        1                []

cudnn_batch_norm             0.01%            101.292us        0.01%            101.292us        101.292us        0.03%            417.789us        417.789us        1                []

conv2d                       0.05%            397.084us        0.05%            397.084us        397.084us        0.11%            1.615ms          1.615ms          1                []

convolution                  0.05%            359.459us        0.05%            359.459us        359.459us        0.08%            1.224ms          1.224ms          1                []

_convolution                 0.04%            285.459us        0.04%            285.459us        285.459us        0.07%            999.172us        999.172us        1                []

contiguous                   0.01%            90.250us         0.01%            90.250us         90.250us         0.02%            235.938us        235.938us        1                []

cudnn_convolution            0.01%            82.625us         0.01%            82.625us         82.625us         0.04%            577.281us        577.281us        1                []

batch_norm                   0.11%            799.334us        0.11%            799.334us        799.334us        0.21%            3.154ms          3.154ms          1                []

_batch_norm_impl_index       0.09%            710.417us        0.09%            710.417us        710.417us        0.20%            2.988ms          2.988ms          1                []

contiguous                   0.02%            154.250us        0.02%            154.250us        154.250us        0.02%            376.828us        376.828us        1                []

contiguous                   0.00%            19.959us         0.00%            19.959us         19.959us         0.02%            260.000us        260.000us        1                []

contiguous                   0.01%            40.000us         0.01%            40.000us         40.000us         0.04%            579.070us        579.070us        1                []

contiguous                   0.00%            26.292us         0.00%            26.292us         26.292us         0.02%            345.945us        345.945us        1                []

contiguous                   0.00%            34.958us         0.00%            34.958us         34.958us         0.02%            257.508us        257.508us        1                []

cudnn_batch_norm             0.01%            68.958us         0.01%            68.958us         68.958us         0.04%            639.430us        639.430us        1                []

add_                         0.01%            39.916us         0.01%            39.916us         39.916us         0.03%            409.945us        409.945us        1                []

relu_                        0.01%            52.292us         0.01%            52.292us         52.292us         0.02%            364.023us        364.023us        1                []

conv2d                       0.08%            587.125us        0.08%            587.125us        587.125us        0.10%            1.552ms          1.552ms          1                []

convolution                  0.07%            523.208us        0.07%            523.208us        523.208us        0.08%            1.280ms          1.280ms          1                []

_convolution                 0.06%            416.625us        0.06%            416.625us        416.625us        0.07%            1.098ms          1.098ms          1                []

contiguous                   0.01%            45.334us         0.01%            45.334us         45.334us         0.01%            213.594us        213.594us        1                []

cudnn_convolution            0.03%            243.458us        0.03%            243.458us        243.458us        0.03%            499.328us        499.328us        1                []

batch_norm                   0.13%            966.208us        0.13%            966.208us        966.208us        0.23%            3.443ms          3.443ms          1                []

_batch_norm_impl_index       0.12%            872.292us        0.12%            872.292us        872.292us        0.21%            3.229ms          3.229ms          1                []

contiguous                   0.01%            45.292us         0.01%            45.292us         45.292us         0.06%            864.508us        864.508us        1                []

contiguous                   0.01%            84.250us         0.01%            84.250us         84.250us         0.02%            329.438us        329.438us        1                []

contiguous                   0.01%            107.916us        0.01%            107.916us        107.916us        0.02%            337.922us        337.922us        1                []

contiguous                   0.02%            151.833us        0.02%            151.833us        151.833us        0.02%            234.305us        234.305us        1                []

contiguous                   0.01%            39.959us         0.01%            39.959us         39.959us         0.03%            430.078us        430.078us        1                []

cudnn_batch_norm             0.02%            113.875us        0.02%            113.875us        113.875us        0.03%            530.914us        530.914us        1                []

relu_                        0.02%            123.792us        0.02%            123.792us        123.792us        0.01%            131.070us        131.070us        1                []

conv2d                       0.06%            455.375us        0.06%            455.375us        455.375us        0.15%            2.238ms          2.238ms          1                []

convolution                  0.05%            359.083us        0.05%            359.083us        359.083us        0.13%            1.995ms          1.995ms          1                []

_convolution                 0.04%            292.458us        0.04%            292.458us        292.458us        0.09%            1.411ms          1.411ms          1                []

contiguous                   0.01%            41.292us         0.01%            41.292us         41.292us         0.02%            361.891us        361.891us        1                []

cudnn_convolution            0.01%            110.625us        0.01%            110.625us        110.625us        0.06%            846.078us        846.078us        1                []

batch_norm                   0.10%            740.459us        0.10%            740.459us        740.459us        0.15%            2.271ms          2.271ms          1                []

_batch_norm_impl_index       0.09%            675.541us        0.09%            675.541us        675.541us        0.14%            2.148ms          2.148ms          1                []

contiguous                   0.01%            49.917us         0.01%            49.917us         49.917us         0.02%            262.750us        262.750us        1                []

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 751.832ms
CUDA time total: 1.527s

@danieltudosiu
Copy link

Any progress on this?

@ngimel
Copy link
Collaborator

ngimel commented Mar 19, 2020

This is a pretty old issue comparing performance using now old versions of pytorch and cudnn, tbh, I don't think it's going to be looked at in detail. If you have perf issues with current pytorch/cudnn versions, please file a new issue with a script that demonstrates performance problems. Don't forget to use torch.backends.cudnn.benchmark=True and use a few warmup iterations, and benchmark and profile over several tens of iterations. Also, the reproducer should be reflective of a real usecase, and not just try to find problematic configurations. Convolution search space is very wide, there always will be convolution parameters that are unoptimized for and give poor performance.
cc @ptrblck.

@danieltudosiu
Copy link

danieltudosiu commented Mar 19, 2020

In my case, I am facing a 10 times increase in iteration time when I ported my code from Tensorflow to Pytorch. Upon profiling my code with both torch.utils.bottleneck.profiler and torch.cuda.profiler I observed that the issue I face is on the backward pass. More precisely on the update on the convolutions, even more precisely it is the following kernel call:

void cudnn::detail::wgrad_alg1_nd_float_engine<float, int=3, int=0, int=5, int=7, int=4, int=3, int=5, bool=1, bool=1>(int, int, int, float const *, int, cudnn::detail::wgrad_alg1_nd_float_engine<float, int=3, int=0, int=5, int=7, int=4, int=3, int=5, bool=1, bool=1>*, float const , kernel_gradNd_params, int, float, float, int, int, int*, kernel_gradNd_params)

I can't publicly release my code at the moment, as soon as my publication is accepted I will. But I can continuously reproduce it on my machine which has a TitanV and on a DGX1 which has a V100. If I use benchmark=True I get even worse performance than with benchmark=False and nothing in my network is dynamic in size.

@ngimel
Copy link
Collaborator

ngimel commented Mar 19, 2020

I'm sorry, the information you've provided is not actionable. With proper benchmarking benchmark=True does not give worse performance, unless your input sizes change at each iteration. You can try using pyprof https://github.com/NVIDIA/apex/tree/master/apex/pyprof to identify problematic operation, but that's again subject to proper benchmarking, enough warm-up iterations, enough benchmarking iterations.

@danieltudosiu
Copy link

Thank you for responding @ngimel. For me to give you the required data I have the following questions:

  • Do you recommend PyProf over torch.cuda.profiler + torch.bottleneck.profiler.emit_nvtx + NvProf?
  • Do you want me to run with only torch.backends.cudnn.benchmark=True or with both True and False?
  • How many warm-up iterations are enough?
  • How many profile iterations are enough?
  • What files do you want me to send to you?
  • Can I share my code with the contributors purely for reproducibility?

@ngimel
Copy link
Collaborator

ngimel commented Mar 19, 2020

pyprof provides more detailed information than torch.autograd.profiler (e.g. torch.autograd.profiler does not record all the necessary shapes). If you are certain that your input sizes don't change every iteration, just running with benchmark=True is enough, otherwise, run with both benchmark=True and benchmark=False. There should be say 20 warmup iterations (with a synchronization after them) and 100 benchmarking iterations, you can change those numbers if you see that your results are flaky.
Ideal case for files you want to share is a small self-contained reproducer with no dependencies that runs on random data - that would allow whoever is working on the problem to make their own tweaks to runs and benchmarking and try say newer builds of pytorch or cudnn. If that's not possible, hopefully pyprof should provide you with enough information to identify where the slow wgrad kernel is coming from (one would need all the convolution parameters such as kernel size, padding, stride, dilation, and the input sizes).
As for sharing your code, since there are no formal NDA agreements here, it's up to you, but really, what we are interested in is a small snippet demonstrating poor performance.

@danieltudosiu
Copy link

@ngimel Thank you very much for the swift response!

My network is a particular implementation of the VQ-VAE 2 paper and besides the quantization process, it is fully convolutional with a standard input size of [3,1,192,256,192].

I observed that PyProf does not cover transpose convolutions which I use heavily in my network. Will that be a big problem?

Should I post the links to the files here after I generate them?

@ngimel
Copy link
Collaborator

ngimel commented Mar 19, 2020

Hm, then I'm surprised that cudnn.benchmark hurts - it's possible that it would not help, but it should not help. I've verified with pyprof author that transposed convolution is supported in a sense that it will capture all the relevant shapes and parameters, but they won't compute achieved flops/bandwidth. If some convolutions are egregiously slow though, just looking at the timing should be enough. Yes, please post the links to the files here (including sql file generated by pyprof)

@adityaiitb
Copy link

adityaiitb commented Mar 19, 2020

@danieltudosiu

(wrt PyProf): I didn't yet add bytes, flops calculation for transposed convolution. But all the other useful information e.g. the call stack (file name, line number), tensor shape, datatype and attributes (strides, padding, dilation), kernel duration etc. are captured. The information is visible in the intermediate human readable dictionary (every line of output corresponds to the same line in the dictionary). The final output is not polished, I agree. As a stop gap, if you can post sql file, I can help.

e.g. if you had

m = nn.ConvTranspose2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2)).cuda()
input = torch.randn(20, 16, 50, 100).cuda()
output = m(input)

the intermediate file will have the following information (Its uglier so I pretty printed here). Look at the information contained in args. Its signature is the same as torch.nn.functional.conv_transpose2d

{
'kShortName': 'cudnn::detail::dgrad2d_alg1_1',
'kDuration': 1849853, 
'layer': [],
'trace': ['conv2d_transpose.py:17'],
'reprMarkers': [],
'marker': ["{'mod': 'torch.nn.functional', 'op': 'conv_transpose2d',
   'args': [{'name': '', 'type': 'tensor', 'shape': (20, 16, 50, 100), 'dtype': 'float32'},
              {'name': '', 'type': 'tensor', 'shape': (16, 33, 3, 5), 'dtype': 'float32'},
              {'name': '', 'type': 'tensor', 'shape': (33,), 'dtype': 'float32'},
              {'name': '', 'type': 'tuple', 'value': (2, 1)},
              {'name': '', 'type': 'tuple', 'value': (4, 2)},
              {'name': '', 'type': 'tuple', 'value': (0, 0)},
              {'name': '', 'type': 'int', 'value': 1},
              {'name': '', 'type': 'tuple', 'value': (1, 1)}]}"],
'seqMarker': [],
'seqId': [],
'subSeqId': 0,
'altSeqId': [],
'dir': 'fprop',
'mod': ['torch.nn.functional'],
'op': ['conv_transpose2d'],
'tid': 2833028928, 'device': 0, 'stream': 7, 'grid': (9, 20, 1), 'block': (16, 32, 1),
'kLongName': 'void cudnn::detail::dgrad2d_alg1_1<float, 0, 6, 7, 5, 4, 5, true, true>
(int, int, int, float const*, int, float const*, int, float*, kernel_grad_params, int, int, float, int, int)'}

@danieltudosiu
Copy link

@ngimel @adityaiitb Thank you for your support.

I am working now on installing Apex and running a full-blown profiling.

Will come back to you with more details tomorrow morning London time.

@ngimel
Copy link
Collaborator

ngimel commented Mar 19, 2020

Cool! You don't need to install apex with cpp extensions, for profiling just python installation will do.

@danieltudosiu
Copy link

danieltudosiu commented Mar 20, 2020

@ngimel I have made public my git repo and you can find it here https://github.com/danieltudosiu/nmpevqvae

In it you can find the net.sql.gz which is the profiling that I have done.

To generate it I used the profiling.py script and the command line nvprof -f -o net.sql --profile-from-start off -- python3 profiling.py.

The python environment that I use can be found in the requirements.txt (besides apex which is just for profiling).

The profiling has been run inside a Docker container that inherits from 10.1-cudnn7-devel-ubuntu18.04 NVIDIA image the Dockerfile can be found in my repo.

The hardware used was a V100 from a DGX1 server.

Please let me know if I can help in any other way.

@adityaiitb
Copy link

adityaiitb commented Mar 20, 2020

A couple of things:

  1. I looked at the SQL file. It does not have a single GPU kernel. Looks like for some reason your application ran on the CPU or for some reason the profiler did not capture any GPU activity.

  2. I tried running your code but couldn't because there is a hard coded path "/raid/danieltudosiu/datasets/neuro_morphology/healthy/train_192". If you can modify that to feed random data, that would help.

  3. The SQL file is big because NVprof captured a lot of NVTX annotations as those always happen on the CPU (which is fine). I did see 10 calls to conv_transpose3d. Attached is the file which has all the parameters for each of those calls.
    conv_transpose3d.txt

The command I used to look for the convolutions in the SQL file was
sqlite3 net.sql "select * from StringTable" | grep -i conv_transpose3d | grep functional

@danieltudosiu
Copy link

danieltudosiu commented Mar 20, 2020

@adityaiitb Interesting!

I will modify the script now and push it to git.

10 calls to conv_transpose3d sounds about right.

I need to go and move my workstation from office to home covid19 style. Let me know if you need anything else.

There are NVTX annotation for each nn.Module that I create so I could debug it.

@danieltudosiu
Copy link

@adityaiitb @ngimel

I modified the profiling script to be independent of my data but respect the data shape.

I did two more profilings one that tried to capture between start and stop called net_independent.sql.gz and another one which captured everything and in this one I was able to pars and you find it as net_independent_all.dict.gz since the sql was too big.

Let me know if I can be of any further help!

@danieltudosiu
Copy link

danieltudosiu commented Mar 21, 2020

@adityaiitb

I did some personal parsing of the dict hoping to find something useful, but the most useful thing I found is this but I can't make heads or tails of it.

                                                             min      mean           std       max      total   count
kShortName                                                                                                           
cudnn::detail::wgrad_alg1_nd_float_engine           1.622400e-05  0.012991  2.407226e-02  0.086074  70.593751    5434
cudnn::detail::implicit_convolveNd_sgemm            3.795100e-05  0.005770  8.815605e-03  0.033361  23.240455    4028
volta_scudnn_128x64_stridedB_splitK_small_nn_v1     3.494400e-05  0.004681  1.163572e-02  0.046304  21.624084    4620
cudnn::detail::convolveNd_wgrad_engine              1.369600e-05  0.009364  1.378996e-01  2.297072  20.919318    2234
cudnn::detail::convolveNd_dgrad_float_engine        1.257600e-05  0.002649  5.326334e-03  0.046430  14.804468    5588
elementwise_kernel                                  9.280000e-07  0.000041  1.653503e-04  0.001723  11.703452  283920
avg_pool3d_cuda_update_output                       3.328000e-06  0.000666  1.345273e-03  0.005348   3.355386    5040
avg_pool3d_single_backward_out_frame_stride1        3.104000e-06  0.000558  1.115472e-03  0.003855   2.812430    5040
reduce_kernel                                       1.824000e-06  0.000083  1.556771e-04  0.000613   1.691088   20280
setTensor5d_kernel                                  1.248000e-06  0.000229  4.797258e-04  0.001454   1.279915    5588
kernelPointwiseApply3                               1.408000e-06  0.000251  5.185644e-04  0.001856   1.264417    5040
volta_scudnn_128x32_stridedB_splitK_xregs_large_nn  4.360067e-02  0.118861  6.979165e-02  0.176072   0.832024       7
kernelPointwiseApply2                               1.120000e-06  0.000139  3.165819e-04  0.001157   0.816679    5880
cudnn::detail::dgrad_alg1_nd_float_engine           3.263350e-04  0.047277  1.015612e-01  0.385441   0.756425      16
cudnn::gemm::setOutputKernel                        9.280000e-07  0.000131  3.467400e-04  0.001468   0.751137    5720
volta_gcgemm_32x32_tn                               2.160000e-05  0.000025  2.614317e-05  0.002271   0.418831   16426
volta_scudnn_128x128_stridedB_splitK_small_nn_v1    7.689500e-05  0.000220  8.117861e-05  0.000495   0.238825    1088
fft3d_r2c_16x16x16                                  4.224000e-06  0.000006  1.294814e-05  0.001555   0.198598   31358
fft3d_c2r_16x16x16                                  4.896000e-06  0.000006  1.387327e-06  0.000041   0.188108   31322
cudnn::gemm::computeOffsetsKernel                   1.216000e-06  0.000031  8.746352e-05  0.000370   0.174604    5692
sgemm_largek_lds64                                  1.195738e-03  0.001297  3.165635e-05  0.001402   0.155656     120
transpose_readWrite_alignment_kernel                1.503000e-06  0.000002  1.791118e-05  0.003037   0.144902   62682
volta_sgemm_32x32_sliced1x4_nn                      1.225600e-05  0.000269  3.576056e-04  0.000811   0.096964     360
gemv2T_kernel_val                                   6.335000e-06  0.000006  2.950772e-07  0.000016   0.095937   14896
volta_scudnn_128x64_stridedB_splitK_medium_nn_v1    1.886720e-02  0.019069  1.830427e-04  0.019249   0.095347       5
volta_sgemm_128x32_tn                               1.484800e-05  0.000283  2.680396e-04  0.000553   0.067963     240
THCudaTensor_scatterFillKernel                      2.272000e-06  0.000076  1.031451e-04  0.000230   0.027233     360
cudnn::gemm::computeBOffsetsKernel                  9.590000e-07  0.000001  2.185772e-07  0.000002   0.006775    5720
volta_sgemm_32x32_sliced1x4_nt                      1.078400e-05  0.000028  1.699259e-05  0.000057   0.006748     240
volta_sgemm_32x32_sliced1x4_tn                      9.055000e-06  0.000009  2.104289e-07  0.000011   0.001140     120
cudnn::gemm::computeWgradOffsetsKernel              1.696000e-06  0.000015  2.410338e-05  0.000072   0.000415      28
scal_kernel                                         1.504000e-06  0.000002  9.983884e-08  0.000002   0.000191     120

Also, the inbuilt parser crashes due to unusual return of layers (VectorQuantizerEMA, Quantization, Encoder, VectorQuantizedVAE).

What would you recommend that I do next?

I found out that PyTorch "they ship with their own CUDA, cudnn etc." as per @ptrblck comments. Does this mean that PyTorch is agnostic to the installed CUDA, CUDNN and Drivers?

@soumith
Copy link
Member

soumith commented Mar 21, 2020

I found out that PyTorch "they ship with their own CUDA, cudnn etc." as per @ptrblck comments. Does this mean that PyTorch is agnostic to the installed CUDA, CUDNN and Drivers?

If you installed PyTorch from binaries (pip or conda), then yes. It will be agnostic to system's CUDA/CUDNN

@danieltudosiu
Copy link

@soumith yes I pip installed it. Then my problem is purely a PyTorch one?

@danieltudosiu
Copy link

danieltudosiu commented Mar 23, 2020

@ngimel @adityaiitb

I found out the reason. In my experiment.py I was missing the line of code torch.backends.cudnn.benchmark = True.

@soumith
Copy link
Member

soumith commented Mar 23, 2020

@danieltudosiu it kicks off an internal auto-tuner in CuDNN which picks the best convolution codepath manually after benchmarking every single available codepath (algorithm) manually for each input shape

@adityaiitb
Copy link

@danieltudosiu @ngimel

I have quite a few updates.

  1. The profile context manager captures only the current device. That's why I did not observe any GPU kernels when I ran your code. There is a very small fix as follows (thanks to Natalia).
torch.cuda.set_device(1)
device="cuda"  # instead of "cuda:1"
  1. The PyProf in APEX is in a state of flux. For the time being can you run parse.py and prof.py using my repo at https://github.com/adityaiitb/pyprof2. To repeat, you can still use from apex import pyprof but for the remaining steps please use my repo.
pyprof2/parse/parse.py net.sql > net.dict
pyprof2/prof/prof.py -w 150 net.dict
OR
pyprof2/prof/prof.py --csv -c idx,dir,sub,mod,op,kernel,params,sil,tc,flops,bytes net.dict > net.csv
  1. After warmup, I profiled 1 step of your code. Attached is the detailed csv file. Please change the extension to .csv. For transpose convolution, the output of prof.py shows the size of image and filter. All other arguments like stride, padding, dilation are in the net.dict file (on the same line). Sorry its still a WIP.
    nmpevqvae.txt

  2. The first thing that jumped out to me was that you are not using fp16. If you don't use fp16, you are not making use of the tensor cores on Turing and losing out on a lot of performance for both compute kernels like convolution and also for streaming kernels like c = a + b. I think you can speed up your training by ~ 1.5x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cuda Related to torch.cuda, and CUDA support in general module: performance Issues related to performance, either of kernel code or framework glue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests