-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
nn.Conv3d is not accelerated with tensorcores (using autocast/AMP) #57115
Comments
Tentatively marking as hi-pri because it sounds like we have incorrect behavior (if the source build of pytorch utilizes tensor cores, the nightly should as well). Unless this has something to do with the cuda versions we build binaries for? |
The 2080ti system I tested only has cuda 11.0. Binaries are provided with cuda 11.1 (those are the ones I used), so that's probably not it. Thanks for looking into this! |
Thanks for creating this issue. You are most likely hitting this bug. |
That could indeed be the case. I see the same behavior with nn.Conv2D (not sure why I thought this was a conv3d problem): nn.Conv2D instead of 3d, RTX 3090 GPU:
It's interesting that the problem only affects mixed precision training. FP32 has almost the same speed when I compare source vs binary |
Is this fixed ? |
RTX 3090, pip install, 3D network
RTX 3090, compiled myself, 3D network
looks like it's fixed. But best test it yourself as well. Note that binaries were always slower than what I compiled myself. |
@FabianIsensee Indeed, you are right. I can verify this: $ python3 verify.py CUDA Version: 11.3 GeForce 3060 , I suppose you may remove the requirement to build pytorch from source on your nnunet page. |
馃悰 Bug
When installing either the current nightly or version 1.8.1 using pip/conda the tensorcore acceleration of nn.Conv3D with Nvidia GPUs is not working. However, when compiling pytorch from source it works as intended and gives a ~3x speedup relative to regular fp32 training. Given that compiling from source is something not all users will be able to do (or want to do) it would be nice if that was fixed in the pip/conda installer.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
fp16/mixed precision should be ~3x faster than fp32. This number comes from my own experiments using pytorch version that I compiled myself. I tested both Turing and Ampere GPUs. Here are my results:
As you can see, both
pytorch 1.7.1 + cuDNN 8.1.0.77
andpytorch 1.8.1 + cuDNN 8.2.0.53
have the expected speedup when using mixed precision.pytorch '1.9.0.dev20210427+cu111' + cuDNN "8005"
does not.(I do not have a pytorch version that I compiled myself with cuDNN 8005 but I used to have it and I know that it worked with Turing at least)
Environment
(this is the RTX 3090 system.If you need info for the RTX 2080ti system as well let me know)
Thank you!
Best,
Fabian
cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @anjali411 @seemethere @malfet @walterddr @ngimel @csarofeen @ptrblck @xwang233 @VitalyFedyunin
The text was updated successfully, but these errors were encountered: