New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libtorch_cuda.so is missing fast kernels from libcudnn_static.a, therefore statically linked cuDNN could be much slower than dynamically linked #50153
Comments
@zasdfgbnm Is it only true for fp16 on Ampere or for tensor types/architectures as well? |
I do not see cutlass kernels invoked when running on 2080 with shared libcudnn.so.7.5.0:
|
@malfet Thanks for the quick test! |
Update: cudnn static linking should use |
@zasdfgbnm wouldn't that lead to a 50% increase in the binary size? Can we simply reference one symbol that instantiate cutlass kernels from torch_cuda? |
@malfet Do you know how to tell cmake to replace its |
@zasdfgbnm @malfet We are facing this issue too. Is there any progress upon how to address this? Additionally, missing CUDA kernel when linking statically with cudnn is a behavior of linker or some kind of fault of NVIDIA cudnn? |
So, does this issue explain why NGC containers have been consistently faster than the official conda builds for a number of PyTorch versions now? NGC == dynamic link, conda/pip = static w/ this issue? This has pretty significant impact if that is the case. I ran some benchmarks trying to figure out what was happening as I've kept bumping into it with new releases... https://gist.github.com/rwightman/bb59f9e245162cee0e38bd66bd8cd77f |
@rwightman (Mostly) for Turing and Ampere these kernels would be missing in the binaries due to this issue, which would explain the performance difference between the binaries and a source build. Besides that the cudnn (CUDA, cublas, etc.) versions could also differ, which might give more performance gains or potential regressions. |
Following gist reproduces the problem with static linking..... Whole-library linking or extracting cudnn_static.a and specifying all objects inside of it explicitly fixes the problem |
Summary: This is only important for builds where cuDNN is linked statically into libtorch_cpu. Before this PR PyTorch wheels often accidentally contained several partial copies of cudnn_static library. Splitting the interface into header only (cudnn-public) and library+headers(cudnn-private) prevents those from happening. Preliminary step towards enabling optional linking whole cudnn_library to workaround issue reported in #50153 Pull Request resolved: #59721 Reviewed By: ngimel Differential Revision: D29000967 Pulled By: malfet fbshipit-source-id: f054df92b265e9494076ab16c247427b39da9336
Summary: This is only important for builds where cuDNN is linked statically into libtorch_cpu. Before this PR PyTorch wheels often accidentally contained several partial copies of cudnn_static library. Splitting the interface into header only (cudnn-public) and library+headers(cudnn-private) prevents those from happening. Preliminary step towards enabling optional linking whole cudnn_library to workaround issue reported in pytorch#50153 Pull Request resolved: pytorch#59721 Reviewed By: ngimel Differential Revision: D29000967 Pulled By: malfet fbshipit-source-id: f054df92b265e9494076ab16c247427b39da9336
Summary: It is only enabled if USE_STATIC_CUDNN is enabled Next step after #59721 towards resolving fast kernels stripping reported in #50153 Pull Request resolved: #59744 Reviewed By: seemethere, ngimel Differential Revision: D29007314 Pulled By: malfet fbshipit-source-id: 7091e299c0c6cc2a8aa82fbf49312cecf3bb861a
Summary: This is only important for builds where cuDNN is linked statically into libtorch_cpu. Before this PR PyTorch wheels often accidentally contained several partial copies of cudnn_static library. Splitting the interface into header only (cudnn-public) and library+headers(cudnn-private) prevents those from happening. Preliminary step towards enabling optional linking whole cudnn_library to workaround issue reported in pytorch#50153 Pull Request resolved: pytorch#59721 Reviewed By: ngimel Differential Revision: D29000967 Pulled By: malfet fbshipit-source-id: f054df92b265e9494076ab16c247427b39da9336
Summary: It is only enabled if USE_STATIC_CUDNN is enabled Next step after pytorch#59721 towards resolving fast kernels stripping reported in pytorch#50153 Pull Request resolved: pytorch#59744 Reviewed By: seemethere, ngimel Differential Revision: D29007314 Pulled By: malfet fbshipit-source-id: 7091e299c0c6cc2a8aa82fbf49312cecf3bb861a
Summary: Fixes pytorch#50153 Pull Request resolved: pytorch#59802 Reviewed By: driazati, seemethere Differential Revision: D29033537 Pulled By: malfet fbshipit-source-id: e816fc71f273ae0b4ba8a0621d5368a2078561a1
* Move cublas dependency after CuDNN (#58287) Summary: Library linking order matters during static linking Not sure whether its a bug or a feature, but if cublas is reference before CuDNN, it will be partially statically linked into the library, even if it is not used Pull Request resolved: #58287 Reviewed By: janeyx99 Differential Revision: D28433165 Pulled By: malfet fbshipit-source-id: 8dffa0533075126dc383428f838f7d048074205c * [CMake] Split caffe2::cudnn into public and private (#59721) Summary: This is only important for builds where cuDNN is linked statically into libtorch_cpu. Before this PR PyTorch wheels often accidentally contained several partial copies of cudnn_static library. Splitting the interface into header only (cudnn-public) and library+headers(cudnn-private) prevents those from happening. Preliminary step towards enabling optional linking whole cudnn_library to workaround issue reported in #50153 Pull Request resolved: #59721 Reviewed By: ngimel Differential Revision: D29000967 Pulled By: malfet fbshipit-source-id: f054df92b265e9494076ab16c247427b39da9336 * Add USE_WHOLE_CUDNN option (#59744) Summary: It is only enabled if USE_STATIC_CUDNN is enabled Next step after #59721 towards resolving fast kernels stripping reported in #50153 Pull Request resolved: #59744 Reviewed By: seemethere, ngimel Differential Revision: D29007314 Pulled By: malfet fbshipit-source-id: 7091e299c0c6cc2a8aa82fbf49312cecf3bb861a * [Binary] Link whole CuDNN for CUDA-11.1 (#59802) Summary: Fixes #50153 Pull Request resolved: #59802 Reviewed By: driazati, seemethere Differential Revision: D29033537 Pulled By: malfet fbshipit-source-id: e816fc71f273ae0b4ba8a0621d5368a2078561a1
🐛 Bug
libtorch_cuda.so is missing fast kernels from libcudnn_static.a, therefore statically linked cuDNN could be much slower than dynamically linked.
People at NVIDIA found that the following code is much slower on backward when running with statically linked cuDNN compared to dynamically linked one:
The backward for statically linked cuDNN is about 4x slower than the dynamic ones.
Profiling shows that static cuDNN and dynamic cuDNN are calling different kernels:
static:
dynamic:
I am not 100% sure about the reason, but it seems to me that this is because PyTorch does not copy fast kernels from libcudnn_static.a to libtorch_cuda.so, so cuDNN has to use a slow kernel:
Environment
cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @anjali411 @Varal7 @malfet @seemethere @walterddr @ngimel @csarofeen @ptrblck @xwang233 @VitalyFedyunin
The text was updated successfully, but these errors were encountered: