-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Libtorch build error when setting both USE_GLOO
and USE_SYSTEM_NCCL
to ON
#36570
Comments
Hey @gnaggnoyil, sorry to hear you're having a problem building PyTorch. Can you elaborate on what you're trying to accomplish by setting both these flags? |
@mruberry Thanks for your reply here! We set |
Got it, thanks for the update @gnaggnoyil. I wonder if we just need to improve the documentation here: https://pytorch.org/docs/stable/distributed.html. |
It seems like these two flags should work together. @zhaojuanmao @mrshenli Thoughts? |
@pritamdamania87 agree with you that they should work together and need fix. The same issue was reported in #32286 and #36570 as well. for fixing, we can add a condition, when USE_SYSTEM_NCCL is on, link to system library, otherwise link to nccl_external |
Agree with @pritamdamania87 and @zhaojuanmao, they should work together |
I can't repoduce this bug in my env. So I think how abot adding a condition, @zhaojuanmao any advices? |
@guol-fnst Not sure what your env was, but I just tried on an Archlinux machine and no wonder this issue still exists. The only difference this time are just that cmake emits a warning instead of an error:
This time the env is the system cuda, cudnn and nccl libraries installed through official Archlinux soft repo, without pytorch installed before:
All tools and libraries used are system ones and no Anaconda envs exists in this machine. |
🐛 Bug
Currently when setting
USE_GLOO
cmake option toON
, targetgloo_cuda
requires a dependency callednccl_external
; however, this target is avaliable if and only ifUSE_SYSTEM_NCCL
isOFF
. Thus if bothUSE_GLOO
andUSE_SYSTEM_NCCL
is set toON
cmake would report error during configuration phase.https://github.com/pytorch/pytorch/blob/master/cmake/Dependencies.cmake#L1149
https://github.com/pytorch/pytorch/blob/master/cmake/External/nccl.cmake#L19
To Reproduce
Steps to reproduce the behavior:
run
cmake -DBUILD_PYTHON=OFF -DUSE_CUDA=ON -DUSE_CUDNN=ON -DUSE_NCCL=ON -DUSE_SYSTEM_NCCL=ON -DUSE_DISTRIBUTED=ON -DUSE_GLOO=ON /path/to/torchsrc
CMake will then report the following error:
Expected behavior
The configuration step should be executed without problem
Environment
devtoolset-6
enabledAll of Cuda, CuDNN and NCCL libraries are installed through NVIDIA's offical rpm package; no libtorch were installed before.
I've tried using current master HEAD and tag/v1.5.0-rc3 and the problem still exists.
Additional context
Related commit seems to be 30da84f
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar
The text was updated successfully, but these errors were encountered: