Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Libtorch build error when setting both USE_GLOO and USE_SYSTEM_NCCL to ON #36570

Open
gnaggnoyil opened this issue Apr 14, 2020 · 8 comments
Open
Labels
module: build Build system issues module: docs Related to our documentation, both in docs/ and docblocks module: nccl Problems related to nccl support oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@gnaggnoyil
Copy link

gnaggnoyil commented Apr 14, 2020

🐛 Bug

Currently when setting USE_GLOO cmake option to ON, target gloo_cuda requires a dependency called nccl_external; however, this target is avaliable if and only if USE_SYSTEM_NCCL is OFF. Thus if both USE_GLOO and USE_SYSTEM_NCCL is set to ON cmake would report error during configuration phase.

https://github.com/pytorch/pytorch/blob/master/cmake/Dependencies.cmake#L1149
https://github.com/pytorch/pytorch/blob/master/cmake/External/nccl.cmake#L19

To Reproduce

Steps to reproduce the behavior:

  1. run cmake -DBUILD_PYTHON=OFF -DUSE_CUDA=ON -DUSE_CUDNN=ON -DUSE_NCCL=ON -DUSE_SYSTEM_NCCL=ON -DUSE_DISTRIBUTED=ON -DUSE_GLOO=ON /path/to/torchsrc

  2. CMake will then report the following error:

    -- Configuring done
    CMake Error at cmake/Dependencies.cmake:1149 (add_dependencies):
      The dependency target "nccl_external" of target "gloo_cuda" does not exist.
    Call Stack (most recent call first):
      CMakeLists.txt:421 (include)
    
    
    -- Generating done
    CMake Generate step failed.  Build files cannot be regenerated correctly.
    

Expected behavior

The configuration step should be executed without problem

Environment

  • OS: CentOS 7 x86_64 with devtoolset-6 enabled
  • CMake version: 3.17.0
  • Cuda version: 9.2.148-1
  • CuDNN version: libcudnn7-7.6.5.31-1.cuda9.2
  • NCCL version: 2.4.8-ga-cuda9.2-1-1

All of Cuda, CuDNN and NCCL libraries are installed through NVIDIA's offical rpm package; no libtorch were installed before.

I've tried using current master HEAD and tag/v1.5.0-rc3 and the problem still exists.

Additional context

Related commit seems to be 30da84f

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar

@colesbury colesbury added the module: build Build system issues label Apr 14, 2020
@mruberry mruberry added module: nccl Problems related to nccl support triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module oncall: distributed Add this issue/PR to distributed oncall triage queue labels Apr 18, 2020
@mruberry
Copy link
Collaborator

Hey @gnaggnoyil, sorry to hear you're having a problem building PyTorch. Can you elaborate on what you're trying to accomplish by setting both these flags?

@gnaggnoyil
Copy link
Author

@mruberry Thanks for your reply here! We set USE_GLOO on because we were trying to see if we can build libtorch with multi-machine support, and we have been setting USE_SYSTEM_NCCL from the first time we began to use libtorch and didn't encounter problems until this time we set USE_GLOO on. We were expecting that those two flags are orthogonal and shouldn't have impact on each other, hence the issue.

@mruberry mruberry added the module: docs Related to our documentation, both in docs/ and docblocks label Apr 20, 2020
@mruberry
Copy link
Collaborator

mruberry commented Apr 20, 2020

Got it, thanks for the update @gnaggnoyil. I wonder if we just need to improve the documentation here: https://pytorch.org/docs/stable/distributed.html.

@pritamdamania87
Copy link
Contributor

It seems like these two flags should work together. @zhaojuanmao @mrshenli Thoughts?

@zhaojuanmao
Copy link
Contributor

zhaojuanmao commented Apr 23, 2020

@pritamdamania87 agree with you that they should work together and need fix. The same issue was reported in #32286 and #36570 as well.

for fixing, we can add a condition, when USE_SYSTEM_NCCL is on, link to system library, otherwise link to nccl_external

@mrshenli
Copy link
Contributor

Agree with @pritamdamania87 and @zhaojuanmao, they should work together

@guol-fnst
Copy link
Contributor

guol-fnst commented Jun 9, 2020

I can't repoduce this bug in my env. So I think how abot adding a condition,
when USE_GLOO=ON , we use ExternalProject_Add to create nccl_external target, no matter whether USE_SYSTEM_NCCL is ON or OFF .
https://github.com/pytorch/pytorch/blob/master/cmake/External/nccl.cmake#L4

@zhaojuanmao any advices?

@gnaggnoyil
Copy link
Author

gnaggnoyil commented Jun 10, 2020

@guol-fnst Not sure what your env was, but I just tried on an Archlinux machine and no wonder this issue still exists. The only difference this time are just that cmake emits a warning instead of an error:

-- Configuring done
CMake Warning (dev) at cmake/Dependencies.cmake:1239 (add_dependencies):
  Policy CMP0046 is not set: Error on non-existent dependency in
  add_dependencies.  Run "cmake --help-policy CMP0046" for policy details.
  Use the cmake_policy command to set the policy and suppress this warning.

  The dependency target "nccl_external" of target "gloo_cuda" does not exist.
Call Stack (most recent call first):
  CMakeLists.txt:469 (include)
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Generating done
-- Build files have been written to: /home/gnaggnoyil/libtorch_build

This time the env is the system cuda, cudnn and nccl libraries installed through official Archlinux soft repo, without pytorch installed before:

gnaggnoyil@gnaggnoyil-pc ~/libtorch_build % pacman -Qs | grep -E "(cmake )|(cuda )|(cudnn )|(nccl )"
local/cmake 3.17.3-1
local/cuda 10.2.89-5
local/cudnn 7.6.5.32-4
local/nccl 2.6.4-1

All tools and libraries used are system ones and no Anaconda envs exists in this machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: build Build system issues module: docs Related to our documentation, both in docs/ and docblocks module: nccl Problems related to nccl support oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

7 participants