Libtorch build error when setting both `USE_GLOO` and `USE_SYSTEM_NCCL` to `ON` #36570

gnaggnoyil · 2020-04-14T08:54:00Z

🐛 Bug

Currently when setting USE_GLOO cmake option to ON, target gloo_cuda requires a dependency called nccl_external; however, this target is avaliable if and only if USE_SYSTEM_NCCL is OFF. Thus if both USE_GLOO and USE_SYSTEM_NCCL is set to ON cmake would report error during configuration phase.

https://github.com/pytorch/pytorch/blob/master/cmake/Dependencies.cmake#L1149
https://github.com/pytorch/pytorch/blob/master/cmake/External/nccl.cmake#L19

To Reproduce

Steps to reproduce the behavior:

run cmake -DBUILD_PYTHON=OFF -DUSE_CUDA=ON -DUSE_CUDNN=ON -DUSE_NCCL=ON -DUSE_SYSTEM_NCCL=ON -DUSE_DISTRIBUTED=ON -DUSE_GLOO=ON /path/to/torchsrc

CMake will then report the following error:

-- Configuring done
CMake Error at cmake/Dependencies.cmake:1149 (add_dependencies):
  The dependency target "nccl_external" of target "gloo_cuda" does not exist.
Call Stack (most recent call first):
  CMakeLists.txt:421 (include)


-- Generating done
CMake Generate step failed.  Build files cannot be regenerated correctly.

Expected behavior

The configuration step should be executed without problem

Environment

OS: CentOS 7 x86_64 with devtoolset-6 enabled
CMake version: 3.17.0
Cuda version: 9.2.148-1
CuDNN version: libcudnn7-7.6.5.31-1.cuda9.2
NCCL version: 2.4.8-ga-cuda9.2-1-1

All of Cuda, CuDNN and NCCL libraries are installed through NVIDIA's offical rpm package; no libtorch were installed before.

I've tried using current master HEAD and tag/v1.5.0-rc3 and the problem still exists.

Additional context

Related commit seems to be 30da84f

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar

The text was updated successfully, but these errors were encountered:

mruberry · 2020-04-18T06:02:29Z

Hey @gnaggnoyil, sorry to hear you're having a problem building PyTorch. Can you elaborate on what you're trying to accomplish by setting both these flags?

gnaggnoyil · 2020-04-20T03:08:32Z

@mruberry Thanks for your reply here! We set USE_GLOO on because we were trying to see if we can build libtorch with multi-machine support, and we have been setting USE_SYSTEM_NCCL from the first time we began to use libtorch and didn't encounter problems until this time we set USE_GLOO on. We were expecting that those two flags are orthogonal and shouldn't have impact on each other, hence the issue.

mruberry · 2020-04-20T03:21:08Z

Got it, thanks for the update @gnaggnoyil. I wonder if we just need to improve the documentation here: https://pytorch.org/docs/stable/distributed.html.

pritamdamania87 · 2020-04-22T01:37:15Z

It seems like these two flags should work together. @zhaojuanmao @mrshenli Thoughts?

zhaojuanmao · 2020-04-23T00:38:28Z

@pritamdamania87 agree with you that they should work together and need fix. The same issue was reported in #32286 and #36570 as well.

for fixing, we can add a condition, when USE_SYSTEM_NCCL is on, link to system library, otherwise link to nccl_external

mrshenli · 2020-04-23T21:14:35Z

Agree with @pritamdamania87 and @zhaojuanmao, they should work together

guol-fnst · 2020-06-09T05:48:57Z

I can't repoduce this bug in my env. So I think how abot adding a condition,
when USE_GLOO=ON , we use ExternalProject_Add to create nccl_external target， no matter whether USE_SYSTEM_NCCL is ON or OFF .
https://github.com/pytorch/pytorch/blob/master/cmake/External/nccl.cmake#L4

@zhaojuanmao any advices?

gnaggnoyil · 2020-06-10T12:47:03Z

@guol-fnst Not sure what your env was, but I just tried on an Archlinux machine and no wonder this issue still exists. The only difference this time are just that cmake emits a warning instead of an error:

-- Configuring done
CMake Warning (dev) at cmake/Dependencies.cmake:1239 (add_dependencies):
  Policy CMP0046 is not set: Error on non-existent dependency in
  add_dependencies.  Run "cmake --help-policy CMP0046" for policy details.
  Use the cmake_policy command to set the policy and suppress this warning.

  The dependency target "nccl_external" of target "gloo_cuda" does not exist.
Call Stack (most recent call first):
  CMakeLists.txt:469 (include)
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Generating done
-- Build files have been written to: /home/gnaggnoyil/libtorch_build

This time the env is the system cuda, cudnn and nccl libraries installed through official Archlinux soft repo, without pytorch installed before:

gnaggnoyil@gnaggnoyil-pc ~/libtorch_build % pacman -Qs | grep -E "(cmake )|(cuda )|(cudnn )|(nccl )"
local/cmake 3.17.3-1
local/cuda 10.2.89-5
local/cudnn 7.6.5.32-4
local/nccl 2.6.4-1

All tools and libraries used are system ones and no Anaconda envs exists in this machine.

colesbury added the module: build Build system issues label Apr 14, 2020

mruberry added module: nccl Problems related to nccl support triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module oncall: distributed Add this issue/PR to distributed oncall triage queue labels Apr 18, 2020

mruberry added the module: docs Related to our documentation, both in docs/ and docblocks label Apr 20, 2020

IvanYashchuk mentioned this issue Aug 2, 2021

Ensure we use the conda-forge nccl and disable building tests conda-forge/pytorch-cpu-feedstock#61

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Libtorch build error when setting both `USE_GLOO` and `USE_SYSTEM_NCCL` to `ON` #36570

Libtorch build error when setting both `USE_GLOO` and `USE_SYSTEM_NCCL` to `ON` #36570

gnaggnoyil commented Apr 14, 2020 •

edited by pytorch-probot bot

mruberry commented Apr 18, 2020

gnaggnoyil commented Apr 20, 2020

mruberry commented Apr 20, 2020 •

edited

pritamdamania87 commented Apr 22, 2020

zhaojuanmao commented Apr 23, 2020 •

edited

mrshenli commented Apr 23, 2020

guol-fnst commented Jun 9, 2020 •

edited

gnaggnoyil commented Jun 10, 2020 •

edited

Libtorch build error when setting both USE_GLOO and USE_SYSTEM_NCCL to ON #36570

Libtorch build error when setting both USE_GLOO and USE_SYSTEM_NCCL to ON #36570

Comments

gnaggnoyil commented Apr 14, 2020 • edited by pytorch-probot bot

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

mruberry commented Apr 18, 2020

gnaggnoyil commented Apr 20, 2020

mruberry commented Apr 20, 2020 • edited

pritamdamania87 commented Apr 22, 2020

zhaojuanmao commented Apr 23, 2020 • edited

mrshenli commented Apr 23, 2020

guol-fnst commented Jun 9, 2020 • edited

gnaggnoyil commented Jun 10, 2020 • edited

Libtorch build error when setting both `USE_GLOO` and `USE_SYSTEM_NCCL` to `ON` #36570

Libtorch build error when setting both `USE_GLOO` and `USE_SYSTEM_NCCL` to `ON` #36570

gnaggnoyil commented Apr 14, 2020 •

edited by pytorch-probot bot

mruberry commented Apr 20, 2020 •

edited

zhaojuanmao commented Apr 23, 2020 •

edited

guol-fnst commented Jun 9, 2020 •

edited

gnaggnoyil commented Jun 10, 2020 •

edited