Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add torch::cuda::ncll::all2all #45900

Closed
wants to merge 1 commit into from
Closed

Conversation

malfet
Copy link
Contributor

@malfet malfet commented Oct 6, 2020

Stack from ghstack:

Use torch:cuda::nccl:all2all from ProcesGroupNCCL.cpp

Fixes #42517

Here is a NCCL dependency graph:

libnccl.a --> libtorch_cuda.so ---> libtorch_python.so
    |                                   ^
    |                                   |
    --------> libc10d.a -----------------

When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless -whole-archive option is used. Before #42514 all nccl call made from ProcessGroupNCCL.cpp were also made from torch/csrc/cuda/nccl.cpp, which is compiled as part of libtorch_cuda.so
But adding ncclSend|ncclRecv to ProcesGroupNCCL.cpp forced linker to embed those into libtorch_python.so, which also resulted in linking other dependent symbols into the library.

This PR adds nccl[Send|Recv] call to torch_cuda.so by implementing all2all in torch_cuda and thus avoids double linking the static library.

More involved, but prone solution, would be to use wrappers exported in torch::cuda::nccl namespace, instead of making direct NCCL API calls.

Differential Revision: D24138011

Use it from ProcesGroupNCCL

`torch/csrc/cuda/nccl.cpp` is compiled as part of `torch_cuda` library and thus by calling this function from ProcessGroupNCCCL.cpp it avoids linking 2nd instance of libnccl.a into `torch_python`

Fixes #42517

[ghstack-poisoned]
@malfet malfet mentioned this pull request Oct 6, 2020
malfet added a commit that referenced this pull request Oct 6, 2020
Use it from ProcesGroupNCCL

`torch/csrc/cuda/nccl.cpp` is compiled as part of `torch_cuda` library and thus by calling this function from ProcessGroupNCCCL.cpp it avoids linking 2nd instance of libnccl.a into `torch_python`

Fixes #42517

ghstack-source-id: 1c5f90f544d06f88f4cc2132bd1587cbdc946911
Pull Request resolved: #45900
@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 6, 2020
@malfet malfet requested a review from ngimel October 6, 2020 15:03
torch/csrc/cuda/comm.cpp Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@malfet merged this pull request in c19b9cd.

@facebook-github-bot
Copy link
Contributor

@malfet merged this pull request in c19b9cd.

@facebook-github-bot facebook-github-bot deleted the gh/malfet/2/head branch October 11, 2020 14:18
mingzhe09088 pushed a commit to mingzhe09088/pytorch-1 that referenced this pull request Dec 4, 2020
Summary:
Pull Request resolved: pytorch#45900

Use `torch:cuda::nccl:all2all` from `ProcesGroupNCCL.cpp`

Fixes pytorch#42517

Here is a NCCL dependency graph:
```
libnccl.a --> libtorch_cuda.so ---> libtorch_python.so
    |                                   ^
    |                                   |
    --------> libc10d.a -----------------
```
When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless `-whole-archive` option is used. Before pytorch#42514 all nccl call made from `ProcessGroupNCCL.cpp` were also made from `torch/csrc/cuda/nccl.cpp`, which is compiled as part of `libtorch_cuda.so`
But adding `ncclSend`|`ncclRecv` to ProcesGroupNCCL.cpp forced linker to embed those into `libtorch_python.so`, which also resulted in linking other dependent symbols into the library.

This PR adds `nccl[Send|Recv]` call to `torch_cuda.so` by implementing `all2all` in `torch_cuda` and thus avoids double linking the static library.

More involved, but prone solution, would be to use wrappers exported in `torch::cuda::nccl` namespace, instead of making direct NCCL API calls.

Test Plan: Imported from OSS

Reviewed By: mingzhe09088

Differential Revision: D24138011

Pulled By: malfet

fbshipit-source-id: 33305197fc7d8707b7fd3a66b543f7733b9241a1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants