[Release/1.7] Enable NCCL A2A on OSS #48857

mingzhe09088 · 2020-12-04T19:24:25Z

Summary:
Pull Request resolved: #45900

Use torch:cuda::nccl:all2all from ProcesGroupNCCL.cpp

Here is a NCCL dependency graph:

libnccl.a --> libtorch_cuda.so ---> libtorch_python.so
    |                                   ^
    |                                   |
    --------> libc10d.a -----------------

When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless -whole-archive option is used. Before #42514 all nccl call made from ProcessGroupNCCL.cpp were also made from torch/csrc/cuda/nccl.cpp, which is compiled as part of libtorch_cuda.so
But adding ncclSend|ncclRecv to ProcesGroupNCCL.cpp forced linker to embed those into libtorch_python.so, which also resulted in linking other dependent symbols into the library.

This PR adds nccl[Send|Recv] call to torch_cuda.so by implementing all2all in torch_cuda and thus avoids double linking the static library.

More involved, but prone solution, would be to use wrappers exported in torch::cuda::nccl namespace, instead of making direct NCCL API calls.

Test Plan: Imported from OSS

Reviewed By: mingzhe09088

Differential Revision: D24138011

Pulled By: malfet

fbshipit-source-id: 33305197fc7d8707b7fd3a66b543f7733b9241a1

Fixes #{issue number}

dr-ci · 2020-12-04T19:28:38Z

💊 CI failures summary and remediations

As of commit 445963c (more details on the Dr. CI page):

3/3 failures possibly* introduced in this PR
- 2/3 non-CircleCI failure(s)

XLA failure

Job pytorch_xla_linux_bionic_py3_6_clang9_test is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.

Extra GitHub checks: 1 failed

Failed: GitHub Actions - flake8-py3

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-bionic-rocm3.9-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 10 times.

Summary: Pull Request resolved: pytorch#45899 Use function polymorphism to avoid repeated casts I.e. instead of using `NCCL_CHECK(from_nccl_result(` add variant of the function that takes `ncclResult_t` as input argument Add non-pointer variant of `to_nccl_comm` to avoid `*to_nccl_comm(&comm)` pattern Test Plan: Imported from OSS Reviewed By: walterddr Differential Revision: D24138012 Pulled By: malfet fbshipit-source-id: 7f62a03e108cbe455910e86e894afdd1c27e8ff1

Summary: Pull Request resolved: pytorch#45900 Use `torch:cuda::nccl:all2all` from `ProcesGroupNCCL.cpp` Fixes pytorch#42517 Here is a NCCL dependency graph: ``` libnccl.a --> libtorch_cuda.so ---> libtorch_python.so | ^ | | --------> libc10d.a ----------------- ``` When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless `-whole-archive` option is used. Before pytorch#42514 all nccl call made from `ProcessGroupNCCL.cpp` were also made from `torch/csrc/cuda/nccl.cpp`, which is compiled as part of `libtorch_cuda.so` But adding `ncclSend`|`ncclRecv` to ProcesGroupNCCL.cpp forced linker to embed those into `libtorch_python.so`, which also resulted in linking other dependent symbols into the library. This PR adds `nccl[Send|Recv]` call to `torch_cuda.so` by implementing `all2all` in `torch_cuda` and thus avoids double linking the static library. More involved, but prone solution, would be to use wrappers exported in `torch::cuda::nccl` namespace, instead of making direct NCCL API calls. Test Plan: Imported from OSS Reviewed By: mingzhe09088 Differential Revision: D24138011 Pulled By: malfet fbshipit-source-id: 33305197fc7d8707b7fd3a66b543f7733b9241a1

github-actions · 2021-02-03T00:23:44Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
Stale pull requests will automatically be closed 30 days after being marked Stale

mingzhe09088 requested review from mrshenli, pietern, pritamdamania87, rohan-varma and zhaojuanmao as code owners December 4, 2020 19:24

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Dec 4, 2020

mingzhe09088 mentioned this pull request Dec 4, 2020

[v1.7.1] Release Tracker #47622

Closed

malfet added 2 commits December 4, 2020 13:29

mingzhe09088 force-pushed the release/1.7 branch from 44b0e00 to 445963c Compare December 4, 2020 21:31

github-actions bot added the Stale label Feb 3, 2021

github-actions bot closed this Apr 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Release/1.7] Enable NCCL A2A on OSS #48857

[Release/1.7] Enable NCCL A2A on OSS #48857

mingzhe09088 commented Dec 4, 2020

dr-ci bot commented Dec 4, 2020 •

edited

github-actions bot commented Feb 3, 2021

[Release/1.7] Enable NCCL A2A on OSS #48857

[Release/1.7] Enable NCCL A2A on OSS #48857

Conversation

mingzhe09088 commented Dec 4, 2020

dr-ci bot commented Dec 4, 2020 • edited

💊 CI failures summary and remediations

XLA failure

Extra GitHub checks: 1 failed

ci.pytorch.org: 1 failed

github-actions bot commented Feb 3, 2021

dr-ci bot commented Dec 4, 2020 •

edited