[Release/1.7] Enable NCCL A2A on OSS #48857

Summary: Pull Request resolved: pytorch#45899 Use function polymorphism to avoid repeated casts I.e. instead of using `NCCL_CHECK(from_nccl_result(` add variant of the function that takes `ncclResult_t` as input argument Add non-pointer variant of `to_nccl_comm` to avoid `*to_nccl_comm(&comm)` pattern Test Plan: Imported from OSS Reviewed By: walterddr Differential Revision: D24138012 Pulled By: malfet fbshipit-source-id: 7f62a03e108cbe455910e86e894afdd1c27e8ff1

Summary: Pull Request resolved: pytorch#45900 Use `torch:cuda::nccl:all2all` from `ProcesGroupNCCL.cpp` Fixes pytorch#42517 Here is a NCCL dependency graph: ``` libnccl.a --> libtorch_cuda.so ---> libtorch_python.so | ^ | | --------> libc10d.a ----------------- ``` When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless `-whole-archive` option is used. Before pytorch#42514 all nccl call made from `ProcessGroupNCCL.cpp` were also made from `torch/csrc/cuda/nccl.cpp`, which is compiled as part of `libtorch_cuda.so` But adding `ncclSend`|`ncclRecv` to ProcesGroupNCCL.cpp forced linker to embed those into `libtorch_python.so`, which also resulted in linking other dependent symbols into the library. This PR adds `nccl[Send|Recv]` call to `torch_cuda.so` by implementing `all2all` in `torch_cuda` and thus avoids double linking the static library. More involved, but prone solution, would be to use wrappers exported in `torch::cuda::nccl` namespace, instead of making direct NCCL API calls. Test Plan: Imported from OSS Reviewed By: mingzhe09088 Differential Revision: D24138011 Pulled By: malfet fbshipit-source-id: 33305197fc7d8707b7fd3a66b543f7733b9241a1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Release/1.7] Enable NCCL A2A on OSS #48857

[Release/1.7] Enable NCCL A2A on OSS #48857

Commits on Dec 4, 2020