[Release/1.7] Enable NCCL A2A on OSS #48857
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
Pull Request resolved: #45900
Use
torch:cuda::nccl:all2all
fromProcesGroupNCCL.cpp
Fixes #42517
Here is a NCCL dependency graph:
When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless
-whole-archive
option is used. Before #42514 all nccl call made fromProcessGroupNCCL.cpp
were also made fromtorch/csrc/cuda/nccl.cpp
, which is compiled as part oflibtorch_cuda.so
But adding
ncclSend
|ncclRecv
to ProcesGroupNCCL.cpp forced linker to embed those intolibtorch_python.so
, which also resulted in linking other dependent symbols into the library.This PR adds
nccl[Send|Recv]
call totorch_cuda.so
by implementingall2all
intorch_cuda
and thus avoids double linking the static library.More involved, but prone solution, would be to use wrappers exported in
torch::cuda::nccl
namespace, instead of making direct NCCL API calls.Test Plan: Imported from OSS
Reviewed By: mingzhe09088
Differential Revision: D24138011
Pulled By: malfet
fbshipit-source-id: 33305197fc7d8707b7fd3a66b543f7733b9241a1
Fixes #{issue number}