New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for NCCL alltoall #44374
Conversation
💊 CI failures summary and remediationsAs of commit 3b20dd6 (more details on the Dr. CI page):
ci.pytorch.org: 1 failedThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. This comment has been revised 89 times. |
We already enabled alltoall in #42514, how is this different? |
@ngimel That PR only enables |
And what is the difference? It would be good to have at least a brief description of this PR. |
@ngimel I updated #44374 (comment) |
@srinivas212 any update on this? |
Codecov Report
@@ Coverage Diff @@
## master #44374 +/- ##
===========================================
+ Coverage 70.46% 80.65% +10.18%
===========================================
Files 1899 1899
Lines 206032 206049 +17
===========================================
+ Hits 145186 166193 +21007
+ Misses 60846 39856 -20990 |
ping @srinivas212 |
torch/lib/c10d/ProcessGroupNCCL.cpp
Outdated
}, | ||
OpType::ALLTOALL_BASE, | ||
"nccl:all_to_all"); | ||
} | ||
} | ||
|
||
std::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is causing the current failure, so c10::intrusive_ptr<>
should be needed here.
torch/lib/c10d/ProcessGroupNCCL.cpp
Outdated
@@ -1568,6 +1552,14 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall_base( | |||
"ProcessGroupNCCL only supports alltoall* for NCCL lib version >= 2.7.0"); | |||
} | |||
|
|||
std::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above: c10::intrusive_ptr<>
pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp Lines 1594 to 1599 in e94b602
should create a conflict by redefining pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp Line 1481 in e94b602
or pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp Line 1555 in e94b602
so I think it should be removed. |
This was removed in the following PR: So yes, we need to remove this. |
@srinivas212 I already removed it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srinivas212 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for fixing!!
} | ||
switch (t.scalar_type()) { | ||
ncclDataType_t to_nccl_data_type(c10::ScalarType type) { | ||
switch (type) { | ||
case at::kFloat: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today's ProcessGroupNCCL
also supports at::kBool
, is that the same as at::kByte
?
{at::kChar, ncclInt8},
{at::kByte, ncclUint8},
{at::kFloat, ncclFloat},
{at::kDouble, ncclDouble},
{at::kInt, ncclInt32},
{at::kLong, ncclInt64},
{at::kHalf, ncclHalf},
{at::kBool, ncclUint8},
#if defined(__HIP_PLATFORM_HCC__) && HIP_VERSION >= 301
{at::kBFloat16, ncclBfloat16},
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added kBool
. And yes, I think kByte should be ncclUint8
as well, instead of ncclChar
as currently in this file. I have updated this.
torch/csrc/cuda/nccl.cpp
Outdated
@@ -99,6 +96,13 @@ ncclDataType_t to_nccl_data_type(const at::Tensor& t) { | |||
} | |||
} | |||
|
|||
ncclDataType_t to_nccl_data_type(const at::Tensor& t) { | |||
if (!t.is_cuda()) { | |||
throw std::runtime_error("Unconvertible NCCL type"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is prior to this PR, shall we be more explicit on the error message? Should this be the following?
f"NCCL only supports CUDA tensors, but got a tensor on {t.device}"
@@ -625,7 +629,7 @@ void all_gather( | |||
#endif | |||
} | |||
|
|||
void all2all(at::Tensor& input, | |||
void all2all_single_equal_split(at::Tensor& input, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would I be correct if I assume this API is not visible to users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems so? I can not find anything about nccl at https://pytorch.org/cppdocs/api/library_root.html
NCCL_CHECK(ncclGroupStart()); | ||
for (int r = 0; r < numranks; r++) { | ||
// NCCL uses 0 byte message for synchronization | ||
// Avoid send/recv when message size is zero |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean even if all send/recv cnts are 0, this would still trigger an zero-byte message to do sync across ranks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all send counts are zero, wouldn't this be an empty nccl group?
cc @agolynski |
Is there anything this PR is waiting for? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srinivas212 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Rebased today |
ping @srinivas212 any update on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srinivas212 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@srinivas212 merged this pull request in 44922f2. |
return collective( | ||
inputTensor0, | ||
outputTensor0, | ||
[&](at::Tensor& /* unused */, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zasdfgbnm @mrshenli Hello, I'm a bit confused, why outputTensors didn't need to record ncclStream to prevent being freed before the collective finishes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a bug... Thanks for catching it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You’re welcome :)
In #42514, NCCL
alltoall_single
is already added. This PR adds NCCLalltoall
.The difference between
alltoall_single
andalltoall
is:alltoall_single
works on a single tensor and send/receive slices of that tensor, whilealltoall
works on a list of tensor, and send/receive tensors in that list.cc: @ptrblck @ngimel