Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for NCCL alltoall #44374

Closed
wants to merge 34 commits into from
Closed

Add support for NCCL alltoall #44374

wants to merge 34 commits into from

Conversation

zasdfgbnm
Copy link
Collaborator

@zasdfgbnm zasdfgbnm commented Sep 9, 2020

In #42514, NCCL alltoall_single is already added. This PR adds NCCL alltoall.

The difference between alltoall_single and alltoall is: alltoall_single works on a single tensor and send/receive slices of that tensor, while alltoall works on a list of tensor, and send/receive tensors in that list.

cc: @ptrblck @ngimel

@dr-ci
Copy link

dr-ci bot commented Sep 9, 2020

💊 CI failures summary and remediations

As of commit 3b20dd6 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 2/2 non-CircleCI failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This comment has been revised 89 times.

@ngimel
Copy link
Collaborator

ngimel commented Sep 9, 2020

We already enabled alltoall in #42514, how is this different?

@zasdfgbnm
Copy link
Collaborator Author

@ngimel That PR only enables alltoall_single, not alltoall.

@ngimel
Copy link
Collaborator

ngimel commented Sep 9, 2020

And what is the difference? It would be good to have at least a brief description of this PR.

@ngimel ngimel requested review from srinivas212 and removed request for srinivas212 September 9, 2020 17:16
@zasdfgbnm
Copy link
Collaborator Author

@ngimel I updated #44374 (comment)

@mrshenli mrshenli added oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Sep 10, 2020
@zasdfgbnm
Copy link
Collaborator Author

@srinivas212 any update on this?

@mrshenli mrshenli removed the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 6, 2020
@codecov
Copy link

codecov bot commented Oct 16, 2020

Codecov Report

Merging #44374 (3b20dd6) into master (5acb1cc) will increase coverage by 10.18%.
The diff coverage is 45.83%.

@@             Coverage Diff             @@
##           master   #44374       +/-   ##
===========================================
+ Coverage   70.46%   80.65%   +10.18%     
===========================================
  Files        1899     1899               
  Lines      206032   206049       +17     
===========================================
+ Hits       145186   166193    +21007     
+ Misses      60846    39856    -20990     

@zasdfgbnm
Copy link
Collaborator Author

ping @srinivas212

},
OpType::ALLTOALL_BASE,
"nccl:all_to_all");
}
}

std::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is causing the current failure, so c10::intrusive_ptr<> should be needed here.

@@ -1568,6 +1552,14 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall_base(
"ProcessGroupNCCL only supports alltoall* for NCCL lib version >= 2.7.0");
}

std::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: c10::intrusive_ptr<>

@ptrblck
Copy link
Collaborator

ptrblck commented Nov 15, 2020

c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall(
std::vector<at::Tensor>& /* unused */,
std::vector<at::Tensor>& /* unused */,
const AllToAllOptions& /* unused */) {
throw std::runtime_error("ProcessGroupNCCL does not support alltoall");
}

should create a conflict by redefining ProcessGroupNCCL::alltoall from the #ifdef ENABLE_NCCL_P2P_SUPPORT block in

std::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall(

or
std::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall(

so I think it should be removed.

@srinivas212
Copy link

should create a conflict by redefining ProcessGroupNCCL::alltoall from the #ifdef ENABLE_NCCL_P2P_SUPPORT block in

This was removed in the following PR:
#45900

So yes, we need to remove this.

@zasdfgbnm
Copy link
Collaborator Author

@srinivas212 I already removed it.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srinivas212 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for fixing!!

}
switch (t.scalar_type()) {
ncclDataType_t to_nccl_data_type(c10::ScalarType type) {
switch (type) {
case at::kFloat:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today's ProcessGroupNCCL also supports at::kBool, is that the same as at::kByte?

    {at::kChar, ncclInt8},
    {at::kByte, ncclUint8},
    {at::kFloat, ncclFloat},
    {at::kDouble, ncclDouble},
    {at::kInt, ncclInt32},
    {at::kLong, ncclInt64},
    {at::kHalf, ncclHalf},
    {at::kBool, ncclUint8},
#if defined(__HIP_PLATFORM_HCC__) && HIP_VERSION >= 301
    {at::kBFloat16, ncclBfloat16},
#endif

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added kBool. And yes, I think kByte should be ncclUint8 as well, instead of ncclChar as currently in this file. I have updated this.

@@ -99,6 +96,13 @@ ncclDataType_t to_nccl_data_type(const at::Tensor& t) {
}
}

ncclDataType_t to_nccl_data_type(const at::Tensor& t) {
if (!t.is_cuda()) {
throw std::runtime_error("Unconvertible NCCL type");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is prior to this PR, shall we be more explicit on the error message? Should this be the following?

f"NCCL only supports CUDA tensors, but got a tensor on {t.device}"

torch/csrc/cuda/nccl.cpp Show resolved Hide resolved
@@ -625,7 +629,7 @@ void all_gather(
#endif
}

void all2all(at::Tensor& input,
void all2all_single_equal_split(at::Tensor& input,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would I be correct if I assume this API is not visible to users?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems so? I can not find anything about nccl at https://pytorch.org/cppdocs/api/library_root.html

NCCL_CHECK(ncclGroupStart());
for (int r = 0; r < numranks; r++) {
// NCCL uses 0 byte message for synchronization
// Avoid send/recv when message size is zero
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean even if all send/recv cnts are 0, this would still trigger an zero-byte message to do sync across ranks?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all send counts are zero, wouldn't this be an empty nccl group?

@mrshenli
Copy link
Contributor

cc @agolynski

@zasdfgbnm
Copy link
Collaborator Author

Is there anything this PR is waiting for?

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srinivas212 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@zasdfgbnm
Copy link
Collaborator Author

Rebased today

@zasdfgbnm
Copy link
Collaborator Author

ping @srinivas212 any update on this?

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srinivas212 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@srinivas212 merged this pull request in 44922f2.

@zasdfgbnm zasdfgbnm deleted the nccl-all2all branch January 20, 2021 23:16
return collective(
inputTensor0,
outputTensor0,
[&](at::Tensor& /* unused */,
Copy link
Contributor

@cdzhan cdzhan Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zasdfgbnm @mrshenli Hello, I'm a bit confused, why outputTensors didn't need to record ncclStream to prevent being freed before the collective finishes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a bug... Thanks for catching it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re welcome :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants