Add support for NCCL alltoall #44374

zasdfgbnm · 2020-09-09T07:56:07Z

In #42514, NCCL alltoall_single is already added. This PR adds NCCL alltoall.

The difference between alltoall_single and alltoall is: alltoall_single works on a single tensor and send/receive slices of that tensor, while alltoall works on a list of tensor, and send/receive tensors in that list.

cc: @ptrblck @ngimel

dr-ci · 2020-09-09T08:16:33Z

💊 CI failures summary and remediations

As of commit 3b20dd6 (more details on the Dr. CI page):

2/2 failures possibly* introduced in this PR
- 2/2 non-CircleCI failure(s)

ci.pytorch.org: 1 failed

Failed: pr/caffe2-pytorch-linux-bionic-rocm3.10-py3.6-test

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This comment has been revised 89 times.

ngimel · 2020-09-09T16:07:25Z

We already enabled alltoall in #42514, how is this different?

zasdfgbnm · 2020-09-09T17:13:17Z

@ngimel That PR only enables alltoall_single, not alltoall.

ngimel · 2020-09-09T17:16:34Z

And what is the difference? It would be good to have at least a brief description of this PR.

zasdfgbnm · 2020-09-09T17:21:04Z

@ngimel I updated #44374 (comment)

torch/lib/c10d/ProcessGroupNCCL.cpp

zasdfgbnm · 2020-09-21T22:08:20Z

@srinivas212 any update on this?

codecov · 2020-10-16T08:14:28Z

Codecov Report

Merging #44374 (3b20dd6) into master (5acb1cc) will increase coverage by 10.18%.
The diff coverage is 45.83%.

@@             Coverage Diff             @@
##           master   #44374       +/-   ##
===========================================
+ Coverage   70.46%   80.65%   +10.18%     
===========================================
  Files        1899     1899               
  Lines      206032   206049       +17     
===========================================
+ Hits       145186   166193    +21007     
+ Misses      60846    39856    -20990

zasdfgbnm · 2020-11-09T22:08:34Z

ping @srinivas212

ptrblck · 2020-11-15T21:16:09Z

torch/lib/c10d/ProcessGroupNCCL.cpp

        },
        OpType::ALLTOALL_BASE,
        "nccl:all_to_all");
  }
 }

+std::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall(


This is causing the current failure, so c10::intrusive_ptr<> should be needed here.

ptrblck · 2020-11-15T21:16:33Z

torch/lib/c10d/ProcessGroupNCCL.cpp

@@ -1568,6 +1552,14 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall_base(
      "ProcessGroupNCCL only supports alltoall* for NCCL lib version >= 2.7.0");
 }

+std::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall(


Same as above: c10::intrusive_ptr<>

ptrblck · 2020-11-15T21:20:29Z

pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp

Lines 1594 to 1599 in e94b602

    
           c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall( 
        
               std::vector<at::Tensor>& /* unused */, 
        
               std::vector<at::Tensor>& /* unused */, 
        
               const AllToAllOptions& /* unused */) { 
        
             throw std::runtime_error("ProcessGroupNCCL does not support alltoall"); 
        
           }

should create a conflict by redefining ProcessGroupNCCL::alltoall from the #ifdef ENABLE_NCCL_P2P_SUPPORT block in

pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp

Line 1481 in e94b602

std::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall(

or

pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp

Line 1555 in e94b602

std::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall(

so I think it should be removed.

srinivas212 · 2020-11-18T06:21:25Z

should create a conflict by redefining ProcessGroupNCCL::alltoall from the #ifdef ENABLE_NCCL_P2P_SUPPORT block in

This was removed in the following PR:
#45900

So yes, we need to remove this.

zasdfgbnm · 2020-11-18T07:16:36Z

@srinivas212 I already removed it.

facebook-github-bot

@srinivas212 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mrshenli

LGTM! Thanks for fixing!!

mrshenli · 2020-11-18T15:36:15Z

torch/csrc/cuda/nccl.cpp

-  }
-  switch (t.scalar_type()) {
+ncclDataType_t to_nccl_data_type(c10::ScalarType type) {
+  switch (type) {
    case at::kFloat:


Today's ProcessGroupNCCL also supports at::kBool, is that the same as at::kByte?

{at::kChar, ncclInt8}, {at::kByte, ncclUint8}, {at::kFloat, ncclFloat}, {at::kDouble, ncclDouble}, {at::kInt, ncclInt32}, {at::kLong, ncclInt64}, {at::kHalf, ncclHalf}, {at::kBool, ncclUint8}, #if defined(__HIP_PLATFORM_HCC__) && HIP_VERSION >= 301 {at::kBFloat16, ncclBfloat16}, #endif

I have added kBool. And yes, I think kByte should be ncclUint8 as well, instead of ncclChar as currently in this file. I have updated this.

mrshenli · 2020-11-18T15:37:50Z

torch/csrc/cuda/nccl.cpp

@@ -99,6 +96,13 @@ ncclDataType_t to_nccl_data_type(const at::Tensor& t) {
  }
 }

+ncclDataType_t to_nccl_data_type(const at::Tensor& t) {
+  if (!t.is_cuda()) {
+    throw std::runtime_error("Unconvertible NCCL type");


nit: this is prior to this PR, shall we be more explicit on the error message? Should this be the following?

f"NCCL only supports CUDA tensors, but got a tensor on {t.device}"

torch/csrc/cuda/nccl.cpp

mrshenli · 2020-11-18T15:39:08Z

torch/csrc/cuda/nccl.cpp

@@ -625,7 +629,7 @@ void all_gather(
 #endif
 }

-void all2all(at::Tensor& input,
+void all2all_single_equal_split(at::Tensor& input,


Would I be correct if I assume this API is not visible to users?

It seems so? I can not find anything about nccl at https://pytorch.org/cppdocs/api/library_root.html

mrshenli · 2020-11-18T15:42:40Z

torch/csrc/cuda/nccl.cpp

+  NCCL_CHECK(ncclGroupStart());
+  for (int r = 0; r < numranks; r++) {
+    // NCCL uses 0 byte message for synchronization
+    // Avoid send/recv when message size is zero


Does this mean even if all send/recv cnts are 0, this would still trigger an zero-byte message to do sync across ranks?

If all send counts are zero, wouldn't this be an empty nccl group?

mrshenli · 2020-11-18T15:46:05Z

cc @agolynski

zasdfgbnm · 2020-12-07T20:18:56Z

Is there anything this PR is waiting for?

facebook-github-bot

@srinivas212 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zasdfgbnm · 2021-01-05T00:31:06Z

Rebased today

zasdfgbnm · 2021-01-13T23:32:12Z

ping @srinivas212 any update on this?

facebook-github-bot

@srinivas212 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-01-20T23:02:01Z

@srinivas212 merged this pull request in 44922f2.

cdzhan · 2022-07-14T11:03:38Z

torch/lib/c10d/ProcessGroupNCCL.cpp

+  return collective(
+    inputTensor0,
+    outputTensor0,
+    [&](at::Tensor& /* unused */,


@zasdfgbnm @mrshenli Hello, I'm a bit confused, why outputTensors didn't need to record ncclStream to prevent being freed before the collective finishes?

This looks like a bug... Thanks for catching it!

You’re welcome :)

zasdfgbnm added 5 commits September 8, 2020 16:44

Add support for NCCL all to all

d427943

fix

0215034

fix

d03c3b1

fix

03831ec

Merge branch 'master' of github.com:pytorch/pytorch into nccl-all2all

d0da056

pytorchbot added the open source label Sep 9, 2020

tests

ffe67dc

zasdfgbnm added 3 commits September 9, 2020 01:56

fix

16cec66

fix

c331af0

cleanup

865f4a8

zasdfgbnm marked this pull request as ready for review September 9, 2020 09:20

zasdfgbnm requested review from mrshenli, pietern, pritamdamania87 and zhaojuanmao as code owners September 9, 2020 09:20

zasdfgbnm requested a review from srinivas212 September 9, 2020 09:21

ngimel requested review from srinivas212 and removed request for srinivas212 September 9, 2020 17:16

srinivas212 reviewed Sep 9, 2020

View reviewed changes

torch/lib/c10d/ProcessGroupNCCL.cpp Outdated Show resolved Hide resolved

srinivas212 reviewed Sep 9, 2020

View reviewed changes

torch/lib/c10d/ProcessGroupNCCL.cpp Show resolved Hide resolved

mrshenli added oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Sep 10, 2020

mrshenli removed the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 6, 2020

Merge branch 'master' into nccl-all2all

d1590e8

Merge branch 'master' into nccl-all2all

e94b602

ptrblck reviewed Nov 15, 2020

View reviewed changes

zasdfgbnm added 2 commits November 15, 2020 13:50

Update ProcessGroupNCCL.cpp

81c214b

Update ProcessGroupNCCL.cpp

58d50c5

facebook-github-bot reviewed Nov 18, 2020

View reviewed changes

mrshenli approved these changes Nov 18, 2020

View reviewed changes

mrshenli requested a review from agolynski November 18, 2020 15:46

zasdfgbnm added 4 commits November 18, 2020 18:45

fix

1e163a6

Merge branch 'master' of github.com:pytorch/pytorch into nccl-all2all

3e5f29f

fix

482368a

Merge branch 'master' of github.com:pytorch/pytorch into nccl-all2all

7abc38a

facebook-github-bot reviewed Dec 15, 2020

View reviewed changes

Merge branch 'master' of github.com:pytorch/pytorch into nccl-all2all

3b20dd6

facebook-github-bot reviewed Jan 15, 2021

View reviewed changes

facebook-github-bot closed this in 44922f2 Jan 20, 2021

facebook-github-bot added the Merged label Jan 20, 2021

zasdfgbnm deleted the nccl-all2all branch January 20, 2021 23:16

cdzhan reviewed Jul 14, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for NCCL alltoall #44374

Add support for NCCL alltoall #44374

zasdfgbnm commented Sep 9, 2020 •

edited

dr-ci bot commented Sep 9, 2020 •

edited by facebook-github-bot

ngimel commented Sep 9, 2020

zasdfgbnm commented Sep 9, 2020

ngimel commented Sep 9, 2020

zasdfgbnm commented Sep 9, 2020

zasdfgbnm commented Sep 21, 2020

codecov bot commented Oct 16, 2020 •

edited

zasdfgbnm commented Nov 9, 2020

ptrblck Nov 15, 2020

ptrblck Nov 15, 2020

ptrblck commented Nov 15, 2020

srinivas212 commented Nov 18, 2020

zasdfgbnm commented Nov 18, 2020

facebook-github-bot left a comment

mrshenli left a comment

mrshenli Nov 18, 2020

zasdfgbnm Nov 19, 2020

mrshenli Nov 18, 2020

mrshenli Nov 18, 2020

zasdfgbnm Nov 19, 2020

mrshenli Nov 18, 2020

zasdfgbnm Nov 19, 2020

mrshenli commented Nov 18, 2020

zasdfgbnm commented Dec 7, 2020

facebook-github-bot left a comment

zasdfgbnm commented Jan 5, 2021

zasdfgbnm commented Jan 13, 2021

facebook-github-bot left a comment

facebook-github-bot commented Jan 20, 2021

cdzhan Jul 14, 2022 •

edited

zasdfgbnm Jul 14, 2022

cdzhan Jul 15, 2022

Add support for NCCL alltoall #44374

Add support for NCCL alltoall #44374

Conversation

zasdfgbnm commented Sep 9, 2020 • edited

dr-ci bot commented Sep 9, 2020 • edited by facebook-github-bot

💊 CI failures summary and remediations

ci.pytorch.org: 1 failed

ngimel commented Sep 9, 2020

zasdfgbnm commented Sep 9, 2020

ngimel commented Sep 9, 2020

zasdfgbnm commented Sep 9, 2020

zasdfgbnm commented Sep 21, 2020

codecov bot commented Oct 16, 2020 • edited

Codecov Report

zasdfgbnm commented Nov 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ptrblck commented Nov 15, 2020

srinivas212 commented Nov 18, 2020

zasdfgbnm commented Nov 18, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

mrshenli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrshenli commented Nov 18, 2020

zasdfgbnm commented Dec 7, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

zasdfgbnm commented Jan 5, 2021

zasdfgbnm commented Jan 13, 2021

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jan 20, 2021

cdzhan Jul 14, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zasdfgbnm commented Sep 9, 2020 •

edited

dr-ci bot commented Sep 9, 2020 •

edited by facebook-github-bot

codecov bot commented Oct 16, 2020 •

edited

cdzhan Jul 14, 2022 •

edited