Add Futures to ProcessGroupGloo #57818

agolynski · 2021-05-07T13:21:09Z

Stack from ghstack:

Add Futures to ProcessGroupGloo #57818 Add Futures to ProcessGroupGloo

Differential Revision: D28304171

[ghstack-poisoned]

facebook-github-bot · 2021-05-07T13:21:16Z

💊 CI failures summary and remediations

As of commit ca3a079 (more details on the Dr. CI page):

2/2 failures possibly* introduced in this PR
- 1/2 non-scanned failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

test (1/1)

Step: "Test PyTorch" (full log | diagnosis details | 🔁 rerun)

2021-05-11T20:46:11.2845297Z Build left local git repository checkout dirty

2021-05-11T20:46:06.3252426Z real	69m29.285s
2021-05-11T20:46:06.3252769Z user	120m30.189s
2021-05-11T20:46:06.3253030Z sys	25m53.236s
2021-05-11T20:46:06.3253330Z + assert_git_not_dirty
2021-05-11T20:46:06.3254540Z + [[ pytorch-linux-xenial-py3.6-gcc5.4 != *rocm* ]]
2021-05-11T20:46:06.3255423Z + [[ pytorch-linux-xenial-py3.6-gcc5.4 != *xla* ]]
2021-05-11T20:46:06.3256066Z ++ git status --porcelain
2021-05-11T20:46:11.2843234Z + git_status=' M third_party/kineto'
2021-05-11T20:46:11.2844133Z + [[ -n  M third_party/kineto ]]
2021-05-11T20:46:11.2844786Z + echo 'Build left local git repository checkout dirty'
2021-05-11T20:46:11.2845297Z Build left local git repository checkout dirty
2021-05-11T20:46:11.2845807Z + echo 'git status --porcelain:'
2021-05-11T20:46:11.2846261Z git status --porcelain:
2021-05-11T20:46:11.2846690Z + echo ' M third_party/kineto'
2021-05-11T20:46:11.2847039Z  M third_party/kineto
2021-05-11T20:46:11.2847332Z + exit 1
2021-05-11T20:46:11.2847601Z + cleanup
2021-05-11T20:46:11.2847877Z + retcode=1
2021-05-11T20:46:11.2848132Z + set +x
2021-05-11T20:46:11.2848473Z =================== sccache compilation log ===================
2021-05-11T20:46:11.3035650Z =========== If your build fails, please take a look at the log above for possible reasons ===========

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

ghstack-source-id: a45d96b1ca65dfdb83ccfe3f8d84c6fea8533d60 Pull Request resolved: #57818

[ghstack-poisoned]

ghstack-source-id: ab4f05c1ad0888d067994001d24a0b9f139c2323 Pull Request resolved: #57818

[ghstack-poisoned]

ghstack-source-id: 1985c72ce50b896b287308f31cfa959fc72f9b0c Pull Request resolved: #57818

[ghstack-poisoned]

ghstack-source-id: 9347ea446747d672c8c4bda81e033d9e60774c94 Pull Request resolved: #57818

agolynski · 2021-05-07T23:17:08Z

@agolynski has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

wayi1 · 2021-05-07T23:29:32Z

torch/lib/c10d/ProcessGroupGloo.cpp

+
+  // FIXME: We need to call it here since Future completion requires all
+  // the work to be synchronized to CUDA.
+  work->synchronize();


Is this only specific to GLOO?

I think it concerns all backends that implement synchronize()
you can call synchronize() to make sure that tensors arrive to GPU and usable there without resorting to stream/event syncronization or you can CUDA mechanisms to syncronize.

@mrshenli requested that we use cuda futures to deal with such syncronizations which would be more efficient.

wayi1 · 2021-05-08T00:09:36Z

torch/lib/c10d/ProcessGroupGloo.cpp

+}
+
+namespace {
+c10::intrusive_ptr<c10::ivalue::Future> CreateFutureForOutput(


nit: AsOutput?

wayi1 · 2021-05-08T00:14:49Z

torch/lib/c10d/test/ProcessGroupGlooTest.cpp

+    try {
+      fut->wait();
+    } catch (const std::exception& ex) {
+      std::cerr << "Exception received: " << ex.what() << std::endl;


I understand that this is not new code, but shouldn't we use LOG(ERROR) here?

Shall we abort here?

LOG(ERROR) here

Done

Shall we abort here?

Then we won't be able to test exception throwing?

wayi1

Thanks for the merge!

mrshenli

Thanks for adding this! I left some questions for consolidating Work/Future APIs/fields. I am OK with addressing some of those comments in follow up PRs, but let's not stop here can call the consolidation done.

One example is that, after the consolidation, at least we shouldn't have both Work and Future keeping completed_ / exception_ field.

cc @lw

mrshenli · 2021-05-09T19:56:36Z

test/distributed/test_c10d_gloo.py

-        w = pg.broadcast(xs)
-        w.wait()
-        output = w.result()
+        fut = pg.broadcast(xs).get_future()


curious, do we still have test coverage for work that makes sure that APIs exposed from work still work the same way?

good point. Added c++ and Py tests for work API

mrshenli · 2021-05-09T19:58:44Z

torch/testing/_internal/distributed/distributed_test.py

@@ -3591,8 +3602,8 @@ def allreduce_hook(
            )

        @unittest.skipIf(
-            BACKEND != "mpi" and BACKEND != 'nccl',
-            "get_future is only supported on mpi and nccl"
+            BACKEND != "mpi" and BACKEND != "nccl" and BACKEND != "gloo",


Do we still need this if we support all three backends now? Or is there any 3rd-party backend uses this test file?

I don't think any 3rd party backends use this file, but we also test 'test' and RR backend. Although they are not covered by this test, I prefer to be explicit

mrshenli · 2021-05-09T20:02:08Z

torch/lib/c10d/ProcessGroupGloo.cpp

+}
+
+namespace {
+c10::intrusive_ptr<c10::ivalue::Future> CreateFutureForOutput(


ReturnFutureWithOutput -> returnFutureWithOutput to stay consistent with other function names in this file.

mrshenli · 2021-05-09T20:02:13Z

torch/lib/c10d/ProcessGroupGloo.cpp

+      c10::ListType::create(c10::TensorType::get()));
+}
+
+void ReturnFutureWithOutput(


mrshenli · 2021-05-09T20:06:49Z

torch/lib/c10d/ProcessGroupGloo.cpp

+    std::vector<std::vector<at::Tensor>> outputTensors,
+    const char* profilingTitle,
+    const c10::optional<std::vector<at::Tensor>>& inputTensors)
+    : ProcessGroup::Work(-1, OpType::UNKNOWN, profilingTitle, inputTensors),


(This is prior to this PR): Curious, any reason for using UNKNOWN OpType here? I saw UNKNOWN is also used for recv, but I don't recall the reason for that.

I've looked around and seems like optype is only used in NCCL process group for p2p comm at #45873. Maybe there are other uses which I haven't found?

https://github.com/pytorch/pytorch/search?q=retrieveOpType

https://github.com/pytorch/pytorch/search?q=opType_

Since this is already prior to this PR, it shouldn't block this PR from landing. Let's add an BE issue to track this.

mrshenli · 2021-05-09T20:09:30Z

torch/lib/c10d/ProcessGroupGloo.cpp

-  work->finish(eptr);
+
+  // FIXME: We need to call it here since Future completion requires all
+  // the work to be synchronized to CUDA.


Since we now have CUDA support in ivalue::Future, should this synchronization work be done by ivalue::Future instead? cc @lw

completely agree.

BTW, we used to do some other things in synchronize() (e.g. copying tensors), however seems like now all derived versions of synchronize() match ivalue::Future::synchronizeWithCurrentStreams()

The plan is to deprecate work API in 1.10, so we don't have any more versions of synchronize() and then use
ivalue::Future::synchronizeWithCurrentStreams() directly.

cc @lw

mrshenli · 2021-05-09T20:11:07Z

torch/lib/c10d/ProcessGroupGloo.cpp

+
+void ProcessGroupGloo::AsyncWork::finishWorkGlooError(std::exception_ptr eptr) {
+  future_->setError(eptr);
+  finish(eptr);


similarly, since ivalue::Future has similar concepts and states for completed_ / exception_, shall we just use that instead? cc @lw

we exposed completed_ and exception_ as protected members of ProcessGroup::Work, so potentially they can be referred to by 3rd parties. Let's make deprecation announcement for 1.10 and then it'd be easier to make better Work/Future merge which can also include CUDA synchronization.

let's add a tracking issue for this as well.

I don't get it: how can third-parties access the completed_ and exception_ fields? If they are protected it means they can only be accessed by subclasses, but I don't see the point of users subclassing Work: even if they did, ProcessGroupGloo would still return its own classes, and not the user ones, so what does one gain? Moreover, the trailing underscore in the fields names is C++'s "convention" for private fields. All in all, I don't think we should worry about users depending on it.

I agree with Shen that now, effectively, we're doing the same thing twice within Work: we have two ways to mark completion (Work's own one, and the one within Future), etc. One way to go about it is to redefine Work so that it just "forwards" each of its methods to the Future it holds (e.g., Work::wait() basically just calls this->future_->wait()).

@lw

I don't get it: how can third-parties access the completed_ and exception_ fields? If they are protected it means they can only be accessed by subclasses, but I don't see the point of users subclassing Work: even if they did, ProcessGroupGloo would still return its own classes, and not the user ones, so what does one gain?

MPI/GLOO subclass Work, same way 3rd party implementers can also subclass Work and use completed_ and exception_?

ProcessGroupGloo ProcessGroupMPI Work subclasses use (and 3rd party can) PG::Work's completed_ and exception_, so if we remove them 3rd party Work classes will break.

Moreover, the trailing underscore in the fields names is C++'s "convention" for private fields.

I thought trailing underscore just means it's a member, not necessarily private?

All in all, I don't think we should worry about users depending on it.

I think you are right, but it's good to be cautious.

I agree with Shen that now, effectively, we're doing the same thing twice within Work: we have two ways to mark completion (Work's own one, and the one within Future), etc. One way to go about it is to redefine Work so that it just "forwards" each of its methods to the Future it holds (e.g., Work::wait() basically just calls this->future_->wait()).

the plan is to remove fields from PG::Work and replace it a future. Note that right now, MPI and GLOO subclasses hold a future.

Right, I hadn't understood the completed_ and exception_ fields came from the root "abstract" base class Work (which, well, isn't actually abstract it seems). Indeed, people who created their own ProcessGroup implementations (as extensions) would be broken by any change there. But I think we eventually will need to do it, right?

mrshenli · 2021-05-09T20:11:14Z

torch/lib/c10d/ProcessGroupGloo.cpp

+
+void ProcessGroupGloo::AsyncWork::finishWorkGloo() {
+  ReturnFutureWithOutput(future_, outputTensors_);
+  finish();


let's leave it for 1.10?

yep, looks OK to me, as we are too close to branch cut.

Differential Revision: [D28304171](https://our.internmc.facebook.com/intern/diff/D28304171) [ghstack-poisoned]

ghstack-source-id: 536de83ce4543a5bfb8f64409317675b404192be Pull Request resolved: #57818

Differential Revision: [D28304171](https://our.internmc.facebook.com/intern/diff/D28304171) [ghstack-poisoned]

ghstack-source-id: 9441cd0457c2d49b287b6833fb305012282819d4 Pull Request resolved: #57818

mrshenli · 2021-05-11T17:58:02Z

torch/lib/c10d/ProcessGroupGloo.cpp

+    std::vector<std::vector<at::Tensor>> outputTensors,
+    const char* profilingTitle,
+    const c10::optional<std::vector<at::Tensor>>& inputTensors)
+    : ProcessGroup::Work(-1, OpType::UNKNOWN, profilingTitle, inputTensors),


Since this is already prior to this PR, it shouldn't block this PR from landing. Let's add an BE issue to track this.

mrshenli · 2021-05-11T17:58:30Z

torch/lib/c10d/ProcessGroupGloo.cpp

+
+void ProcessGroupGloo::AsyncWork::finishWorkGlooError(std::exception_ptr eptr) {
+  future_->setError(eptr);
+  finish(eptr);


let's add a tracking issue for this as well.

mrshenli · 2021-05-11T17:58:49Z

torch/lib/c10d/ProcessGroupGloo.cpp

+
+void ProcessGroupGloo::AsyncWork::finishWorkGloo() {
+  ReturnFutureWithOutput(future_, outputTensors_);
+  finish();


yep, looks OK to me, as we are too close to branch cut.

mrshenli · 2021-05-11T17:59:12Z

torch/testing/_internal/distributed/distributed_test.py

@@ -3591,8 +3602,8 @@ def allreduce_hook(
            )

        @unittest.skipIf(
-            BACKEND != "mpi" and BACKEND != 'nccl',
-            "get_future is only supported on mpi and nccl"
+            BACKEND != "mpi" and BACKEND != "nccl" and BACKEND != "gloo",


lw

It's a good first step to add the getFuture method to Gloo too, but we agree that the final solution will be to remove that method and to make Work itself be a Future subclass right?

lw · 2021-05-11T18:05:44Z

torch/lib/c10d/ProcessGroupGloo.cpp

+}
+
+void returnFutureWithOutput(
+    c10::intrusive_ptr<c10::ivalue::Future> future,


Nit: no need to pass the pointer by value, this function could just take a Future& and thus avoid the copy and the consequent refcount increase+decrease.

lw · 2021-05-11T18:08:22Z

torch/lib/c10d/ProcessGroupGloo.cpp

+  if (outputTensors.size() > 1) {
+    return c10::make_intrusive<c10::ivalue::Future>(
+        c10::ListType::create(c10::ListType::create(c10::TensorType::get())));
+  }
+  return c10::make_intrusive<c10::ivalue::Future>(
+      c10::ListType::create(c10::TensorType::get()));


Is there a reason for this? I find it odd that a type information depends on a runtime condition. Types are typically useful to "predict what would happen before it happens" (e.g., static analysis) but by doing this we defeat that purpose.

@lw agreed that this is not good for static analysis.

right now, the only return value work supports is a list of tensors, however some operations need to return list of lists (c10d libs has been very spotty on return value coverage and we never properly encountered this case). also, I prefer this PR to be BC, i.e. in most cases we still need to return a list.

how would you prefer to address this?

I'm not very knowledgeable on the PG API. My first impression is that it would be more "regular" and predictable if here we always used a list of tensors, even in case of a singleton.

If however you say that different methods of a PG need to return different types, that's fine, and some can return a list of tensors while other can return a singleton tensor. However I'd like if a certain method always returns the same type, no matter its arguments.

agreed. I don't mind returning a singleton tensor wrapped in a list and we already do, but how would we deal with returning list of lists?

lw · 2021-05-11T18:14:25Z

torch/lib/c10d/ProcessGroupGloo.cpp

+
+void ProcessGroupGloo::AsyncWork::finishWorkGlooError(std::exception_ptr eptr) {
+  future_->setError(eptr);
+  finish(eptr);


I don't get it: how can third-parties access the completed_ and exception_ fields? If they are protected it means they can only be accessed by subclasses, but I don't see the point of users subclassing Work: even if they did, ProcessGroupGloo would still return its own classes, and not the user ones, so what does one gain? Moreover, the trailing underscore in the fields names is C++'s "convention" for private fields. All in all, I don't think we should worry about users depending on it.

I agree with Shen that now, effectively, we're doing the same thing twice within Work: we have two ways to mark completion (Work's own one, and the one within Future), etc. One way to go about it is to redefine Work so that it just "forwards" each of its methods to the Future it holds (e.g., Work::wait() basically just calls this->future_->wait()).

agolynski · 2021-05-11T19:04:21Z

It's a good first step to add the getFuture method to Gloo too, but we agree that the final solution will be to remove that method and to make Work itself be a Future subclass right?

yes

Differential Revision: [D28304171](https://our.internmc.facebook.com/intern/diff/D28304171) [ghstack-poisoned]

ghstack-source-id: 7a3a006bf6cc234425b2a6d8828fdbd711ccef9d Pull Request resolved: #57818

agolynski · 2021-05-11T19:32:03Z

@agolynski has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

agolynski · 2021-05-11T19:33:29Z

@SciPioneer @mrshenli @lw Thanks for review!

@lw lets address further Work/Future merge in the follow up PRs?

facebook-github-bot · 2021-05-11T21:48:30Z

@agolynski merged this pull request in 4ef9426.

rohan-varma

A bit late to this, but are we planning to also add support for send/recv futures and testing? It looks like we would need to add that to Send/RecvWork.

agolynski · 2021-05-12T16:08:52Z

A bit late to this, but are we planning to also add support for send/recv futures and testing? It looks like we would need to add that to Send/RecvWork.

yes, we could add handling that 1.10. Send/RecvWork (both GLOO and MPI) are running in async mode, so they would require some more work (maybe sacrificing some efficiency or API or a combination)

@rohan-varma are there usecases for send/recv futures?

Summary: Pull Request resolved: pytorch#57818 Test Plan: Imported from OSS Reviewed By: SciPioneer Differential Revision: D28304171 Pulled By: agolynski fbshipit-source-id: dbf7f5538890d138582831aa0279ede89619ea1e

Add Futures to ProcessGroupGloo

3345ca1

[ghstack-poisoned]

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels May 7, 2021

agolynski added a commit that referenced this pull request May 7, 2021

Add Futures to ProcessGroupGloo

cf7cd73

ghstack-source-id: a45d96b1ca65dfdb83ccfe3f8d84c6fea8533d60 Pull Request resolved: #57818

Update on "Add Futures to ProcessGroupGloo"

e90ab35

[ghstack-poisoned]

Update on "Add Futures to ProcessGroupGloo"

5e8eb65

[ghstack-poisoned]

agolynski added a commit that referenced this pull request May 7, 2021

Add Futures to ProcessGroupGloo

6f83bed

ghstack-source-id: ab4f05c1ad0888d067994001d24a0b9f139c2323 Pull Request resolved: #57818

Update on "Add Futures to ProcessGroupGloo"

96803dc

[ghstack-poisoned]

agolynski added a commit that referenced this pull request May 7, 2021

Add Futures to ProcessGroupGloo

908ae54

ghstack-source-id: 1985c72ce50b896b287308f31cfa959fc72f9b0c Pull Request resolved: #57818

Update on "Add Futures to ProcessGroupGloo"

7b17f84

[ghstack-poisoned]

Update on "Add Futures to ProcessGroupGloo"

e74d9ca

[ghstack-poisoned]

Update on "Add Futures to ProcessGroupGloo"

1fee42e

[ghstack-poisoned]

Update on "Add Futures to ProcessGroupGloo"

4fb8c4b

[ghstack-poisoned]

agolynski requested review from pbelevich, wayi1 and mrshenli May 7, 2021 22:25

agolynski added the module: c10d Issues/PRs related to collective communications and process groups label May 7, 2021

agolynski marked this pull request as ready for review May 7, 2021 22:26

agolynski requested review from H-Huang, mingzhe09088, pritamdamania87, rohan-varma and zhaojuanmao as code owners May 7, 2021 22:26

Update on "Add Futures to ProcessGroupGloo"

e9fe8c2

[ghstack-poisoned]

agolynski added a commit that referenced this pull request May 7, 2021

Add Futures to ProcessGroupGloo

7e898ca

ghstack-source-id: 9347ea446747d672c8c4bda81e033d9e60774c94 Pull Request resolved: #57818

agolynski linked an issue May 7, 2021 that may be closed by this pull request

[Gloo, MPI] DDP communication hook: getFuture() #42048

Closed

wayi1 reviewed May 7, 2021

View reviewed changes

wayi1 reviewed May 8, 2021

View reviewed changes

wayi1 approved these changes May 8, 2021

View reviewed changes

mrshenli reviewed May 9, 2021

View reviewed changes

Update on "Add Futures to ProcessGroupGloo"

e2c2f6e

Differential Revision: [D28304171](https://our.internmc.facebook.com/intern/diff/D28304171) [ghstack-poisoned]

agolynski added a commit that referenced this pull request May 10, 2021

Add Futures to ProcessGroupGloo

5bb4d02

ghstack-source-id: 536de83ce4543a5bfb8f64409317675b404192be Pull Request resolved: #57818

Update on "Add Futures to ProcessGroupGloo"

8c635f9

Differential Revision: [D28304171](https://our.internmc.facebook.com/intern/diff/D28304171) [ghstack-poisoned]

agolynski added a commit that referenced this pull request May 11, 2021

Add Futures to ProcessGroupGloo

9373bbd

ghstack-source-id: 9441cd0457c2d49b287b6833fb305012282819d4 Pull Request resolved: #57818

mrshenli approved these changes May 11, 2021

View reviewed changes

lw reviewed May 11, 2021

View reviewed changes

Update on "Add Futures to ProcessGroupGloo"

ca3a079

Differential Revision: [D28304171](https://our.internmc.facebook.com/intern/diff/D28304171) [ghstack-poisoned]

agolynski added a commit that referenced this pull request May 11, 2021

Add Futures to ProcessGroupGloo

cefdbac

ghstack-source-id: 7a3a006bf6cc234425b2a6d8828fdbd711ccef9d Pull Request resolved: #57818

facebook-github-bot closed this in 4ef9426 May 11, 2021

facebook-github-bot added the Merged label May 11, 2021

rohan-varma reviewed May 11, 2021

View reviewed changes

facebook-github-bot deleted the gh/agolynski/18/head branch May 15, 2021 14:23

Add Futures to ProcessGroupGloo #57818

Add Futures to ProcessGroupGloo #57818

Conversation

agolynski commented May 7, 2021 • edited Loading

facebook-github-bot commented May 7, 2021 • edited Loading

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

test (1/1)

agolynski commented May 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wayi1 left a comment

Choose a reason for hiding this comment

mrshenli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agolynski commented May 11, 2021

agolynski commented May 11, 2021

agolynski commented May 11, 2021

facebook-github-bot commented May 11, 2021

rohan-varma left a comment

Choose a reason for hiding this comment

agolynski commented May 12, 2021 • edited Loading

agolynski commented May 7, 2021 •

edited

Loading

facebook-github-bot commented May 7, 2021 •

edited

Loading

agolynski commented May 12, 2021 •

edited

Loading