Adding profiling capability to c++ ddp collective functions #46471

mrzzd · 2020-10-16T14:58:44Z

Stack from ghstack:

Adding profiling capability to c++ ddp collective functions #46471 Adding profiling capability to c++ ddp collective functions

Differential Revision: D23948397

Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/) [ghstack-poisoned]

Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/) ghstack-source-id: 114486799 Pull Request resolved: #46471

dr-ci · 2020-10-16T15:10:10Z

💊 CI failures summary and remediations

As of commit f995345 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 31 times.

facebook-github-bot · 2020-10-16T16:12:53Z

💊 CI failures summary and remediations

As of commit 3f824f2 (more details on the Dr. CI page):

4/4 failures possibly* introduced in this PR
- 1/4 non-CircleCI failure(s)---

3 failures not recognized by patterns:

Job	Step	Action
^{pytorch_libtorch_linux_xenial_cuda11_0_cudnn8_py3_gcc7_build}	^Build	🔁 rerun
^{pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build}	^Build	🔁 rerun
^{pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build}	^Build	🔁 rerun

Extra GitHub checks: 1 failed

Failed: GitHub Actions - flake8-py3

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 2 times.

Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/) [ghstack-poisoned]

Pull Request resolved: #46471 ghstack-source-id: 114689816 Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/)

pritamdamania87 · 2020-10-21T19:24:29Z

Can you update the PR summary to include an example of what the profiling output looks like with your changes?

pritamdamania87

Thanks for working on this, the overall structure looks good!

torch/lib/c10d/ProcessGroup.cpp

torch/lib/c10d/ProcessGroupGloo.cpp

pritamdamania87 · 2020-10-21T19:28:33Z

torch/lib/c10d/ProcessGroupNCCL.cpp

+    // recordFunctionEndCallback_ is normally called in fininsh() function by
+    // base class, but since finish is not called by WorkNCCL, we schedule this
+    // function to be run when work is done.
+    work->getFuture()->addCallback(std::move(work->recordFunctionEndCallback_));


This would only work in the case outputs.size() ==1, we should validate that here.

I guess we would just need to change the if statement from "f (work->recordFunctionEndCallback_) { to if (work->recordFunctionEndCallback_ && can_profile) {

when outputs size is greater than 1 (ans so can_profile is false), profiling_title is null and so recordFunctionEndCallback_ is not set. Added comment.

torch/lib/c10d/ProcessGroupNCCL.cpp

torch/lib/c10d/ProcessGroup.cpp

pritamdamania87 · 2020-10-21T20:17:22Z

torch/testing/_internal/distributed/distributed_test.py

+                tensor = _build_tensor(src + 1).fill_(master_value if rank == src else worker_value)
+                if cuda:
+                    tensor = tensor.cuda(rank_to_GPU[rank][0])
+                self.call_dist_op("reduce", async_op, dist.reduce, tensor, src, op, group_id)


We should also test that send and recv work as well for both gloo and NCCL.

We have not implemented it for send and recv. What should we test?

pritamdamania87 · 2020-10-21T20:18:19Z

torch/lib/c10d/ProcessGroupNCCL.cpp


  // Store references to outputs and futureNCCLCallbackStream to be used by
  // WorkNCCL::getFuture.
  work->outputs_ = std::make_shared<std::vector<at::Tensor>>(outputs);
  work->futureNCCLCallbackStreams_ = futureNCCLCallbackStreams_;

+  if (work->recordFunctionEndCallback_) {


We probably need to enhance pointTopoint as well to cover send and recv?

I am not familiar with that part. Should we leave as future work?

I think it's fine to add as a follow up PR, but can we file a GH issue for this (and any other follow up tasks?)

pritamdamania87 · 2020-10-21T20:20:02Z

torch/lib/c10d/test/ProcessGroupNCCLTest.cpp

+    using namespace torch::autograd::profiler;
+    // Make sure enabling profile does not make any issue. Note, in single
+    // process multi-device mode we do not expect any events be populated for
+    // collective operations.
+    enableProfiler({ProfilerState::CPU});
+    auto results = pg_->allreduce(tensors_);
+    disableProfiler();
+    return results;


Do we need to add a cpp test since we're already covered by python tests?

I think it would be useful from profiler perspective to just test the C++ API as well, which skips the parsing/event aggregation logic in profiler that happens in python. We have similar tests in https://github.com/pytorch/pytorch/blob/master/test/cpp/jit/test_misc.cpp#L2185-L2198

pritamdamania87 · 2020-10-21T20:20:30Z

torch/lib/c10d/test/ProcessGroupNCCLTest.cpp

@@ -9,6 +9,7 @@
 #include <c10/cuda/CUDAGuard.h>


We probably need to add tests to ProcessGroupMPITest to validate the profiling works correctly for that as well.

rohan-varma

Awesome, this is looking great overall! Left some comments inline. Could you also paste what the profiling output looks like in the PR description (you can get that with print(prof.key_averages().table()) in one of the tests)?

torch/csrc/distributed/c10d/init.cpp

rohan-varma · 2020-10-21T23:01:49Z

torch/lib/c10d/ProcessGroup.cpp

+    auto recordingFunction = std::make_shared<at::RecordFunction>(at::RecordScope::USER_SCOPE);
+    if (recordingFunction->active) {
+        recordingFunction->before(profiling_title, {});
+        std::function<void()> end_handler = [this, recordingFunction]() {


Can we std::move(recordingFunction) since it's not used after this anymore to avoid a copy?

It is shared_ptr and copy is cheap. right after this block also the extra copy is destroyed. If I wanted to use std:move line 62 will change to something much less readable:
std::function<void()> end_handler = [this, recordingFunction{std::move(recordingFunction)}]()

torch/lib/c10d/ProcessGroup.hpp

rohan-varma · 2020-10-21T23:06:44Z

torch/lib/c10d/ProcessGroupNCCL.cpp

+    // recordFunctionEndCallback_ is normally called in fininsh() function by
+    // base class, but since finish is not called by WorkNCCL, we schedule this
+    // function to be run when work is done.
+    work->getFuture()->addCallback(std::move(work->recordFunctionEndCallback_));


I guess we would just need to change the if statement from "f (work->recordFunctionEndCallback_) { to if (work->recordFunctionEndCallback_ && can_profile) {

rohan-varma · 2020-10-21T23:07:27Z

torch/lib/c10d/ProcessGroupNCCL.cpp

@@ -1476,7 +1497,8 @@ std::shared_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall_base(
              comm,
              stream.stream());
        },
-        OpType::ALLTOALL_BASE);
+        OpType::ALLTOALL_BASE,
+        "all_to_all");


Just to confirm, this won't include shape information for now right? That's fine for this diff but just wanted to make sure

That's true!

rohan-varma · 2020-10-21T23:09:58Z

torch/lib/c10d/test/ProcessGroupNCCLTest.cpp

+    enableProfiler({ProfilerState::CPU});
+    auto results = pg_->allreduce(tensors_);
+    disableProfiler();
+    return results;


Generally a test should assert something/some condition. Could we search through results and verify there is an allreduce here? You can see https://github.com/pytorch/pytorch/blob/master/test/cpp/jit/test_misc.cpp#L2185-L2198 as an example.

actually, since this is multi-device single process case, it should not be an event collected, unless number of devices happen to be 1. So I am not sure what we can check.

Can we do a single device per process test somehow? In the current version are the profiling results empty?

Also tangentially related - can you add a comment somewhere appropriate that specifies that it only works for single process per device?

Following up here, is it possible to add some asserts on the expected result?

rohan-varma · 2020-10-21T23:11:17Z

torch/testing/_internal/distributed/distributed_test.py

+                events = [event for event in prof.function_events if partial_key in event.name]
+                return events[0] if len(events) > 0 else None
+
+            recv_event = get_event(profiling_title)


Nit: here we've matched on a partial key, can we also add an assert for what the exact name would look like?

changed it to the postfix, since it could be e.g. nccl:reduce or gloo:reduce

rohan-varma · 2020-10-21T23:32:08Z

torch/testing/_internal/distributed/distributed_test.py

+                work = op(*args, async_op=async_op, **kwargs)
+                if async_op:
+                    work.wait()
+                    work._get_profiling_future().wait()


I don't think that's necessarily the case, as work.wait() could return without the profiling callback having ran and this wait ensures that the profiling callback (one that terminates the record function)is finished successfully. We had to do something similar with RPC (see: https://github.com/pytorch/pytorch/pull/38352/files) although in that case it is transparent to the user.

Ideally when there's profiling, similar to RPC, work.wait() should ensure the profiling callbacks have ran before returning. It might depend on future/work merge, though maybe we can get it to work now by modifying ::wait() to await the profiling future if one exists

rohan-varma · 2020-10-21T23:33:26Z

torch/testing/_internal/distributed/distributed_test.py

+                events = [event for event in prof.function_events if partial_key in event.name]
+                return events[0] if len(events) > 0 else None
+
+            recv_event = get_event(profiling_title)


Nit: Is it always a recv event or are there different type of collective comm. calls here? If the latter is true can we have a name such as comm_event which might be less confusing?

no event for now is collected for send/recv. should we add that in followup PR?

Yes, that is fine. I was mostly asking because I was curious why it was named recv_event.

rohan-varma · 2020-10-21T23:34:19Z

torch/testing/_internal/distributed/distributed_test.py

+
+            recv_event = get_event(profiling_title)
+            if expect_event:
+                self.assertEqual(recv_event.count, 1)


Can we have a test where we do > 1 collective comm of the same type, and then > 1 collective comm of different types and validate the counts for those as well?

Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/) [ghstack-poisoned]

Pull Request resolved: #46471 ghstack-source-id: 115289419 Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/)

Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/) [ghstack-poisoned]

Pull Request resolved: #46471 ghstack-source-id: 115335707 Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/)

rohan-varma · 2020-10-28T19:46:38Z

torch/lib/c10d/ProcessGroup.cpp

@@ -131,6 +156,10 @@ void ProcessGroup::Work::finishAndThrow(std::exception_ptr exception) {
  std::unique_lock<std::mutex> lock(mutex_);
  completed_ = true;
  exception_ = exception;
+  if (recordFunctionEndCallback_) {
+    recordFunctionEndCallback_();
+    recordFunctionEndCallback_ = nullptr;


Do we have tests that exercise this code path (finishAndThrow)?

@mrzzd Just following up here - do we need to add these tests?

rohan-varma · 2020-10-28T20:08:16Z

torch/lib/c10d/ProcessGroup.hpp

@@ -137,6 +135,10 @@ class ProcessGroup {

    OpType retrieveOpType();

+    // Keeps track of the future responsible for profiling owner creation


what does "profiling owner creation" mean? Do you just mean that this is a future that is complete when the profiling has finished?

torch/lib/c10d/ProcessGroup.hpp

rohan-varma · 2020-10-28T20:11:00Z

torch/lib/c10d/ProcessGroupNCCL.cpp

+  }
+
+
+


nit: unneeded extra lines?

torch/testing/_internal/distributed/distributed_test.py

rohan-varma · 2020-10-28T20:20:18Z

torch/testing/_internal/distributed/distributed_test.py

+                if is_async:
+                    for work in works:
+                        work.wait()
+                        work._get_profiling_future().wait()


I'm assuming that the test is flaky in the case where we remove this call? Is it okay to ship this prototype because we would need this explicit wait call from user code?

cc @pritamdamania87 - I guess we might be able to hack something but in the long term this will probably depend on future/work merge and we would implement this by adding a then callback like we do for RPC. Do you have any thoughts on what we can do currently?

rohan-varma

Thank you for the awesome work! It looks great, but some larger things I think we should talk about:

ProcessGroupMPI tests (looks like nccl and gloo are thoroughly tested)
send/recv profiling follow up, and file GH issues for that and any other follow up tasks
Discuss a design for removing work._get_profiling_future().wait() before we expose this to users.
Could you also add the profiling output to the PR summary?

rohan-varma · 2020-10-29T00:23:40Z

Discussed about (3) offline, we will try to remove the call and the profiling should still be done transparently since the nccl callback should be invoked inline.

torch/testing/_internal/distributed/distributed_test.py

Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/) [ghstack-poisoned]

Pull Request resolved: #46471 ghstack-source-id: 115954679 Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/)

rohan-varma

LGTM overall, thanks for doing this! Had a couple of nits and 2 comments about testing. Feel free to land after taking a look at those.

Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/) [ghstack-poisoned]

Pull Request resolved: #46471 ghstack-source-id: 116018837 Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/)

codecov · 2020-11-06T05:54:57Z

Codecov Report

Merging #46471 into gh/mrzzd/5/base will decrease coverage by 0.01%.
The diff coverage is 30.52%.

@@                 Coverage Diff                 @@
##           gh/mrzzd/5/base   #46471      +/-   ##
===================================================
- Coverage            81.45%   81.43%   -0.02%     
===================================================
  Files                 1798     1798              
  Lines               188242   188300      +58     
===================================================
+ Hits                153333   153345      +12     
- Misses               34909    34955      +46

facebook-github-bot · 2020-11-06T19:13:28Z

This pull request has been merged in 160db3d.

Adding profiling capability to c++ ddp collective functions

3f824f2

Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/) [ghstack-poisoned]

mrzzd requested review from mingzhe09088, mrshenli, pietern, pritamdamania87, rohan-varma and zhaojuanmao as code owners October 16, 2020 14:58

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 16, 2020

mrzzd pushed a commit that referenced this pull request Oct 16, 2020

Adding profiling capability to c++ ddp collective functions

af8172b

Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/) ghstack-source-id: 114486799 Pull Request resolved: #46471

mrzzd linked an issue Oct 16, 2020 that may be closed by this pull request

Autograd profiler support for torch.distributed #43231

Closed

mrzzd mentioned this pull request Oct 16, 2020

Adding profiling capability to ddp collective functions #44436

Closed

Update on "Adding profiling capability to c++ ddp collective functions"

4473817

Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/) [ghstack-poisoned]

mrzzd pushed a commit that referenced this pull request Oct 20, 2020

Adding profiling capability to c++ ddp collective functions

679f403

Pull Request resolved: #46471 ghstack-source-id: 114689816 Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/)

pritamdamania87 reviewed Oct 21, 2020

View reviewed changes

rohan-varma reviewed Oct 21, 2020

View reviewed changes

Update on "Adding profiling capability to c++ ddp collective functions"

b31140f

Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/) [ghstack-poisoned]

mrzzd pushed a commit that referenced this pull request Oct 27, 2020

Adding profiling capability to c++ ddp collective functions

a7e6385

Pull Request resolved: #46471 ghstack-source-id: 115289419 Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/)

Update on "Adding profiling capability to c++ ddp collective functions"

a89326a

Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/) [ghstack-poisoned]

mrzzd pushed a commit that referenced this pull request Oct 28, 2020

Adding profiling capability to c++ ddp collective functions

2e3883d

Pull Request resolved: #46471 ghstack-source-id: 115335707 Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/)

rohan-varma self-requested a review October 28, 2020 19:45

rohan-varma reviewed Oct 28, 2020

View reviewed changes

torch/lib/c10d/ProcessGroup.hpp Outdated Show resolved Hide resolved

rohan-varma reviewed Oct 28, 2020

View reviewed changes

torch/lib/c10d/ProcessGroupNCCL.cpp

}

Copy link

Member

rohan-varma Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unneeded extra lines?

rohan-varma reviewed Oct 28, 2020

View reviewed changes

torch/testing/_internal/distributed/distributed_test.py Outdated Show resolved Hide resolved

rohan-varma reviewed Oct 28, 2020

View reviewed changes

facebook-github-bot added the cla signed label Oct 30, 2020

rohan-varma reviewed Nov 2, 2020

View reviewed changes

torch/testing/_internal/distributed/distributed_test.py Outdated Show resolved Hide resolved

Update on "Adding profiling capability to c++ ddp collective functions"

33872cd

Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/) [ghstack-poisoned]

Update on "Adding profiling capability to c++ ddp collective functions"

4808922

Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/) [ghstack-poisoned]

mrzzd pushed a commit that referenced this pull request Nov 5, 2020

Adding profiling capability to c++ ddp collective functions

ba9daf5

Pull Request resolved: #46471 ghstack-source-id: 115954679 Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/)

This was referenced Nov 6, 2020

Enable and test profiling for torch.distributed with MPI #47477

Closed

Support profiling of point-to-point collective operations. #47482

Closed

rohan-varma approved these changes Nov 6, 2020

View reviewed changes

Update on "Adding profiling capability to c++ ddp collective functions"

f995345

Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/) [ghstack-poisoned]

mrzzd pushed a commit that referenced this pull request Nov 6, 2020

Adding profiling capability to c++ ddp collective functions

a82d9ab

Pull Request resolved: #46471 ghstack-source-id: 116018837 Differential Revision: [D23948397](https://our.internmc.facebook.com/intern/diff/D23948397/)

facebook-github-bot closed this in 160db3d Nov 6, 2020

facebook-github-bot added the Merged label Nov 6, 2020

jeffdaily mentioned this pull request Nov 9, 2020

skip test_all_reduce_sum_cuda_async test case for ROCM #47630

Closed

facebook-github-bot deleted the gh/mrzzd/5/head branch November 10, 2020 15:17

rohan-varma mentioned this pull request Nov 12, 2020

Refactor DDP uneven inputs control flags #47394

Closed

This was referenced Dec 8, 2020

Profiling distributed NCCL collectives deadlocks when profiler run with use_cuda=True #48987

Closed

Record input shapes when profiling torch.distributed collectives #49070

Closed

rohan-varma mentioned this pull request Feb 10, 2021

Distributed Profiling does not work with DDP's all_reduce #52020

Closed

		@@ -137,6 +135,10 @@ class ProcessGroup {

		OpType retrieveOpType();

		// Keeps track of the future responsible for profiling owner creation

Adding profiling capability to c++ ddp collective functions #46471

Adding profiling capability to c++ ddp collective functions #46471

Conversation

mrzzd commented Oct 16, 2020 • edited

dr-ci bot commented Oct 16, 2020 • edited

💊 CI failures summary and remediations

facebook-github-bot commented Oct 16, 2020 • edited

💊 CI failures summary and remediations

3 failures not recognized by patterns:

Extra GitHub checks: 1 failed

pritamdamania87 commented Oct 21, 2020

pritamdamania87 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rohan-varma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rohan-varma left a comment • edited

Choose a reason for hiding this comment

rohan-varma commented Oct 29, 2020

rohan-varma left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 6, 2020

Codecov Report

facebook-github-bot commented Nov 6, 2020

mrzzd commented Oct 16, 2020 •

edited

dr-ci bot commented Oct 16, 2020 •

edited

facebook-github-bot commented Oct 16, 2020 •

edited

rohan-varma left a comment •

edited