[pytorch][PR] Record FutureNCCL callback stream on CUDA caching allocator #45318

wayi1 · 2020-09-25T00:28:24Z

Summary:
When calling then() from WorkNCCL, record the input data pointers in futureNCCLCallbackStream_ before the execution of the input callback.

Note that the recording cannot be directly added to the lambda used by addCallback in ProcessGroupNCCL.hpp. This is because the type of future value in that context is pyobject rather than TensorList, but a type casting will require pybind and introduce Python dependency, which should not be allowed in c10d library.

I have considered creating a util function in a separate file to support this type casting, and then placing it under torch/csrc directory where python dependency is allowed. However, torch/csrc has a dependency on c10d, so this will create a circular dependency.

Finally, a record_stream_cb_ member is added to FutureNCCL, and the default value is nullptr. A default record_stream_cb_ implementation is added to caffe2/torch/csrc/jit/python/pybind_utils.h, where Python dependency is allowed.

Currently record_stream_cb_ is not really customizable, and we assume the input ivalue of this callback is or can be casted into a tensor that has exactly one tensor. This callback will only be used by NCCL backend.

Test Plan
buck test @mode/dev-nosan caffe2/test/distributed:c10d -- ProcessGroupNCCLTest
buck test @mode/dev-nosan caffe2/test/distributed:c10d -- test_accumulate_gradients_no_sync_allreduce_with_then_hook
buck test @mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_with_then_hook_nccl

Differential Revision: D23910257

facebook-github-bot · 2020-09-25T00:28:47Z

This pull request was exported from Phabricator. Differential Revision: D23910257

dr-ci · 2020-09-25T02:04:25Z

💊 CI failures summary and remediations

As of commit 7020fea (more details on the Dr. CI page):

✅ None of the CI failures appear to be your fault 💚

1/1 broken upstream at merge base adffd8e since Oct 21

🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet:

pytorch_linux_backward_compatibility_check_test since Oct 21
- 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 66 times.

pritamdamania87 · 2020-09-25T19:23:47Z

@ngimel Could you help review this PR? We're following the convention from here: https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroupNCCL.cpp#L968

torch/lib/c10d/ProcessGroupNCCL.cpp

torch/lib/c10d/ProcessGroupNCCL.hpp

facebook-github-bot · 2020-09-26T17:30:53Z

This pull request was exported from Phabricator. Differential Revision: D23910257

facebook-github-bot · 2020-10-12T08:12:51Z

This pull request was exported from Phabricator. Differential Revision: D23910257

pritamdamania87

Requesting changes since I think recordStream should be done inside addCallback(). Also, can we run the DDP/HPC benchmarks with this change (with fp16 hook enabled) to ensure it doesn't cause any issues? You can probably verify QPS and peak memory to see they're not affected significantly.

torch/csrc/distributed/c10d/init.cpp

torch/lib/c10d/ProcessGroupNCCL.hpp

torch/csrc/distributed/c10d/init.cpp

torch/lib/c10d/ProcessGroupNCCL.hpp

torch/csrc/distributed/c10d/init.cpp

torch/lib/c10d/ProcessGroupNCCL.hpp

rohan-varma · 2020-10-14T19:25:28Z

torch/lib/c10d/ProcessGroupNCCL.hpp

+
+    void recordStream() {
+      TORCH_CHECK(record_stream_cb_);
+      // Do not free the underlying data storage of value_ before its usage on


cc @pritamdamania87 , is there anyway we could write a test that would fail without this record stream callback logic?

I think it might be possible but tricky to do. It probably has to be a C++ test and interact with CUDACachingAllocator in a way that the caching allocator frees up the data, but we're still using the data in futureNCCLCallbackStream.

Currently I hacked the test file to locally verify that in a test case like fut.then(mult).then(div), CUDACachingAllocator::recordStream has been invoked by the first two futures, respectively.

Agree that there could be a better approach to create a C++ unit test that can allow CUDACachingAllocator to free up the cached data.

wayi1 · 2020-10-14T23:17:35Z

Requesting changes since I think recordStream should be done inside addCallback(). Also, can we run the DDP/HPC benchmarks with this change (with fp16 hook enabled) to ensure it doesn't cause any issues? You can probably verify QPS and peak memory to see they're not affected significantly.

Per offline discussion, ran the callback right before addCallback is invoked.

facebook-github-bot · 2020-10-15T00:24:45Z

This pull request was exported from Phabricator. Differential Revision: D23910257

facebook-github-bot · 2020-10-15T00:32:34Z

This pull request was exported from Phabricator. Differential Revision: D23910257

aten/src/ATen/core/ivalue_inl.h

facebook-github-bot · 2020-10-15T05:37:15Z

This pull request was exported from Phabricator. Differential Revision: D23910257

facebook-github-bot · 2020-10-16T05:35:51Z

This pull request was exported from Phabricator. Differential Revision: D23910257

facebook-github-bot · 2020-10-16T05:42:50Z

This pull request was exported from Phabricator. Differential Revision: D23910257

facebook-github-bot · 2020-10-16T05:44:34Z

This pull request was exported from Phabricator. Differential Revision: D23910257

pritamdamania87

Changes LGTM, although there are a lot of CI failures. Can you fix those?

facebook-github-bot · 2020-10-21T22:16:08Z

This pull request was exported from Phabricator. Differential Revision: D23910257

facebook-github-bot · 2020-10-21T22:25:57Z

This pull request was exported from Phabricator. Differential Revision: D23910257

facebook-github-bot · 2020-10-21T22:31:09Z

This pull request was exported from Phabricator. Differential Revision: D23910257

…ator (#45318) Summary: Pull Request resolved: #45318 When calling `then()` from WorkNCCL, record the input data pointers in futureNCCLCallbackStream_ before the execution of the input callback. Note that the recording cannot be directly added to the lambda used by addCallback in ProcessGroupNCCL.hpp. This is because the type of future value in that context is pyobject rather than TensorList, but a type casting will require pybind and introduce Python dependency, which should not be allowed in c10d library. I have considered creating a util function in a separate file to support this type casting, and then placing it under torch/csrc directory where python dependency is allowed. However, torch/csrc has a dependency on c10d, so this will create a circular dependency. Finally, a `record_stream_cb_` member is added to FutureNCCL, and the default value is nullptr. A default `record_stream_cb_` implementation is added to `PythonFutureWrapper,` where Python dependency is allowed. In addition, a few lines are reformatted by lint. caffe2/torch/csrc/distributed/c10d/init.cpp is only reformatted. #Closes: #44203 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- ProcessGroupNCCLTest buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_accumulate_gradients_no_sync_allreduce_with_then_hook buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_with_then_hook_nccl Reviewed By: pritamdamania87 Differential Revision: D23910257 fbshipit-source-id: fe41338e3b5510b9d9a631f57b78134e8c7e1b12

facebook-github-bot · 2020-10-22T04:50:42Z

This pull request was exported from Phabricator. Differential Revision: D23910257

facebook-github-bot · 2020-10-22T10:10:59Z

This pull request has been merged in 98aad93.

wayi1 requested review from mrshenli, pietern, pritamdamania87, rohan-varma and zhaojuanmao as code owners September 25, 2020 00:28

facebook-github-bot added the fb-exported label Sep 25, 2020

pritamdamania87 requested a review from ngimel September 25, 2020 19:23

pritamdamania87 reviewed Sep 25, 2020

View reviewed changes

torch/lib/c10d/ProcessGroupNCCL.cpp Outdated Show resolved Hide resolved

torch/lib/c10d/ProcessGroupNCCL.hpp Outdated Show resolved Hide resolved

wayi1 requested a review from pritamdamania87 September 26, 2020 18:11

wayi1 requested a review from mingzhe09088 as a code owner October 12, 2020 08:13

pritamdamania87 suggested changes Oct 12, 2020

View reviewed changes

rohan-varma reviewed Oct 14, 2020

View reviewed changes

torch/csrc/distributed/c10d/init.cpp Outdated Show resolved Hide resolved

rohan-varma reviewed Oct 14, 2020

View reviewed changes

torch/csrc/distributed/c10d/init.cpp Outdated Show resolved Hide resolved

rohan-varma reviewed Oct 14, 2020

View reviewed changes

torch/csrc/distributed/c10d/init.cpp Outdated Show resolved Hide resolved

rohan-varma reviewed Oct 14, 2020

View reviewed changes

torch/lib/c10d/ProcessGroupNCCL.hpp Outdated Show resolved Hide resolved

rohan-varma reviewed Oct 14, 2020

View reviewed changes

torch/csrc/distributed/c10d/init.cpp Outdated Show resolved Hide resolved

rohan-varma reviewed Oct 14, 2020

View reviewed changes

torch/lib/c10d/ProcessGroupNCCL.hpp Outdated Show resolved Hide resolved

rohan-varma reviewed Oct 14, 2020

View reviewed changes

wayi1 requested a review from apaszke as a code owner October 15, 2020 00:24

wayi1 requested review from rohan-varma and pritamdamania87 October 15, 2020 00:27

pritamdamania87 reviewed Oct 15, 2020

View reviewed changes

aten/src/ATen/core/ivalue_inl.h Show resolved Hide resolved

aten/src/ATen/core/ivalue_inl.h Outdated Show resolved Hide resolved

wayi1 requested a review from pritamdamania87 October 15, 2020 05:32

pritamdamania87 reviewed Oct 21, 2020

View reviewed changes

pritamdamania87 approved these changes Oct 21, 2020

View reviewed changes

facebook-github-bot closed this in 98aad93 Oct 22, 2020

facebook-github-bot added the Merged label Oct 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pytorch][PR] Record FutureNCCL callback stream on CUDA caching allocator #45318

[pytorch][PR] Record FutureNCCL callback stream on CUDA caching allocator #45318

wayi1 commented Sep 25, 2020 •

edited

facebook-github-bot commented Sep 25, 2020

dr-ci bot commented Sep 25, 2020 •

edited

pritamdamania87 commented Sep 25, 2020

facebook-github-bot commented Sep 26, 2020

facebook-github-bot commented Oct 12, 2020

pritamdamania87 left a comment

rohan-varma Oct 14, 2020

pritamdamania87 Oct 14, 2020

wayi1 Oct 14, 2020

wayi1 commented Oct 14, 2020

facebook-github-bot commented Oct 15, 2020

facebook-github-bot commented Oct 15, 2020

facebook-github-bot commented Oct 15, 2020

facebook-github-bot commented Oct 16, 2020

facebook-github-bot commented Oct 16, 2020

facebook-github-bot commented Oct 16, 2020

pritamdamania87 left a comment

facebook-github-bot commented Oct 21, 2020

facebook-github-bot commented Oct 21, 2020

facebook-github-bot commented Oct 21, 2020

facebook-github-bot commented Oct 22, 2020

facebook-github-bot commented Oct 22, 2020

[pytorch][PR] Record FutureNCCL callback stream on CUDA caching allocator #45318

[pytorch][PR] Record FutureNCCL callback stream on CUDA caching allocator #45318

Conversation

wayi1 commented Sep 25, 2020 • edited

facebook-github-bot commented Sep 25, 2020

dr-ci bot commented Sep 25, 2020 • edited

💊 CI failures summary and remediations

🚧 1 ongoing upstream failure:

pritamdamania87 commented Sep 25, 2020

facebook-github-bot commented Sep 26, 2020

facebook-github-bot commented Oct 12, 2020

pritamdamania87 left a comment

Choose a reason for hiding this comment

rohan-varma Oct 14, 2020

Choose a reason for hiding this comment

pritamdamania87 Oct 14, 2020

Choose a reason for hiding this comment

wayi1 Oct 14, 2020

Choose a reason for hiding this comment

wayi1 commented Oct 14, 2020

facebook-github-bot commented Oct 15, 2020

facebook-github-bot commented Oct 15, 2020

facebook-github-bot commented Oct 15, 2020

facebook-github-bot commented Oct 16, 2020

facebook-github-bot commented Oct 16, 2020

facebook-github-bot commented Oct 16, 2020

pritamdamania87 left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 21, 2020

facebook-github-bot commented Oct 21, 2020

facebook-github-bot commented Oct 21, 2020

facebook-github-bot commented Oct 22, 2020

facebook-github-bot commented Oct 22, 2020

wayi1 commented Sep 25, 2020 •

edited

dr-ci bot commented Sep 25, 2020 •

edited