Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pytorch][PR] Record FutureNCCL callback stream on CUDA caching allocator #45318

Closed
wants to merge 1 commit into from
Closed

[pytorch][PR] Record FutureNCCL callback stream on CUDA caching allocator #45318

wants to merge 1 commit into from

Conversation

wayi1
Copy link
Contributor

@wayi1 wayi1 commented Sep 25, 2020

Summary:
When calling then() from WorkNCCL, record the input data pointers in futureNCCLCallbackStream_ before the execution of the input callback.

Note that the recording cannot be directly added to the lambda used by addCallback in ProcessGroupNCCL.hpp. This is because the type of future value in that context is pyobject rather than TensorList, but a type casting will require pybind and introduce Python dependency, which should not be allowed in c10d library.

I have considered creating a util function in a separate file to support this type casting, and then placing it under torch/csrc directory where python dependency is allowed. However, torch/csrc has a dependency on c10d, so this will create a circular dependency.

Finally, a record_stream_cb_ member is added to FutureNCCL, and the default value is nullptr. A default record_stream_cb_ implementation is added to caffe2/torch/csrc/jit/python/pybind_utils.h, where Python dependency is allowed.

Currently record_stream_cb_ is not really customizable, and we assume the input ivalue of this callback is or can be casted into a tensor that has exactly one tensor. This callback will only be used by NCCL backend.

Test Plan
buck test @mode/dev-nosan caffe2/test/distributed:c10d -- ProcessGroupNCCLTest
buck test @mode/dev-nosan caffe2/test/distributed:c10d -- test_accumulate_gradients_no_sync_allreduce_with_then_hook
buck test @mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_with_then_hook_nccl

Differential Revision: D23910257

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D23910257

@dr-ci
Copy link

dr-ci bot commented Sep 25, 2020

💊 CI failures summary and remediations

As of commit 7020fea (more details on the Dr. CI page):


None of the CI failures appear to be your fault 💚



🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet:


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 66 times.

@pritamdamania87
Copy link
Contributor

@ngimel Could you help review this PR? We're following the convention from here: https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroupNCCL.cpp#L968

torch/lib/c10d/ProcessGroupNCCL.cpp Outdated Show resolved Hide resolved
torch/lib/c10d/ProcessGroupNCCL.hpp Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D23910257

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D23910257

Copy link
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes since I think recordStream should be done inside addCallback(). Also, can we run the DDP/HPC benchmarks with this change (with fp16 hook enabled) to ensure it doesn't cause any issues? You can probably verify QPS and peak memory to see they're not affected significantly.

torch/csrc/distributed/c10d/init.cpp Show resolved Hide resolved
torch/csrc/distributed/c10d/init.cpp Outdated Show resolved Hide resolved
torch/csrc/distributed/c10d/init.cpp Outdated Show resolved Hide resolved
torch/lib/c10d/ProcessGroupNCCL.hpp Outdated Show resolved Hide resolved
torch/lib/c10d/ProcessGroupNCCL.hpp Outdated Show resolved Hide resolved

void recordStream() {
TORCH_CHECK(record_stream_cb_);
// Do not free the underlying data storage of value_ before its usage on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @pritamdamania87 , is there anyway we could write a test that would fail without this record stream callback logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be possible but tricky to do. It probably has to be a C++ test and interact with CUDACachingAllocator in a way that the caching allocator frees up the data, but we're still using the data in futureNCCLCallbackStream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently I hacked the test file to locally verify that in a test case like fut.then(mult).then(div), CUDACachingAllocator::recordStream has been invoked by the first two futures, respectively.

Agree that there could be a better approach to create a C++ unit test that can allow CUDACachingAllocator to free up the cached data.

@wayi1
Copy link
Contributor Author

wayi1 commented Oct 14, 2020

Requesting changes since I think recordStream should be done inside addCallback(). Also, can we run the DDP/HPC benchmarks with this change (with fp16 hook enabled) to ensure it doesn't cause any issues? You can probably verify QPS and peak memory to see they're not affected significantly.

Per offline discussion, ran the callback right before addCallback is invoked.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D23910257

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D23910257

aten/src/ATen/core/ivalue_inl.h Show resolved Hide resolved
aten/src/ATen/core/ivalue_inl.h Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D23910257

3 similar comments
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D23910257

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D23910257

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D23910257

Copy link
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM, although there are a lot of CI failures. Can you fix those?

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D23910257

2 similar comments
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D23910257

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D23910257

…ator (#45318)

Summary:
Pull Request resolved: #45318

When calling `then()` from WorkNCCL, record the input data pointers in futureNCCLCallbackStream_ before the execution of the input callback.

Note that the recording cannot be directly added to the lambda used by addCallback in ProcessGroupNCCL.hpp. This is because the type of future value in that context is pyobject rather than TensorList, but a type casting will require pybind and introduce Python dependency, which should not be allowed in c10d library.

I have considered creating a util function in a separate file to support this type casting, and then placing it under torch/csrc directory where python dependency is allowed. However, torch/csrc has a dependency on c10d, so this will create a circular dependency.

Finally, a `record_stream_cb_` member is added to FutureNCCL, and the default value is nullptr. A default `record_stream_cb_` implementation is added to `PythonFutureWrapper,` where Python dependency is allowed.

In addition, a few lines are reformatted by lint.
caffe2/torch/csrc/distributed/c10d/init.cpp is only reformatted.

#Closes: #44203

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- ProcessGroupNCCLTest
buck test mode/dev-nosan caffe2/test/distributed:c10d  -- test_accumulate_gradients_no_sync_allreduce_with_then_hook
buck test mode/dev-nosan caffe2/test/distributed:c10d  -- test_ddp_comm_hook_allreduce_with_then_hook_nccl

Reviewed By: pritamdamania87

Differential Revision: D23910257

fbshipit-source-id: fe41338e3b5510b9d9a631f57b78134e8c7e1b12
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D23910257

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 98aad93.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants