New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pytorch][PR] Record FutureNCCL callback stream on CUDA caching allocator #45318
Conversation
This pull request was exported from Phabricator. Differential Revision: D23910257 |
💊 CI failures summary and remediationsAs of commit 7020fea (more details on the Dr. CI page): ✅ None of the CI failures appear to be your fault 💚
🚧 1 ongoing upstream failure:These were probably caused by upstream breakages that are not fixed yet:
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 66 times. |
@ngimel Could you help review this PR? We're following the convention from here: https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroupNCCL.cpp#L968 |
This pull request was exported from Phabricator. Differential Revision: D23910257 |
This pull request was exported from Phabricator. Differential Revision: D23910257 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes since I think recordStream
should be done inside addCallback(). Also, can we run the DDP/HPC benchmarks with this change (with fp16 hook enabled) to ensure it doesn't cause any issues? You can probably verify QPS and peak memory to see they're not affected significantly.
torch/lib/c10d/ProcessGroupNCCL.hpp
Outdated
|
||
void recordStream() { | ||
TORCH_CHECK(record_stream_cb_); | ||
// Do not free the underlying data storage of value_ before its usage on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @pritamdamania87 , is there anyway we could write a test that would fail without this record stream callback logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be possible but tricky to do. It probably has to be a C++ test and interact with CUDACachingAllocator in a way that the caching allocator frees up the data, but we're still using the data in futureNCCLCallbackStream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently I hacked the test file to locally verify that in a test case like fut.then(mult).then(div), CUDACachingAllocator::recordStream has been invoked by the first two futures, respectively.
Agree that there could be a better approach to create a C++ unit test that can allow CUDACachingAllocator to free up the cached data.
Per offline discussion, ran the callback right before |
This pull request was exported from Phabricator. Differential Revision: D23910257 |
This pull request was exported from Phabricator. Differential Revision: D23910257 |
This pull request was exported from Phabricator. Differential Revision: D23910257 |
3 similar comments
This pull request was exported from Phabricator. Differential Revision: D23910257 |
This pull request was exported from Phabricator. Differential Revision: D23910257 |
This pull request was exported from Phabricator. Differential Revision: D23910257 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM, although there are a lot of CI failures. Can you fix those?
This pull request was exported from Phabricator. Differential Revision: D23910257 |
2 similar comments
This pull request was exported from Phabricator. Differential Revision: D23910257 |
This pull request was exported from Phabricator. Differential Revision: D23910257 |
…ator (#45318) Summary: Pull Request resolved: #45318 When calling `then()` from WorkNCCL, record the input data pointers in futureNCCLCallbackStream_ before the execution of the input callback. Note that the recording cannot be directly added to the lambda used by addCallback in ProcessGroupNCCL.hpp. This is because the type of future value in that context is pyobject rather than TensorList, but a type casting will require pybind and introduce Python dependency, which should not be allowed in c10d library. I have considered creating a util function in a separate file to support this type casting, and then placing it under torch/csrc directory where python dependency is allowed. However, torch/csrc has a dependency on c10d, so this will create a circular dependency. Finally, a `record_stream_cb_` member is added to FutureNCCL, and the default value is nullptr. A default `record_stream_cb_` implementation is added to `PythonFutureWrapper,` where Python dependency is allowed. In addition, a few lines are reformatted by lint. caffe2/torch/csrc/distributed/c10d/init.cpp is only reformatted. #Closes: #44203 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- ProcessGroupNCCLTest buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_accumulate_gradients_no_sync_allreduce_with_then_hook buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_with_then_hook_nccl Reviewed By: pritamdamania87 Differential Revision: D23910257 fbshipit-source-id: fe41338e3b5510b9d9a631f57b78134e8c7e1b12
This pull request was exported from Phabricator. Differential Revision: D23910257 |
This pull request has been merged in 98aad93. |
Summary:
When calling
then()
from WorkNCCL, record the input data pointers in futureNCCLCallbackStream_ before the execution of the input callback.Note that the recording cannot be directly added to the lambda used by addCallback in ProcessGroupNCCL.hpp. This is because the type of future value in that context is pyobject rather than TensorList, but a type casting will require pybind and introduce Python dependency, which should not be allowed in c10d library.
I have considered creating a util function in a separate file to support this type casting, and then placing it under torch/csrc directory where python dependency is allowed. However, torch/csrc has a dependency on c10d, so this will create a circular dependency.
Finally, a
record_stream_cb_
member is added to FutureNCCL, and the default value is nullptr. A defaultrecord_stream_cb_
implementation is added tocaffe2/torch/csrc/jit/python/pybind_utils.h
, where Python dependency is allowed.Currently record_stream_cb_ is not really customizable, and we assume the input ivalue of this callback is or can be casted into a tensor that has exactly one tensor. This callback will only be used by NCCL backend.
Test Plan
buck test @mode/dev-nosan caffe2/test/distributed:c10d -- ProcessGroupNCCLTest
buck test @mode/dev-nosan caffe2/test/distributed:c10d -- test_accumulate_gradients_no_sync_allreduce_with_then_hook
buck test @mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_with_then_hook_nccl
Differential Revision: D23910257