-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[NCCL] Dedicated stream to run all FutureNCCL callbacks. #43447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream: 1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow. 2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns. Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach. Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/) [ghstack-poisoned]
Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream: 1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow. 2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns. Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach. Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/) ghstack-source-id: 110500130 Pull Request resolved: #43447
💊 CI failures summary and remediationsAs of commit 76f16ac (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 21 times. |
Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream: 1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow. 2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns. Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach. Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/) [ghstack-poisoned]
Pull Request resolved: #43447 Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream: 1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow. 2. getStreamFromPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns. Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach. ghstack-source-id: 110611337 Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please address some of my comments below before landing.
torch/lib/c10d/ProcessGroupNCCL.cpp
Outdated
if (devices.size() == 1) { | ||
futureNCCLCallbackStream_ = | ||
std::make_shared<at::cuda::CUDAStream>(at::cuda::getStreamFromPool()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done in the constructor of ProcessGroupNCCL and not each time we call getNCCLComm
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've decided to make a static vector of streams called futureNCCLCallbackStreams_
instead that will store one stream per device. I think it will look way better. I'll give a quick try and let you know if it causes problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a bit strange but when I try to initialize even a simple int using c10::cuda::device_count: int ProcessGroupNCCL::test = c10::cuda::device_count();
; I get the following error:
THCudaCheck FAIL file=../aten/src/THC/THCGeneral.cpp line=54 error=3 : initialization error
ERROR:root:Caught exception:
Traceback (most recent call last):
File "/home/sinannasir/local/pytorch/torch/testing/_internal/common_distributed.py", line 226, in wrapper
fn()
File "/home/sinannasir/local/pytorch/torch/testing/_internal/common_distributed.py", line 93, in wrapper
return func(*args, **kwargs)
File "test/distributed/test_c10d.py", line 3226, in test_ddp_comm_hook_allreduce_with_then_hook_nccl
gpu_model = self._gpu_model_with_ddp_comm_hook(
File "test/distributed/test_c10d.py", line 3111, in _gpu_model_with_ddp_comm_hook
TestDdpCommHook().to(device_id),
File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 612, in to
return self._apply(convert)
File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 359, in _apply
module._apply(fn)
File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 381, in _apply
param_applied = fn(param)
File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 610, in convert
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
File "/home/sinannasir/local/pytorch/torch/cuda/__init__.py", line 173, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (3) : initialization error at ../aten/src/THC/THCGeneral.cpp:54
exiting process with exit code: 10
THCudaCheck FAIL file=../aten/src/THC/THCGeneral.cpp line=54 error=3 : initialization error
ERROR:root:Caught exception:
Traceback (most recent call last):
File "/home/sinannasir/local/pytorch/torch/testing/_internal/common_distributed.py", line 226, in wrapper
fn()
File "/home/sinannasir/local/pytorch/torch/testing/_internal/common_distributed.py", line 93, in wrapper
return func(*args, **kwargs)
File "test/distributed/test_c10d.py", line 3226, in test_ddp_comm_hook_allreduce_with_then_hook_nccl
gpu_model = self._gpu_model_with_ddp_comm_hook(
File "test/distributed/test_c10d.py", line 3111, in _gpu_model_with_ddp_comm_hook
TestDdpCommHook().to(device_id),
File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 612, in to
return self._apply(convert)
File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 359, in _apply
module._apply(fn)
File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 381, in _apply
param_applied = fn(param)
File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 610, in convert
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
File "/home/sinannasir/local/pytorch/torch/cuda/__init__.py", line 173, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (3) : initialization error at ../aten/src/THC/THCGeneral.cpp:54
I just initialize the streams inside constructor and they are not static.
Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream: 1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow. 2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns. Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach. Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/) [ghstack-poisoned]
Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream: 1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow. 2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns. Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach. Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/) [ghstack-poisoned]
Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream: 1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow. 2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns. Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach. Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/) [ghstack-poisoned]
Pull Request resolved: #43447 Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream: 1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow. 2. getStreamFromPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns. Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach. ghstack-source-id: 110852266 Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/)
torch/lib/c10d/ProcessGroupNCCL.hpp
Outdated
// Store a reference to the NCCL stream that runs all FutureNCCL callbacks. | ||
std::shared_ptr<at::cuda::CUDAStream> futureNCCLCallbackStream_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass a vector here for all the streams for all devices. We can have the limitations for single process multi device mode only in FutureNCCL where its necessary.
Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream: 1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow. 2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns. Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach. Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/) [ghstack-poisoned]
Pull Request resolved: #43447 Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream: 1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow. 2. getStreamFromPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns. Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach. ghstack-source-id: 110900477 Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream: 1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow. 2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns. Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach. Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/) [ghstack-poisoned]
Codecov Report
@@ Coverage Diff @@
## gh/sinannasir/14/base #43447 +/- ##
========================================================
Coverage ? 69.34%
========================================================
Files ? 378
Lines ? 46680
Branches ? 0
========================================================
Hits ? 32369
Misses ? 14311
Partials ? 0 Continue to review full report at Codecov.
|
This pull request has been merged in 7d517cf. |
Stack from ghstack:
Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream:
Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach.
Differential Revision: D23277575