[NCCL] Dedicated stream to run all FutureNCCL callbacks. #43447

sinannasir · 2020-08-22T02:21:36Z

Stack from ghstack:

DDP communication hook examples #43310 DDP communication hook examples
[NCCL] Dedicated stream to run all FutureNCCL callbacks. #43447 [NCCL] Dedicated stream to run all FutureNCCL callbacks.

Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream:

Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow.
StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns.

Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach.

Differential Revision: D23277575

Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream: 1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow. 2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns. Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach. Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/) [ghstack-poisoned]

Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream: 1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow. 2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns. Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach. Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/) ghstack-source-id: 110500130 Pull Request resolved: #43447

dr-ci · 2020-08-22T04:06:40Z

💊 CI failures summary and remediations

As of commit 76f16ac (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 21 times.

Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream: 1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow. 2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns. Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach. Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/) [ghstack-poisoned]

Pull Request resolved: #43447 Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream: 1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow. 2. getStreamFromPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns. Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach. ghstack-source-id: 110611337 Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/)

pritamdamania87

LGTM, please address some of my comments below before landing.

pritamdamania87 · 2020-08-25T23:23:51Z

torch/lib/c10d/ProcessGroupNCCL.cpp

+    if (devices.size() == 1) {
+      futureNCCLCallbackStream_ =
+          std::make_shared<at::cuda::CUDAStream>(at::cuda::getStreamFromPool());
+    }


This should be done in the constructor of ProcessGroupNCCL and not each time we call getNCCLComm.

I've decided to make a static vector of streams called futureNCCLCallbackStreams_ instead that will store one stream per device. I think it will look way better. I'll give a quick try and let you know if it causes problems.

That's a bit strange but when I try to initialize even a simple int using c10::cuda::device_count: int ProcessGroupNCCL::test = c10::cuda::device_count();; I get the following error:

THCudaCheck FAIL file=../aten/src/THC/THCGeneral.cpp line=54 error=3 : initialization error ERROR:root:Caught exception: Traceback (most recent call last): File "/home/sinannasir/local/pytorch/torch/testing/_internal/common_distributed.py", line 226, in wrapper fn() File "/home/sinannasir/local/pytorch/torch/testing/_internal/common_distributed.py", line 93, in wrapper return func(*args, **kwargs) File "test/distributed/test_c10d.py", line 3226, in test_ddp_comm_hook_allreduce_with_then_hook_nccl gpu_model = self._gpu_model_with_ddp_comm_hook( File "test/distributed/test_c10d.py", line 3111, in _gpu_model_with_ddp_comm_hook TestDdpCommHook().to(device_id), File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 612, in to return self._apply(convert) File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 359, in _apply module._apply(fn) File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 381, in _apply param_applied = fn(param) File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 610, in convert return t.to(device, dtype if t.is_floating_point() else None, non_blocking) File "/home/sinannasir/local/pytorch/torch/cuda/__init__.py", line 173, in _lazy_init torch._C._cuda_init() RuntimeError: cuda runtime error (3) : initialization error at ../aten/src/THC/THCGeneral.cpp:54 exiting process with exit code: 10 THCudaCheck FAIL file=../aten/src/THC/THCGeneral.cpp line=54 error=3 : initialization error ERROR:root:Caught exception: Traceback (most recent call last): File "/home/sinannasir/local/pytorch/torch/testing/_internal/common_distributed.py", line 226, in wrapper fn() File "/home/sinannasir/local/pytorch/torch/testing/_internal/common_distributed.py", line 93, in wrapper return func(*args, **kwargs) File "test/distributed/test_c10d.py", line 3226, in test_ddp_comm_hook_allreduce_with_then_hook_nccl gpu_model = self._gpu_model_with_ddp_comm_hook( File "test/distributed/test_c10d.py", line 3111, in _gpu_model_with_ddp_comm_hook TestDdpCommHook().to(device_id), File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 612, in to return self._apply(convert) File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 359, in _apply module._apply(fn) File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 381, in _apply param_applied = fn(param) File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 610, in convert return t.to(device, dtype if t.is_floating_point() else None, non_blocking) File "/home/sinannasir/local/pytorch/torch/cuda/__init__.py", line 173, in _lazy_init torch._C._cuda_init() RuntimeError: cuda runtime error (3) : initialization error at ../aten/src/THC/THCGeneral.cpp:54

I just initialize the streams inside constructor and they are not static.

torch/lib/c10d/ProcessGroupNCCL.cpp

torch/lib/c10d/ProcessGroupNCCL.hpp

Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream: 1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow. 2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns. Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach. Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/) [ghstack-poisoned]

Pull Request resolved: #43447 Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream: 1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow. 2. getStreamFromPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns. Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach. ghstack-source-id: 110852266 Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/)

torch/lib/c10d/ProcessGroupNCCL.cpp

pritamdamania87 · 2020-08-27T22:23:19Z

torch/lib/c10d/ProcessGroupNCCL.hpp

+    // Store a reference to the NCCL stream that runs all FutureNCCL callbacks.
+    std::shared_ptr<at::cuda::CUDAStream> futureNCCLCallbackStream_;


Pass a vector here for all the streams for all devices. We can have the limitations for single process multi device mode only in FutureNCCL where its necessary.

Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream: 1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow. 2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns. Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach. Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/) [ghstack-poisoned]

Pull Request resolved: #43447 Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream: 1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow. 2. getStreamFromPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns. Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach. ghstack-source-id: 110900477 Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/)

pritamdamania87

LGTM!

Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream: 1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow. 2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns. Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach. Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/) [ghstack-poisoned]

codecov · 2020-08-28T08:20:53Z

Codecov Report

❗ No coverage uploaded for pull request base (gh/sinannasir/14/base@52de655). Click here to learn what that means.
The diff coverage is n/a.

@@                   Coverage Diff                    @@
##             gh/sinannasir/14/base   #43447   +/-   ##
========================================================
  Coverage                         ?   69.34%           
========================================================
  Files                            ?      378           
  Lines                            ?    46680           
  Branches                         ?        0           
========================================================
  Hits                             ?    32369           
  Misses                           ?    14311           
  Partials                         ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 52de655...76f16ac. Read the comment docs.

facebook-github-bot · 2020-08-29T04:11:53Z

This pull request has been merged in 7d517cf.

sinannasir requested review from mrshenli, pietern and zhaojuanmao as code owners August 22, 2020 02:21

sinannasir changed the title ~~[NCCL] Dedicated stream to run all FutureNCCL callbacks.~~ [wip] [NCCL] Dedicated stream to run all FutureNCCL callbacks. Aug 22, 2020

sinannasir requested review from pritamdamania87 and rohan-varma August 22, 2020 02:35

sinannasir marked this pull request as draft August 22, 2020 02:35

sinannasir mentioned this pull request Aug 22, 2020

DDP communication hook examples #43310

Closed

sinannasir marked this pull request as ready for review August 22, 2020 03:48

sinannasir changed the title ~~[wip] [NCCL] Dedicated stream to run all FutureNCCL callbacks.~~ [NCCL] Dedicated stream to run all FutureNCCL callbacks. Aug 22, 2020

sinannasir mentioned this pull request Aug 25, 2020

DDP Communication hook: Fix the way we pass future result to buckets. #43307

Closed

pritamdamania87 approved these changes Aug 25, 2020

View reviewed changes

pritamdamania87 reviewed Aug 26, 2020

View reviewed changes

torch/lib/c10d/ProcessGroupNCCL.hpp Outdated Show resolved Hide resolved

sinannasir requested a review from pritamdamania87 August 27, 2020 17:01

pritamdamania87 reviewed Aug 27, 2020

View reviewed changes

torch/lib/c10d/ProcessGroupNCCL.cpp Outdated Show resolved Hide resolved

pritamdamania87 reviewed Aug 27, 2020

View reviewed changes

pritamdamania87 approved these changes Aug 27, 2020

View reviewed changes

facebook-github-bot closed this in 7d517cf Aug 29, 2020

facebook-github-bot added the merged label Aug 29, 2020

facebook-github-bot deleted the gh/sinannasir/14/head branch September 1, 2020 14:17

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NCCL] Dedicated stream to run all FutureNCCL callbacks. #43447

[NCCL] Dedicated stream to run all FutureNCCL callbacks. #43447

Uh oh!

sinannasir commented Aug 22, 2020 •

edited

Loading

Uh oh!

dr-ci bot commented Aug 22, 2020 •

edited

Loading

Uh oh!

pritamdamania87 left a comment

Uh oh!

pritamdamania87 Aug 25, 2020

Uh oh!

sinannasir Aug 27, 2020

Uh oh!

sinannasir Aug 27, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pritamdamania87 Aug 27, 2020

Uh oh!

pritamdamania87 left a comment

Uh oh!

codecov bot commented Aug 28, 2020 •

edited

Loading

Uh oh!

facebook-github-bot commented Aug 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		// Store a reference to the NCCL stream that runs all FutureNCCL callbacks.
		std::shared_ptr<at::cuda::CUDAStream> futureNCCLCallbackStream_;

[NCCL] Dedicated stream to run all FutureNCCL callbacks. #43447

[NCCL] Dedicated stream to run all FutureNCCL callbacks. #43447

Uh oh!

Conversation

sinannasir commented Aug 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Aug 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

pritamdamania87 left a comment

Choose a reason for hiding this comment

Uh oh!

pritamdamania87 Aug 25, 2020

Choose a reason for hiding this comment

Uh oh!

sinannasir Aug 27, 2020

Choose a reason for hiding this comment

Uh oh!

sinannasir Aug 27, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pritamdamania87 Aug 27, 2020

Choose a reason for hiding this comment

Uh oh!

pritamdamania87 left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Aug 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

facebook-github-bot commented Aug 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sinannasir commented Aug 22, 2020 •

edited

Loading

dr-ci bot commented Aug 22, 2020 •

edited

Loading

codecov bot commented Aug 28, 2020 •

edited

Loading