Skip to content

Conversation

sinannasir
Copy link
Contributor

@sinannasir sinannasir commented Aug 22, 2020

Stack from ghstack:

Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream:

  1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow.
  2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns.

Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach.

Differential Revision: D23277575

Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream:
1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow.
2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns.

Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach.

Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 22, 2020
Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream:
1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow.
2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns.

Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach.

Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/)

ghstack-source-id: 110500130
Pull Request resolved: #43447
@sinannasir sinannasir changed the title [NCCL] Dedicated stream to run all FutureNCCL callbacks. [wip] [NCCL] Dedicated stream to run all FutureNCCL callbacks. Aug 22, 2020
@sinannasir sinannasir marked this pull request as draft August 22, 2020 02:35
@sinannasir sinannasir marked this pull request as ready for review August 22, 2020 03:48
@dr-ci
Copy link

dr-ci bot commented Aug 22, 2020

💊 CI failures summary and remediations

As of commit 76f16ac (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 21 times.

@sinannasir sinannasir changed the title [wip] [NCCL] Dedicated stream to run all FutureNCCL callbacks. [NCCL] Dedicated stream to run all FutureNCCL callbacks. Aug 22, 2020
Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream:
1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow.
2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns.

Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach.

Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 25, 2020
Pull Request resolved: #43447

Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream:
1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow.
2. getStreamFromPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns.

Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach.
ghstack-source-id: 110611337

Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/)
Copy link
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please address some of my comments below before landing.

Comment on lines 650 to 653
if (devices.size() == 1) {
futureNCCLCallbackStream_ =
std::make_shared<at::cuda::CUDAStream>(at::cuda::getStreamFromPool());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done in the constructor of ProcessGroupNCCL and not each time we call getNCCLComm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decided to make a static vector of streams called futureNCCLCallbackStreams_ instead that will store one stream per device. I think it will look way better. I'll give a quick try and let you know if it causes problems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a bit strange but when I try to initialize even a simple int using c10::cuda::device_count: int ProcessGroupNCCL::test = c10::cuda::device_count();; I get the following error:

THCudaCheck FAIL file=../aten/src/THC/THCGeneral.cpp line=54 error=3 : initialization error
ERROR:root:Caught exception: 
Traceback (most recent call last):
  File "/home/sinannasir/local/pytorch/torch/testing/_internal/common_distributed.py", line 226, in wrapper
    fn()
  File "/home/sinannasir/local/pytorch/torch/testing/_internal/common_distributed.py", line 93, in wrapper
    return func(*args, **kwargs)
  File "test/distributed/test_c10d.py", line 3226, in test_ddp_comm_hook_allreduce_with_then_hook_nccl
    gpu_model = self._gpu_model_with_ddp_comm_hook(
  File "test/distributed/test_c10d.py", line 3111, in _gpu_model_with_ddp_comm_hook
    TestDdpCommHook().to(device_id),
  File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 612, in to
    return self._apply(convert)
  File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 359, in _apply
    module._apply(fn)
  File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 381, in _apply
    param_applied = fn(param)
  File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 610, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
  File "/home/sinannasir/local/pytorch/torch/cuda/__init__.py", line 173, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (3) : initialization error at ../aten/src/THC/THCGeneral.cpp:54
exiting process with exit code: 10
THCudaCheck FAIL file=../aten/src/THC/THCGeneral.cpp line=54 error=3 : initialization error
ERROR:root:Caught exception: 
Traceback (most recent call last):
  File "/home/sinannasir/local/pytorch/torch/testing/_internal/common_distributed.py", line 226, in wrapper
    fn()
  File "/home/sinannasir/local/pytorch/torch/testing/_internal/common_distributed.py", line 93, in wrapper
    return func(*args, **kwargs)
  File "test/distributed/test_c10d.py", line 3226, in test_ddp_comm_hook_allreduce_with_then_hook_nccl
    gpu_model = self._gpu_model_with_ddp_comm_hook(
  File "test/distributed/test_c10d.py", line 3111, in _gpu_model_with_ddp_comm_hook
    TestDdpCommHook().to(device_id),
  File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 612, in to
    return self._apply(convert)
  File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 359, in _apply
    module._apply(fn)
  File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 381, in _apply
    param_applied = fn(param)
  File "/home/sinannasir/local/pytorch/torch/nn/modules/module.py", line 610, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
  File "/home/sinannasir/local/pytorch/torch/cuda/__init__.py", line 173, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (3) : initialization error at ../aten/src/THC/THCGeneral.cpp:54

I just initialize the streams inside constructor and they are not static.

Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream:
1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow.
2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns.

Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach.

Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/)

[ghstack-poisoned]
Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream:
1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow.
2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns.

Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach.

Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/)

[ghstack-poisoned]
Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream:
1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow.
2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns.

Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach.

Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 27, 2020
Pull Request resolved: #43447

Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream:
1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow.
2. getStreamFromPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns.

Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach.
ghstack-source-id: 110852266

Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/)
Comment on lines 139 to 140
// Store a reference to the NCCL stream that runs all FutureNCCL callbacks.
std::shared_ptr<at::cuda::CUDAStream> futureNCCLCallbackStream_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass a vector here for all the streams for all devices. We can have the limitations for single process multi device mode only in FutureNCCL where its necessary.

Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream:
1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow.
2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns.

Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach.

Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 27, 2020
Pull Request resolved: #43447

Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream:
1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow.
2. getStreamFromPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns.

Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach.
ghstack-source-id: 110900477

Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/)
Copy link
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream:
1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow.
2. StreamPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns.

Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach.

Differential Revision: [D23277575](https://our.internmc.facebook.com/intern/diff/D23277575/)

[ghstack-poisoned]
@codecov
Copy link

codecov bot commented Aug 28, 2020

Codecov Report

❗ No coverage uploaded for pull request base (gh/sinannasir/14/base@52de655). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@                   Coverage Diff                    @@
##             gh/sinannasir/14/base   #43447   +/-   ##
========================================================
  Coverage                         ?   69.34%           
========================================================
  Files                            ?      378           
  Lines                            ?    46680           
  Branches                         ?        0           
========================================================
  Hits                             ?    32369           
  Misses                           ?    14311           
  Partials                         ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 52de655...76f16ac. Read the comment docs.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 7d517cf.

@facebook-github-bot facebook-github-bot deleted the gh/sinannasir/14/head branch September 1, 2020 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants