[NCCL] DDP communication hook: getFuture() #41596

sinannasir · 2020-07-17T16:14:12Z

Stack from ghstack:

[NCCL] [For Test] In DDP's reducer merge work and future_work #41840 [NCCL] [For Test] In DDP's reducer merge work and future_work
[NCCL] DDP communication hook: getFuture() #41596 [NCCL] DDP communication hook: getFuture()

We've modified the previous design of convert_dist_work_to_future API in the GH Issue #39272.

Whenever we create a WorkNCCL object, create a Future associated with WorkNCCL and store it with the object.
Add an API c10::intrusive_ptr<c10::ivalue::Future> getFuture() to c10d::ProcessGroup::Work.
This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.
To mark the future associated with WorkNCCL completed, implement a cudaStreamCallback function.

cudaStreamAddCallback is marked as deprecated. An alternative is cudaLaunchHostFunc, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to this discussion.

Differential Revision: D22583690

Test plan:

Run old python test/distributed/test_c10d.py.
Some additional tests:
test_ddp_comm_hook_allreduce_hook_nccl: This unit test verifies whether a DDP communication hook that just calls allreduce gives the same result result with the case of no hook registered. Without the then callback, the future_value in reducer is no longer a PyObject, and this unit test verifies future_value is properly checked.
test_ddp_comm_hook_allreduce_then_mult_ten_hook_nccl: This unit test verifies whether a DDP communication hook that calls allreduce and then multiplies the result by ten gives the expected result.

As of v10:

........................s.....s.....................................................s...............................
----------------------------------------------------------------------
Ran 116 tests

OK (skipped=3)

We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](#39272). 1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object. 2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. 3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. 4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function. `cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr). Differential Revision: [D22583690](https://our.internmc.facebook.com/intern/diff/D22583690/) [ghstack-poisoned]

dr-ci · 2020-07-17T16:15:41Z

💊 CI failures summary and remediations

As of commit 3a2f0c8 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 61 times.

We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](#39272). 1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object. 2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. 3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. 4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function. `cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr). Differential Revision: [D22583690](https://our.internmc.facebook.com/intern/diff/D22583690/) [ghstack-poisoned]

Pull Request resolved: #41596 We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](#39272). 1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object. 2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. 3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. 4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function. `cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr). ghstack-source-id: 108004090 Differential Revision: [D22583690](https://our.internmc.facebook.com/intern/diff/D22583690/)

We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](#39272). 1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object. 2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. 3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. 4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function. `cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr). Differential Revision: [D22583690](https://our.internmc.facebook.com/intern/diff/D22583690/) [ghstack-poisoned]

Pull Request resolved: #41596 We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](#39272). 1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object. 2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. 3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. 4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function. `cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr). ghstack-source-id: 108025379 Differential Revision: [D22583690](https://our.internmc.facebook.com/intern/diff/D22583690/)

pritamdamania87

I understand the PR is not yet completely ready, but just leaving some early comments that would be helpful when you post a review ready version.

test/distributed/test_c10d.py

torch/lib/c10d/ProcessGroup.cpp

torch/lib/c10d/ProcessGroupNCCL.cpp

torch/lib/c10d/ProcessGroup.hpp

torch/lib/c10d/ProcessGroupNCCL.cpp

We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](#39272). 1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object. 2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. 3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. 4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function. `cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr). Differential Revision: [D22583690](https://our.internmc.facebook.com/intern/diff/D22583690/) [ghstack-poisoned]

Pull Request resolved: #41596 We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](#39272). 1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object. 2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. 3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. 4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function. `cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr). ghstack-source-id: 108111856 Differential Revision: [D22583690](https://our.internmc.facebook.com/intern/diff/D22583690/)

We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](#39272). 1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object. 2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. 3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. 4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function. `cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr). Differential Revision: [D22583690](https://our.internmc.facebook.com/intern/diff/D22583690/) [ghstack-poisoned]

Pull Request resolved: #41596 We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](#39272). 1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object. 2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. 3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. 4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function. `cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr). ghstack-source-id: 108133913 Differential Revision: [D22583690](https://our.internmc.facebook.com/intern/diff/D22583690/)

We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](#39272). 1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object. 2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. 3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. 4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function. `cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr). Differential Revision: [D22583690](https://our.internmc.facebook.com/intern/diff/D22583690/) [ghstack-poisoned]

Pull Request resolved: #41596 We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](#39272). 1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object. 2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. 3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. 4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function. `cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr). ghstack-source-id: 108223686 Differential Revision: [D22583690](https://our.internmc.facebook.com/intern/diff/D22583690/)

Pull Request resolved: #42335 **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). ghstack-source-id: 108915990 Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

…eamAddCallback" **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) [ghstack-poisoned]

Pull Request resolved: #42335 **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). ghstack-source-id: 108944549 Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

…eamAddCallback" **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) [ghstack-poisoned]

Pull Request resolved: #42335 **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). ghstack-source-id: 108986334 Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

…eamAddCallback" **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) [ghstack-poisoned]

Pull Request resolved: #42335 **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). ghstack-source-id: 109003394 Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

…eamAddCallback" **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) [ghstack-poisoned]

Pull Request resolved: #42335 **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). ghstack-source-id: 109005915 Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

…eamAddCallback" **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) [ghstack-poisoned]

Pull Request resolved: #42335 **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). ghstack-source-id: 109007778 Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

…eamAddCallback" **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) [ghstack-poisoned]

Pull Request resolved: #42335 **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). ghstack-source-id: 109150639 Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

…eamAddCallback" **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) [ghstack-poisoned]

Pull Request resolved: #42335 **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). ghstack-source-id: 109156649 Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

…eamAddCallback" **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) [ghstack-poisoned]

Pull Request resolved: #42335 **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). ghstack-source-id: 109225244 Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

…eamAddCallback" **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) [ghstack-poisoned]

Pull Request resolved: #42335 **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). ghstack-source-id: 109291702 Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

…eamAddCallback" **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) [ghstack-poisoned]

Pull Request resolved: #42335 **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). ghstack-source-id: 109392464 Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

…eamAddCallback" **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) [ghstack-poisoned]

Pull Request resolved: #42335 **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). ghstack-source-id: 109461507 Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

…ack (#42335) Summary: Pull Request resolved: #42335 **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). ghstack-source-id: 109461507 Test Plan: ```(pytorch) [sinannasir@devgpu017.ash6 ~/local/pytorch] python test/distributed/test_c10d.py Couldn't download test skip set, leaving all tests enabled... ..............................s.....................................................s................................ ---------------------------------------------------------------------- Ran 117 tests in 298.042s OK (skipped=2) ``` ### Facebook Internal: 2\. HPC PT trainer run to validate no regression. Check the QPS number: **Master:** QPS after 1000 iters: around ~34100 ``` hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"testvideo_master" --trainers 16 --trainer-version 1c53912 ``` ``` [0] I0806 142048.682 metrics_publishers.py:50] Finished iter 999, Local window NE: [0.963963 0.950479 0.953704], lifetime NE: [0.963963 0.950479 0.953704], loss: [0.243456 0.235225 0.248375], QPS: 34199 ``` [detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtestvideo_mastwarm.trainer.trainer%2F0&ta_tab=logs) **getFuture/new design:** QPS after 1000 iters: around ~34030 ``` hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"testvideo_getFutureCyclicFix" --trainers 16 --trainer-version 8553aee ``` ``` [0] I0806 160149.197 metrics_publishers.py:50] Finished iter 999, Local window NE: [0.963959 0.950477 0.953704], lifetime NE: [0.963959 0.950477 0.953704], loss: [0.243456 0.235225 0.248375], QPS: 34018 ``` [detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtestvideo_getFutureCyclicFix.trainer.trainer%2F0&ta_tab=logs) **getFuture/new design Run 2:** QPS after 1000 iters: around ~34200 ``` hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"test2video_getFutureCyclicFix" --trainers 16 --trainer-version 8553aee ``` ``` [0] I0806 160444.650 metrics_publishers.py:50] Finished iter 999, Local window NE: [0.963963 0.950482 0.953706], lifetime NE: [0.963963 0.950482 0.953706], loss: [0.243456 0.235225 0.248375], QPS: 34201 ``` [detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtest2video_getFutureCyclicFix.trainer.trainer%2F0&ta_tab=logs) **getFuture/old design (Regression):** QPS after 1000 iters: around ~31150 ``` hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER”testvideo_OLDgetFutureD22583690 (d904ea5)" --trainers 16 --trainer-version 1cb5cbb ``` ``` priv3_global/mast_hpc/hpc.sinannasirtestvideo_OLDgetFutureD22583690 (https://github.com/pytorch/pytorch/commit/d904ea597277673eefbb3661430d3f905e8760d5).trainer.trainer/0 [0] I0805 101320.407 metrics_publishers.py:50] Finished iter 999, Local window NE: [0.963964 0.950482 0.953703], lifetime NE: [0.963964 0.950482 0.953703], loss: [0.243456 0.235225 0.248375], QPS: 31159 ``` 3\. `flow-cli` tests; roberta_base; world_size=4: **Master:** f210039922 ``` total: 32 GPUs -- 32 GPUs: p25: 0.908 35/s p50: 1.002 31/s p75: 1.035 30/s p90: 1.051 30/s p95: 1.063 30/s forward: 32 GPUs -- 32 GPUs: p25: 0.071 452/s p50: 0.071 449/s p75: 0.072 446/s p90: 0.072 445/s p95: 0.072 444/s backward: 32 GPUs -- 32 GPUs: p25: 0.821 38/s p50: 0.915 34/s p75: 0.948 33/s p90: 0.964 33/s p95: 0.976 32/s optimizer: 32 GPUs -- 32 GPUs: p25: 0.016 2037/s p50: 0.016 2035/s p75: 0.016 2027/s p90: 0.016 2019/s p95: 0.016 2017/s ``` **getFuture new design:** f210285797 ``` total: 32 GPUs -- 32 GPUs: p25: 0.952 33/s p50: 1.031 31/s p75: 1.046 30/s p90: 1.055 30/s p95: 1.070 29/s forward: 32 GPUs -- 32 GPUs: p25: 0.071 449/s p50: 0.072 446/s p75: 0.072 445/s p90: 0.072 444/s p95: 0.072 443/s backward: 32 GPUs -- 32 GPUs: p25: 0.865 37/s p50: 0.943 33/s p75: 0.958 33/s p90: 0.968 33/s p95: 0.982 32/s optimizer: 32 GPUs -- 32 GPUs: p25: 0.016 2037/s p50: 0.016 2033/s p75: 0.016 2022/s p90: 0.016 2018/s p95: 0.016 2017/s ``` Reviewed By: ezyang Differential Revision: D22833298 fbshipit-source-id: 1bb268d3b00335b42ee235c112f93ebe2f25b208

sinannasir requested review from mrshenli, pietern, pritamdamania87 and zhaojuanmao as code owners July 17, 2020 16:14

sinannasir requested a review from rohan-varma July 17, 2020 16:17

sinannasir marked this pull request as draft July 17, 2020 17:37

pritamdamania87 reviewed Jul 17, 2020

View reviewed changes

sinannasir requested a review from pritamdamania87 July 22, 2020 04:34

sinannasir marked this pull request as ready for review July 22, 2020 04:35

sinannasir requested a review from apaszke as a code owner July 22, 2020 04:35

sinannasir mentioned this pull request Jul 22, 2020

[NCCL] [For Test] In DDP's reducer merge work and future_work #41840

Closed

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NCCL] DDP communication hook: getFuture() #41596

[NCCL] DDP communication hook: getFuture() #41596

Uh oh!

sinannasir commented Jul 17, 2020 •

edited

Loading

Uh oh!

dr-ci bot commented Jul 17, 2020 •

edited

Loading

Uh oh!

pritamdamania87 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[NCCL] DDP communication hook: getFuture() #41596

[NCCL] DDP communication hook: getFuture() #41596

Uh oh!

Conversation

sinannasir commented Jul 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan:

Uh oh!

dr-ci bot commented Jul 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

pritamdamania87 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sinannasir commented Jul 17, 2020 •

edited

Loading

dr-ci bot commented Jul 17, 2020 •

edited

Loading