Skip to content

[NCCL] DDP communication hook: getFuture() without cudaStreamAddCallback #42335

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

sinannasir
Copy link
Contributor

@sinannasir sinannasir commented Jul 30, 2020

Stack from ghstack:

Main goal: For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API c10::intrusive_ptr<c10::ivalue::Future> getFuture() to c10d::ProcessGroup::Work. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in #41596.

Differential Revision: D22833298

**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). See the following approach:

**New design:**
1. As we create a `WorkNCCL` object inside `ProcessGroupNCCL::collective`, we also create a `FutureNCCL` (sub class of `Future`) associated with `WorkNCCL` and store it with the object. This future has a reference to the original work object.
2. The Future is marked as completed when its created (to allow for async execution of callbacks).
3. fut.wait() simply synchronizes the streams (synchronizeStreams() in Sinan's diff) and returns the result. This preserves the async execution model for CUDA.
4. When we add a callback to this future (.then()), we synchronizeStreams() and invoke the callback inline.

Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Jul 30, 2020
**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). See the following approach:

**New design:**
1. As we create a `WorkNCCL` object inside `ProcessGroupNCCL::collective`, we also create a `FutureNCCL` (sub class of `Future`) associated with `WorkNCCL` and store it with the object. This future has a reference to the original work object.
2. The Future is marked as completed when its created (to allow for async execution of callbacks).
3. fut.wait() simply synchronizes the streams (synchronizeStreams() in Sinan's diff) and returns the result. This preserves the async execution model for CUDA.
4. When we add a callback to this future (.then()), we synchronizeStreams() and invoke the callback inline.

Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

ghstack-source-id: 108894839
Pull Request resolved: #42335
@sinannasir sinannasir requested a review from rohan-varma July 30, 2020 22:11
@dr-ci
Copy link

dr-ci bot commented Jul 30, 2020

💊 CI failures summary and remediations

As of commit 6322122 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 63 times.

…eamAddCallback"

**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). See the following approach:

**New design:**
1. As we create a `WorkNCCL` object inside `ProcessGroupNCCL::collective`, we also create a `FutureNCCL` (sub class of `Future`) associated with `WorkNCCL` and store it with the object. This future has a reference to the original work object.
2. The Future is marked as completed when its created (to allow for async execution of callbacks).
3. fut.wait() simply synchronizes the streams (synchronizeStreams() in Sinan's diff) and returns the result. This preserves the async execution model for CUDA.
4. When we add a callback to this future (.then()), we synchronizeStreams() and invoke the callback inline.

Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

[ghstack-poisoned]
…eamAddCallback"


**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596).

Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Jul 31, 2020
Pull Request resolved: #42335

**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596).

ghstack-source-id: 108915990

Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)
…eamAddCallback"


**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596).

Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Jul 31, 2020
Pull Request resolved: #42335

**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596).

ghstack-source-id: 108944549

Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)
…eamAddCallback"


**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596).

Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Jul 31, 2020
Pull Request resolved: #42335

**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596).

ghstack-source-id: 108986334

Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)
@sinannasir sinannasir requested a review from mrshenli August 4, 2020 15:21
…eamAddCallback"


**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596).

Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 4, 2020
Pull Request resolved: #42335

**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596).

ghstack-source-id: 109150639

Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)
…eamAddCallback"


**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596).

Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 4, 2020
Pull Request resolved: #42335

**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596).

ghstack-source-id: 109156649

Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)
…eamAddCallback"


**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596).

Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 5, 2020
Pull Request resolved: #42335

**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596).

ghstack-source-id: 109225244

Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)
@sinannasir sinannasir requested review from wanchaol and mrshenli August 5, 2020 06:34
…eamAddCallback"


**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596).

Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 5, 2020
Pull Request resolved: #42335

**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596).

ghstack-source-id: 109291702

Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Let's land

…eamAddCallback"


**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596).

Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 6, 2020
Pull Request resolved: #42335

**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596).

ghstack-source-id: 109392464

Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the circular reference. LGTM! Please address @pritamdamania87's comments before landing. Thanks!

…eamAddCallback"


**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596).

Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 7, 2020
Pull Request resolved: #42335

**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596).

ghstack-source-id: 109461507

Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 0a804be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants