-
Notifications
You must be signed in to change notification settings - Fork 24.9k
[NCCL] DDP communication hook: getFuture() without cudaStreamAddCallback #42335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). See the following approach: **New design:** 1. As we create a `WorkNCCL` object inside `ProcessGroupNCCL::collective`, we also create a `FutureNCCL` (sub class of `Future`) associated with `WorkNCCL` and store it with the object. This future has a reference to the original work object. 2. The Future is marked as completed when its created (to allow for async execution of callbacks). 3. fut.wait() simply synchronizes the streams (synchronizeStreams() in Sinan's diff) and returns the result. This preserves the async execution model for CUDA. 4. When we add a callback to this future (.then()), we synchronizeStreams() and invoke the callback inline. Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) [ghstack-poisoned]
**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). See the following approach: **New design:** 1. As we create a `WorkNCCL` object inside `ProcessGroupNCCL::collective`, we also create a `FutureNCCL` (sub class of `Future`) associated with `WorkNCCL` and store it with the object. This future has a reference to the original work object. 2. The Future is marked as completed when its created (to allow for async execution of callbacks). 3. fut.wait() simply synchronizes the streams (synchronizeStreams() in Sinan's diff) and returns the result. This preserves the async execution model for CUDA. 4. When we add a callback to this future (.then()), we synchronizeStreams() and invoke the callback inline. Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) ghstack-source-id: 108894839 Pull Request resolved: #42335
💊 CI failures summary and remediationsAs of commit 6322122 (more details on the Dr. CI page):
ci.pytorch.org: 1 failedThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 63 times. |
…eamAddCallback" **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). See the following approach: **New design:** 1. As we create a `WorkNCCL` object inside `ProcessGroupNCCL::collective`, we also create a `FutureNCCL` (sub class of `Future`) associated with `WorkNCCL` and store it with the object. This future has a reference to the original work object. 2. The Future is marked as completed when its created (to allow for async execution of callbacks). 3. fut.wait() simply synchronizes the streams (synchronizeStreams() in Sinan's diff) and returns the result. This preserves the async execution model for CUDA. 4. When we add a callback to this future (.then()), we synchronizeStreams() and invoke the callback inline. Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) [ghstack-poisoned]
…eamAddCallback" **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) [ghstack-poisoned]
Pull Request resolved: #42335 **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). ghstack-source-id: 108915990 Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)
…eamAddCallback" **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) [ghstack-poisoned]
Pull Request resolved: #42335 **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). ghstack-source-id: 108944549 Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)
…eamAddCallback" **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) [ghstack-poisoned]
Pull Request resolved: #42335 **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). ghstack-source-id: 108986334 Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)
…eamAddCallback" **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) [ghstack-poisoned]
Pull Request resolved: #42335 **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). ghstack-source-id: 109150639 Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)
…eamAddCallback" **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) [ghstack-poisoned]
Pull Request resolved: #42335 **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). ghstack-source-id: 109156649 Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)
…eamAddCallback" **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) [ghstack-poisoned]
Pull Request resolved: #42335 **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). ghstack-source-id: 109225244 Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)
…eamAddCallback" **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) [ghstack-poisoned]
Pull Request resolved: #42335 **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). ghstack-source-id: 109291702 Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Let's land
…eamAddCallback" **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) [ghstack-poisoned]
Pull Request resolved: #42335 **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). ghstack-source-id: 109392464 Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing the circular reference. LGTM! Please address @pritamdamania87's comments before landing. Thanks!
…eamAddCallback" **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/) [ghstack-poisoned]
Pull Request resolved: #42335 **Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](#41596). ghstack-source-id: 109461507 Differential Revision: [D22833298](https://our.internmc.facebook.com/intern/diff/D22833298/)
This pull request has been merged in 0a804be. |
Stack from ghstack:
work
andfuture_work
#41840 [NCCL] [For Test] In DDP's reducer mergework
andfuture_work
Main goal: For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.
We add an API
c10::intrusive_ptr<c10::ivalue::Future> getFuture()
toc10d::ProcessGroup::Work
. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in #41596.
Differential Revision: D22833298