-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[C10D] Fix coalescedCollective op Flight Recording #120430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/120430
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit b54bdbc with merge base 92a2b21 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]
The coalescing manager api works by accumulating operations in python via a contextmanager, and then making one call into c++ to an <op>_coalesced API. It has limited support for ops and has been added recently to avoid overheads of making individual py-cpp calls. This complicates flight recording.. For now, flight recording of coalescing_manager collectives is less detailed than cpp coalesced collectives. TODO test the other 2 ops that are supported by coalescing_manager ghstack-source-id: c85d4eb Pull Request resolved: #120430
Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]
The coalescing manager api works by accumulating operations in python via a contextmanager, and then making one call into c++ to an <op>_coalesced API. It has limited support for ops and has been added recently to avoid overheads of making individual py-cpp calls. This complicates flight recording.. For now, flight recording of coalescing_manager collectives is less detailed than cpp coalesced collectives. TODO test the other 2 ops that are supported by coalescing_manager ghstack-source-id: 604bda7 Pull Request resolved: #120430
Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]
The coalescing manager api works by accumulating operations in python via a contextmanager, and then making one call into c++ to an <op>_coalesced API. It has limited support for ops and has been added recently to avoid overheads of making individual py-cpp calls. This complicates flight recording.. For now, flight recording of coalescing_manager collectives is less detailed than cpp coalesced collectives. TODO test the other 2 ops that are supported by coalescing_manager ghstack-source-id: a00a1d0 Pull Request resolved: #120430
| workEnqueue(work); | ||
| // TODO(whc) did I delete a relevant codepath here? | ||
| // at::cuda::CUDAGraph::dec_pending_event_queries(); | ||
| // } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you would need to put back the capture_status == c10::cuda::CaptureStatus::None check and the else path.
Would be also nice to pack everything:
inc_pending_event_queries()
check capture_status
dec_pending_event_queries()
inside workEnqueue so that we don't have these checks everywhere. (can be in a different PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my reasoning here was that coalescing_state_ should NEVER be active inside this function. Becuase this function is part of the 'coalescing manager' implementation, which would call its own startCoalescing function later.
If that is correct, i'll add an assertion here to clarify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure how the cudagraph capture path works. Is it exclusive with coalescing or is it additive?
should the new logic be like this? I don't understand why the cuda event increment is unconditional and the decrement is conditional.
// Notify graphs before we check the capture status preemptively
at::cuda::CUDAGraph::inc_pending_event_queries();
workEnqueue(work);
if (capture_status != c10::cuda::CaptureStatus::None) {
at::cuda::CUDAGraph::dec_pending_event_queries();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion.
What I meant is like this:
at::cuda::CUDAGraph::inc_pending_event_queries();
if (capture_status == c10::cuda::CaptureStatus::None) {
workEnqueue(work);
} else {
at::cuda::CUDAGraph::dec_pending_event_queries();
}
It preemptively increases the pending work number, if this work is actually enqueued, then this is fine.
If this work ends up not being enqueued because CUDA Graph capture is active, then it winds back the pending work number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can check elsewhere in the code how workEnqueue(work) is conditioned. I guess they are more or less done the same way. Hence I propose putting the above CUDA Graph logic into the workEnqueue function itself, so that we don't have the if-else at every call site.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm.
is it necessary to preemptively increment and then decrement? or can we just increment when we workEnqueue nad never decrement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a reason. See here:
#110665 (comment)
Cc @eqy
Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]
Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]
The coalescing manager api works by accumulating operations in python via a contextmanager, and then making one call into c++ to an <op>_coalesced API. It has limited support for ops and has been added recently to avoid overheads of making individual py-cpp calls. This complicates flight recording.. For now, flight recording of coalescing_manager collectives is less detailed than cpp coalesced collectives. TODO test the other 2 ops that are supported by coalescing_manager ghstack-source-id: b3e7e68 Pull Request resolved: #120430
Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]
The coalescing manager api works by accumulating operations in python via a contextmanager, and then making one call into c++ to an <op>_coalesced API. It has limited support for ops and has been added recently to avoid overheads of making individual py-cpp calls. This complicates flight recording.. For now, flight recording of coalescing_manager collectives is less detailed than cpp coalesced collectives. TODO test the other 2 ops that are supported by coalescing_manager ghstack-source-id: ddb126a Pull Request resolved: #120430
Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]
The coalescing manager api works by accumulating operations in python via a contextmanager, and then making one call into c++ to an <op>_coalesced API. It has limited support for ops and has been added recently to avoid overheads of making individual py-cpp calls. This complicates flight recording.. For now, flight recording of coalescing_manager collectives is less detailed than cpp coalesced collectives. TODO test the other 2 ops that are supported by coalescing_manager ghstack-source-id: cecff23 Pull Request resolved: #120430
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the last fix about workEnqueue. LGTM.
| << "Launching ProcessGroupNCCL abort asynchrounously."; | ||
| std::future<bool> fut = std::async( | ||
| std::launch::async, [this, &reason]() { return this->abort(reason); }); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this explanation is accurate; one nit would maybe be editing "(a) capturing is already in progress, so we can not enqueue" to "(a) capturing is already in progress (we cannot enqueue in this case)."
Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]
The coalescing manager api works by accumulating operations in python via a contextmanager, and then making one call into c++ to an <op>_coalesced API. It has limited support for ops and has been added recently to avoid overheads of making individual py-cpp calls. This complicates flight recording.. For now, flight recording of coalescing_manager collectives is less detailed than cpp coalesced collectives. TODO test the other 2 ops that are supported by coalescing_manager ghstack-source-id: 35d0aac Pull Request resolved: #120430
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]
The coalescing manager api works by accumulating operations in python via a contextmanager, and then making one call into c++ to an <op>_coalesced API. It has limited support for ops and has been added recently to avoid overheads of making individual py-cpp calls. This complicates flight recording.. For now, flight recording of coalescing_manager collectives is less detailed than cpp coalesced collectives. TODO test the other 2 ops that are supported by coalescing_manager ghstack-source-id: e8dac5e Pull Request resolved: #120430
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later.
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @yf225 @chauhang