New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ProcessGroupNCCL profiling when profiler is not run with use_cuda #48946
Conversation
… not run with use_cuda Still does not work with use_cuda=True. Debugging the deadlock currently. Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/) [ghstack-poisoned]
… not run with use_cuda Still does not work with use_cuda=True. Debugging the deadlock currently. Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/) ghstack-source-id: 118016665 Pull Request resolved: #48946
💊 CI failures summary and remediationsAs of commit 8a77092 (more details on the Dr. CI page): ✅ None of the CI failures appear to be your fault 💚
❄️ 1 failure tentatively classified as flakybut reruns have not yet been triggered to confirm: binary_windows_libtorch_3_7_cpu_release_build (1/1)Step: "Checkout code" (full log | diagnosis details | 🔁 rerun) ❄️
|
@@ -1126,6 +1115,17 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::collective( | |||
work->opTimeout_ = opTimeout_; | |||
work->store_ = store_; | |||
|
|||
if (work->recordFunctionEndCallback_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @rohan-varma . Would you elaborate why it matters to add callback later here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…th use_cuda" Move `recordFunctionEndCallback` to after the blocking portion of launching the NCCL kernel, and remove `addCallback` since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with `use_cuda=True`. However, we are currently debugging a deadlock for the `use_cuda=True` case, fix is being tracked in #48987. To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine and made sure they don't fail. Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/) [ghstack-poisoned]
…th use_cuda" Move `recordFunctionEndCallback` to after the blocking portion of launching the NCCL kernel, and remove `addCallback` since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with `use_cuda=True`. However, we are currently debugging a deadlock for the `use_cuda=True` case, fix is being tracked in #48987. To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine and made sure they don't fail. Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/) [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks!
…th use_cuda" Move `recordFunctionEndCallback` to after the blocking portion of launching the NCCL kernel, and remove `addCallback` since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with `use_cuda=True`. However, we are currently debugging a deadlock for the `use_cuda=True` case, fix is being tracked in #48987. To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine and made sure they don't fail. Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/) [ghstack-poisoned]
Pull Request resolved: #48946 Move recordFunctionEndCallback to after the blocking portion of launching the NCCL kernel, and remove addCallback since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with use_cuda=True. However, we are currently debugging a deadlock for the use_cuda=True case, fix is being tracked in #48987. To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine. ghstack-source-id: 118276464 Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/)
Tested on the ci-all PR by ssh'ing into the multigpu instance and running the following:
for test_all_reduce_sum_cuda, test_reduce_multigpu, test_all_gather_multigpu, and test_all_reduce_sum_cuda_async. The command never exited so the tests should no longer be flaky. |
…th use_cuda" Move `recordFunctionEndCallback` to after the blocking portion of launching the NCCL kernel, and remove `addCallback` since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with `use_cuda=True`. However, we are currently debugging a deadlock for the `use_cuda=True` case, fix is being tracked in #48987. To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine and made sure they don't fail. Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/) [ghstack-poisoned]
…th use_cuda" Move `recordFunctionEndCallback` to after the blocking portion of launching the NCCL kernel, and remove `addCallback` since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with `use_cuda=True`. However, we are currently debugging a deadlock for the `use_cuda=True` case, fix is being tracked in #48987. To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine and made sure they don't fail. Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/) [ghstack-poisoned]
…th use_cuda" Move `recordFunctionEndCallback` to after the blocking portion of launching the NCCL kernel, and remove `addCallback` since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with `use_cuda=True`. However, we are currently debugging a deadlock for the `use_cuda=True` case, fix is being tracked in #48987. To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine and made sure they don't fail. Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/) [ghstack-poisoned]
Hm, the bazel build CI job is marked as pending but when I go to it it looks completed - https://app.circleci.com/pipelines/github/pytorch/pytorch/248979/workflows/b86fa46c-c327-4797-991f-562f0ff741cf/jobs/9508765 |
This pull request has been merged in 696e30a. |
1 similar comment
This pull request has been merged in 696e30a. |
Stack from ghstack:
Move
recordFunctionEndCallback
to after the blocking portion of launching the NCCL kernel, and removeaddCallback
since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler withuse_cuda=True
. However, we are currently debugging a deadlock for theuse_cuda=True
case, fix is being tracked in #48987.To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine and made sure they don't fail.
Differential Revision: D25368322