Fix ProcessGroupNCCL profiling when profiler is not run with use_cuda #48946

rohan-varma · 2020-12-07T20:10:30Z

Stack from ghstack:

Test distributed collectives profiling with Gloo on GPU #49072 Test distributed collectives profiling with Gloo on GPU
Fix ProcessGroupNCCL profiling when profiler is not run with use_cuda #48946 Fix ProcessGroupNCCL profiling when profiler is not run with use_cuda

Move recordFunctionEndCallback to after the blocking portion of launching the NCCL kernel, and remove addCallback since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with use_cuda=True. However, we are currently debugging a deadlock for the use_cuda=True case, fix is being tracked in #48987.

To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine and made sure they don't fail.

Differential Revision: D25368322

… not run with use_cuda Still does not work with use_cuda=True. Debugging the deadlock currently. Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/) [ghstack-poisoned]

… not run with use_cuda Still does not work with use_cuda=True. Debugging the deadlock currently. Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/) ghstack-source-id: 118016665 Pull Request resolved: #48946

dr-ci · 2020-12-07T22:15:24Z

💊 CI failures summary and remediations

As of commit 8a77092 (more details on the Dr. CI page):

✅ None of the CI failures appear to be your fault 💚

1/1 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

binary_windows_libtorch_3_7_cpu_release_build (1/1)

Step: "Checkout code" (full log | diagnosis details | 🔁 rerun) ❄️

fatal: Could not read from remote repository.

Cloning into '.'...

Creating .ssh directory
Adding the following entries to known_hosts:
github.com ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAq2A7hRGmdnm9tUDbO9IDSwBK6TbQa+PXYPCPy6rbTrTtw7PHkccKrpp0yVhp5HdEIcKr6pLlVDBfOLX9QUsyCOV0wzfjIJNlGEYsdlLJizHhbn2mUjvSAHQqZETYP81eFzLQNnPHt4EVVUh7VfDESU84KezmD5QlWpXLmvU31/yMf+Se8xhHTvKSCZIFImWwoG6mbUoWf9nzpIoaSjB+weqqUUmpaaasXVal72J+UX2B+2RPW3RcT0eOzQgqlJL3RKrTJvdsjE3JEAvGq3lGHSZXy28G3skua2SmVi/w4yCE6gbODqnTWlg7+wC604ydGXA8VJiS5ap43JXiUFFAaQ==
bitbucket.org ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAubiN81eDcafrgMeLzaFPsw2kNvEcqTKl/VqLat/MaB33pZy0y3rJZtnqwR2qOOvbwKZYKiEO1O6VqNEBxKvJJelCq0dTXWT5pbO2gDXC6h6QDXCaHo6pOHGPUy+YBaGQRGuSusMEASYiWunYN0vCAI8QaXnWMXNMdFP3jHAJH0eDsoiGnLPBlBp4TNm6rYI74nMzgz3B9IikW4WVK+dc8KZJZWYjAuORU3jc1c/NPskD2ASinf8v3xnfXeukU0sJ5N6m5E8VLjObPEO+mN2t/FZTMZLiFqPWc/ALSqnMnnhwrNi2rbfg/rd/IpL8Le3pSBne8+seeFVBoGqzHM9yXw==

Writing SSH key for checkout to id_rsa

ssh: connect to host github.com port 22: Connection timed out
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
Error setting git remote: exit status 128

Error setting git remote: exit status 128

---

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 29 times.

mrzzd · 2020-12-08T17:42:03Z

torch/lib/c10d/ProcessGroupNCCL.cpp

@@ -1126,6 +1115,17 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::collective(
  work->opTimeout_ = opTimeout_;
  work->store_ = store_;

+  if (work->recordFunctionEndCallback_) {


Thanks for working on this @rohan-varma . Would you elaborate why it matters to add callback later here?

@mrzzd As discussed offline, it's equivalent to directly invoking the callback inline since the callback doesn't do any GPU work. See #48561 for more of a discussion/context.

…th use_cuda" Move `recordFunctionEndCallback` to after the blocking portion of launching the NCCL kernel, and remove `addCallback` since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with `use_cuda=True`. However, we are currently debugging a deadlock for the `use_cuda=True` case, fix is being tracked in #48987. To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine and made sure they don't fail. Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/) [ghstack-poisoned]

mrzzd

Looks good. Thanks!

…th use_cuda" Move `recordFunctionEndCallback` to after the blocking portion of launching the NCCL kernel, and remove `addCallback` since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with `use_cuda=True`. However, we are currently debugging a deadlock for the `use_cuda=True` case, fix is being tracked in #48987. To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine and made sure they don't fail. Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/) [ghstack-poisoned]

Pull Request resolved: #48946 Move recordFunctionEndCallback to after the blocking portion of launching the NCCL kernel, and remove addCallback since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with use_cuda=True. However, we are currently debugging a deadlock for the use_cuda=True case, fix is being tracked in #48987. To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine. ghstack-source-id: 118276464 Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/)

rohan-varma · 2020-12-10T06:43:37Z

Tested on the ci-all PR by ssh'ing into the multigpu instance and running the following:

while [ $? -eq 0 ] ; do TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="3" python test/distributed/test_distributed_fork.py -v TestDistBackendWithFork.test_all_reduce_sum_cuda ; done

for test_all_reduce_sum_cuda, test_reduce_multigpu, test_all_gather_multigpu, and test_all_reduce_sum_cuda_async. The command never exited so the tests should no longer be flaky.

…th use_cuda" Move `recordFunctionEndCallback` to after the blocking portion of launching the NCCL kernel, and remove `addCallback` since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with `use_cuda=True`. However, we are currently debugging a deadlock for the `use_cuda=True` case, fix is being tracked in #48987. To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine and made sure they don't fail. Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/) [ghstack-poisoned]

rohan-varma · 2020-12-11T04:30:32Z

Hm, the bazel build CI job is marked as pending but when I go to it it looks completed - https://app.circleci.com/pipelines/github/pytorch/pytorch/248979/workflows/b86fa46c-c327-4797-991f-562f0ff741cf/jobs/9508765

facebook-github-bot · 2020-12-11T07:12:42Z

This pull request has been merged in 696e30a.

facebook-github-bot · 2020-12-11T07:12:51Z

This pull request has been merged in 696e30a.

[wip][not for review] Fix ProcessGroupNCCL profiling when profiler is…

062a1e5

… not run with use_cuda Still does not work with use_cuda=True. Debugging the deadlock currently. Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/) [ghstack-poisoned]

rohan-varma requested review from mingzhe09088, mrshenli, pritamdamania87 and zhaojuanmao as code owners December 7, 2020 20:10

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Dec 7, 2020

rohan-varma removed request for pritamdamania87, mrshenli, mingzhe09088 and zhaojuanmao December 7, 2020 20:14

rohan-varma mentioned this pull request Dec 8, 2020

Profiling distributed NCCL collectives deadlocks when profiler run with use_cuda=True #48987

Closed

rohan-varma requested review from pritamdamania87, mrshenli and mrzzd December 8, 2020 02:10

rohan-varma changed the title ~~[wip][not for review] Fix ProcessGroupNCCL profiling when profiler is not run with use_cuda~~ Fix ProcessGroupNCCL profiling when profiler is not run with use_cuda Dec 8, 2020

mrzzd reviewed Dec 8, 2020

View reviewed changes

rohan-varma requested a review from mrzzd December 9, 2020 01:49

rohan-varma mentioned this pull request Dec 9, 2020

Test distributed collectives profiling with Gloo on GPU #49072

Closed

mrzzd approved these changes Dec 9, 2020

View reviewed changes

facebook-github-bot closed this in 696e30a Dec 11, 2020

facebook-github-bot added the Merged label Dec 11, 2020

facebook-github-bot deleted the gh/rohan-varma/206/head branch December 14, 2020 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ProcessGroupNCCL profiling when profiler is not run with use_cuda #48946

Fix ProcessGroupNCCL profiling when profiler is not run with use_cuda #48946

rohan-varma commented Dec 7, 2020 •

edited

dr-ci bot commented Dec 7, 2020 •

edited by facebook-github-bot

mrzzd Dec 8, 2020

rohan-varma Dec 8, 2020

mrzzd left a comment

rohan-varma commented Dec 10, 2020

rohan-varma commented Dec 11, 2020

facebook-github-bot commented Dec 11, 2020

facebook-github-bot commented Dec 11, 2020

Fix ProcessGroupNCCL profiling when profiler is not run with use_cuda #48946

Fix ProcessGroupNCCL profiling when profiler is not run with use_cuda #48946

Conversation

rohan-varma commented Dec 7, 2020 • edited

dr-ci bot commented Dec 7, 2020 • edited by facebook-github-bot

💊 CI failures summary and remediations

❄️ 1 failure tentatively classified as flaky

binary_windows_libtorch_3_7_cpu_release_build (1/1)

mrzzd Dec 8, 2020

Choose a reason for hiding this comment

rohan-varma Dec 8, 2020

Choose a reason for hiding this comment

mrzzd left a comment

Choose a reason for hiding this comment

rohan-varma commented Dec 10, 2020

rohan-varma commented Dec 11, 2020

facebook-github-bot commented Dec 11, 2020

facebook-github-bot commented Dec 11, 2020

rohan-varma commented Dec 7, 2020 •

edited

dr-ci bot commented Dec 7, 2020 •

edited by facebook-github-bot