Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ProcessGroupNCCL profiling when profiler is not run with use_cuda #48946

Closed
wants to merge 7 commits into from

Conversation

rohan-varma
Copy link
Member

@rohan-varma rohan-varma commented Dec 7, 2020

Stack from ghstack:

Move recordFunctionEndCallback to after the blocking portion of launching the NCCL kernel, and remove addCallback since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with use_cuda=True. However, we are currently debugging a deadlock for the use_cuda=True case, fix is being tracked in #48987.

To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine and made sure they don't fail.

Differential Revision: D25368322

… not run with use_cuda

Still does not work with use_cuda=True. Debugging the deadlock
currently.

Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Dec 7, 2020
rohan-varma added a commit that referenced this pull request Dec 7, 2020
… not run with use_cuda

Still does not work with use_cuda=True. Debugging the deadlock
currently.

Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/)

ghstack-source-id: 118016665
Pull Request resolved: #48946
@dr-ci
Copy link

dr-ci bot commented Dec 7, 2020

💊 CI failures summary and remediations

As of commit 8a77092 (more details on the Dr. CI page):


None of the CI failures appear to be your fault 💚



❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build binary_windows_libtorch_3_7_cpu_release_build (1/1)

Step: "Checkout code" (full log | diagnosis details | 🔁 rerun) ❄️

fatal: Could not read from remote repository.
Cloning into '.'...

Creating .ssh directory
Adding the following entries to known_hosts:
github.com ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAq2A7hRGmdnm9tUDbO9IDSwBK6TbQa+PXYPCPy6rbTrTtw7PHkccKrpp0yVhp5HdEIcKr6pLlVDBfOLX9QUsyCOV0wzfjIJNlGEYsdlLJizHhbn2mUjvSAHQqZETYP81eFzLQNnPHt4EVVUh7VfDESU84KezmD5QlWpXLmvU31/yMf+Se8xhHTvKSCZIFImWwoG6mbUoWf9nzpIoaSjB+weqqUUmpaaasXVal72J+UX2B+2RPW3RcT0eOzQgqlJL3RKrTJvdsjE3JEAvGq3lGHSZXy28G3skua2SmVi/w4yCE6gbODqnTWlg7+wC604ydGXA8VJiS5ap43JXiUFFAaQ==
bitbucket.org ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAubiN81eDcafrgMeLzaFPsw2kNvEcqTKl/VqLat/MaB33pZy0y3rJZtnqwR2qOOvbwKZYKiEO1O6VqNEBxKvJJelCq0dTXWT5pbO2gDXC6h6QDXCaHo6pOHGPUy+YBaGQRGuSusMEASYiWunYN0vCAI8QaXnWMXNMdFP3jHAJH0eDsoiGnLPBlBp4TNm6rYI74nMzgz3B9IikW4WVK+dc8KZJZWYjAuORU3jc1c/NPskD2ASinf8v3xnfXeukU0sJ5N6m5E8VLjObPEO+mN2t/FZTMZLiFqPWc/ALSqnMnnhwrNi2rbfg/rd/IpL8Le3pSBne8+seeFVBoGqzHM9yXw==

Writing SSH key for checkout to id_rsa

ssh: connect to host github.com port 22: Connection timed out
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
Error setting git remote: exit status 128

Error setting git remote: exit status 128

---
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 29 times.

@rohan-varma rohan-varma changed the title [wip][not for review] Fix ProcessGroupNCCL profiling when profiler is not run with use_cuda Fix ProcessGroupNCCL profiling when profiler is not run with use_cuda Dec 8, 2020
@@ -1126,6 +1115,17 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::collective(
work->opTimeout_ = opTimeout_;
work->store_ = store_;

if (work->recordFunctionEndCallback_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @rohan-varma . Would you elaborate why it matters to add callback later here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrzzd As discussed offline, it's equivalent to directly invoking the callback inline since the callback doesn't do any GPU work. See #48561 for more of a discussion/context.

…th use_cuda"


Move `recordFunctionEndCallback` to after the blocking portion of launching the NCCL kernel, and remove `addCallback` since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with `use_cuda=True`. However, we are currently debugging a deadlock for the `use_cuda=True` case, fix is being tracked in #48987. 

To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine and made sure they don't fail.

Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/)

[ghstack-poisoned]
…th use_cuda"


Move `recordFunctionEndCallback` to after the blocking portion of launching the NCCL kernel, and remove `addCallback` since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with `use_cuda=True`. However, we are currently debugging a deadlock for the `use_cuda=True` case, fix is being tracked in #48987. 

To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine and made sure they don't fail.

Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/)

[ghstack-poisoned]
Copy link
Contributor

@mrzzd mrzzd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks!

…th use_cuda"


Move `recordFunctionEndCallback` to after the blocking portion of launching the NCCL kernel, and remove `addCallback` since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with `use_cuda=True`. However, we are currently debugging a deadlock for the `use_cuda=True` case, fix is being tracked in #48987. 

To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine and made sure they don't fail.

Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Dec 10, 2020
Pull Request resolved: #48946

Move recordFunctionEndCallback to after the blocking portion of launching the NCCL kernel, and remove addCallback since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with use_cuda=True. However, we are currently debugging a deadlock for the use_cuda=True case, fix is being tracked in #48987.

To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine.


ghstack-source-id: 118276464

Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/)
@rohan-varma
Copy link
Member Author

Tested on the ci-all PR by ssh'ing into the multigpu instance and running the following:

while [ $? -eq 0 ] ; do TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="3" python test/distributed/test_distributed_fork.py -v TestDistBackendWithFork.test_all_reduce_sum_cuda ; done

for test_all_reduce_sum_cuda, test_reduce_multigpu, test_all_gather_multigpu, and test_all_reduce_sum_cuda_async. The command never exited so the tests should no longer be flaky.

…th use_cuda"


Move `recordFunctionEndCallback` to after the blocking portion of launching the NCCL kernel, and remove `addCallback` since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with `use_cuda=True`. However, we are currently debugging a deadlock for the `use_cuda=True` case, fix is being tracked in #48987. 

To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine and made sure they don't fail.

Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/)

[ghstack-poisoned]
…th use_cuda"


Move `recordFunctionEndCallback` to after the blocking portion of launching the NCCL kernel, and remove `addCallback` since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with `use_cuda=True`. However, we are currently debugging a deadlock for the `use_cuda=True` case, fix is being tracked in #48987. 

To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine and made sure they don't fail.

Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/)

[ghstack-poisoned]
…th use_cuda"


Move `recordFunctionEndCallback` to after the blocking portion of launching the NCCL kernel, and remove `addCallback` since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with `use_cuda=True`. However, we are currently debugging a deadlock for the `use_cuda=True` case, fix is being tracked in #48987. 

To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine and made sure they don't fail.

Differential Revision: [D25368322](https://our.internmc.facebook.com/intern/diff/D25368322/)

[ghstack-poisoned]
@rohan-varma
Copy link
Member Author

Hm, the bazel build CI job is marked as pending but when I go to it it looks completed - https://app.circleci.com/pipelines/github/pytorch/pytorch/248979/workflows/b86fa46c-c327-4797-991f-562f0ff741cf/jobs/9508765

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 696e30a.

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 696e30a.

@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/206/head branch December 14, 2020 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants