Add an explicit _shutdown method to ProcessGroupNCCL #111392

pritamdamania87 · 2023-10-16T20:44:55Z

Currently, the only way ProcessGroupNCCL shuts down its background threads and aborts all communicators is via the destructor.

However, given how python GC works and code holding references to the PG in multiple places, in practice calling destroy_process_group doesn't actually end up invoking the destructor.

As a result, in this PR I'm adding a explicit shutdown method to that users can call to cleanup all resources.

pytorch-bot · 2023-10-16T20:45:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111392

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit f452fa7 with merge base 4ed4753 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

janeyx99 · 2023-10-16T22:05:44Z

cc @wanchaol do you know who would be the best person to take a look?

wconstab · 2023-10-18T17:33:41Z

@XilunWu @fduwjj Could you take a look?

XilunWu

overall the PR LGTM. One suggestion plus one question.

I'm wondering if there's any follow up we can do to make it better.

test/distributed/test_c10d_nccl.py

torch/csrc/distributed/c10d/init.cpp

wanchaol

Can you elaborate the difference of this API compare to abort

torch/csrc/distributed/c10d/init.cpp

pritamdamania87 · 2023-10-21T04:23:26Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+  if (ncclCommWatchdogThread_.joinable()) {
+    ncclCommWatchdogThread_.join();
+  }
 #endif

  if (onCompletionHookThread_.joinable())
    onCompletionHookThread_.join();


Did not include .join() as part of the shutdown method since join() could potentially block.

pritamdamania87 · 2023-10-21T04:24:03Z

@XilunWu @wanchaol @fduwjj Addressed comments

wanchaol

lgtm, one question about gloo compatability

wanchaol · 2023-10-23T16:51:51Z

torch/csrc/distributed/c10d/init.cpp

-                return self->abort(abortReason);
+              "_shutdown",
+              [](const c10::intrusive_ptr<::c10d::ProcessGroupNCCL>& self) {
+                return self->shutdown();


Is this supporting gloo too or shall we add the _shutdown for gloo to keep consistency?

This really depends if we want shutdown as a public method in the ProcessGroup.hpp public interface and have all subclasses implement it (not just gloo).

I kept this private here since I wasn't sure if we want this to be part of the ProcessGroup interface.

fduwjj · 2023-10-23T20:14:54Z

torch/csrc/distributed/c10d/init.cpp

@@ -2239,12 +2239,10 @@ options :class:`~torch.distributed.ProcessGroupNCCL.Options`).
              py::arg("timeout") = ::c10d::kProcessGroupNCCLDefaultTimeout,
              py::call_guard<py::gil_scoped_release>())
          .def(
-              "_abort",


I am ok with this change but I am not sure if this is still used somewhere so it might break some use cases.. If it is used this PR might get reverted.

I don't mind either, I just removed it based on @wanchaol's suggestion here: #111392 (comment)

Well, I didn't find any critical use cases for this IIUC. And I don't like keeping both, so I stamp it for now.

This API was added by @pritamdamania87 a few weeks ago so I imagine the API usage should be relatively low and probably only @pritamdamania87 atm. We should remove it soon to avoid more ppl depend on it.

fduwjj · 2023-10-23T20:18:20Z

test/distributed/test_c10d_nccl.py

+        # Destroy pg and validate pg is still in working condition since we hold a
+        # reference above.
+        dist.destroy_process_group()
+        pg.allreduce([t])


Is it possible to call allreduce even after pg destroy? The first allreduce is blocking right?

Yes because destroy_process_group just removes the PG from the map in distributed_c10d.py, but doesn't trigger the destructor of ProcessGroupNCCL since we are holding a reference to the PG object here.

Oh I see. That's a good idea to put a test case. Thanks!

pritamdamania87 · 2023-10-23T23:40:21Z

@pytorchbot merge

pytorchmergebot · 2023-10-23T23:43:16Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-10-23T23:43:35Z

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-binary-manywheel / manywheel-py3_8-cuda11_8-build / build

Details for Dev Infra team

Raised by workflow job

pritamdamania87 · 2023-10-24T05:44:23Z

@pytorchbot merge

pytorchmergebot · 2023-10-24T05:46:23Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Currently, the only way ProcessGroupNCCL shuts down its background threads and aborts all communicators is via the destructor. However, given how python GC works and code holding references to the PG in multiple places, in practice calling `destroy_process_group` doesn't actually end up invoking the destructor. As a result, in this PR I'm adding a explicit shutdown method to that users can call to cleanup all resources. Pull Request resolved: pytorch#111392 Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fduwjj

pritamdamania87 requested review from mrshenli, zhaojuanmao, rohan-varma, H-Huang, awgu, kwen2501, wanchaol, fegin, fduwjj, wz337, kiukchung and d4l3k as code owners October 16, 2023 20:44

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Oct 16, 2023

pytorchbot added the open source label Oct 16, 2023

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 16, 2023

XilunWu approved these changes Oct 18, 2023

View reviewed changes

test/distributed/test_c10d_nccl.py Show resolved Hide resolved

torch/csrc/distributed/c10d/init.cpp Outdated Show resolved Hide resolved

wanchaol requested changes Oct 18, 2023

View reviewed changes

torch/csrc/distributed/c10d/init.cpp Outdated Show resolved Hide resolved

torch/csrc/distributed/c10d/init.cpp Outdated Show resolved Hide resolved

pritamdamania87 requested review from wanchaol and XilunWu October 18, 2023 18:37

pritamdamania87 changed the title ~~Add an explicit _close method to ProcessGroupNCCL~~ Add an explicit _shutdown method to ProcessGroupNCCL Oct 21, 2023

pritamdamania87 force-pushed the user/pdamania/pgclose branch from 0a2691f to 6754a40 Compare October 21, 2023 04:21

pritamdamania87 commented Oct 21, 2023

View reviewed changes

wanchaol approved these changes Oct 23, 2023

View reviewed changes

fduwjj reviewed Oct 23, 2023

View reviewed changes

Add an explicit _close method to ProcessGroupNCCL

a438bbe

pritamdamania87 added 2 commits October 23, 2023 18:53

Address comments

aa7cbd0

Fix tests

9014cd8

pritamdamania87 force-pushed the user/pdamania/pgclose branch from 742b5ab to 9014cd8 Compare October 23, 2023 22:54

Fix build

4893e16

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 23, 2023

pytorchmergebot added the merging label Oct 23, 2023

pytorchmergebot removed the merging label Oct 23, 2023

Fix lint

f452fa7

fduwjj approved these changes Oct 24, 2023

View reviewed changes

pytorchmergebot added the merging label Oct 24, 2023

pytorchmergebot added Merged and removed merging labels Oct 24, 2023

pytorchmergebot closed this in 0ad91c2 Oct 24, 2023

XilunWu mentioned this pull request Oct 31, 2023

ProcessGroup is not automatically destroyed when the process exits #109478

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an explicit _shutdown method to ProcessGroupNCCL #111392

Add an explicit _shutdown method to ProcessGroupNCCL #111392

pritamdamania87 commented Oct 16, 2023 •

edited

pytorch-bot bot commented Oct 16, 2023 •

edited

janeyx99 commented Oct 16, 2023

wconstab commented Oct 18, 2023

XilunWu left a comment

wanchaol left a comment

pritamdamania87 Oct 21, 2023

pritamdamania87 commented Oct 21, 2023 •

edited

wanchaol left a comment

wanchaol Oct 23, 2023

pritamdamania87 Oct 23, 2023

fduwjj Oct 23, 2023

pritamdamania87 Oct 23, 2023

fduwjj Oct 24, 2023 •

edited

wanchaol Oct 24, 2023

fduwjj Oct 23, 2023

pritamdamania87 Oct 23, 2023

fduwjj Oct 24, 2023

pritamdamania87 commented Oct 23, 2023

pytorchmergebot commented Oct 23, 2023

pytorchmergebot commented Oct 23, 2023

pritamdamania87 commented Oct 24, 2023

pytorchmergebot commented Oct 24, 2023

Add an explicit _shutdown method to ProcessGroupNCCL #111392

Add an explicit _shutdown method to ProcessGroupNCCL #111392

Conversation

pritamdamania87 commented Oct 16, 2023 • edited

pytorch-bot bot commented Oct 16, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111392

✅ You can merge normally! (1 Unrelated Failure)

janeyx99 commented Oct 16, 2023

wconstab commented Oct 18, 2023

XilunWu left a comment

Choose a reason for hiding this comment

wanchaol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pritamdamania87 commented Oct 21, 2023 • edited

wanchaol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fduwjj Oct 24, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pritamdamania87 commented Oct 23, 2023

pytorchmergebot commented Oct 23, 2023

Merge started

pytorchmergebot commented Oct 23, 2023

Merge failed

pritamdamania87 commented Oct 24, 2023

pytorchmergebot commented Oct 24, 2023

Merge started

pritamdamania87 commented Oct 16, 2023 •

edited

pytorch-bot bot commented Oct 16, 2023 •

edited

pritamdamania87 commented Oct 21, 2023 •

edited

fduwjj Oct 24, 2023 •

edited