Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an explicit _shutdown method to ProcessGroupNCCL #111392

Closed

Conversation

pritamdamania87
Copy link
Contributor

@pritamdamania87 pritamdamania87 commented Oct 16, 2023

Currently, the only way ProcessGroupNCCL shuts down its background threads and aborts all communicators is via the destructor.

However, given how python GC works and code holding references to the PG in multiple places, in practice calling destroy_process_group doesn't actually end up invoking the destructor.

As a result, in this PR I'm adding a explicit shutdown method to that users can call to cleanup all resources.

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 16, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111392

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit f452fa7 with merge base 4ed4753 (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 16, 2023
@janeyx99
Copy link
Contributor

cc @wanchaol do you know who would be the best person to take a look?

@wconstab
Copy link
Contributor

@XilunWu @fduwjj Could you take a look?

Copy link
Contributor

@XilunWu XilunWu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall the PR LGTM. One suggestion plus one question.

I'm wondering if there's any follow up we can do to make it better.

test/distributed/test_c10d_nccl.py Show resolved Hide resolved
torch/csrc/distributed/c10d/init.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate the difference of this API compare to abort

torch/csrc/distributed/c10d/init.cpp Outdated Show resolved Hide resolved
torch/csrc/distributed/c10d/init.cpp Outdated Show resolved Hide resolved
@pritamdamania87 pritamdamania87 changed the title Add an explicit _close method to ProcessGroupNCCL Add an explicit _shutdown method to ProcessGroupNCCL Oct 21, 2023
Comment on lines +889 to 905
if (ncclCommWatchdogThread_.joinable()) {
ncclCommWatchdogThread_.join();
}
#endif

if (onCompletionHookThread_.joinable())
onCompletionHookThread_.join();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not include .join() as part of the shutdown method since join() could potentially block.

@pritamdamania87
Copy link
Contributor Author

pritamdamania87 commented Oct 21, 2023

@XilunWu @wanchaol @fduwjj Addressed comments

Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, one question about gloo compatability

return self->abort(abortReason);
"_shutdown",
[](const c10::intrusive_ptr<::c10d::ProcessGroupNCCL>& self) {
return self->shutdown();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supporting gloo too or shall we add the _shutdown for gloo to keep consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really depends if we want shutdown as a public method in the ProcessGroup.hpp public interface and have all subclasses implement it (not just gloo).

I kept this private here since I wasn't sure if we want this to be part of the ProcessGroup interface.

@@ -2239,12 +2239,10 @@ options :class:`~torch.distributed.ProcessGroupNCCL.Options`).
py::arg("timeout") = ::c10d::kProcessGroupNCCLDefaultTimeout,
py::call_guard<py::gil_scoped_release>())
.def(
"_abort",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with this change but I am not sure if this is still used somewhere so it might break some use cases.. If it is used this PR might get reverted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind either, I just removed it based on @wanchaol's suggestion here: #111392 (comment)

Copy link
Contributor

@fduwjj fduwjj Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I didn't find any critical use cases for this IIUC. And I don't like keeping both, so I stamp it for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API was added by @pritamdamania87 a few weeks ago so I imagine the API usage should be relatively low and probably only @pritamdamania87 atm. We should remove it soon to avoid more ppl depend on it.

# Destroy pg and validate pg is still in working condition since we hold a
# reference above.
dist.destroy_process_group()
pg.allreduce([t])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to call allreduce even after pg destroy? The first allreduce is blocking right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes because destroy_process_group just removes the PG from the map in distributed_c10d.py, but doesn't trigger the destructor of ProcessGroupNCCL since we are holding a reference to the PG object here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. That's a good idea to put a test case. Thanks!

@pritamdamania87
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 23, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-binary-manywheel / manywheel-py3_8-cuda11_8-build / build

Details for Dev Infra team Raised by workflow job

@pritamdamania87
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

andreigh pushed a commit to andreigh/pytorch that referenced this pull request Oct 26, 2023
Currently, the only way ProcessGroupNCCL shuts down its background threads and aborts all communicators is via the destructor.

However, given how python GC works and code holding references to the PG in multiple places, in practice calling `destroy_process_group` doesn't actually end up invoking the destructor.

As a result, in this PR I'm adding a explicit shutdown method to that users can call to cleanup all resources.
Pull Request resolved: pytorch#111392
Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fduwjj
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023
Currently, the only way ProcessGroupNCCL shuts down its background threads and aborts all communicators is via the destructor.

However, given how python GC works and code holding references to the PG in multiple places, in practice calling `destroy_process_group` doesn't actually end up invoking the destructor.

As a result, in this PR I'm adding a explicit shutdown method to that users can call to cleanup all resources.
Pull Request resolved: pytorch#111392
Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fduwjj
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
Currently, the only way ProcessGroupNCCL shuts down its background threads and aborts all communicators is via the destructor.

However, given how python GC works and code holding references to the PG in multiple places, in practice calling `destroy_process_group` doesn't actually end up invoking the destructor.

As a result, in this PR I'm adding a explicit shutdown method to that users can call to cleanup all resources.
Pull Request resolved: pytorch#111392
Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fduwjj
acphile pushed a commit to acphile/pytorch that referenced this pull request Jan 5, 2024
Currently, the only way ProcessGroupNCCL shuts down its background threads and aborts all communicators is via the destructor.

However, given how python GC works and code holding references to the PG in multiple places, in practice calling `destroy_process_group` doesn't actually end up invoking the destructor.

As a result, in this PR I'm adding a explicit shutdown method to that users can call to cleanup all resources.
Pull Request resolved: pytorch#111392
Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fduwjj
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: distributed (c10d) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants