New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an explicit _shutdown method to ProcessGroupNCCL #111392
Add an explicit _shutdown method to ProcessGroupNCCL #111392
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111392
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit f452fa7 with merge base 4ed4753 (): BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
cc @wanchaol do you know who would be the best person to take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall the PR LGTM. One suggestion plus one question.
I'm wondering if there's any follow up we can do to make it better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate the difference of this API compare to abort
0a2691f
to
6754a40
Compare
if (ncclCommWatchdogThread_.joinable()) { | ||
ncclCommWatchdogThread_.join(); | ||
} | ||
#endif | ||
|
||
if (onCompletionHookThread_.joinable()) | ||
onCompletionHookThread_.join(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not include .join()
as part of the shutdown
method since join()
could potentially block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, one question about gloo compatability
return self->abort(abortReason); | ||
"_shutdown", | ||
[](const c10::intrusive_ptr<::c10d::ProcessGroupNCCL>& self) { | ||
return self->shutdown(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supporting gloo too or shall we add the _shutdown
for gloo to keep consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really depends if we want shutdown
as a public method in the ProcessGroup.hpp
public interface and have all subclasses implement it (not just gloo).
I kept this private here since I wasn't sure if we want this to be part of the ProcessGroup interface.
@@ -2239,12 +2239,10 @@ options :class:`~torch.distributed.ProcessGroupNCCL.Options`). | |||
py::arg("timeout") = ::c10d::kProcessGroupNCCLDefaultTimeout, | |||
py::call_guard<py::gil_scoped_release>()) | |||
.def( | |||
"_abort", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with this change but I am not sure if this is still used somewhere so it might break some use cases.. If it is used this PR might get reverted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind either, I just removed it based on @wanchaol's suggestion here: #111392 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I didn't find any critical use cases for this IIUC. And I don't like keeping both, so I stamp it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API was added by @pritamdamania87 a few weeks ago so I imagine the API usage should be relatively low and probably only @pritamdamania87 atm. We should remove it soon to avoid more ppl depend on it.
# Destroy pg and validate pg is still in working condition since we hold a | ||
# reference above. | ||
dist.destroy_process_group() | ||
pg.allreduce([t]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to call allreduce even after pg destroy? The first allreduce is blocking right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes because destroy_process_group
just removes the PG from the map in distributed_c10d.py
, but doesn't trigger the destructor of ProcessGroupNCCL since we are holding a reference to the PG object here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. That's a good idea to put a test case. Thanks!
742b5ab
to
9014cd8
Compare
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: linux-binary-manywheel / manywheel-py3_8-cuda11_8-build / build Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Currently, the only way ProcessGroupNCCL shuts down its background threads and aborts all communicators is via the destructor. However, given how python GC works and code holding references to the PG in multiple places, in practice calling `destroy_process_group` doesn't actually end up invoking the destructor. As a result, in this PR I'm adding a explicit shutdown method to that users can call to cleanup all resources. Pull Request resolved: pytorch#111392 Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fduwjj
Currently, the only way ProcessGroupNCCL shuts down its background threads and aborts all communicators is via the destructor. However, given how python GC works and code holding references to the PG in multiple places, in practice calling `destroy_process_group` doesn't actually end up invoking the destructor. As a result, in this PR I'm adding a explicit shutdown method to that users can call to cleanup all resources. Pull Request resolved: pytorch#111392 Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fduwjj
Currently, the only way ProcessGroupNCCL shuts down its background threads and aborts all communicators is via the destructor. However, given how python GC works and code holding references to the PG in multiple places, in practice calling `destroy_process_group` doesn't actually end up invoking the destructor. As a result, in this PR I'm adding a explicit shutdown method to that users can call to cleanup all resources. Pull Request resolved: pytorch#111392 Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fduwjj
Currently, the only way ProcessGroupNCCL shuts down its background threads and aborts all communicators is via the destructor. However, given how python GC works and code holding references to the PG in multiple places, in practice calling `destroy_process_group` doesn't actually end up invoking the destructor. As a result, in this PR I'm adding a explicit shutdown method to that users can call to cleanup all resources. Pull Request resolved: pytorch#111392 Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fduwjj
Currently, the only way ProcessGroupNCCL shuts down its background threads and aborts all communicators is via the destructor.
However, given how python GC works and code holding references to the PG in multiple places, in practice calling
destroy_process_group
doesn't actually end up invoking the destructor.As a result, in this PR I'm adding a explicit shutdown method to that users can call to cleanup all resources.