-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document torch.distributed.destroy_process_group() #48203
Labels
module: c10d
Issues/PRs related to collective communications and process groups
module: docs
Related to our documentation, both in docs/ and docblocks
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Comments
rohan-varma
added
module: docs
Related to our documentation, both in docs/ and docblocks
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
module: c10d
Issues/PRs related to collective communications and process groups
labels
Nov 18, 2020
facebook-github-bot
added
the
oncall: distributed
Add this issue/PR to distributed oncall triage queue
label
Nov 18, 2020
what is the progress on this? Where should we be calling this function? |
Hi there, I was running into the same issue today. Unfortunately, there is still no documentation. However, in the Pytorch DDP tutorial the function is called at the end of each process: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html Hope this helps! |
wconstab
added a commit
that referenced
this issue
Mar 20, 2024
This API was not documented. It has already been a source of confusion, but recently has become more urgent as improper destruction can lead to hangs due to ncclCommAbort's requirement of being called collectively. Fixes #48203 [ghstack-poisoned]
wconstab
added a commit
that referenced
this issue
Mar 20, 2024
This API was not documented. It has already been a source of confusion, but recently has become more urgent as improper destruction can lead to hangs due to ncclCommAbort's requirement of being called collectively. Fixes #48203 ghstack-source-id: 2e3ba2c493e54013c94919d7d05acb3085e7d8d6 Pull Request resolved: #122358
wconstab
added a commit
that referenced
this issue
Mar 21, 2024
This API was not documented. It has already been a source of confusion, but recently has become more urgent as improper destruction can lead to hangs due to ncclCommAbort's requirement of being called collectively. Fixes #48203 ghstack-source-id: 80c863d79a95ddba044edafc7430ab8b6b75b396 Pull Request resolved: #122358
wconstab
added a commit
that referenced
this issue
Mar 22, 2024
This API was not documented. It has already been a source of confusion, but recently has become more urgent as improper destruction can lead to hangs due to ncclCommAbort's requirement of being called collectively. Fixes #48203 ghstack-source-id: 4451faa9a7c4fb236b5d558405445f3bffc574f4 Pull Request resolved: #122358
wconstab
added a commit
that referenced
this issue
May 8, 2024
This API was not documented. It has already been a source of confusion, but recently has become more urgent as improper destruction can lead to hangs due to ncclCommAbort's requirement of being called collectively. Fixes #48203 ghstack-source-id: 4263c6f3b66255e11a3ebd6eb196824d82e7d5b0 Pull Request resolved: #122358
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
module: c10d
Issues/PRs related to collective communications and process groups
module: docs
Related to our documentation, both in docs/ and docblocks
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
馃摎 Documentation
This is a useful function to de-init a PG in order to re-initialize it, for example during error handling/retries for distributed training. However, the function is not documented in the docs currently: https://pytorch.org/docs/master/distributed.html.
cc @jlin27 @mruberry @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd
The text was updated successfully, but these errors were encountered: