Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document torch.distributed.destroy_process_group() #48203

Closed
rohan-varma opened this issue Nov 18, 2020 · 2 comments
Closed

Document torch.distributed.destroy_process_group() #48203

rohan-varma opened this issue Nov 18, 2020 · 2 comments
Labels
module: c10d Issues/PRs related to collective communications and process groups module: docs Related to our documentation, both in docs/ and docblocks oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@rohan-varma
Copy link
Member

rohan-varma commented Nov 18, 2020

馃摎 Documentation

This is a useful function to de-init a PG in order to re-initialize it, for example during error handling/retries for distributed training. However, the function is not documented in the docs currently: https://pytorch.org/docs/master/distributed.html.

cc @jlin27 @mruberry @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd

@rohan-varma rohan-varma added module: docs Related to our documentation, both in docs/ and docblocks triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: c10d Issues/PRs related to collective communications and process groups labels Nov 18, 2020
@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Nov 18, 2020
@brando90
Copy link

brando90 commented Mar 29, 2021

@benni-ser
Copy link

Hi there,

I was running into the same issue today. Unfortunately, there is still no documentation. However, in the Pytorch DDP tutorial the function is called at the end of each process:

https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

Hope this helps!

wconstab added a commit that referenced this issue Mar 20, 2024
This API was not documented. It has already been a source of confusion,
but recently has become more urgent as improper destruction can lead to
hangs due to ncclCommAbort's requirement of being called collectively.

Fixes #48203

[ghstack-poisoned]
wconstab added a commit that referenced this issue Mar 20, 2024
This API was not documented. It has already been a source of confusion,
but recently has become more urgent as improper destruction can lead to
hangs due to ncclCommAbort's requirement of being called collectively.

Fixes #48203

ghstack-source-id: 2e3ba2c493e54013c94919d7d05acb3085e7d8d6
Pull Request resolved: #122358
wconstab added a commit that referenced this issue Mar 21, 2024
This API was not documented. It has already been a source of confusion,
but recently has become more urgent as improper destruction can lead to
hangs due to ncclCommAbort's requirement of being called collectively.

Fixes #48203

ghstack-source-id: 80c863d79a95ddba044edafc7430ab8b6b75b396
Pull Request resolved: #122358
wconstab added a commit that referenced this issue Mar 22, 2024
This API was not documented. It has already been a source of confusion,
but recently has become more urgent as improper destruction can lead to
hangs due to ncclCommAbort's requirement of being called collectively.

Fixes #48203

ghstack-source-id: 4451faa9a7c4fb236b5d558405445f3bffc574f4
Pull Request resolved: #122358
wconstab added a commit that referenced this issue May 8, 2024
This API was not documented. It has already been a source of confusion,
but recently has become more urgent as improper destruction can lead to
hangs due to ncclCommAbort's requirement of being called collectively.

Fixes #48203

ghstack-source-id: 4263c6f3b66255e11a3ebd6eb196824d82e7d5b0
Pull Request resolved: #122358
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: c10d Issues/PRs related to collective communications and process groups module: docs Related to our documentation, both in docs/ and docblocks oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

4 participants