Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update DDP checkpoint documentation #84589

Closed
rohan-varma opened this issue Sep 6, 2022 · 3 comments
Closed

Update DDP checkpoint documentation #84589

rohan-varma opened this issue Sep 6, 2022 · 3 comments
Labels
good first issue module: ddp Issues/PRs related distributed data parallel training oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@rohan-varma
Copy link
Member

rohan-varma commented Sep 6, 2022

馃摎 The doc issue

DDP checkpoint documentation here:

should be updated in light of new non-reentrant checkpoint API available.

In particular, checkpoint with non_reentrant=True supports the use cases that are mentioned as unsupported.

Suggest a potential alternative/fix

No response

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @kwen2501

@rohan-varma rohan-varma added oncall: distributed Add this issue/PR to distributed oncall triage queue module: ddp Issues/PRs related distributed data parallel training labels Sep 6, 2022
@kumpera kumpera added good first issue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Sep 13, 2022
@Shasmita
Copy link

Hi, I would like to work on this issue.

@the-tyler
Copy link

Hi, I am available to work on this very interesting issue. If you can assign it to me, then I can immediately start working on it.

@shilpiprd
Copy link

Hi!, since nobody had made any changes, i assumed that this issue was open. I changed the documentation a bit and added one example.

ringohoffman pushed a commit to ringohoffman/pytorch that referenced this issue Sep 27, 2023
Amended the documentation for the specified case.

Fixes pytorch#84589

Pull Request resolved: pytorch#106985
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue module: ddp Issues/PRs related distributed data parallel training oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants