New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix checkpointing #2267
Fix checkpointing #2267
Conversation
This reverts the changes that allowed FSDP and doing validation or computing metrics across multiple devices by running validation and evaluation on a single process again. I would prefer to keep those capabilities rather than reverting, but if we can't find another solution I suppose this could work. |
Hey @pplantinga, yes, it's a revert because we can't find how to fix the issues that we are seeing more and more when DDP is used. My opinion is that the pool of users doing FSDP and/or per-device evaluation is extremely small, while the pool of users using DDP is very large, and I am hearing, around me from people using DDP that this bug is really frustrating (I actually have it too). I would suggest that we revert, and you, if you have the time, keep working on that to make sure that we can fix these problems. We of course want to support it, but not at the cost of breaking what was previously working. |
Alternative fix proposed in #2268 |
Lemme try. |
There's another reason why reverting this change is not ideal. Having the |
I think we need to close that, right? @Adel-Moumen |
Closed thanks to #2268 |
What does this PR do?
This PR attempts to fix checkpointing issues that we are currently facing with the new changes made in #2059.
Before submitting
PR review
Reviewer checklist