Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix checkpointing #2267

Closed
wants to merge 2 commits into from
Closed

Fix checkpointing #2267

wants to merge 2 commits into from

Conversation

Adel-Moumen
Copy link
Collaborator

What does this PR do?

This PR attempts to fix checkpointing issues that we are currently facing with the new changes made in #2059.

Before submitting
  • Did you read the contributor guideline?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
  • Review the self-review checklist to ensure the code is ready for review

@pplantinga
Copy link
Collaborator

This reverts the changes that allowed FSDP and doing validation or computing metrics across multiple devices by running validation and evaluation on a single process again. I would prefer to keep those capabilities rather than reverting, but if we can't find another solution I suppose this could work.

@TParcollet
Copy link
Collaborator

Hey @pplantinga, yes, it's a revert because we can't find how to fix the issues that we are seeing more and more when DDP is used. My opinion is that the pool of users doing FSDP and/or per-device evaluation is extremely small, while the pool of users using DDP is very large, and I am hearing, around me from people using DDP that this bug is really frustrating (I actually have it too). I would suggest that we revert, and you, if you have the time, keep working on that to make sure that we can fix these problems. We of course want to support it, but not at the cost of breaking what was previously working.

@pplantinga
Copy link
Collaborator

Alternative fix proposed in #2268

@TParcollet
Copy link
Collaborator

Lemme try.

@pplantinga
Copy link
Collaborator

There's another reason why reverting this change is not ideal. Having the on_stage_end run only on main leads to subtle errors in the logic of many recipes, such as the optimizer's learning rate only getting updated on a single thread.

@mravanelli mravanelli added the bug Something isn't working label Nov 29, 2023
@mravanelli
Copy link
Collaborator

I think we need to close that, right? @Adel-Moumen

@Adel-Moumen
Copy link
Collaborator Author

Closed thanks to #2268

@Adel-Moumen Adel-Moumen closed this Dec 2, 2023
@mravanelli mravanelli deleted the fix-checkpointing branch February 27, 2024 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants