Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air] New train.Checkpoint API: Update lightning, transformers, and tf integrations #38491

Merged
merged 4 commits into from
Aug 16, 2023

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Aug 16, 2023

Why are these changes needed?

This PR updates LightningTrainer, TransformersTrainer, and the TF/Keras ReportCheckpointCallback to report a new FrameworkCheckpoint (from #38452) when the feature flag is enabled. The callbacks for the new torch integration APIs are also updated to return a new Checkpoint rather than the old air.Checkpoint.

This PR also makes Backend._encode_data a noop for all frameworks. This is not needed with the new checkpoints.

Notes

  • I am not changing the legacy trainers to report a generic Checkpoint because the lightning/transformers trainers are being deprecated soon, and it's ok to just keep the legacy behavior for them. (Otherwise, there'd be a bunch of unnecessary updates to make.)
  • I am reporting the new TensorflowCheckpoint in ReportCheckpointCallback. This is because it's also a checkpoint that we create for the user, so the user may need some accessors to get stuff out of it. Another option is to add a getter to ReportCheckpointCallback (similar to XGBoostTrainer.get_model) and just return a generic Checkpoint.

Related issue number

#38292

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

nit: Do we have any warning message when the users trying to call legacy checkpoint apis on the new checkpoint objects?

@justinvyu
Copy link
Contributor Author

@woshiyyya Good point. I'll add that as a todo on the issue. I think we can add these legacy methods on the base Checkpoint that all raise and show you how to do the same with from_directory.

@justinvyu justinvyu added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 16, 2023
@matthewdeng matthewdeng merged commit 05293cd into ray-project:master Aug 16, 2023
69 of 71 checks passed
@justinvyu justinvyu deleted the air/update_trainers branch August 16, 2023 19:07
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
… tf integrations (ray-project#38491)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
… tf integrations (ray-project#38491)


Signed-off-by: Justin Yu <justinvyu@anyscale.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
… tf integrations (ray-project#38491)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
… tf integrations (ray-project#38491)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants