Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air] pyarrow.fs persistence (8/n): ray.train.Checkpoint restore: resume_from_checkpoint #38143

Merged

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Aug 5, 2023

Why are these changes needed?

This PR supports Trainer(resume_from_checkpoint) with the new Checkpoint and adds it as a section of the e2e test.

This PR also fixes a bug where no checkpoints being reported causes the Result object to error on construction.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Update to use the new checkpoint id attribute

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Add todo comment to remove legacy path

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
… -> driver

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix lint

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix lint

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/storage_context_to_worker_temp
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/new_checkpoint

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix lint for session.py

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix lint for storage.py

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/restore_new_checkpoint_autoft

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/restore_new_checkpoint_autoft
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/restore_new_checkpoint_rfc

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu requested a review from ericl August 8, 2023 20:13
@justinvyu justinvyu marked this pull request as ready for review August 8, 2023 20:13
@justinvyu justinvyu added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 8, 2023
@justinvyu
Copy link
Contributor Author

@ericl Good to merge. #38128 ready for review

@ericl ericl merged commit 55acb74 into ray-project:master Aug 9, 2023
67 of 72 checks passed
yutsai84 pushed a commit to yutsai84/ray that referenced this pull request Aug 9, 2023
… `resume_from_checkpoint` (ray-project#38143)

This PR supports `Trainer(resume_from_checkpoint)` with the new Checkpoint and adds it as a section of the e2e test.

This PR also fixes a bug where no checkpoints being reported causes the Result object to error on construction.

Signed-off-by: Yu-Cheng Tsai <yucheng.tsai@sage.com>
@justinvyu justinvyu deleted the air/persistence/restore_new_checkpoint_rfc branch August 9, 2023 01:15
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
… `resume_from_checkpoint` (ray-project#38143)

This PR supports `Trainer(resume_from_checkpoint)` with the new Checkpoint and adds it as a section of the e2e test.

This PR also fixes a bug where no checkpoints being reported causes the Result object to error on construction.

Signed-off-by: NripeshN <nn2012@hw.ac.uk>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
… `resume_from_checkpoint` (ray-project#38143)

This PR supports `Trainer(resume_from_checkpoint)` with the new Checkpoint and adds it as a section of the e2e test.

This PR also fixes a bug where no checkpoints being reported causes the Result object to error on construction.

Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
… `resume_from_checkpoint` (ray-project#38143)

This PR supports `Trainer(resume_from_checkpoint)` with the new Checkpoint and adds it as a section of the e2e test.

This PR also fixes a bug where no checkpoints being reported causes the Result object to error on construction.
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
… `resume_from_checkpoint` (ray-project#38143)

This PR supports `Trainer(resume_from_checkpoint)` with the new Checkpoint and adds it as a section of the e2e test.

This PR also fixes a bug where no checkpoints being reported causes the Result object to error on construction.

Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
… `resume_from_checkpoint` (ray-project#38143)

This PR supports `Trainer(resume_from_checkpoint)` with the new Checkpoint and adds it as a section of the e2e test.

This PR also fixes a bug where no checkpoints being reported causes the Result object to error on construction.

Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants