Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air] pyarrow.fs persistence: Don't automatically delete the local checkpoint #38507

Merged
merged 6 commits into from
Aug 17, 2023

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Aug 16, 2023

Why are these changes needed?

This PR updates the persist_current_checkpoint logic to not delete the user's input checkpoint upon uploading. This is instead the job of the user to clean up if they want to. This is in line with making things more explicit.

The new pattern for reporting a checkpoint (that gets cleaned up automatically) is:

import tempfile

from ray import train
from ray.train._checkpoint import Checkpoint  # soon to be ray.train.Checkpoint

def train_fn(config):
    for i in range(epochs):
        should_checkpoint = ...
        if should_checkpoint:
            with tempfile.TemporaryDirectory() as tempdir:
                # write files into tempdir...
                train.report({...}, Checkpoint.from_directory(tempdir))
        else:
            train.report({...})

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 16, 2023
Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this change.

Previously, users had to create a tmpdir, copy existing ckpts to that tmp directory, and report it to Train. If the checkpoint is large, the copy step is unnecessary and takes a long time.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu
Copy link
Contributor Author

Ah yeah I didn't even think about that! Saves 1 copy step if they already have a dir created by some integration library.

@justinvyu justinvyu added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Aug 17, 2023
Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @justinvyu!

@matthewdeng matthewdeng merged commit b3b900c into ray-project:master Aug 17, 2023
57 of 62 checks passed
@justinvyu justinvyu deleted the air/persistence/no_delete branch August 17, 2023 01:42
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
…checkpoint (ray-project#38507)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
…checkpoint (ray-project#38507)



Signed-off-by: Justin Yu <justinvyu@anyscale.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…checkpoint (ray-project#38507)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…checkpoint (ray-project#38507)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants