Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air] pyarrow.fs persistence (12/n): Patch new persistence path for Class Trainable #38382

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Aug 12, 2023

Why are these changes needed?

This PR enables class trainables to run with the new persistence codepath (storage context, pyarrow fs):

  • Checkpoint saving for class trainables now uses storage.persist_current_checkpoint and saves a new Checkpoint object in internal book-keeping. This also means that users will receive a new Checkpoint in their Result.
    • Returning a dict from save_checkpoint will create a "dict" Checkpoint, which is just a directory with a special dict_checkpoint.pkl file and a .is_dict_checkpoint marker.
    • Class trainable checkpoint dir indexing is now indexed as 0, 1, 2, 3, ..., rather than by the iteration. This is a minor change in behavior to be consistent with fn trainables / Trainers
  • Checkpoint loading is patched to give the user a dir/dict depending on what they returned from save_checkpoint.
    • I'm not handling the relative checkpoint dir case. I just always give the user the root checkpoint_0000x dir as the argument to load_checkpoint. This is a minor change in behavior, and needs to be validated if it's ok to do (rllib?). Otherwise we should add back the relpath support.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/disable_for_cls_trainables

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
This reverts commit e23505e.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…ants

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

pt 2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/disable_for_cls_trainables_approach2
@justinvyu justinvyu marked this pull request as ready for review August 14, 2023 19:01
# File containing dict data returned by user from `Trainable.save_checkpoint`
_DICT_CHECKPOINT_FILE_NAME = "dict_checkpoint.pkl"
# Marker file indicating that a checkpoint is a dict checkpoint.
_DICT_CHECKPOINT_MARKER = ".is_dict_checkpoint"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we check the existence of dict_checkpoint.pkl instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was worried about users writing a file named the same thing in their save_checkpoint. We could add an underscore at the beginning and most likely this won't cause any problems?

Also think that the marker is ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Underscore sounds good. This is just a temporary thing anyways right? I want to really get rid of as much marker craft as possible to keep it straightforward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll need to keep this part of the implementation around even after 2.7. Unless we want to change the save_checkpoint API to not accept a dict anymore. I'll remove the marker.

python/ray/tune/trainable/trainable.py Show resolved Hide resolved
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 14, 2023
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/disable_for_cls_trainables_approach2
@justinvyu justinvyu removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 14, 2023
@justinvyu justinvyu requested a review from ericl August 14, 2023 22:27
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the checkpoint marker.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 14, 2023
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 14, 2023
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 15, 2023
@justinvyu justinvyu added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 15, 2023
python/ray/tune/trainable/trainable.py Outdated Show resolved Hide resolved
) as f:
ray_pickle.dump(checkpoint_dict_or_path, f)

# TODO(justinvyu): Ignoring relpaths returned by save_checkpoint for now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually surprised this doesn't raise issues. Are we running unit tests with the new storage context somewhere already? Should we add CI jobs that run the full tune/train unit test suite?

E.g. in this mock_trainable we expect to get a path to a file back.

Same in rllib's mock trainable.

The good thing is though that it seems like we only do this in some examples and in the mock trainables. Rllib just returns the original checkpoint path.

In my opinion, we can introduce a breaking change here. For this we should detect if someone didn't return the original checkpoint path and raise an error.

The surface area will be small - it only affects users with class trainables, and of those only those that return a subpath, and the fix for them is very simple.

cc @ericl for the breaking change. Alternatively we have a code path below that does recover the original path passed to tune, but it has to be stored in some checkpoint metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only test_new_persistence has the flag turned on right now. Otherwise, would have been too many tests to update while I was still trying to wrap up the functionality for the new path.

I think the CI jobs you added are very helpful -- was wondering the best way to do this! We can start sharding the failing tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, can we then detect a subpath and raise an actionable error? Would be great to do it in this PR. Then we can merge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, PTAL!

…persistence/disable_for_cls_trainables_approach2
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 15, 2023
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

…persistence/disable_for_cls_trainables_approach2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@ericl ericl merged commit cd71d8d into ray-project:master Aug 15, 2023
5 of 28 checks passed
@justinvyu justinvyu deleted the air/persistence/disable_for_cls_trainables_approach2 branch August 15, 2023 20:10
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
… Class `Trainable` (ray-project#38382)

This PR enables class trainables to run with the new persistence codepath (storage context, pyarrow fs):
* Checkpoint saving for class trainables now uses `storage.persist_current_checkpoint` and saves a new Checkpoint object in internal book-keeping. This also means that users will receive a new Checkpoint in their `Result`.
    * Returning a dict from `save_checkpoint` will create a "dict" Checkpoint, which is just a directory with a special `dict_checkpoint.pkl` file and a `.is_dict_checkpoint` marker.
    * Class trainable checkpoint dir indexing is now indexed as 0, 1, 2, 3, ..., rather than by the iteration. **This is a minor change in behavior to be consistent with fn trainables / Trainers**
* Checkpoint loading is patched to give the user a dir/dict depending on what they returned from `save_checkpoint`.
    * I'm not handling the relative checkpoint dir case. I just always give the user the root `checkpoint_0000x` dir as the argument to `load_checkpoint`. **This is a minor change in behavior, and needs to be validated if it's ok to do (rllib?). Otherwise we should add back the relpath support.**

Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
… Class `Trainable` (ray-project#38382)

This PR enables class trainables to run with the new persistence codepath (storage context, pyarrow fs):
* Checkpoint saving for class trainables now uses `storage.persist_current_checkpoint` and saves a new Checkpoint object in internal book-keeping. This also means that users will receive a new Checkpoint in their `Result`.
    * Returning a dict from `save_checkpoint` will create a "dict" Checkpoint, which is just a directory with a special `dict_checkpoint.pkl` file and a `.is_dict_checkpoint` marker.
    * Class trainable checkpoint dir indexing is now indexed as 0, 1, 2, 3, ..., rather than by the iteration. **This is a minor change in behavior to be consistent with fn trainables / Trainers**
* Checkpoint loading is patched to give the user a dir/dict depending on what they returned from `save_checkpoint`.
    * I'm not handling the relative checkpoint dir case. I just always give the user the root `checkpoint_0000x` dir as the argument to `load_checkpoint`. **This is a minor change in behavior, and needs to be validated if it's ok to do (rllib?). Otherwise we should add back the relpath support.**
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
… Class `Trainable` (ray-project#38382)

This PR enables class trainables to run with the new persistence codepath (storage context, pyarrow fs):
* Checkpoint saving for class trainables now uses `storage.persist_current_checkpoint` and saves a new Checkpoint object in internal book-keeping. This also means that users will receive a new Checkpoint in their `Result`.
    * Returning a dict from `save_checkpoint` will create a "dict" Checkpoint, which is just a directory with a special `dict_checkpoint.pkl` file and a `.is_dict_checkpoint` marker.
    * Class trainable checkpoint dir indexing is now indexed as 0, 1, 2, 3, ..., rather than by the iteration. **This is a minor change in behavior to be consistent with fn trainables / Trainers**
* Checkpoint loading is patched to give the user a dir/dict depending on what they returned from `save_checkpoint`.
    * I'm not handling the relative checkpoint dir case. I just always give the user the root `checkpoint_0000x` dir as the argument to `load_checkpoint`. **This is a minor change in behavior, and needs to be validated if it's ok to do (rllib?). Otherwise we should add back the relpath support.**

Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
… Class `Trainable` (ray-project#38382)

This PR enables class trainables to run with the new persistence codepath (storage context, pyarrow fs):
* Checkpoint saving for class trainables now uses `storage.persist_current_checkpoint` and saves a new Checkpoint object in internal book-keeping. This also means that users will receive a new Checkpoint in their `Result`.
    * Returning a dict from `save_checkpoint` will create a "dict" Checkpoint, which is just a directory with a special `dict_checkpoint.pkl` file and a `.is_dict_checkpoint` marker.
    * Class trainable checkpoint dir indexing is now indexed as 0, 1, 2, 3, ..., rather than by the iteration. **This is a minor change in behavior to be consistent with fn trainables / Trainers**
* Checkpoint loading is patched to give the user a dir/dict depending on what they returned from `save_checkpoint`.
    * I'm not handling the relative checkpoint dir case. I just always give the user the root `checkpoint_0000x` dir as the argument to `load_checkpoint`. **This is a minor change in behavior, and needs to be validated if it's ok to do (rllib?). Otherwise we should add back the relpath support.**

Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants