[air] `pyarrow.fs` persistence (12/n): Patch new persistence path for Class `Trainable` #38382

justinvyu · 2023-08-12T23:51:13Z

Why are these changes needed?

This PR enables class trainables to run with the new persistence codepath (storage context, pyarrow fs):

Checkpoint saving for class trainables now uses storage.persist_current_checkpoint and saves a new Checkpoint object in internal book-keeping. This also means that users will receive a new Checkpoint in their Result.
- Returning a dict from save_checkpoint will create a "dict" Checkpoint, which is just a directory with a special dict_checkpoint.pkl file and a .is_dict_checkpoint marker.
- Class trainable checkpoint dir indexing is now indexed as 0, 1, 2, 3, ..., rather than by the iteration. This is a minor change in behavior to be consistent with fn trainables / Trainers
Checkpoint loading is patched to give the user a dir/dict depending on what they returned from save_checkpoint.
- I'm not handling the relative checkpoint dir case. I just always give the user the root checkpoint_0000x dir as the argument to load_checkpoint. This is a minor change in behavior, and needs to be validated if it's ok to do (rllib?). Otherwise we should add back the relpath support.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…persistence/disable_for_cls_trainables Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

This reverts commit e23505e. Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

python/ray/tune/trainable/trainable.py

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…ants Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com> pt 2 Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…persistence/disable_for_cls_trainables_approach2

ericl · 2023-08-14T19:24:27Z

python/ray/tune/trainable/trainable.py

+# File containing dict data returned by user from `Trainable.save_checkpoint`
+_DICT_CHECKPOINT_FILE_NAME = "dict_checkpoint.pkl"
+# Marker file indicating that a checkpoint is a dict checkpoint.
+_DICT_CHECKPOINT_MARKER = ".is_dict_checkpoint"


Could we check the existence of dict_checkpoint.pkl instead?

I was worried about users writing a file named the same thing in their save_checkpoint. We could add an underscore at the beginning and most likely this won't cause any problems?

Also think that the marker is ok.

Underscore sounds good. This is just a temporary thing anyways right? I want to really get rid of as much marker craft as possible to keep it straightforward.

I think we'll need to keep this part of the implementation around even after 2.7. Unless we want to change the save_checkpoint API to not accept a dict anymore. I'll remove the marker.

python/ray/tune/trainable/trainable.py

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…persistence/disable_for_cls_trainables_approach2

ericl

Please remove the checkpoint marker.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

python/ray/tune/trainable/trainable.py

krfricke · 2023-08-15T08:27:18Z

python/ray/tune/trainable/trainable.py

+                    ) as f:
+                        ray_pickle.dump(checkpoint_dict_or_path, f)
+
+                # TODO(justinvyu): Ignoring relpaths returned by save_checkpoint for now


I'm actually surprised this doesn't raise issues. Are we running unit tests with the new storage context somewhere already? Should we add CI jobs that run the full tune/train unit test suite?

E.g. in this mock_trainable we expect to get a path to a file back.

Same in rllib's mock trainable.

The good thing is though that it seems like we only do this in some examples and in the mock trainables. Rllib just returns the original checkpoint path.

In my opinion, we can introduce a breaking change here. For this we should detect if someone didn't return the original checkpoint path and raise an error.

The surface area will be small - it only affects users with class trainables, and of those only those that return a subpath, and the fix for them is very simple.

cc @ericl for the breaking change. Alternatively we have a code path below that does recover the original path passed to tune, but it has to be stored in some checkpoint metadata.

Only test_new_persistence has the flag turned on right now. Otherwise, would have been too many tests to update while I was still trying to wrap up the functionality for the new path.

I think the CI jobs you added are very helpful -- was wondering the best way to do this! We can start sharding the failing tests.

Ok, can we then detect a subpath and raise an actionable error? Would be great to do it in this PR. Then we can merge

…persistence/disable_for_cls_trainables_approach2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

krfricke

Thanks!

…persistence/disable_for_cls_trainables_approach2 Signed-off-by: Justin Yu <justinvyu@anyscale.com>

… Class `Trainable` (ray-project#38382) This PR enables class trainables to run with the new persistence codepath (storage context, pyarrow fs): * Checkpoint saving for class trainables now uses `storage.persist_current_checkpoint` and saves a new Checkpoint object in internal book-keeping. This also means that users will receive a new Checkpoint in their `Result`. * Returning a dict from `save_checkpoint` will create a "dict" Checkpoint, which is just a directory with a special `dict_checkpoint.pkl` file and a `.is_dict_checkpoint` marker. * Class trainable checkpoint dir indexing is now indexed as 0, 1, 2, 3, ..., rather than by the iteration. **This is a minor change in behavior to be consistent with fn trainables / Trainers** * Checkpoint loading is patched to give the user a dir/dict depending on what they returned from `save_checkpoint`. * I'm not handling the relative checkpoint dir case. I just always give the user the root `checkpoint_0000x` dir as the argument to `load_checkpoint`. **This is a minor change in behavior, and needs to be validated if it's ok to do (rllib?). Otherwise we should add back the relpath support.** Signed-off-by: harborn <gangsheng.wu@intel.com>

… Class `Trainable` (ray-project#38382) This PR enables class trainables to run with the new persistence codepath (storage context, pyarrow fs): * Checkpoint saving for class trainables now uses `storage.persist_current_checkpoint` and saves a new Checkpoint object in internal book-keeping. This also means that users will receive a new Checkpoint in their `Result`. * Returning a dict from `save_checkpoint` will create a "dict" Checkpoint, which is just a directory with a special `dict_checkpoint.pkl` file and a `.is_dict_checkpoint` marker. * Class trainable checkpoint dir indexing is now indexed as 0, 1, 2, 3, ..., rather than by the iteration. **This is a minor change in behavior to be consistent with fn trainables / Trainers** * Checkpoint loading is patched to give the user a dir/dict depending on what they returned from `save_checkpoint`. * I'm not handling the relative checkpoint dir case. I just always give the user the root `checkpoint_0000x` dir as the argument to `load_checkpoint`. **This is a minor change in behavior, and needs to be validated if it's ok to do (rllib?). Otherwise we should add back the relpath support.**

… Class `Trainable` (ray-project#38382) This PR enables class trainables to run with the new persistence codepath (storage context, pyarrow fs): * Checkpoint saving for class trainables now uses `storage.persist_current_checkpoint` and saves a new Checkpoint object in internal book-keeping. This also means that users will receive a new Checkpoint in their `Result`. * Returning a dict from `save_checkpoint` will create a "dict" Checkpoint, which is just a directory with a special `dict_checkpoint.pkl` file and a `.is_dict_checkpoint` marker. * Class trainable checkpoint dir indexing is now indexed as 0, 1, 2, 3, ..., rather than by the iteration. **This is a minor change in behavior to be consistent with fn trainables / Trainers** * Checkpoint loading is patched to give the user a dir/dict depending on what they returned from `save_checkpoint`. * I'm not handling the relative checkpoint dir case. I just always give the user the root `checkpoint_0000x` dir as the argument to `load_checkpoint`. **This is a minor change in behavior, and needs to be validated if it's ok to do (rllib?). Otherwise we should add back the relpath support.** Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

… Class `Trainable` (ray-project#38382) This PR enables class trainables to run with the new persistence codepath (storage context, pyarrow fs): * Checkpoint saving for class trainables now uses `storage.persist_current_checkpoint` and saves a new Checkpoint object in internal book-keeping. This also means that users will receive a new Checkpoint in their `Result`. * Returning a dict from `save_checkpoint` will create a "dict" Checkpoint, which is just a directory with a special `dict_checkpoint.pkl` file and a `.is_dict_checkpoint` marker. * Class trainable checkpoint dir indexing is now indexed as 0, 1, 2, 3, ..., rather than by the iteration. **This is a minor change in behavior to be consistent with fn trainables / Trainers** * Checkpoint loading is patched to give the user a dir/dict depending on what they returned from `save_checkpoint`. * I'm not handling the relative checkpoint dir case. I just always give the user the root `checkpoint_0000x` dir as the argument to `load_checkpoint`. **This is a minor change in behavior, and needs to be validated if it's ok to do (rllib?). Otherwise we should add back the relpath support.** Signed-off-by: Victor <vctr.y.m@example.com>

justinvyu added 9 commits August 11, 2023 16:45

Disable env var for class trainables

bb7c246

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Add a new env var to indicate class trainables

aee789a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

option 1: surface level Result.checkpoint change

e23505e

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Add unit test

9b32359

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

bc91462

…persistence/disable_for_cls_trainables Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix circular import

5b023e4

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Revert "option 1: surface level Result.checkpoint change"

d16153d

This reverts commit e23505e. Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Save working for cls trainables in new path

c72a7e2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

cls trainable restore new persistence

572b352

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested a review from ericl August 12, 2023 23:51

justinvyu assigned ericl Aug 12, 2023

justinvyu commented Aug 12, 2023

View reviewed changes

python/ray/tune/trainable/trainable.py Outdated Show resolved Hide resolved

justinvyu added 5 commits August 14, 2023 10:53

Make checkpoint dir indexing consistent with fn trainables

4ca9117

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

No need for the cls trainable env var

169a070

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Update test with class trainable + shard some of the ckpt config vari…

da48ae5

…ants Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Remove cls trainable env var completely

85069c2

Signed-off-by: Justin Yu <justinvyu@anyscale.com> pt 2 Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

efad170

…persistence/disable_for_cls_trainables_approach2

justinvyu marked this pull request as ready for review August 14, 2023 19:01

justinvyu assigned krfricke Aug 14, 2023

justinvyu requested a review from krfricke August 14, 2023 19:02

ericl reviewed Aug 14, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 14, 2023

justinvyu added 3 commits August 14, 2023 15:10

Assert checkpoint index values

c3ab62b

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Combine the storage path contents assertion

7256bd5

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

4d7b070

…persistence/disable_for_cls_trainables_approach2

justinvyu removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 14, 2023

justinvyu requested a review from ericl August 14, 2023 22:27

ericl reviewed Aug 14, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 14, 2023

Remove dict ckpt marker

adb3166

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 14, 2023

ericl approved these changes Aug 14, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 15, 2023

justinvyu added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 15, 2023

krfricke reviewed Aug 15, 2023

View reviewed changes

justinvyu added 3 commits August 15, 2023 09:27

Merge branch 'master' of https://github.com/ray-project/ray into air/…

a8011d6

…persistence/disable_for_cls_trainables_approach2

Raise an actionable error on relpath output

abd944c

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix + address comments

319be54

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 15, 2023

krfricke approved these changes Aug 15, 2023

View reviewed changes

Merge branch 'master' of https://github.com/ray-project/ray into air/…

cecc60e

…persistence/disable_for_cls_trainables_approach2 Signed-off-by: Justin Yu <justinvyu@anyscale.com>

ericl merged commit cd71d8d into ray-project:master Aug 15, 2023
5 of 28 checks passed

justinvyu deleted the air/persistence/disable_for_cls_trainables_approach2 branch August 15, 2023 20:10

matthewdeng mentioned this pull request Aug 18, 2023

[tune/train] Implement new persistence strategy and roll out as default option #38294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[air] `pyarrow.fs` persistence (12/n): Patch new persistence path for Class `Trainable` #38382

[air] `pyarrow.fs` persistence (12/n): Patch new persistence path for Class `Trainable` #38382

justinvyu commented Aug 12, 2023 •

edited

ericl Aug 14, 2023

justinvyu Aug 14, 2023

ericl Aug 14, 2023

justinvyu Aug 14, 2023

ericl left a comment

krfricke Aug 15, 2023

justinvyu Aug 15, 2023

krfricke Aug 15, 2023

justinvyu Aug 15, 2023

krfricke left a comment

[air] pyarrow.fs persistence (12/n): Patch new persistence path for Class Trainable #38382

[air] pyarrow.fs persistence (12/n): Patch new persistence path for Class Trainable #38382

Conversation

justinvyu commented Aug 12, 2023 • edited

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

[air] `pyarrow.fs` persistence (12/n): Patch new persistence path for Class `Trainable` #38382

[air] `pyarrow.fs` persistence (12/n): Patch new persistence path for Class `Trainable` #38382

justinvyu commented Aug 12, 2023 •

edited