-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air] pyarrow.fs
persistence (12/n): Patch new persistence path for Class Trainable
#38382
[air] pyarrow.fs
persistence (12/n): Patch new persistence path for Class Trainable
#38382
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/disable_for_cls_trainables Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
This reverts commit e23505e. Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…ants Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com> pt 2 Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/disable_for_cls_trainables_approach2
# File containing dict data returned by user from `Trainable.save_checkpoint` | ||
_DICT_CHECKPOINT_FILE_NAME = "dict_checkpoint.pkl" | ||
# Marker file indicating that a checkpoint is a dict checkpoint. | ||
_DICT_CHECKPOINT_MARKER = ".is_dict_checkpoint" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we check the existence of dict_checkpoint.pkl instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was worried about users writing a file named the same thing in their save_checkpoint
. We could add an underscore at the beginning and most likely this won't cause any problems?
Also think that the marker is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Underscore sounds good. This is just a temporary thing anyways right? I want to really get rid of as much marker craft as possible to keep it straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll need to keep this part of the implementation around even after 2.7. Unless we want to change the save_checkpoint
API to not accept a dict anymore. I'll remove the marker.
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/disable_for_cls_trainables_approach2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the checkpoint marker.
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
) as f: | ||
ray_pickle.dump(checkpoint_dict_or_path, f) | ||
|
||
# TODO(justinvyu): Ignoring relpaths returned by save_checkpoint for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually surprised this doesn't raise issues. Are we running unit tests with the new storage context somewhere already? Should we add CI jobs that run the full tune/train unit test suite?
E.g. in this mock_trainable we expect to get a path to a file back.
Same in rllib's mock trainable.
The good thing is though that it seems like we only do this in some examples and in the mock trainables. Rllib just returns the original checkpoint path.
In my opinion, we can introduce a breaking change here. For this we should detect if someone didn't return the original checkpoint path and raise an error.
The surface area will be small - it only affects users with class trainables, and of those only those that return a subpath, and the fix for them is very simple.
cc @ericl for the breaking change. Alternatively we have a code path below that does recover the original path passed to tune, but it has to be stored in some checkpoint metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only test_new_persistence
has the flag turned on right now. Otherwise, would have been too many tests to update while I was still trying to wrap up the functionality for the new path.
I think the CI jobs you added are very helpful -- was wondering the best way to do this! We can start sharding the failing tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, can we then detect a subpath and raise an actionable error? Would be great to do it in this PR. Then we can merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, PTAL!
…persistence/disable_for_cls_trainables_approach2
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
…persistence/disable_for_cls_trainables_approach2 Signed-off-by: Justin Yu <justinvyu@anyscale.com>
… Class `Trainable` (ray-project#38382) This PR enables class trainables to run with the new persistence codepath (storage context, pyarrow fs): * Checkpoint saving for class trainables now uses `storage.persist_current_checkpoint` and saves a new Checkpoint object in internal book-keeping. This also means that users will receive a new Checkpoint in their `Result`. * Returning a dict from `save_checkpoint` will create a "dict" Checkpoint, which is just a directory with a special `dict_checkpoint.pkl` file and a `.is_dict_checkpoint` marker. * Class trainable checkpoint dir indexing is now indexed as 0, 1, 2, 3, ..., rather than by the iteration. **This is a minor change in behavior to be consistent with fn trainables / Trainers** * Checkpoint loading is patched to give the user a dir/dict depending on what they returned from `save_checkpoint`. * I'm not handling the relative checkpoint dir case. I just always give the user the root `checkpoint_0000x` dir as the argument to `load_checkpoint`. **This is a minor change in behavior, and needs to be validated if it's ok to do (rllib?). Otherwise we should add back the relpath support.** Signed-off-by: harborn <gangsheng.wu@intel.com>
… Class `Trainable` (ray-project#38382) This PR enables class trainables to run with the new persistence codepath (storage context, pyarrow fs): * Checkpoint saving for class trainables now uses `storage.persist_current_checkpoint` and saves a new Checkpoint object in internal book-keeping. This also means that users will receive a new Checkpoint in their `Result`. * Returning a dict from `save_checkpoint` will create a "dict" Checkpoint, which is just a directory with a special `dict_checkpoint.pkl` file and a `.is_dict_checkpoint` marker. * Class trainable checkpoint dir indexing is now indexed as 0, 1, 2, 3, ..., rather than by the iteration. **This is a minor change in behavior to be consistent with fn trainables / Trainers** * Checkpoint loading is patched to give the user a dir/dict depending on what they returned from `save_checkpoint`. * I'm not handling the relative checkpoint dir case. I just always give the user the root `checkpoint_0000x` dir as the argument to `load_checkpoint`. **This is a minor change in behavior, and needs to be validated if it's ok to do (rllib?). Otherwise we should add back the relpath support.**
… Class `Trainable` (ray-project#38382) This PR enables class trainables to run with the new persistence codepath (storage context, pyarrow fs): * Checkpoint saving for class trainables now uses `storage.persist_current_checkpoint` and saves a new Checkpoint object in internal book-keeping. This also means that users will receive a new Checkpoint in their `Result`. * Returning a dict from `save_checkpoint` will create a "dict" Checkpoint, which is just a directory with a special `dict_checkpoint.pkl` file and a `.is_dict_checkpoint` marker. * Class trainable checkpoint dir indexing is now indexed as 0, 1, 2, 3, ..., rather than by the iteration. **This is a minor change in behavior to be consistent with fn trainables / Trainers** * Checkpoint loading is patched to give the user a dir/dict depending on what they returned from `save_checkpoint`. * I'm not handling the relative checkpoint dir case. I just always give the user the root `checkpoint_0000x` dir as the argument to `load_checkpoint`. **This is a minor change in behavior, and needs to be validated if it's ok to do (rllib?). Otherwise we should add back the relpath support.** Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
… Class `Trainable` (ray-project#38382) This PR enables class trainables to run with the new persistence codepath (storage context, pyarrow fs): * Checkpoint saving for class trainables now uses `storage.persist_current_checkpoint` and saves a new Checkpoint object in internal book-keeping. This also means that users will receive a new Checkpoint in their `Result`. * Returning a dict from `save_checkpoint` will create a "dict" Checkpoint, which is just a directory with a special `dict_checkpoint.pkl` file and a `.is_dict_checkpoint` marker. * Class trainable checkpoint dir indexing is now indexed as 0, 1, 2, 3, ..., rather than by the iteration. **This is a minor change in behavior to be consistent with fn trainables / Trainers** * Checkpoint loading is patched to give the user a dir/dict depending on what they returned from `save_checkpoint`. * I'm not handling the relative checkpoint dir case. I just always give the user the root `checkpoint_0000x` dir as the argument to `load_checkpoint`. **This is a minor change in behavior, and needs to be validated if it's ok to do (rllib?). Otherwise we should add back the relpath support.** Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
This PR enables class trainables to run with the new persistence codepath (storage context, pyarrow fs):
storage.persist_current_checkpoint
and saves a new Checkpoint object in internal book-keeping. This also means that users will receive a new Checkpoint in theirResult
.save_checkpoint
will create a "dict" Checkpoint, which is just a directory with a specialdict_checkpoint.pkl
file and a.is_dict_checkpoint
marker.save_checkpoint
.checkpoint_0000x
dir as the argument toload_checkpoint
. This is a minor change in behavior, and needs to be validated if it's ok to do (rllib?). Otherwise we should add back the relpath support.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.