-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train] New persistence mode: Support the tune.run(restore="ckpt_path")
API
#38804
[train] New persistence mode: Support the tune.run(restore="ckpt_path")
API
#38804
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@krfricke ok to merge if looks good. Are you ok with the slight API behavior change here? |
ping @zhe-thoughts for approval for 2.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, will merge after approval from @zhe-thoughts
…persistence/trial_restore
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of Train API clean up
Checked with @aslonnie and confirmed CI is OK |
1 similar comment
Checked with @aslonnie and confirmed CI is OK |
…h")` API (ray-project#38804) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
…h")` API (ray-project#38804) Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…h")` API (ray-project#38804) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
tune.run(restore="/path/to/local/checkpoint")
previously didn't work. This PR allows you to resume training a single trial from a checkpoint. However, this does NOT load back the training iteration from before. It just starts from scratch with the checkpoint. It's more similar toTrainer(resume_from_checkpoint)
now. This is a behavior change due to the "trainable metadata" (including the training iter) no longer being saved within the checkpoint.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.