-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air] pyarrow.fs
persistence (11/n): Support pausing trials (and certain schedulers)
#38355
[air] pyarrow.fs
persistence (11/n): Support pausing trials (and certain schedulers)
#38355
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/unify_sessions Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/unify_sessions
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/unify_sessions
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
(can do final review once dependency is merged) |
…persistence/pause_trials Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
2e1b691
to
34e0c8e
Compare
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally lgtm but requesting minimal changes for the tune controller changes
…persistence/pause_trials
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
def stop_after_save_result(*args, **kwargs): | ||
self._on_saving_result(*args, **kwargs) | ||
self._schedule_trial_stop(trial) | ||
self._set_trial_status(trial, Trial.PAUSED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krfricke Is it ok that there's some time between the schedule save and the save result coming in, where the trial is still RUNNING
?
Previously, the memory checkpoint path would just return a future and immediately set to PAUSED.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point.
I think it is a problem. If a pause instruction doesn't immediately pause the trial, we may send multiple instructions.
How about we set the status to PAUSED immediately, and again after the saving result resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, done
if trainable_type == "function": | ||
monkeypatch.setenv(RAY_AIR_NEW_PERSISTENCE_MODE, "1") | ||
yield train_fn | ||
monkeypatch.setenv(RAY_AIR_NEW_PERSISTENCE_MODE, "0") | ||
elif trainable_type == "class": | ||
yield MyResettableClass | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just for now, before the other class trainable PR is in.
…persistence/pause_trials
…case Signed-off-by: Justin Yu <justinvyu@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…rtain schedulers) (ray-project#38355) This PR gets pausing trials to work without in-memory checkpoints. This is needed for pausing scheduler support (PBT, BOHB). Instead of placing a `save_to_object` future in the trial data and immediately stopping the trial, this PR now waits to handle the `save` future before stopping the trial. Otherwise, the save future gets immediately cleared when `schedule_trial_stop` is called, causing no checkpoint to be saved --> possibly starting training from scratch. Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: harborn <gangsheng.wu@intel.com>
…rtain schedulers) (ray-project#38355) This PR gets pausing trials to work without in-memory checkpoints. This is needed for pausing scheduler support (PBT, BOHB). Instead of placing a `save_to_object` future in the trial data and immediately stopping the trial, this PR now waits to handle the `save` future before stopping the trial. Otherwise, the save future gets immediately cleared when `schedule_trial_stop` is called, causing no checkpoint to be saved --> possibly starting training from scratch. Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…rtain schedulers) (ray-project#38355) This PR gets pausing trials to work without in-memory checkpoints. This is needed for pausing scheduler support (PBT, BOHB). Instead of placing a `save_to_object` future in the trial data and immediately stopping the trial, this PR now waits to handle the `save` future before stopping the trial. Otherwise, the save future gets immediately cleared when `schedule_trial_stop` is called, causing no checkpoint to be saved --> possibly starting training from scratch. Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
…rtain schedulers) (ray-project#38355) This PR gets pausing trials to work without in-memory checkpoints. This is needed for pausing scheduler support (PBT, BOHB). Instead of placing a `save_to_object` future in the trial data and immediately stopping the trial, this PR now waits to handle the `save` future before stopping the trial. Otherwise, the save future gets immediately cleared when `schedule_trial_stop` is called, causing no checkpoint to be saved --> possibly starting training from scratch. Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
This PR gets pausing trials to work without in-memory checkpoints. This is needed for pausing scheduler support (PBT, BOHB).
Instead of placing a
save_to_object
future in the trial data and immediately stopping the trial, this PR now waits to handle thesave
future before stopping the trial. Otherwise, the save future gets immediately cleared whenschedule_trial_stop
is called, causing no checkpoint to be saved --> possibly starting training from scratch.Main part to review
This is the workaround I have for pausing trials w/o memory checkpoints:
TODO
pyarrow.fs
persistence (10/n): Unify Tune and Train sessions to support new persistence path inFunctionTrainable
#38284Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.