Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air] pyarrow.fs persistence (11/n): Support pausing trials (and certain schedulers) #38355

Merged
merged 41 commits into from
Aug 15, 2023

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Aug 11, 2023

Why are these changes needed?

This PR gets pausing trials to work without in-memory checkpoints. This is needed for pausing scheduler support (PBT, BOHB).

Instead of placing a save_to_object future in the trial data and immediately stopping the trial, this PR now waits to handle the save future before stopping the trial. Otherwise, the save future gets immediately cleared when schedule_trial_stop is called, causing no checkpoint to be saved --> possibly starting training from scratch.

Main part to review

This is the workaround I have for pausing trials w/o memory checkpoints:

TODO

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/unify_sessions

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 11, 2023
@ericl
Copy link
Contributor

ericl commented Aug 11, 2023

(can do final review once dependency is merged)

…persistence/pause_trials

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu marked this pull request as ready for review August 12, 2023 19:29
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally lgtm but requesting minimal changes for the tune controller changes

python/ray/tune/execution/tune_controller.py Outdated Show resolved Hide resolved
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
def stop_after_save_result(*args, **kwargs):
self._on_saving_result(*args, **kwargs)
self._schedule_trial_stop(trial)
self._set_trial_status(trial, Trial.PAUSED)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krfricke Is it ok that there's some time between the schedule save and the save result coming in, where the trial is still RUNNING?

Previously, the memory checkpoint path would just return a future and immediately set to PAUSED.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point.

I think it is a problem. If a pause instruction doesn't immediately pause the trial, we may send multiple instructions.

How about we set the status to PAUSED immediately, and again after the saving result resolved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done

Comment on lines +154 to +160
if trainable_type == "function":
monkeypatch.setenv(RAY_AIR_NEW_PERSISTENCE_MODE, "1")
yield train_fn
monkeypatch.setenv(RAY_AIR_NEW_PERSISTENCE_MODE, "0")
elif trainable_type == "class":
yield MyResettableClass
else:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for now, before the other class trainable PR is in.

@justinvyu justinvyu removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 14, 2023
@justinvyu justinvyu added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 15, 2023
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@krfricke krfricke merged commit 921ec47 into ray-project:master Aug 15, 2023
58 of 62 checks passed
@justinvyu justinvyu deleted the air/persistence/pause_trials branch August 15, 2023 19:09
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
…rtain schedulers) (ray-project#38355)

This PR gets pausing trials to work without in-memory checkpoints. This is needed for pausing scheduler support (PBT, BOHB).

Instead of placing a `save_to_object` future in the trial data and immediately stopping the trial, this PR now waits to handle the `save` future before stopping the trial. Otherwise, the save future gets immediately cleared when `schedule_trial_stop` is called, causing no checkpoint to be saved --> possibly starting training from scratch.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
…rtain schedulers) (ray-project#38355)

This PR gets pausing trials to work without in-memory checkpoints. This is needed for pausing scheduler support (PBT, BOHB).

Instead of placing a `save_to_object` future in the trial data and immediately stopping the trial, this PR now waits to handle the `save` future before stopping the trial. Otherwise, the save future gets immediately cleared when `schedule_trial_stop` is called, causing no checkpoint to be saved --> possibly starting training from scratch.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…rtain schedulers) (ray-project#38355)

This PR gets pausing trials to work without in-memory checkpoints. This is needed for pausing scheduler support (PBT, BOHB).

Instead of placing a `save_to_object` future in the trial data and immediately stopping the trial, this PR now waits to handle the `save` future before stopping the trial. Otherwise, the save future gets immediately cleared when `schedule_trial_stop` is called, causing no checkpoint to be saved --> possibly starting training from scratch.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…rtain schedulers) (ray-project#38355)

This PR gets pausing trials to work without in-memory checkpoints. This is needed for pausing scheduler support (PBT, BOHB).

Instead of placing a `save_to_object` future in the trial data and immediately stopping the trial, this PR now waits to handle the `save` future before stopping the trial. Otherwise, the save future gets immediately cleared when `schedule_trial_stop` is called, causing no checkpoint to be saved --> possibly starting training from scratch.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants