Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train] Storage refactor: Support PBT and BOHB #38736

Merged
merged 51 commits into from
Aug 25, 2023

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented Aug 22, 2023

Why are these changes needed?

Previously PBT and BOHB do not work with the new storage context. This is because they directly affect the Ray Tune control flow and rely on specific behavior.

In particular, PBT and BOHB made heavy use of pausing trials, and PBT also saved and restored from "objects".

However, the new storage backend does not support memory checkpoints anymore. Saving and restoring from objects was also removed. This means PBT and BOHB have to be adjusted and the pausing logic has to be revamped.

For BOHB/Hypberband in general, we now avoid controlling trial status directly, and instead rely on choose_trial_to_run to unpause trials. This means the tune control loop has greater control over the pausing logic.

In the tune control loop, we now detect double saves. If a save() future is scheduled while another one is in-flight (which can happen in PBT), we don't schedule another save.

In PBT, we now schedule persistent checkpoints instead of memory checkpoints. The main difference here is that persistent checkpoints may get deleted before the exploiting trial gets a chance to load from it. For this reason, we detect too small num_to_keep values and log a warning.

Related issue number

Closes #38569

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Kai Fricke added 26 commits August 17, 2023 13:09
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
# Conflicts:
#	python/ray/tune/utils/release_test_util.py
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
# Conflicts:
#	python/ray/tune/execution/tune_controller.py
Signed-off-by: Kai Fricke <kai@anyscale.com>
Copy link
Contributor Author

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some review guidance

@@ -286,6 +286,8 @@ def __init__(

# Used for keeping top K checkpoints.
self._top_persisted_checkpoints: List[_HeapCheckpointWrapper] = []
# Also keep a set of all existing checkpoints
self._all_persisted_checkpoint_data: Set[_TrackedCheckpoint] = set()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now may schedule multiple saves subsequently (e.g. save is scheduled, trial is instructed to pause which also schedules a save, or maybe PBT schedules an additional save).

Previously, these additional saves were memory checkpoints that were not registered. We want to safeguard here against the same checkpoint being registered multiple times.

@@ -80,6 +80,31 @@ class TrainingResult:
metadata: Optional[Dict] = None


class _FutureTrainingResult:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wrapper is only used by PBT and it serves a similar purpose as the previous future in _TrackedCheckpoint.

It's not used anywhere else (except being created in the tune controller).

@@ -1,3 +1,6 @@
import os
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just changes the API to the new Checkpoint API

@@ -1,5 +1,6 @@
import argparse
import os
import tempfile
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, this just changes the API to the new Checkpoint API

self._callbacks.on_trial_recover(
iteration=self._iteration, trials=self._trials, trial=trial
)
elif trial.status in {Trial.RUNNING, Trial.PENDING}:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trials now may be PENDING and failing - e.g. when a save resolves late

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is if a trial is paused, schedules a save, then gets unpaused (set to pending) and then the save fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right!

Comment on lines +1941 to +1944
if trial.temporary_state.saving_to:
# If a save is already in progress, don't schedule another one.
return trial.temporary_state.saving_to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also de-dupes checkpoints

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old codepath

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a typehint to saving_to in the _TemporaryTrialState class? It's always a _FutureTrainingResult now


if checkpoint.dir_or_data is None:
logger.debug(f"Not restoring trial {trial}: No checkpoint found.")
return False

kwargs = {}

if checkpoint.storage_mode == CheckpointStorage.MEMORY:
if checkpoint.storage_mode == CheckpointStorage.MEMORY or isinstance(
checkpoint.dir_or_data, ray.ObjectRef
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We schedule some restores directly from the object ref

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old codepath

Comment on lines +144 to +145
trial.status == Trial.PAUSED
and trial in bracket.trials_to_unpause
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change in the hyperband - instead of directly setting the trial status to PENDING, we add them to the trials_to_unpause set. Then the scheduler will selet them in choose_trial_to_run

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice I like this.

)

if isinstance(last_checkpoint, _FutureTrainingResult):
training_result = last_checkpoint.resolve()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we actually synchronously wait for the save to resolve. This may impact performance slightly but let's see if it's actually a problem. We only wait for trials we actually exploit and we do need the attached result.

Kai Fricke added 3 commits August 23, 2023 13:39
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to go.

I have one suggestion for clarity that I'd like to get in but not 100% needed.

Basically, let's make PBTTrialState.last_checkpoint always either a _FutureTrainingResult or a _TrainingResult. Right now it can be those as well as a Checkpoint.

Basically here, just set clone_state.last_checkpoint = training_result:

last_checkpoint = clone_state.last_checkpoint
logger.debug(
f"Trial {trial} is in lower quantile. "
f"Exploiting trial {trial_to_clone}."
)
if isinstance(last_checkpoint, _FutureTrainingResult):
training_result = last_checkpoint.resolve()
if training_result:
clone_state.last_result = training_result.metrics
clone_state.last_checkpoint = training_result.checkpoint
last_checkpoint = clone_state.last_checkpoint
else:

Then, no need to construct a new "training result" on exploit.

python/ray/train/_internal/storage.py Outdated Show resolved Hide resolved
Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke
Copy link
Contributor Author

It makes sense, but it's a bit confusing (why is a checkpoint a training result). We also use trial.checkpoint above, which returns the actual checkpoint and not the result.

Let's keep it as is for now (the whole situation is very confusing right now anyway - two different checkpoint, three checkpoint managers, training results, tracked checkpoints...). But I'd like to specifically make time to clean up the leftovers from the old code path.

@justinvyu
Copy link
Contributor

Ok, good with me. Here's what should remain after everything is all cleaned up:

train._internal.checkpoint_manager.CheckpointManager
train.Checkpoint
train._internal.session.TrainingResult
train._internal.session.FutureTrainingResult (hopefully temporary)

@krfricke krfricke added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. Ray 2.7 labels Aug 25, 2023
Copy link
Collaborator

@zhe-thoughts zhe-thoughts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Necessary change for Train

@zhe-thoughts zhe-thoughts merged commit 5e3b2f7 into ray-project:master Aug 25, 2023
68 of 72 checks passed
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
LeonLuttenberger pushed a commit to jaidisido/ray that referenced this pull request Sep 5, 2023
Signed-off-by: Kai Fricke <kai@anyscale.com>
jimthompson5802 pushed a commit to jimthompson5802/ray that referenced this pull request Sep 12, 2023
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>
@krfricke krfricke deleted the tune/pbt-bohb-pause branch September 22, 2023 22:44
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ray 2.7 tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[tune] Update PBT and BOHB to support _TrainingResult
4 participants