[train] Storage refactor: Support PBT and BOHB #38736

krfricke · 2023-08-22T17:56:10Z

Why are these changes needed?

Previously PBT and BOHB do not work with the new storage context. This is because they directly affect the Ray Tune control flow and rely on specific behavior.

In particular, PBT and BOHB made heavy use of pausing trials, and PBT also saved and restored from "objects".

However, the new storage backend does not support memory checkpoints anymore. Saving and restoring from objects was also removed. This means PBT and BOHB have to be adjusted and the pausing logic has to be revamped.

For BOHB/Hypberband in general, we now avoid controlling trial status directly, and instead rely on choose_trial_to_run to unpause trials. This means the tune control loop has greater control over the pausing logic.

In the tune control loop, we now detect double saves. If a save() future is scheduled while another one is in-flight (which can happen in PBT), we don't schedule another save.

In PBT, we now schedule persistent checkpoints instead of memory checkpoints. The main difference here is that persistent checkpoints may get deleted before the exploiting trial gets a chance to load from it. For this reason, we detect too small num_to_keep values and log a warning.

Related issue number

Closes #38569

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai Fricke <kai@anyscale.com>

# Conflicts: # python/ray/tune/utils/release_test_util.py

Signed-off-by: Kai Fricke <kai@anyscale.com>

…restore-obj

Signed-off-by: Kai Fricke <kai@anyscale.com>

# Conflicts: # python/ray/tune/execution/tune_controller.py

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke

Leaving some review guidance

krfricke · 2023-08-23T10:26:22Z

python/ray/air/_internal/checkpoint_manager.py

@@ -286,6 +286,8 @@ def __init__(

        # Used for keeping top K checkpoints.
        self._top_persisted_checkpoints: List[_HeapCheckpointWrapper] = []
+        # Also keep a set of all existing checkpoints
+        self._all_persisted_checkpoint_data: Set[_TrackedCheckpoint] = set()


We now may schedule multiple saves subsequently (e.g. save is scheduled, trial is instructed to pause which also schedules a save, or maybe PBT schedules an additional save).

Previously, these additional saves were memory checkpoints that were not registered. We want to safeguard here against the same checkpoint being registered multiple times.

krfricke · 2023-08-23T10:29:02Z

python/ray/train/_internal/session.py

@@ -80,6 +80,31 @@ class TrainingResult:
    metadata: Optional[Dict] = None


+class _FutureTrainingResult:


This wrapper is only used by PBT and it serves a similar purpose as the previous future in _TrackedCheckpoint.

It's not used anywhere else (except being created in the tune controller).

krfricke · 2023-08-23T10:29:15Z

python/ray/train/examples/horovod/horovod_cifar_pbt_example.py

@@ -1,3 +1,6 @@
+import os


This just changes the API to the new Checkpoint API

krfricke · 2023-08-23T10:29:23Z

python/ray/train/examples/pytorch/tune_cifar_torch_pbt_example.py

@@ -1,5 +1,6 @@
 import argparse
 import os
+import tempfile


Same here, this just changes the API to the new Checkpoint API

krfricke · 2023-08-23T10:30:39Z

python/ray/tune/execution/tune_controller.py

+            self._callbacks.on_trial_recover(
+                iteration=self._iteration, trials=self._trials, trial=trial
+            )
+        elif trial.status in {Trial.RUNNING, Trial.PENDING}:


Trials now may be PENDING and failing - e.g. when a save resolves late

This is if a trial is paused, schedules a save, then gets unpaused (set to pending) and then the save fails?

That's right!

krfricke · 2023-08-23T10:31:05Z

python/ray/tune/execution/tune_controller.py

+            if trial.temporary_state.saving_to:
+                # If a save is already in progress, don't schedule another one.
+                return trial.temporary_state.saving_to
+


This also de-dupes checkpoints

old codepath

Can we also add a typehint to saving_to in the _TemporaryTrialState class? It's always a _FutureTrainingResult now

krfricke · 2023-08-23T10:32:14Z

python/ray/tune/execution/tune_controller.py


        if checkpoint.dir_or_data is None:
            logger.debug(f"Not restoring trial {trial}: No checkpoint found.")
            return False

        kwargs = {}

-        if checkpoint.storage_mode == CheckpointStorage.MEMORY:
+        if checkpoint.storage_mode == CheckpointStorage.MEMORY or isinstance(
+            checkpoint.dir_or_data, ray.ObjectRef


We schedule some restores directly from the object ref

old codepath

krfricke · 2023-08-23T10:33:06Z

python/ray/tune/schedulers/hb_bohb.py

+                        trial.status == Trial.PAUSED
+                        and trial in bracket.trials_to_unpause


This is a change in the hyperband - instead of directly setting the trial status to PENDING, we add them to the trials_to_unpause set. Then the scheduler will selet them in choose_trial_to_run

nice I like this.

krfricke · 2023-08-23T10:35:08Z

python/ray/tune/schedulers/pbt.py

+            )
+
+            if isinstance(last_checkpoint, _FutureTrainingResult):
+                training_result = last_checkpoint.resolve()


Here we actually synchronously wait for the save to resolve. This may impact performance slightly but let's see if it's actually a problem. We only wait for trials we actually exploit and we do need the attached result.

Signed-off-by: Kai Fricke <kai@anyscale.com>

justinvyu

I think it's good to go.

I have one suggestion for clarity that I'd like to get in but not 100% needed.

Basically, let's make PBTTrialState.last_checkpoint always either a _FutureTrainingResult or a _TrainingResult. Right now it can be those as well as a Checkpoint.

Basically here, just set clone_state.last_checkpoint = training_result:

ray/python/ray/tune/schedulers/pbt.py

Lines 698 to 712 in c3bf12b

    
           last_checkpoint = clone_state.last_checkpoint 
        
           logger.debug( 
        
               f"Trial {trial} is in lower quantile. " 
        
               f"Exploiting trial {trial_to_clone}." 
        
           ) 
        
           if isinstance(last_checkpoint, _FutureTrainingResult): 
        
               training_result = last_checkpoint.resolve() 
        
               if training_result: 
        
                   clone_state.last_result = training_result.metrics 
        
                   clone_state.last_checkpoint = training_result.checkpoint 
        
                   last_checkpoint = clone_state.last_checkpoint 
        
               else:

Then, no need to construct a new "training result" on exploit.

python/ray/train/_internal/storage.py

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke · 2023-08-24T16:53:22Z

It makes sense, but it's a bit confusing (why is a checkpoint a training result). We also use trial.checkpoint above, which returns the actual checkpoint and not the result.

Let's keep it as is for now (the whole situation is very confusing right now anyway - two different checkpoint, three checkpoint managers, training results, tracked checkpoints...). But I'd like to specifically make time to clean up the leftovers from the old code path.

justinvyu · 2023-08-24T16:59:34Z

Ok, good with me. Here's what should remain after everything is all cleaned up:

train._internal.checkpoint_manager.CheckpointManager
train.Checkpoint
train._internal.session.TrainingResult
train._internal.session.FutureTrainingResult (hopefully temporary)

# Conflicts: # python/ray/tune/experiment/trial.py

Signed-off-by: Kai Fricke <kai@anyscale.com>

…bt-bohb-pause

zhe-thoughts

Necessary change for Train

Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

Signed-off-by: Kai Fricke <kai@anyscale.com>

Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>

Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>

Kai Fricke added 26 commits August 17, 2023 13:09

Adjust save_checkpoint API

fa992fc

Signed-off-by: Kai Fricke <kai@anyscale.com>

more

f0f0f41

Signed-off-by: Kai Fricke <kai@anyscale.com>

fix test

77be920

Signed-off-by: Kai Fricke <kai@anyscale.com>

Merge remote-tracking branch 'upstream/master' into tune/storage-pbt

c2c6073

Update typehints

711eefa

Signed-off-by: Kai Fricke <kai@anyscale.com>

Merge remote-tracking branch 'upstream/master' into tune/storage-pbt

8bcc82e

Merge branch 'master' into tune/storage-pbt

80ec41e

# Conflicts: # python/ray/tune/utils/release_test_util.py

undo pause logic

964247e

Signed-off-by: Kai Fricke <kai@anyscale.com>

Merge branch 'master' into tune/pbt-bohb-pause

33e896f

resolve future

863ec03

Signed-off-by: Kai Fricke <kai@anyscale.com>

Pausing

a2eb589

Signed-off-by: Kai Fricke <kai@anyscale.com>

skip memory test

9af362e

Signed-off-by: Kai Fricke <kai@anyscale.com>

typo

0a98a16

Signed-off-by: Kai Fricke <kai@anyscale.com>

Overwrite trial restore path

789752b

Signed-off-by: Kai Fricke <kai@anyscale.com>

Merge branch 'master' into tune/pbt-bohb-pause

965f3db

default 0

fa89632

Signed-off-by: Kai Fricke <kai@anyscale.com>

[train/tune] Remove save_to_object/restore_from_object

190df4f

Signed-off-by: Kai Fricke <kai@anyscale.com>

Fixes

138f92d

Signed-off-by: Kai Fricke <kai@anyscale.com>

avoid variable name conflict

b674dd2

Signed-off-by: Kai Fricke <kai@anyscale.com>

Merge remote-tracking branch 'upstream/master' into tune/remove-save-…

b0a1e57

…restore-obj

fix last test

e6ac302

Signed-off-by: Kai Fricke <kai@anyscale.com>

Merge remote-tracking branch 'upstream/master' into tune/pbt-bohb-pause

55f1b84

Merge branch 'tune/remove-save-restore-obj' into tune/pbt-bohb-pause

4b624c5

fix last test

8c87077

Signed-off-by: Kai Fricke <kai@anyscale.com>

Merge remote-tracking branch 'upstream/master' into tune/pbt-bohb-pause

11966c0

# Conflicts: # python/ray/tune/execution/tune_controller.py

bohb unpause

40819b0

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke commented Aug 23, 2023

View reviewed changes

Kai Fricke added 3 commits August 23, 2023 13:39

pbt tests for storage

031ea23

Signed-off-by: Kai Fricke <kai@anyscale.com>

fix checkpoint test

209ff6a

Signed-off-by: Kai Fricke <kai@anyscale.com>

more fixes

d6839b6

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke requested a review from justinvyu August 24, 2023 16:12

justinvyu approved these changes Aug 24, 2023

View reviewed changes

python/ray/train/_internal/storage.py Outdated Show resolved Hide resolved

Revert

80e8a6d

Signed-off-by: Kai Fricke <kai@anyscale.com>

Kai Fricke and others added 5 commits August 24, 2023 19:48

Merge remote-tracking branch 'upstream/master' into tune/pbt-bohb-pause

d490db5

# Conflicts: # python/ray/tune/experiment/trial.py

Merge branch 'master' into tune/pbt-bohb-pause

25927f7

Merge remote-tracking branch 'upstream/master' into tune/pbt-bohb-pause

f697157

Update build files, resolve merge logic conflict

af8e44c

Signed-off-by: Kai Fricke <kai@anyscale.com>

Merge remote-tracking branch 'origin/tune/pbt-bohb-pause' into tune/p…

3ba628e

…bt-bohb-pause

krfricke requested review from richardliaw, xwjiang2010, amogkam, matthewdeng, Yard1, maxpumperla and a team as code owners August 25, 2023 08:15

krfricke assigned matthewdeng and zhe-thoughts Aug 25, 2023

krfricke added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. Ray 2.7 labels Aug 25, 2023

matthewdeng approved these changes Aug 25, 2023

View reviewed changes

zhe-thoughts approved these changes Aug 25, 2023

View reviewed changes

zhe-thoughts merged commit 5e3b2f7 into ray-project:master Aug 25, 2023
2 checks passed

justinvyu mentioned this pull request Aug 29, 2023

[Release Test] Fix long_running_horovod_tune_test. #39012

Merged

8 tasks

arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023

[train] Storage refactor: Support PBT and BOHB (ray-project#38736)

faf6b33

Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

LeonLuttenberger pushed a commit to jaidisido/ray that referenced this pull request Sep 5, 2023

[train] Storage refactor: Support PBT and BOHB (ray-project#38736)

210c9d3

Signed-off-by: Kai Fricke <kai@anyscale.com>

jimthompson5802 pushed a commit to jimthompson5802/ray that referenced this pull request Sep 12, 2023

[train] Storage refactor: Support PBT and BOHB (ray-project#38736)

d0702e0

Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>

krfricke deleted the tune/pbt-bohb-pause branch September 22, 2023 22:44

vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023

[train] Storage refactor: Support PBT and BOHB (ray-project#38736)

6406a1a

Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Storage refactor: Support PBT and BOHB #38736

[train] Storage refactor: Support PBT and BOHB #38736

krfricke commented Aug 22, 2023 •

edited by justinvyu

Loading

krfricke left a comment

krfricke Aug 23, 2023

krfricke Aug 23, 2023

krfricke Aug 23, 2023

krfricke Aug 23, 2023

krfricke Aug 23, 2023

justinvyu Aug 23, 2023

krfricke Aug 23, 2023

krfricke Aug 23, 2023

justinvyu Aug 23, 2023

justinvyu Aug 23, 2023

krfricke Aug 23, 2023

justinvyu Aug 23, 2023

krfricke Aug 23, 2023

justinvyu Aug 23, 2023

krfricke Aug 23, 2023

justinvyu left a comment

krfricke commented Aug 24, 2023

justinvyu commented Aug 24, 2023

zhe-thoughts left a comment

		@@ -80,6 +80,31 @@ class TrainingResult:
		metadata: Optional[Dict] = None


		class _FutureTrainingResult:

		trial.status == Trial.PAUSED
		and trial in bracket.trials_to_unpause

	last_checkpoint = clone_state.last_checkpoint

	logger.debug(
	f"Trial {trial} is in lower quantile. "
	f"Exploiting trial {trial_to_clone}."
	)

	if isinstance(last_checkpoint, _FutureTrainingResult):
	training_result = last_checkpoint.resolve()

	if training_result:
	clone_state.last_result = training_result.metrics
	clone_state.last_checkpoint = training_result.checkpoint
	last_checkpoint = clone_state.last_checkpoint
	else:

[train] Storage refactor: Support PBT and BOHB #38736

[train] Storage refactor: Support PBT and BOHB #38736

Conversation

krfricke commented Aug 22, 2023 • edited by justinvyu Loading

Why are these changes needed?

Related issue number

Checks

krfricke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinvyu left a comment

Choose a reason for hiding this comment

krfricke commented Aug 24, 2023

justinvyu commented Aug 24, 2023

zhe-thoughts left a comment

Choose a reason for hiding this comment

krfricke commented Aug 22, 2023 •

edited by justinvyu

Loading