[train+tune] Refactor restoration configuration to be centered around `storage_path` #42853

justinvyu · 2024-01-31T01:30:22Z

Why are these changes needed?

This PR cleans up restoration logic in Tune.

Context

Previously, we had local_dir and upload_dir as two separate configurations for where to store experiment results. Results would first go to local_dir, then to upload_dir. (Checkpoints were the exception, which had a special codepath to upload directly to cloud.)

Now, the storage_path is the only location that users can expect all their run data to go to. The local directory just serves as a staging ground to eventually copy to storage. This is an implementation detail that ideally doesn't need to exist.

At the moment, it's not actually possible to recover successfully from the "local directory" (set by the RAY_AIR_LOCAL_CACHE_DIR environment variable), when a storage path is set. Checkpoints are uploaded directly to the storage path, so restoring from the local staging directory will result in trials starting from scratch.

Therefore, the resume="REMOTE" and resume="LOCAL" configurations are outdated artifacts of the old (<2.7) checkpointing/storage implementation. resume="AUTO" is now the only codepath, which greatly simplifies the restoration code.

Change Summary

Given this context, these are the main changes from this PR:

Remove outdated resume configurations.
Revamp resume configuration to use a resume config instead of a string of settings joined by "+". (Ex: AUTO+RESTART_ERRORED --> ResumeConfig(errored=ResumeConfig.ResumeType.RESTART).)
Adds a ResumeType.IGNORE setting that can be used to only restart/resume subsets of trials based on their statuses.
Enables finished trials to be resumed, which allows for iterative experimentation (training for more epochs the original training finished successfully). This is still not the recommended path (resume_from_checkpoint to start a new experiment is what we recommend), but it is a common user ask to be able to continue the run, while still tracking the top K checkpoints.

API Change Summary

Public API: Tuner.restore(..., _resume_config) is now possible, but this change is backwards compatible, since Tuner.restore(resume_errored, ...) etc. are still possible.
Public API: tune.run(resume_config) is now the simpler alternative to tune.run(resume), and some legacy resume strings have been hard-deprecated. There is no hard API change here though.
Private API: TuneController(resume) is now TuneController(resume_config)

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…ore_terminated_trainer

This reverts commit b6a8b81. Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…ore_terminated_trainer

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

python/ray/tune/execution/experiment_state.py

matthewdeng

Overall looks good to me, thanks for the much cleaner API. Main consideration is just if we choose to expose this publicly or keep it internal.

matthewdeng · 2024-02-06T01:40:01Z

python/ray/tune/execution/experiment_state.py

+        Members:
+            RESUME: Resume from the latest checkpoint.
+            RESTART: Restart from the beginning (with no checkpoint).
+            IGNORE: Ignore this trial.


nit: Upon first read I wasn't sure what "ignore" meant, but not sure off the top of my head if there's a clearer word we can use here.

Same feeling for me. SKIP might be better?

Updated to SKIP

woshiyyya

Great simplification with _ResumeConfig API! Left some comments

python/ray/train/_internal/utils.py

woshiyyya · 2024-02-06T07:35:05Z

python/ray/tune/execution/experiment_state.py

+        Members:
+            RESUME: Resume from the latest checkpoint.
+            RESTART: Restart from the beginning (with no checkpoint).
+            IGNORE: Ignore this trial.


Same feeling for me. SKIP might be better?

python/ray/tune/execution/experiment_state.py

woshiyyya · 2024-02-06T07:47:45Z

python/ray/tune/tune.py

@@ -259,7 +304,7 @@ def run(
    max_failures: int = 0,
    fail_fast: bool = False,
    restore: Optional[str] = None,
-    resume: Union[bool, str] = False,
+    resume_config: Optional[_ResumeConfig] = None,


Should we just make it a public API? The users will always use this config but cannot find it on our Ray API doc.

Decision: Keep it as a DeveloperAPI for now, and expose it as a "hidden" experimental argument of Tuner.restore. Ideally, we wouldn't have this, but there is the dependency of Trainer.restore on Tuner.restore for now 😢

python/ray/train/base_trainer.py

…ore_terminated_trainer Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…ore_terminated_trainer

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

woshiyyya

Nice work!

… `storage_path` (ray-project#42853) Simplify the restoration logic by using the `ResumeConfig` internally to determine how to treat finished, errored, and unfinished trials. There are no more "LOCAL", "REMOTE, or "PROMPT" modes of resuming. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu added 16 commits January 30, 2024 11:25

update resume config

ff6f2af

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

suppress useless warning driveby

62ede46

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

pipe resume config through

241a8d7

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

simplify resume implementation

71f624e

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

resume config not actually needed

dbb3ede

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

update implementation to respect new resume config

ba0c70a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

pipe resume config from the entrypoint

67c453a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix param counting to only consider required positional args

743ef95

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

remove a duplicate train func validity check

b6a8b81

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add test

c00f142

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix lint

aab6975

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix legacy config -> resumeconfig translation

704201a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

allow new config in tune.run_experiments:

fc39139

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

support deprecated resume param in tune.run

6330eda

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix tune restore tests

700cbe1

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix lint

f921bc0

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu assigned matthewdeng and woshiyyya Jan 31, 2024

justinvyu requested review from richardliaw, krfricke, xwjiang2010, amogkam, matthewdeng, Yard1, maxpumperla, woshiyyya and a team as code owners January 31, 2024 01:30

justinvyu added 3 commits January 30, 2024 22:50

Merge branch 'master' of https://github.com/ray-project/ray into rest…

91768a4

…ore_terminated_trainer

Revert "remove a duplicate train func validity check"

fcfd3e7

This reverts commit b6a8b81. Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix required param counting for train fn validity check on driver

d3a050a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu added 7 commits January 30, 2024 23:09

loop through remaining resume str addons

fe72616

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix error -> terminated unintended trial status change

a30b7fc

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

remove deprecated usage in tune+rllib test

93a47bb

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix lint

9ff08f8

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into rest…

518885c

…ore_terminated_trainer

fix ckpting integration test

a123078

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix callback integration test

a4f6f0e

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu commented Feb 3, 2024

View reviewed changes

python/ray/tune/execution/experiment_state.py Outdated Show resolved Hide resolved

matthewdeng reviewed Feb 6, 2024

View reviewed changes

woshiyyya reviewed Feb 6, 2024

View reviewed changes

justinvyu added 11 commits February 13, 2024 10:35

Merge branch 'master' of https://github.com/ray-project/ray into rest…

ae4819c

…ore_terminated_trainer Signed-off-by: Justin Yu <justinvyu@anyscale.com>

rename _ResumeConfig -> ResumeConfig

2fb7395

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

move ResumeConfig and expose as a developer api

3d467a3

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add experimental notice to resume config docstring

a2e4359

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix merge conflict

0b6f1d9

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix tune.run(resume) docstring

32a9444

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix circular import

ec6c1c0

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

rename IGNORE -> SKIP everywhere

72b1194

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

don't deprecate tune.run(resume) for now

a86c5f2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into rest…

5b21a07

…ore_terminated_trainer

Hide resume config in the Tuner API for now to leave it experimental

36ef2b2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

woshiyyya approved these changes Feb 13, 2024

View reviewed changes

justinvyu merged commit 2a83a67 into ray-project:master Feb 13, 2024
9 checks passed

justinvyu deleted the restore_terminated_trainer branch February 13, 2024 23:49

justinvyu mentioned this pull request Feb 14, 2024

[tune] Fix resume="AUTO" compatibility with the new ResumeConfig implementation #43179

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train+tune] Refactor restoration configuration to be centered around `storage_path` #42853

[train+tune] Refactor restoration configuration to be centered around `storage_path` #42853

justinvyu commented Jan 31, 2024 •

edited

matthewdeng left a comment

matthewdeng Feb 6, 2024

woshiyyya Feb 6, 2024

justinvyu Feb 13, 2024

woshiyyya left a comment

woshiyyya Feb 6, 2024

woshiyyya Feb 6, 2024

justinvyu Feb 13, 2024

woshiyyya left a comment

[train+tune] Refactor restoration configuration to be centered around storage_path #42853

[train+tune] Refactor restoration configuration to be centered around storage_path #42853

Conversation

justinvyu commented Jan 31, 2024 • edited

Why are these changes needed?

Context

Change Summary

API Change Summary

Related issue number

Checks

matthewdeng left a comment

Choose a reason for hiding this comment

matthewdeng Feb 6, 2024

Choose a reason for hiding this comment

woshiyyya Feb 6, 2024

Choose a reason for hiding this comment

justinvyu Feb 13, 2024

Choose a reason for hiding this comment

woshiyyya left a comment

Choose a reason for hiding this comment

woshiyyya Feb 6, 2024

Choose a reason for hiding this comment

woshiyyya Feb 6, 2024

Choose a reason for hiding this comment

justinvyu Feb 13, 2024

Choose a reason for hiding this comment

woshiyyya left a comment

Choose a reason for hiding this comment

[train+tune] Refactor restoration configuration to be centered around `storage_path` #42853

[train+tune] Refactor restoration configuration to be centered around `storage_path` #42853

justinvyu commented Jan 31, 2024 •

edited