Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train+tune] Refactor restoration configuration to be centered around storage_path #42853

Merged
merged 37 commits into from Feb 13, 2024

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Jan 31, 2024

Why are these changes needed?

This PR cleans up restoration logic in Tune.

Context

Previously, we had local_dir and upload_dir as two separate configurations for where to store experiment results. Results would first go to local_dir, then to upload_dir. (Checkpoints were the exception, which had a special codepath to upload directly to cloud.)

Now, the storage_path is the only location that users can expect all their run data to go to. The local directory just serves as a staging ground to eventually copy to storage. This is an implementation detail that ideally doesn't need to exist.

At the moment, it's not actually possible to recover successfully from the "local directory" (set by the RAY_AIR_LOCAL_CACHE_DIR environment variable), when a storage path is set. Checkpoints are uploaded directly to the storage path, so restoring from the local staging directory will result in trials starting from scratch.

Therefore, the resume="REMOTE" and resume="LOCAL" configurations are outdated artifacts of the old (<2.7) checkpointing/storage implementation. resume="AUTO" is now the only codepath, which greatly simplifies the restoration code.

Change Summary

Given this context, these are the main changes from this PR:

  • Remove outdated resume configurations.
  • Revamp resume configuration to use a resume config instead of a string of settings joined by "+". (Ex: AUTO+RESTART_ERRORED --> ResumeConfig(errored=ResumeConfig.ResumeType.RESTART).)
  • Adds a ResumeType.IGNORE setting that can be used to only restart/resume subsets of trials based on their statuses.
  • Enables finished trials to be resumed, which allows for iterative experimentation (training for more epochs the original training finished successfully). This is still not the recommended path (resume_from_checkpoint to start a new experiment is what we recommend), but it is a common user ask to be able to continue the run, while still tracking the top K checkpoints.

API Change Summary

  • Public API: Tuner.restore(..., _resume_config) is now possible, but this change is backwards compatible, since Tuner.restore(resume_errored, ...) etc. are still possible.
  • Public API: tune.run(resume_config) is now the simpler alternative to tune.run(resume), and some legacy resume strings have been hard-deprecated. There is no hard API change here though.
  • Private API: TuneController(resume) is now TuneController(resume_config)

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
This reverts commit b6a8b81.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, thanks for the much cleaner API. Main consideration is just if we choose to expose this publicly or keep it internal.

Members:
RESUME: Resume from the latest checkpoint.
RESTART: Restart from the beginning (with no checkpoint).
IGNORE: Ignore this trial.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Upon first read I wasn't sure what "ignore" meant, but not sure off the top of my head if there's a clearer word we can use here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same feeling for me. SKIP might be better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to SKIP

Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great simplification with _ResumeConfig API! Left some comments

python/ray/train/_internal/utils.py Show resolved Hide resolved
Members:
RESUME: Resume from the latest checkpoint.
RESTART: Restart from the beginning (with no checkpoint).
IGNORE: Ignore this trial.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same feeling for me. SKIP might be better?

python/ray/tune/execution/experiment_state.py Show resolved Hide resolved
@@ -259,7 +304,7 @@ def run(
max_failures: int = 0,
fail_fast: bool = False,
restore: Optional[str] = None,
resume: Union[bool, str] = False,
resume_config: Optional[_ResumeConfig] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just make it a public API? The users will always use this config but cannot find it on our Ray API doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decision: Keep it as a DeveloperAPI for now, and expose it as a "hidden" experimental argument of Tuner.restore. Ideally, we wouldn't have this, but there is the dependency of Trainer.restore on Tuner.restore for now 😢

python/ray/train/base_trainer.py Outdated Show resolved Hide resolved
…ore_terminated_trainer

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@justinvyu justinvyu merged commit 2a83a67 into ray-project:master Feb 13, 2024
9 checks passed
@justinvyu justinvyu deleted the restore_terminated_trainer branch February 13, 2024 23:49
kevin85421 pushed a commit to kevin85421/ray that referenced this pull request Feb 17, 2024
… `storage_path` (ray-project#42853)

Simplify the restoration logic by using the `ResumeConfig` internally to determine how to treat finished, errored, and unfinished trials. There are no more "LOCAL", "REMOTE, or "PROMPT" modes of resuming.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants