[train+tune] Local directory refactor (1/n): Write launcher state files (`tuner.pkl`, `trainer.pkl`) directly to storage #43369

justinvyu · 2024-02-23T00:12:05Z

Why are these changes needed?

This PR updates Trainers and the Tuner to upload its state directly to storage_path, rather than dumping it in a local directory and relying on driver syncing to upload.

This removes the dependency of these pkl files on the _get_defaults_results_dir, which is the folder that defaults to ~/ray_results and is set by environment variables.

TODOs for follow-up PRs

There's currently an issue where multiple runs with the same RunConfig(name) can produce conflicting versions of tuner.pkl. This can lead to a bug where the files needed to restore a run can be overwritten by a mismatched version. See #43369 (comment) for more details.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com> update trainer._save usage in test Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…ad_pkl_directly

…entrypoints Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…ad_pkl_directly

matthewdeng · 2024-02-23T21:56:51Z

python/ray/tune/impl/tuner_internal.py

+            self._run_config.name
+            or StorageContext.get_experiment_dir_name(self.converted_trainable)
+        )
+        storage = StorageContext(


Should we be trying to instantiate this StorageContext only in one place and then pass it around?

This is hard to do since we'd need to pass the context through from BaseTrainer.fit to the public Tuner interface.

One alternative we discussed is having a global storage context, but that'd also require a get_or_create_storage_context logic to account for users coming in through any of the 3 entrypoints (tuner, trainer, tune.run).

I will clarify the usage is just to access the path and the filesystem so that we don't re-implement that logic.

python/ray/tune/impl/tuner_internal.py

…ad_pkl_directly

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com> fix lint Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…ad_pkl_directly

python/ray/tune/impl/tuner_internal.py

python/ray/train/tests/test_trainer_restore.py

…ad_pkl_directly

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…pickled one is instead) Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…ad_pkl_directly

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…ad_pkl_directly

justinvyu · 2024-02-29T21:02:42Z

python/ray/tune/tests/test_tuner_restore.py

@@ -713,7 +713,7 @@ def create_trainable_with_params():
        )
        return trainable_with_params

-    exp_name = "restore_with_params"
+    exp_name = f"restore_with_params-{use_function_trainable=}"


This change was required to avoid a bug where:

Both tests would use the same exp name and save things to the same driver staging dir

tuner.pkl gets downloaded to the staging dir when the test runs the first time with the use_function_trainable=True.

Then, when the second run happens, it writes the correct tuner.pkl to storage, but then does another driver sync where the old tuner.pkl that was downloaded gets uploaded again.

This caused an error when restoring the 2nd run.

This problem gets fixed in the follow-up PR by using unique staging directories for each new ray train experiment.

Users should never run Tune twice in the same job?
I think this will no longer be a problem after your second PR, which separates driver staging directories with timestamp.

…ad_pkl_directly

woshiyyya

Looks good to me. Left some minor comments.

woshiyyya · 2024-02-29T21:34:19Z

python/ray/tune/tuner.py

@@ -42,14 +42,6 @@
 _SELF = "self"


-_TUNER_FAILED_MSG = (


Why we no longer send this failure message?

This only caught TuneError, which only gets raised from tune.run(raise_on_failed_trial=True). However, the Tuner always passes False for this flag, so this error message doesn't actually get printed ever.

ray/python/ray/tune/impl/tuner_internal.py

Line 594 in 93d8d96

raise_on_failed_trial=False,

woshiyyya · 2024-02-29T21:39:24Z

python/ray/tune/tests/test_tuner_restore.py

@@ -713,7 +713,7 @@ def create_trainable_with_params():
        )
        return trainable_with_params

-    exp_name = "restore_with_params"
+    exp_name = f"restore_with_params-{use_function_trainable=}"


Users should never run Tune twice in the same job?
I think this will no longer be a problem after your second PR, which separates driver staging directories with timestamp.

woshiyyya · 2024-02-29T21:40:43Z

python/ray/train/tests/test_tune.py

@@ -302,7 +302,6 @@ def test_run_config_in_trainer_and_tuner(
        assert not (tmp_path / "trainer").exists()
        assert both_msg not in caplog.text
    else:
-        assert tuner._local_tuner.get_run_config() == RunConfig()


Is it because we inject a default storage context into the run config if not specified?

This is because I set the run_config.name in the trainer, then pass that along to the tuner. So, the tuner's run config is no longer the default RunConfig. This is so that we don't generate the experiment name multiple times (which could lead to different folders being used by the trainer vs. tuner).

…es (`tuner.pkl`, `trainer.pkl`) directly to storage (ray-project#43369) This PR updates `Trainer`s and the `Tuner` to upload its state directly to `storage_path`, rather than dumping it in a local directory and relying on driver syncing to upload. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu added 7 commits February 22, 2024 13:19

storage_path default = ~/ray_results

41ef191

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

upload trainer pkl directly

bf323fa

Signed-off-by: Justin Yu <justinvyu@anyscale.com> update trainer._save usage in test Signed-off-by: Justin Yu <justinvyu@anyscale.com>

upload tuner pkl directly

dee3249

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

revert storage path default

bdd58b3

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix optional storage path dependencies for now

0bd1a58

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

remove todo

24a9fc0

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

small correction...

90be9ca

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested review from matthewdeng and woshiyyya February 23, 2024 00:12

justinvyu assigned matthewdeng and woshiyyya Feb 23, 2024

justinvyu added 12 commits February 22, 2024 14:13

remove ipdb

c18384a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

use converted trainable in tuner entrypoint

cab5e12

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

use non-optional run config

09e0273

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

remove local restoration test

c0d0ba0

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

keep base trainer and tuner exp dir name resolution consistent

011ac92

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add test case for restoration with default RunConfig(name)

c3c03ac

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into uplo…

25cccb5

…ad_pkl_directly

centralize on storage context for path handling in the tuner/trainer …

b367413

…entrypoints Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix errors caused by syncing being enabled to the same dir

417a7da

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

key concepts small fix

6b10a13

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

separate exp folders for doc code

c218937

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into uplo…

805757b

…ad_pkl_directly

justinvyu requested review from richardliaw, krfricke, xwjiang2010, amogkam, Yard1, maxpumperla and a team as code owners February 23, 2024 21:08

justinvyu mentioned this pull request Feb 23, 2024

[train+tune] Local directory refactor (2/n): Separate driver artifacts and trial working directories #43403

Merged

8 tasks

matthewdeng reviewed Feb 23, 2024

View reviewed changes

justinvyu added 5 commits February 26, 2024 14:58

Merge branch 'master' of https://github.com/ray-project/ray into uplo…

55e069a

…ad_pkl_directly

fix trainer._save test usage

39eaf81

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

use unique exp names for test

75b80a6

Signed-off-by: Justin Yu <justinvyu@anyscale.com> fix lint Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix run config validation test

876aad2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into uplo…

85c2acd

…ad_pkl_directly

woshiyyya reviewed Feb 27, 2024

View reviewed changes

python/ray/tune/impl/tuner_internal.py Show resolved Hide resolved

python/ray/tune/impl/tuner_internal.py Outdated Show resolved Hide resolved

python/ray/train/tests/test_trainer_restore.py Show resolved Hide resolved

justinvyu added 7 commits February 27, 2024 10:55

Merge branch 'master' of https://github.com/ray-project/ray into uplo…

df25a79

…ad_pkl_directly

remove tuner try catch that relies on get_exp_ckpt_dir

43d53a0

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add comment about storage context at the top entrypoint layers

533c807

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix bug where new storage filesystem is not used on restoration (the …

185f57c

…pickled one is instead) Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into uplo…

6f5277a

…ad_pkl_directly

fix test_errors

cdeaa44

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into uplo…

f0ead9a

…ad_pkl_directly

justinvyu commented Feb 29, 2024

View reviewed changes

Merge branch 'master' of https://github.com/ray-project/ray into uplo…

9c896f6

…ad_pkl_directly

woshiyyya approved these changes Feb 29, 2024

View reviewed changes

justinvyu merged commit 58edaa4 into ray-project:master Mar 1, 2024
9 checks passed

justinvyu deleted the upload_pkl_directly branch March 1, 2024 19:12

This was referenced Mar 9, 2024

[no_ci][WIP][train/tune] Prototype removing local_dir #41041

Closed

[tune] Fix flaky test_controller_checkpointing_integration test suite #43880

Merged

justinvyu mentioned this pull request Mar 25, 2024

[train+tune][doc] Remove docs sections recommending RAY_AIR_LOCAL_CACHE_DIR #44284

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train+tune] Local directory refactor (1/n): Write launcher state files (`tuner.pkl`, `trainer.pkl`) directly to storage #43369

[train+tune] Local directory refactor (1/n): Write launcher state files (`tuner.pkl`, `trainer.pkl`) directly to storage #43369

justinvyu commented Feb 23, 2024 •

edited

matthewdeng Feb 23, 2024

justinvyu Feb 26, 2024 •

edited

justinvyu Feb 29, 2024

woshiyyya Feb 29, 2024

woshiyyya left a comment

woshiyyya Feb 29, 2024

justinvyu Feb 29, 2024

woshiyyya Feb 29, 2024

woshiyyya Feb 29, 2024

justinvyu Feb 29, 2024

[train+tune] Local directory refactor (1/n): Write launcher state files (tuner.pkl, trainer.pkl) directly to storage #43369

[train+tune] Local directory refactor (1/n): Write launcher state files (tuner.pkl, trainer.pkl) directly to storage #43369

Conversation

justinvyu commented Feb 23, 2024 • edited

Why are these changes needed?

TODOs for follow-up PRs

Related issue number

Checks

Choose a reason for hiding this comment

justinvyu Feb 26, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

woshiyyya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[train+tune] Local directory refactor (1/n): Write launcher state files (`tuner.pkl`, `trainer.pkl`) directly to storage #43369

[train+tune] Local directory refactor (1/n): Write launcher state files (`tuner.pkl`, `trainer.pkl`) directly to storage #43369

justinvyu commented Feb 23, 2024 •

edited

justinvyu Feb 26, 2024 •

edited