[train+tune] Local directory refactor (2/n): Separate driver artifacts and trial working directories #43403

justinvyu · 2024-02-23T21:45:01Z

Change summary

Behavior changes:
- Rebrands the local_dir concept as a driver staging directory and moves its default location from ~/ray_results to a subfolder in the ray temp directory (/tmp/ray/session_*/artifacts).
  - Delegates the customization of this folder to ray.init(_temp_dir=...).
  - Increases the visibility of this staging directory so that users can find and delete it if they want to. Ray already logs things to this directory, so it's a natural directory for users to interact with.
- Provides trial actors (Tune) and distributed training workers (Train) an empty working directory that will only be populated by worker artifacts.
  - This working directory is the storage.trial_working_directory and will only be synced if sync_artifacts=True.
  - This is the same behavior as before, with the added benefit that driver syncing no longer unintentionally uploads worker artifacts.
- Simplifies the default storage_path resolution.
  - storage_path is now resolved immediately on RunConfig initialization, rather than resolving downstream during tune.run. It gets set to the Ray storage URI if that's setup -- otherwise, it defaults to ~/ray_results.
  - Prior to this PR, ~/ray_results was already the default location of storage_path. The difference now is that ~/ray_results no longer be populated if the storage_path is set to something else.
- Syncing from the staging directory to the storage path is now the only codepath.
  - There is no more syncer=None special case when local_dir == storage_path.
  - This means that for single node runs, there is some unnecessary file-copying, but this is an ok sacrifice for simplifying the codepaths.
- Experiment checkpoint saving and uploading now happens synchronously.
  - Previously, this happened in 2 steps: (1) the files would get saved to local_dir, (2) they get uploaded to storage_path in a background task (this is "driver syncing").
  - The frequency of the 2nd step gets gated by SyncConfig(sync_timeout), which is 5 minutes by default. Now that the local directory only contains driver artifacts, the upload is not so expensive, making it okay to upload synchronously right after saving.
  - The benefit of this is better consistency of the experiment state in storage_path.
  - I also removed the forced syncing behavior of num_to_keep now that uploading driver files is no longer gated by the 5 minute sync_period default. That workaround was intended to increase driver checkpointing frequency, which is also achieved by change.
  - See 1 and 2 for context on the motivation for this change.
Some undesirable workarounds made in this PR:
- Adding a unique timestamp to the path of the driver staging directory so that different experiments with the same RunConfig(name) don't have conflicting staging directories.
  - This was the issue that needed to be solved: [train+tune] Local directory refactor (1/n): Write launcher state files (tuner.pkl, trainer.pkl) directly to storage #43369 (comment)
  - Now, StorageContext gets a property of the current timestamp upon creation (which happens on the driver once).
  - The storage context gets propagated to the remote trial actors and shares the same timestamp.
  - On restoration, a new storage context gets created on the driver, which is now different from the timestamp of the storage context on the restored trial.
  - We workaround this by updating the timestamp of the driver's storage context with the restored timestamp.
  - A restored run will use the same staging directory as the original run. A new run will always use a new timestamp.
- The /tmp/ray/session_* folder is only available when ray init has been called. I included the ray start fixture in many unit tests to get around this. mock_storage_context patches out this directory to a tempdir.
Remaining TODOs for follow-ups:
- A hanging/failing driver file upload will currently block/fail the tune control loop, even though all trials may be running fine. This is a regression compared to the previous behavior of just logging a warning and retrying the upload. This will be fixed in [train+tune] Local directory refactor (3/n): Revert to async experiment state snapshotting #43689.
- Possibly re-introduce a way to modify the Ray Train staging directory without needing to move the entire Ray temp dir.
- Clean up Trial and Experiment path properties once and for all.
- Removing usage/mentions of RAY_AIR_LOCAL_CACHE_DIR everywhere.
- Docs updates:
  - Tune relative path FAQ section. ray.train.get_context().get_local_dir doesn't really make sense anymore.
  - Update the persistent storage user guide.

Context / motivation for this change

What are the main user problems to solve?

Problem 1: ~/ray_results gets populated and it’s unclear how this thing is configured and how it can be disabled.
- train.report() erases contents of files when acquiring continue lock #42630
- [Tune/Air] Setting a custom storage_path in air.RunConfig creates duplicates in default and specified storage directory #40009
- https://ray-distributed.slack.com/archives/CSX7HVB5L/p1698324789112559
- https://ray-distributed.slack.com/archives/CNECXMW22/p1694712269714009
- Problem 1 gets solved by hiding away the ~/ray_results to the ray session dir that already gets populated and is easily accessible / configurable.
Problem 2: Implicit uploading of CWD logs from driver syncing causes slow syncing and unintended files getting uploaded.
- [train] Logs from frameworks (lightning_logs, wandb, transformers output_dir) in the working directory can be synced unintentionally #40634
- Problem 2 gets solved by separating the driver artifact folders and the trial working directories in different folders.
Problem 3: If you keep using the same RunConfig(name) and run multiple experiments, each consecutive experiment will upload ALL “trial directories” so far from the local ~/ray_results.
- Ex: exp 1 has 1 trial folder, exp 2 has 2 (the current one plus the previous one), etc.
- [Train] Ray Train sync all previous trial directories from local to remote storage #38522
- Problem 3 gets solved by using the latest ray session dir, so that every run has a blank slate rather than accumulating in a single ~/ray_results/<name> folder.

Related issue number

Closes #40634
Closes #40009
Closes #38522

#42630

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…rate_driver_and_trial_artifacts

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com> update trainer._save usage in test Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…ifacts

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…ifacts

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…ting for driver sync Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…ad_pkl_directly

…ifacts

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…rate_driver_and_trial_artifacts

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

woshiyyya

Nice to see the tests become green!

As this is a large PR changed a bunch of behaviors and tests. Can you summarize the following things in the PR descriptions?

Behavior changes
- changes in the way of driver syncing (Directly write to storage path)
- how to separate trial and driver artifact directory
- changes in the default storage path (ray_storage_uri)
- ...
The compromise we've done
- The bypassed unit tests
- sync_artifacts
- ...
The TODOs
- doc and faqs to update

Also, before we merge the PR, It'd also be good to run a few release tests to ensure it also works under multi-node setting.

woshiyyya · 2024-03-05T01:30:33Z

python/ray/train/_internal/storage.py

+        # Timestamp is used to create a unique session directory for the current
+        # training job. This is used to avoid conflicts when multiple training jobs
+        # run with the same name in the same cluster.
+        # This is set ONCE at the creation of the storage context, on the driver.


Can you help me remember how did we resolved the consistency issue with the timestamp. Did we resolve it by writing files into driver directory and using background syncing?

See the first bullet point in the PR description under "Some undesirable workarounds made in this PR."

justinvyu · 2024-03-05T07:58:28Z

python/ray/tune/execution/tune_controller.py

+        # NOTE: The restored run should reuse the same driver staging directory.
+        self._storage._timestamp = trials[0].storage._timestamp


Workaround to make sure that a restored run uses the same timestamped staging dir.

…rate_driver_and_trial_artifacts

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2024-03-05T08:31:21Z

train_multinode_persistence release test passed here: https://buildkite.com/ray-project/release/builds/10131#018e0d55-eb73-42aa-abe9-e3ba603a61f5

I removed the tune_cloud_durable_upload test because it's basically a duplicate of the test above, but also relies on "local directory" implementation details. I need to add back a tune multi-node persistence test as a follow-up.

justinvyu added 30 commits February 21, 2024 14:58

add util for getting ray train session tmp dir

c157ad9

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into sepa…

7891989

…rate_driver_and_trial_artifacts

remove storage local path and introduce driver staging + working dirs

ad3b136

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

update trial chdir to use trial_working_dir

d8b4800

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

rename experiment_local_path -> experiment_local_staging_path

614fefa

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

rename trial_local_path -> trial_local_staging_path

08d9892

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix incorrect worker artifact sync dir

dd1f202

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

update syncer = None codepaths

aab939e

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix test_storage

f981b6a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix cwd assert to use resolved path in test

7d45920

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

storage_path default = ~/ray_results

41ef191

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

upload trainer pkl directly

bf323fa

Signed-off-by: Justin Yu <justinvyu@anyscale.com> update trainer._save usage in test Signed-off-by: Justin Yu <justinvyu@anyscale.com>

upload tuner pkl directly

dee3249

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

revert storage path default

bdd58b3

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix optional storage path dependencies for now

0bd1a58

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

remove todo

24a9fc0

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

small correction...

90be9ca

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'upload_pkl_directly' into separate_driver_and_trial_art…

197a29e

…ifacts

remove ipdb

c18384a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'upload_pkl_directly' into separate_driver_and_trial_art…

4978c45

…ifacts

remove some hacks in test

f55ace2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

upload exp state (with trial states) directly to cloud instead of wai…

75ef6bd

…ting for driver sync Signed-off-by: Justin Yu <justinvyu@anyscale.com>

use converted trainable in tuner entrypoint

cab5e12

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

use non-optional run config

09e0273

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

remove local restoration test

c0d0ba0

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

keep base trainer and tuner exp dir name resolution consistent

011ac92

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add test case for restoration with default RunConfig(name)

c3c03ac

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into uplo…

25cccb5

…ad_pkl_directly

Merge branch 'upload_pkl_directly' into separate_driver_and_trial_art…

0a0e37c

…ifacts

storage path = ~/ray_results by default

2577b91

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu added 17 commits February 29, 2024 18:47

fix test_var

a4a51f4

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix pytest.skip -> pytest.mark.skip

a1543e0

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

skip some tests

af3fd67

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

revert test_training_iterator

69dcd35

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix tutorial

23277eb

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix test_trial + remove delete_syncer option in utility

24a779a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

only pull error from storage

87aa000

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into sepa…

5154421

…rate_driver_and_trial_artifacts

Merge branch 'master' of https://github.com/ray-project/ray into sepa…

a6fde03

…rate_driver_and_trial_artifacts

fix merge error

e0125fa

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

move ray storage handling to RunConfig

9260fa9

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

improve docstrings + mark storage context as developer api

bd2f36a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

improve storage path docstring

417bd43

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add storage_fs docstring

3e59baa

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

rename

3b0b96a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix logdir -> trial working dir

ad0a63f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix lint

64126ba

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

woshiyyya reviewed Mar 5, 2024

View reviewed changes

woshiyyya approved these changes Mar 5, 2024

View reviewed changes

justinvyu commented Mar 5, 2024

View reviewed changes

justinvyu added 2 commits March 5, 2024 00:00

Merge branch 'master' of https://github.com/ray-project/ray into sepa…

e3991e7

…rate_driver_and_trial_artifacts

remove tune cloud durable

83733dc

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu merged commit 94bbf99 into ray-project:master Mar 5, 2024
9 checks passed

justinvyu deleted the separate_driver_and_trial_artifacts branch March 5, 2024 18:17

This was referenced Mar 5, 2024

[train+tune] Local directory refactor (3/n): Revert to async experiment state snapshotting #43689

Merged

[no_ci][WIP][train/tune] Prototype removing local_dir #41041

Closed

[tune] Fix flaky test_controller_checkpointing_integration test suite #43880

Merged

justinvyu mentioned this pull request Mar 25, 2024

[train+tune][doc] Remove docs sections recommending RAY_AIR_LOCAL_CACHE_DIR #44284

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train+tune] Local directory refactor (2/n): Separate driver artifacts and trial working directories #43403

[train+tune] Local directory refactor (2/n): Separate driver artifacts and trial working directories #43403

justinvyu commented Feb 23, 2024 •

edited

woshiyyya left a comment •

edited

woshiyyya Mar 5, 2024 •

edited

justinvyu Mar 5, 2024

justinvyu Mar 5, 2024

justinvyu commented Mar 5, 2024 •

edited

		# NOTE: The restored run should reuse the same driver staging directory.
		self._storage._timestamp = trials[0].storage._timestamp

[train+tune] Local directory refactor (2/n): Separate driver artifacts and trial working directories #43403

[train+tune] Local directory refactor (2/n): Separate driver artifacts and trial working directories #43403

Conversation

justinvyu commented Feb 23, 2024 • edited

Change summary

Context / motivation for this change

Related issue number

Checks

woshiyyya left a comment • edited

Choose a reason for hiding this comment

woshiyyya Mar 5, 2024 • edited

Choose a reason for hiding this comment

justinvyu Mar 5, 2024

Choose a reason for hiding this comment

justinvyu Mar 5, 2024

Choose a reason for hiding this comment

justinvyu commented Mar 5, 2024 • edited

justinvyu commented Feb 23, 2024 •

edited

woshiyyya left a comment •

edited

woshiyyya Mar 5, 2024 •

edited

justinvyu commented Mar 5, 2024 •

edited