Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train+tune] Local directory refactor (2/n): Separate driver artifacts and trial working directories #43403

Merged
Merged
Show file tree
Hide file tree
Changes from 124 commits
Commits
Show all changes
126 commits
Select commit Hold shift + click to select a range
c157ad9
add util for getting ray train session tmp dir
justinvyu Feb 22, 2024
7891989
Merge branch 'master' of https://github.com/ray-project/ray into sepa…
justinvyu Feb 22, 2024
ad3b136
remove storage local path and introduce driver staging + working dirs
justinvyu Feb 22, 2024
d8b4800
update trial chdir to use trial_working_dir
justinvyu Feb 22, 2024
614fefa
rename experiment_local_path -> experiment_local_staging_path
justinvyu Feb 22, 2024
08d9892
rename trial_local_path -> trial_local_staging_path
justinvyu Feb 22, 2024
dd1f202
fix incorrect worker artifact sync dir
justinvyu Feb 22, 2024
aab939e
update syncer = None codepaths
justinvyu Feb 22, 2024
f981b6a
fix test_storage
justinvyu Feb 22, 2024
7d45920
fix cwd assert to use resolved path in test
justinvyu Feb 22, 2024
41ef191
storage_path default = ~/ray_results
justinvyu Feb 22, 2024
bf323fa
upload trainer pkl directly
justinvyu Feb 22, 2024
dee3249
upload tuner pkl directly
justinvyu Feb 22, 2024
bdd58b3
revert storage path default
justinvyu Feb 22, 2024
0bd1a58
fix optional storage path dependencies for now
justinvyu Feb 22, 2024
24a9fc0
remove todo
justinvyu Feb 22, 2024
90be9ca
small correction...
justinvyu Feb 23, 2024
197a29e
Merge branch 'upload_pkl_directly' into separate_driver_and_trial_art…
justinvyu Feb 23, 2024
c18384a
remove ipdb
justinvyu Feb 23, 2024
4978c45
Merge branch 'upload_pkl_directly' into separate_driver_and_trial_art…
justinvyu Feb 23, 2024
f55ace2
remove some hacks in test
justinvyu Feb 23, 2024
75ef6bd
upload exp state (with trial states) directly to cloud instead of wai…
justinvyu Feb 23, 2024
cab5e12
use converted trainable in tuner entrypoint
justinvyu Feb 23, 2024
09e0273
use non-optional run config
justinvyu Feb 23, 2024
c0d0ba0
remove local restoration test
justinvyu Feb 23, 2024
011ac92
keep base trainer and tuner exp dir name resolution consistent
justinvyu Feb 23, 2024
c3c03ac
add test case for restoration with default RunConfig(name)
justinvyu Feb 23, 2024
25cccb5
Merge branch 'master' of https://github.com/ray-project/ray into uplo…
justinvyu Feb 23, 2024
0a0e37c
Merge branch 'upload_pkl_directly' into separate_driver_and_trial_art…
justinvyu Feb 23, 2024
2577b91
storage path = ~/ray_results by default
justinvyu Feb 23, 2024
46cbf02
override storage path with ray storage if set
justinvyu Feb 23, 2024
343224b
Fix lint
justinvyu Feb 23, 2024
b367413
centralize on storage context for path handling in the tuner/trainer …
justinvyu Feb 23, 2024
417a7da
fix errors caused by syncing being enabled to the same dir
justinvyu Feb 23, 2024
6b10a13
key concepts small fix
justinvyu Feb 23, 2024
c218937
separate exp folders for doc code
justinvyu Feb 23, 2024
805757b
Merge branch 'master' of https://github.com/ray-project/ray into uplo…
justinvyu Feb 23, 2024
231d637
Merge branch 'upload_pkl_directly' into separate_driver_and_trial_art…
justinvyu Feb 23, 2024
063d3b2
no need to copy sync_config anymore
justinvyu Feb 23, 2024
f46dd84
different way to do default
justinvyu Feb 23, 2024
21dcd4c
fix test_new_persistence
justinvyu Feb 23, 2024
178d8ca
fix test_tuner_restore
justinvyu Feb 23, 2024
c1357e4
fix lint
justinvyu Feb 23, 2024
55e069a
Merge branch 'master' of https://github.com/ray-project/ray into uplo…
justinvyu Feb 26, 2024
39eaf81
fix trainer._save test usage
justinvyu Feb 26, 2024
75b80a6
use unique exp names for test
justinvyu Feb 27, 2024
876aad2
fix run config validation test
justinvyu Feb 27, 2024
85c2acd
Merge branch 'master' of https://github.com/ray-project/ray into uplo…
justinvyu Feb 27, 2024
1c4dfed
Merge branch 'upload_pkl_directly' into separate_driver_and_trial_art…
justinvyu Feb 27, 2024
d509ff0
set storage path by default in tune.run path as well
justinvyu Feb 27, 2024
7d92e00
no need to test storage_path=None in unit test
justinvyu Feb 27, 2024
42d1378
fix circular import for default storage path const
justinvyu Feb 27, 2024
af68fe1
don't allow accessing ray train session dir outside of ray session
justinvyu Feb 27, 2024
676d5a3
fix test_session
justinvyu Feb 27, 2024
091f3bc
make session dir helper a private fn
justinvyu Feb 27, 2024
a411863
fix storage docstring example
justinvyu Feb 27, 2024
dafc828
[remove RAY_AIR_LOCAL_CACHE_DIR] test_storage
justinvyu Feb 27, 2024
fa61411
[remove RAY_AIR_LOCAL_CACHE_DIR] test_actor_reuse + test_result
justinvyu Feb 27, 2024
96c9a67
[remove RAY_AIR_LOCAL_CACHE_DIR] legacy local_dir test
justinvyu Feb 27, 2024
825cdb6
[remove RAY_AIR_LOCAL_CACHE_DIR] test_tune
justinvyu Feb 27, 2024
3727e3a
[remove RAY_AIR_LOCAL_CACHE_DIR] test_exp_analysis + test_api
justinvyu Feb 27, 2024
fce8a0a
add test for customizing train session dir
justinvyu Feb 27, 2024
b3610de
Merge branch 'master' of https://github.com/ray-project/ray into sepa…
justinvyu Feb 27, 2024
df25a79
Merge branch 'master' of https://github.com/ray-project/ray into uplo…
justinvyu Feb 27, 2024
43d53a0
remove tuner try catch that relies on get_exp_ckpt_dir
justinvyu Feb 27, 2024
533c807
add comment about storage context at the top entrypoint layers
justinvyu Feb 27, 2024
185f57c
fix bug where new storage filesystem is not used on restoration (the …
justinvyu Feb 27, 2024
8f63111
Merge branch 'upload_pkl_directly' into separate_driver_and_trial_art…
justinvyu Feb 27, 2024
baeea52
pass in param space again on restore
justinvyu Feb 27, 2024
6f5277a
Merge branch 'master' of https://github.com/ray-project/ray into uplo…
justinvyu Feb 28, 2024
cdeaa44
fix test_errors
justinvyu Feb 28, 2024
d309990
add timestamp to ray train session dir so that each ray train job get…
justinvyu Feb 28, 2024
e4cf791
dump file and upload entire driver staging dir on exp checkpointing
justinvyu Feb 28, 2024
8fca047
fetch errors from storage and fallback to local staging
justinvyu Feb 28, 2024
559191d
fix trial working dir for class trainables
justinvyu Feb 28, 2024
c96c440
Merge branch 'master' of https://github.com/ray-project/ray into sepa…
justinvyu Feb 28, 2024
4324b50
Merge branch 'master' of https://github.com/ray-project/ray into sepa…
justinvyu Feb 28, 2024
bdd4864
fix can_restore test + remove unused monkeypatch
justinvyu Feb 28, 2024
13aabf4
fix lint
justinvyu Feb 28, 2024
38cfb0f
move find newest exp ckpt logic to exp state manager file
justinvyu Feb 29, 2024
0cfaa90
read exp state directly from storage + restore controller state befor…
justinvyu Feb 29, 2024
439b409
fix test_tuner
justinvyu Feb 29, 2024
560adbc
remove exp state manager resume
justinvyu Feb 29, 2024
1e58f46
remove env var usage from test_tune
justinvyu Feb 29, 2024
8c113b9
remove local_dir legacy test
justinvyu Feb 29, 2024
2ebad79
Merge branch 'upload_pkl_directly' into separate_driver_and_trial_art…
justinvyu Feb 29, 2024
4f03d7f
only fetch error file if the trial error file has been set
justinvyu Feb 29, 2024
f2c0f38
remove arbitrary length check on repr
justinvyu Feb 29, 2024
9bbede6
fix mlflow test
justinvyu Feb 29, 2024
dbf58aa
Merge branch 'master' of https://github.com/ray-project/ray into sepa…
justinvyu Feb 29, 2024
30bdc39
Merge branch 'master' of https://github.com/ray-project/ray into sepa…
justinvyu Feb 29, 2024
1573096
fix test_training_iterator
justinvyu Feb 29, 2024
c3076d9
fix test_tuner_resume_errored_only
justinvyu Feb 29, 2024
a51358f
ray init in test_function_api
justinvyu Feb 29, 2024
abd117f
fix test_api_ckpt_integration
justinvyu Feb 29, 2024
e7d8f9d
make driver staging dir during exp checkpoint if not created yet
justinvyu Feb 29, 2024
03d7ef7
fix a bunch of tests
justinvyu Feb 29, 2024
6cc5c75
Merge branch 'master' of https://github.com/ray-project/ray into sepa…
justinvyu Feb 29, 2024
8a3430e
add test for customizing train session dir
justinvyu Feb 27, 2024
85ec91c
Merge branch 'master' of https://github.com/ray-project/ray into sepa…
justinvyu Mar 1, 2024
b1551c5
fix some tests in test_api
justinvyu Mar 1, 2024
6b98726
Merge branch 'cleanup_air_cache_dir' into separate_driver_and_trial_a…
justinvyu Mar 1, 2024
6007618
fix test_actor_caching
justinvyu Mar 1, 2024
182d069
fix test_actor_reuse
justinvyu Mar 1, 2024
3cb5fc3
patch call to get ray train session dir in mock storage context
justinvyu Mar 1, 2024
af8bf05
remove unneeded ray.init
justinvyu Mar 1, 2024
815cc0b
fix test_run_experiment
justinvyu Mar 1, 2024
a4a51f4
fix test_var
justinvyu Mar 1, 2024
a1543e0
fix pytest.skip -> pytest.mark.skip
justinvyu Mar 1, 2024
af3fd67
skip some tests
justinvyu Mar 1, 2024
69dcd35
revert test_training_iterator
justinvyu Mar 1, 2024
23277eb
fix tutorial
justinvyu Mar 1, 2024
24a779a
fix test_trial + remove delete_syncer option in utility
justinvyu Mar 1, 2024
87aa000
only pull error from storage
justinvyu Mar 1, 2024
5154421
Merge branch 'master' of https://github.com/ray-project/ray into sepa…
justinvyu Mar 1, 2024
a6fde03
Merge branch 'master' of https://github.com/ray-project/ray into sepa…
justinvyu Mar 1, 2024
e0125fa
fix merge error
justinvyu Mar 1, 2024
9260fa9
move ray storage handling to RunConfig
justinvyu Mar 1, 2024
bd2f36a
improve docstrings + mark storage context as developer api
justinvyu Mar 1, 2024
417bd43
improve storage path docstring
justinvyu Mar 1, 2024
3e59baa
add storage_fs docstring
justinvyu Mar 1, 2024
3b0b96a
rename
justinvyu Mar 1, 2024
ad0a63f
fix logdir -> trial working dir
justinvyu Mar 1, 2024
64126ba
fix lint
justinvyu Mar 1, 2024
e3991e7
Merge branch 'master' of https://github.com/ray-project/ray into sepa…
justinvyu Mar 5, 2024
83733dc
remove tune cloud durable
justinvyu Mar 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
26 changes: 22 additions & 4 deletions python/ray/air/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@

import pyarrow.fs

from ray._private.storage import _get_storage_uri
from ray._private.thirdparty.tabulate.tabulate import tabulate
from ray.util.annotations import PublicAPI, Deprecated
from ray.widgets import Template, make_table_html_repr
Expand Down Expand Up @@ -581,10 +582,14 @@ class RunConfig:
Args:
name: Name of the trial or experiment. If not provided, will be deduced
from the Trainable.
storage_path: [Beta] Path to store results at. Can be a local directory or
a destination on cloud storage. If Ray storage is set up,
defaults to the storage location. Otherwise, this defaults to
the local ``~/ray_results`` directory.
storage_path: [Beta] Path where all results and checkpoints are persisted.
Can be a local directory or a destination on cloud storage.
For multi-node training/tuning runs, this must be set to a
shared storage location (e.g., S3, NFS).
This defaults to the local ``~/ray_results`` directory.
storage_filesystem: [Beta] A custom filesystem to use for storage.
If this is provided, `storage_path` should be a path with its
prefix stripped (e.g., `s3://bucket/path` -> `bucket/path`).
failure_config: Failure mode configuration.
checkpoint_config: Checkpointing configuration.
sync_config: Configuration object for syncing. See train.SyncConfig.
Expand Down Expand Up @@ -634,8 +639,21 @@ class RunConfig:

def __post_init__(self):
from ray.train import SyncConfig
from ray.train.constants import DEFAULT_STORAGE_PATH
from ray.tune.experimental.output import AirVerbosity, get_air_verbosity

if self.storage_path is None:
self.storage_path = DEFAULT_STORAGE_PATH
justinvyu marked this conversation as resolved.
Show resolved Hide resolved

# If no remote path is set, try to get Ray Storage URI
ray_storage_uri: Optional[str] = _get_storage_uri()
if ray_storage_uri is not None:
logger.info(
"Using configured Ray Storage URI as the `storage_path`: "
f"{ray_storage_uri}"
)
self.storage_path = ray_storage_uri

if not self.failure_config:
self.failure_config = FailureConfig()

Expand Down
10 changes: 5 additions & 5 deletions python/ray/air/integrations/mlflow.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
from typing import Dict, Optional, Union

import ray
from ray.air import session
from ray.air._internal.mlflow import _MLflowLoggerUtil
from ray.air._internal import usage as air_usage
from ray.air.constants import TRAINING_ITERATION
Expand Down Expand Up @@ -148,13 +147,14 @@ def train_fn(config):
)

try:
train_context = ray.train.get_context()

# Do a try-catch here if we are not in a train session
_session = session._get_session(warn=False)
if _session and rank_zero_only and session.get_world_rank() != 0:
if rank_zero_only and train_context.get_world_rank() != 0:
return _NoopModule()

default_trial_id = session.get_trial_id()
default_trial_name = session.get_trial_name()
default_trial_id = train_context.get_trial_id()
default_trial_name = train_context.get_trial_name()

except RuntimeError:
default_trial_id = None
Expand Down
3 changes: 1 addition & 2 deletions python/ray/air/tests/test_configs.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,7 @@ def test_repr(config):

def test_storage_filesystem_repr():
config = RunConfig(storage_filesystem=pyarrow.fs.S3FileSystem())
representation = repr(config)
assert len(representation) < MAX_REPR_LENGTH
repr(config)


def test_failure_config_init():
Expand Down
28 changes: 8 additions & 20 deletions python/ray/air/tests/test_integration_mlflow.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,11 @@
import tempfile
import unittest
from collections import namedtuple
from unittest.mock import patch
from unittest.mock import patch, MagicMock

from mlflow.tracking import MlflowClient

from ray._private.dict import flatten_dict
from ray.train._internal.session import init_session, shutdown_session
from ray.train._internal.storage import StorageContext
from ray.air.integrations.mlflow import MLflowLoggerCallback, setup_mlflow, _NoopModule
from ray.air._internal.mlflow import _MLflowLoggerUtil

Expand Down Expand Up @@ -49,8 +47,7 @@ def setUp(self):
assert client.get_experiment_by_name("existing_experiment").experiment_id == "1"

def tearDown(self) -> None:
# Shutdown session to clean up for next test
shutdown_session()
pass

def testMlFlowLoggerCallbackConfig(self):
# Explicitly pass in all args.
Expand Down Expand Up @@ -225,23 +222,14 @@ def testMlFlowSetupExplicit(self):
)
mlflow.end_run()

def testMlFlowSetupRankNonRankZero(self):
@patch("ray.train.get_context")
def testMlFlowSetupRankNonRankZero(self, mock_get_context):
"""Assert that non-rank-0 workers get a noop module"""
storage = StorageContext(
storage_path=tempfile.mkdtemp(),
experiment_dir_name="exp_name",
trial_dir_name="trial_name",
)
mock_context = MagicMock()
mock_context.get_world_rank.return_value = 1

mock_get_context.return_value = mock_context

init_session(
training_func=None,
world_rank=1,
local_rank=1,
node_rank=1,
local_world_size=2,
world_size=2,
storage=storage,
)
mlflow = setup_mlflow({})
assert isinstance(mlflow, _NoopModule)

Expand Down
11 changes: 5 additions & 6 deletions python/ray/train/_internal/session.py
Original file line number Diff line number Diff line change
Expand Up @@ -222,15 +222,14 @@ def reset(
self.training_started = False
self._first_report = True

# Change the working directory to the local trial directory.
# -> All workers on the same node share a working directory.
os.makedirs(storage.trial_local_path, exist_ok=True)
Copy link
Member

@woshiyyya woshiyyya Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intuitively we should only save the small metadata and logs in this Ray session dir.

I am actually a bit worried that we changed the working dir into the ray global session dir in the performance perspective.

For example, users may save large checkpoints into /tmp folder, which could

  • be slow since it's on root disk
  • use up the disk storage when checkpointing LLM

A workaround is to change ray global session dir onto another disk (e.g. mounted SSD with larger storage and higher bandwidth). However, all other logs (ray core, data, ...) will be moved there as a consequence. We probably don't want our ray train featuer affects other libraries' behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/tmp/ray is the place where large ray objects get spilled, and I am also a bit worried that random things written by to the working directory will cause out of disk on the /tmp drive.

For now, it's possible to work around this by setting a different output directory for saving checkpoints, so let's just keep it as is. The other driver files are very small, so it's fine for the general case.

In the future, if there are user issues related to this, we can consider adding an environment variable to allow users to set the ray train staging path without needing to move all of the ray logs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it'd be better to allow users to specify a custom ray train staging path. Let's create an issue/add a todo?

# Change the working directory to a special trial folder.
# This is to ensure that all Ray Train workers have a common working directory.
os.makedirs(storage.trial_working_directory, exist_ok=True)
if bool(int(os.environ.get(RAY_CHDIR_TO_TRIAL_DIR, "1"))):
logger.debug(
"Switching the working directory to the trial directory: "
f"{storage.trial_local_path}"
f"Changing the working directory to: {storage.trial_working_directory}"
)
os.chdir(storage.trial_local_path)
os.chdir(storage.trial_working_directory)

def pause_reporting(self):
"""Ignore all future ``session.report()`` calls."""
Expand Down