[Train][Tune] Syncing files to the head node to be removed in Ray 2.7 in favor of cloud storage/NFS #37177

justinvyu · 2023-07-07T00:09:08Z

Quicklinks

User guide on configuring storage for Ray Train/Tune
User guide on checkpointing and how they interact with storage

Summary

Starting in Ray 2.7, Ray Train and Tune will require users to pass in a cloud storage or NFS path if running distributed training or tuning jobs.

In other words, Ray Train / Tune will no longer support the synchronization of checkpoints and other artifacts from worker nodes to the head node.

In Ray 2.6, syncing directories to the head node will no longer be the default storage configuration. Instead, this will raise an error telling you to switch to one of the recommended alternatives: cloud storage or NFS.

Please leave any comments or concerns on this thread below -- we would be happy to better understand your perspective.

Code Changes

For single node Ray Train and Ray Tune experiments, this does not change anything or require any modifications to your code.

For multi-node Ray Train and Ray Tune experiments, you should switch to using one of the following persistent storage options:

Cloud storage. See here for a configuration guide.

from ray import tune
from ray.train.torch import TorchTrainer
from ray.air.config import RunConfig

run_config = RunConfig(
    name="experiment_name",
    storage_path="s3://bucket-name/experiment_results",
)

# Use cloud storage in Train/Tune by configuring `RunConfig(storage_path)`.
trainer = TorchTrainer(..., run_config=run_config)
tuner = tune.Tuner(..., run_config=run_config)

# All experiment results will be persisted to s3://bucket-name/experiment_results/experiment_name

A network filesystem mounted on all nodes. See here for a configuration guide.

from ray import tune
from ray.train.torch import TorchTrainer
from ray.air.config import RunConfig

run_config = RunConfig(
    name="experiment_name",
    storage_path="/mnt/shared_storage/experiment_results",
)

# Use NFS in Train/Tune by configuring `RunConfig(storage_path)`.
trainer = TorchTrainer(..., run_config=run_config)
tuner = tune.Tuner(..., run_config=run_config)

# All experiment results will be persisted to /mnt/shared_storage/experiment_results/experiment_name

~~If needed, you can re-enable this behavior by setting the environment variable: RAY_AIR_REENABLE_DEPRECATED_SYNC_TO_HEAD_NODE=1~~

Background Context

In a multi-node Ray cluster, Ray Train and Ray Tune assume access to some form of persistent storage that stores outputs from all worker nodes. This includes files such as logged metrics, artifacts, and checkpoints.

Without some form of external shared storage (cloud storage or NFS):

Ray AIR cannot restore a training run from the latest checkpoint for fault tolerance. Without saving a checkpoint to external storage, the latest checkpoint may not exist anymore, if the node that it was saved on has already crashed.
You cannot access results after training has finished. If the Ray cluster has already been terminated (e.g., from automatic cluster downscaling), then the trained model checkpoints cannot be accessed if they have not been persisted to external storage.

Why are we removing support?

Cloud storage and NFS are cheap, easy to set up, and ubiquitous in today's machine learning landscape.
Syncing to the head node introduces major performance bottlenecks and does not scale to a large number of worker nodes or larger model sizes.
1. The speed of communication is limited by the network bandwidth of a single (head) node, and with large models, disk space on the head node even becomes an issue.
2. Generally, putting more load on the head node increases the risk of cluster-level failures.
The maintenance burden of the legacy sync has become substantial. The ML team wants to focus on making the cloud storage path robust and performant, which is much easier without having to maintain two duplicate synchronization stacks.

The text was updated successfully, but these errors were encountered:

richardliaw · 2023-07-07T17:51:33Z

Related REP: https://github.com/ray-project/enhancements/pull/35/files

JingChen23 · 2023-07-31T04:20:12Z

Does Minio also supported?

pcmoritz · 2023-08-01T21:38:01Z

@JingChen23 We are using pyarrow under the hood, which exposes overriding an S3 endpoint via https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html so I believe this should be supported (I know they have some minio tests in their test suite).

You can configure the underlying filesystem with the storage_filesystem option of RunConfig. I believe this is not yet working in Ray 2.6 but will be working in Ray 2.7.

@justinvyu Please correct me if I'm wrong

justinvyu · 2023-08-01T21:50:47Z

@JingChen23 Yes, another option is passing in a custom fsspec (s3fs) filesystem, then wrapping that as a pyarrow.fs.FileSystem. s3fs has some examples with minio: https://s3fs.readthedocs.io/en/latest/#s3-compatible-storage

Concretely, in 2.7, this will (tenatively) look like:

import pyarrow.fs
import s3fs

s3_fs = s3fs.S3FileSystem(
  key='miniokey...',
  secret='asecretkey...',
  endpoint_url='https://...'
)
custom_fs = pyarrow.fs.PyFileSystem(pyarrow.fs.FSSpecHandler(s3_fs))

run_config = RunConfig(storage_path="minio_bucket", storage_filesystem=custom_fs)

See also:

JingChen23 · 2023-08-02T11:05:14Z

Guys, thanks for the reply!

AwesomeLemon · 2023-09-15T10:30:57Z

In our lab (in a public research institute), there are multiple servers that don't have any shared storage. Do I understand correctly that such a setup is no longer supported?

justinvyu · 2023-09-26T00:34:01Z

@AwesomeLemon Yes, Ray Train/Tune will require cloud storage or NFS in 2.7+ for multi-node training.

One detail is that this is only strictly enforced (e.g., we will raise an error) if you try to report a checkpoint without setting up persistent storage.

justinvyu · 2023-11-21T21:56:24Z

Closing this issue, since the change has already been made, but feel free to keep posting questions here!

alvitawa · 2024-02-22T08:57:31Z

How can I do this anyways? I just have a small local cluster for which it doesnt make sense to add a cloud provider. Any recommendations for ways to for example set the head node up as a NFS server as well?

justinvyu added air Ray 2.7 labels Jul 7, 2023

richardliaw pinned this issue Jul 7, 2023

justinvyu mentioned this issue Jul 20, 2023

[AIR] Remove head node syncing as the default storage option #37142

Merged

15 tasks

rkooo567 unpinned this issue Aug 7, 2023

JalinWang mentioned this issue Aug 31, 2023

[Serve] fail to detect changes in code and re-deploy #38714

Closed

justinvyu changed the title ~~[AIR] Syncing files to the head node to be removed in Ray 2.7 in favor of cloud storage/NFS~~ [Train][Tune] Syncing files to the head node to be removed in Ray 2.7 in favor of cloud storage/NFS Sep 26, 2023

anyscalesam added train Ray Train Related Issue and removed air labels Oct 27, 2023

justinvyu closed this as completed Nov 21, 2023

17zhangw mentioned this issue Apr 14, 2024

Remote Ray Trials cmu-db/dbgym#30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train][Tune] Syncing files to the head node to be removed in Ray 2.7 in favor of cloud storage/NFS #37177

[Train][Tune] Syncing files to the head node to be removed in Ray 2.7 in favor of cloud storage/NFS #37177

justinvyu commented Jul 7, 2023 •

edited

richardliaw commented Jul 7, 2023

JingChen23 commented Jul 31, 2023

pcmoritz commented Aug 1, 2023 •

edited

justinvyu commented Aug 1, 2023

JingChen23 commented Aug 2, 2023

AwesomeLemon commented Sep 15, 2023

justinvyu commented Sep 26, 2023 •

edited

justinvyu commented Nov 21, 2023

alvitawa commented Feb 22, 2024

[Train][Tune] Syncing files to the head node to be removed in Ray 2.7 in favor of cloud storage/NFS #37177

[Train][Tune] Syncing files to the head node to be removed in Ray 2.7 in favor of cloud storage/NFS #37177

Comments

justinvyu commented Jul 7, 2023 • edited

Quicklinks

Summary

Code Changes

Background Context

Why are we removing support?

richardliaw commented Jul 7, 2023

JingChen23 commented Jul 31, 2023

pcmoritz commented Aug 1, 2023 • edited

justinvyu commented Aug 1, 2023

JingChen23 commented Aug 2, 2023

AwesomeLemon commented Sep 15, 2023

justinvyu commented Sep 26, 2023 • edited

justinvyu commented Nov 21, 2023

alvitawa commented Feb 22, 2024

justinvyu commented Jul 7, 2023 •

edited

pcmoritz commented Aug 1, 2023 •

edited

justinvyu commented Sep 26, 2023 •

edited