Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train][Tune] Syncing files to the head node to be removed in Ray 2.7 in favor of cloud storage/NFS #37177

Closed
justinvyu opened this issue Jul 7, 2023 · 9 comments
Labels
Ray 2.7 train Ray Train Related Issue

Comments

@justinvyu
Copy link
Contributor

justinvyu commented Jul 7, 2023

Quicklinks

User guide on configuring storage for Ray Train/Tune
User guide on checkpointing and how they interact with storage

Summary

Starting in Ray 2.7, Ray Train and Tune will require users to pass in a cloud storage or NFS path if running distributed training or tuning jobs.

In other words, Ray Train / Tune will no longer support the synchronization of checkpoints and other artifacts from worker nodes to the head node.

In Ray 2.6, syncing directories to the head node will no longer be the default storage configuration. Instead, this will raise an error telling you to switch to one of the recommended alternatives: cloud storage or NFS.

Please leave any comments or concerns on this thread below -- we would be happy to better understand your perspective.

Code Changes

For single node Ray Train and Ray Tune experiments, this does not change anything or require any modifications to your code.

For multi-node Ray Train and Ray Tune experiments, you should switch to using one of the following persistent storage options:

  1. Cloud storage. See here for a configuration guide.
from ray import tune
from ray.train.torch import TorchTrainer
from ray.air.config import RunConfig

run_config = RunConfig(
    name="experiment_name",
    storage_path="s3://bucket-name/experiment_results",
)

# Use cloud storage in Train/Tune by configuring `RunConfig(storage_path)`.
trainer = TorchTrainer(..., run_config=run_config)
tuner = tune.Tuner(..., run_config=run_config)

# All experiment results will be persisted to s3://bucket-name/experiment_results/experiment_name
  1. A network filesystem mounted on all nodes. See here for a configuration guide.
from ray import tune
from ray.train.torch import TorchTrainer
from ray.air.config import RunConfig

run_config = RunConfig(
    name="experiment_name",
    storage_path="/mnt/shared_storage/experiment_results",
)

# Use NFS in Train/Tune by configuring `RunConfig(storage_path)`.
trainer = TorchTrainer(..., run_config=run_config)
tuner = tune.Tuner(..., run_config=run_config)

# All experiment results will be persisted to /mnt/shared_storage/experiment_results/experiment_name

If needed, you can re-enable this behavior by setting the environment variable: RAY_AIR_REENABLE_DEPRECATED_SYNC_TO_HEAD_NODE=1

Background Context

In a multi-node Ray cluster, Ray Train and Ray Tune assume access to some form of persistent storage that stores outputs from all worker nodes. This includes files such as logged metrics, artifacts, and checkpoints.

Without some form of external shared storage (cloud storage or NFS):

  1. Ray AIR cannot restore a training run from the latest checkpoint for fault tolerance. Without saving a checkpoint to external storage, the latest checkpoint may not exist anymore, if the node that it was saved on has already crashed.
  2. You cannot access results after training has finished. If the Ray cluster has already been terminated (e.g., from automatic cluster downscaling), then the trained model checkpoints cannot be accessed if they have not been persisted to external storage.

Why are we removing support?

  1. Cloud storage and NFS are cheap, easy to set up, and ubiquitous in today's machine learning landscape.
  2. Syncing to the head node introduces major performance bottlenecks and does not scale to a large number of worker nodes or larger model sizes.
    1. The speed of communication is limited by the network bandwidth of a single (head) node, and with large models, disk space on the head node even becomes an issue.
    2. Generally, putting more load on the head node increases the risk of cluster-level failures.
  3. The maintenance burden of the legacy sync has become substantial. The ML team wants to focus on making the cloud storage path robust and performant, which is much easier without having to maintain two duplicate synchronization stacks.
@richardliaw
Copy link
Contributor

@JingChen23
Copy link
Contributor

Does Minio also supported?

@pcmoritz
Copy link
Contributor

pcmoritz commented Aug 1, 2023

@JingChen23 We are using pyarrow under the hood, which exposes overriding an S3 endpoint via https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html so I believe this should be supported (I know they have some minio tests in their test suite).

You can configure the underlying filesystem with the storage_filesystem option of RunConfig. I believe this is not yet working in Ray 2.6 but will be working in Ray 2.7.

@justinvyu Please correct me if I'm wrong

@justinvyu
Copy link
Contributor Author

@JingChen23 Yes, another option is passing in a custom fsspec (s3fs) filesystem, then wrapping that as a pyarrow.fs.FileSystem. s3fs has some examples with minio: https://s3fs.readthedocs.io/en/latest/#s3-compatible-storage

Concretely, in 2.7, this will (tenatively) look like:

import pyarrow.fs
import s3fs

s3_fs = s3fs.S3FileSystem(
  key='miniokey...',
  secret='asecretkey...',
  endpoint_url='https://...'
)
custom_fs = pyarrow.fs.PyFileSystem(pyarrow.fs.FSSpecHandler(s3_fs))

run_config = RunConfig(storage_path="minio_bucket", storage_filesystem=custom_fs)

See also:

@JingChen23
Copy link
Contributor

Guys, thanks for the reply!

@AwesomeLemon
Copy link

In our lab (in a public research institute), there are multiple servers that don't have any shared storage. Do I understand correctly that such a setup is no longer supported?

@justinvyu justinvyu changed the title [AIR] Syncing files to the head node to be removed in Ray 2.7 in favor of cloud storage/NFS [Train][Tune] Syncing files to the head node to be removed in Ray 2.7 in favor of cloud storage/NFS Sep 26, 2023
@justinvyu
Copy link
Contributor Author

justinvyu commented Sep 26, 2023

@AwesomeLemon Yes, Ray Train/Tune will require cloud storage or NFS in 2.7+ for multi-node training.

One detail is that this is only strictly enforced (e.g., we will raise an error) if you try to report a checkpoint without setting up persistent storage.

@anyscalesam anyscalesam added train Ray Train Related Issue and removed air labels Oct 27, 2023
@justinvyu
Copy link
Contributor Author

Closing this issue, since the change has already been made, but feel free to keep posting questions here!

@alvitawa
Copy link

How can I do this anyways? I just have a small local cluster for which it doesnt make sense to add a cloud provider. Any recommendations for ways to for example set the head node up as a NFS server as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ray 2.7 train Ray Train Related Issue
Projects
None yet
Development

No branches or pull requests

7 participants