New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train][Tune] Syncing files to the head node to be removed in Ray 2.7 in favor of cloud storage/NFS #37177
Comments
Does Minio also supported? |
@JingChen23 We are using pyarrow under the hood, which exposes overriding an S3 endpoint via https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html so I believe this should be supported (I know they have some minio tests in their test suite). You can configure the underlying filesystem with the @justinvyu Please correct me if I'm wrong |
@JingChen23 Yes, another option is passing in a custom Concretely, in 2.7, this will (tenatively) look like: import pyarrow.fs
import s3fs
s3_fs = s3fs.S3FileSystem(
key='miniokey...',
secret='asecretkey...',
endpoint_url='https://...'
)
custom_fs = pyarrow.fs.PyFileSystem(pyarrow.fs.FSSpecHandler(s3_fs))
run_config = RunConfig(storage_path="minio_bucket", storage_filesystem=custom_fs) See also: |
Guys, thanks for the reply! |
In our lab (in a public research institute), there are multiple servers that don't have any shared storage. Do I understand correctly that such a setup is no longer supported? |
@AwesomeLemon Yes, Ray Train/Tune will require cloud storage or NFS in 2.7+ for multi-node training. One detail is that this is only strictly enforced (e.g., we will raise an error) if you try to report a checkpoint without setting up persistent storage. |
Closing this issue, since the change has already been made, but feel free to keep posting questions here! |
How can I do this anyways? I just have a small local cluster for which it doesnt make sense to add a cloud provider. Any recommendations for ways to for example set the head node up as a NFS server as well? |
Quicklinks
User guide on configuring storage for Ray Train/Tune
User guide on checkpointing and how they interact with storage
Summary
Starting in Ray 2.7, Ray Train and Tune will require users to pass in a cloud storage or NFS path if running distributed training or tuning jobs.
In other words, Ray Train / Tune will no longer support the synchronization of checkpoints and other artifacts from worker nodes to the head node.
In Ray 2.6, syncing directories to the head node will no longer be the default storage configuration. Instead, this will raise an error telling you to switch to one of the recommended alternatives: cloud storage or NFS.
Please leave any comments or concerns on this thread below -- we would be happy to better understand your perspective.
Code Changes
For single node Ray Train and Ray Tune experiments, this does not change anything or require any modifications to your code.
For multi-node Ray Train and Ray Tune experiments, you should switch to using one of the following persistent storage options:
If needed, you can re-enable this behavior by setting the environment variable:RAY_AIR_REENABLE_DEPRECATED_SYNC_TO_HEAD_NODE=1
Background Context
In a multi-node Ray cluster, Ray Train and Ray Tune assume access to some form of persistent storage that stores outputs from all worker nodes. This includes files such as logged metrics, artifacts, and checkpoints.
Without some form of external shared storage (cloud storage or NFS):
Why are we removing support?
The text was updated successfully, but these errors were encountered: