[RFC] Ray 2.0 Feature Proposals #22833

ericl · 2022-03-05T00:32:37Z

What is Ray 2.0?

Ray 2.0 (ETA: August this year) will improve on the Ray 1.x line in three major ways:

Usability: Ease-of-use improvements include new observability tooling, improving the interoperability of libraries in the Ray ML ecosystem, and higher-level APIs for scheduling, model pipelines, and durable workflows.
Performance: The Ray dataplane and object store are critical to the performance of data-intensive ML workloads. We plan to add optimizations that will enable Ray to be used for petabyte-scale shuffle workloads.
Production: Cluster fault tolerance, a new KubeRay operator, RESTful APIs for Serve, and a new Job submission API will make Ray 2.0 substantially better for production serving workloads.

The following is a list of current proposals we have for 2.0. For each feature on the list we will later publish a REP (with design details) for community feedback here: https://github.com/ray-project/enhancements

Please feel free to comment in this thread, as well as propose new REPs for 2.0!

Ray Core

State Observability

Observability has been a common pain point for Ray users. In particular, Ray currently lacks a unified API for inspecting the entities that together comprise a running Ray application (e.g., tasks, actors, nodes, placement groups). We plan to add APIs that will provide access to a core list of system states in the CLI, REST, and UI. This API will cover Ray tasks, actors, objects, placement groups, nodes, resources, runtime envs, and namespaces.

Cluster Fault Tolerance

Ray's fault tolerance model is currently designed for batch jobs, but serving workloads require greater degrees of availability within a single cluster. We plan to add support for failover of the Ray's global control store (GCS). This means that jobs can continue running even if the GCS fails, though certain operations like launching new tasks or actors will be unavailable until the GCS recovers (a few seconds to minutes depending on the cluster manager).

For Ray Serve users, this means single-cluster deployments can continue serving user traffic uninterrupted during GCS failover. For Serve users that are already using multi-cluster HA (i.e., load balancing traffic across multiple Serve clusters), GCS fault tolerance will reduce the frequency of single-cluster failures that can cause latency spikes / overloads even in a multi-cluster setup, and reduce the amount of resources that need to be provisioned.

Scalable Data Shuffle

Shuffle is an important primitive for ML preprocessing and training use cases. We plan to scale Ray's data shuffle support to large clusters. Current prototypes of shuffle in Ray run 100TB-scale jobs at 70%+ hardware efficiency. In Ray 2.0, we plan to productionize these prototypes and integrate them with the Datasets sort / repartition / shuffle APIs to bring scalable shuffle to Ray library users.

Implementation-wise, this effort will involve (1) adding more advanced shuffle algorithms to Datasets (e.g., Cosco, pipelined shuffle), and (2) optimizing and hardening the Ray dataplane to operate performantly at this scale (e.g., reducing metadata overheads and improving object transfer protocols).

Ray AI Runtime

In 2.0, we're aligning Ray's existing ML libraries to work together as the first third generation AI runtime. For ML platform builders, this runtime will provide a unified compute layer for ML apps and services. For individual library users, this effort will improve interoperability between Ray libraries and between Ray and third-party ML libraries.

See [RFC] Introducing Ray AI Runtime for more information. This effort also includes a number of standalone improvements to Ray libraries:

Serve Pipelines GA

Serve Pipelines is a feature of Serve built to aid the development and deployment of multi-models inference pipelines, also known as model composition. Pipelines will enable users to compose and test dynamic, multi-model pipelines using familiar Ray task and actor primitives, and then use Serve to deploy these same pipelines to production for low-latency serving of user traffic.

Serve JSON API

In addition to its Pythonic API, we plan to add an equivalent JSON-format API to Serve, which can be used to declaratively update deployments and pipelines via CLI or REST request. This API improves the MLOps CI/CD story for Serve by adding a pathway to inspect and update Serve deployments without needing to execute Python code.

Train GA

As part of the Ray AI Runtime, we plan to GA the Ray Train library. Ray Train interoperates with Ray Tune to tune your distributed model and Ray Datasets to train on large amounts of data, and will become the standard integration point for 3rd party ML training frameworks (e.g., XGBoost, PyTorch, HuggingFace, etc.) to run on Ray.

RLlib API Improvements

We plan two major API improvements to RLlib in 2.0:

First, "RLlib Connectors" will extract the common observation preprocessing code connecting env and RLlib models into a stand-alone connectors library. This will make the Model to Env interfacing code RLlib has more transparent and understandable for users needing to handle environments with complex action and observation spaces.

Second, we plan to refactor the "execution plans" of RLlib algorithms to follow a more imperative pattern, while preserving its distributed functionality. At a high level, this means that algorithm engineers will be able to express the distributed execution of algorithms in more familiar imperative patterns rather than working with distributed result iterators.

Ray Clusters

KubeRay GA

KubeRay is the official way to run Ray applications on Kubernetes. In Ray 2.0, KubeRay will reach GA. See the Kuberay Project for more details on its roadmap: https://github.com/ray-project/kuberay

Job Submission API GA

The Ray job submission API enables users to package code for execution in Ray clusters. This API will reach GA in 2.0, providing a standard for external Job managers to interact with Ray clusters: https://docs.ray.io/en/latest/ray-job-submission/overview.html

Ray 2.1 and beyond

There's a lot more we want to do for Ray. A few items we are considering for 2.1 and beyond include:

Workflows GA

We plan to eventually GA the Ray workflows library. This will allow ML applications that only require lightweight orchestration to be built entirely within Ray, improving performance and reducing operational complexity. Key elements of GA include (1) updating the workflows API to align with ray.remote syntax instead of having a separate decorator, (2) stabilizing the storage API, and (3) adding production features such as limiting the number of concurrent workflows.

Scalability to 2k+ nodes and 100k+ actors

Currently, Ray's scalability is only tested up to 1,000 nodes and 10,000 actors. We plan to address bottlenecks preventing applications requiring more resources than this from being run on Ray.

Pyarrow Compute for Datasets

We plan to leverage the pyarrow compute library (or other accelerated libraries like polars) more heavily in Ray Datasets, which will accelerate operations such as aggregations commonly used for ML preprocessing.

Placement Group API Improvements

To make the placement groups API easier to use, especially for more challenging use cases, we plan to add a higher-level gang scheduling API. This means that instead of creating a placement group and then scheduling actors to the bundles within the group, users would be able to create a set of actors atomically as a group with given scheduling constraints.

Please let us know your feedback below!

The text was updated successfully, but these errors were encountered:

Jeffwan · 2022-03-07T16:04:27Z

Current prototypes of shuffle in Ray run 100TB-scale jobs at 70%+ hardware efficiency.

Any reference on the hardware efficiency? how to define it?
Is there an actionable tasks list for those Beta -> GA features. We are trying to understand the current limitations and see if we can participate in moving them into the maturity?

ericl · 2022-03-08T00:57:06Z

Any reference on the hardware efficiency? how to define it?

I believe this was defined with respect to the aggregate disk performance of the cluster (cc @stephanie-wang ), which was the bottleneck.

Is there an actionable tasks list for those Beta -> GA features. We are trying to understand the current limitations and see if we can participate in moving them into the maturity?

For Datasets you can follow the P1/P2s in the milestone: https://github.com/ray-project/ray/milestone/53; otherwise we'll be posting REPs which will contain more details. Are there particular questions you have?

xgreat8 · 2022-03-08T22:29:58Z

State Observability
Resource utilization observability is needed for finer pricing model.

tianlinzx · 2022-03-17T23:58:42Z

Any timeline estimation for aplpha,beta,rc and final releases ? @ericl

ericl · 2022-03-18T00:55:02Z

@tianlinzx thanks for the question! Roughly speaking, we plan to have a community preview around the end of July, though many of the features will land in master in some form before then. The actual release will be Aug 21.

kuizhiqing · 2022-04-14T03:21:57Z

Hi @ericl , as far as I know, the object store of ray a.k.a Plasma does not support HBM yet, whereas it does not support RDMA, NVLink, etc. Is there any plan for that or maybe there is some related work already done?
Thank you in advance.

ericl · 2022-04-14T21:24:46Z

Hi @ericl , as far as I know, the object store of ray a.k.a Plasma does not support HBM yet, whereas it does not support RDMA, NVLink, etc. Is there any plan for that or maybe there is some related work already done?
Thank you in advance.

Hey @kuizhiqing , we don't have a concrete timeline but are very interested in requirements here (cc @clarkzinzow @scv119 ). What are your use cases and needs for plasma HBM?

Btw, some functionality of this category is available in the ray.util.collective library for use at the application level: https://docs.ray.io/en/latest/ray-more-libs/ray-collective.html

denkensk · 2022-04-15T06:30:18Z

Ray's scalability is only tested up to 1,000 nodes and 10,000 actors.

Thanks. Are there any reference documents or tests on scalability?

scv119 · 2022-04-29T04:12:53Z

@denkensk there is a benchmark directory for each ray releases https://github.com/ray-project/ray/blob/ray-1.12.0/benchmarks/README.md

viksharma1987 · 2022-05-23T16:03:41Z

Is there any Ray 2.0 beta? Would like to test our apps with that before the final release comes out.

zhe-thoughts · 2022-05-24T04:45:02Z

Hi @viksharma1987 yes we are working on that plan and will share with the community soon!

stale · 2022-09-24T02:49:57Z

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

zhe-thoughts · 2022-09-24T02:54:29Z

Closing this issue as Ray 2.0 was released

ericl added the RFC RFC issues label Mar 5, 2022

ericl pinned this issue Mar 5, 2022

kennethlien mentioned this issue Mar 8, 2022

Enable buffering and spilling to multiple remote storages #22798

Merged

6 tasks

rkooo567 mentioned this issue Apr 13, 2022

[State Observability] Basic functionality for centralized data #23744

Merged

6 tasks

DmitriGekhtman mentioned this issue Apr 19, 2022

[Feature] Publicize the stability timeline ray-project/kuberay#224

Closed

2 tasks

scv119 mentioned this issue Apr 29, 2022

[Core] expose task and related information when using ray cli #23947

Closed

mwtian unpinned this issue May 19, 2022

ericl pinned this issue May 23, 2022

waleedkadous unpinned this issue May 27, 2022

zhouaihui mentioned this issue Jul 28, 2022

请问ray网络的生命周期——Head节点的高可用怎么做 secretflow/secretflow#65

Closed

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Sep 24, 2022

zhe-thoughts closed this as completed Sep 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Ray 2.0 Feature Proposals #22833

[RFC] Ray 2.0 Feature Proposals #22833

ericl commented Mar 5, 2022 •

edited

Loading

Jeffwan commented Mar 7, 2022 •

edited

Loading

ericl commented Mar 8, 2022

xgreat8 commented Mar 8, 2022

tianlinzx commented Mar 17, 2022 •

edited

Loading

ericl commented Mar 18, 2022

kuizhiqing commented Apr 14, 2022

ericl commented Apr 14, 2022

denkensk commented Apr 15, 2022

scv119 commented Apr 29, 2022

viksharma1987 commented May 23, 2022

zhe-thoughts commented May 24, 2022

stale bot commented Sep 24, 2022

zhe-thoughts commented Sep 24, 2022

[RFC] Ray 2.0 Feature Proposals #22833

[RFC] Ray 2.0 Feature Proposals #22833

Comments

ericl commented Mar 5, 2022 • edited Loading

What is Ray 2.0?

Ray Core

State Observability

Cluster Fault Tolerance

Scalable Data Shuffle

Ray AI Runtime

Serve Pipelines GA

Serve JSON API

Train GA

RLlib API Improvements

Ray Clusters

KubeRay GA

Job Submission API GA

Ray 2.1 and beyond

Workflows GA

Scalability to 2k+ nodes and 100k+ actors

Pyarrow Compute for Datasets

Placement Group API Improvements

Jeffwan commented Mar 7, 2022 • edited Loading

ericl commented Mar 8, 2022

xgreat8 commented Mar 8, 2022

tianlinzx commented Mar 17, 2022 • edited Loading

ericl commented Mar 18, 2022

kuizhiqing commented Apr 14, 2022

ericl commented Apr 14, 2022

denkensk commented Apr 15, 2022

scv119 commented Apr 29, 2022

viksharma1987 commented May 23, 2022

zhe-thoughts commented May 24, 2022

stale bot commented Sep 24, 2022

zhe-thoughts commented Sep 24, 2022

ericl commented Mar 5, 2022 •

edited

Loading

Jeffwan commented Mar 7, 2022 •

edited

Loading

tianlinzx commented Mar 17, 2022 •

edited

Loading