Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Ray 2.0 Feature Proposals #22833

Closed
ericl opened this issue Mar 5, 2022 · 13 comments
Closed

[RFC] Ray 2.0 Feature Proposals #22833

ericl opened this issue Mar 5, 2022 · 13 comments
Labels
RFC RFC issues stale The issue is stale. It will be closed within 7 days unless there are further conversation

Comments

@ericl
Copy link
Contributor

ericl commented Mar 5, 2022

What is Ray 2.0?

Ray 2.0 (ETA: August this year) will improve on the Ray 1.x line in three major ways:

  • Usability: Ease-of-use improvements include new observability tooling, improving the interoperability of libraries in the Ray ML ecosystem, and higher-level APIs for scheduling, model pipelines, and durable workflows.
  • Performance: The Ray dataplane and object store are critical to the performance of data-intensive ML workloads. We plan to add optimizations that will enable Ray to be used for petabyte-scale shuffle workloads.
  • Production: Cluster fault tolerance, a new KubeRay operator, RESTful APIs for Serve, and a new Job submission API will make Ray 2.0 substantially better for production serving workloads.

The following is a list of current proposals we have for 2.0. For each feature on the list we will later publish a REP (with design details) for community feedback here: https://github.com/ray-project/enhancements

Please feel free to comment in this thread, as well as propose new REPs for 2.0!

Ray Core

State Observability

Observability has been a common pain point for Ray users. In particular, Ray currently lacks a unified API for inspecting the entities that together comprise a running Ray application (e.g., tasks, actors, nodes, placement groups). We plan to add APIs that will provide access to a core list of system states in the CLI, REST, and UI. This API will cover Ray tasks, actors, objects, placement groups, nodes, resources, runtime envs, and namespaces.

Cluster Fault Tolerance

Ray's fault tolerance model is currently designed for batch jobs, but serving workloads require greater degrees of availability within a single cluster. We plan to add support for failover of the Ray's global control store (GCS). This means that jobs can continue running even if the GCS fails, though certain operations like launching new tasks or actors will be unavailable until the GCS recovers (a few seconds to minutes depending on the cluster manager).

For Ray Serve users, this means single-cluster deployments can continue serving user traffic uninterrupted during GCS failover. For Serve users that are already using multi-cluster HA (i.e., load balancing traffic across multiple Serve clusters), GCS fault tolerance will reduce the frequency of single-cluster failures that can cause latency spikes / overloads even in a multi-cluster setup, and reduce the amount of resources that need to be provisioned.

Scalable Data Shuffle

Shuffle is an important primitive for ML preprocessing and training use cases. We plan to scale Ray's data shuffle support to large clusters. Current prototypes of shuffle in Ray run 100TB-scale jobs at 70%+ hardware efficiency. In Ray 2.0, we plan to productionize these prototypes and integrate them with the Datasets sort / repartition / shuffle APIs to bring scalable shuffle to Ray library users.

Implementation-wise, this effort will involve (1) adding more advanced shuffle algorithms to Datasets (e.g., Cosco, pipelined shuffle), and (2) optimizing and hardening the Ray dataplane to operate performantly at this scale (e.g., reducing metadata overheads and improving object transfer protocols).

Ray AI Runtime

In 2.0, we're aligning Ray's existing ML libraries to work together as the first third generation AI runtime. For ML platform builders, this runtime will provide a unified compute layer for ML apps and services. For individual library users, this effort will improve interoperability between Ray libraries and between Ray and third-party ML libraries.

See [RFC] Introducing Ray AI Runtime for more information. This effort also includes a number of standalone improvements to Ray libraries:

Serve Pipelines GA

Serve Pipelines is a feature of Serve built to aid the development and deployment of multi-models inference pipelines, also known as model composition. Pipelines will enable users to compose and test dynamic, multi-model pipelines using familiar Ray task and actor primitives, and then use Serve to deploy these same pipelines to production for low-latency serving of user traffic.

Serve JSON API

In addition to its Pythonic API, we plan to add an equivalent JSON-format API to Serve, which can be used to declaratively update deployments and pipelines via CLI or REST request. This API improves the MLOps CI/CD story for Serve by adding a pathway to inspect and update Serve deployments without needing to execute Python code.

Train GA

As part of the Ray AI Runtime, we plan to GA the Ray Train library. Ray Train interoperates with Ray Tune to tune your distributed model and Ray Datasets to train on large amounts of data, and will become the standard integration point for 3rd party ML training frameworks (e.g., XGBoost, PyTorch, HuggingFace, etc.) to run on Ray.

RLlib API Improvements

We plan two major API improvements to RLlib in 2.0:

First, "RLlib Connectors" will extract the common observation preprocessing code connecting env and RLlib models into a stand-alone connectors library. This will make the Model to Env interfacing code RLlib has more transparent and understandable for users needing to handle environments with complex action and observation spaces.

Second, we plan to refactor the "execution plans" of RLlib algorithms to follow a more imperative pattern, while preserving its distributed functionality. At a high level, this means that algorithm engineers will be able to express the distributed execution of algorithms in more familiar imperative patterns rather than working with distributed result iterators.

Ray Clusters

KubeRay GA

KubeRay is the official way to run Ray applications on Kubernetes. In Ray 2.0, KubeRay will reach GA. See the Kuberay Project for more details on its roadmap: https://github.com/ray-project/kuberay

Job Submission API GA

The Ray job submission API enables users to package code for execution in Ray clusters. This API will reach GA in 2.0, providing a standard for external Job managers to interact with Ray clusters: https://docs.ray.io/en/latest/ray-job-submission/overview.html

Ray 2.1 and beyond

There's a lot more we want to do for Ray. A few items we are considering for 2.1 and beyond include:

Workflows GA

We plan to eventually GA the Ray workflows library. This will allow ML applications that only require lightweight orchestration to be built entirely within Ray, improving performance and reducing operational complexity. Key elements of GA include (1) updating the workflows API to align with ray.remote syntax instead of having a separate decorator, (2) stabilizing the storage API, and (3) adding production features such as limiting the number of concurrent workflows.

Scalability to 2k+ nodes and 100k+ actors

Currently, Ray's scalability is only tested up to 1,000 nodes and 10,000 actors. We plan to address bottlenecks preventing applications requiring more resources than this from being run on Ray.

Pyarrow Compute for Datasets

We plan to leverage the pyarrow compute library (or other accelerated libraries like polars) more heavily in Ray Datasets, which will accelerate operations such as aggregations commonly used for ML preprocessing.

Placement Group API Improvements

To make the placement groups API easier to use, especially for more challenging use cases, we plan to add a higher-level gang scheduling API. This means that instead of creating a placement group and then scheduling actors to the bundles within the group, users would be able to create a set of actors atomically as a group with given scheduling constraints.

Please let us know your feedback below!

@ericl ericl added the RFC RFC issues label Mar 5, 2022
@ericl ericl pinned this issue Mar 5, 2022
@Jeffwan
Copy link
Contributor

Jeffwan commented Mar 7, 2022

  1. Current prototypes of shuffle in Ray run 100TB-scale jobs at 70%+ hardware efficiency.

    Any reference on the hardware efficiency? how to define it?

  2. Is there an actionable tasks list for those Beta -> GA features. We are trying to understand the current limitations and see if we can participate in moving them into the maturity?

@ericl
Copy link
Contributor Author

ericl commented Mar 8, 2022

Any reference on the hardware efficiency? how to define it?

I believe this was defined with respect to the aggregate disk performance of the cluster (cc @stephanie-wang ), which was the bottleneck.

Is there an actionable tasks list for those Beta -> GA features. We are trying to understand the current limitations and see if we can participate in moving them into the maturity?

For Datasets you can follow the P1/P2s in the milestone: https://github.com/ray-project/ray/milestone/53; otherwise we'll be posting REPs which will contain more details. Are there particular questions you have?

@xgreat8
Copy link

xgreat8 commented Mar 8, 2022

State Observability
Resource utilization observability is needed for finer pricing model.

@tianlinzx
Copy link

tianlinzx commented Mar 17, 2022

Any timeline estimation for aplpha,beta,rc and final releases ? @ericl

@ericl
Copy link
Contributor Author

ericl commented Mar 18, 2022

@tianlinzx thanks for the question! Roughly speaking, we plan to have a community preview around the end of July, though many of the features will land in master in some form before then. The actual release will be Aug 21.

@kuizhiqing
Copy link

Hi @ericl , as far as I know, the object store of ray a.k.a Plasma does not support HBM yet, whereas it does not support RDMA, NVLink, etc. Is there any plan for that or maybe there is some related work already done?
Thank you in advance.

@ericl
Copy link
Contributor Author

ericl commented Apr 14, 2022

Hi @ericl , as far as I know, the object store of ray a.k.a Plasma does not support HBM yet, whereas it does not support RDMA, NVLink, etc. Is there any plan for that or maybe there is some related work already done?
Thank you in advance.

Hey @kuizhiqing , we don't have a concrete timeline but are very interested in requirements here (cc @clarkzinzow @scv119 ). What are your use cases and needs for plasma HBM?

Btw, some functionality of this category is available in the ray.util.collective library for use at the application level: https://docs.ray.io/en/latest/ray-more-libs/ray-collective.html

@denkensk
Copy link

Ray's scalability is only tested up to 1,000 nodes and 10,000 actors.

Thanks. Are there any reference documents or tests on scalability?

@scv119
Copy link
Contributor

scv119 commented Apr 29, 2022

@denkensk there is a benchmark directory for each ray releases https://github.com/ray-project/ray/blob/ray-1.12.0/benchmarks/README.md

@mwtian mwtian unpinned this issue May 19, 2022
@ericl ericl pinned this issue May 23, 2022
@viksharma1987
Copy link

Is there any Ray 2.0 beta? Would like to test our apps with that before the final release comes out.

@zhe-thoughts
Copy link
Collaborator

Hi @viksharma1987 yes we are working on that plan and will share with the community soon!

@stale
Copy link

stale bot commented Sep 24, 2022

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Sep 24, 2022
@zhe-thoughts
Copy link
Collaborator

Closing this issue as Ray 2.0 was released

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC RFC issues stale The issue is stale. It will be closed within 7 days unless there are further conversation
Projects
None yet
Development

No branches or pull requests

9 participants