Skip to content

Ray-2.7.0

Compare
Choose a tag to compare
@GeneDer GeneDer released this 17 Sep 18:13
· 1 commit to releases/2.7.0rc0 since this release
b4bba47

Release Highlights

Ray 2.7 release brings important stability improvements and enhancements to Ray libraries, with Ray Train and Ray Serve becoming generally available. Ray 2.7 is accompanied with a GA release of KubeRay.

  • Following user feedback, we are rebranding “Ray AI Runtime (AIR)” to “Ray AI Libraries”. Without reducing any of the underlying functionality of the original Ray AI runtime vision as put forth in Ray 2.0, the underlying namespace (ray.air) is consolidated into ray.data, ray.train, and ray.tune. This change reduces the friction for new machine learning (ML) practitioners to quickly understand and leverage Ray for their production machine learning use cases.
  • With this release, Ray Serve and Ray Train’s Pytorch support are becoming Generally Available -- indicating that the core APIs have been marked stable and that both libraries have undergone significant production hardening.
  • In Ray Serve, we are introducing a new backwards-compatible DeploymentHandle API to unify various existing Handle APIs, a high performant gRPC proxy to serve gRPC requests through Ray Serve, along with various stability and usability improvements.
  • In Ray Train, we are consolidating various Pytorch-based trainers into the TorchTrainer, reducing the amount of refactoring work new users needed to scale existing training scripts. We are also introducing a new train.Checkpoint API, which provides a consolidated way of interacting with remote and local storage, along with various stability and usability improvements.
  • In Ray Core, we’ve added initial integrations with TPUs and AWS accelerators, enabling Ray to natively detect these devices and schedule tasks/actors onto them. Ray Core also officially now supports actor task cancellation and has an experimental streaming generator that supports streaming response to the caller.

Take a look at our refreshed documentation and the Ray 2.7 migration guide and let us know your feedback!

Ray Libraries

Ray AIR

🏗 Architecture refactoring:

Ray Data

🎉 New Features:

  • In this release, we’ve integrated the Ray Core streaming generator API by default, which allows us to reduce memory footprint throughout the data pipeline (#37736).
  • Avoid unnecessary data buffering between Read and Map operator (zero-copy fusion) (#38789)
  • Add Dataset.write_images to write images (#38228)
  • Add Dataset.write_sql() to write SQL databases (#38544)
  • Support sort on multiple keys (#37124)
  • Support reading and writing JSONL file format (#37637)
  • Support class constructor args for Dataset.map() and flat_map() (#38606)
  • Implement streamed read from Hugging Face Dataset (#38432)

💫Enhancements:

  • Read data with multi-threading for FileBasedDataSource (#39493)
  • Optimization to reduce ArrowBlock building time for blocks of size 1 (#38988)
  • Add partition_filter parameter to read_parquet (#38479)
  • Apply limit to Dataset.take() and related methods (#38677)
  • Postpone reader.get_read_tasks until execution (#38373)
  • Lazily construct metadata providers (#38198)
  • Support writing each block to a separate file (#37986)
  • Make iter_batches an Iterable (#37881)
  • Remove default limit on Dataset.to_pandas() (#37420)
  • Add Dataset.to_dask() parameter to toggle consistent metadata check (#37163)
  • Add Datasource.on_write_start (#38298)
  • Remove support for DatasetDict as input into from_huggingface() (#37555)

🔨 Fixes:

  • Backwards compatibility for Preprocessor that have been fit in older versions (#39488)
  • Do not eagerly free root RefBundles (#39085)
  • Retry open files with exponential backoff (#38773)
  • Avoid passing local_uri to all non-Parquet data sources (#38719)
  • Add ctx parameter to Datasource.write (#38688)
  • Preserve block format on map_batches over empty blocks (#38161)
  • Fix args and kwargs passed to ActorPool map_batches (#38110)
  • Add tif file extension to ImageDatasource (#38129)
  • Raise error if PIL can't load image (#38030)
  • Allow automatic handling of string features as byte features during TFRecord serialization (#37995)
  • Remove unnecessary file system wrapping (#38299)
  • Remove _block_udf from FileBasedDatasource reads (#38111)

📖Documentation:

Ray Train

🤝 API Changes

💫Enhancements:

  • Various improvements and fixes for the console output of Ray Train and Tune (#37572, #37571, #37570, #37569, #37531, #36993)
  • Raise actionable error message for missing dependencies (#38497)
  • Use posix paths throughout library code (#38319)
  • Group consecutive workers by IP (#38490)
  • Split all Ray Datasets by default (#38694)
  • Add static Trainer methods for getting tree-based models (#38344)
  • Don't set rank-specific local directories for Train workers (#38007)

🔨 Fixes:

  • Fix trainer restoration from S3 (#38251)

🏗 Architecture refactoring:

📖Documentation:

Ray Tune

🤝 API Changes

💫Enhancements:

🔨 Fixes:

🏗 Architecture refactoring:

Ray Serve

🎉 New Features:

  • Added keep_alive_timeout_s to Serve config file to allow users to configure HTTP proxy’s duration to keep idle connections alive when no requests are ongoing.
  • Added gRPC proxy to serve gRPC requests through Ray Serve. It comes with feature parity with HTTP while offering better performance. Also, replaces the previous experimental gRPC direct ingress.
  • Ray 2.7 introduces a new DeploymentHandle API that will replace the existing RayServeHandle and RayServeSyncHandle APIs in a future release. You are encouraged to migrate to the new API to avoid breakages in the future. To opt in, either use handle.options(use_new_handle_api=True) or set the global environment variable export RAY_SERVE_ENABLE_NEW_HANDLE_API=1. See https://docs.ray.io/en/latest/serve/model_composition.html for more details.
  • Added a new API get_app_handle that gets a handle used to send requests to an application. The API uses the new DeploymentHandle API.
  • Added a new developer API get_deployment_handle that gets a handle that can be used to send requests to any deployment in any application.
  • Added replica placement group support.
  • Added a new API serve.status which can be used to get the status of proxies and Serve applications (and their deployments and replicas). This is the pythonic equivalent of the CLI serve status.
  • A --reload option has been added to the serve run CLI.
  • Support X-Request-ID in http header

💫Enhancements:

  • Downstream handlers will now be canceled when the HTTP client disconnects or an end-to-end timeout occurs.
  • Ray Serve is now “generally available,” so the core APIs have been marked stable.
  • Added a new metric (ray_serve_num_ongoing_http_requests) to track the number of ongoing requests in each proxy
  • Add RAY_SERVE_MULTIPLEXED_MODEL_ID_MATCHING_TIMEOUT_S flag to wait until the model matching.
  • Reduce the multiplexed model id information publish interval.
  • Add Multiplex metrics into dashboard
  • Added metrics to track controller restarts and control loop progress
  • Various stability, flexibility, and performance enhancements to Ray Serve’s autoscaling.

🔨 Fixes:

  • Fixed a memory leak in Serve components by upgrading gRPC: #38591.
  • Fixed a memory leak due to asyncio.Events not being removed in the long poll host: #38516.
  • Fixed a bug where bound deployments could not be passed within custom objects: #38809.
  • Fixed a bug where all replica handles were unnecessarily broadcasted to all proxies every minute: #38539.
  • Fixed a bug where ray_serve_deployment_queued_queries wouldn’t decrement when clients disconnected: https://github.com/ray-project/ray/pull/37965.

📖Documentation:

  • Added docs for how to use keep_alive_timeout_s in the Serve config file.
  • Added usage and examples for serving gRPC requests through Serve’s gRPC proxy.
  • Added example for passing deployment handle responses by reference.
  • Added a Ray Serve Autoscaling guide to the Ray Serve docs that goes over basic configurations and autoscaling examples. Also added an Advanced Ray Serve Autoscaling guide that goes over more advanced configurations and autoscaling examples.
  • Added docs explaining how to debug memory leaks in Serve.
  • Added docs that explain how Serve cancels disconnected requests and how to handle those disconnections.

RLlib

🎉 New Features:

  • In Ray RLlib, we have implemented Google’s new DreamerV3, a sample-efficient, model-based, and hyperparameter hassle-free algorithm. It solves a wide variety of challenging reinforcement learning environments out-of-the-box (e.g. the MineRL diamond challenge), for arbitrary observation- and action-spaces as well as dense and sparse reward functions.

💫Enhancements:

  • Added support for Gymnasium 0.28.1 (#35698)
  • Dreamer V3 tuned examples and support for “XL” Dreamer models (#38461)
  • Added an action masking example for RL Modules (#38095)

🔨 Fixes:

  • Multiple fixes to DreamerV3 (#37979) (#38259) (#38461) (#38981)
  • Fixed TorchBinaryAutoregressiveDistribution.sampled_action_logp() returning probs not log probs. (#37240)
  • Fix a bug in Multi-Categorical distribution. It should use logp and not log_p. (#36814)
  • Index tensors in slate epsilon greedy properly so SlateQ does not fail on multiple GPUs (#37481)
  • Removed excessive deprecation warnings in exploration related files (#37404)
  • Fixed missing agent index in policy input dict on environment reset (#37544)

📖Documentation:

  • Added docs for DreamerV3 (#37978)
  • Added docs on torch.compile usage (#37252)
  • Added docs for the Learner API (#37729)
  • Improvements to Catalogs and RL Modules docs + Catalogs improvements (#37245)
  • Extended our metrics and callbacks example to showcase how to do custom summarisation on custom metrics (#37292)

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

  • Actor task cancelation is officially supported.
  • The experimental streaming generator is now available. It means the yielded output is sent to the caller before the task is finished and overcomes the limitation from num_returns="dynamic" generator. The API could be used by specifying num_returns="streaming". The API has been used for Ray data and Ray serve to support streaming use cases. See the test script to learn how to use the API. The documentation will be available in a few days.

💫Enhancements:

  • Minimal Ray installation pip install ray doesn't require the Python grpcio dependency anymore.
  • [Breaking change] ray job submit now exits with 1 if the job fails instead of 0. To get the old behavior back, you may use ray job submit ... || true . (#38390)
  • [Breaking change] get_assigned_resources in pg will return the name of the original resources instead of formatted name (#37421)
  • [Breaking change] Every env var specified via ${ENV_VAR} now can be replaced. Previous versions only supported limited number of env vars. (#36187)
  • [Java] Update Guava package (#38424)
  • [Java] Update Jackson Databind XML Parsing (#38525)
  • [Spark] Allow specifying CPU / GPU / Memory resources for head node of Ray cluster on spark (#38056)

🔨 Fixes:

  • [Core] Internal gRPC version is upgraded from 1.46.6 to 1.50.2, which fixes the memory leak issue
  • [Core] Bind jemalloc to raylet and GCS (#38644) to fix memory fragmentation issue
  • [Core] Previously, when a ray is started with ray start --node-ip-address=..., the driver also had to specify ray.init(_node_ip_address). Now Ray finds the node ip address automatically. (#37644)
  • [Core] Child processes of workers are cleaned up automatically when a raylet dies (#38439)
  • [Core] Fix the issue where there are lots of threads created when using async actor (#37949)
  • [Core] Fixed a bug where tracing did not work when an actor/task was defined prior to calling ray.init: #26019
  • Various other bug fixes
    • [Core] loosen the check on release object (#39570)
    • [Core][agent] fix the race condition where the worker process terminated during the get_all_workers call #37953
    • [Core]Fix PG leakage caused by GCS restart when PG has not been successfully remove after the job died (#35773)
    • [Core]Fix internal_kv del api bug in client proxy mode (#37031)
    • [Core] Pass logs through if sphinx-doctest is running (#36306)
    • [Core][dashboard] Make intentional ray system exit from worker exit non task failing (#38624)
    • [Core][dashboard] Add worker pid to task info (#36941)
    • [Core] Use 1 thread for all fibers for an actor scheduling queue. (#37949)
    • [runtime env] Fix Ray hangs when nonexistent conda environment is specified #28105 (#34956)

Ray Clusters

💫Enhancements:

  • New Cluster Launcher for vSphere #37815
  • TPU pod support for cluster launcher #37934

📖Documentation:

Thanks

Many thanks to all those who contributed to this release!

@simran-2797, @can-anyscale, @akshay-anyscale, @c21, @EdwardCuiPeacock, @rynewang, @volks73, @sven1977, @alexeykudinkin, @mattip, @Rohan138, @larrylian, @DavidYoonsik, @scv119, @alpozcan, @JalinWang, @peterghaddad, @rkooo567, @avnishn, @JoshKarpel, @tekumara, @zcin, @jiwq, @nikosavola, @seokjin1013, @shrekris-anyscale, @ericl, @yuxiaoba, @vymao, @architkulkarni, @rickyyx, @bveeramani, @SongGuyang, @jjyao, @sihanwang41, @kevin85421, @ArturNiederfahrenhorst, @justinvyu, @pleaseupgradegrpcio, @aslonnie, @kukushking, @94929, @jrosti, @MattiasDC, @edoakes, @PRESIDENT810, @cadedaniel, @ddelange, @alanwguo, @noahjax, @matthewdeng, @pcmoritz, @richardliaw, @vitsai, @Michaelvll, @tanmaychimurkar, @smiraldr, @wfangchi, @amogkam, @crypdick, @WeichenXu123, @darthhexx, @angelinalg, @chaowanggg, @GeneDer, @xwjiang2010, @peytondmurray, @z4y1b2, @scottsun94, @chappidim, @jovany-wang, @jaidisido, @krfricke, @woshiyyya, @Shubhamurkade, @ijrsvt, @scottjlee, @kouroshHakha, @allenwang28, @raulchen, @stephanie-wang, @iycheng