Skip to content

Ray-2.56.0

Latest

Choose a tag to compare

@sai-miduthuri sai-miduthuri released this 29 Jun 20:32
· 4 commits to releases/2.56.0 since this release
637fd06

Highlights

  • Ray Data Stability: In this Ray release, we've added a variety of stability improvements, including running multiple datasets in a cluster, adding automatic batch size selection to CPU-based map-batches, and default logical memory configuration to prevent OOMs. We've also tightened iter_batches stability by reducing hidden buffering and shutting down the executor when consumers exit early (#63660, #63682, #62949). This reduces object-store spilling for common training workloads
  • Ray Serve: We re-architected Ray Serve LLM by decoupling request handling from token streaming response path (#62667, #62680, #62668, #62669, #63167), resulting in significant LLM serving performance improvements. We've also introduced new routing policies such as session-sticky routing via consistent hashing with ConsistentHashRouter (#62905, #63096, #62906) and CapacityQueueRouter (#62323) which is beneficial for supply-constrained workloads.
  • Ray Core: We've added GPU-domain-aware placement groups using label locality (#61442, #61614, #62487, #62533). This enables placement groups to pack bundles onto nodes that share a ray.io/gpu-domain label instead of only packing at the single-node level. We've also added initial Kubernetes in-place pod resizing support for Autoscaler v2 (#55961, #62369, #62215), enabling Ray clusters to resize CPU and memory on existing worker pods before scaling out new pods.

Ray Data

🎉 New Features

  • Support multiple datasets per cluster via subcluster labels and resource partitioning (#63331, #63375, #63982)
  • Add Dataset.mix() public API and MixOperator for weighted dataset mixing (#63168, #62450)
  • New DataSourceV2 framework: ParquetDatasourceV2, chunked reader, predicate splitting, listing/scanner infra (#63113, #63454, #63163, #62975, #63027, #62182)
  • Add batch_size='auto' to map_batches to derive batch row count from target row batch size (#62648)
  • Implement distributed upsert for Iceberg using task-based merge algorithm, preventing performance bottleneck on driver (#63482)
  • Add include_row_hash to read_parquet (#61408)
  • Add JAX data iterator (#61630)
  • Expose flag to run read tasks on isolated worker processes via isolate_read_workers (#63490)
  • Expose flag to set default logical memory for map operators via default_map_logical_memory_enabled (#63814)
  • Support predicate pushdown for Lance format (#61400)
  • Support per-partition start_offset and end_offset for read_kafka (#61620)
  • Add obstore async download backend for download operator (#61735)
  • Support UDF retries on transient exceptions (#63023)

💫 Enhancements

  • Fix iter_batches spilling by replacing make_async_gen with iter_threaded and reducing buffered batches (#63660, #63682)
  • Gate restore_original_order in iter_batches behind preserve_order (#63792)
  • Convert drop_columns to a Project logical operator when input schema is known (#63813)
  • Make ConcatAggregation and TurbopufferDatasink use polars for sorting (#61904)
  • Boost and vectorize hash_partition with sort_indices, zero-copy slices, and pandas (#63498, #62757, #63152, #62587)
  • Enable GPU_SHUFFLE in grouped_data.py (#62410)
  • Eager StarExpr expansion, schema inference for non-black-box UDFs, and Expressions struct support (#63776, #63387, #62560)
  • Make logging configurable via RAY_DATA_LOG_LEVEL and log RAY_DATA env vars at execution start (#63487, #63380)
  • Display and track logical memory in progress bar (#63379)
  • Honor compute= in filter(expr=...) and deprecate concurrency= (#63576)
  • Enable filter pushdown through StreamingRepartition and read stage column-rename removal (#62347, #63384, #63582)
  • Cache deserialized Arrow schemas in BlockMetadataWithSchema (#63462)
  • Track scheduling-loop step duration (p50/p90/max), peak USS/object-store memory, and task block locality (#63586, #63345, #63489, #63418, #62249)
  • Replace TaskDurationStats and Timer with DistributionTracker (#63488, #63530, #63825)
  • Introduce BlockEntry on RefBundle in place of (ref, metadata) tuples (#63654)
  • Pre-resolve filesystem in threaded download to avoid IMDS herd (#62898)
  • Convert logical operators to frozen dataclasses and consolidate operator base/repr (#62593, #62568, #62400, #63137, #63140, #63108)
  • Non-blocking default autoscaling coordinator and resource-aware auto-downscaling (#62725, #62574)
  • Release pinned blocks after dataset execution and shut down executor on early DataIterator exit (#62456, #62949)
  • Optimize local shuffle with incremental index and configurable compaction threshold (#62539)
  • Speed up checkpoint filter and reduce memory usage (#60294)
  • Preserve Arrow types through pandas roundtrip and reorder block columns by name before schema ops (#63017, #63582)
  • Block pickle object columns when reading untrusted Parquet and gate unsafe WebDataset deserialization (#63470, #63469)
  • Move backpressure escape hatch across all policies (#63539)
  • Update pandas, modin, and pyarrow minimum versions (#62899)
  • Add utilization monitoring and correct logical resource usage for ActorPool (#61987, #61528)
  • Deprecate ConcurrencyCapBackpressurePolicy, DataIterator.to_torch, and pandas UDF batches (#63392, #62540, #61733)
  • Rank actors per node in a heap and avoid re-exporting actor class via .options (#62309, #62722)
  • read_delta reads from preconfigured pyarrow dataset (#61721)
  • Include column name and target type in ArrowConversionError; reduce arrow conversion warning verbosity (#62407, #61486, #62521)
  • Show external consumer bytes in verbose operator progress log (#63728)
  • Disable DataSourceV2 by default after earlier enabling (#63674, #63326)

🔨 Fixes

  • Rename subcluster label key from __subcluster__ to ray-subcluster (#63982)
  • Fix get_or_create_stats_actor crash in Ray Client mode (#63402)
  • Fix datasource pushdown crashes for generic UDFExpr filter predicates (#63781)
  • Fix hash-shuffle aggregator memory estimation: metadata propagation, node-size clamp, column pruning (#63809)
  • Fix CheckpointConfig FileNotFoundError on Azure Blob Storage (#63606)
  • Fix silent credential drop for fsspec-S3 in download expression (#62897)
  • Fix missing f-string prefix in _concatenate_extension_column (#62939)
  • Fix HashAggregate duplicate group rows for AggregateFnV2 (#63066)
  • Fix JSONL read retry with advanced file cursor (#63233)
  • Fix read_parquet ArrowNotImplementedError for nested column types exceeding ~2GB row group (#61824)
  • Fix read_parquet nested-type fallback and parquet scanner memory accumulation (#63175, #62745)
  • Fix memory leak in DataIterator.to_torch() by switching to PyArrow (#60966)
  • Fix ZipOperator freeing shared blocks via _split_at_indices (#62665)
  • Fix concurrent writes race condition in write_parquet (#62377)
  • Fix GPU shuffle output ordering when using ShuffleStrategy.GPU_SHUFFLE (#62351)
  • Fix incorrect DatasetStat uuid propagation (#62255)
  • Fix none issue when DATA_ENABLE_OP_RESOURCE_RESERVATION=False (#61718)
  • Fix filesystem compatibility check for fsspec-wrapped PyFileSystem (#61850)
  • Forward try_create_dir to pyarrow.dataset.write_dataset (#58302)
  • Fix autoscaler bug blocking timely release of leased resources (#62592)
  • Ensure consistent nan_is_null/nans-as-nulls semantics in encoder (#62623, #62618)
  • Skip unconditional null strip in find_partition_index (#62594)
  • V1 _split_predicate_by_columns correctness fix (#63176)
  • Avoid importing cudf in _is_cudf_dataframe when cudf not loaded (#62302)
  • Revert raw-modulo hash partition fast path (#63097)
  • Remove tfx-bsl support from read_tfrecords (#63245)

📖 Documentation

  • Document isolate_read_workers for read_parquet (#63816)
  • Remove docs recommending increased object store memory proportion (#63389)
  • Update docs minimum version for build_processor and "auto" batch size (#61757, #62790)
  • Remove outdated limitation of DefaultClusterAutoscalerV2 and stale object-store-memory warnings (#62385, #62387)

Ray Serve

🎉 New Features:

  • Add custom ingress request router app interfaces and HAProxy ingress dispatch path (#62680, #62668, #62669, #62667)
  • Expose choose_replica/dispatch on deployment handles and AsyncioRouter with replica-side slot reservation (#63255, #63254, #63252)
  • Introduce experimental round robin router and ConsistentHashRouter for session-sticky routing (#63238, #62906, #63096, #62905)
  • Central capacity queue for token-based request routing via CapacityQueueRouter (#62323)
  • Add experimental ray-haproxy support behind RAY_SERVE_EXPERIMENTAL_PIP_HAPROXY (#62589)
  • Add deployment actor context API and broadcast API for deployment handles (#62532, #61472)
  • Add ControllerOptions for configurable controller runtime_env (#63352)
  • Make rolling update percentage configurable (#62160)
  • Support per-request timeout and disconnect in HTTP proxy path (#62867)

💫 Enhancements:

  • HAProxy stability improvements: wait for old workers before drain, redirect stdout/stderr, redispatch+retry-on, coalesce broadcasts, quarantine released ports (#63620, #63621, #63622, #63623, #63628)
  • Bind direct ingress ports to 0.0.0.0 for cross-node HAProxy routing (#62515)
  • HAProxy ingress request router metrics, enable splice by default, TCP_NODELAY default 1, optional retry knobs, RAY_SERVE_HAPROXY_STATS_PORT (#63356, #63531, #63353, #63415, #62979)
  • Resolve bundled ray-haproxy binary before RAY_SERVE_HAPROXY_BINARY_PATH; HAProxy abspath env var (#63829, #62610)
  • Replace socat subprocess with Python socket for HAProxy admin communication; bump HAProxy to avoid CVE-2025-11230 (#61897, #62585)
  • Expose controller health metrics via /api/serve/applications/ API; add max_replicas_per_node to response (#63556, #63234)
  • Run health check on user execution path to detect request-serving stalls (#61621)
  • Mark widely-used APIs as stable (#62932)
  • Retain recently-stopped replica logs in the dashboard (#63678)
  • Add observability logs for pack scheduling decisions (#63603)
  • Gate ingress request router body forwarding behind escape hatch (#63183)
  • Avoid rolling replicas for no-op config overrides (#63034)
  • Gate replica/deployment creation during shutdown (#62761)
  • Defer PG creation for TPU Serve deployments to accelerator backend (#62941)
  • Expose DeploymentStateManager APIs for controller access (#62950)
  • Add tracing support for Windows and gRPC tracing improvements (#62821, #63833)
  • Split node vs requested resources in deployment scheduler (#62778)
  • Defer DEPLOYMENT_TARGETS broadcast while replicas are RECOVERING (#62751)
  • Evict per-deployment LongPollHost state on deployment delete; enable logs when client stops its event loop (#62820, #63028)
  • Add metrics: max replica processing latency, objref resolution latency, serve_autoscaling_target_ongoing_requests (#62381, #62355, #62421)
  • Filter stale bootstrap observations from serve_long_poll_latency_ms (#62868)
  • Retry build_serve_application task on failure (#62987)
  • Scale down non-matching primary-label replicas first (#61488)
  • Refactor internal autoscaling policy state extraction into a single helper (#62452)
  • Catalog Ray Serve env vars (#62006)
  • Remove or raise clear error for deprecated deployment items; remove deprecated DeploymentMode (#63548, #63510)

🔨 Fixes:

  • Fix orphaned actors on controller crash during shutdown; drop and replace replicas surviving a controller crash without rank assignment (#62823, #63139)
  • Fix deployment actors creating 15K OS threads for sync actor classes (#62661)
  • Fix gang scheduling PG leak when deployment actors are starting (#62469)
  • Fix app-level autoscaling policy state cross-deployment contamination and state loss for skipped deployments (#62484)
  • Fix Serve autoscaling delay to use wall-clock time (#62144)
  • Fix race condition in multiplex LRU cache update using move_to_end() (#62548)
  • Normalize multiplexed model ID header to support proxy-transformed names (#61869)
  • Fix AttributeError when request_router is None in update_deployment_config (#63180)
  • Fix potential UnboundLocalError in ActorReplicaWrapper.check_stopped() (#63339)
  • Fail loud when ingress request router dispatch fails (#63215)
  • Fix stale _global_client cache across driver sessions (#62368)
  • Fix start_metrics_pusher crash when deployment has record_autoscaling_stats but no autoscaling config (#62123)
  • Fix high-cardinality namespace tag on long poll metrics (#62386)
  • Fix Java long poll timeout serialization (#61875)
  • Avoid destructor error when FastAPI ingress init fails (#62172)
  • Avoid proxy readiness future timeout race (#62194)
  • Avoid self-cause on non-gRPC replica exceptions (#62412)
  • Fix HAProxy startup timeout propagation (#61752)
  • Include ingress_request_router.lua.tmpl in package_data (#63145)
  • Revert support for root_path parameter across uvicorn versions (#62529)

📖 Documentation:

  • Add round robin and consistent hashing router documentation (#63636)
  • Introduce gang scheduling documentation (#61737)
  • Add deployment scope actor docs (#62735)
  • Add Kuberay guide for RayService with HAProxy and High Throughput mode (#62408)
  • Add Ray Serve office hours invite into documentation (#62176)

Ray Train

🎉 New Features

  • Add LoggingConfig for configuring the ray.train logger on controller and workers (#61550)
  • Allow DataParallelTrainer's train_fn to return data (#62021)
  • Add async checkpointing/validation with Torch Lightning (#62370)

💫 Enhancements

  • Report time spent syncing and transferring checkpoints to storage in ray.train.report(checkpoint) (#62027)
  • Block until create_or_update_train_run completes on Train initialization (#63432)
  • Implement DatasetManager (#63309)
  • Forward label_selector to AutoscalingCoordinator (#63287)
  • Add log line before launching training function (#62911)
  • Allow contextlib.redirect_stdout() to bypass print redirect to logs (#61075)
  • Add timeouts to validation functions of ray.train.report (#62916)
  • ray.train.report does not hang across replica group restarts; Ray Train manages replica group restarts (#62651, #61475)
  • Swallow RayTaskError during BackendSetupCallback shutdown (#63143)
  • Improve JaxTrainer TPU multi-slice fault tolerance and reservation ergonomics (#62893)
  • Export default data execution options (#62784)
  • Consolidate Train run metadata sanitization and improve readability (#63182)
  • Fix PlacementGroupCleaner race condition: drain queue before cleanup on controller death (#62754)
  • Harden against unsafe pickle deserialization (#62807)
  • Raise error when checkpoint is within experiment directory and delete_local_checkpoint_after_upload=True (#62555)
  • Add timeout_s to ray.train.get_all_reported_checkpoints (#61761)
  • Change remaining pytorch_lightning imports (#61291)
  • Make controller resilient to errors in all lifecycle hooks (#60900)
  • Remove Predictor from Train v1 (#63461)

🔨 Fixes

  • Fix missing comma in DataBatchType Union type (#63872)
  • Handle Arrow-backed pandas dtypes in LightGBM examples (#63427)
  • Fix exclude_resources regression for V1 Train + V2 cluster autoscaler (#62827)
  • Add missing %s to logger.debug (#63039)
  • Increase get_actor timeout (#62516)

📖 Documentation

  • Document S3-compatible storage (#63103)
  • Add Azure Files to persistent storage docs (#63406)
  • Uncomment Result.from_path in docs (#62887)
  • Document how to tune async validation (#62227)
  • Document why validation runs need unique names (#62224)

Ray Tune

💫 Enhancements

  • Fix Tune search for Python 3.14 (#63575)
  • Modernize AxSearch for Ax Platform 1.0.0+ (#60522)
  • Use built-in inspect for argument capture (#60049)

🔨 Fixes

  • Fix import count in CIFAR PyTorch tutorial (#62756)

Ray LLM

🎉 New Features

  • Major Ray Serve LLM performance improvement with direct streaming (#63167, #63468, #63779)
  • TPU support: Add topology field to LLMConfig for multi-host TPU support (#61906)
  • Add per-host bundles default and fix fractional TPUs for TPUAccelerator (#63177)
  • Enable Ray Serve LLM session-stickiness routing policy via RAY_SERVE_SESSION_ID_HEADER_KEY (#63362)

💫 Enhancements

  • Upgrade vLLM to 0.22.0 (#63730, #63396, #62970, #62349)
  • Co-locate DP rank 0 worker with advertised master address (#63803)
  • Add pick-only fast path to AsyncioRouter for LLM ingress (#63517)
  • Replace LLM ingress router replica selection with choose_replica; don't fetch LLMConfig from replicas at startup (#63280, #63065)
  • Promote max_tasks_in_flight_per_actor to a first-class config field and adjust defaults (#63214)
  • Validate accelerator_type against CPU-only configs; replace GPUType alias with AcceleratorType (#62139, #62978)
  • Add rate-limiter for per-request traceback spam (#62440)
  • Promote SGLang integration to user guide and move engine to _internal (#62570)
  • Lazy-load batch stage/processor submodules and make boto3/botocore imports lazy (#62861, #62383)
  • LLM telemetry bugfixes (#63782)

🔨 Fixes

  • Fix flaky GPU-0 worker and NIXL port collisions (#63810)
  • Fix P/D direct streaming OpenAI routing (#63679)
  • Remove guided_decoding, truncate_prompt_tokens, build_llm_processor (#63569)
  • Fix misleading ImportError when vLLM is installed but fails to import (#63305)
  • Fix max_pending_requests default to track vLLM's GPU-dependent max_num_seqs (#62918)
  • Fix HF config loading for models with custom rope_scaling (#62464)
  • Wait for request router init in LLMRouter constructor (#63206)
  • Materialize chat completion message content in sanitizer (#63119)
  • Fix lora_request not forwarded to vLLM engine + add regression tests (#62609)
  • Fix SGLangEngineProcessor telemetry for trust_remote_code models (#62102)
  • Fix TOKENIZER_ONLY downloads missing chat_template for S3-backed models (#62121)
  • Fix SGLang chat tokenize to respect add_generation_prompt (#61688)
  • Fix bool serialization in benchmark_vllm CLI builder (#63516)

📖 Documentation

  • Document multimodal pixel-budget gotchas and vLLM compatibility (#63593)
  • Add tokenization disaggregation documentation (#62494)
  • Add benchmark docs and refactor into submodules (#62204)
  • Remove VLLM_USE_V1 from docs and examples (#63001)
  • Fix wrong documented default for max_tasks_in_flight_per_actor (#62917)

Ray RLlib

🎉 New Features

  • Add custom_resources_per_learner config and custom_resources_for_main_process to AlgorithmConfig (#63303, #62475)
  • Add Importance Sampling APPO metrics to the torch learner (#63675)

💫 Enhancements

  • Put only one copy of weights into the object store (#63529)
  • Handle the all-evaluation-workers-unhealthy case uniformly across modes (#63128)
  • Stop IMPALA/APPO learner thread gracefully to avoid misleading error messages (#62763)
  • Improve invalid input error messages (#62324)

🔨 Fixes

  • Fix two substantial edge cases in PPO's value target calculation (#59958)
  • Fix EnvRunner crash loops (#62884)
  • Fix extra model outputs hanging val indexing (#62960)
  • Fix ValueError in MultiAgentEpisode.get_rewards() when an agent is inactive for all requested env steps (#62907)
  • Preserve Torch optimizer param-group scalar types on restore (#61937)
  • Fix wrong assert variable in _update_env_seed_if_necessary (#61823)
  • Maintain value in EMAStat (#63064)

📖 Documentation

  • Clarify extra model output docstrings (#63524)

Ray Core

🎉 New Features

  • Add support for Furiosa AI NPU (#63035) and register_collective_backend API for custom collective backends (#60701)
  • In-place pod resizing (IPPR) on Kubernetes 1.35: initial implementation and standalone KubeRay IPPR provider (#55961, #62369, #62215)
  • Label locality support: GPU-domain-aware placement groups, autoscaler proto changes, and state API observability (#61442, #61614, #62487, #62533)
  • Publish platform events via Ray Event Recorder and support single-event emission in the Python layer (#63329, #60858)
  • Autoscaler v2: priority-based worker group selection (#62997) and noDriverTimeoutSeconds for KubeRay cluster termination (#63465)
  • RDT: concurrent one-sided transfers for multiple ObjectRefs in ray.get (#61773), retry support (#62842), and NIXL memory deregistration via deregister_nixl_memory (#62341)
  • Support .tar.gz archives for remote working_dir URIs (#62813)
  • Add IPv6 localhost and all-interfaces support (#60023)

💫 Enhancements

  • Resource isolation: event-based memory monitor, multi-memory-monitor factory, time-based group killing policy, idle-worker prioritization, system/user slice bounds, and OOM policy tuning (#62060, #62705, #62643, #62378, #62168, #63521, #63324, #63067, #62957)
  • Compute per-component memory usage in MiB (#63932) and add host vs container memory distinction to memory panels (#63111)
  • Consider cgroup limit when fetching CPU (#63685) and correct worker OOM score adjustment logic (#62470)
  • Replace NodeAffinitySchedulingStrategy with Label Selector API when soft=False (#54940)
  • Improve SlicePlacementGroup lifecycle and support explicit bundle_label_selector for TPUs (#63171); add TPU head resource for Ironwood TPU (#62786), chips_per_vm arg (#62526), and v6e single-host fixes (#62306)
  • Batch placement group bundle removal RPCs (#63839); remove PG resource deduction from GCS in favor of resource broadcast (#63723)
  • Migrate Raylet/GCS timing logic to a shared ClockInterface with a fake clock for testing (#62562, #62502, #62476)
  • Refactor asio build targets and add IOContextMonitor; run GCS health check on io_service (#63042, #63166, #62608, #62374)
  • Autoscaler v2 performance: skip serializations for debug logs (#63778); accept fractional resource values in request_resources (#63306)
  • Reduce traffic: halve task arg pubsub by skipping redundant raylet pull (#62583), avoid extra memcpy when spilling fused objects (#63653), and resolve task dependencies synchronously when objects exist (#62561)
  • Improve inspect_serializability messages and traversal context (#63501, #63373, #63258); better worker startup error messages (#63714)
  • Warn when runtime_env package approaches upload size limit (#63404); harden zip extraction path containment (#63786, #62813)
  • Include owner node ID in OwnerDiedError (#63727); add dependency info to taskspec debug string (#62316)
  • Add unexpected worker failure metric and dashboard panel (#62297); group observability APIs in ray CLI help (#62748)
  • Normalize OTel metric labels before Prometheus export (#63744) and retry/log when Prometheus queries fail (#63578); add GPU usage instance filter (#62214)
  • Move observability and control-plane pubsub to dedicated services and rename InternalPubSub* to ControlPlanePubSub* (#62806, #63044, #62461)
  • AMD GPU: replace rocm-smi ctypes binding with amd-smi Python interface (#62393); detect NVIDIA Blackwell consumer GPUs (#63322)
  • Add task retry delay for ACTOR_UNAVAILABLE retries (#62330); improve State API filter key handling (#63638)
  • Patch setproctitle to skip launch services IPC calls (#63366); add timeout for first redis probe (#63148)
  • Clarify head node commands in ray up output (#63409); pass logging_config through Ray Client ray.init (#62192)
  • Print subprocess log tails with exit codes on unexpected exit (#61905); add warning log when GPU profiling command times out (#63706)
  • Add unique suffix to log filenames (#62365); disable profiling endpoints by default (#62531)
  • Remove pydantic v1 support (#62716); update Starlette to v1.0.1 (#63722)
  • Deprecate DAGNode.execute() (#63716); remove experimental _owner support for ray.put (#63520)

🔨 Fixes

  • Fix ray.get hanging forever when an object's owner dies during pull (#63694); resolve ReferenceCounter race on WORKER_REF_REMOVED_CHANNEL (#60495)
  • Fix resource leaks in subprocess management (#63878) and runtime_env cache not detecting changes in -r-referenced requirements files (#63403)
  • Fix replica actor zombie process after GCS restart (#63764); fix actor creation race condition (#62994); fix actor state counter bug (#63647)
  • Fix placement groups with label domain stuck on the infeasible queue (#62483); log status for failed PG PrepareResources/CommitResources (#62836)
  • Fix env var expansion in ray job submit CLI via shlex.join (#63797) and --working-dir for local zip files and http:// URLs (#62843)
  • Surface WebSocket close codes and errors in job log streaming (#63364); fix ray stop failing to terminate dashboard/runtime_env agents on Windows (#62428)
  • Fix ray down not stopping Docker containers on worker nodes for local clusters (#62169); fix delayed/missing worker logs in Jupyter by flushing stdout/stderr (#63599)
  • Fix Python log monitor handling for same-inode truncated files (#63720); avoid os.getcwd() on import by lazily evaluating scratch_dir (#63040)
  • Fix accelerator detection on NVIDIA Blackwell consumer GPUs (#63322); avoid FabricManager stall on NVLink systems in GpuProfilingManager (#63312)
  • Fix POSIX semaphore crash in experimental mutable objects (#62328); fix overflow on exponential backoff multiplication (#62366)
  • Fix OOM kill message wrong threshold with resource isolation (#62948); fix OpenTelemetryMetricRecorder singleton init guard (#63081)
  • Fix MarkFootprintAsBusy clearing saved idle state for unrelated items (#62588); fix HandleIsLocalWorkerDead for drivers (#62688)
  • Fix AttributeError on trace in client mode (#62955); fix IndexError in legacy post-mortem debugging (#61479)
  • Keep strong references to fire-and-forget asyncio tasks (#63291); validate JobConfig code_search_path type (#62499)
  • Fix uv existence check in UVProcessor (#62818); fix invalid default stats factory in ClusterStatus (#62934)
  • Fix autoscaler v2 instance_type_name in autoscaling state (#62101) and stopped-node metric double counting (#62026)
  • Fix ReadOnlyProviderConfigReader max_workers counting bug (#62819); fix circular import in ray_print_logs thread (#63410)
  • Fix wrong container in spill-fusion threshold check (#63605); avoid emitting idle worker failure for unregistered failed workers (#62789)
  • Avoid return in finally block (Python 3.14 SyntaxWarning) (#63742); fix typos and replace type() checks with isinstance() (#62154)

📖 Documentation

  • Add "bring your own transport" docs page for RDT (#60308); doc changes for label locality support (#62551)
  • Fix misleading docstrings on drain_node APIs (#62942); update outdated description for max_direct_call_object_size (#63164)

Dashboard

🎉 New Features

  • Add Platform Events module with K8s event ingestion/caching and frontend UI (#62314, #63332)
  • Show TPU stats on the Cluster tab (#63774)

💫 Enhancements

  • Add py-spy --idle and --subprocesses flags to profiling endpoints (#63852)
  • Pass Grafana cluster filter to Serve metrics URLs (#63211)
  • Show last data load time (#63618)
  • Add Name column to Jobs view from job_name metadata (#62257)
  • Mask password arguments in get_entrypoint_name() to prevent password exposure (#61995)

🔨 Fixes

  • Fix TPU metrics (#63998)
  • Guard against zero num_cpus in k8s_utils.cpu_percent (#63729)
  • Fix invalid PromQL when global_filters is empty in Grafana dashboard generation (#63687)
  • Fix unexpected log line details pop-up in log viewer UI (#62637)

Ray Wheels and Images

  • Bumped the Ray version for the 2.56.0 release.
  • Bumped the minimum Python version in pyproject.toml (#62569).
  • Added TPU release images (#62113) and updated the TPU Docker image base dependencies (#63006).
  • Added a torchft image for Torch trainer tests (#63361) and ran apt-get upgrade for slim base images (#62666).
  • Numerous dependency lockfile and CI image updates (raydepsets migration, depset regeneration across core/ML/RLlib/docs/macOS CI images).

Documentation

  • Established doc/redirects/current.yaml as the redirects source of truth with legacy-version 404 redirect coverage (#63367, #63880).
  • Added an agent context guide for Ray documentation and an ipython3 lexer hook for notebook shell/magic cells (#63227, #63515).
  • Added Sphinx /llms.txt and /llms-full.txt generation, excluding Jupyter notebooks (#63130, #63228).
  • Upgraded doc toolchain: pydata-sphinx-theme 0.17.1, myst-nb 1.4.0, added sphinxext-opengraph, unpinned yanked tf-keras (#63344, #63360, #63343, #63358).
  • Banned new .rst files under doc/source/ and added CI to skip RTD builds for PRs that don't touch docs (#63057, #63431).
  • Added meta descriptions to ray-contribute pages and anonymized personal paths in Tune notebook outputs (#63832, #63464).
  • Tune: updated deprecated sample_from examples to config-dict style and documented time_attr scheduler values (#63804, #32467).
  • RLlib: clarified DQN hiddens as dueling-only, removed a broken parametric-actions link, fixed broken doc links (#43051, #54671, #47146).
  • Ray Data: added a map_batches shuffle section, streaming generator docs, and fixed a broken README link (#62576, #63791, #63412).
  • Ray Train: documented iter_jax_batches for JaxTrainer and updated TPU scaling config docs (#63294, #62584).
  • Kubernetes/TPU: added a GKE Gateway ingress example, fixed the GKE TPU guide, and replaced deprecated example images (#63546, #63209, #63019).
  • Added a RayCronJob quick-start guide and clarified KAI Scheduler RayJob submission modes (#62151, #61332).
  • Added a Slurm guide for running Ray inside Docker containers (#63221).
  • Documented AutoscalingConfig replica/target fields and corrected max_calls default docs (#48601, #63894).

Dependencies

This is the last Ray release to support the dependency versions listed below. For the 2.57.0 release, Ray will raise its minimum required versions for several core dependencies. If your environment
pins any of these packages below the new minimums, plan to upgrade before moving to the next Ray release.

Dependency Last supported in this release New minimum (next release)
numpy < 2.1 >= 2.1
protobuf < 5.26 >= 5.26
pandas < 2.2.3 >= 2.2.3
pyarrow < 18.0.0 >= 18.0.0
pydantic < 2.9 >= 2.9
grpcio < 1.66 >= 1.66
scipy (previously unpinned) >= 1.14.1

Most users on recent releases of these packages are unaffected

Thanks

Many thanks to all those who contributed to this release!

@khluu, @Krishnachaitanyakc, @leewyang, @ssam18, @christian-pinto, @Hyunoh-Yeo, @hango880623, @yuanzhuoyang1-bit, @marwan116, @aaronscalene, @tianyi-ge, @TriNguyen1208, @andrewsykim, @leonaIee, @OneSizeFitsQuorum, @AksodFlare, @limarkdcunha, @dayshah, @jade710, @pedrojeronim0, @dev-miro26, @DonPalius, @TimothySeah, @abrarsheikh, @nathon-lee, @prince8273, @Bye-legumes, @rayhhome, @Yunnglin, @spencer-p, @ryanaoleary, @herin049, @stephanie-wang, @liulehui, @slxswaa1993, @psaikaushik, @cyhapun, @tdat1465, @akyang-anyscale, @chenshi5012, @zzchun, @ryankert01, @EagleLo, @mzjp2, @justinvyu, @petern48, @YuangGao, @sjp611, @wingkitlee0, @AndySung320, @dstrodtman, @Accurio, @JasonLi1909, @peterjc123, @eicherseiji, @kyuds, @Chong-Li, @joaquinhuigomez, @IrvinFan, @XuQianJin-Stars, @AJamesPhillips, @harshit-anyscale, @claytonlin1110, @nhquana2, @Rruop, @win5923, @raulchen, @rohankmr414, @andrew-anyscale, @YoyinZyc, @doanxem99, @liujp, @dancingactor, @Evelynn-V, @SohamRajpure, @dragongu, @ShockYoungCHN, @ljstrnadiii, @WFY123wfy, @axreldable, @pseudo-rnd-thoughts, @H4ck2, @mvcb, @xinyuangui2, @edoakes, @ankushbbbr, @ps2181, @dominikkawka, @vinhuytran0810-cell, @siyuanfoundation, @MengjinYan, @Chronostasys, @jeffreywang88, @lalitc375, @sampan-s-nayak, @ArturNiederfahrenhorst, @srini047, @ChangyuWang, @adam360x, @Yicheng-Lu-llll, @thakoreh, @Aydin-ab, @manhld0206, @oab24413gmai, @ayushk7102, @tycao0338-cpu, @slfan1989, @myandpr, @rueian, @ans9868, @Ziy1-Tan, @elliot-barn, @as-jding, @daiping8, @robertnishihara, @MatthewCWeston, @Cursx, @laysfire, @karticam, @Mr-Neutr0n, @jjyao, @zent1n0, @aslonnie, @DenBuzz, @michael-pryor, @goanpeca, @nadongjun, @ronny-anyscale, @GoparapukethaN, @werkt, @carolynwang, @kamil-kaczmarek, @madiyar-wayve, @peterxcli, @pqkzzz, @Future-Outlier, @iamjustinhsu, @micah-yong-ai, @wxwmd, @owenowenisme, @sai-miduthuri, @lonexreb, @prassanna-ravishankar, @wanadzhar913, @kouroshHakha, @tobby168, @johntaylor-cell, @richabanker, @Kunchd, @vincere-mori, @vaishdho1, @wenhaozhao011-cmd, @bveeramani, @bittoby, @Phucvt123, @aschuh-hf, @RudrenduPaul, @xyuzh, @Sparks0219, @yancanmao, @eureka0928, @yuhuan130, @goutamvenkat-anyscale, @Zerui18, @machichima, @Lucas61000, @weimingdiit, @xi377266, @EmaFerrao, @awen11123, @Lawson-Darrow, @suppagoddo