[serve] Refresh GCS node-info cache off the control loop (async) by johntaylor-cell · Pull Request #64510 · ray-project/ray

johntaylor-cell · 2026-07-02T15:59:24Z

At high replica counts with direct ingress, the controller's per-loop GCS node-info refresh (ClusterNodeInfoCache.update -> get_all_node_info / get_all_resource_usage) is a synchronous RPC on the event loop. When the query exceeds its deadline during cluster churn (~12K+ nodes/replicas) the blocking call freezes the control loop for the whole RPC, the cache goes stale, and direct-ingress / HAProxy backends empty out.

Add RAY_SERVE_ASYNC_NODE_INFO (default off): refactor update() into a pure _compute_snapshot() (no self-mutation, safe in a thread) + _apply_snapshot() (atomic assignment on the event-loop thread). When the flag is on, a background _node_info_refresh_loop runs _compute_snapshot via run_in_executor -- the GCS calls release the GIL (with nogil) so the control loop keeps running while a slow reply is in flight, and reads always see the last complete snapshot. One refresh in flight at a time; the synchronous per-step update() is gated off when async is on.

Behavior is unchanged with the flag off: update() still computes and applies the snapshot synchronously (same result as before). At high N it is safe to raise RAY_GCS_RPC_TIMEOUT_S once async is on, since a longer query no longer blocks the loop.

gemini-code-assist

Code Review

This pull request introduces an asynchronous mechanism to refresh the cluster node-info cache off the main control loop, preventing slow GCS queries from blocking the controller. The changes include adding a non-blocking refresh_async method to ClusterNodeInfoCache and running a background refresh loop in the controller when RAY_SERVE_ASYNC_NODE_INFO is enabled. The feedback recommends using run_background_task instead of asyncio.ensure_future to avoid garbage collection of the background task, utilizing the get_env_float_positive helper for environment variable parsing, and adding a type hint to the alive_id_set parameter.

At high replica counts with direct ingress, the controller's per-loop GCS node-info refresh (ClusterNodeInfoCache.update -> get_all_node_info / get_all_resource_usage) is a synchronous RPC on the event loop. When the query exceeds its deadline during cluster churn (~12K+ nodes/replicas) the blocking call freezes the control loop for the whole RPC, the cache goes stale, and direct-ingress / HAProxy backends empty out. Add RAY_SERVE_ASYNC_NODE_INFO (default off): refactor update() into a pure _compute_snapshot() (no self-mutation, safe in a thread) + _apply_snapshot() (atomic assignment on the event-loop thread). When the flag is on, a background _node_info_refresh_loop runs _compute_snapshot via run_in_executor -- the GCS calls release the GIL (with nogil) so the control loop keeps running while a slow reply is in flight, and reads always see the last complete snapshot. One refresh in flight at a time; the synchronous per-step update() is gated off when async is on. Behavior is unchanged with the flag off: update() still computes and applies the snapshot synchronously (same result as before). At high N it is safe to raise RAY_GCS_RPC_TIMEOUT_S once async is on, since a longer query no longer blocks the loop. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: john.taylor <john.taylor@anyscale.com>

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: john.taylor <john.taylor@anyscale.com>

johntaylor-cell requested a review from a team as a code owner July 2, 2026 15:59

gemini-code-assist Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread python/ray/serve/_private/controller.py Outdated

Comment thread python/ray/serve/_private/constants.py Outdated

Comment thread python/ray/serve/_private/cluster_node_info_cache.py

johntaylor-cell force-pushed the serve-async-node-info branch 2 times, most recently from 3e96a8d to dcb3ce9 Compare July 2, 2026 17:17

johntaylor-cell force-pushed the serve-async-node-info branch from dcb3ce9 to 2f8d7fb Compare July 2, 2026 17:19

johntaylor-cell added the go add ONLY when ready to merge, run all tests label Jul 2, 2026

ray-gardener Bot added the serve Ray Serve Related Issue label Jul 2, 2026

ci: re-run premerge (flaky test_metrics/test_grpc shard, unrelated)

f08244a

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: john.taylor <john.taylor@anyscale.com>

johntaylor-cell added the performance label Jul 2, 2026

johntaylor-cell requested a review from abrarsheikh July 3, 2026 20:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[serve] Refresh GCS node-info cache off the control loop (async)#64510

[serve] Refresh GCS node-info cache off the control loop (async)#64510
johntaylor-cell wants to merge 2 commits into
ray-project:masterfrom
johntaylor-cell:serve-async-node-info

johntaylor-cell commented Jul 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

johntaylor-cell commented Jul 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant