Skip to content

[serve] Fixes Router skips cache invalidation on gRPC request failure#63272

Closed
wanadzhar913 wants to merge 1 commit into
ray-project:masterfrom
wanadzhar913:bugfix/grpc-cache-invalidation-63261
Closed

[serve] Fixes Router skips cache invalidation on gRPC request failure#63272
wanadzhar913 wants to merge 1 commit into
ray-project:masterfrom
wanadzhar913:bugfix/grpc-cache-invalidation-63261

Conversation

@wanadzhar913
Copy link
Copy Markdown

@wanadzhar913 wanadzhar913 commented May 11, 2026

Description

This PR addresses the issue where (for gRPC transport), after a gRPC failure, the AsyncioRouter's request-completion callback never invalidates the queue-length cache entry for a failed replica

Hence, here's my approach:

  • I'm proposing the main change to be in gRPCReplicaResult.add_done_callback.
  • It should wrap completed gRPC calls in python/ray/serve/_private/replica_result.py and if await call.code() in the gRPC loop is grpc.StatusCode.UNAVAILABLE, it passes ActorUnavailableError to the router callback.
  • AsyncioRouter._process_finished_request should then handle that error by invalidating the replica queue-length cache via on_replica_actor_unavailable.
  • Tests should cover the gRPC callback mapping and router cache invalidation path in test_grpc_replica_result.py and test_router.py

Related issues

Fixes #63261

Additional information

Tests:
python3 -m pytest python/ray/serve/tests/unit/test_grpc_replica_result.py python/ray/serve/tests/unit/test_router.py -q

Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
@wanadzhar913 wanadzhar913 requested a review from a team as a code owner May 11, 2026 17:17
@wanadzhar913 wanadzhar913 marked this pull request as draft May 11, 2026 17:17
@wanadzhar913 wanadzhar913 changed the title fixes https://github.com/ray-project/ray/issues/63261 [serve] Fixes Router skips cache invalidation on gRPC request failure May 11, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies gRPCReplicaResult to map gRPC UNAVAILABLE status codes to ActorUnavailableError within completion callbacks, which allows the router to invalidate its queue length cache. Unit tests were added to verify this error mapping and the resulting cache behavior. The reviewer identified a thread-safety concern where the callback might be executed on a background thread, potentially leading to race conditions in the router, and provided a code suggestion to ensure the callback runs on the caller's asyncio loop.

Comment on lines 582 to +590
def add_done_callback(self, callback: Callable):
self._call.add_done_callback(callback)
def wrapped_callback(call: grpc.aio.Call):
self._grpc_call_loop.call_soon_threadsafe(
lambda: self._grpc_call_loop.create_task(
self._process_done_callback(call, callback)
)
)

self._call.add_done_callback(wrapped_callback)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The callback is executed on self._grpc_call_loop. When _on_separate_loop is enabled for gRPC transport, this loop runs in a background thread. However, the AsyncioRouter (which typically provides this callback) and its internal state (like the queue-length cache) are not thread-safe. Executing the callback on a background thread can lead to race conditions when invalidating the cache or updating metrics in the router.

Consider capturing the current asyncio loop when add_done_callback is called and ensuring the callback is executed on that loop using call_soon_threadsafe.

Suggested change
def add_done_callback(self, callback: Callable):
self._call.add_done_callback(callback)
def wrapped_callback(call: grpc.aio.Call):
self._grpc_call_loop.call_soon_threadsafe(
lambda: self._grpc_call_loop.create_task(
self._process_done_callback(call, callback)
)
)
self._call.add_done_callback(wrapped_callback)
def add_done_callback(self, callback: Callable):
caller_loop = asyncio.get_running_loop()
def wrapped_callback(call: grpc.aio.Call):
self._grpc_call_loop.call_soon_threadsafe(
lambda: self._grpc_call_loop.create_task(
self._process_done_callback(
call, lambda arg: caller_loop.call_soon_threadsafe(callback, arg)
)
)
)
self._call.add_done_callback(wrapped_callback)

@abrarsheikh
Copy link
Copy Markdown
Contributor

never invalidates the queue-length cache entry for a failed replica

what does this mean?

@abrarsheikh abrarsheikh reopened this May 11, 2026
@wanadzhar913
Copy link
Copy Markdown
Author

wanadzhar913 commented May 11, 2026

never invalidates the queue-length cache entry for a failed replica

what does this mean?

From what I understand in the issue, when a replica is temporarily unavailable, we should remove that replica’s cached queue length (ReplicaQueueLengthCache) so the next request doesn’t trust stale data and route there again without probing (self._replica_queue_len_cache.invalidate_key(replica_id in RequestRouter).

On the gRPC path, the completion callback gets a raw grpc.aio.Call instead of an ActorUnavailableError, so the router skips on_replica_actor_unavailable(), which is the method that invalidates that cache entry.

def on_replica_actor_unavailable(self, replica_id: ReplicaID):
"""Invalidate cache entry so active probing is required for the next request."""
self._replica_queue_len_cache.invalidate_key(replica_id)
self._queue_len_gauge_last_update.pop(replica_id, None)

@wanadzhar913 wanadzhar913 marked this pull request as ready for review May 12, 2026 01:12
@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue community-contribution Contributed by the community labels May 12, 2026
@wanadzhar913
Copy link
Copy Markdown
Author

Hi @jeffreywang-anyscale @abrarsheikh lmk what you guys think. Tysm!

@abrarsheikh
Copy link
Copy Markdown
Contributor

thanks for you contribution, closing this PR in fovor of #63371

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[serve] Router skips cache invalidation on gRPC request failure

2 participants