[serve] Fixes Router skips cache invalidation on gRPC request failure by wanadzhar913 · Pull Request #63272 · ray-project/ray

wanadzhar913 · 2026-05-11T17:17:43Z

Description

This PR addresses the issue where (for gRPC transport), after a gRPC failure, the AsyncioRouter's request-completion callback never invalidates the queue-length cache entry for a failed replica

Hence, here's my approach:

I'm proposing the main change to be in gRPCReplicaResult.add_done_callback.
It should wrap completed gRPC calls in python/ray/serve/_private/replica_result.py and if await call.code() in the gRPC loop is grpc.StatusCode.UNAVAILABLE, it passes ActorUnavailableError to the router callback.
AsyncioRouter._process_finished_request should then handle that error by invalidating the replica queue-length cache via on_replica_actor_unavailable.
Tests should cover the gRPC callback mapping and router cache invalidation path in test_grpc_replica_result.py and test_router.py

Related issues

Fixes #63261

Additional information

Tests:
python3 -m pytest python/ray/serve/tests/unit/test_grpc_replica_result.py python/ray/serve/tests/unit/test_router.py -q

Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>

gemini-code-assist

Code Review

This pull request modifies gRPCReplicaResult to map gRPC UNAVAILABLE status codes to ActorUnavailableError within completion callbacks, which allows the router to invalidate its queue length cache. Unit tests were added to verify this error mapping and the resulting cache behavior. The reviewer identified a thread-safety concern where the callback might be executed on a background thread, potentially leading to race conditions in the router, and provided a code suggestion to ensure the callback runs on the caller's asyncio loop.

gemini-code-assist · 2026-05-11T17:20:35Z

    def add_done_callback(self, callback: Callable):
-        self._call.add_done_callback(callback)
+        def wrapped_callback(call: grpc.aio.Call):
+            self._grpc_call_loop.call_soon_threadsafe(
+                lambda: self._grpc_call_loop.create_task(
+                    self._process_done_callback(call, callback)
+                )
+            )
+
+        self._call.add_done_callback(wrapped_callback)


The callback is executed on self._grpc_call_loop. When _on_separate_loop is enabled for gRPC transport, this loop runs in a background thread. However, the AsyncioRouter (which typically provides this callback) and its internal state (like the queue-length cache) are not thread-safe. Executing the callback on a background thread can lead to race conditions when invalidating the cache or updating metrics in the router.

Consider capturing the current asyncio loop when add_done_callback is called and ensuring the callback is executed on that loop using call_soon_threadsafe.

Suggested change

def add_done_callback(self, callback: Callable):

self._call.add_done_callback(callback)

def wrapped_callback(call: grpc.aio.Call):

self._grpc_call_loop.call_soon_threadsafe(

lambda: self._grpc_call_loop.create_task(

self._process_done_callback(call, callback)

)

)

self._call.add_done_callback(wrapped_callback)

def add_done_callback(self, callback: Callable):

caller_loop = asyncio.get_running_loop()

def wrapped_callback(call: grpc.aio.Call):

self._grpc_call_loop.call_soon_threadsafe(

lambda: self._grpc_call_loop.create_task(

self._process_done_callback(

call, lambda arg: caller_loop.call_soon_threadsafe(callback, arg)

)

)

)

self._call.add_done_callback(wrapped_callback)

abrarsheikh · 2026-05-11T18:14:38Z

never invalidates the queue-length cache entry for a failed replica

what does this mean?

wanadzhar913 · 2026-05-11T23:39:27Z

never invalidates the queue-length cache entry for a failed replica

what does this mean?

From what I understand in the issue, when a replica is temporarily unavailable, we should remove that replica’s cached queue length (ReplicaQueueLengthCache) so the next request doesn’t trust stale data and route there again without probing (self._replica_queue_len_cache.invalidate_key(replica_id in RequestRouter).

On the gRPC path, the completion callback gets a raw grpc.aio.Call instead of an ActorUnavailableError, so the router skips on_replica_actor_unavailable(), which is the method that invalidates that cache entry.

ray/python/ray/serve/_private/request_router/request_router.py

Lines 784 to 787 in 409bc23

    
           def on_replica_actor_unavailable(self, replica_id: ReplicaID): 
        
               """Invalidate cache entry so active probing is required for the next request.""" 
        
               self._replica_queue_len_cache.invalidate_key(replica_id) 
        
               self._queue_len_gauge_last_update.pop(replica_id, None)

wanadzhar913 · 2026-05-15T11:17:37Z

Hi @jeffreywang-anyscale @abrarsheikh lmk what you guys think. Tysm!

abrarsheikh · 2026-05-15T19:20:07Z

thanks for you contribution, closing this PR in fovor of #63371

fixes ray-project#63261

ca3f8d1

Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>

wanadzhar913 requested a review from a team as a code owner May 11, 2026 17:17

wanadzhar913 marked this pull request as draft May 11, 2026 17:17

wanadzhar913 changed the title ~~fixes https://github.com/ray-project/ray/issues/63261~~ [serve] Fixes Router skips cache invalidation on gRPC request failure May 11, 2026

gemini-code-assist Bot reviewed May 11, 2026

View reviewed changes

abrarsheikh closed this May 11, 2026

abrarsheikh reopened this May 11, 2026

wanadzhar913 marked this pull request as ready for review May 12, 2026 01:12

ray-gardener Bot added serve Ray Serve Related Issue community-contribution Contributed by the community labels May 12, 2026

abrarsheikh closed this May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] Fixes Router skips cache invalidation on gRPC request failure#63272

[serve] Fixes Router skips cache invalidation on gRPC request failure#63272
wanadzhar913 wants to merge 1 commit into
ray-project:masterfrom
wanadzhar913:bugfix/grpc-cache-invalidation-63261

wanadzhar913 commented May 11, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

abrarsheikh commented May 11, 2026

Uh oh!

wanadzhar913 commented May 11, 2026 •

edited

Loading

Uh oh!

wanadzhar913 commented May 15, 2026

Uh oh!

abrarsheikh commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wanadzhar913 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

abrarsheikh commented May 11, 2026

Uh oh!

wanadzhar913 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wanadzhar913 commented May 15, 2026

Uh oh!

abrarsheikh commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wanadzhar913 commented May 11, 2026 •

edited

Loading

wanadzhar913 commented May 11, 2026 •

edited

Loading