[serve] Translate gRPC UNAVAILABLE into ActorUnavailableError in done-callback#63371
Open
isotauon wants to merge 1 commit into
Open
[serve] Translate gRPC UNAVAILABLE into ActorUnavailableError in done-callback#63371isotauon wants to merge 1 commit into
isotauon wants to merge 1 commit into
Conversation
…-callback When a ``DeploymentHandle`` uses ``_by_reference=False`` (gRPC transport), ``gRPCReplicaResult.add_done_callback`` registered the user's callback directly on the underlying ``grpc.aio.Call`` object, so the done-callback fired with the call itself regardless of outcome. The ``AsyncioRouter._process_finished_request`` consumer dispatches on the Ray error type of its ``result`` argument; on the gRPC path, neither ``_get_actor_died_error`` nor the ``ActorUnavailableError`` branch fired, so the queue-length cache for a failed replica was never invalidated. Power-of-two-choices then continued routing to the failed replica until either the next request's rejection path or a controller long-poll refreshed state. Wrap the user callback in a translating shim that inspects the completed call: when ``call.exception()`` is an ``AioRpcError`` with ``StatusCode.UNAVAILABLE``, deliver an ``ActorUnavailableError`` instead — matching what ``ActorReplicaResult.add_done_callback`` already does for the actor transport. Cancellations and other status codes pass through unchanged. Existing router error-handling branches now fire uniformly across transports. Add unit-test coverage for the three relevant cases (UNAVAILABLE translated, successful call passes through, non-UNAVAILABLE gRPC failure passes through). Closes ray-project#63261 Signed-off-by: Isabel Zhou <isabelxzhou@gmail.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request modifies gRPCReplicaResult.add_done_callback to translate gRPC UNAVAILABLE status codes into ActorUnavailableError. This change ensures a consistent error contract for routers, allowing them to handle transport-level failures and actor-path failures uniformly for cache invalidation. Corresponding unit tests were added to verify the translation logic for various gRPC call outcomes. I have no feedback to provide.
abrarsheikh
approved these changes
May 15, 2026
This was referenced May 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Closes #63261.
When a
DeploymentHandleuses_by_reference=False(gRPC transport),gRPCReplicaResult.add_done_callbackpreviously registered the user's callback directly on the underlyinggrpc.aio.Call:grpc.aio.Callinvokes its done-callbacks with the call object itself, not the deserialized result. The router'sAsyncioRouter._process_finished_requestdispatches on the Ray error type of itsresultargument:So on the gRPC path neither error branch fired, the queue-length cache was never invalidated for a failed replica, and power-of-two-choices kept routing to that replica until either the next request's rejection path or a controller long-poll pushed a refresh.
Fix
Wrap the user callback in a translating shim. When the underlying call is done and not cancelled, inspect
call.exception(): anAioRpcErrorwithStatusCode.UNAVAILABLEis translated into anActorUnavailableError— matching whatActorReplicaResult.add_done_callbackdelivers for the actor transport, and matching the existing pattern already used elsewhere inreplica_result.py(e.g.replica_result.py:374-379,replica_result.py:506-510).Cancellations and other status codes pass the call through unchanged so existing callers are unaffected. Failures during translation are logged and the call is passed through (no callback chain breakage).
Scope intentionally limited to UNAVAILABLE
Only
UNAVAILABLEis translated, matching the existing precedent at lines 374-379 and 506-510.INTERNAL/UNKNOWNcould in theory indicate replica issues too, but they can also reflect application-level errors that aren't the router's concern. Happy to widen the translation if maintainers prefer.Related issue number
Closes #63261
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR. (verified manually; new code stays within the file's existing line-width budget)python/ray/serve/tests/unit/test_grpc_replica_result.py; not run locally because the local environment has no Ray install. Verifiedpython -m py_compileon the changed files; relies on CI to confirm the existing suite still passes.)