Skip to content

[serve] Translate gRPC UNAVAILABLE into ActorUnavailableError in done-callback#63371

Open
isotauon wants to merge 1 commit into
ray-project:masterfrom
isotauon:serve/router-grpc-cache-invalidation-63261
Open

[serve] Translate gRPC UNAVAILABLE into ActorUnavailableError in done-callback#63371
isotauon wants to merge 1 commit into
ray-project:masterfrom
isotauon:serve/router-grpc-cache-invalidation-63261

Conversation

@isotauon
Copy link
Copy Markdown

Why are these changes needed?

Closes #63261.

When a DeploymentHandle uses _by_reference=False (gRPC transport), gRPCReplicaResult.add_done_callback previously registered the user's callback directly on the underlying grpc.aio.Call:

def add_done_callback(self, callback: Callable):
    self._call.add_done_callback(callback)

grpc.aio.Call invokes its done-callbacks with the call object itself, not the deserialized result. The router's AsyncioRouter._process_finished_request dispatches on the Ray error type of its result argument:

actor_died_error = self._get_actor_died_error(result)
if actor_died_error is not None: ...
elif isinstance(result, ActorUnavailableError):
    self.request_router.on_replica_actor_unavailable(replica_id)

So on the gRPC path neither error branch fired, the queue-length cache was never invalidated for a failed replica, and power-of-two-choices kept routing to that replica until either the next request's rejection path or a controller long-poll pushed a refresh.

Fix

Wrap the user callback in a translating shim. When the underlying call is done and not cancelled, inspect call.exception(): an AioRpcError with StatusCode.UNAVAILABLE is translated into an ActorUnavailableError — matching what ActorReplicaResult.add_done_callback delivers for the actor transport, and matching the existing pattern already used elsewhere in replica_result.py (e.g. replica_result.py:374-379, replica_result.py:506-510).

if call.done() and not call.cancelled():
    exc = call.exception()
    if isinstance(exc, grpc.aio.AioRpcError) and exc.code() == grpc.StatusCode.UNAVAILABLE:
        translated = ActorUnavailableError(
            "Actor is unavailable.",
            self._actor_id.binary(),
        )

Cancellations and other status codes pass the call through unchanged so existing callers are unaffected. Failures during translation are logged and the call is passed through (no callback chain breakage).

Scope intentionally limited to UNAVAILABLE

Only UNAVAILABLE is translated, matching the existing precedent at lines 374-379 and 506-510. INTERNAL / UNKNOWN could in theory indicate replica issues too, but they can also reflect application-level errors that aren't the router's concern. Happy to widen the translation if maintainers prefer.

Related issue number

Closes #63261

Checks

  • I've signed off every commit (by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR. (verified manually; new code stays within the file's existing line-width budget)
  • I've included any doc changes needed for https://docs.ray.io/en/master/. (no doc changes needed — bugfix in internal behavior)
  • I've added any new APIs to the API Reference. (N/A — no new public APIs)
  • I've made sure the tests are passing. (new unit tests added in python/ray/serve/tests/unit/test_grpc_replica_result.py; not run locally because the local environment has no Ray install. Verified python -m py_compile on the changed files; relies on CI to confirm the existing suite still passes.)
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

…-callback

When a ``DeploymentHandle`` uses ``_by_reference=False`` (gRPC transport),
``gRPCReplicaResult.add_done_callback`` registered the user's callback
directly on the underlying ``grpc.aio.Call`` object, so the done-callback
fired with the call itself regardless of outcome. The
``AsyncioRouter._process_finished_request`` consumer dispatches on the
Ray error type of its ``result`` argument; on the gRPC path, neither
``_get_actor_died_error`` nor the ``ActorUnavailableError`` branch fired,
so the queue-length cache for a failed replica was never invalidated.
Power-of-two-choices then continued routing to the failed replica until
either the next request's rejection path or a controller long-poll
refreshed state.

Wrap the user callback in a translating shim that inspects the completed
call: when ``call.exception()`` is an ``AioRpcError`` with
``StatusCode.UNAVAILABLE``, deliver an ``ActorUnavailableError`` instead
— matching what ``ActorReplicaResult.add_done_callback`` already does
for the actor transport. Cancellations and other status codes pass
through unchanged. Existing router error-handling branches now fire
uniformly across transports.

Add unit-test coverage for the three relevant cases (UNAVAILABLE
translated, successful call passes through, non-UNAVAILABLE gRPC failure
passes through).

Closes ray-project#63261

Signed-off-by: Isabel Zhou <isabelxzhou@gmail.com>
@isotauon isotauon requested a review from a team as a code owner May 15, 2026 17:41
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies gRPCReplicaResult.add_done_callback to translate gRPC UNAVAILABLE status codes into ActorUnavailableError. This change ensures a consistent error contract for routers, allowing them to handle transport-level failures and actor-path failures uniformly for cache invalidation. Corresponding unit tests were added to verify the translation logic for various gRPC call outcomes. I have no feedback to provide.

@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[serve] Router skips cache invalidation on gRPC request failure

2 participants