Skip to content

[Serve][1/2] Add TracingConfig and wire through proxy and replica paths#63273

Open
suppagoddo wants to merge 2 commits into
ray-project:masterfrom
suppagoddo:tracing-config-prA-wiring
Open

[Serve][1/2] Add TracingConfig and wire through proxy and replica paths#63273
suppagoddo wants to merge 2 commits into
ray-project:masterfrom
suppagoddo:tracing-config-prA-wiring

Conversation

@suppagoddo
Copy link
Copy Markdown
Contributor

@suppagoddo suppagoddo commented May 11, 2026

Schema:

  • New TracingConfig pydantic model (schema.py) with enabled, exporter_import_path, sampling_ratio
  • Exported in ray.serve public API + API doc entry

setup_tracing API:

  • Accepts TracingConfig directly instead of raw kwargs
  • When tracing_config is provided: reads config from it
  • When tracing_config is None: falls back to env vars (backward compatible)
  • Tracing setup errors now propagate (fail fast) instead of being silently swallowed

Replica path:

  • Replica fetches TracingConfig from controller via get_tracing_config() and passes to setup_tracing

Proxy path:

  • ProxyActorInterface accepts tracing_config param (plumbed to setup_tracing)
  • Proxy currently receives None (env-var fallback, same as master). Full proxy wiring via ProxyStateManager deferred to PR 2.

Controller:

  • Stores global_tracing_config and exposes get_tracing_config() method

Decorator fix:

  • tracing_decorator_factory now checks is_tracing_enabled() at call-time instead of decoration-time, because TracingConfig arrives at runtime (after decorators are applied at import time)

Public API:

  • serve.start(tracing_config=...) parameter

Design

  • Global only: cluster-level, not per-app or per-deployment
  • Static: set once at startup, no hot-reload of existing actors
  • No proto changes: constructor arg through actor hierarchy
  • Backward compatible: env vars still work as defaults when tracing_config is None

What's deferred to PR 2

  • Proxy wiring through ProxyStateManager (proxy currently falls back to env vars)
  • Runtime propagation (push config updates to live actors)
  • HAProxy tracing support

Test plan

  • Unit tests for TracingConfig validation (test_schema.py)
  • Updated tracing e2e tests to use TracingConfig
  • Existing proxy/replica/deployment tests pass
  • No Replica class restructuring — existing class unchanged

@suppagoddo suppagoddo requested a review from a team as a code owner May 11, 2026 17:51
@suppagoddo suppagoddo changed the title Tracing config pr a wiring [Serve][1/2] Add TracingConfig and wire through proxy and replica paths May 11, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces TracingConfig to enable OpenTelemetry tracing across Ray Serve components, including the controller, proxies, and replicas. It refactors the Replica class into an abstract base class and a concrete implementation to streamline tracing setup and request handling. Feedback identifies a parameter naming inconsistency in _on_request_cancelled that violates the Liskov Substitution Principle and a missing return type hint in _can_accept_request.

Comment thread python/ray/serve/_private/replica.py Outdated
Comment on lines +1993 to +2014
def _on_request_cancelled(
self, metadata: RequestMetadata, e: asyncio.CancelledError
):
"""Recursively cancel child requests.

This includes all requests that are pending assignment, and gRPC
requests that have already been assigned.
"""
# Cancel child requests pending assignment
requests_pending_assignment = (
ray.serve.context._get_requests_pending_assignment(
metadata.internal_request_id
)
)
for task in requests_pending_assignment.values():
task.cancel()

# Cancel child requests that have already been assigned.
# This is for gRPC requests and direct ingress requests.
in_flight_requests = _get_in_flight_requests(metadata.internal_request_id)
for replica_result in in_flight_requests.values():
replica_result.cancel()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The parameter name metadata in Replica._on_request_cancelled is inconsistent with the base class ReplicaBase._on_request_cancelled, which uses request_metadata. This can lead to issues if the method is called using keyword arguments and violates the Liskov Substitution Principle.

    def _on_request_cancelled(
        self, request_metadata: RequestMetadata, e: asyncio.CancelledError
    ):
        """Recursively cancel child requests.

        This includes all requests that are pending assignment, and gRPC
        requests that have already been assigned.
        """
        # Cancel child requests pending assignment
        requests_pending_assignment = (
            ray.serve.context._get_requests_pending_assignment(
                request_metadata.internal_request_id
            )
        )
        for task in requests_pending_assignment.values():
            task.cancel()

        # Cancel child requests that have already been assigned.
        # This is for gRPC requests and direct ingress requests.
        in_flight_requests = _get_in_flight_requests(request_metadata.internal_request_id)
        for replica_result in in_flight_requests.values():
            replica_result.cancel()

Comment thread python/ray/serve/_private/replica.py Outdated
if ray.util.pdb._is_ray_debugger_post_mortem_enabled():
ray.util.pdb._post_mortem()

def _can_accept_request(self, request_metadata: RequestMetadata):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The method _can_accept_request is missing a return type hint. It should return bool to be consistent with the base class ReplicaBase.

Suggested change
def _can_accept_request(self, request_metadata: RequestMetadata):
def _can_accept_request(self, request_metadata: RequestMetadata) -> bool:

Comment thread python/ray/serve/_private/replica.py Outdated
Comment thread python/ray/serve/_private/replica.py Outdated
Comment thread python/ray/serve/_private/haproxy.py
@suppagoddo suppagoddo force-pushed the tracing-config-prA-wiring branch from e81c4c8 to 67af77c Compare May 11, 2026 19:26
Comment thread python/ray/serve/_private/replica.py
Comment thread python/ray/serve/_private/replica.py
@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue docs An issue or change related to documentation observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels May 11, 2026
@suppagoddo suppagoddo force-pushed the tracing-config-prA-wiring branch from 67af77c to cb4a579 Compare May 11, 2026 19:47
@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label May 11, 2026
@suppagoddo suppagoddo force-pushed the tracing-config-prA-wiring branch 3 times, most recently from 393b5aa to fe8e0b6 Compare May 12, 2026 00:18
Comment thread python/ray/serve/_private/replica.py Outdated
Comment thread python/ray/serve/_private/proxy.py Outdated
Comment on lines 1678 to 1682
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should start to fail fast here. User has already expressed intent to setup tracing, i think its better to ensure its setup or fail loudly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, will mark this setup as failed then? Just curious won't it be too harsh to fail the setup.

Comment thread python/ray/serve/_private/proxy.py Outdated
Comment on lines 1671 to 1675
is_tracing_setup_successful = setup_tracing(
component_name="proxy", component_id=node_ip_address
component_name="proxy",
component_id=node_ip_address,
**get_tracing_kwargs(self._tracing_config),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's change the API of setup_tracing to accept the TracingConfig object directly instead of denormalizing it. The advantage is that when we need to add a new parameter to setup_tracing through TracingConfig, we dont have to make change to get_tracing_kwargs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK, will make this change here.

Comment thread python/ray/serve/_private/replica.py Outdated
Comment on lines +1168 to +1169
except AttributeError:
self._tracing_config = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when do we expect AttributeError

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to support rolling deployment when there is a version update [ Old version does not have get_tracing_config method ], but I think on a different thread we discussed, that a version upgrade will have new deployment, so new controller. So it is not neccasary and I will remove it.

Comment thread python/ray/serve/_private/replica.py Outdated
Comment on lines 1171 to 1184
@@ -1169,10 +1183,6 @@ def __init__(
"The replica will continue running, but traces will not be exported."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK, will fast here as well.

Comment thread python/ray/serve/_private/replica.py Outdated


class Replica:
class ReplicaBase(ABC):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the refactor to Replica necessary for adding tracing? If not, that's better done in a follow-up PR. The context Replica is probably the most critical module in the serve replica path, and I am reluctant to make sweeping changes here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK, that's a valid concern! I will do a follow up PR.

Comment on lines +15 to +16
if TYPE_CHECKING:
from ray.serve.schema import TracingConfig
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe this was introduced to avoid circular dependency, please revert this change and figure out how we can avoid the circular dependency issue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes during the type check phase, will move it to leaf module.

@@ -176,32 +181,38 @@ def my_function(obj):
"""

def tracing_decorator(func):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this refactor strictly necessary? If not, let's leave them out for now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In older version when tracing is enabled via env variables decorater sets the wrapper correctly based on is_tracing_enabled() value.

But in the newer version of TracingConfig intialized via a schema, if there is a class [ Example: for proxy/replica TracingConfig comes from the controller at runtime] which gets tracing via this new setup_tracing() which makes tracing enabled at the runtime then for those, wrapper will never activate because at the decorater time, is_tracing_enabled returned false.

This is the reason for this refactor to now move this check to each wrapper function. Now we always set/return the wrapper and each function spans correctly based on is_tracing_enabled

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yes it is necessary because we don't set it via env variables now.

Comment thread python/ray/serve/schema.py Outdated
Comment on lines +285 to +289
@model_validator(mode="after")
def fill_default_exporter_when_enabled(self):
if self.enabled and not self.exporter_import_path:
self.exporter_import_path = DEFAULT_TRACING_EXPORTER_IMPORT_PATH
return self
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed if we already do

exporter_import_path: str = Field(
        default_factory=lambda: RAY_SERVE_TRACING_EXPORTER_IMPORT_PATH

@suppagoddo suppagoddo force-pushed the tracing-config-prA-wiring branch from a9828fc to d2811b1 Compare May 17, 2026 21:12
Comment thread python/ray/serve/_private/proxy.py
Comment thread python/ray/serve/tests/unit/test_schema.py Outdated
@suppagoddo suppagoddo force-pushed the tracing-config-prA-wiring branch from 0187fff to c0fdc9a Compare May 18, 2026 00:10
Comment thread python/ray/serve/_private/api.py
@suppagoddo suppagoddo force-pushed the tracing-config-prA-wiring branch from c0fdc9a to 922afb8 Compare May 18, 2026 00:36
…ierarchy

Introduce TracingConfig as a first-class configuration object for
OpenTelemetry tracing in Ray Serve. Key changes:

- Add TracingConfig(BaseModel) to schema.py with enabled, exporter_import_path,
  and sampling_ratio fields
- setup_tracing() now accepts TracingConfig directly instead of raw kwargs
- Controller stores TracingConfig and exposes get_tracing_config() method
- Replica and Proxy fetch TracingConfig from controller at init
- Tracing decorators check is_tracing_enabled() at call-time (not decoration-time)
  to support runtime TracingConfig that arrives after import
- serve.start() accepts tracing_config parameter
- Fail fast: tracing setup errors propagate instead of being swallowed

No Replica class restructuring — the existing class remains unchanged.

Signed-off-by: Udit Agrawal <uagrawal@apple.com>
@suppagoddo suppagoddo force-pushed the tracing-config-prA-wiring branch 5 times, most recently from 93ecd5a to 54ca9f5 Compare May 19, 2026 17:19
component_name="proxy",
component_id=node_ip_address,
tracing_config=self._tracing_config,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proxy tracing config never wired through, always None

High Severity

ProxyActor.__init__ doesn't accept tracing_config as a parameter and doesn't pass it to super().__init__(), so self._tracing_config used in the setup_tracing call is always None. Additionally, the controller never passes global_tracing_config to ProxyStateManager, and ProxyStateManager never passes it when creating proxy actors. The entire proxy tracing path described in the PR design is broken — the user-configured TracingConfig only reaches replicas (which fetch it from the controller), never proxies.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 54ca9f5. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be handled in the [2/2] PR for this feature.

Signed-off-by: Udit Agrawal <uagrawal@apple.com>
@suppagoddo suppagoddo force-pushed the tracing-config-prA-wiring branch from 54ca9f5 to 37f3b98 Compare May 19, 2026 18:47
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 37f3b98. Configure here.

tracing_config=tracing_config,
)
if is_tracing_setup_successful:
logger.info("Successfully set up tracing for replica")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replica crashes on tracing setup failure instead of degrading

Medium Severity

The try/except block around setup_tracing() was removed from the replica initialization path. If setup_tracing raises (e.g., ImportError for missing opentelemetry, or an invalid exporter_import_path), the replica actor will crash during init. Previously, failures were caught and logged as warnings, allowing the replica to continue serving without tracing.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 37f3b98. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are failing fast, so no need for this try/catch here.

@suppagoddo suppagoddo requested a review from abrarsheikh May 19, 2026 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community docs An issue or change related to documentation go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants