[Serve][1/2] Add TracingConfig and wire through proxy and replica paths#63273
[Serve][1/2] Add TracingConfig and wire through proxy and replica paths#63273suppagoddo wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces TracingConfig to enable OpenTelemetry tracing across Ray Serve components, including the controller, proxies, and replicas. It refactors the Replica class into an abstract base class and a concrete implementation to streamline tracing setup and request handling. Feedback identifies a parameter naming inconsistency in _on_request_cancelled that violates the Liskov Substitution Principle and a missing return type hint in _can_accept_request.
| def _on_request_cancelled( | ||
| self, metadata: RequestMetadata, e: asyncio.CancelledError | ||
| ): | ||
| """Recursively cancel child requests. | ||
|
|
||
| This includes all requests that are pending assignment, and gRPC | ||
| requests that have already been assigned. | ||
| """ | ||
| # Cancel child requests pending assignment | ||
| requests_pending_assignment = ( | ||
| ray.serve.context._get_requests_pending_assignment( | ||
| metadata.internal_request_id | ||
| ) | ||
| ) | ||
| for task in requests_pending_assignment.values(): | ||
| task.cancel() | ||
|
|
||
| # Cancel child requests that have already been assigned. | ||
| # This is for gRPC requests and direct ingress requests. | ||
| in_flight_requests = _get_in_flight_requests(metadata.internal_request_id) | ||
| for replica_result in in_flight_requests.values(): | ||
| replica_result.cancel() |
There was a problem hiding this comment.
The parameter name metadata in Replica._on_request_cancelled is inconsistent with the base class ReplicaBase._on_request_cancelled, which uses request_metadata. This can lead to issues if the method is called using keyword arguments and violates the Liskov Substitution Principle.
def _on_request_cancelled(
self, request_metadata: RequestMetadata, e: asyncio.CancelledError
):
"""Recursively cancel child requests.
This includes all requests that are pending assignment, and gRPC
requests that have already been assigned.
"""
# Cancel child requests pending assignment
requests_pending_assignment = (
ray.serve.context._get_requests_pending_assignment(
request_metadata.internal_request_id
)
)
for task in requests_pending_assignment.values():
task.cancel()
# Cancel child requests that have already been assigned.
# This is for gRPC requests and direct ingress requests.
in_flight_requests = _get_in_flight_requests(request_metadata.internal_request_id)
for replica_result in in_flight_requests.values():
replica_result.cancel()| if ray.util.pdb._is_ray_debugger_post_mortem_enabled(): | ||
| ray.util.pdb._post_mortem() | ||
|
|
||
| def _can_accept_request(self, request_metadata: RequestMetadata): |
There was a problem hiding this comment.
e81c4c8 to
67af77c
Compare
67af77c to
cb4a579
Compare
393b5aa to
fe8e0b6
Compare
There was a problem hiding this comment.
i think we should start to fail fast here. User has already expressed intent to setup tracing, i think its better to ensure its setup or fail loudly.
There was a problem hiding this comment.
Got it, will mark this setup as failed then? Just curious won't it be too harsh to fail the setup.
| is_tracing_setup_successful = setup_tracing( | ||
| component_name="proxy", component_id=node_ip_address | ||
| component_name="proxy", | ||
| component_id=node_ip_address, | ||
| **get_tracing_kwargs(self._tracing_config), | ||
| ) |
There was a problem hiding this comment.
let's change the API of setup_tracing to accept the TracingConfig object directly instead of denormalizing it. The advantage is that when we need to add a new parameter to setup_tracing through TracingConfig, we dont have to make change to get_tracing_kwargs.
There was a problem hiding this comment.
ACK, will make this change here.
| except AttributeError: | ||
| self._tracing_config = None |
There was a problem hiding this comment.
when do we expect AttributeError
There was a problem hiding this comment.
This was to support rolling deployment when there is a version update [ Old version does not have get_tracing_config method ], but I think on a different thread we discussed, that a version upgrade will have new deployment, so new controller. So it is not neccasary and I will remove it.
| @@ -1169,10 +1183,6 @@ def __init__( | |||
| "The replica will continue running, but traces will not be exported." | |||
| ) | |||
There was a problem hiding this comment.
ACK, will fast here as well.
|
|
||
|
|
||
| class Replica: | ||
| class ReplicaBase(ABC): |
There was a problem hiding this comment.
Is the refactor to Replica necessary for adding tracing? If not, that's better done in a follow-up PR. The context Replica is probably the most critical module in the serve replica path, and I am reluctant to make sweeping changes here.
There was a problem hiding this comment.
ACK, that's a valid concern! I will do a follow up PR.
| if TYPE_CHECKING: | ||
| from ray.serve.schema import TracingConfig |
There was a problem hiding this comment.
i believe this was introduced to avoid circular dependency, please revert this change and figure out how we can avoid the circular dependency issue.
There was a problem hiding this comment.
Yes during the type check phase, will move it to leaf module.
| @@ -176,32 +181,38 @@ def my_function(obj): | |||
| """ | |||
|
|
|||
| def tracing_decorator(func): | |||
There was a problem hiding this comment.
Is this refactor strictly necessary? If not, let's leave them out for now.
There was a problem hiding this comment.
In older version when tracing is enabled via env variables decorater sets the wrapper correctly based on is_tracing_enabled() value.
But in the newer version of TracingConfig intialized via a schema, if there is a class [ Example: for proxy/replica TracingConfig comes from the controller at runtime] which gets tracing via this new setup_tracing() which makes tracing enabled at the runtime then for those, wrapper will never activate because at the decorater time, is_tracing_enabled returned false.
This is the reason for this refactor to now move this check to each wrapper function. Now we always set/return the wrapper and each function spans correctly based on is_tracing_enabled
There was a problem hiding this comment.
So yes it is necessary because we don't set it via env variables now.
| @model_validator(mode="after") | ||
| def fill_default_exporter_when_enabled(self): | ||
| if self.enabled and not self.exporter_import_path: | ||
| self.exporter_import_path = DEFAULT_TRACING_EXPORTER_IMPORT_PATH | ||
| return self |
There was a problem hiding this comment.
why is this needed if we already do
exporter_import_path: str = Field(
default_factory=lambda: RAY_SERVE_TRACING_EXPORTER_IMPORT_PATH
a9828fc to
d2811b1
Compare
0187fff to
c0fdc9a
Compare
c0fdc9a to
922afb8
Compare
…ierarchy Introduce TracingConfig as a first-class configuration object for OpenTelemetry tracing in Ray Serve. Key changes: - Add TracingConfig(BaseModel) to schema.py with enabled, exporter_import_path, and sampling_ratio fields - setup_tracing() now accepts TracingConfig directly instead of raw kwargs - Controller stores TracingConfig and exposes get_tracing_config() method - Replica and Proxy fetch TracingConfig from controller at init - Tracing decorators check is_tracing_enabled() at call-time (not decoration-time) to support runtime TracingConfig that arrives after import - serve.start() accepts tracing_config parameter - Fail fast: tracing setup errors propagate instead of being swallowed No Replica class restructuring — the existing class remains unchanged. Signed-off-by: Udit Agrawal <uagrawal@apple.com>
93ecd5a to
54ca9f5
Compare
| component_name="proxy", | ||
| component_id=node_ip_address, | ||
| tracing_config=self._tracing_config, | ||
| ) |
There was a problem hiding this comment.
Proxy tracing config never wired through, always None
High Severity
ProxyActor.__init__ doesn't accept tracing_config as a parameter and doesn't pass it to super().__init__(), so self._tracing_config used in the setup_tracing call is always None. Additionally, the controller never passes global_tracing_config to ProxyStateManager, and ProxyStateManager never passes it when creating proxy actors. The entire proxy tracing path described in the PR design is broken — the user-configured TracingConfig only reaches replicas (which fetch it from the controller), never proxies.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit 54ca9f5. Configure here.
There was a problem hiding this comment.
This will be handled in the [2/2] PR for this feature.
Signed-off-by: Udit Agrawal <uagrawal@apple.com>
54ca9f5 to
37f3b98
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
Reviewed by Cursor Bugbot for commit 37f3b98. Configure here.
| tracing_config=tracing_config, | ||
| ) | ||
| if is_tracing_setup_successful: | ||
| logger.info("Successfully set up tracing for replica") |
There was a problem hiding this comment.
Replica crashes on tracing setup failure instead of degrading
Medium Severity
The try/except block around setup_tracing() was removed from the replica initialization path. If setup_tracing raises (e.g., ImportError for missing opentelemetry, or an invalid exporter_import_path), the replica actor will crash during init. Previously, failures were caught and logged as warnings, allowing the replica to continue serving without tracing.
Reviewed by Cursor Bugbot for commit 37f3b98. Configure here.
There was a problem hiding this comment.
We are failing fast, so no need for this try/catch here.


Schema:
TracingConfigpydantic model (schema.py) withenabled,exporter_import_path,sampling_ratioray.servepublic API + API doc entrysetup_tracingAPI:TracingConfigdirectly instead of rawkwargstracing_configis provided: reads config from ittracing_configisNone: falls back to env vars (backward compatible)Replica path:
TracingConfigfrom controller viaget_tracing_config()and passes tosetup_tracingProxy path:
ProxyActorInterfaceacceptstracing_configparam (plumbed tosetup_tracing)None(env-var fallback, same as master). Full proxy wiring viaProxyStateManagerdeferred to PR 2.Controller:
global_tracing_configand exposesget_tracing_config()methodDecorator fix:
tracing_decorator_factorynow checksis_tracing_enabled()at call-time instead of decoration-time, becauseTracingConfigarrives at runtime (after decorators are applied at import time)Public API:
serve.start(tracing_config=...)parameterDesign
tracing_configisNoneWhat's deferred to PR 2
ProxyStateManager(proxy currently falls back to env vars)HAProxytracing supportTest plan
TracingConfigvalidation (test_schema.py)