Conversation
Instrument the inference server with OpenTelemetry to participate in distributed traces initiated by the async-serverless system. When enabled via OTEL_TRACING_ENABLED=True, the server extracts W3C traceparent headers from incoming requests, creates spans for key operations, and exports traces to an OTLP collector. Key changes: - New inference/core/telemetry.py module with TracerProvider setup, span helpers, and error recording - FastAPI auto-instrumentation for HTTP server spans - Manual spans on model.load, model.infer, and all Roboflow API calls - Error recording on all route exception handlers - Trace context (traceparent) propagation through SDK HTTP calls for remote workflow step execution - OTel context propagation through workflow ThreadPoolExecutor - trace_id/span_id injection into structlog entries - Zero overhead when disabled (default) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the inference server is hit directly (no parent traceparent), the sampling rate controls which requests are traced. This adds an X-Force-Trace: true header that overrides sampling for a specific request, useful for debugging in production. How it works: - _ForceTraceASGIMiddleware (outermost ASGI layer) reads the header and sets a ContextVar before the OTel instrumentor runs - _ForceTraceRootSampler checks the ContextVar and force-samples when set, otherwise delegates to the ratio-based sampler - ParentBased wrapping ensures that when a parent traceparent exists (e.g. from async-serverless), the parent's decision is honoured and the force-trace header is ignored — no duplicate tracing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- telemetry.py: wrap opentelemetry imports in try/except so the module is always safe to import even without opentelemetry installed. All public helpers (start_span, record_error, inject_trace_context) degrade to noops. - Move from inference.core.telemetry imports to top-level in roboflow_api.py, error_handlers.py, base.py - Add inject_trace_context() helper to inference_sdk/config.py, replacing scattered try/except ImportError blocks in client.py, executors.py, and request_building.py - Add top-level guarded otel_context import in workflow executor Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…metry - Add set_span_attribute() to telemetry.py — callers no longer need to guard against None spans, business logic stays OTel-unaware - Move add_trace_context from logger.py to telemetry.py as trace_context_log_processor, using the module's _OTEL_AVAILABLE flag instead of per-call try/except ImportError - Remove unused `as span` capture in base.py model.load block - Add None guard to SDK's inject_trace_context for consistency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nager When USE_INFERENCE_MODELS=True, model loading happens inside the inference-models package and is invisible to the inference server. This adds a TracingModelAccessManager that hooks into AutoModel's callback system to create spans covering the full loading pipeline. - TracingModelAccessManager subclasses LiberalModelAccessManager (preserves default behavior, adds tracing on top) - on_model_package_access_granted starts an inference_models.load span - on_model_loaded ends the span (captures total load time) - File operations recorded as span events - Factory function returns None when OTel/inference-models unavailable - All 5 adapter classes pass the manager to AutoModel.from_pretrained() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds OpenTelemetry metrics exported via OTLP to the same collector as traces. Metrics are never sampled — they aggregate across 100% of requests regardless of trace sampling rate. Metrics added: - inference.models.loaded (UpDownCounter) — currently loaded model count - inference.model.loads (Counter by model.id) — total cold starts - inference.model.unloads (Counter by model.id) — total unloads - inference.model.load.duration (Histogram by model.id) — load time - inference.model.infer.count (Counter by model.id) — inference count - inference.model.infer.duration (Histogram by model.id) — inference latency - inference.roboflow_api.duration (Histogram by function) — API call latency - inference.errors (Counter by error.type) — error count MeterProvider shares the same OTLP endpoint, protocol, and resource as the TracerProvider. Metrics are pushed every 10 seconds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Uses GLOBAL_INFERENCE_SERVER_ID so metrics backends (Grafana, etc.) can distinguish pods. Enables per-pod grouping for gauges like inference.models.loaded while still allowing fleet-wide aggregation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CompositePropagator is at opentelemetry.propagators.composite (not opentelemetry.propagation.composite), and TraceContextTextMapPropagator is at opentelemetry.trace.propagation.tracecontext. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CompositePropagator is at opentelemetry.propagators.composite (not opentelemetry.propagation.composite), and TraceContextTextMapPropagator is at opentelemetry.trace.propagation.tracecontext. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The OTLP HTTP exporter doesn't accept insecure=True (gRPC-only). HTTP is insecure by default when using http:// scheme. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
development/otel/start-otel-dev.sh spins up grafana/otel-lgtm in Docker and imports a pre-configured Inference Server dashboard with panels for inference count, latency, model loads, and traces. Usage: ./development/otel/start-otel-dev.sh # start ./development/otel/start-otel-dev.sh stop # stop Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the OTLP collector is down, the SDK logs full connection-refused tracebacks every export cycle. This is expected and harmless (data is dropped), but clutters server logs. Set the export loggers to CRITICAL so only truly fatal issues are logged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the previous CRITICAL-level suppression with a log filter that emits a single warning on first export failure and suppresses the noisy tracebacks. Resets when the collector comes back. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The metrics exporter logger is opentelemetry.sdk.metrics._internal.export, not opentelemetry.sdk.metrics.export. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New env vars: - OTEL_TRACE_EXPORT_INTERVAL_MS (default: 5000) — BatchSpanProcessor - OTEL_METRIC_EXPORT_INTERVAL_MS (default: 10000) — PeriodicExportingMetricReader Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
By default, data persists across stop/start (docker start reuses the container). Use --clean to wipe all traces/metrics and start fresh. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move OTel instrumentation from the inference server's external TracingModelAccessManager into the inference-models library itself, following OTel library author guidelines. inference-models now: - Depends on opentelemetry-api (optional extra) — noop when not installed - Wraps AutoModel.from_pretrained() in inference_models.from_pretrained span - Auto-instruments returned model instances (infer, pre_process, forward, post_process) via monkey-patching — zero burden on model authors - Instruments Roboflow API calls and weight downloads with spans - All instrumentation is noop when no SDK is configured by the application Removes TracingModelAccessManager from inference server — the library handles its own tracing now. Spans automatically appear as children of the inference server's model.load span via in-process context propagation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ation Replace manual inject_trace_context() calls in the SDK with the RequestsInstrumentor which globally patches the requests library to: - Automatically inject traceparent/tracestate on all outgoing requests - Create http.client child spans with standard HTTP attributes This covers all requests.* calls across inference server, SDK, and inference-models — including ones we might have missed manually. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove built-in OTel from inference-models per maintainer request. The inference server's model.load and model.infer spans already cover timing from the server's perspective. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| timeout=ROBOFLOW_API_REQUEST_TIMEOUT, | ||
| verify=ROBOFLOW_API_VERIFY_SSL, | ||
| ) | ||
| with start_span( |
There was a problem hiding this comment.
Since RequestsInstrumentor().instrument() is called in telemetry.py, every requests call automatically gets an HTTP client span with http.url, method, status code, etc.
This span looks redundant.
| # Extract trace_id from traceparent header if present | ||
| # (reading from header due to ContextVar isolation in BaseHTTPMiddleware) | ||
| traceparent = request.headers.get("traceparent") | ||
| if traceparent: | ||
| parts = traceparent.split("-") | ||
| if len(parts) >= 3: | ||
| log_fields["trace_id"] = parts[1] |
There was a problem hiding this comment.
You might be able to extract the current trace_id using the otel library, something like:
# in telemetry.py
def current_trace_id():
from opentelemetry import trace
import opentelemetry.context as context
current_span = trace.get_current_span()
if current_span.span_context.is_valid:
return format(current_span.span_context.trace_id, "032x")
ecarrara
left a comment
There was a problem hiding this comment.
Code Review: Add OpenTelemetry tracing and metrics
Strengths
- Clean API surface — all public helpers are noop-safe, so callers never need to guard against OTel being absent
- Force-trace feature (
X-Force-Traceheader) is a thoughtful addition for on-demand debugging in production - Export error filter suppresses noisy connection-refused tracebacks — good operational ergonomics
- Dev tooling — the shell script + Grafana dashboard make it easy to try locally
- Zero overhead when disabled — feature-flagged at startup, not per-request
Issues & Suggestions
1. otel_context.attach() without detach() leaks context in thread pool (Bug)
inference/core/workflows/execution_engine/v1/executor/core.py — otel_context.attach() returns a token that must be passed to otel_context.detach(token) when the scope ends. In a ThreadPoolExecutor, threads are reused, so attached contexts accumulate:
# Current:
otel_context.attach(otel_ctx)
# Should be:
token = otel_context.attach(otel_ctx)
try:
# ... step execution ...
finally:
otel_context.detach(token)This is a correctness bug — leaked contexts can cause spans in subsequent tasks on the same thread to be parented incorrectly.
2. _force_trace_flag ContextVar is never reset
_ForceTraceASGIMiddleware sets _force_trace_flag.set(True) but never resets it to False. Since ASGI runs requests in separate tasks, this is probably fine in practice (each task gets the default False). However, if any middleware or framework code reuses the same task context, the flag could leak. A try/finally reset would be safer:
async def __call__(self, scope, receive, send):
if scope["type"] == "http":
for header_name, header_value in scope.get("headers", []):
if header_name == FORCE_TRACE_HEADER:
if header_value.lower() == b"true":
_force_trace_flag.set(True)
break
try:
await self.app(scope, receive, send)
finally:
_force_trace_flag.set(False)3. Duplicate record_api_call on every exception path (DRY)
In roboflow_api.py, record_api_call(function.__name__, time.perf_counter() - t_start) is repeated in every except block plus the happy path. A try/finally would be cleaner and also ensures the metric is recorded if an unexpected exception type is raised:
try:
try:
result = function(*args, **kwargs)
return result
except RetryRequestError as error:
raise error.inner_error
except ...:
...
finally:
record_api_call(function.__name__, time.perf_counter() - t_start)4. Access log trace_id parses raw header instead of OTel context
http_api.py — The comment says this is due to "ContextVar isolation in BaseHTTPMiddleware", which is a known Starlette issue. However, this means the access log only gets the incoming traceparent trace ID, which won't exist for server-initiated traces (e.g., X-Force-Trace without a parent). Consider documenting this limitation or using trace.get_current_span() if you can move the access log away from BaseHTTPMiddleware.
5. insecure=True hardcoded for gRPC exporter
telemetry.py — The gRPC exporter always uses insecure=True. Fine for local dev but could be a concern in production if someone points OTEL_EXPORTER_ENDPOINT at a remote collector. Consider making this configurable or defaulting based on the endpoint.
6. OTEL_SERVICE_NAME collides with standard OTel env var
The OTel SDK itself reads OTEL_SERVICE_NAME as a standard environment variable. Since you're also reading it in env.py and passing it explicitly to Resource.create(), there's a risk of confusion — the SDK's auto-configuration and your explicit config may interact unexpectedly. Consider using a prefixed name (e.g., INFERENCE_OTEL_SERVICE_NAME) or documenting that you intentionally reuse the standard var.
7. No tests
At minimum, unit tests for:
start_span/record_error/set_span_attributenoop behavior when OTel is absent_ForceTraceRootSamplersampling logic_ExportErrorFiltersuppress/reset behaviortrace_context_log_processorstructlog integration
Summary
| Priority | Issue |
|---|---|
| Must fix | otel_context.attach() without detach() — context leak in thread pool |
| Should fix | record_api_call duplication → use try/finally |
| Should fix | Add unit tests for the telemetry module |
| Consider | Reset _force_trace_flag in finally block |
| Consider | Make gRPC insecure configurable |
| Consider | Rename or document OTEL_SERVICE_NAME collision with standard OTel env var |
Summary
Adds OpenTelemetry distributed tracing and metrics to the inference server. Integrates with the tracing in async-serverless#192 for end-to-end traces across the system.
Disabled by default (
OTEL_TRACING_ENABLED=False). Zero overhead when off. Safe when no collector is running.Tracing
opentelemetry-instrumentation-fastapi(extractstraceparent, creates server spans)opentelemetry-instrumentation-requests(injectstraceparent, createshttp.clientspans)model.loadandmodel.inferspans withmodel.idattributesroboflow_api.callandroboflow_api.http_getspansThreadPoolExecutortrace_id/span_idinjected into structlog entriesX-Force-Trace: trueheader forces sampling on standalone requests; honoured parent decisions take priorityMetrics (alongside existing Prometheus)
Never sampled — 100% of requests.
inference.models.loadedinference.model.loads/.unloadsmodel.idinference.model.load.durationmodel.idinference.model.infer.countmodel.idinference.model.infer.durationmodel.idinference.roboflow_api.durationroboflow_api.functioninference.errorserror.typeAll metrics include
service.instance.idfor per-pod breakdown.Configuration
OTEL_TRACING_ENABLEDFalseOTEL_SERVICE_NAMEinference-serverOTEL_EXPORTER_PROTOCOLgrpcgrpcorhttpOTEL_EXPORTER_ENDPOINTlocalhost:4317OTEL_SAMPLING_RATE1.0OTEL_TRACE_EXPORT_INTERVAL_MS5000OTEL_METRIC_EXPORT_INTERVAL_MS10000Local dev tool
Pre-configured dashboard at http://localhost:3000 (admin/admin).
Files changed (12 files)
requirements/requirements.http.txtinference/core/env.pyOTEL_*env varsinference/core/telemetry.pyinference/core/interfaces/http/http_api.pyinference/core/managers/base.pyinference/core/interfaces/http/error_handlers.pyinference/core/roboflow_api.pyinference/core/logger.pyinference/core/workflows/.../executor/core.pyinference_sdk/http/client.pydevelopment/otel/start-otel-dev.shdevelopment/otel/dashboard.json🤖 Generated with Claude Code