Skip to content

Add OpenTelemetry tracing and metrics#2141

Open
hansent wants to merge 24 commits intomainfrom
feat/inference-models-otel
Open

Add OpenTelemetry tracing and metrics#2141
hansent wants to merge 24 commits intomainfrom
feat/inference-models-otel

Conversation

@hansent
Copy link
Collaborator

@hansent hansent commented Mar 23, 2026

Summary

Adds OpenTelemetry distributed tracing and metrics to the inference server. Integrates with the tracing in async-serverless#192 for end-to-end traces across the system.

Disabled by default (OTEL_TRACING_ENABLED=False). Zero overhead when off. Safe when no collector is running.

Tracing

  • HTTP requests — auto-instrumented via opentelemetry-instrumentation-fastapi (extracts traceparent, creates server spans)
  • Outgoing requests — auto-instrumented via opentelemetry-instrumentation-requests (injects traceparent, creates http.client spans)
  • Model load/infermodel.load and model.infer spans with model.id attributes
  • Roboflow API callsroboflow_api.call and roboflow_api.http_get spans
  • Error recording — all route handler errors recorded on active spans
  • Workflow thread propagation — OTel context forwarded through ThreadPoolExecutor
  • Log correlationtrace_id/span_id injected into structlog entries
  • On-demand tracingX-Force-Trace: true header forces sampling on standalone requests; honoured parent decisions take priority

Metrics (alongside existing Prometheus)

Never sampled — 100% of requests.

Metric Type Attributes
inference.models.loaded UpDownCounter
inference.model.loads / .unloads Counter model.id
inference.model.load.duration Histogram model.id
inference.model.infer.count Counter model.id
inference.model.infer.duration Histogram model.id
inference.roboflow_api.duration Histogram roboflow_api.function
inference.errors Counter error.type

All metrics include service.instance.id for per-pod breakdown.

Configuration

Env Var Default Description
OTEL_TRACING_ENABLED False Master switch
OTEL_SERVICE_NAME inference-server Service name
OTEL_EXPORTER_PROTOCOL grpc grpc or http
OTEL_EXPORTER_ENDPOINT localhost:4317 OTLP collector
OTEL_SAMPLING_RATE 1.0 Trace sampling (parent-based)
OTEL_TRACE_EXPORT_INTERVAL_MS 5000 Span push interval
OTEL_METRIC_EXPORT_INTERVAL_MS 10000 Metric push interval

Local dev tool

./development/otel/start-otel-dev.sh          # Grafana + Tempo + Prometheus
./development/otel/start-otel-dev.sh --clean   # fresh start
./development/otel/start-otel-dev.sh stop

OTEL_TRACING_ENABLED=True OTEL_EXPORTER_PROTOCOL=http OTEL_EXPORTER_ENDPOINT=localhost:4318 python debugrun.py

Pre-configured dashboard at http://localhost:3000 (admin/admin).

Files changed (12 files)

File Change
requirements/requirements.http.txt OTel packages
inference/core/env.py OTEL_* env vars
inference/core/telemetry.py New — tracing, metrics, force-trace, export error handling
inference/core/interfaces/http/http_api.py Setup, shutdown, CORS, access log
inference/core/managers/base.py Spans + metrics on model load/infer/unload
inference/core/interfaces/http/error_handlers.py Error recording + metric
inference/core/roboflow_api.py API call spans + timing
inference/core/logger.py trace_context_log_processor
inference/core/workflows/.../executor/core.py OTel context through ThreadPool
inference_sdk/http/client.py Cleanup (manual injection removed)
development/otel/start-otel-dev.sh Local dev tool
development/otel/dashboard.json Grafana dashboard

🤖 Generated with Claude Code

hansent and others added 22 commits March 20, 2026 19:50
Instrument the inference server with OpenTelemetry to participate in
distributed traces initiated by the async-serverless system. When enabled
via OTEL_TRACING_ENABLED=True, the server extracts W3C traceparent
headers from incoming requests, creates spans for key operations, and
exports traces to an OTLP collector.

Key changes:
- New inference/core/telemetry.py module with TracerProvider setup,
  span helpers, and error recording
- FastAPI auto-instrumentation for HTTP server spans
- Manual spans on model.load, model.infer, and all Roboflow API calls
- Error recording on all route exception handlers
- Trace context (traceparent) propagation through SDK HTTP calls for
  remote workflow step execution
- OTel context propagation through workflow ThreadPoolExecutor
- trace_id/span_id injection into structlog entries
- Zero overhead when disabled (default)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the inference server is hit directly (no parent traceparent), the
sampling rate controls which requests are traced. This adds an
X-Force-Trace: true header that overrides sampling for a specific
request, useful for debugging in production.

How it works:
- _ForceTraceASGIMiddleware (outermost ASGI layer) reads the header
  and sets a ContextVar before the OTel instrumentor runs
- _ForceTraceRootSampler checks the ContextVar and force-samples
  when set, otherwise delegates to the ratio-based sampler
- ParentBased wrapping ensures that when a parent traceparent exists
  (e.g. from async-serverless), the parent's decision is honoured
  and the force-trace header is ignored — no duplicate tracing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- telemetry.py: wrap opentelemetry imports in try/except so the module
  is always safe to import even without opentelemetry installed. All
  public helpers (start_span, record_error, inject_trace_context)
  degrade to noops.
- Move from inference.core.telemetry imports to top-level in
  roboflow_api.py, error_handlers.py, base.py
- Add inject_trace_context() helper to inference_sdk/config.py,
  replacing scattered try/except ImportError blocks in client.py,
  executors.py, and request_building.py
- Add top-level guarded otel_context import in workflow executor

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…metry

- Add set_span_attribute() to telemetry.py — callers no longer need
  to guard against None spans, business logic stays OTel-unaware
- Move add_trace_context from logger.py to telemetry.py as
  trace_context_log_processor, using the module's _OTEL_AVAILABLE
  flag instead of per-call try/except ImportError
- Remove unused `as span` capture in base.py model.load block
- Add None guard to SDK's inject_trace_context for consistency

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nager

When USE_INFERENCE_MODELS=True, model loading happens inside the
inference-models package and is invisible to the inference server.
This adds a TracingModelAccessManager that hooks into AutoModel's
callback system to create spans covering the full loading pipeline.

- TracingModelAccessManager subclasses LiberalModelAccessManager
  (preserves default behavior, adds tracing on top)
- on_model_package_access_granted starts an inference_models.load span
- on_model_loaded ends the span (captures total load time)
- File operations recorded as span events
- Factory function returns None when OTel/inference-models unavailable
- All 5 adapter classes pass the manager to AutoModel.from_pretrained()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds OpenTelemetry metrics exported via OTLP to the same collector as
traces. Metrics are never sampled — they aggregate across 100% of
requests regardless of trace sampling rate.

Metrics added:
- inference.models.loaded (UpDownCounter) — currently loaded model count
- inference.model.loads (Counter by model.id) — total cold starts
- inference.model.unloads (Counter by model.id) — total unloads
- inference.model.load.duration (Histogram by model.id) — load time
- inference.model.infer.count (Counter by model.id) — inference count
- inference.model.infer.duration (Histogram by model.id) — inference latency
- inference.roboflow_api.duration (Histogram by function) — API call latency
- inference.errors (Counter by error.type) — error count

MeterProvider shares the same OTLP endpoint, protocol, and resource
as the TracerProvider. Metrics are pushed every 10 seconds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Uses GLOBAL_INFERENCE_SERVER_ID so metrics backends (Grafana, etc.)
can distinguish pods. Enables per-pod grouping for gauges like
inference.models.loaded while still allowing fleet-wide aggregation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CompositePropagator is at opentelemetry.propagators.composite (not
opentelemetry.propagation.composite), and TraceContextTextMapPropagator
is at opentelemetry.trace.propagation.tracecontext.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CompositePropagator is at opentelemetry.propagators.composite (not
opentelemetry.propagation.composite), and TraceContextTextMapPropagator
is at opentelemetry.trace.propagation.tracecontext.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The OTLP HTTP exporter doesn't accept insecure=True (gRPC-only).
HTTP is insecure by default when using http:// scheme.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
development/otel/start-otel-dev.sh spins up grafana/otel-lgtm in
Docker and imports a pre-configured Inference Server dashboard with
panels for inference count, latency, model loads, and traces.

Usage:
  ./development/otel/start-otel-dev.sh       # start
  ./development/otel/start-otel-dev.sh stop  # stop

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the OTLP collector is down, the SDK logs full connection-refused
tracebacks every export cycle. This is expected and harmless (data is
dropped), but clutters server logs. Set the export loggers to CRITICAL
so only truly fatal issues are logged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the previous CRITICAL-level suppression with a log filter that
emits a single warning on first export failure and suppresses the noisy
tracebacks. Resets when the collector comes back.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The metrics exporter logger is opentelemetry.sdk.metrics._internal.export,
not opentelemetry.sdk.metrics.export.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New env vars:
- OTEL_TRACE_EXPORT_INTERVAL_MS (default: 5000) — BatchSpanProcessor
- OTEL_METRIC_EXPORT_INTERVAL_MS (default: 10000) — PeriodicExportingMetricReader

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
By default, data persists across stop/start (docker start reuses the
container). Use --clean to wipe all traces/metrics and start fresh.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move OTel instrumentation from the inference server's external
TracingModelAccessManager into the inference-models library itself,
following OTel library author guidelines.

inference-models now:
- Depends on opentelemetry-api (optional extra) — noop when not installed
- Wraps AutoModel.from_pretrained() in inference_models.from_pretrained span
- Auto-instruments returned model instances (infer, pre_process, forward,
  post_process) via monkey-patching — zero burden on model authors
- Instruments Roboflow API calls and weight downloads with spans
- All instrumentation is noop when no SDK is configured by the application

Removes TracingModelAccessManager from inference server — the library
handles its own tracing now. Spans automatically appear as children of
the inference server's model.load span via in-process context propagation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hansent hansent changed the title Add built-in OTel instrumentation to inference-models Add OpenTelemetry distributed tracing, metrics, and inference-models instrumentation Mar 23, 2026
@hansent hansent changed the title Add OpenTelemetry distributed tracing, metrics, and inference-models instrumentation Open Telemetry instrumentation Mar 23, 2026
hansent and others added 2 commits March 23, 2026 12:31
…ation

Replace manual inject_trace_context() calls in the SDK with the
RequestsInstrumentor which globally patches the requests library to:
- Automatically inject traceparent/tracestate on all outgoing requests
- Create http.client child spans with standard HTTP attributes

This covers all requests.* calls across inference server, SDK, and
inference-models — including ones we might have missed manually.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove built-in OTel from inference-models per maintainer request.
The inference server's model.load and model.infer spans already cover
timing from the server's perspective.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hansent hansent changed the title Open Telemetry instrumentation Add OpenTelemetry tracing and metrics Mar 23, 2026
timeout=ROBOFLOW_API_REQUEST_TIMEOUT,
verify=ROBOFLOW_API_VERIFY_SSL,
)
with start_span(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since RequestsInstrumentor().instrument() is called in telemetry.py, every requests call automatically gets an HTTP client span with http.url, method, status code, etc.

This span looks redundant.

Comment on lines +807 to +813
# Extract trace_id from traceparent header if present
# (reading from header due to ContextVar isolation in BaseHTTPMiddleware)
traceparent = request.headers.get("traceparent")
if traceparent:
parts = traceparent.split("-")
if len(parts) >= 3:
log_fields["trace_id"] = parts[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to extract the current trace_id using the otel library, something like:

# in telemetry.py

def current_trace_id():
    from opentelemetry import trace
    import opentelemetry.context as context

    current_span = trace.get_current_span()
    if current_span.span_context.is_valid:
        return format(current_span.span_context.trace_id, "032x")

Copy link
Contributor

@ecarrara ecarrara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Add OpenTelemetry tracing and metrics

Strengths

  • Clean API surface — all public helpers are noop-safe, so callers never need to guard against OTel being absent
  • Force-trace feature (X-Force-Trace header) is a thoughtful addition for on-demand debugging in production
  • Export error filter suppresses noisy connection-refused tracebacks — good operational ergonomics
  • Dev tooling — the shell script + Grafana dashboard make it easy to try locally
  • Zero overhead when disabled — feature-flagged at startup, not per-request

Issues & Suggestions

1. otel_context.attach() without detach() leaks context in thread pool (Bug)

inference/core/workflows/execution_engine/v1/executor/core.pyotel_context.attach() returns a token that must be passed to otel_context.detach(token) when the scope ends. In a ThreadPoolExecutor, threads are reused, so attached contexts accumulate:

# Current:
otel_context.attach(otel_ctx)

# Should be:
token = otel_context.attach(otel_ctx)
try:
    # ... step execution ...
finally:
    otel_context.detach(token)

This is a correctness bug — leaked contexts can cause spans in subsequent tasks on the same thread to be parented incorrectly.

2. _force_trace_flag ContextVar is never reset

_ForceTraceASGIMiddleware sets _force_trace_flag.set(True) but never resets it to False. Since ASGI runs requests in separate tasks, this is probably fine in practice (each task gets the default False). However, if any middleware or framework code reuses the same task context, the flag could leak. A try/finally reset would be safer:

async def __call__(self, scope, receive, send):
    if scope["type"] == "http":
        for header_name, header_value in scope.get("headers", []):
            if header_name == FORCE_TRACE_HEADER:
                if header_value.lower() == b"true":
                    _force_trace_flag.set(True)
                break
    try:
        await self.app(scope, receive, send)
    finally:
        _force_trace_flag.set(False)

3. Duplicate record_api_call on every exception path (DRY)

In roboflow_api.py, record_api_call(function.__name__, time.perf_counter() - t_start) is repeated in every except block plus the happy path. A try/finally would be cleaner and also ensures the metric is recorded if an unexpected exception type is raised:

try:
    try:
        result = function(*args, **kwargs)
        return result
    except RetryRequestError as error:
        raise error.inner_error
except ...:
    ...
finally:
    record_api_call(function.__name__, time.perf_counter() - t_start)

4. Access log trace_id parses raw header instead of OTel context

http_api.py — The comment says this is due to "ContextVar isolation in BaseHTTPMiddleware", which is a known Starlette issue. However, this means the access log only gets the incoming traceparent trace ID, which won't exist for server-initiated traces (e.g., X-Force-Trace without a parent). Consider documenting this limitation or using trace.get_current_span() if you can move the access log away from BaseHTTPMiddleware.

5. insecure=True hardcoded for gRPC exporter

telemetry.py — The gRPC exporter always uses insecure=True. Fine for local dev but could be a concern in production if someone points OTEL_EXPORTER_ENDPOINT at a remote collector. Consider making this configurable or defaulting based on the endpoint.

6. OTEL_SERVICE_NAME collides with standard OTel env var

The OTel SDK itself reads OTEL_SERVICE_NAME as a standard environment variable. Since you're also reading it in env.py and passing it explicitly to Resource.create(), there's a risk of confusion — the SDK's auto-configuration and your explicit config may interact unexpectedly. Consider using a prefixed name (e.g., INFERENCE_OTEL_SERVICE_NAME) or documenting that you intentionally reuse the standard var.

7. No tests

At minimum, unit tests for:

  • start_span / record_error / set_span_attribute noop behavior when OTel is absent
  • _ForceTraceRootSampler sampling logic
  • _ExportErrorFilter suppress/reset behavior
  • trace_context_log_processor structlog integration

Summary

Priority Issue
Must fix otel_context.attach() without detach() — context leak in thread pool
Should fix record_api_call duplication → use try/finally
Should fix Add unit tests for the telemetry module
Consider Reset _force_trace_flag in finally block
Consider Make gRPC insecure configurable
Consider Rename or document OTEL_SERVICE_NAME collision with standard OTel env var

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants