Skip to content

Distributed Tracing Support for Fine-Grained Observability #311

@Xunzhuo

Description

@Xunzhuo

Is your feature request related to a problem? Please describe.

Currently, vLLM Semantic Router provides basic observability through Prometheus metrics and structured logging. However, these approaches have limitations when it comes to understanding the complete request lifecycle across distributed components:

  • Limited Request Context: Metrics provide aggregated data but lack per-request visibility into the routing decision flow
  • Difficult Root Cause Analysis: When issues occur (e.g., high latency, routing errors), it's challenging to trace the exact path a request took through classification, routing, security checks, and backend selection
  • No Cross-Service Correlation: As the system integrates with vLLM engines and other components, there's no unified way to correlate traces across service boundaries
  • Missing Fine-Grained Timing: While we have overall latency metrics, we lack detailed breakdowns of time spent in each processing stage (classification, PII detection, jailbreak detection, cache lookup, model selection, etc.)

This becomes especially problematic when:

  • Debugging production issues where specific requests fail or perform poorly
  • Optimizing the routing pipeline by identifying bottlenecks
  • Understanding the impact of different routing strategies on end-to-end latency
  • Integrating with the broader vLLM Production Stack for unified observability

Describe the solution you'd like

Implement comprehensive distributed tracing support using industry-standard OpenTelemetry instrumentation, leveraging either:

  1. OpenInference (https://github.com/Arize-ai/openinference) - Specialized for LLM observability with semantic conventions for AI/ML workloads
  2. OpenLLMetry (https://github.com/traceloop/openllmetry) - Purpose-built for LLM application tracing with automatic instrumentation

Key Implementation Requirements:

1. Core Tracing Infrastructure

  • Integrate OpenTelemetry SDK for Go (main router service)
  • Add trace context propagation through HTTP headers and gRPC metadata
  • Support multiple trace exporters (OTLP, Jaeger, Zipkin)
  • Configure sampling strategies (always-on for development, probabilistic for production)

2. Instrumentation Points

Instrument the following critical paths with spans:

Request Processing Pipeline:

  • semantic_router.request.received - Entry point span
  • semantic_router.classification - Category classification with model name and confidence
  • semantic_router.security.pii_detection - PII detection with results
  • semantic_router.security.jailbreak_detection - Jailbreak detection with results
  • semantic_router.cache.lookup - Semantic cache operations
  • semantic_router.routing.decision - Model selection logic with reasoning
  • semantic_router.backend.selection - Endpoint selection
  • semantic_router.upstream.request - Forwarding to vLLM backend
  • semantic_router.response.processing - Response handling

Span Attributes (following OpenInference conventions):

  • Request metadata: request_id, user_id, session_id
  • Model information: model.name, model.provider, model.version
  • Classification: category.name, category.confidence, classifier.type
  • Routing: routing.strategy, routing.reason, original_model, selected_model
  • Security: pii.detected, jailbreak.detected, security.action
  • Performance: token.count.prompt, token.count.completion, cache.hit
  • Reasoning: reasoning.enabled, reasoning.effort, reasoning.family

3. Integration with vLLM Production Stack

  • Propagate trace context to vLLM engine requests
  • Correlate router traces with vLLM engine traces
  • Support unified trace visualization across the full stack
  • Enable end-to-end latency analysis from router to model inference

4. Configuration

Add tracing configuration to config.yaml:

observability:
  tracing:
    enabled: true
    provider: "opentelemetry"  # or "openinference", "openllmetry"
    exporter:
      type: "otlp"  # otlp, jaeger, zipkin, stdout
      endpoint: "localhost:4317"
      insecure: true
    sampling:
      type: "probabilistic"  # always_on, always_off, probabilistic
      rate: 0.1  # 10% sampling for production
    resource:
      service_name: "vllm-semantic-router"
      service_version: "v0.1.0"
      deployment_environment: "production"

5. Visualization and Analysis

  • Provide example Jaeger/Grafana Tempo configurations
  • Create sample trace queries for common debugging scenarios
  • Document trace analysis workflows for performance optimization
  • Include dashboard templates for trace-based metrics

6. Performance Considerations

  • Minimize tracing overhead (< 1% latency impact)
  • Use async span export to avoid blocking request processing
  • Implement efficient span batching
  • Support dynamic sampling rate adjustment

Additional context

Benefits:

  • Enhanced Debugging: Quickly identify where requests fail or slow down in the pipeline
  • Performance Optimization: Pinpoint exact bottlenecks in the routing logic
  • Production Readiness: Industry-standard observability for enterprise deployments
  • vLLM Stack Integration: Seamless correlation with vLLM engine traces for full-stack visibility
  • Compliance & Auditing: Detailed audit trails for security and compliance requirements

Implementation Approach:

  1. Phase 1: Core OpenTelemetry integration with basic span instrumentation
  2. Phase 2: Add OpenInference/OpenLLMetry semantic conventions for LLM-specific attributes
  3. Phase 3: Integrate with vLLM engine tracing for end-to-end correlation
  4. Phase 4: Advanced features (trace sampling strategies, custom exporters, trace-based alerting)

Related Projects:

Example Trace Visualization:

Request Trace (trace_id: abc123)
├─ semantic_router.request.received [2ms]
│  ├─ semantic_router.classification [45ms]
│  │  └─ attributes: {category: "coding", confidence: 0.95, classifier: "bert"}
│  ├─ semantic_router.security.pii_detection [8ms]
│  │  └─ attributes: {pii.detected: false}
│  ├─ semantic_router.security.jailbreak_detection [12ms]
│  │  └─ attributes: {jailbreak.detected: false}
│  ├─ semantic_router.cache.lookup [3ms]
│  │  └─ attributes: {cache.hit: false}
│  ├─ semantic_router.routing.decision [5ms]
│  │  └─ attributes: {original_model: "auto", selected_model: "deepseek-coder-v3", reasoning.enabled: true}
│  ├─ semantic_router.backend.selection [2ms]
│  │  └─ attributes: {endpoint: "endpoint1", address: "127.0.0.1:8000"}
│  └─ semantic_router.upstream.request [1250ms]
│     └─ vllm.engine.generate [1245ms]  # Correlated vLLM trace
│        ├─ vllm.tokenization [15ms]
│        ├─ vllm.inference [1200ms]
│        └─ vllm.detokenization [30ms]
└─ semantic_router.response.processing [5ms]
   └─ attributes: {tokens.prompt: 150, tokens.completion: 500, status: "success"}

Total Duration: 1332ms

This feature will significantly enhance the observability capabilities of vLLM Semantic Router and position it as a production-ready component of the vLLM ecosystem with enterprise-grade monitoring.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions