Distributed Tracing Support for Fine-Grained Observability

### Is your feature request related to a problem? Please describe.

Currently, vLLM Semantic Router provides basic observability through Prometheus metrics and structured logging. However, these approaches have limitations when it comes to understanding the complete request lifecycle across distributed components:

- **Limited Request Context**: Metrics provide aggregated data but lack per-request visibility into the routing decision flow
- **Difficult Root Cause Analysis**: When issues occur (e.g., high latency, routing errors), it's challenging to trace the exact path a request took through classification, routing, security checks, and backend selection
- **No Cross-Service Correlation**: As the system integrates with vLLM engines and other components, there's no unified way to correlate traces across service boundaries
- **Missing Fine-Grained Timing**: While we have overall latency metrics, we lack detailed breakdowns of time spent in each processing stage (classification, PII detection, jailbreak detection, cache lookup, model selection, etc.)

This becomes especially problematic when:
- Debugging production issues where specific requests fail or perform poorly
- Optimizing the routing pipeline by identifying bottlenecks
- Understanding the impact of different routing strategies on end-to-end latency
- Integrating with the broader vLLM Production Stack for unified observability

### Describe the solution you'd like

Implement comprehensive distributed tracing support using industry-standard OpenTelemetry instrumentation, leveraging either:

1. **OpenInference** (https://github.com/Arize-ai/openinference) - Specialized for LLM observability with semantic conventions for AI/ML workloads
2. **OpenLLMetry** (https://github.com/traceloop/openllmetry) - Purpose-built for LLM application tracing with automatic instrumentation

**Key Implementation Requirements:**

#### 1. Core Tracing Infrastructure
- Integrate OpenTelemetry SDK for Go (main router service)
- Add trace context propagation through HTTP headers and gRPC metadata
- Support multiple trace exporters (OTLP, Jaeger, Zipkin)
- Configure sampling strategies (always-on for development, probabilistic for production)

#### 2. Instrumentation Points
Instrument the following critical paths with spans:

**Request Processing Pipeline:**
- `semantic_router.request.received` - Entry point span
- `semantic_router.classification` - Category classification with model name and confidence
- `semantic_router.security.pii_detection` - PII detection with results
- `semantic_router.security.jailbreak_detection` - Jailbreak detection with results
- `semantic_router.cache.lookup` - Semantic cache operations
- `semantic_router.routing.decision` - Model selection logic with reasoning
- `semantic_router.backend.selection` - Endpoint selection
- `semantic_router.upstream.request` - Forwarding to vLLM backend
- `semantic_router.response.processing` - Response handling

**Span Attributes (following OpenInference conventions):**
- Request metadata: `request_id`, `user_id`, `session_id`
- Model information: `model.name`, `model.provider`, `model.version`
- Classification: `category.name`, `category.confidence`, `classifier.type`
- Routing: `routing.strategy`, `routing.reason`, `original_model`, `selected_model`
- Security: `pii.detected`, `jailbreak.detected`, `security.action`
- Performance: `token.count.prompt`, `token.count.completion`, `cache.hit`
- Reasoning: `reasoning.enabled`, `reasoning.effort`, `reasoning.family`

#### 3. Integration with vLLM Production Stack
- Propagate trace context to vLLM engine requests
- Correlate router traces with vLLM engine traces
- Support unified trace visualization across the full stack
- Enable end-to-end latency analysis from router to model inference

#### 4. Configuration
Add tracing configuration to `config.yaml`:

```yaml
observability:
  tracing:
    enabled: true
    provider: "opentelemetry"  # or "openinference", "openllmetry"
    exporter:
      type: "otlp"  # otlp, jaeger, zipkin, stdout
      endpoint: "localhost:4317"
      insecure: true
    sampling:
      type: "probabilistic"  # always_on, always_off, probabilistic
      rate: 0.1  # 10% sampling for production
    resource:
      service_name: "vllm-semantic-router"
      service_version: "v0.1.0"
      deployment_environment: "production"
```

#### 5. Visualization and Analysis
- Provide example Jaeger/Grafana Tempo configurations
- Create sample trace queries for common debugging scenarios
- Document trace analysis workflows for performance optimization
- Include dashboard templates for trace-based metrics

#### 6. Performance Considerations
- Minimize tracing overhead (< 1% latency impact)
- Use async span export to avoid blocking request processing
- Implement efficient span batching
- Support dynamic sampling rate adjustment

### Additional context

**Benefits:**
- **Enhanced Debugging**: Quickly identify where requests fail or slow down in the pipeline
- **Performance Optimization**: Pinpoint exact bottlenecks in the routing logic
- **Production Readiness**: Industry-standard observability for enterprise deployments
- **vLLM Stack Integration**: Seamless correlation with vLLM engine traces for full-stack visibility
- **Compliance & Auditing**: Detailed audit trails for security and compliance requirements

**Implementation Approach:**
1. Phase 1: Core OpenTelemetry integration with basic span instrumentation
2. Phase 2: Add OpenInference/OpenLLMetry semantic conventions for LLM-specific attributes
3. Phase 3: Integrate with vLLM engine tracing for end-to-end correlation
4. Phase 4: Advanced features (trace sampling strategies, custom exporters, trace-based alerting)

**Related Projects:**
- OpenInference: https://github.com/Arize-ai/openinference
- OpenLLMetry: https://github.com/traceloop/openllmetry
- OpenTelemetry Go: https://github.com/open-telemetry/opentelemetry-go
- vLLM Production Stack: Future integration point for unified observability

**Example Trace Visualization:**
```
Request Trace (trace_id: abc123)
├─ semantic_router.request.received [2ms]
│  ├─ semantic_router.classification [45ms]
│  │  └─ attributes: {category: "coding", confidence: 0.95, classifier: "bert"}
│  ├─ semantic_router.security.pii_detection [8ms]
│  │  └─ attributes: {pii.detected: false}
│  ├─ semantic_router.security.jailbreak_detection [12ms]
│  │  └─ attributes: {jailbreak.detected: false}
│  ├─ semantic_router.cache.lookup [3ms]
│  │  └─ attributes: {cache.hit: false}
│  ├─ semantic_router.routing.decision [5ms]
│  │  └─ attributes: {original_model: "auto", selected_model: "deepseek-coder-v3", reasoning.enabled: true}
│  ├─ semantic_router.backend.selection [2ms]
│  │  └─ attributes: {endpoint: "endpoint1", address: "127.0.0.1:8000"}
│  └─ semantic_router.upstream.request [1250ms]
│     └─ vllm.engine.generate [1245ms]  # Correlated vLLM trace
│        ├─ vllm.tokenization [15ms]
│        ├─ vllm.inference [1200ms]
│        └─ vllm.detokenization [30ms]
└─ semantic_router.response.processing [5ms]
   └─ attributes: {tokens.prompt: 150, tokens.completion: 500, status: "success"}

Total Duration: 1332ms
```

This feature will significantly enhance the observability capabilities of vLLM Semantic Router and position it as a production-ready component of the vLLM ecosystem with enterprise-grade monitoring.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distributed Tracing Support for Fine-Grained Observability #311

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

1. Core Tracing Infrastructure

2. Instrumentation Points

3. Integration with vLLM Production Stack

4. Configuration

5. Visualization and Analysis

6. Performance Considerations

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Distributed Tracing Support for Fine-Grained Observability #311

Description

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

1. Core Tracing Infrastructure

2. Instrumentation Points

3. Integration with vLLM Production Stack

4. Configuration

5. Visualization and Analysis

6. Performance Considerations

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions