- 
                Notifications
    
You must be signed in to change notification settings  - Fork 277
 
Description
Is your feature request related to a problem? Please describe.
Currently, vLLM Semantic Router provides basic observability through Prometheus metrics and structured logging. However, these approaches have limitations when it comes to understanding the complete request lifecycle across distributed components:
- Limited Request Context: Metrics provide aggregated data but lack per-request visibility into the routing decision flow
 - Difficult Root Cause Analysis: When issues occur (e.g., high latency, routing errors), it's challenging to trace the exact path a request took through classification, routing, security checks, and backend selection
 - No Cross-Service Correlation: As the system integrates with vLLM engines and other components, there's no unified way to correlate traces across service boundaries
 - Missing Fine-Grained Timing: While we have overall latency metrics, we lack detailed breakdowns of time spent in each processing stage (classification, PII detection, jailbreak detection, cache lookup, model selection, etc.)
 
This becomes especially problematic when:
- Debugging production issues where specific requests fail or perform poorly
 - Optimizing the routing pipeline by identifying bottlenecks
 - Understanding the impact of different routing strategies on end-to-end latency
 - Integrating with the broader vLLM Production Stack for unified observability
 
Describe the solution you'd like
Implement comprehensive distributed tracing support using industry-standard OpenTelemetry instrumentation, leveraging either:
- OpenInference (https://github.com/Arize-ai/openinference) - Specialized for LLM observability with semantic conventions for AI/ML workloads
 - OpenLLMetry (https://github.com/traceloop/openllmetry) - Purpose-built for LLM application tracing with automatic instrumentation
 
Key Implementation Requirements:
1. Core Tracing Infrastructure
- Integrate OpenTelemetry SDK for Go (main router service)
 - Add trace context propagation through HTTP headers and gRPC metadata
 - Support multiple trace exporters (OTLP, Jaeger, Zipkin)
 - Configure sampling strategies (always-on for development, probabilistic for production)
 
2. Instrumentation Points
Instrument the following critical paths with spans:
Request Processing Pipeline:
semantic_router.request.received- Entry point spansemantic_router.classification- Category classification with model name and confidencesemantic_router.security.pii_detection- PII detection with resultssemantic_router.security.jailbreak_detection- Jailbreak detection with resultssemantic_router.cache.lookup- Semantic cache operationssemantic_router.routing.decision- Model selection logic with reasoningsemantic_router.backend.selection- Endpoint selectionsemantic_router.upstream.request- Forwarding to vLLM backendsemantic_router.response.processing- Response handling
Span Attributes (following OpenInference conventions):
- Request metadata: 
request_id,user_id,session_id - Model information: 
model.name,model.provider,model.version - Classification: 
category.name,category.confidence,classifier.type - Routing: 
routing.strategy,routing.reason,original_model,selected_model - Security: 
pii.detected,jailbreak.detected,security.action - Performance: 
token.count.prompt,token.count.completion,cache.hit - Reasoning: 
reasoning.enabled,reasoning.effort,reasoning.family 
3. Integration with vLLM Production Stack
- Propagate trace context to vLLM engine requests
 - Correlate router traces with vLLM engine traces
 - Support unified trace visualization across the full stack
 - Enable end-to-end latency analysis from router to model inference
 
4. Configuration
Add tracing configuration to config.yaml:
observability:
  tracing:
    enabled: true
    provider: "opentelemetry"  # or "openinference", "openllmetry"
    exporter:
      type: "otlp"  # otlp, jaeger, zipkin, stdout
      endpoint: "localhost:4317"
      insecure: true
    sampling:
      type: "probabilistic"  # always_on, always_off, probabilistic
      rate: 0.1  # 10% sampling for production
    resource:
      service_name: "vllm-semantic-router"
      service_version: "v0.1.0"
      deployment_environment: "production"5. Visualization and Analysis
- Provide example Jaeger/Grafana Tempo configurations
 - Create sample trace queries for common debugging scenarios
 - Document trace analysis workflows for performance optimization
 - Include dashboard templates for trace-based metrics
 
6. Performance Considerations
- Minimize tracing overhead (< 1% latency impact)
 - Use async span export to avoid blocking request processing
 - Implement efficient span batching
 - Support dynamic sampling rate adjustment
 
Additional context
Benefits:
- Enhanced Debugging: Quickly identify where requests fail or slow down in the pipeline
 - Performance Optimization: Pinpoint exact bottlenecks in the routing logic
 - Production Readiness: Industry-standard observability for enterprise deployments
 - vLLM Stack Integration: Seamless correlation with vLLM engine traces for full-stack visibility
 - Compliance & Auditing: Detailed audit trails for security and compliance requirements
 
Implementation Approach:
- Phase 1: Core OpenTelemetry integration with basic span instrumentation
 - Phase 2: Add OpenInference/OpenLLMetry semantic conventions for LLM-specific attributes
 - Phase 3: Integrate with vLLM engine tracing for end-to-end correlation
 - Phase 4: Advanced features (trace sampling strategies, custom exporters, trace-based alerting)
 
Related Projects:
- OpenInference: https://github.com/Arize-ai/openinference
 - OpenLLMetry: https://github.com/traceloop/openllmetry
 - OpenTelemetry Go: https://github.com/open-telemetry/opentelemetry-go
 - vLLM Production Stack: Future integration point for unified observability
 
Example Trace Visualization:
Request Trace (trace_id: abc123)
├─ semantic_router.request.received [2ms]
│  ├─ semantic_router.classification [45ms]
│  │  └─ attributes: {category: "coding", confidence: 0.95, classifier: "bert"}
│  ├─ semantic_router.security.pii_detection [8ms]
│  │  └─ attributes: {pii.detected: false}
│  ├─ semantic_router.security.jailbreak_detection [12ms]
│  │  └─ attributes: {jailbreak.detected: false}
│  ├─ semantic_router.cache.lookup [3ms]
│  │  └─ attributes: {cache.hit: false}
│  ├─ semantic_router.routing.decision [5ms]
│  │  └─ attributes: {original_model: "auto", selected_model: "deepseek-coder-v3", reasoning.enabled: true}
│  ├─ semantic_router.backend.selection [2ms]
│  │  └─ attributes: {endpoint: "endpoint1", address: "127.0.0.1:8000"}
│  └─ semantic_router.upstream.request [1250ms]
│     └─ vllm.engine.generate [1245ms]  # Correlated vLLM trace
│        ├─ vllm.tokenization [15ms]
│        ├─ vllm.inference [1200ms]
│        └─ vllm.detokenization [30ms]
└─ semantic_router.response.processing [5ms]
   └─ attributes: {tokens.prompt: 150, tokens.completion: 500, status: "success"}
Total Duration: 1332ms
This feature will significantly enhance the observability capabilities of vLLM Semantic Router and position it as a production-ready component of the vLLM ecosystem with enterprise-grade monitoring.