Skip to content

[Feat] Model Ensemble Support #730

@rootfs

Description

@rootfs

Introduction

Support a model ensemble orchestration service that can intelligently combine outputs from multiple LLM endpoints using configurable aggregation strategies, enabling improved reliability, accuracy, and flexible cost-performance trade-offs.

Use Case

Problem Statement

  1. Single model limitations: Individual models have reliability and accuracy constraints that affect production deployments
  2. No orchestration layer: Current router lacks the ability to coordinate multiple model inferences and combine results
  3. Fixed routing: Requests are routed to a single model, missing opportunities for consensus-based decision making
  4. Limited reliability options: No built-in mechanisms for fallback, voting, or ensemble strategies

Real-World Scenarios

Critical Applications

  • Medical diagnosis assistance where consensus from multiple models increases confidence
  • Legal document analysis requiring high accuracy verification
  • Financial advisory systems where reliability directly impacts business outcomes
  • Safety-critical AI systems (content moderation, fraud detection)

Cost Optimization

  • Query multiple smaller models instead of one large expensive model
  • Start with fast/cheap models, escalate to ensemble for uncertain cases
  • Adaptive routing based on query complexity or confidence thresholds

Reliability & Accuracy

  • Voting mechanisms to reduce hallucinations and errors
  • Consensus-based outputs for higher confidence results
  • Graceful degradation with fallback chains
  • A/B testing and gradual rollout of new models

Model Diversity

  • Combine outputs from different model architectures (e.g., GPT-style + Llama-style)
  • Ensemble different model sizes for balanced performance
  • Cross-validate responses from models with different training data

Architecture

graph TB
    Client[Client Request] --> Router[Semantic Router]
    Router --> Orchestrator[Ensemble Orchestrator]
    
    Orchestrator --> Strategy{Routing Strategy}
    
    Strategy -->|Parallel Query| M1[Model Endpoint 1]
    Strategy -->|Parallel Query| M2[Model Endpoint 2]
    Strategy -->|Parallel Query| M3[Model Endpoint N]
    
    M1 --> Aggregator[Aggregation Engine]
    M2 --> Aggregator
    M3 --> Aggregator
    
    Aggregator --> Voting[Voting Strategy]
    Aggregator --> Weighted[Weighted Consensus]
    Aggregator --> Ranking[Reranking]
    Aggregator --> Average[Score Averaging]
    Aggregator --> FirstSuccess[First Success]
    
    Voting --> Response[Final Response]
    Weighted --> Response
    Ranking --> Response
    Average --> Response
    FirstSuccess --> Response
    
    style Orchestrator fill:#e1f5ff
    style Aggregator fill:#fff4e1
    style Response fill:#e1ffe1
Loading

Core Components

1. Ensemble Orchestrator

Coordinates parallel or sequential requests to multiple model endpoints:

  • Manages concurrent inference requests
  • Handles timeouts and partial failures
  • Tracks response metadata (latency, confidence scores)
  • Supports both synchronous and streaming responses

2. Aggregation Engine

Combines multiple model outputs using configurable strategies:

  • Voting: Majority consensus for classification/multiple choice
  • Weighted Consensus: Confidence-weighted combination
  • Score Averaging: Average numerical outputs or probabilities
  • First Success: Return first valid response (latency optimization)
  • Reranking: Use a separate model to rank and select best output
  • Longest Common Subsequence: Find common patterns across responses

3. Configuration Interface

Flexible control mechanisms:

  • Header-based routing (e.g., X-Ensemble-Models, X-Ensemble-Strategy)
  • JSON configuration for ensemble policies
  • Per-request override capabilities
  • Global defaults with request-level customization

4. Adaptive Triggering

Intelligent decision-making for when to use ensemble:

  • Confidence threshold triggers (ensemble on low-confidence queries)
  • Query complexity detection (ensemble for complex questions)
  • Cost-aware routing (balance cost vs accuracy)
  • Fallback chains (start cheap, escalate as needed)

Expected Benefits

Accuracy & Reliability

  • Accuracy improvement on complex reasoning tasks through multi-model consensus
  • Reduced hallucination rate via voting and cross-validation
  • Higher confidence outputs from aggregated responses
  • Graceful degradation with fallback mechanisms

Cost Optimization

  • Lower cost per query by using multiple small models instead of one large model
  • Adaptive cost management by triggering ensemble only when needed
  • Flexible trade-offs between accuracy and inference cost
  • Better ROI on model investments by combining existing deployments

Operational Excellence

  • Improved reliability through redundancy and failover
  • A/B testing capabilities built into routing logic
  • Gradual model rollout by blending old and new models
  • Performance insights from comparing model outputs

Technical Design

API Interface

Header-Based Control

Enable ensemble via HTTP headers:

X-Ensemble-Enable: true
X-Ensemble-Models: model-a,model-b,model-c
X-Ensemble-Strategy: voting
X-Ensemble-Min-Responses: 2

Response Metadata

Provide transparency into ensemble process:

{
  "choices": [{
    "message": {...},
    "ensemble_metadata": {
      "strategy": "voting",
      "models_queried": 3,
      "responses_received": 3,
      "aggregation_details": {
        "votes": {"A": 2, "B": 1},
        "confidence_scores": [0.85, 0.82, 0.45],
        "selected_answer": "A"
      },
      "performance": {
        "total_latency_ms": 1250,
        "model_latencies_ms": [1200, 850, 1100]
      }
    }
  }]
}

Aggregation Strategies

1. Voting (Majority Consensus)

Best for: Classification, multiple choice, yes/no questions

  • Collect responses from all models
  • Extract answers (A, B, C, etc.)
  • Return most frequent answer
  • Handle ties with configurable rules

2. Weighted Consensus

Best for: Combining models with different reliability profiles

  • Weight responses by confidence scores
  • Support manual model weight configuration
  • Dynamic weighting based on historical accuracy
  • Confidence-weighted probability averaging

3. Score Averaging

Best for: Numerical outputs, probability distributions

  • Average numerical scores across models
  • Support weighted averages
  • Configurable outlier handling
  • Standard deviation reporting

4. First Success

Best for: Latency-sensitive applications

  • Return first valid response
  • Continue querying for backup if needed
  • Fallback to slower models if fast models fail
  • Optimize for p50 latency

5. Reranking

Best for: Generation tasks, open-ended responses

  • Collect multiple candidate responses
  • Use separate model to rank quality
  • Select highest-ranked response
  • Support custom ranking criteria

6. Longest Common Subsequence (LCS)

Best for: Finding consensus in generated text

  • Identify common patterns across responses
  • Extract most consistent information
  • Filter out model-specific hallucinations
  • Build consensus response from shared elements

Configuration Options

Global Configuration

Set default ensemble behaviors:

  • Default model lists for different query types
  • Fallback chains and timeout policies
  • Cost limits and budget controls
  • Monitoring and logging preferences

Request-Level Override

Allow per-request customization:

  • Override default ensemble strategy
  • Specify custom model combinations
  • Adjust timeout and minimum response requirements
  • Enable/disable ensemble for specific requests

Adaptive Policies

Intelligent routing decisions:

  • Confidence-based: Ensemble if initial response confidence < threshold
  • Complexity-based: Ensemble for queries above complexity score
  • Cost-aware: Balance accuracy vs inference cost
  • Hybrid: Combine multiple triggers

Validation Approach

Benchmarking Strategy

Test ensemble effectiveness across standard benchmarks:

  • MMLU-Pro: Multi-subject knowledge (validated: significant improvement)
  • GSM8K: Math reasoning
  • HumanEval: Code generation
  • HellaSwag: Commonsense reasoning

Comparison Metrics

  • Accuracy: Absolute and relative improvement
  • Latency: p50, p95, p99 response times
  • Cost: Total token usage and inference cost
  • Reliability: Error rates and failure handling
  • ROI: Accuracy gain per unit cost increase

Test Scenarios

  • Single model baseline
  • 2-model ensemble (various strategies)
  • 3+ model ensemble
  • Adaptive triggering effectiveness
  • Failure and degradation handling

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions