# Module E: Intelligent Model Routing Lab

**Goal:** Route prompts to the optimal LLM using NVIDIA’s router approach, balancing accuracy, latency, and cost.

**Persona:** Cost-Conscious Solutions Architect

## Workshop Overview

### Part 1: Understand the NVIDIA Router (20 min)
- Task + complexity classification with a lightweight DeBERTa model
- Why routing matters (the cost-quality-latency triangle)

### Part 2: Build a Routing System (30 min)
- Generate synthetic text-to-SQL prompts (simple → complex)
- Classify prompts with NVIDIA’s router model
- Route to a model pool: **3B** (cheap), **SQL specialist**, **70B** (best quality)

### Part 3: Evaluate + Decide (20 min)
- Benchmark routed vs always-70B
- Visualize cost savings and quality impacts
- Tune thresholds and pick your production policy

## Why Model Routing Matters for FICO

In production, not every prompt needs a 70B model.
- **Simple** prompts → small model (fast, inexpensive)
- **SQL** prompts → SQL specialist model
- **Hard / ambiguous** prompts → 70B model (highest quality)

The win: large cost savings while keeping quality high.


---

## The NVIDIA Model Router Blueprint (Conceptual)

```
UserPrompt
  |
  v
RouterClassifier(DeBERTa)
  |-- task_type (11-class)
  |-- complexity_dimensions (6 scores)
  |-- overall_complexity
  v
RoutingPolicy(thresholds + task-aware overrides)
  |-- small_model (3B)
  |-- sql_specialist (7B)
  |-- large_model (70B)
  v
LLMResponse
```

We’ll implement the same pattern: classifier → policy → model pool → metrics.


---

## Environment Setup

Run the next cells to import dependencies and confirm your runtime (CPU/GPU).


In [1]:
import os
import sys
import time
import json
import random
from dataclasses import dataclass
from typing import Dict, List, Optional

import numpy as np
import pandas as pd

import torch

from IPython.display import display, Markdown, HTML, clear_output

print("Python:", sys.version)
print("Executable:", sys.executable)
print("CWD:", os.getcwd())

print("\nPyTorch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
    props = torch.cuda.get_device_properties(0)
    print(f"VRAM: {props.total_memory/1e9:.1f} GB")

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print("\nUsing DEVICE =", DEVICE)


Python: 3.10.12 (main, Nov  4 2025, 08:48:33) [GCC 11.4.0]
Executable: /home/shadeform/workshop-v1/fico/.venv/bin/python
CWD: /home/shadeform/workshop-v1/fico

PyTorch: 2.9.1+cu128
CUDA available: True
GPU: NVIDIA H200
VRAM: 150.0 GB

Using DEVICE = cuda


---

## Load NVIDIA Router Classifier (DeBERTa)

We use NVIDIA’s **Prompt Task and Complexity Classifier** (`nvidia/prompt-task-and-complexity-classifier`).

This is a lightweight classifier model (not an LLM) that predicts:
- **Task type** (11 classes)
- **Complexity** across 6 dimensions

The blueprint then uses a policy (thresholds + overrides) to pick which LLM to call.


In [2]:
from transformers import AutoTokenizer, AutoModel

ROUTER_MODEL_ID = "nvidia/prompt-task-and-complexity-classifier"

print("Loading NVIDIA router classifier:", ROUTER_MODEL_ID)

# NOTE: trust_remote_code=True is important for some multi-head models.
router_tokenizer = AutoTokenizer.from_pretrained(ROUTER_MODEL_ID)
router_model = AutoModel.from_pretrained(ROUTER_MODEL_ID, trust_remote_code=True)
router_model.to(DEVICE)
router_model.eval()

print("✅ Router loaded")


Loading NVIDIA router classifier: nvidia/prompt-task-and-complexity-classifier


tokenizer_config.json: 0.00B [00:00, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/735M [00:00<?, ?B/s]

Some weights of DiaModel were not initialized from the model checkpoint at nvidia/prompt-task-and-complexity-classifier and are newly initialized: ['decoder.embeddings.embed.weight', 'decoder.layers.0.cross_attention.k_proj.weight', 'decoder.layers.0.cross_attention.o_proj.weight', 'decoder.layers.0.cross_attention.q_proj.weight', 'decoder.layers.0.cross_attention.v_proj.weight', 'decoder.layers.0.mlp.down_proj.weight', 'decoder.layers.0.mlp.gate_up_proj.weight', 'decoder.layers.0.pre_ca_norm.weight', 'decoder.layers.0.pre_mlp_norm.weight', 'decoder.layers.0.pre_sa_norm.weight', 'decoder.layers.0.self_attention.k_proj.weight', 'decoder.layers.0.self_attention.o_proj.weight', 'decoder.layers.0.self_attention.q_proj.weight', 'decoder.layers.0.self_attention.v_proj.weight', 'decoder.layers.1.cross_attention.k_proj.weight', 'decoder.layers.1.cross_attention.o_proj.weight', 'decoder.layers.1.cross_attention.q_proj.weight', 'decoder.layers.1.cross_attention.v_proj.weight', 'decoder.layers.1.

✅ Router loaded


---

## Synthetic Data: Text-to-SQL Prompts

We’ll generate synthetic prompts across three bands:
- **Simple**: single-table lookups and filters
- **Medium**: joins + aggregations
- **Complex**: CTEs / subqueries / window-function style instructions

This synthetic set gives us a controllable testbed for routing and benchmarking.


In [3]:
SIMPLE_TEMPLATES = [
    "Get all {entity} from {location}",
    "Show me {entity} where {field} is {value}",
    "List all {entity} with {field} greater than {value}",
    "Count the number of {entity}",
    "List {entity} ordered by {field}",
]

MEDIUM_TEMPLATES = [
    "Find the average {metric} by {grouping}",
    "Show total {metric} for each {grouping} in {time_period}",
    "List {entity1} along with their {entity2}",
    "Get top 10 {entity} by {metric}",
    "List {entity1} and count of {entity2} for each",
]

COMPLEX_TEMPLATES = [
    "Rank {entity} by {metric} within each {grouping} and show top {value}",
    "Calculate month-over-month growth rate of {metric} by {grouping}",
    "Calculate the running total of {metric} for each {entity} over time",
    "Show {entity} with the largest {metric} change between {period1} and {period2}",
    "Identify {entity} whose {metric} increased compared to previous {time_period}",
]

ENTITIES = ["customers", "accounts", "transactions", "credit_applications", "risk_scores", "loans"]
METRICS = ["credit_score", "balance", "payment_amount", "risk_rating", "utilization_rate", "delinquency_rate"]
GROUPINGS = ["region", "customer_segment", "product_type", "risk_category", "account_type"]
LOCATIONS = ["California", "New York", "Texas", "Florida", "Illinois"]
TIME_PERIODS = ["last month", "Q4 2024", "the past year", "2023", "last quarter"]

print("Templates:", len(SIMPLE_TEMPLATES), len(MEDIUM_TEMPLATES), len(COMPLEX_TEMPLATES))


Templates: 5 5 5


In [4]:
def _fill(template: str) -> str:
    replacements = {
        "{entity}": random.choice(ENTITIES),
        "{entity1}": random.choice(ENTITIES),
        "{entity2}": random.choice(ENTITIES),
        "{field}": random.choice(METRICS + ["name", "id", "status", "date"]),
        "{metric}": random.choice(METRICS),
        "{grouping}": random.choice(GROUPINGS),
        "{location}": random.choice(LOCATIONS),
        "{time_period}": random.choice(TIME_PERIODS),
        "{period1}": "Q3 2024",
        "{period2}": "Q4 2024",
        "{value}": str(random.randint(1, 25)),
    }

    out = template
    for k, v in replacements.items():
        out = out.replace(k, v)
    return out


def make_synth_dataset(n_per_class: int = 30) -> pd.DataFrame:
    rows = []
    for label, templates in [
        ("simple", SIMPLE_TEMPLATES),
        ("medium", MEDIUM_TEMPLATES),
        ("complex", COMPLEX_TEMPLATES),
    ]:
        for _ in range(n_per_class):
            rows.append({
                "prompt": _fill(random.choice(templates)),
                "gold_complexity": label,
            })

    df = pd.DataFrame(rows).sample(frac=1.0, random_state=SEED).reset_index(drop=True)
    return df


df_prompts = make_synth_dataset(n_per_class=30)
print("Rows:", len(df_prompts))
display(df_prompts.head(10))
print("\nDistribution:\n", df_prompts["gold_complexity"].value_counts())


Rows: 90


Unnamed: 0,prompt,gold_complexity
0,Find the average credit_score by region,medium
1,Show me risk_scores where balance is 2,simple
2,Show total delinquency_rate for each region in...,medium
3,Show transactions with the largest risk_rating...,complex
4,Get all customers from California,simple
5,List all customers with risk_rating greater th...,simple
6,List loans and count of transactions for each,medium
7,Identify loans whose risk_rating increased com...,complex
8,List all accounts with name greater than 5,simple
9,List risk_scores and count of loans for each,medium



Distribution:
 gold_complexity
medium     30
simple     30
complex    30
Name: count, dtype: int64


### Exercise: Add FICO-Specific Templates

Add 2 templates to each list (`SIMPLE_TEMPLATES`, `MEDIUM_TEMPLATES`, `COMPLEX_TEMPLATES`) that reflect realistic credit/financial analytics questions you expect from internal users.

Then re-run `make_synth_dataset` and keep these prompts for the router + benchmark sections.


In [5]:
import plotly.express as px

fig = px.histogram(
    df_prompts,
    x="gold_complexity",
    color="gold_complexity",
    title="Synthetic Prompt Distribution",
)
fig.update_layout(showlegend=False, height=350)
fig.show()


---

## Build the Router Wrapper

We’ll wrap NVIDIA’s classifier behind a small interface that returns:
- `task_type`
- `complexity_dimensions` (6 scores)
- `overall_complexity`

Because model heads can be implemented differently across versions, we’ll implement **robust parsing** and a helpful `debug()` method that prints the raw output structure.


In [6]:
TASK_LABELS = [
    "Open QA",
    "Closed QA",
    "Summarization",
    "Text Generation",
    "Code Generation",
    "Chatbot",
    "Classification",
    "Rewrite",
    "Brainstorming",
    "Extraction",
    "Other",
]

COMPLEXITY_DIMS = [
    "creativity",
    "reasoning",
    "contextual_knowledge",
    "domain_knowledge",
    "constraints",
    "few_shot",
]

def _softmax(x: torch.Tensor, dim: int = -1) -> torch.Tensor:
    return torch.softmax(x, dim=dim)


In [7]:
class NVIDIARouter:
    def __init__(
        self,
        model,
        tokenizer,
        low_threshold: float = 0.40,
        high_threshold: float = 0.70,
        sql_task_label: str = "Code Generation",
    ):
        self.model = model
        self.tokenizer = tokenizer
        self.low_threshold = low_threshold
        self.high_threshold = high_threshold
        self.sql_task_label = sql_task_label

    @torch.inference_mode()
    def debug(self, prompt: str) -> None:
        inputs = self.tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
        inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
        out = self.model(**inputs)
        print("Output type:", type(out))
        if isinstance(out, dict):
            print("Keys:", list(out.keys()))
        else:
            print("Attrs (subset):", [a for a in dir(out) if a.endswith("logits")][:20])
        try:
            print("repr(out) snippet:", str(out)[:500])
        except Exception:
            pass

    @torch.inference_mode()
    def classify(self, prompt: str) -> Dict:
        inputs = self.tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
        inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
        out = self.model(**inputs)

        # Best-effort parsing. Different implementations may expose heads differently.
        task_probs = None
        dim_scores = None

        # Case A: dict-like outputs with explicit heads
        if isinstance(out, dict):
            if "task_logits" in out:
                task_probs = _softmax(out["task_logits"][0]).detach().cpu().numpy()
            if "complexity_logits" in out:
                # Expect shape [B, 6] or [B, 6, K]. If [B,6], treat as scores.
                c = out["complexity_logits"][0]
                if c.ndim == 1 and c.shape[0] == len(COMPLEXITY_DIMS):
                    dim_scores = torch.sigmoid(c).detach().cpu().numpy()

        # Case B: object outputs exposing logits attributes
        if task_probs is None and hasattr(out, "task_logits"):
            task_probs = _softmax(out.task_logits[0]).detach().cpu().numpy()
        if dim_scores is None and hasattr(out, "complexity_logits"):
            c = out.complexity_logits[0]
            if c.ndim == 1 and c.shape[0] == len(COMPLEXITY_DIMS):
                dim_scores = torch.sigmoid(c).detach().cpu().numpy()

        # Case C: last-resort: single logits tensor. Assume first 11 correspond to task.
        if task_probs is None and hasattr(out, "logits"):
            logits = out.logits
            if isinstance(logits, (tuple, list)):
                logits = logits[0]
            if isinstance(logits, torch.Tensor) and logits.ndim == 2 and logits.shape[1] >= len(TASK_LABELS):
                task_probs = _softmax(logits[0, : len(TASK_LABELS)]).detach().cpu().numpy()

        if task_probs is None:
            # If we can’t parse, fall back to heuristics for the notebook demo.
            task_probs = np.ones(len(TASK_LABELS), dtype=float) / len(TASK_LABELS)

        task_idx = int(np.argmax(task_probs))
        task_type = TASK_LABELS[task_idx]

        if dim_scores is None:
            # Heuristic complexity score from keywords (only used if parsing dims isn’t possible).
            p = prompt.lower()
            score = 0.25
            if any(k in p for k in ["join", "group", "average", "top"]):
                score = max(score, 0.55)
            if any(k in p for k in ["rank", "running total", "month-over-month", "with", "window"]):
                score = max(score, 0.80)
            dim_scores = np.array([score] * len(COMPLEXITY_DIMS), dtype=float)

        overall = float(np.mean(dim_scores))

        return {
            "task_type": task_type,
            "task_probs": task_probs,
            "complexity_dimensions": {d: float(dim_scores[i]) for i, d in enumerate(COMPLEXITY_DIMS)},
            "overall_complexity": overall,
        }

    def route(self, prompt: str) -> Dict:
        r = self.classify(prompt)
        c = r["overall_complexity"]
        task = r["task_type"]

        # Task-aware override: if it’s Code Generation and not too hard, prefer SQL specialist.
        if task == self.sql_task_label and c < self.high_threshold:
            model_role = "sql"
        else:
            if c < self.low_threshold:
                model_role = "small"
            elif c < self.high_threshold:
                model_role = "sql"
            else:
                model_role = "large"

        r["model_role"] = model_role
        r["policy"] = {
            "low_threshold": self.low_threshold,
            "high_threshold": self.high_threshold,
            "sql_task_label": self.sql_task_label,
        }
        return r

router = NVIDIARouter(router_model, router_tokenizer)
print("✅ NVIDIARouter ready")


✅ NVIDIARouter ready


In [8]:
# Quick sanity check on router output
sample = df_prompts.iloc[0]["prompt"]
print("Prompt:", sample)
out = router.route(sample)
print(json.dumps({k: out[k] for k in ["task_type", "overall_complexity", "model_role"]}, indent=2))


Prompt: Find the average credit_score by region


/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [0,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [0,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [0,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [0,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [0,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: indexSelectSmallIndex: block: [0,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1478: index

AcceleratorError: CUDA error: device-side assert triggered
Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


---

## Model Pool (Hugging Face)

We define 3 tiers for routing and cost demonstration:

| Tier | Hugging Face Model | Notes | Cost/1K tokens (demo) |
|------|---------------------|------|-----------------------|
| Small | `Qwen/Qwen2.5-3B-Instruct` | cheap + fast | $0.001 |
| SQL | `defog/sqlcoder-7b-2` | specialized | $0.005 |
| Large | `meta-llama/Llama-3.1-70B-Instruct` | best quality | $0.09 |

Note: the 70B model may require an access token / license acceptance in Hugging Face. For the workshop, we’ll default to a simulator for repeatable benchmarking.


In [None]:
@dataclass
class ModelSpec:
    role: str
    hf_id: str
    approx_params_b: float
    cost_per_1k_tokens: float
    base_latency_ms: float

MODEL_POOL: Dict[str, ModelSpec] = {
    "small": ModelSpec(
        role="small",
        hf_id="Qwen/Qwen2.5-3B-Instruct",
        approx_params_b=3.0,
        cost_per_1k_tokens=0.001,
        base_latency_ms=60.0,
    ),
    "sql": ModelSpec(
        role="sql",
        hf_id="defog/sqlcoder-7b-2",
        approx_params_b=7.0,
        cost_per_1k_tokens=0.005,
        base_latency_ms=120.0,
    ),
    "large": ModelSpec(
        role="large",
        hf_id="meta-llama/Llama-3.1-70B-Instruct",
        approx_params_b=70.0,
        cost_per_1k_tokens=0.09,
        base_latency_ms=600.0,
    ),
}

pd.DataFrame([vars(v) for v in MODEL_POOL.values()]).sort_values("approx_params_b")


### Optional: Load Real LLMs

For the lab, we default to simulation (fast, repeatable). If you want to load real models:
- Expect GPU memory pressure
- 70B models typically require multi-GPU or remote endpoints
- Use quantization where possible

We provide a stub below that you can adapt to your serving environment.


In [None]:
LOAD_REAL_MODELS = False

# If you enable this, consider: bitsandbytes 4-bit, device_map="auto", and endpoints for 70B.
# from transformers import AutoModelForCausalLM, AutoTokenizer
# real_models = {}
# if LOAD_REAL_MODELS:
#     for role, spec in MODEL_POOL.items():
#         tok = AutoTokenizer.from_pretrained(spec.hf_id, trust_remote_code=True)
#         mdl = AutoModelForCausalLM.from_pretrained(spec.hf_id, device_map="auto", trust_remote_code=True)
#         real_models[role] = (mdl, tok)

print("LOAD_REAL_MODELS =", LOAD_REAL_MODELS)


---

## Simulator (for Benchmarking)

We simulate LLM performance using a simple quality model: accuracy depends on (model tier × prompt complexity). This keeps the lab interactive and lets us focus on routing economics.


In [None]:
class ModelSimulator:
    def __init__(self, pool: Dict[str, ModelSpec]):
        self.pool = pool
        # Accuracy by (role, gold_complexity)
        self.acc = {
            ("small", "simple"): 0.95,
            ("small", "medium"): 0.65,
            ("small", "complex"): 0.35,
            ("sql", "simple"): 0.90,
            ("sql", "medium"): 0.93,
            ("sql", "complex"): 0.70,
            ("large", "simple"): 0.96,
            ("large", "medium"): 0.94,
            ("large", "complex"): 0.90,
        }

    def _estimate_tokens(self, prompt: str) -> int:
        # Rough heuristic for the lab
        return int(max(80, 2.5 * len(prompt.split())))

    def generate(self, role: str, prompt: str, gold_complexity: str) -> Dict:
        spec = self.pool[role]
        tokens = self._estimate_tokens(prompt)
        cost = (tokens / 1000.0) * spec.cost_per_1k_tokens
        latency = spec.base_latency_ms * (1.0 + random.uniform(-0.15, 0.15))
        p = self.acc.get((role, gold_complexity), 0.7)
        ok = random.random() < p

        # Mock SQL output (good enough for demo charts)
        sql = "SELECT * FROM customers WHERE region = 'California';"
        if "average" in prompt.lower():
            sql = "SELECT customer_segment, AVG(credit_score) FROM customers GROUP BY customer_segment;"
        if "rank" in prompt.lower() or "running total" in prompt.lower():
            sql = "WITH t AS (...) SELECT * FROM t;"
        if not ok:
            sql = sql.replace("SELECT", "SELCET")

        return {
            "role": role,
            "hf_id": spec.hf_id,
            "tokens": tokens,
            "cost": cost,
            "latency_ms": latency,
            "is_correct": ok,
            "sql": sql,
        }

sim = ModelSimulator(MODEL_POOL)
print("✅ Simulator ready")


---

## Benchmark: Routed vs Always-70B

We compare two policies:
1. **Routed**: use NVIDIA router → choose small/sql/large
2. **Always-Large**: always call the 70B model

Metrics:
- **Accuracy** (simulated correctness)
- **Latency**
- **Cost**


In [None]:
from tqdm.auto import tqdm

def run_benchmark(df: pd.DataFrame, router: NVIDIARouter, sim: ModelSimulator, n: int = 60) -> pd.DataFrame:
    df = df.sample(n=min(n, len(df)), random_state=SEED).reset_index(drop=True)
    rows = []
    for _, row in tqdm(df.iterrows(), total=len(df)):
        prompt = row["prompt"]
        gold = row["gold_complexity"]

        routed = router.route(prompt)
        routed_role = routed["model_role"]

        r1 = sim.generate(routed_role, prompt, gold)
        r2 = sim.generate("large", prompt, gold)

        rows.append({
            "prompt": prompt,
            "gold_complexity": gold,
            "task_type": routed["task_type"],
            "overall_complexity": routed["overall_complexity"],
            "routed_role": routed_role,
            "routed_cost": r1["cost"],
            "routed_latency_ms": r1["latency_ms"],
            "routed_correct": r1["is_correct"],
            "always_large_cost": r2["cost"],
            "always_large_latency_ms": r2["latency_ms"],
            "always_large_correct": r2["is_correct"],
        })

    return pd.DataFrame(rows)

results = run_benchmark(df_prompts, router, sim, n=60)
display(results.head(5))
print("Rows:", len(results))


In [None]:
def summarize(results: pd.DataFrame) -> Dict:
    routed_cost = float(results["routed_cost"].sum())
    large_cost = float(results["always_large_cost"].sum())
    cost_savings = 100.0 * (1.0 - routed_cost / large_cost)

    routed_lat = float(results["routed_latency_ms"].mean())
    large_lat = float(results["always_large_latency_ms"].mean())
    lat_improve = 100.0 * (1.0 - routed_lat / large_lat)

    routed_acc = float(results["routed_correct"].mean())
    large_acc = float(results["always_large_correct"].mean())

    return {
        "routed_total_cost": routed_cost,
        "always_large_total_cost": large_cost,
        "cost_savings_pct": cost_savings,
        "routed_avg_latency_ms": routed_lat,
        "always_large_avg_latency_ms": large_lat,
        "latency_improvement_pct": lat_improve,
        "routed_accuracy": routed_acc,
        "always_large_accuracy": large_acc,
    }

summary = summarize(results)
print(json.dumps(summary, indent=2))
print("\nRouted role distribution:\n", results["routed_role"].value_counts())


---

## Visualizations

We visualize the economics: cost, latency, and accuracy.


In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

fig = make_subplots(rows=1, cols=3, subplot_titles=("Total Cost", "Avg Latency", "Accuracy"))
fig.add_trace(go.Bar(x=["Routed", "Always-70B"], y=[summary["routed_total_cost"], summary["always_large_total_cost"]]), row=1, col=1)
fig.add_trace(go.Bar(x=["Routed", "Always-70B"], y=[summary["routed_avg_latency_ms"], summary["always_large_avg_latency_ms"]]), row=1, col=2)
fig.add_trace(go.Bar(x=["Routed", "Always-70B"], y=[summary["routed_accuracy"], summary["always_large_accuracy"]]), row=1, col=3)
fig.update_layout(height=350, title_text="Routed vs Always-70B Summary")
fig.show()


In [None]:
role_counts = results["routed_role"].value_counts()
fig = go.Figure(data=[go.Pie(labels=role_counts.index.tolist(), values=role_counts.values.tolist(), hole=0.35)])
fig.update_layout(title="Routed Model Mix", height=350)
fig.show()


In [None]:
# Radar chart for a single prompt’s complexity dimensions
example_prompt = df_prompts.sample(n=1, random_state=SEED).iloc[0]["prompt"]
r = router.classify(example_prompt)
dims = r["complexity_dimensions"]

theta = list(dims.keys()) + [list(dims.keys())[0]]
radial = list(dims.values()) + [list(dims.values())[0]]

fig = go.Figure()
fig.add_trace(go.Scatterpolar(r=radial, theta=theta, fill="toself", name="complexity"))
fig.update_layout(title="Router Complexity Dimensions (Example)", polar=dict(radialaxis=dict(visible=True, range=[0, 1])), height=450)
print("Prompt:", example_prompt)
print("Task:", r["task_type"], "Overall complexity:", r["overall_complexity"])
fig.show()


---

## Interactive Router Console

Enter a prompt, see the router scores, chosen model tier, and a simulated response with cost/latency.


In [None]:
import ipywidgets as widgets

prompt_box = widgets.Textarea(
    value="Find the average credit_score by customer_segment for last quarter",
    description="Prompt:",
    layout=widgets.Layout(width="100%", height="90px"),
)

run_btn = widgets.Button(description="Analyze & Route", button_style="primary")
out = widgets.Output(layout=widgets.Layout(border="1px solid #444", padding="10px"))

def _on_click(_):
    with out:
        clear_output(wait=True)
        p = prompt_box.value.strip()
        if not p:
            print("Enter a prompt.")
            return
        r = router.route(p)
        role = r["model_role"]
        # No gold label here; pick medium as neutral for simulation
        sim_out = sim.generate(role, p, gold_complexity="medium")

        display(Markdown(f"**Task:** `{r['task_type']}`\n\n**Overall complexity:** `{r['overall_complexity']:.2f}`\n\n**Selected model role:** `{role}` → `{MODEL_POOL[role].hf_id}`"))

        dims = r["complexity_dimensions"]
        theta = list(dims.keys()) + [list(dims.keys())[0]]
        radial = list(dims.values()) + [list(dims.keys())[0]]
        radial = list(dims.values()) + [list(dims.values())[0]]
        fig = go.Figure()
        fig.add_trace(go.Scatterpolar(r=radial, theta=theta, fill="toself"))
        fig.update_layout(title="Complexity Dimensions", polar=dict(radialaxis=dict(visible=True, range=[0, 1])), height=350)
        fig.show()

        display(Markdown("**Simulated SQL:**"))
        print(sim_out["sql"])
        display(Markdown(f"**Latency:** `{sim_out['latency_ms']:.0f} ms`  |  **Tokens:** `{sim_out['tokens']}`  |  **Cost:** `${sim_out['cost']:.6f}`"))

run_btn.on_click(_on_click)
display(widgets.VBox([prompt_box, run_btn, out]))


---

## Exercise: Threshold Tuning

Tune the routing thresholds to hit your target tradeoff.

Goal examples:
- **Max savings**: push more traffic to 3B/SQL
- **Max quality**: route uncertain prompts to 70B

Try different `(low_threshold, high_threshold)` and re-run the benchmark.


In [None]:
# TODO: Try different thresholds, then re-run run_benchmark + summarize
# Example:
# router = NVIDIARouter(router_model, router_tokenizer, low_threshold=0.35, high_threshold=0.75)
# results = run_benchmark(df_prompts, router, sim, n=60)
# summary = summarize(results)
# print(json.dumps(summary, indent=2))

print("Edit thresholds in this cell and re-run the benchmark.")


---

## Production Considerations (FICO Context)

- **Caching**: route+response caching for frequent prompts
- **Observability**: log (task, complexity, selected_model, latency, user rating)
- **Fallbacks**: low confidence or failures → escalate to 70B
- **Drift**: prompt distribution changes over time → re-evaluate thresholds
- **Governance**: document routing policy for cost/risk audits


---

## Summary

You now have:
- An NVIDIA-style **router classifier wrapper**
- A **task/complexity-driven routing policy**
- A **3-tier model pool** (3B, SQL specialist, 70B)
- A **benchmark** that quantifies cost/latency/accuracy tradeoffs
- An **interactive console** for hands-on exploration
