# Session 3 – Benchmark Open-Source Models

Benchmark latency & approximate tokens/sec for multiple model aliases via Foundry Local.

### Explanation: Dependency Installation
Installs minimal packages for benchmarking:
- `foundry-local-sdk` for managing/attaching to local models.
- `openai` for a simple chat completion client.
- `numpy` (future extension or vector ops if needed).
Idempotent; safe to re-run.

# Scenario
This benchmark notebook measures latency and approximate throughput (tokens/sec) for one or more locally hosted open‑source model aliases via Foundry Local. It:
- Discovers available model IDs (or respects BENCH_MODELS env override).
- Warms each model once to mitigate first-token cold start.
- Executes multiple chat completion rounds per model and aggregates latency + token usage.
- Outputs JSON plus a Markdown-friendly summary table.

Use this to compare small language model trade-offs (speed vs. capability) before integrating routing or cost heuristics.

In [None]:
!pip install -q foundry-local-sdk openai numpy

### Explanation: Preview Model Discovery
Performs an early lightweight model listing to give immediate feedback on which models are currently loaded before we refine configuration. Non-fatal if service not yet started.

In [None]:
import os, time, statistics, json
from foundry_local import FoundryLocalManager
from openai import OpenAI

# Early model discovery preview (optional)
# This gives users immediate feedback on what Foundry Local currently has loaded
# before the configuration/discovery cell refines the model list.
try:
    _preview_base = os.getenv('BASE_URL','http://127.0.0.1:57127/v1')
    _preview_key = os.getenv('API_KEY','not-needed')
    _preview_client = OpenAI(base_url=_preview_base, api_key=_preview_key)
    _preview_models = [m.id for m in _preview_client.models.list().data]
    if _preview_models:
        print(f"Foundry Local models discovered (preview): {_preview_models}")
    else:
        print(f"No models discovered yet at {_preview_base}. Load models before benchmarking (e.g., 'foundry model run phi-4-mini').")
except Exception as _e:
    print(f"Model preview unavailable: {_e}")

### Explanation: Benchmark Configuration & Model Filtering
Sets environment-driven benchmarking parameters (rounds, prompt, generation settings). Discovers models, optionally filters to requested list, and prints the final set to benchmark with basic validation and helpful warnings.

In [None]:
# Benchmark configuration & model discovery (override via environment variables)
BASE_URL = os.getenv('BASE_URL',' http://127.0.0.1:57127/v1')
API_KEY = os.getenv('API_KEY','not-needed')

_raw_models = os.getenv('BENCH_MODELS','').strip()
requested_models = [m.strip() for m in _raw_models.split(',') if m.strip()] if _raw_models else []

ROUNDS = int(os.getenv('BENCH_ROUNDS','3'))
if ROUNDS < 1:
    raise ValueError('BENCH_ROUNDS must be >= 1')
PROMPT = os.getenv('BENCH_PROMPT','Explain retrieval augmented generation briefly.')
MAX_TOKENS = int(os.getenv('BENCH_MAX_TOKENS','120'))
TEMPERATURE = float(os.getenv('BENCH_TEMPERATURE','0.2'))

def _discover_models():
    try:
        c = OpenAI(base_url=BASE_URL, api_key=API_KEY)
        data = c.models.list().data
        return [m.id for m in data]
    except Exception as e:
        print(f"Model discovery failed: {e}")
        return []

_discovered = _discover_models()
if not _discovered:
    print("Warning: No models discovered at BASE_URL. Ensure Foundry Local is running and models are loaded.")

if not requested_models or requested_models == ['auto'] or 'ALL' in requested_models:
    MODELS = _discovered
else:
    # Filter requested models to those actually discovered
    MODELS = [m for m in requested_models if m in _discovered] or requested_models  # fallback to requested even if not discovered
    missing = [m for m in requested_models if m not in _discovered]
    if missing:
        print(f"Notice: The following requested models were not discovered and may fail during benchmarking: {missing}")

MODELS = [m for m in MODELS if m]
if not MODELS:
    raise ValueError("No models available to benchmark. Start a model (e.g., 'foundry model run phi-4-mini') or set BENCH_MODELS.")

print(f"Benchmarking models: {MODELS}\nRounds: {ROUNDS}  Max Tokens: {MAX_TOKENS}  Temp: {TEMPERATURE}")

### Explanation: Model Access Helper
`ensure_loaded(alias)` attaches to Foundry Local via manager, resolves concrete model id, and returns both manager + OpenAI client. Raises a helpful error if the model isn't available so the benchmark loop can skip gracefully.

In [None]:
def ensure_loaded(alias):
    """Return (manager, client, model_id) ensuring the alias is accessible.
    Raises RuntimeError with guidance if the model cannot be accessed."""
    try:
        m = FoundryLocalManager(alias)  # triggers bootstrap / attaches to existing
        info = m.get_model_info(alias)
        model_id = getattr(info, 'id', alias)
        c = OpenAI(base_url=m.endpoint, api_key=m.api_key or 'not-needed')
        return m, c, model_id
    except Exception as e:
        raise RuntimeError(
            f"Failed to access model '{alias}'. Ensure it is loaded in Foundry Local, or check BENCH_MODELS. Original error: {e}" )

### Explanation: Single Round Execution
`run_round` performs one chat completion and returns latency + token usage fields if the backend provides them. This atomic unit powers warmup and statistical aggregation.

In [None]:
def run_round(client, model_id):
    start = time.time()
    resp = client.chat.completions.create(
        model=model_id,
        messages=[{'role':'user','content':PROMPT}],
        max_tokens=MAX_TOKENS,
        temperature=TEMPERATURE,
    )
    end = time.time()
    usage = getattr(resp, 'usage', None)
    total_tokens = getattr(usage, 'total_tokens', None) if usage else None
    prompt_tokens = getattr(usage, 'prompt_tokens', None) if usage else None
    completion_tokens = getattr(usage, 'completion_tokens', None) if usage else None
    return end-start, total_tokens, prompt_tokens, completion_tokens

### Explanation: Benchmark Loop & Aggregation
Iterates each model:
- Warmup (excluded from stats) to mitigate cold start.
- Multiple measured rounds capturing latency + tokens.
- Aggregates mean, p95, and tokens/sec.
Stores per-model summary dicts for later rendering.

In [6]:
summary = []
for alias in MODELS:
    try:
        m, client, model_id = ensure_loaded(alias.strip())
    except Exception as e:
        print(e)
        continue
    # Warmup (not recorded)
    try:
        run_round(client, model_id)
    except Exception as e:
        print(f"Warmup failed for {alias}: {e}")
        continue

    latencies, tps = [], []
    prompt_tokens_total = 0
    completion_tokens_total = 0
    token_rounds = 0

    for _ in range(ROUNDS):
        try:
            latency, total_tokens, p_tokens, c_tokens = run_round(client, model_id)
        except Exception as e:
            print(f"Round failed for {alias}: {e}")
            continue
        latencies.append(latency)
        if total_tokens:
            tps.append(total_tokens/latency)
        if p_tokens is not None:
            prompt_tokens_total += p_tokens
        if c_tokens is not None:
            completion_tokens_total += c_tokens
            token_rounds += 1

    if not latencies:
        print(f"Skipping {alias}: no successful rounds.")
        continue

    latency_avg = statistics.mean(latencies)
    latency_p95 = statistics.quantiles(latencies, n=20)[-1] if len(latencies) > 1 else latencies[0]
    tokens_per_sec_avg = statistics.mean(tps) if tps else None

    summary.append({
        'alias': alias,
        'latency_avg_s': latency_avg,
        'latency_p95_s': latency_p95,
        'tokens_per_sec_avg': tokens_per_sec_avg,
        'prompt_tokens_total': prompt_tokens_total if token_rounds else None,
        'completion_tokens_total': completion_tokens_total if token_rounds else None,
        'rounds_ok': len(latencies),
        'configured_rounds': ROUNDS,
    })

Round failed for gpt-oss-20b: Connection error.


### Explanation: Results Rendering
Outputs a JSON summary (machine-friendly) and a Markdown table (human-friendly) with aligned columns. Table includes p95 latency for tail insights and tokens/sec if usage data was available.

In [7]:
# Render results as JSON and markdown table
import math
print("\nJSON Summary:\n" + json.dumps(summary, indent=2))

if summary:
    # Build table
    headers = ["alias","lat_avg(s)","lat_p95(s)","tok/s(avg)","prompt_tokens","completion_tokens","rounds_ok"]
    rows = []
    for r in summary:
        rows.append([
            r['alias'],
            f"{r['latency_avg_s']:.3f}",
            f"{r['latency_p95_s']:.3f}",
            f"{r['tokens_per_sec_avg']:.1f}" if r['tokens_per_sec_avg'] else '-',
            r.get('prompt_tokens_total') or '-',
            r.get('completion_tokens_total') or '-',
            f"{r['rounds_ok']}/{r['configured_rounds']}"
        ])
    col_widths = [max(len(str(cell)) for cell in col) for col in zip(headers, *rows)]
    def fmt_row(row):
        return " | ".join(str(c).ljust(w) for c, w in zip(row, col_widths))
    print("\nMarkdown Table:\n")
    print(fmt_row(headers))
    print(" | ".join('-'*w for w in col_widths))
    for row in rows:
        print(fmt_row(row))
else:
    print("No results to display.")


JSON Summary:
[
  {
    "alias": "gpt-oss-20b",
    "latency_avg_s": 229.65615725517273,
    "latency_p95_s": 229.65615725517273,
    "tokens_per_sec_avg": null,
    "prompt_tokens_total": null,
    "completion_tokens_total": null,
    "rounds_ok": 1,
    "configured_rounds": 3
  },
  {
    "alias": "Phi-4-mini",
    "latency_avg_s": 22.810930013656616,
    "latency_p95_s": 28.72445616722107,
    "tokens_per_sec_avg": null,
    "prompt_tokens_total": null,
    "completion_tokens_total": null,
    "rounds_ok": 3,
    "configured_rounds": 3
  }
]

Markdown Table:

alias       | lat_avg(s) | lat_p95(s) | tok/s(avg) | prompt_tokens | completion_tokens | rounds_ok
----------- | ---------- | ---------- | ---------- | ------------- | ----------------- | ---------
gpt-oss-20b | 229.656    | 229.656    | -          | -             | -                 | 1/3      
Phi-4-mini  | 22.811     | 28.724     | -          | -             | -                 | 3/3      
