# Day 3 — Inference Physics (NIM-first)

**Goal:** Build intuition for throughput, tail latency, queueing, and why batching/concurrency matter.

**Target:** NIM gateway (OpenAI-compatible) at `NIM_BASE_URL` (default `http://localhost:8000`).

**Outputs (what you’ll see):**
- Latency distribution (p50/p95/p99)
- Latency vs concurrency curves
- Throughput vs concurrency curve
- Latency vs token budget curve

**Timebox:** 60–90 minutes.



In [None]:
# Setup + preflight

import os
import sys
import time
import math
from dataclasses import dataclass
from typing import Any
from concurrent.futures import ThreadPoolExecutor, as_completed

import numpy as np
import pandas as pd
import plotly.express as px
import requests

NIM_BASE_URL = os.environ.get("NIM_BASE_URL", "http://localhost:8000").rstrip("/")
NIM_CHAT_PATH = os.environ.get("NIM_CHAT_PATH", "/v1/chat/completions")
NIM_GEN_MODEL = os.environ.get("NIM_GEN_MODEL", "meta/llama-3.1-8b-instruct")

px.defaults.template = "plotly_white"

print("sys.executable:", sys.executable)
print("NIM_BASE_URL:", NIM_BASE_URL)
print("NIM_CHAT_PATH:", NIM_CHAT_PATH)
print("NIM_GEN_MODEL:", NIM_GEN_MODEL)



In [None]:
def nim_chat_once(*, prompt: str, max_tokens: int = 32, temperature: float = 0.0, timeout_s: float = 60.0) -> tuple[str, float]:
    """One OpenAI-style chat request to NIM; returns (text, latency_s)."""
    url = f"{NIM_BASE_URL}{NIM_CHAT_PATH}"
    payload = {
        "model": NIM_GEN_MODEL,
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        "max_tokens": int(max_tokens),
        "temperature": float(temperature),
    }

    t0 = time.perf_counter()
    r = requests.post(url, headers={"Content-Type": "application/json"}, json=payload, timeout=float(timeout_s))
    dt = time.perf_counter() - t0
    r.raise_for_status()
    j = r.json()

    choices = j.get("choices") or []
    msg = (choices[0].get("message") if choices else {}) or {}
    content = msg.get("content")
    if content is None:
        content = (choices[0].get("text") if choices else "")

    return str(content or "").strip(), float(dt)


def nim_preflight() -> bool:
    print("\n=== NIM preflight ===")
    try:
        txt, dt = nim_chat_once(prompt="Reply with only: OK", max_tokens=4, temperature=0.0, timeout_s=20)
        print(f"✅ NIM reachable: {dt:.3f}s | sample={txt!r}")
        return True
    except Exception as e:
        print(f"❌ NIM not reachable at {NIM_BASE_URL}: {type(e).__name__}: {str(e)[:200]}")
        print("\nTo start local NIMs:")
        print("  cd fico")
        print("  export NGC_API_KEY=...   # needed to pull nvcr.io images")
        print("  ./scripts/start_nims.sh")
        print("\nThen re-run this cell.")
        return False


if not nim_preflight():
    raise RuntimeError("NIM preflight failed")



## Mental model (pretty short)

End-to-end latency is roughly:

\[
T = T_{queue} + T_{net} + T_{compute}
\]

- Under light load, **p50** is mostly \(T_{net} + T_{compute}\).
- Under contention, **p95/p99** are dominated by \(T_{queue}\) (waiting).

A useful intuition (Little’s Law):

\[
L = \lambda W
\]

- \(L\): average number of requests in the system
- \(\lambda\): throughput (req/s)
- \(W\): time-in-system (seconds)

As you push concurrency up, you often increase \(L\), which increases \(W\) (latency).

Your goal is to find the **knee**: throughput stops improving, but p95 explodes.




## Baseline: single-request latency distribution

We send the same request repeatedly (sequentially) to estimate the **noise floor** before adding concurrency.


In [None]:
def pct(xs: list[float], p: float) -> float:
    if not xs:
        return float("nan")
    return float(np.percentile(np.array(xs, dtype=np.float64), p))


def summarize_latencies(lat_s: list[float]) -> dict[str, float]:
    return {
        "n": float(len(lat_s)),
        "mean_s": float(np.mean(lat_s)) if lat_s else float("nan"),
        "p50_s": pct(lat_s, 50),
        "p95_s": pct(lat_s, 95),
        "p99_s": pct(lat_s, 99),
        "min_s": float(min(lat_s)) if lat_s else float("nan"),
        "max_s": float(max(lat_s)) if lat_s else float("nan"),
    }


BASE_PROMPT = "Summarize why batching affects throughput in one short paragraph."
BASE_MAX_TOKENS = 96
BASE_TEMP = 0.2



In [None]:
N_BASELINE = 12

lat = []
for _ in range(N_BASELINE):
    _, dt = nim_chat_once(prompt=BASE_PROMPT, max_tokens=BASE_MAX_TOKENS, temperature=BASE_TEMP, timeout_s=60)
    lat.append(float(dt))

display(pd.DataFrame([summarize_latencies(lat)]))

df_lat = pd.DataFrame({"latency_s": lat})

fig_h = px.histogram(df_lat, x="latency_s", nbins=12, title="Baseline latency histogram", labels={"latency_s": "seconds"})
fig_h.show()

fig_b = px.box(df_lat, y="latency_s", title="Baseline latency box plot", labels={"latency_s": "seconds"})
fig_b.show()



## Concurrency sweep: the p95 cliff

We’ll run a small load test by issuing many requests while limiting **in-flight concurrency**.

What to look for:
- Throughput rises, then saturates.
- p95/p99 often blow up near saturation.



In [None]:
@dataclass
class RunResult:
    ok: bool
    latency_s: float
    error: str | None = None


def _sync_one(timeout_s: float) -> RunResult:
    t0 = time.perf_counter()
    try:
        nim_chat_once(prompt=BASE_PROMPT, max_tokens=BASE_MAX_TOKENS, temperature=BASE_TEMP, timeout_s=timeout_s)
        return RunResult(ok=True, latency_s=float(time.perf_counter() - t0), error=None)
    except Exception as e:
        return RunResult(ok=False, latency_s=float(time.perf_counter() - t0), error=f"{type(e).__name__}: {str(e)[:160]}")


def run_load(*, total_requests: int, concurrency: int, timeout_s: float) -> dict[str, Any]:
    """Threadpool-based loadgen: portable and works in notebooks."""
    total_requests = int(total_requests)
    concurrency = int(concurrency)

    t0 = time.perf_counter()
    results: list[RunResult] = []

    with ThreadPoolExecutor(max_workers=concurrency) as ex:
        futs = [ex.submit(_sync_one, float(timeout_s)) for _ in range(total_requests)]
        for f in as_completed(futs):
            results.append(f.result())

    wall_s = float(time.perf_counter() - t0)
    lat_ok = [r.latency_s for r in results if r.ok]
    err = [r for r in results if not r.ok]

    throughput = (len(results) / wall_s) if wall_s > 0 else float("nan")
    err_rate = (len(err) / len(results)) if results else float("nan")

    return {
        "concurrency": concurrency,
        "total_requests": total_requests,
        "timeout_s": float(timeout_s),
        "wall_s": wall_s,
        "throughput_rps": float(throughput),
        "error_rate": float(err_rate),
        "p50_s": pct(lat_ok, 50),
        "p95_s": pct(lat_ok, 95),
        "p99_s": pct(lat_ok, 99),
        "mean_s": float(np.mean(lat_ok)) if lat_ok else float("nan"),
        "ok": int(len(lat_ok)),
        "err": int(len(err)),
        "sample_error": (err[0].error if err else None),
    }


CONCURRENCY_LEVELS = [1, 2, 4, 8, 16]
TOTAL_REQ = 80
TIMEOUT_S = 60.0

rows = [run_load(total_requests=TOTAL_REQ, concurrency=c, timeout_s=TIMEOUT_S) for c in CONCURRENCY_LEVELS]
df_c = pd.DataFrame(rows).sort_values("concurrency")
display(df_c)

fig_lat = px.line(
    df_c,
    x="concurrency",
    y=["p50_s", "p95_s", "p99_s"],
    markers=True,
    title="Latency vs concurrency (p50/p95/p99)",
    labels={"value": "seconds"},
)
fig_lat.show()

fig_tp = px.line(df_c, x="concurrency", y="throughput_rps", markers=True, title="Throughput vs concurrency", labels={"throughput_rps": "req/s"})
fig_tp.show()

fig_err = px.bar(df_c, x="concurrency", y="error_rate", title="Error rate vs concurrency")
fig_err.update_yaxes(range=[0, max(0.05, float(df_c["error_rate"].max()) * 1.2 if len(df_c) else 0.1)])
fig_err.show()



## Token budget: generation dominates

Roughly, generation cost scales with:
- output tokens (`max_tokens`)
- and often input size (prompt length)

We’ll vary both and plot how p95 latency responds.



In [None]:
def make_prompt(chars: int) -> str:
    base = "Summarize the following text in one sentence.\n\n"
    filler = ("lorem ipsum ") * 5000
    return base + filler[: int(chars)]


def measure_p95(*, prompt: str, max_tokens: int, n: int = 6) -> dict[str, Any]:
    lats = []
    for _ in range(int(n)):
        _, dt = nim_chat_once(prompt=prompt, max_tokens=int(max_tokens), temperature=0.2, timeout_s=60)
        lats.append(float(dt))
    return {
        "prompt_chars": int(len(prompt)),
        "max_tokens": int(max_tokens),
        "p95_s": pct(lats, 95),
        "p50_s": pct(lats, 50),
    }


PROMPT_SIZES = [200, 800, 2000]
MAX_TOKENS_GRID = [32, 64, 128, 256]

rows = []
for pc in PROMPT_SIZES:
    p = make_prompt(pc)
    for mt in MAX_TOKENS_GRID:
        rows.append(measure_p95(prompt=p, max_tokens=mt, n=6))

df_tok = pd.DataFrame(rows)
display(df_tok)

fig_tok = px.line(
    df_tok,
    x="max_tokens",
    y="p95_s",
    color=df_tok["prompt_chars"].astype(str),
    markers=True,
    title="p95 latency vs max_tokens (colored by input size)",
    labels={"p95_s": "p95 seconds", "color": "prompt_chars"},
)
fig_tok.show()

hm = df_tok.pivot_table(index="prompt_chars", columns="max_tokens", values="p95_s", aggfunc="mean")
fig_hm = px.imshow(hm, title="Heatmap: p95 latency (seconds)", labels={"x": "max_tokens", "y": "prompt_chars", "color": "p95 seconds"})
fig_hm.show()



## Backpressure + timeouts (the failure mode)

When you overload an inference server:
- p95/p99 get worse (queueing)
- then requests start timing out / failing

A practical rule: **cap concurrency** and set a timeout that matches your SLO.



In [None]:
# Demonstrate overload by lowering timeout and increasing concurrency

OVERLOAD_TIMEOUT_S = 5.0
OVERLOAD_TOTAL_REQ = 60
OVERLOAD_CONCURRENCY = [4, 8, 16, 32]

rows = [run_load(total_requests=OVERLOAD_TOTAL_REQ, concurrency=c, timeout_s=OVERLOAD_TIMEOUT_S) for c in OVERLOAD_CONCURRENCY]
df_over = pd.DataFrame(rows).sort_values("concurrency")
display(df_over)

fig = px.line(
    df_over,
    x="concurrency",
    y=["p95_s", "error_rate"],
    markers=True,
    title=f"Overload demo (timeout={OVERLOAD_TIMEOUT_S}s)",
)
fig.show()



## Practical tuning checklist (NIM / TensorRT-style serving)

- **Start with a baseline**: measure p50/p95 at concurrency=1.
- **Find the knee**: increase concurrency until throughput stops improving.
- **Protect tail latency**:
  - cap concurrency below the knee
  - set timeouts intentionally (match your SLO)
- **Control generation cost**:
  - reduce `max_tokens`
  - keep prompts short; avoid dumping huge context
- **Operationally**:
  - watch error rate
  - avoid blind retries under overload (use backoff)



## Optional appendix: mapping the same ideas to vLLM (concepts only)

Even though we used NIM here, the same “physics” shows up in any serving stack:
- **Batching** improves GPU utilization (higher throughput) but can add queueing delay.
- **KV-cache** makes long generations expensive; output tokens dominate cost.
- **Scheduling** decisions move you along the latency/throughput frontier.

If you later benchmark vLLM, you’ll typically see the same p95 cliff as you approach saturation.

