### Promptimizer Lift (Definitions)

**Goal:** we want to know whether rewriting a prompt with the *promptimizer* makes model answers score better (according to the judge).

**Definition of $\Delta_i$:** for prompt $i$, $\Delta_i$ is the *change in the average judge score* when you use the promptimizer version instead of the original version.
- If $\Delta_i = +0.4$, that means the promptimizer version scored 0.4 points higher *on average across the models*.
- If $\Delta_i = -0.4$, that means it scored 0.4 points lower on average.

### The math (same idea, written precisely)

$$
\hspace{-2.5em}{\LARGE\begin{aligned}
\Delta_i &= \bar y_i^{\mathrm{opt}} - \bar y_i^{\mathrm{base}}
\end{aligned}}
$$

$$
\hspace{-2.5em}{\LARGE\begin{aligned}
\bar y_i^{\mathrm{opt}} &= \frac{1}{J}\sum_{j=1}^{J} y_{ij}^{\mathrm{opt}} \\
\bar y_i^{\mathrm{base}} &= \frac{1}{J}\sum_{j=1}^{J} y_{ij}^{\mathrm{base}}
\end{aligned}}
$$

### What each symbol means (easy terms)
- **Prompt**: the question/request we ask the models (example: “Explain statelessness in one sentence.”).
- **Model**: one of the candidate AIs we’re comparing (e.g., llama vs gemma).
- **Judge score** ($y$): a rating from the judge model (for example 1–5) that measures how good an answer is for the original user request (higher is better).
- **Baseline** (“base”): using the original prompt with no promptimizer.
- **Optimized** (“opt”): using the prompt after the promptimizer rewrites it.
- $i \in \{1,\ldots,N\}$: which prompt we’re talking about (prompt #1, prompt #2, …).
- $j \in \{1,\ldots,J\}$: which model we’re talking about (model #1, model #2, …).
- $J$: how many models are in the comparison “basket”.
- $y_{ij}^{\mathrm{base}}$: the judge score for **prompt $i$** answered by **model $j$** using the **baseline** prompt.
- $y_{ij}^{\mathrm{opt}}$: the judge score for **prompt $i$** answered by **model $j$** using the **promptimizer** prompt.
- $\bar y_i^{\mathrm{base}}$: the *average* baseline judge score across all $J$ models for prompt $i$.
- $\bar y_i^{\mathrm{opt}}$: the *average* promptimized judge score across all $J$ models for prompt $i$.
- $\Delta_i$ (“lift”): **optimized average minus baseline average** for prompt $i$ (how much promptimizer helps or hurts).

### Interpretation
- $\Delta_i > 0$: promptimizer improved the score (by $\Delta_i$ points, on average across models).
- $\Delta_i < 0$: promptimizer hurt the score.
- $\Delta_i = 0$: no change.

**Tiny example:** if the average baseline score is $\bar y_i^{\mathrm{base}}=3.2$ and the average promptimized score is $\bar y_i^{\mathrm{opt}}=3.6$, then $\Delta_i=3.6-3.2=0.4$ (a +0.4 lift).

In [None]:
# setup
import asyncio
import json
import aiohttp

# Minimal pipeline imports FROM swap_testing (as requested)
from metrics.swap_testing import promptimizer, call_all_models, send_judge, CANDIDATE_KEYS


async def run_one_prompt(user_prompt: str) -> dict:
    """Baseline vs promptimized for one prompt; returns judge JSONs."""
    timeout = aiohttp.ClientTimeout(total=None, connect=None, sock_read=None, sock_connect=None)
    async with aiohttp.ClientSession(timeout=timeout) as session:
        # Baseline: candidates answer the original prompt
        baseline_answers = await call_all_models(session, user_prompt)

        # Promptimizer: rewrite prompt, then candidates answer the rewritten prompt
        optimized_prompt = await promptimizer(session, user_prompt)
        optimized_answers = await call_all_models(session, optimized_prompt)

        # Judge both sets against the ORIGINAL user prompt (fair comparison)
        baseline_judge = json.loads(await send_judge(session, user_prompt, baseline_answers))
        optimized_judge = json.loads(await send_judge(session, user_prompt, optimized_answers))

        return {
            "user_prompt": user_prompt,
            "optimized_prompt": optimized_prompt,
            "candidate_models": list(CANDIDATE_KEYS),
            "baseline": baseline_judge,
            "optimized": optimized_judge,
        }


result = asyncio.run(run_one_prompt("Explain statelessness in one sentence."))
print("Candidates:", result["candidate_models"])
print("Optimized prompt:\n", result["optimized_prompt"], "\n")
print("Baseline overall winner:", result["baseline"].get("overall_winner_model"))
print("Optimized overall winner:", result["optimized"].get("overall_winner_model"))
print("Baseline avg scores:", result["baseline"].get("avg_scores"))
print("Optimized avg scores:", result["optimized"].get("avg_scores"))

In [None]:
from statistics import mean

# This cell assumes Cell 2 already ran and created `result`.
baseline_avg_scores = result["baseline"].get("avg_scores") or {}
optimized_avg_scores = result["optimized"].get("avg_scores") or {}

# Convert dict values -> list of floats (skip anything non-numeric).
def _to_floats(d: dict) -> list[float]:
    out: list[float] = []
    for v in d.values():
        try:
            out.append(float(v))
        except Exception:
            pass
    return out

base_vals = _to_floats(baseline_avg_scores)
opt_vals = _to_floats(optimized_avg_scores)

y_bar_base = mean(base_vals) if base_vals else float("nan")
y_bar_opt = mean(opt_vals) if opt_vals else float("nan")
delta_i = y_bar_opt - y_bar_base

print(f"ȳ_base = {y_bar_base:.3f}")
print(f"ȳ_opt  = {y_bar_opt:.3f}")
print(f"Δ_i    = {delta_i:+.3f}")

if delta_i > 0:
    print("Promptimizer helped (Δ_i > 0).")
elif delta_i < 0:
    print("Promptimizer hurt (Δ_i < 0).")
else:
    print("No change (Δ_i = 0).")

### What to compute (one row per prompt)
For each prompt $i$, compute the lift:
$$
\Delta_i = \bar y_i^{\mathrm{opt}} - \bar y_i^{\mathrm{base}}
$$

Your dataset is just one row per prompt:
- `prompt_id` (like 1, 2, 3, …)
- `delta` (this is $\Delta_i$)

Then answer the only question we care about:
- **Does promptimizer help on average?** (is mean($\Delta_i$) > 0?)

We’ll show:
1) A single bar for mean($\Delta_i$) with a 95% CI error bar
2) A one-sample t-test of $\Delta_i$ vs 0

In [None]:
import asyncio
import math
from statistics import mean, stdev

# Run Cell 2 first (defines run_one_prompt).

# Prompts to test (keep this list short while iterating).
PROMPTS = [
    {"prompt_id": 1, "text": "Explain statelessness in one sentence."},
    {"prompt_id": 2, "text": "Write a Python function that deduplicates a list while preserving order."},
    {"prompt_id": 3, "text": "Give a concise checklist for debugging an async Python deadlock."},
    {"prompt_id": 4, "text": "Summarize pros/cons of k3s vs k8s for a small homelab."},
    {"prompt_id": 5, "text": "Design a short logging strategy for an aiohttp service."},
]


def _mean_ci_95(values: list[float]) -> tuple[float, float, float]:
    """Returns (mean, ci_low, ci_high). Uses t critical if SciPy is available, else normal approx."""
    n = len(values)
    m = mean(values)
    if n < 2:
        return (m, m, m)
    s = stdev(values)
    se = s / math.sqrt(n)
    df = n - 1
    try:
        from scipy import stats  # type: ignore
        tcrit = float(stats.t.ppf(0.975, df))
    except Exception:
        tcrit = 1.96  # normal approx
    half = tcrit * se
    return (m, m - half, m + half)


def _extract_prompt_delta(judge_obj_base: dict, judge_obj_opt: dict) -> float:
    """Computes Δ_i using mean across models of avg_scores values."""
    base_scores = judge_obj_base.get("avg_scores") or {}
    opt_scores = judge_obj_opt.get("avg_scores") or {}
    base_vals = [float(v) for v in base_scores.values()]
    opt_vals = [float(v) for v in opt_scores.values()]
    return mean(opt_vals) - mean(base_vals)


delta_rows: list[dict] = []
for p in PROMPTS:
    r = asyncio.run(run_one_prompt(p["text"]))
    delta_i = _extract_prompt_delta(r["baseline"], r["optimized"])
    delta_rows.append({"prompt_id": p["prompt_id"], "delta": float(delta_i)})

deltas = [row["delta"] for row in delta_rows]
delta_mean, delta_lo, delta_hi = _mean_ci_95(deltas)

print("delta_rows:")
for row in delta_rows:
    print(row)
print()
print(f"Mean Δ: {delta_mean:+.3f} (95% CI [{delta_lo:+.3f}, {delta_hi:+.3f}])")

# ---- Plot: overall effect (single bar) ----
try:
    import matplotlib.pyplot as plt  # type: ignore
except Exception:
    plt = None
    print("\n(matplotlib not available; skipping plot)\nInstall with: pip install matplotlib")

if plt is not None:
    fig, ax = plt.subplots(figsize=(4, 4))
    ax.bar(["mean(Δ)"], [delta_mean], yerr=[[delta_mean - delta_lo], [delta_hi - delta_mean]], capsize=8)
    ax.axhline(0, linewidth=1)
    ax.set_ylabel("Δ (promptimized - baseline)")
    ax.set_title("Does promptimizer help on average?")
    fig.tight_layout()
    plt.show()

In [None]:
import math
from statistics import mean, stdev

# One-sample test on Δ_i values: H0: E[Δ_i] = 0
# Assumes the previous cell created: deltas (list[float])
n = len(deltas)
df = n - 1

delta_mean = mean(deltas)
delta_sd = stdev(deltas) if n > 1 else float("nan")
se = (delta_sd / math.sqrt(n)) if n > 1 else float("nan")
t_stat = (delta_mean / se) if n > 1 and se != 0 else float("nan")

used = "scipy"
try:
    from scipy import stats  # type: ignore
    test = stats.ttest_1samp(deltas, popmean=0.0)
    p_value = float(test.pvalue)
    tcrit = float(stats.t.ppf(0.975, df))
except Exception:
    used = "normal_approx"
    z = abs(t_stat)
    p_value = float(math.erfc(z / math.sqrt(2)))  # two-sided p-value
    tcrit = 1.96

ci_low = delta_mean - tcrit * se if n > 1 else float("nan")
ci_high = delta_mean + tcrit * se if n > 1 else float("nan")

print("=== One-sample t-test on Δ_i vs 0 ===")
print(f"N = {n}, df = {df}")
print(f"mean(Δ) = {delta_mean:+.4f}")
print(f"t = {t_stat:+.4f}")
print(f"p-value = {p_value:.6g}  (computed using: {used})")
print(f"95% CI for mean(Δ): [{ci_low:+.4f}, {ci_high:+.4f}]")