# Week 01 — Data Cleaning & Coverage Audit (AIME)

**What**: Cleaned AIME outputs; normalized key fields into a consistent view (year/variant/problem); and applied a strict quality gate (avg_neg_logprob > 0, total_neg_logprob > 0, non-empty token_neg_logprobs).

**Why**: Ensure a trustworthy, uniform dataset before hardness modeling.

**Results**: For every problem that exists in the source dataset, we now have ≥ 8 quality-passing generations per (year, variant, problem) cell (cap=8). Reasoning length (token count) emerges as a stable, informative signal; correct solutions tend to be shorter/more focused than incorrect ones. Logprobs are used for QC/diagnostics, not as modeling features.

**Implication**: Clean, uniformly covered data + a minimal logprob gate ⇒ reliable comparisons and a simple, strong baseline feature (reasoning length) for the first hardness models.

**Next**: Add a second small model, optionally increase generations, and build first baselines with reasoning length plus a few word-based cues.

This week I cleaned the AIME generations and enforced a strict quality gate (non-empty token logprobs with positive averages/totals). After capping at 8 gens per year–problem cell, I rebuilt the plots directly from the filtered CSVs. The main signal that’s shaping up is response length (in tokens): correct solutions tend to be shorter/more focused than incorrect ones, which is promising for simple heuristic baselines and for a classifier feature set. Logprobs were useful for QC and filtering, but the length distribution is the more actionable metric here. I also verified coverage (AIME I pre-2000, I+II from 2000 on) and produced two rerun lists: (a) filesystem-based completeness, and (b) a dataset-aware list against di-zhang-fdu/AIME_1983_2024. Net-net: the pipeline is reproducible, coverage is where we expect it, and we have a reliable length feature to carry forward.

## 1) Load weekly report & sanity checks (counts, coverage, exclusions) 

In [14]:
# === 1) Load strict report & quick sanity checks (cap = 8 per (year, problem)) ===
# Works even if pandas/matplotlib aren't installed in this Jupyter env.

import os, csv, math
from collections import Counter, defaultdict

REPORT_DIR = "../results/week01_qstrict_cap8"

# --- Optional deps (nice to have, but not required) ---
try:
    import pandas as pd
except Exception:
    pd = None

try:
    import matplotlib.pyplot as plt
except Exception:
    plt = None

from IPython.display import display, Image, Markdown

def load_csv(path):
    if not os.path.isfile(path):
        print(f"[WARN] Missing CSV: {path}")
        return None
    if pd is not None:
        try:
            return pd.read_csv(path)
        except Exception:
            pass
    # csv.DictReader fallback
    with open(path, newline="", encoding="utf-8") as f:
        return list(csv.DictReader(f))

# ---- Load data ----
per_record_path = os.path.join(REPORT_DIR, "per_record_stats.csv")
by_corr_path    = os.path.join(REPORT_DIR, "by_correctness_summary.csv")
coverage_path   = os.path.join(REPORT_DIR, "coverage_by_cell.csv")
excl_path       = os.path.join(REPORT_DIR, "excluded_by_reason.csv")

per_record = load_csv(per_record_path)
by_corr    = load_csv(by_corr_path)
coverage   = load_csv(coverage_path)
excluded   = load_csv(excl_path)

# ---- Quick summary ----
def as_rows(obj):
    # normalize to list-of-dicts for simple fallbacks
    if obj is None: return []
    if pd is not None and hasattr(obj, "to_dict"):
        return obj.to_dict(orient="records")
    return obj

pr_rows = as_rows(per_record)

total_selected = len(pr_rows)
num_correct = sum(1 for r in pr_rows if str(r.get("correct")).lower() == "true")
num_incorrect = sum(1 for r in pr_rows if str(r.get("correct")).lower() == "false")
num_unknown = total_selected - num_correct - num_incorrect
acc = (num_correct / max(1, (num_correct + num_incorrect))) * 100.0

display(Markdown(f"""
### Report: `{REPORT_DIR}`
- Selected records: **{total_selected:,}**
- Correct: **{num_correct:,}**
- Incorrect: **{num_incorrect:,}**
- Unknown: **{num_unknown:,}**
- Accuracy on selected: **{acc:.2f}%**
"""))

# ---- Show already-rendered PNGs if matplotlib is missing in this notebook ----
pngs = [
    "resp_len_boxplot.png",
    "tok_len_boxplot.png",
    "mean_nlp_hist_correct.png",
    "mean_nlp_hist_incorrect.png",
]
pngs = [p for p in pngs if os.path.isfile(os.path.join(REPORT_DIR, p))]

if not pngs and plt is None:
    print("[INFO] No PNGs found and matplotlib not available in this environment.")

if pngs and plt is None:
    display(Markdown("#### Rendered figures from the container"))
    for p in pngs:
        display(Image(filename=os.path.join(REPORT_DIR, p)))

# ---- If matplotlib is available here, recreate a simple plot from CSV ----
if plt is not None and pr_rows:
    # Convert types
    def to_float(x):
        try:
            if isinstance(x, bool): return math.nan
            return float(x)
        except Exception:
            return math.nan

    tok_len = [to_float(r.get("tok_len")) for r in pr_rows]
    is_corr = [str(r.get("correct")).lower() == "true" for r in pr_rows]
    is_inc  = [str(r.get("correct")).lower() == "false" for r in pr_rows]

    tok_corr = [t for t, c in zip(tok_len, is_corr) if not math.isnan(t) and c]
    tok_inc  = [t for t, c in zip(tok_len, is_inc)  if not math.isnan(t) and c]

    if tok_corr or tok_inc:
        plt.figure()
        data, labels = [], []
        if tok_corr: data.append(tok_corr); labels.append("Correct")
        if tok_inc:  data.append(tok_inc);  labels.append("Incorrect")
        # deprecation fix: labels -> tick_labels
        plt.boxplot(data, tick_labels=labels, showfliers=False)
        plt.ylabel("Response length (tokens)")
        plt.title("Token Length by Correctness (quality-strict, cap=8)")
        plt.tight_layout()
        plt.show()

# ---- Coverage check: which (year, problem) cells are under the cap of 8? ----
cov_rows = as_rows(coverage)
if cov_rows:
    underfilled = []
    for r in cov_rows:
        try:
            y = int(r.get("year"))
            p = int(r.get("problem"))
            sel = int(r.get("selected"))
            if sel < 8:
                underfilled.append((y, p, sel))
        except Exception:
            pass
    underfilled.sort()
    if underfilled:
        display(Markdown("#### Cells with fewer than 8 selected generations"))
        head = "\n".join(f"- {y} / problem {p}: selected {sel}" for (y,p,sel) in underfilled[:25])
        display(Markdown(head))
    else:
        display(Markdown("✅ Every (year, problem) cell has at least 8 selected generations."))

# ---- Exclusion reasons (why records were dropped) ----
ex_rows = as_rows(excluded)
if ex_rows:
    display(Markdown("#### Exclusion reasons (counts)"))
    if pd is not None:
        display(pd.DataFrame(ex_rows))
    else:
        for r in ex_rows:
            print(f"{r.get('reason')}: {r.get('count')}")



### Report: `../results/week01_qstrict_cap8`
- Selected records: **7,744**
- Correct: **2,804**
- Incorrect: **4,818**
- Unknown: **122**
- Accuracy on selected: **36.79%**


✅ Every (year, problem) cell has at least 8 selected generations.

#### Exclusion reasons (counts)

Unnamed: 0,reason,count
0,failed_quality_gate,27952
1,hit_cell_cap,4864


Figures will be saved under `figures/week01`.

## 2) Generate figures (token length & mean −log p) → ../figures/week01

In [6]:
# ==== Params =========================================================
REPORT_DIR = "../reports/week01_qstrict_cap8"   # folder produced by nb.py
FIG_TAG    = "week01"                        # subfolder for saved figures
# ====================================================================

import os, sys, importlib, subprocess

# Auto-install if missing (keeps the cell self-contained)
for pkg in ["matplotlib", "pandas"]:
    try:
        importlib.import_module(pkg)
    except ImportError:
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pkg])

# Safe headless plotting
import matplotlib
if "DISPLAY" not in os.environ:
    matplotlib.use("Agg")
import matplotlib.pyplot as plt
import pandas as pd

FIG_DIR = os.path.join("..","figures", FIG_TAG)
os.makedirs(FIG_DIR, exist_ok=True)

def _read_csv(path):
    if os.path.exists(path):
        return pd.read_csv(path)
    print(f"[WARN] Missing CSV: {path}")
    return None

per_record = _read_csv(os.path.join(REPORT_DIR, "per_record_stats.csv"))
by_corr    = _read_csv(os.path.join(REPORT_DIR, "by_correctness_summary.csv"))
coverage   = _read_csv(os.path.join(REPORT_DIR, "coverage_by_cell.csv"))

print("Loaded:",
      f"\n  per_record: {0 if per_record is None else len(per_record)} rows",
      f"\n  by_correctness_summary: {0 if by_corr is None else len(by_corr)} rows",
      f"\n  coverage_by_cell: {0 if coverage is None else len(coverage)} rows")

# --- Fig 1: Token length boxplot (Correct vs Incorrect) ---
if per_record is not None and {"tok_len","correct"}.issubset(per_record.columns):
    df = per_record.dropna(subset=["tok_len","correct"])
    data, labels = [], []
    if not df[df["correct"] == True].empty:
        data.append(df[df["correct"] == True]["tok_len"].tolist()); labels.append("Correct")
    if not df[df["correct"] == False].empty:
        data.append(df[df["correct"] == False]["tok_len"].tolist()); labels.append("Incorrect")
    if data:
        plt.figure()
        plt.boxplot(data, labels=labels, showfliers=False)
        plt.ylabel("Response length (tokens)")
        plt.title("Token Length by Correctness")
        plt.tight_layout()
        out1 = os.path.join(FIG_DIR, "01_tok_len_boxplot.png")
        plt.savefig(out1, dpi=150); plt.show()
        print("Saved:", out1)

# --- Fig 2a/b: Mean negative logprob histograms ---
if per_record is not None and {"mean_negative_logprob","correct"}.issubset(per_record.columns):
    d = per_record.dropna(subset=["mean_negative_logprob","correct"])
    dc  = d[d["correct"] == True]
    dic = d[d["correct"] == False]
    if not dc.empty:
        plt.figure(); plt.hist(dc["mean_negative_logprob"].tolist(), bins=40)
        plt.xlabel("Mean negative logprob"); plt.ylabel("Count"); plt.title("Mean -log p — Correct")
        plt.tight_layout()
        out2a = os.path.join(FIG_DIR, "02_mean_nlp_hist_correct.png")
        plt.savefig(out2a, dpi=150); plt.show(); print("Saved:", out2a)
    if not dic.empty:
        plt.figure(); plt.hist(dic["mean_negative_logprob"].tolist(), bins=40)
        plt.xlabel("Mean negative logprob"); plt.ylabel("Count"); plt.title("Mean -log p — Incorrect")
        plt.tight_layout()
        out2b = os.path.join(FIG_DIR, "03_mean_nlp_hist_incorrect.png")
        plt.savefig(out2b, dpi=150); plt.show(); print("Saved:", out2b)

print("Done.")


Loaded: 
  per_record: 7744 rows 
  by_correctness_summary: 4 rows 
  coverage_by_cell: 596 rows
Saved: ../figures/week01/01_tok_len_boxplot.png


  plt.boxplot(data, labels=labels, showfliers=False)


Saved: ../figures/week01/02_mean_nlp_hist_correct.png
Saved: ../figures/week01/03_mean_nlp_hist_incorrect.png
Done.


## 3. Coverage audit w/ AIME variant (I/II): underfilled cells + heatmaps

Point: sanity-check that coverage is complete per AIME variant (I vs II). It verifies that, after your quality filter, you still have ≥ 8 generations for every (year, variant, problem), and it visualizes that coverage.

What it catches: mis-labeled paths (e.g., an “II” folder missed), holes that were hidden when variants were merged, or cells that slipped below 8 after filtering.

In [9]:
import os, re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Pick one of the reports you generated inside the container:
# "reports/scan_results_only_qstrict_cap8"  (1983–2009)
# "reports/scan_exp_only_qstrict_cap8"      (2010–2024)
#REPORT_DIR = "../reports/scan_exp_only_qstrict_cap8"
REPORT_DIR = "../reports/scan_results_only_qstrict_cap8"

pr_path = os.path.join(REPORT_DIR, "per_record_stats.csv")
pr = pd.read_csv(pr_path)

def parse_variant_from_path(path, year):
    p = str(path).upper()
    if "/II/" in p or p.endswith("/II"): return "II"
    if "/I/"  in p or p.endswith("/I"):  return "I"
    # Pre-2000 AIME only has I
    try:
        if int(year) <= 1999: return "I"
    except Exception:
        pass
    return None  # unknown; will show as underfilled

pr["variant"] = [parse_variant_from_path(p, y) for p, y in zip(pr["path"], pr["year"])]

# Coverage with variant
cov = (pr.groupby(["year","variant","problem"])
         .size()
         .reset_index(name="selected"))

# Quick check: cells with < 8 selected
under = cov[cov["selected"] < 8].sort_values(["year","variant","problem"])
print("Underfilled (<8) cells:", len(under))
display(under.head(30))

# Heatmaps by variant
for v in sorted(cov["variant"].dropna().unique()):
    sub = cov[cov["variant"]==v]
    piv = sub.pivot_table(index="year", columns="problem", values="selected", fill_value=0).sort_index()
    Z = np.clip(piv.to_numpy(), 0, 8)
    plt.figure(figsize=(9,6))
    im = plt.imshow(Z, aspect="auto", origin="lower", cmap="viridis", vmin=0, vmax=8)
    plt.colorbar(im, label="# selected (cap=8)")
    plt.yticks(range(len(piv.index)), piv.index.tolist())
    plt.xticks(range(len(piv.columns)), piv.columns.tolist())
    plt.xlabel("Problem"); plt.ylabel("Year")
    plt.title(f"Coverage (selected per year/problem) — AIME {v}")
    plt.tight_layout(); plt.show()


Underfilled (<8) cells: 0


Unnamed: 0,year,variant,problem,selected


## 4) Filesystem audit (AIME I/II) + quality filter + rerun checklist — outputs only


Scope: checks generated files vs. the expected 1–15 grid.

Output: aime_rerun_todo_fs.csv.

In [10]:
# Filesystem coverage audit (AIME I/II) with quality filter and rerun checklist
import os, re, json, csv, math, collections

# --- Config: adjust if your notebook sits elsewhere ---
ROOTS = ["../results", "../experiment_archive"]   # both trees scanned recursively
QUALITY_STRICT = True                              # set False to see raw presence (ignores logprob quality)
CAP_PER_CELL = 8                                   # target gens per (year, variant, problem)

YEAR_RE   = re.compile(r'^(19[8-9]\d|20[0-2]\d)$')  # 1980–2029 safe range
VARIANTS  = {"I","II"}

def safe_json_load(path):
    try:
        with open(path, "r", encoding="utf-8", errors="ignore") as f:
            return json.load(f)
    except Exception:
        # try JSONL
        try:
            with open(path, "r", encoding="utf-8", errors="ignore") as f:
                first = f.readline()
                return json.loads(first) if first.strip() else None
        except Exception:
            return None

def quality_ok(rec):
    if not QUALITY_STRICT:
        return True
    try:
        avg   = rec.get("avg_neg_logprob")
        total = rec.get("total_neg_logprob") or rec.get("sum_neg_logprob") or rec.get("neg_logprob_total")
        toks  = rec.get("token_neg_logprobs")
        if not (isinstance(avg,(int,float)) and avg > 0): return False
        if not (isinstance(total,(int,float)) and total > 0): return False
        if not (isinstance(toks, list) and len(toks) > 0):    return False
        return True
    except Exception:
        return False

def parse_cell_from_path(path):
    """
    Extract (year:int, variant:'I'|'II' or None, problem:int|None) from a path like:
    .../aime/.../<year>/<variant>/<problem>/.../*.json
    Robust to the deep experiment_archive timestamp prefix.
    """
    parts = os.path.normpath(path).split(os.sep)
    # walk parts; when we see a year, look ahead for variant & problem
    for i, part in enumerate(parts):
        if YEAR_RE.match(part):
            year = int(part)
            variant = None
            problem = None
            # variant right after year?
            if i+1 < len(parts) and parts[i+1] in VARIANTS:
                variant = parts[i+1]
                # problem right after variant?
                if i+2 < len(parts) and parts[i+2].isdigit():
                    p = int(parts[i+2])
                    if 1 <= p <= 15: problem = p
            else:
                # no explicit variant; for pre-2000 default to I if problem follows
                if year <= 1999:
                    if i+1 < len(parts) and parts[i+1].isdigit():
                        p = int(parts[i+1])
                        if 1 <= p <= 15:
                            variant = "I"
                            problem = p
            if year and (variant or year <= 1999) and problem:
                # if still no variant post-1999, leave None (we'll skip those)
                if variant is None and year <= 1999:
                    variant = "I"
                return year, variant, problem
    return None, None, None

# Scan filesystem
cells_total     = collections.defaultdict(int)      # raw file count per cell
cells_quality   = collections.defaultdict(int)      # quality-passing file count per cell
problems_by_yv  = collections.defaultdict(set)      # problems present (quality-passing) per (year,variant)
variants_by_y   = collections.defaultdict(set)      # variants seen per year (quality-passing)

scanned_files = 0
for root in ROOTS:
    if not os.path.isdir(root):
        print(f"[WARN] Missing root: {root}")
        continue
    for dirpath, _, files in os.walk(root):
        for fn in files:
            if not fn.lower().endswith((".json",".jsonl",".ndjson")):
                continue
            fpath = os.path.join(dirpath, fn)
            y, v, p = parse_cell_from_path(fpath)
            if y is None or p is None:
                continue
            if y >= 2000 and v not in VARIANTS:
                # post-1999 must explicitly have I or II in path
                continue
            if y <= 1999 and v is None:
                v = "I"
            scanned_files += 1
            cells_total[(y,v,p)] += 1
            rec = safe_json_load(fpath)
            if rec is None:
                continue
            if quality_ok(rec):
                cells_quality[(y,v,p)] += 1
                problems_by_yv[(y,v)].add(p)
                variants_by_y[y].add(v)

def vset_string(vset):
    return "+".join(sorted(vset)) if vset else "I"

# Summary table (quality-passing presence)
print("year  variants  problems_I  problems_II  expected  selected_rows(approx)")
for y in sorted({y for (y,_,_) in cells_total} | set(variants_by_y.keys())):
    vset = variants_by_y.get(y, set())
    probs_I  = len(problems_by_yv.get((y,"I"), set()))
    probs_II = len(problems_by_yv.get((y,"II"), set()))
    expected = 15 if y <= 1999 else 30
    approx_selected = (probs_I + probs_II) * CAP_PER_CELL
    print(f"{y:<5} {vset_string(vset):<8} {probs_I:<11} {probs_II:<11} {expected:<8} {approx_selected}")

# Missing problems (no quality-passing gen found)
missing = []
for y in sorted({y for (y,_,_) in cells_total} | set(variants_by_y.keys())):
    targets = ["I"] if y <= 1999 else ["I","II"]
    for v in targets:
        seen = problems_by_yv.get((y,v), set())
        expected_probs = set(range(1,16))
        miss = sorted(expected_probs - seen)
        if miss:
            missing.append((y, v, miss))

print("\n=== Missing problems (no quality-passing gens) ===")
if missing:
    for y, v, miss in missing[:120]:
        print(f"  {y} {v}: missing {miss}")
else:
    print("None")

# Underfilled (< CAP_PER_CELL) among quality-passing cells
underfilled = []
for (y,v,p), n in cells_quality.items():
    if n < CAP_PER_CELL:
        underfilled.append((y,v,p,n))

print("\n=== Underfilled cells (< {} quality-passing gens) ===".format(CAP_PER_CELL))
if underfilled:
    for y,v,p,n in sorted(underfilled)[:200]:
        print(f"  {y} {v} problem {p}: {n} gens")
else:
    print("None")

# Save rerun checklist (missing -> 8, underfilled -> top-up)
todo = []
for y, v, miss in missing:
    for p in miss:
        todo.append({"year": y, "variant": v, "problem": p, "needed": CAP_PER_CELL})
for y,v,p,n in underfilled:
    todo.append({"year": y, "variant": v, "problem": p, "needed": CAP_PER_CELL - n})

os.makedirs("../reports", exist_ok=True)
todo_path = "../reports/aime_rerun_todo_fs.csv"
with open(todo_path, "w", newline="", encoding="utf-8") as f:
    w = csv.DictWriter(f, fieldnames=["year","variant","problem","needed"])
    w.writeheader()
    for r in sorted(todo, key=lambda r:(r["year"], r["variant"], r["problem"])):
        w.writerow(r)

print(f"\nScanned files: {scanned_files}")
print(f"Saved rerun checklist -> {todo_path}")



year  variants  problems_I  problems_II  expected  selected_rows(approx)
1983  I        15          0           15       120
1984  I        14          0           15       112
1985  I        15          0           15       120
1986  I        14          0           15       112
1987  I        14          0           15       112
1988  I        10          0           15       80
1989  I        9           0           15       72
1990  I        12          0           15       96
1991  I        11          0           15       88
1992  I        15          0           15       120
1993  I        14          0           15       112
1994  I        9           0           15       72
1995  I        14          0           15       112
1996  I        14          0           15       112
1997  I        14          0           15       112
1998  I        15          0           15       120
1999  I        15          0           15       120
2000  I+II     15          12          30       

## 5) AIME dataset cross-check (expected vs. generated) + rerun checklist — detects dataset omissions

Scope: uses di-zhang-fdu/AIME_1983_2024 to define expected problems, then compares to outputs with the quality filter.

 Output: aime_rerun_todo_hf.csv.

In [11]:
# Audit coverage against HF dataset "di-zhang-fdu/AIME_1983_2024"
# and local outputs (quality-strict; cap=8)

import os, re, json, csv, collections, math

# ---- Config ----
ROOTS_OUTPUT = ["../results", "../experiment_archive"]  # your generated outputs
CAP_PER_CELL = 8
QUALITY_STRICT = True

# ---- Helpers ----
YEAR_RE = re.compile(r'^(19[8-9]\d|20[0-2]\d)$')  # 1980–2029
VARIANTS = {"I", "II"}  # restrict to I/II only

def safe_json_load(path):
    try:
        with open(path, "r", encoding="utf-8", errors="ignore") as f:
            # try JSON object
            return json.load(f)
    except Exception:
        # try first line JSONL
        try:
            with open(path, "r", encoding="utf-8", errors="ignore") as f:
                line = f.readline()
                return json.loads(line) if line.strip() else None
        except Exception:
            return None

def parse_cell_from_path(path):
    """Extract (year:int, variant:'I'|'II', problem:int) from .../<year>/<I|II>/<1..15>/... (or pre-2000 /<year>/<1..15>/...)"""
    parts = os.path.normpath(path).split(os.sep)
    for i, part in enumerate(parts):
        if YEAR_RE.match(part):
            y = int(part)
            v = None
            p = None
            if i+1 < len(parts) and parts[i+1] in VARIANTS:
                v = parts[i+1]
                if i+2 < len(parts) and parts[i+2].isdigit():
                    pp = int(parts[i+2])
                    if 1 <= pp <= 15: p = pp
            else:
                # pre-2000: only AIME I, problems live directly under year/
                if y <= 1999 and i+1 < len(parts) and parts[i+1].isdigit():
                    pp = int(parts[i+1])
                    if 1 <= pp <= 15:
                        v = "I"; p = pp
            if y and v in VARIANTS and p:
                return y, v, p
    return None, None, None

def quality_ok(rec):
    if not QUALITY_STRICT:
        return True
    try:
        avg   = rec.get("avg_neg_logprob")
        total = rec.get("total_neg_logprob") or rec.get("sum_neg_logprob") or rec.get("neg_logprob_total")
        toks  = rec.get("token_neg_logprobs")
        return (isinstance(avg,(int,float)) and avg > 0 and
                isinstance(total,(int,float)) and total > 0 and
                isinstance(toks, list) and len(toks) > 0)
    except Exception:
        return False

# ---- 1) Build EXPECTED inventory from HF dataset ----
try:
    !pip install datasets
    from datasets import load_dataset
except Exception as e:
    raise RuntimeError("You need `pip install datasets` in this environment.") from e

ds = load_dataset("di-zhang-fdu/AIME_1983_2024")
splits = [ds[k] for k in ds.keys()]  # train/validation/etc., whatever exists

def infer_yvp(ex):
    """
    Robustly infer (year, variant, problem) from a dataset example.
    Tries common field names; falls back to parsing a title/source string.
    """
    # common explicit fields
    y = ex.get("year")
    v = ex.get("variant") or ex.get("exam") or ex.get("set")
    p = ex.get("problem_number") or ex.get("problem_index") or ex.get("question_number") or ex.get("index")

    # normalize types
    try: y = int(y) if y is not None and str(y).isdigit() else None
    except: y = None
    v = str(v).strip().upper() if isinstance(v, str) else None
    try: p = int(p) if p is not None and str(p).isdigit() else None
    except: p = None

    # If still missing, parse a composite string if present
    blob = None
    for k in ("source", "title", "name", "id"):
        if isinstance(ex.get(k), str) and ex[k]:
            blob = ex[k]
            break
    if blob:
        # e.g., "AIME 2006 II Problem 7"
        m = re.search(r"AIME\s+(19[8-9]\d|20[0-2]\d)\s+(I{1,3}|II)\b.*?(?:Problem\s+(\d{1,2}))?", blob, re.I)
        if m:
            y = y or int(m.group(1))
            v = v or m.group(2).upper().replace("III","II")  # sanitize weird "III"
            if not p and m.group(3): p = int(m.group(3))

    # defaults: before 2000 only I
    if y and y <= 1999 and not v:
        v = "I"

    # final guardrails
    if y and (v in VARIANTS) and p and 1 <= p <= 15:
        return y, v, p
    return None, None, None

expected = collections.defaultdict(set)  # (year, variant) -> {problems}
bad_rows = 0
for split in splits:
    for ex in split:
        y, v, p = infer_yvp(ex)
        if y and v and p:
            expected[(y, v)].add(p)
        else:
            bad_rows += 1

# ---- 2) Count SEEN quality-passing gens from local outputs (no cap) ----
cells_quality = collections.defaultdict(int)   # (y,v,p) -> count
scanned_files = 0
for root in ROOTS_OUTPUT:
    if not os.path.isdir(root): continue
    for dirpath, _, files in os.walk(root):
        for fn in files:
            if not fn.lower().endswith((".json",".jsonl",".ndjson")):
                continue
            fpath = os.path.join(dirpath, fn)
            y, v, p = parse_cell_from_path(fpath)
            if not (y and v and p): continue
            rec = safe_json_load(fpath)
            scanned_files += 1
            if rec is None: continue
            if quality_ok(rec):
                cells_quality[(y, v, p)] += 1

# ---- 3) Summaries ----
def vstr(vs): return "+".join(sorted(vs)) if vs else "I"

years = sorted({y for (y,_) in expected})
print("year  variants  problems_I  problems_II  expected_total")
for y in years:
    vset = {v for (yy,v) in expected if yy == y}
    pI   = len(expected.get((y,"I"), set()))
    pII  = len(expected.get((y,"II"), set()))
    print(f"{y:<5} {vstr(vset):<8} {pI:<11} {pII:<11} {pI + pII}")

# ---- 4) Missing/Underfilled relative to HF inventory ----
missing = []
underfilled = []
for (y,v), exp_probs in sorted(expected.items()):
    for p in sorted(exp_probs):
        n = cells_quality.get((y,v,p), 0)
        if n == 0:
            missing.append((y, v, p))
        elif n < CAP_PER_CELL:
            underfilled.append((y, v, p, n))

print("\n=== Missing (no quality-passing gens but present in HF dataset) ===")
if missing:
    for y,v,p in missing[:150]:
        print(f"  {y} {v} problem {p}")
else:
    print("None")

print("\n=== Underfilled (< {} quality-passing gens) ===".format(CAP_PER_CELL))
if underfilled:
    for y,v,p,n in underfilled[:150]:
        print(f"  {y} {v} problem {p}: {n}")
else:
    print("None")

# ---- 5) Rerun checklist (only what HF says should exist) ----
os.makedirs("../reports", exist_ok=True)
todo_path = "../reports/aime_rerun_todo_hf.csv"
with open(todo_path, "w", newline="", encoding="utf-8") as f:
    w = csv.DictWriter(f, fieldnames=["year","variant","problem","needed"])
    w.writeheader()
    for y,v,p in missing:
        w.writerow({"year": y, "variant": v, "problem": p, "needed": CAP_PER_CELL})
    for y,v,p,n in underfilled:
        w.writerow({"year": y, "variant": v, "problem": p, "needed": CAP_PER_CELL - n})

print(f"\nScanned files: {scanned_files}")
print(f"HF rows with unparsed y/v/p: {bad_rows}")
print(f"Saved rerun checklist -> {todo_path}")


Defaulting to user installation because normal site-packages is not writeable


  from .autonotebook import tqdm as notebook_tqdm


year  variants  problems_I  problems_II  expected_total

=== Missing (no quality-passing gens but present in HF dataset) ===
None

=== Underfilled (< 8 quality-passing gens) ===
None

Scanned files: 22904
HF rows with unparsed y/v/p: 933
Saved rerun checklist -> ../reports/aime_rerun_todo_hf.csv
