# QEPC NBA — Leakage-free Eoin totals backtest (fast, reproducible)

This notebook backtests **team points + game totals** using the Eoin Kaggle dataset and (optionally) the Kaggle long odds dataset.

Design goals:
- **Portable across machines**: no hardcoded `C:\Users\...` paths.
- **No leakage**: features are computed using **only games strictly before** the current game.
- **Fast**: single pass through games (no per-date rebuild loops).
- **Odds-aware** (optional): compares QEPC totals vs Vegas totals where available.


In [1]:
# --- Cell 1: Robust project-root bootstrap (portable; no hardcoded paths) ---
from __future__ import annotations

from pathlib import Path
import sys
import datetime as dt

import numpy as np
import pandas as pd

# Find the nearest parent directory that contains qepc/__init__.py
_cwd = Path.cwd().resolve()
PROJECT_ROOT = None
for p in [_cwd] + list(_cwd.parents):
    if (p / "qepc" / "__init__.py").exists():
        PROJECT_ROOT = p
        break

if PROJECT_ROOT is None:
    raise RuntimeError(
        f"Could not locate PROJECT_ROOT above: {_cwd}\n"
        "Expected to find: <PROJECT_ROOT>/qepc/__init__.py"
    )

# Put repo root on sys.path BEFORE importing qepc
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

# Now QEPC imports should work
from qepc.utils.paths import get_project_root

PROJECT_ROOT = get_project_root(PROJECT_ROOT)

CONFIG = {
    # Only use modern-era games to build strengths (matches QEPC defaults)
    "modern_cutoff": dt.date(2022, 10, 1),

    # Optional: set to dt.date(YYYY,MM,DD) to shorten run
    "backtest_start": None,
    "backtest_end": None,

    # Require some history before scoring a game (prevents early-season noise)
    "min_games_per_team": 5,

    # Schedule adjustments (keep these aligned with matchups_eoin defaults)
    "home_bonus": 1.5,
    "away_penalty": 0.5,
    "b2b_penalty": 1.5,

    # Apply qepc/nba/calibration_eoin.py linear calibration
    "use_calibration": True,
}

print("PROJECT_ROOT:", PROJECT_ROOT)
print("python:", sys.executable)
print("cwd:", _cwd)
print("CONFIG:", CONFIG)


PROJECT_ROOT: C:\Users\wdorsey\qepc_project
python: C:\Users\wdorsey\AppData\Local\anaconda3_1\python.exe
cwd: C:\Users\wdorsey\qepc_project\notebooks\nba
CONFIG: {'modern_cutoff': datetime.date(2022, 10, 1), 'backtest_start': None, 'backtest_end': None, 'min_games_per_team': 5, 'home_bonus': 1.5, 'away_penalty': 0.5, 'b2b_penalty': 1.5, 'use_calibration': True}


## Load Eoin games (and attach odds if available)

This expects you already created:
- `cache/imports/eoin_games_qepc.parquet`

Odds are optional. If the odds CSV exists at:
- `data/raw/nba/odds_long/nba_2008-2025.csv`

…we'll join it to games to compare QEPC totals vs market totals.


In [2]:
from qepc.nba.eoin_data_source import load_eoin_games

# Optional odds join
HAS_ODDS = False
try:
    from qepc.nba.odds_long_loader import load_long_odds, attach_odds_to_games
    HAS_ODDS = True
except Exception as e:
    print("[warn] Odds modules not available; continuing without odds. Error:", repr(e))
    HAS_ODDS = False

games = load_eoin_games(PROJECT_ROOT).copy()

# Ensure clean date type
games["game_date"] = pd.to_datetime(games["game_date"]).dt.date

# Keep only final games
if "is_final" in games.columns:
    games = games[games["is_final"] == True].copy()

print("games rows:", len(games))
print("date range:", games["game_date"].min(), "→", games["game_date"].max())
print("columns sample:", [c for c in games.columns if c in [
    "game_id","game_date","home_team_id","away_team_id","home_score","away_score"
]])

# Attach odds if possible
if HAS_ODDS:
    try:
        odds = load_long_odds()  # uses default project paths
        games, diag = attach_odds_to_games(games, odds)
        print(f"[odds] attached. matched {diag.matched_rows} of {diag.total_games} games; "
              f"{diag.unmatched_odds} odds rows unmatched.")
    except Exception as e:
        print("[warn] Failed to attach odds; continuing without odds. Error:", repr(e))
        HAS_ODDS = False

games.head()


games rows: 72311
date range: 1946-11-26 → 2025-12-10
columns sample: ['game_id', 'home_team_id', 'away_team_id', 'home_score', 'away_score', 'game_date']
[odds] attached. matched 23113 of 72311 games; 5 odds rows unmatched.


Unnamed: 0,game_id,game_datetime,home_team_city,home_team_name,home_team_id,away_team_city,away_team_name,away_team_id,home_score,away_score,...,score_home,spread_home,spread_away,total_points,moneyline_away,moneyline_home,p_away,p_home,regular,playoffs
0,22501204,2025-12-10 17:00:00+00:00,Los Angeles,Lakers,1610612747,San Antonio,Spurs,1610612759,119,132,...,,,,,,,,,,
1,22501203,2025-12-10 14:30:00+00:00,Oklahoma City,Thunder,1610612760,Phoenix,Suns,1610612756,138,89,...,,,,,,,,,,
2,22501202,2025-12-09 15:30:00+00:00,Toronto,Raptors,1610612761,New York,Knicks,1610612752,101,117,...,,,,,,,,,,
3,22501201,2025-12-09 13:00:00+00:00,Orlando,Magic,1610612753,Miami,Heat,1610612748,117,108,...,,,,,,,,,,
4,22500366,2025-12-08 15:00:00+00:00,New Orleans,Pelicans,1610612740,San Antonio,Spurs,1610612759,132,135,...,,,,,,,,,,


## Leakage-free backtest (single pass)

We iterate games in chronological order and maintain **per-team running totals**.
For each game, we compute pre-game:
- `off_ppg` (points for / games played)
- `def_ppg` (points allowed / games played)
- back-to-back flags from each team's last game date

Then we predict:
- `home_raw = (home_off + away_def)/2 + home_bonus - b2b_penalty(if home b2b)`
- `away_raw = (away_off + home_def)/2 - away_penalty - b2b_penalty(if away b2b)`

Calibration (optional) uses `qepc/nba/calibration_eoin.py`.


In [3]:
from qepc.nba.calibration_eoin import calibrate_team_totals

# Identify actual score columns robustly
HOME_COL = "home_score" if "home_score" in games.columns else None
AWAY_COL = "away_score" if "away_score" in games.columns else None
if HOME_COL is None or AWAY_COL is None:
    raise KeyError("Expected 'home_score' and 'away_score' columns in Eoin games table.")

# Backtest date filters
modern_cutoff = CONFIG["modern_cutoff"]
start = CONFIG["backtest_start"] or modern_cutoff
end = CONFIG["backtest_end"] or games["game_date"].max()

# We always WARM UP team state from modern_cutoff → start (no rows emitted),
# then we SCORE games from start → end.
games_all = games[(games["game_date"] >= modern_cutoff) & (games["game_date"] <= end)].copy()
games_all = games_all.sort_values(["game_date", "game_id"]).reset_index(drop=True)

print("state warmup:", modern_cutoff, "→", (start - dt.timedelta(days=1)) if start > modern_cutoff else modern_cutoff)
print("scoring range:", start, "→", end)
print("total games processed:", len(games_all))

# Running team state (no leakage because we update AFTER each game)
gp = {}         # games played
pf = {}         # points for
pa = {}         # points allowed
last_date = {}  # last game date per team

rows = []
skipped_no_history = 0
skipped_warmup = 0

def _ppg(team_id: int) -> tuple[float, float, int]:
    g = gp.get(team_id, 0)
    if g <= 0:
        return np.nan, np.nan, 0
    return pf.get(team_id, 0.0) / g, pa.get(team_id, 0.0) / g, g

for _, g in games_all.iterrows():
    gid = int(g["game_id"])
    gdate = g["game_date"]
    home_id = int(g["home_team_id"])
    away_id = int(g["away_team_id"])

    # Predict only inside scoring window; otherwise just warm up state.
    in_scoring_window = (gdate >= start)

    home_off, home_def, home_gp = _ppg(home_id)
    away_off, away_def, away_gp = _ppg(away_id)

    if in_scoring_window:
        # Require minimum history for both teams
        if home_gp < CONFIG["min_games_per_team"] or away_gp < CONFIG["min_games_per_team"]:
            skipped_no_history += 1
        else:
            # Back-to-back flags
            home_b2b = False
            away_b2b = False

            if home_id in last_date:
                home_b2b = (gdate - last_date[home_id]).days == 1
            if away_id in last_date:
                away_b2b = (gdate - last_date[away_id]).days == 1

            # Raw symmetric model + schedule tweaks
            exp_home_raw = (home_off + away_def) / 2.0
            exp_away_raw = (away_off + home_def) / 2.0

            exp_home_raw += CONFIG["home_bonus"]
            exp_away_raw -= CONFIG["away_penalty"]

            if home_b2b:
                exp_home_raw -= CONFIG["b2b_penalty"]
            if away_b2b:
                exp_away_raw -= CONFIG["b2b_penalty"]

            # Optional calibration
            if CONFIG["use_calibration"]:
                exp_home, exp_away = calibrate_team_totals(exp_home_raw, exp_away_raw)
            else:
                exp_home, exp_away = exp_home_raw, exp_away_raw

            actual_home = float(g[HOME_COL])
            actual_away = float(g[AWAY_COL])

            total_pred = exp_home + exp_away
            total_act = actual_home + actual_away

            row = {
                "game_id": gid,
                "game_date": gdate,
                "home_team_id": home_id,
                "away_team_id": away_id,

                "home_off_ppg_prev": home_off,
                "home_def_ppg_prev": home_def,
                "away_off_ppg_prev": away_off,
                "away_def_ppg_prev": away_def,

                "home_is_b2b": bool(home_b2b),
                "away_is_b2b": bool(away_b2b),

                "exp_home_pts_raw": exp_home_raw,
                "exp_away_pts_raw": exp_away_raw,
                "exp_home_pts": exp_home,
                "exp_away_pts": exp_away,

                "actual_home_pts": actual_home,
                "actual_away_pts": actual_away,

                "home_abs_err": abs(actual_home - exp_home),
                "away_abs_err": abs(actual_away - exp_away),

                "total_pred": total_pred,
                "total_actual": total_act,
                "total_abs_err": abs(total_act - total_pred),
            }

            # Odds comparison (if attached)
            if HAS_ODDS and "total_points" in g.index:
                vegas_total = g.get("total_points")
                row["vegas_total"] = float(vegas_total) if pd.notna(vegas_total) else np.nan
                row["vegas_total_abs_err"] = abs(total_act - row["vegas_total"]) if pd.notna(row["vegas_total"]) else np.nan

            rows.append(row)
    else:
        skipped_warmup += 1

    # Update running totals AFTER processing this game (prevents leakage)
    # Home update
    gp[home_id] = gp.get(home_id, 0) + 1
    pf[home_id] = pf.get(home_id, 0.0) + float(g[HOME_COL])
    pa[home_id] = pa.get(home_id, 0.0) + float(g[AWAY_COL])
    last_date[home_id] = gdate

    # Away update
    gp[away_id] = gp.get(away_id, 0) + 1
    pf[away_id] = pf.get(away_id, 0.0) + float(g[AWAY_COL])
    pa[away_id] = pa.get(away_id, 0.0) + float(g[HOME_COL])
    last_date[away_id] = gdate

backtest_df = pd.DataFrame(rows)

print("built backtest rows:", len(backtest_df))
print("skipped warmup games:", skipped_warmup)
print("skipped (insufficient history in scoring window):", skipped_no_history)
backtest_df.head()


state warmup: 2022-10-01 → 2022-10-01
scoring range: 2022-10-01 → 2025-12-10
total games processed: 4584
built backtest rows: 4496
skipped warmup games: 0
skipped (insufficient history in scoring window): 88


Unnamed: 0,game_id,game_date,home_team_id,away_team_id,home_off_ppg_prev,home_def_ppg_prev,away_off_ppg_prev,away_def_ppg_prev,home_is_b2b,away_is_b2b,...,exp_away_pts,actual_home_pts,actual_away_pts,home_abs_err,away_abs_err,total_pred,total_actual,total_abs_err,vegas_total,vegas_total_abs_err
0,22200011,2022-10-20,1610612759,1610612766,103.0,113.8,98.8,116.6,False,False,...,98.3456,102.0,129.0,5.4328,30.6544,205.7784,231.0,25.2216,221.5,9.5
1,22200015,2022-10-20,1610612755,1610612749,113.8,107.4,105.0,116.4,False,False,...,98.1459,88.0,90.0,29.5876,8.1459,215.7335,178.0,37.7335,224.5,46.5
2,22200017,2022-10-21,1610612766,1610612740,103.833333,114.166667,116.666667,110.333333,True,False,...,116.551583,112.0,124.0,12.646333,7.448417,215.90525,236.0,20.09475,226.0,10.0
3,22200018,2022-10-21,1610612754,1610612759,113.2,112.8,102.833333,116.333333,False,True,...,98.378883,134.0,137.0,17.051067,38.621117,215.327817,271.0,55.672183,232.5,38.5
4,22200020,2022-10-21,1610612737,1610612753,113.6,110.8,106.833333,106.5,False,False,...,103.371383,108.0,98.0,0.0882,5.371383,211.283183,206.0,5.283183,225.5,19.5


## Summary metrics

In [4]:
if backtest_df.empty:
    raise RuntimeError("No backtest rows built. Try lowering min_games_per_team or widening the date range.")

mae_home = backtest_df["home_abs_err"].mean()
mae_away = backtest_df["away_abs_err"].mean()
mae_total = backtest_df["total_abs_err"].mean()

print("Backtest rows:", len(backtest_df))
print("MAE home:", round(mae_home, 3))
print("MAE away:", round(mae_away, 3))
print("MAE total:", round(mae_total, 3))

if "vegas_total_abs_err" in backtest_df.columns:
    vegas = backtest_df.dropna(subset=["vegas_total_abs_err"])
    if len(vegas) > 0:
        mae_vegas = vegas["vegas_total_abs_err"].mean()
        print("\nOdds-covered games:", len(vegas))
        print("MAE vegas total:", round(mae_vegas, 3))
        print("QEPC - Vegas (lower is better):", round(mae_total - mae_vegas, 3))


Backtest rows: 4496
MAE home: 9.631
MAE away: 9.797
MAE total: 15.611

Odds-covered games: 3937
MAE vegas total: 14.28
QEPC - Vegas (lower is better): 1.331
