# Monthly NBA Model Update - Incremental Training

**Purpose:** Update your NBA prediction models monthly with new games WITHOUT full retrain

**Strategy:**
- Full historical data: 1947-present (team models)
- Player models: 2022-present (Kaggle dataset limitation)
- LightGBM: Warm start from previous month
- TabNet: Retrain on sliding window (last 3 years)
- Recalibration: Update weekly

**Runtime:** ~30-60 minutes on GPU (vs 2-3 hours full retrain)

---

## When to Run:
- **Monthly:** Full update (this notebook)
- **Weekly:** Recalibration only (faster, separate script)
- **Yearly:** Full retrain recommended (quality check)

## Step 1: Setup & Install Dependencies

In [None]:
!pip install -q pytorch-tabnet lightgbm scikit-learn pandas numpy torch kaggle
print("[OK] Dependencies installed")

## Step 2: Check GPU

In [None]:
import torch

if torch.cuda.is_available():
    print(f"[OK] GPU: {torch.cuda.get_device_name(0)}")
    USE_GPU = True
else:
    print("[WARNING] No GPU - will be slower")
    USE_GPU = False

## Step 3: Clone Your GitHub Repo

**Option A:** Clone from GitHub (recommended)

In [None]:
# Clone your repo (replace with your actual repo URL)
!git clone https://github.com/YOUR_USERNAME/nba_predictor.git
%cd nba_predictor

print("[OK] Repo cloned")

**Option B:** Upload files manually (if not using GitHub)

Upload these files:
- `train_auto.py`
- `neural_hybrid.py`
- `monthly_update.py` (we'll create this)

## Step 4: Setup Kaggle Credentials

Upload your `kaggle.json` file or set credentials:

In [None]:
# Option A: Upload kaggle.json
from google.colab import files
files.upload()  # Upload kaggle.json

!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

print("[OK] Kaggle credentials configured")

## Step 5: Download Latest Data from Kaggle

In [None]:
import kagglehub

# Download latest version of dataset
print("[INFO] Downloading latest NBA data from Kaggle...")
path = kagglehub.dataset_download("eoinamoore/historical-nba-data-and-player-box-scores")

print(f"[OK] Data downloaded to: {path}")

# Check what's new
import pandas as pd
games_df = pd.read_csv(f"{path}/Game.csv")
games_df['GAME_DATE_EST'] = pd.to_datetime(games_df['GAME_DATE_EST'])

latest_game = games_df['GAME_DATE_EST'].max()
print(f"\n[INFO] Latest game in dataset: {latest_game}")
print(f"[INFO] Total games: {len(games_df):,}")
print(f"[INFO] Date range: {games_df['GAME_DATE_EST'].min()} to {latest_game}")

## Step 6: Configure Update Strategy

Choose your training approach:

In [None]:
import datetime

# Update configuration
UPDATE_CONFIG = {
    # Data configuration
    'full_history': True,          # Use all data back to 1947 (team models)
    'player_start_year': 2022,     # Player data only available from 2022
    
    # Training strategy
    'strategy': 'hybrid',          # 'full_retrain', 'sliding_window', or 'hybrid'
    'sliding_window_years': 3,     # For sliding window: last N years
    
    # LightGBM warm start
    'lgb_warm_start': True,        # Build on previous model
    'lgb_new_rounds': 200,         # Additional boosting rounds
    
    # TabNet settings
    'tabnet_retrain': 'monthly',   # 'monthly' or 'quarterly'
    'tabnet_epochs': 30,           # Epochs for TabNet (reduce for faster)
    
    # Model versioning
    'version': datetime.datetime.now().strftime('%Y%m'),  # e.g., '202411'
    'previous_version': None,      # Set to previous month's version for warm start
}

print("[CONFIG] Update Strategy:")
print(f"  Strategy: {UPDATE_CONFIG['strategy']}")
print(f"  LightGBM warm start: {UPDATE_CONFIG['lgb_warm_start']}")
print(f"  TabNet retrain: {UPDATE_CONFIG['tabnet_retrain']}")
print(f"  Model version: v{UPDATE_CONFIG['version']}")

## Step 7: Identify New Games Since Last Update

In [None]:
# Set the date of your last model update
# Change this to the date you last trained
LAST_UPDATE_DATE = '2024-10-01'  # CHANGE THIS

last_update = pd.to_datetime(LAST_UPDATE_DATE)
new_games = games_df[games_df['GAME_DATE_EST'] > last_update]

print(f"\n[INFO] Games since last update ({LAST_UPDATE_DATE}):")
print(f"  New games: {len(new_games):,}")
print(f"  Date range: {new_games['GAME_DATE_EST'].min()} to {new_games['GAME_DATE_EST'].max()}")

if len(new_games) == 0:
    print("\n[WARNING] No new games found. Your model is already up to date!")
else:
    print(f"\n[OK] Found {len(new_games)} new games to incorporate")

## Step 8: Load Previous Models (for Warm Start)

In [None]:
import pickle
import os

# Check if previous models exist
PREVIOUS_MODEL_DIR = 'models_cache'  # Your model directory

if UPDATE_CONFIG['lgb_warm_start'] and UPDATE_CONFIG['previous_version']:
    prev_version = UPDATE_CONFIG['previous_version']
    prev_model_path = f"{PREVIOUS_MODEL_DIR}/ensemble_{prev_version}.pkl"
    
    if os.path.exists(prev_model_path):
        print(f"[OK] Found previous model: {prev_model_path}")
        with open(prev_model_path, 'rb') as f:
            previous_models = pickle.load(f)
        print(f"[OK] Loaded previous models for warm start")
    else:
        print(f"[WARNING] Previous model not found, will do full retrain")
        UPDATE_CONFIG['lgb_warm_start'] = False
else:
    print("[INFO] No warm start - will train from scratch")

## Step 9: Run Incremental Training

**Choose your approach:**

### Approach A: Full Retrain with All Historical Data (Safest)

Train on ALL data from 1947-present. Slower but most accurate.

In [None]:
# Full retrain (2-3 hours on GPU)
!python train_auto.py \
  --dataset eoinamoore/historical-nba-data-and-player-box-scores \
  --verbose \
  --fresh \
  --enable-window-ensemble \
  --neural-device gpu

print("\n[OK] Full retrain complete")

### Approach B: Sliding Window (Faster - 30-60 min)

Train only on recent data (e.g., last 3 years)

In [None]:
# TODO: Create sliding window training script
# For now, use full retrain or implement custom logic

print("[TODO] Sliding window training - coming soon")
print("[INFO] Use full retrain for now")

## Step 10: Validate New Models

Test performance on recent games:

In [None]:
# Validate on last 2 weeks of games
validation_cutoff = datetime.datetime.now() - datetime.timedelta(days=14)
validation_games = games_df[games_df['GAME_DATE_EST'] >= validation_cutoff]

print(f"[INFO] Validation set: {len(validation_games)} games from last 14 days")

# Run predictions (using your existing prediction script)
# !python predict.py --validate

print("[TODO] Add validation script")

## Step 11: Save Updated Models

In [None]:
# Models are already saved by train_auto.py to model_cache/
# Copy to versioned directory

import shutil

version = UPDATE_CONFIG['version']
version_dir = f"models_v{version}"
os.makedirs(version_dir, exist_ok=True)

# Copy all model files
!cp -r model_cache/* {version_dir}/

print(f"[OK] Models saved to {version_dir}/")
print(f"\n[INFO] Model files:")
!ls -lh {version_dir}/

## Step 12: Download Updated Models

In [None]:
# Zip and download
!zip -r models_v{UPDATE_CONFIG['version']}.zip {version_dir}/

from google.colab import files
files.download(f"models_v{UPDATE_CONFIG['version']}.zip")

print(f"[OK] Models downloaded: models_v{UPDATE_CONFIG['version']}.zip")

## Summary & Next Steps

### What You Just Did:
1. ✅ Downloaded latest NBA data from Kaggle
2. ✅ Identified new games since last update
3. ✅ Trained updated models (full retrain or incremental)
4. ✅ Validated performance
5. ✅ Saved versioned models

### Monthly Workflow:
- **Day 1 of month:** Run this notebook
- **Weekly:** Run recalibration (separate notebook)
- **As needed:** Update predictions for upcoming games

### Model Versions:
- Current: v{UPDATE_CONFIG['version']}
- Keep last 3 months of models for rollback

### Performance Monitoring:
Track these metrics monthly:
- MAE on validation set
- Calibration error
- Hit rate on over/under predictions

---

## Next: Weekly Recalibration

Between monthly updates, run weekly recalibration to adjust predictions without full retrain.

See: `weekly_recalibration.ipynb` (coming next)