# NBA Props Model Retraining Guide

This notebook walks through the model retraining workflow:
1. **Data Quality Checks** - Validate data before training
2. **Build Training Dataset** - Extract features from historical props
3. **Train Models** - Two-head stacked LightGBM
4. **Evaluate Performance** - AUC, calibration, SHAP analysis
5. **Deploy** - Move models to production

**Current Production Models:** XL (102 features), trained Nov 6, 2025

In [None]:
import sys
sys.path.insert(0, '..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Project imports
from nba.core.data_quality_checks import DataQualityChecker
from nba.config.database import get_intelligence_db_config

pd.set_option('display.max_columns', 50)
%matplotlib inline

## 1. Data Quality Checks

Always validate data before training. These checks catch:
- Stale data (old game logs)
- Missing coverage (players without rolling stats)
- Null values in critical fields
- Home/away imbalance (the bug that invalidated Nov 2025 models)

In [None]:
# Run pre-training checks
checker = DataQualityChecker()
checker.connect()

# Run all checks
success = checker.run_pre_training_checks()

if not success:
    print("\n⚠️  Fix data issues before proceeding!")
else:
    print("\n✅ Ready to train")

checker.close()

## 2. Load & Inspect Training Data

Training datasets are pre-built in `nba/features/datasets/`. To rebuild:

In [None]:
# Load existing training data
MARKET = 'POINTS'  # Change to REBOUNDS for other market

df = pd.read_csv(f'../nba/features/datasets/xl_training_{MARKET}_2023_2025.csv')
print(f"Dataset shape: {df.shape}")
print(f"Date range: {df['game_date'].min()} to {df['game_date'].max()}")
print(f"\nTarget distribution:")
print(df['hit_over'].value_counts(normalize=True))

In [None]:
# Check home/away balance (critical!)
print("Home/Away Distribution:")
print(df['is_home'].value_counts(normalize=True))

# Should be ~50/50. If 100% home, data is corrupted!

In [None]:
# Feature completeness
null_pct = df.isnull().sum() / len(df) * 100
high_null = null_pct[null_pct > 5].sort_values(ascending=False)

if len(high_null) > 0:
    print("Features with >5% null:")
    print(high_null)
else:
    print("✅ All features have <5% nulls")

## 3. Train Model

The training script handles:
- Temporal train/test split (70/30)
- Two-head architecture (regressor + classifier)
- Isotonic calibration
- Blending (60% classifier, 40% residual)

In [None]:
# Option 1: Train via shell (recommended)
!cd .. && make train-points

# Option 2: Train directly
# !python ../nba/models/train_market.py --market POINTS --data ../nba/features/datasets/xl_training_POINTS_2023_2025.csv

## 4. Evaluate Model Performance

In [None]:
import json
from pathlib import Path

# Load model metadata
metadata_path = Path(f'../nba/models/saved_xl/{MARKET.lower()}_xl_metadata.json')

if metadata_path.exists():
    with open(metadata_path) as f:
        metadata = json.load(f)
    
    print(f"Model: {MARKET}")
    print(f"Trained: {metadata.get('trained_date', 'Unknown')}")
    print(f"Features: {metadata.get('features', {}).get('count', 'Unknown')}")
    print("\nRegressor Metrics:")
    for k, v in metadata.get('metrics', {}).get('regressor', {}).items():
        print(f"  {k}: {v}")
    print("\nClassifier Metrics:")
    for k, v in metadata.get('metrics', {}).get('classifier', {}).items():
        print(f"  {k}: {v}")
else:
    print(f"No metadata found at {metadata_path}")

## 5. SHAP Feature Importance

SHAP values show which features drive predictions.

In [None]:
from IPython.display import Image, display

# Display pre-generated SHAP plots
shap_bar = Path(f'../nba/models/model_cards/images/{MARKET.lower()}_shap_bar.png')
shap_beeswarm = Path(f'../nba/models/model_cards/images/{MARKET.lower()}_shap_beeswarm.png')

if shap_bar.exists():
    print("Top Features by Mean |SHAP|:")
    display(Image(filename=str(shap_bar), width=600))

if shap_beeswarm.exists():
    print("\nFeature Impact Distribution:")
    display(Image(filename=str(shap_beeswarm), width=600))

In [None]:
# Generate fresh SHAP analysis (takes ~2 min)
# !python -m nba.models.generate_feature_importance --market POINTS --debug

## 6. Calibration Check

Good calibration means predicted probabilities match actual hit rates.

In [None]:
# Load test predictions (if saved during training)
test_preds_path = Path(f'../nba/models/saved_xl/{MARKET.lower()}_test_predictions.csv')

if test_preds_path.exists():
    test_df = pd.read_csv(test_preds_path)
    
    # Calibration plot
    from sklearn.calibration import calibration_curve
    
    prob_true, prob_pred = calibration_curve(
        test_df['actual'], test_df['predicted_prob'], n_bins=10
    )
    
    plt.figure(figsize=(8, 6))
    plt.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')
    plt.plot(prob_pred, prob_true, 's-', label='Model')
    plt.xlabel('Mean predicted probability')
    plt.ylabel('Fraction of positives')
    plt.title(f'{MARKET} Calibration Curve')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()
else:
    print("Test predictions not saved. Re-run training with --save-predictions flag.")

## 7. Compare to Previous Model

Before deploying, compare new model to current production.

In [None]:
# Load model registry
import toml

registry_path = Path('../nba/models/MODEL_REGISTRY.toml')
if registry_path.exists():
    registry = toml.load(registry_path)
    
    print("Model Registry:")
    for market, info in registry.get('models', {}).items():
        print(f"\n{market}:")
        print(f"  Version: {info.get('version', 'Unknown')}")
        print(f"  AUC: {info.get('auc', 'Unknown')}")
        print(f"  Status: {info.get('status', 'Unknown')}")

## 8. Deploy New Model

If the new model outperforms production:

```bash
# Models are automatically saved to nba/models/saved_xl/
# Update MODEL_REGISTRY.toml with new metrics
# Run validation on recent picks
./nba/nba-predictions.sh validate --7d
```

---

## Quick Reference

**Rebuild training data:**
```bash
make build-dataset
```

**Train all models:**
```bash
make train
```

**Check drift:**
```bash
python -m nba.core.cli_drift_check --market POINTS --check-latest
```