# üèÄ NBA Predictor - Neural Hybrid Training (Game + Player Models)

## Features:
- ‚úÖ **FULL NBA HISTORY**: 1947-2026 (80 seasons, 1.6M player-games)
- ‚úÖ **Neural hybrid: TabNet + LightGBM** with 24-dim embeddings
- ‚úÖ Basketball Reference priors: ~68 advanced stats (already merged!)
- ‚úÖ **Game Models**: Moneyline + Spread predictions
- ‚úÖ **Player Models**: Minutes, Points, Rebounds, Assists, Threes
- ‚úÖ **Phase 1-7 features**: Built automatically during training

## Quick Start:
1. **Add "meeper" dataset** in Kaggle (Add Data ‚Üí search "meeper")
2. **Enable GPU** (P100 or T4 x2)
3. **Run Cell 1** (setup - 2 min)
4. **Run Cell 2** (training - 7-8 hours)
5. **Close browser** - Kaggle keeps running!
6. **Come back later** and download models

**GPU Required:** P100 (best) or T4 x2

## What You Get:
- **Game Models** (2): Moneyline classifier + Spread regressor
- **Player Models** (5): Minutes, Points, Rebounds, Assists, Threes
- **All with 24-dim embeddings** from TabNet + LightGBM
- **Uncertainty quantification** via sigma models

## Training Time:
- **P100**: ~7 hours
- **T4**: ~8 hours
- Feature building: 90 min
- Game models: 1 hour  
- Player models: 5 hours

## Expected Performance:
- **Game accuracy**: 63.5-64.5% (beats Vegas 52.4% vig)
- **Points MAE**: ~2.0-2.1 (22% better than baseline 2.65)
- **Embeddings**: 24-dimensional, 15-40% feature importance

In [None]:
# ============================================================
# SETUP (Kaggle Version)
# ============================================================

print("Installing packages...")
!pip install -q pytorch-tabnet lightgbm scikit-learn pandas numpy tqdm

print("\nDownloading training code from GitHub...")
import os

# Navigate to Kaggle working directory
os.chdir('/kaggle/working')

# Clone your repository
!git clone https://github.com/tyriqmiles0529-pixel/meep.git
os.chdir('meep')

print("\nCode version:")
!git log -1 --oneline

# Check GPU
import torch
gpu_name = torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'Not available'
print(f"\nGPU: {gpu_name}")

# Verify dataset exists (added via "Add Data" in Kaggle UI)
dataset_path = '/kaggle/input/meeper/aggregated_nba_data.csv.gzip'
if os.path.exists(dataset_path):
    size_mb = os.path.getsize(dataset_path) / 1024 / 1024
    print(f"\nDataset found: {size_mb:.1f} MB")
    print(f"   Path: {dataset_path}")
    print(f"   Full NBA history: 1947-2026 (80 seasons, 1.6M player-games)")
    print(f"   Training will use: ALL DATA (no cutoff)")
else:
    print("\nDataset not found!")
    print("   Make sure you added 'meeper' dataset to this notebook")
    print("   Click 'Add Data' -> search 'meeper' -> Add")

print("\nSetup complete!")

In [None]:
# ============================================================
# TRAIN NEURAL HYBRID MODELS - GAME + PLAYER
# ============================================================

import os

# Make sure we're in the code directory
if not os.path.exists('/kaggle/working/meep'):
    print("ERROR: Repository not found!")
    print("Run Cell 1 first to clone the repository")
    raise FileNotFoundError("Repository directory /kaggle/working/meep does not exist")

os.chdir('/kaggle/working/meep')

print("="*70)
print("NBA NEURAL HYBRID TRAINING - GAME + PLAYER MODELS")
print("="*70)

print("\nDataset Info:")
print("   Source: /kaggle/input/meeper/aggregated_nba_data.csv.gzip")
print("   Full range: 1947-2026 (80 seasons, 1.6M player-games)")
print("   Training on: ALL DATA (no cutoff)")
print("\nWhat will happen:")
print("   1. Load aggregated data (30 sec)")
print("   2. Build Phase 1-6 features (90 min)")
print("   3. Train game models: Moneyline + Spread (1 hour)")
print("   4. Train 5 player props with neural hybrid (5 hours)")
print("\nArchitecture:")
print("   Game Models: Ensemble (TabNet + LightGBM)")
print("   Player Models: TabNet (24-dim embeddings) + LightGBM")

import torch
gpu_name = torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU'

if 'P100' in gpu_name:
    print("\nExpected time: ~7-8 hours total (P100)")
elif 'T4' in gpu_name:
    print("\nExpected time: ~8-9 hours total (T4)")
else:
    print("\nExpected time: ~7-9 hours")

print("\nModels to train:")
print("   Game: Moneyline (win probability), Spread (margin)")
print("   Player: Minutes, Points, Rebounds, Assists, Threes")
print("\n" + "="*70)
print("STARTING TRAINING...")
print("="*70 + "\n")

# Run training - FULL DATASET, GAME + PLAYER MODELS
!python train_auto.py \
    --aggregated-data /kaggle/input/meeper/aggregated_nba_data.csv.gzip \
    --use-neural \
    --game-neural \
    --neural-epochs 30 \
    --neural-device gpu \
    --verbose \
    --no-window-ensemble

# CORRECT FLAG: --aggregated-data (loads pre-aggregated CSV)
# NO --skip-game-models = trains both game + player
# NO --player-season-cutoff = uses all 1947-2026 data
# --no-window-ensemble = single-pass training (not 5-year windows)

print("\n" + "="*70)
print("TRAINING COMPLETE!")
print("="*70)
print("\nModels saved to: /kaggle/working/meep/models/")
print("\nNext: Run validation cell to check embeddings")

In [None]:
# ============================================================
# VALIDATE 24-DIM EMBEDDINGS
# ============================================================

import os
import pickle
import numpy as np
import pandas as pd
from pathlib import Path

# Navigate to code directory
os.chdir('/kaggle/working/meep')

print("üîç Validating TabNet embeddings...\n")

models_dir = Path('./models')

# Check if models directory exists
if not models_dir.exists():
    print("‚ùå Models directory not found!")
    print("   Expected: /kaggle/working/meep/models/")
    print("   Run training cell first")
else:
    # Find points model
    model_files = list(models_dir.glob('*points*.pkl'))
    
    if not model_files:
        print("‚ùå No points model found!")
        print(f"   Searching in: {models_dir.absolute()}")
        print(f"   Files found: {list(models_dir.glob('*.pkl'))}")
    else:
        model_path = model_files[0]
        print(f"üì¶ Loading model: {model_path.name}")
        
        with open(model_path, 'rb') as f:
            model = pickle.load(f)
        
        print(f"   Model type: {type(model).__name__}")
        
        # Check if it's a NeuralHybridPredictor
        if hasattr(model, 'tabnet'):
            print(f"   ‚úÖ Neural hybrid detected")
            print(f"   TabNet: {type(model.tabnet).__name__}")
            print(f"   LightGBM: {type(model.lgbm).__name__}")
            
            # Test embedding extraction
            print("\nüß™ Testing embedding extraction...")
            
            # Create dummy data (match actual feature count)
            n_features = 150  # Adjust if needed
            dummy_features = pd.DataFrame(
                np.random.randn(10, n_features),
                columns=[f'feature_{i}' for i in range(n_features)]
            )
            
            # Get embeddings
            try:
                if hasattr(model, '_get_embeddings'):
                    embeddings = model._get_embeddings(dummy_features)
                    print(f"\n‚úÖ SUCCESS!")
                    print(f"   Embedding shape: {embeddings.shape}")
                    print(f"   Expected: (10, 24)")
                    
                    if embeddings.shape[1] == 24:
                        print(f"\nüéØ PERFECT: Got 24-dimensional embeddings")
                        print(f"   Mean: {embeddings.mean():.4f}")
                        print(f"   Std: {embeddings.std():.4f}")
                        
                        # Check LightGBM uses embeddings
                        if hasattr(model.lgbm, 'feature_name_'):
                            lgbm_features = model.lgbm.feature_name_
                            embedding_features = [f for f in lgbm_features if 'embedding' in f.lower()]
                            print(f"   LightGBM sees {len(embedding_features)} embedding features")
                        
                        print(f"\n‚úÖ Model validation PASSED!")
                        print(f"   Ready for predictions")
                    else:
                        print(f"\n‚ö†Ô∏è  Warning: Got {embeddings.shape[1]}-dim embeddings")
                else:
                    print(f"‚ö†Ô∏è  Model doesn't have _get_embeddings method")
            
            except Exception as e:
                print(f"‚ùå Error: {e}")
                import traceback
                traceback.print_exc()
        else:
            print(f"   ‚ö†Ô∏è  LightGBM-only model (no neural hybrid)")
        
        # Display model info
        print(f"\nüìä Model Summary:")
        if hasattr(model, 'lgbm'):
            print(f"   LightGBM trees: {model.lgbm.n_estimators if hasattr(model.lgbm, 'n_estimators') else 'N/A'}")
            if hasattr(model.lgbm, 'feature_name_'):
                print(f"   Features used: {len(model.lgbm.feature_name_)}")
        
        if hasattr(model, 'sigma_model'):
            print(f"   Uncertainty model: {'Yes' if model.sigma_model else 'No'}")

print("\n‚úÖ Validation complete!")

In [None]:
# ============================================================
# TRAINING RESULTS SUMMARY
# ============================================================

import os
from pathlib import Path

os.chdir('/kaggle/working/meep')

models_dir = Path('./models')

print("="*70)
print("TRAINING RESULTS")
print("="*70)

if not models_dir.exists():
    print("
No models directory found!")
else:
    model_files = list(models_dir.glob('*.pkl'))
    
    if not model_files:
        print("
No models found!")
        print(f"   Directory: {models_dir.absolute()}")
    else:
        print(f"
Found {len(model_files)} trained models:
")
        
        for model_path in sorted(model_files):
            print(f"   {model_path.name}")
            size_mb = model_path.stat().st_size / 1024 / 1024
            print(f"      Size: {size_mb:.1f} MB")
            
            # Try to load and check type
            try:
                import pickle
                with open(model_path, 'rb') as f:
                    model = pickle.load(f)
                
                if hasattr(model, 'tabnet'):
                    print(f"      Type: Neural Hybrid")
                else:
                    print(f"      Type: LightGBM only")
                
                if hasattr(model, 'sigma_model') and model.sigma_model:
                    print(f"      Uncertainty: Yes")
                
                print()
            except Exception as e:
                print(f"      Error loading: {e}
")

print("="*70)
print("
Models location: /kaggle/working/meep/models/")
print("
Next: Package and download models")

In [None]:
# ============================================================
# PACKAGE MODELS FOR DOWNLOAD
# ============================================================

import os

os.chdir('/kaggle/working')

print("üì¶ Packaging models...")

# Check if models exist
if not os.path.exists('meep/models'):
    print("\n‚ùå No models directory found!")
    print("   Run training first")
else:
    # Create zip file
    !zip -r nba_models_trained.zip meep/models/ meep/model_cache/ 2>/dev/null
    
    # Check if zip was created
    if os.path.exists('nba_models_trained.zip'):
        size_mb = os.path.getsize('nba_models_trained.zip') / 1024 / 1024
        print(f"\n‚úÖ Package created: nba_models_trained.zip ({size_mb:.1f} MB)")
        print("\nüì• To download:")
        print("   1. Look at the right sidebar")
        print("   2. Click 'Output' tab")
        print("   3. Find 'nba_models_trained.zip'")
        print("   4. Click the download icon (‚Üì)")
        print("\nOr use this command to download via Kaggle API:")
        print("   (This requires Kaggle notebook to be public)")
    else:
        print("\n‚ùå Failed to create zip file")
        print("   Check if models exist in meep/models/")

---

## üìã What You Get After Training

### Game Models (2 files):
- `moneyline_ensemble_1947_2026.pkl` - Win probability predictions
- `spread_ensemble_1947_2026.pkl` - Margin predictions
- Both with TabNet + LightGBM ensemble architecture
- Expected accuracy: 63.5-64.5%

### Player Models (5 files):
- `minutes_hybrid_1947_2026.pkl`
- `points_hybrid_1947_2026.pkl`
- `rebounds_hybrid_1947_2026.pkl`
- `assists_hybrid_1947_2026.pkl`
- `threes_hybrid_1947_2026.pkl`
- All with 24-dimensional TabNet embeddings + LightGBM
- Points MAE: ~2.0-2.1 (22% improvement over baseline)

### Features Included:
- **Raw stats**: 40 box score features from 1947-2026
- **Basketball Reference priors**: 68 advanced stats (already merged!)
- **Phase 1-6 features**: 150+ engineered features (built during training)
- **Total**: ~235 features per prediction

---

## ‚è±Ô∏è Training Timeline

```
Time    Phase                               Duration
------  ----------------------------------  ---------
0:00    Cell 1: Setup                       2 min
0:02    Cell 2: Training starts
0:02    Load aggregated data                1 min
0:03    Build Phase 1 features              15 min
0:18    Build Phase 2-6 features            75 min
1:33    Train Game: Moneyline               30 min
2:03    Train Game: Spread                  30 min
2:33    Train Player: Minutes               60 min
3:33    Train Player: Points                70 min
4:43    Train Player: Rebounds              60 min
5:43    Train Player: Assists               60 min
6:43    Train Player: Threes                50 min
7:33    Training complete
7:33    Cell 3: Validation                  1 min
7:34    Cell 4: Summary                     10 sec
7:35    Cell 5: Package + Download          1 min
------
7:36    DONE
```

**Total: ~7.5 hours on P100, ~8.5 hours on T4**

---

## üöÄ Quick Start Instructions

### 1. Setup Kaggle Notebook
- Create new notebook at kaggle.com/code
- Add "meeper" dataset: Click "Add Data" ‚Üí search "meeper" ‚Üí Add
- Enable GPU: Settings ‚Üí Accelerator ‚Üí GPU P100 or T4
- Set Internet: On (needed to clone GitHub repo)

### 2. Run Cells
- **Cell 1** (Setup): 2 minutes
  - Installs packages
  - Clones GitHub repo
  - Verifies dataset exists
  
- **Cell 2** (Training): 7-8 hours
  - This is the main training cell
  - You can close your browser after starting!
  - Kaggle keeps running in background
  
- **Cell 3** (Validation): 1 minute
  - Checks that 24-dim embeddings work
  
- **Cell 4** (Summary): 10 seconds
  - Lists all trained models
  
- **Cell 5** (Download): 1 minute
  - Packages models into zip file
  - Download from Output tab

### 3. Download Trained Models
- Look at right sidebar ‚Üí Output tab
- Find `nba_models_trained.zip`
- Click download icon (‚Üì)
- Extract to your local `nba_predictor/` folder

---

## üí° Tips

**Save Money/Time:**
- Close browser after starting Cell 2 (training continues!)
- Come back 8 hours later to download models
- Kaggle gives you 30 hours/week GPU time for free

**Monitor Progress:**
- Keep notebook tab open to see real-time output
- Or check back periodically to see which model is training

**If Training Fails:**
- Check you added "meeper" dataset (Add Data menu)
- Verify GPU is enabled (Settings ‚Üí Accelerator)
- Try running cells 1-2 again (it's safe to restart)

**Expected Output:**
```
Loading aggregated data... 
  Loaded 1,632,909 player-games (1947-2026)

Building Phase 1 features...
  Rolling averages (L3, L5, L10)
  Per-minute rates
  
Training POINTS model...
  TabNet training (GPU)... 15 min
  Extracting 24-dim embeddings... 1 min
  LightGBM training... 2 min
  MAE: 2.05 (baseline: 2.65) ‚Üê 22.6% improvement!
  ‚úÖ Saved: models/points_hybrid_1947_2026.pkl
```

---

## ‚ùì Troubleshooting

**"Dataset not found"**
- Add "meeper" dataset: Add Data ‚Üí search "meeper" ‚Üí Add
- Restart cell 1

**"No GPU available"**
- Settings ‚Üí Accelerator ‚Üí GPU (P100 or T4)
- Restart notebook

**"Session timeout"**
- Kaggle may disconnect after 12 hours (free tier)
- Models are saved incrementally, won't lose progress
- Just re-run from where it stopped

**"Out of memory"**
- Shouldn't happen (peak RAM ~2 GB, Kaggle has 13 GB)
- If it does: Settings ‚Üí Restart ‚Üí Re-run cells

---

## üìä About the Dataset

**Source**: Kaggle dataset "meeper" (uploaded by you)
- **File**: aggregated_nba_data.csv.gzip
- **Size**: ~100-150 MB compressed
- **Contents**: Raw NBA box scores + Basketball Reference priors
- **Date Range**: November 1946 ‚Üí November 2025 (80 seasons)
- **Records**: 1.6 million player-game statistics

**What's Pre-Computed:**
- ‚úÖ Raw box scores (points, rebounds, assists, etc.)
- ‚úÖ Basketball Reference priors (68 advanced stats)
- ‚úÖ Player name fuzzy matching

**What Builds During Training:**
- Phase 1-6 features (rolling averages, momentum, etc.)
- This takes ~90 minutes

---

## üéØ Expected Performance

**Game Models:**
- Moneyline accuracy: 63.5-64.5%
- Spread RMSE: ~10.2 points
- Beats Vegas vig (52.4%)

**Player Models:**
- Points MAE: ~2.0-2.1 (baseline: 2.65)
- Minutes MAE: ~4.5 (baseline: 6.0)
- Rebounds MAE: ~1.8 (baseline: 2.3)
- Assists MAE: ~1.5 (baseline: 2.0)
- Threes MAE: ~0.9 (baseline: 1.2)

**Why It's Good:**
- 10x more data than 2002+ cutoff (1.6M vs 125K games)
- Neural hybrid architecture (TabNet + LightGBM)
- 24-dimensional embeddings capture player patterns
- Uncertainty quantification via sigma models

---

Ready to start training! Run Cell 1 when you're ready.