# üèÄ NBA Predictor - Complete Cloud Training
## Neural Network + Full Features + GPU Acceleration

### What This Notebook Does:
‚úÖ Trains with ALL features (Team priors, Player priors, Optimization features, Phase 7)  
‚úÖ Neural Network (TabNet + LightGBM) EMBEDDED (not optional)  
‚úÖ GPU-accelerated for faster training (~20-30 min instead of hours)  
‚úÖ Downloads trained models to your computer  
‚úÖ Shows accuracy metrics for moneyline AND spread  
‚úÖ **FIXED**: Player data filtering bug (was returning 0 rows)

### üî• Latest Update (v2.1):
**CRITICAL FIX**: Resolved type mismatch bug causing player data to be filtered to 0 rows.
- Fixed: train_auto.py lines 5018 & 5040 (season type conversion)
- Added: diagnose_player_filter.py (pre-flight check)
- Fuzzy matching: Already included (name normalization + season offset)

### Steps:
1. **RUN TEST FIRST** (verify GPU and environment)
2. Upload your `priors_data.zip` and `PlayerStatistics.csv.zip`
3. Run all cells (Runtime ‚Üí Run all)
4. Download your trained models
5. Done!

In [None]:
# ============================================================
# STEP 0: ENVIRONMENT & DATA TEST (RUN THIS FIRST!)
# ============================================================
# This verifies GPU AND tests player data preparation
# CRITICAL: If player data fails, no player models = system useless!

print("üîç Testing Colab Environment...\n")

# Check GPU
import torch
gpu_available = torch.cuda.is_available()
print(f"üéÆ GPU Available: {gpu_available}")

if gpu_available:
    print(f"   ‚úÖ GPU: {torch.cuda.get_device_name(0)}")
else:
    print("   ‚ùå NO GPU - Go to Runtime ‚Üí Change runtime type ‚Üí GPU")

# Check disk space
import shutil
total, used, free = shutil.disk_usage('/')
print(f"\nüíæ Disk: {free / (1024**3):.1f} GB free")

print("\n" + "="*70)
print("üì• UPLOADING TEST DATA (for data preparation check)")
print("="*70)
print("Upload PlayerStatistics.csv.zip (39.5 MB) to test player data...")
print("This will verify the data can be loaded and prepared correctly.")
print("\nIf upload fails, check:")
print("  1. File is named exactly 'PlayerStatistics.csv.zip'")
print("  2. File is the compressed version (39.5 MB, not 302 MB)")
print("  3. File was created with compress_csvs_for_colab.py")

from google.colab import files
import os

uploaded = files.upload()

# Extract if zip
if os.path.exists('PlayerStatistics.csv.zip'):
    print("\nüì¶ Extracting PlayerStatistics.csv...")
    !unzip -q PlayerStatistics.csv.zip
    !rm PlayerStatistics.csv.zip

# Clone repo for test script
if not os.path.exists('/content/meep'):
    print("\nüì• Cloning repo for test scripts...")
    !git clone -q https://github.com/tyriqmiles0529-pixel/meep.git /content/meep
    !cp PlayerStatistics.csv /content/meep/ 2>/dev/null

%cd /content/meep

# Install minimal dependencies for test
print("\nüì¶ Installing test dependencies...")
!pip install -q pandas numpy

# Run the player data test
print("\n" + "="*70)
print("üß™ RUNNING PLAYER DATA PREPARATION TEST")
print("="*70)
!python3 test_priors_merge.py

print("\n" + "="*70)
print("VERDICT")
print("="*70)
print("\n‚ö†Ô∏è Check output above for:")
print("  ‚úì personId column found")
print("  ‚úì home column found with valid data")
print("  ‚úì date column found and parseable")
print("  ‚úì season_end_year populated (>50%)")
print("\nIf ANY are missing:")
print("  ‚ùå PLAYER MODELS WILL NOT TRAIN")
print("  ‚Üí Fix the data issues before proceeding")
print("  ‚Üí Check PlayerStatistics.csv has correct columns")
print("\nIf all checks passed:")
print("  ‚úÖ PROCEED - Upload priors_data.zip in next cell")
print("="*70)

In [None]:
# ============================================================
# STEP 1: Upload Your Data Files
# ============================================================
# You need 2 files:
# 1. priors_data.zip (6 CSVs with Basketball Reference stats)
# 2. PlayerStatistics.csv.zip (39.5 MB - COMPRESSED for faster upload!)

from google.colab import files
import os

print("üì§ Upload priors_data.zip:")
uploaded = files.upload()

# Extract priors
!rm -rf /content/priors_data
!unzip -q priors_data.zip -d /content

# Verify priors
csv_files = !ls /content/priors_data/*.csv 2>/dev/null | wc -l
csv_count = int(csv_files[0]) if csv_files else 0
if csv_count >= 6:
    print(f"‚úÖ Priors data uploaded! Found {csv_count} CSV files")
    !ls /content/priors_data/*.csv
else:
    print(f"‚ö†Ô∏è Only found {csv_count} files. Expected 6+ CSV files.")
    print("Make sure you uploaded the correct priors_data.zip")

print("\nüì§ Upload PlayerStatistics.csv.zip (39.5 MB - 87% smaller!):")
print("This compressed file contains 20+ years of player game logs")
print("Upload time: ~8 seconds instead of ~61 seconds!")
uploaded = files.upload()

# Extract PlayerStatistics
if os.path.exists('PlayerStatistics.csv.zip'):
    print("\nüì¶ Extracting PlayerStatistics.csv...")
    !unzip -q PlayerStatistics.csv.zip
    !rm PlayerStatistics.csv.zip
    size_mb = os.path.getsize('PlayerStatistics.csv') / 1024 / 1024
    print(f"‚úÖ PlayerStatistics.csv extracted! ({size_mb:.1f} MB)")
elif os.path.exists('PlayerStatistics.csv'):
    size_mb = os.path.getsize('PlayerStatistics.csv') / 1024 / 1024
    print(f"‚úÖ PlayerStatistics.csv uploaded! ({size_mb:.1f} MB)")
else:
    print("‚ö†Ô∏è PlayerStatistics.csv not found!")
    print("Player models will only train on current season data.")

In [None]:
# ============================================================
# STEP 2: Install Dependencies & Download Code
# ============================================================

print("üì¶ Installing packages...")
!pip install -q nba-api kagglehub pytorch-tabnet lightgbm scikit-learn pandas numpy tqdm

print("\nüì• Forcing fresh clone from GitHub (latest fixes)...")
import os

# Change to parent directory first, then remove and clone
%cd /content
!rm -rf /content/meep
!git clone https://github.com/tyriqmiles0529-pixel/meep.git /content/meep

%cd /content/meep

# Show current commit to verify latest code
print("\nüìç Current commit:")
!git log -1 --oneline

print("\n‚úÖ Code downloaded!")
print(f"üìÅ Working directory: {os.getcwd()}")

# Copy uploaded files to working directory
!cp /content/PlayerStatistics.csv . 2>/dev/null || echo "No PlayerStatistics.csv to copy"
!cp -r /content/priors_data . 2>/dev/null || echo "No priors_data to copy"

# Check GPU
import torch
print(f"\nüéÆ GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"   GPU: {torch.cuda.get_device_name(0)}")

# Run diagnostic tests
print("\nüîç Running data preparation test...")
!python3 test_priors_merge.py

print("\nüîç Running player data filtering test (critical fix verification)...")
!python3 diagnose_player_filter.py

In [None]:
# ============================================================
# STEP 3: Train Models with Neural Network + Full Features
# ============================================================

print("üöÄ Starting training...")
print("‚è±Ô∏è  This will take 20-30 minutes with GPU")
print("‚òï Get coffee!\n")

# Run training with ALL features + local PlayerStatistics.csv
!python3 train_auto.py \
    --priors-dataset /content/priors_data \
    --player-csv /content/PlayerStatistics.csv \
    --verbose \
    --fresh \
    --neural-device gpu \
    --neural-epochs 50

print("\n‚úÖ TRAINING COMPLETE!")

In [None]:
# ============================================================
# STEP 4: Display Training Metrics
# ============================================================

print("üìä Training Metrics:\n")
!python3 show_metrics.py

# Show file structure
print("\nüìÅ Trained Models:")
!ls -lh models/*.pkl models/*.json 2>/dev/null || echo "No models found"

print("\nüìä Model Cache (windowed models):")
!ls -lh model_cache/*.pkl 2>/dev/null || echo "No cached models found"

In [None]:
# ============================================================
# STEP 5: Download Trained Models to Your Computer
# ============================================================

from google.colab import files
import os

print("üì¶ Preparing models for download...")

# Zip everything
!zip -r nba_models_trained.zip models/ model_cache/ -x '*.git*'

print("\nüíæ Downloading models to your computer...")
files.download('nba_models_trained.zip')

print("\n" + "="*80)
print("‚úÖ DONE!")
print("="*80)
print("\nNext steps:")
print("1. Extract nba_models_trained.zip to your local nba_predictor folder")
print("2. Run predictions locally with the new models")
print("3. Models include:")
print("   ‚Ä¢ Moneyline & Spread models (with accuracy metrics)")
print("   ‚Ä¢ Player prop models (Points, Rebounds, Assists, 3PM, Minutes)")
print("   ‚Ä¢ Neural hybrid models (TabNet + LightGBM)")
print("   ‚Ä¢ Ensemble models (Ridge + Elo + Four Factors)")
print("\nüéØ Your models are now trained on 20+ years of data with:")
print("   ‚úì Team statistical priors (O/D ratings, pace, four factors)")
print("   ‚úì Player statistical priors (~68 features from Basketball Reference)")
print("   ‚úì Optimization features (momentum, consistency, fatigue)")
print("   ‚úì Phase 7 features (situational context, adaptive weighting)")
print("   ‚úì Neural network embeddings (deep feature learning)")

---

## üîß Advanced: Run Custom Predictions in Colab

Want to test predictions right here instead of downloading? Run the cells below:

In [None]:
# Test predictions for today's games
!python3 -c "
from player_ensemble_enhanced import predict_all_props
import json

predictions = predict_all_props()
print(json.dumps(predictions, indent=2))
"

---

## üìä Accuracy Metrics Explained

### Moneyline Model:
- **Log Loss**: Lower is better (0.65 = good, 0.55 = excellent)
- **Brier Score**: Similar to log loss (0.22 = good, 0.18 = excellent)
- **Accuracy**: % of games predicted correctly (60%+ is profitable)

### Spread Model:
- **RMSE**: Root Mean Squared Error (10-12 points = good)
- **MAE**: Mean Absolute Error (8-10 points = good)
- **Coverage**: % of predictions within ¬±5 points (70%+ = excellent)

### Player Props:
- **RMSE**: Points/Rebounds/Assists error (6-8 = good for points)
- **MAE**: Average error (4-6 = good for points)
- **Hit Rate**: % of over/under picks that win (55%+ = profitable)

---

## ‚ùì Troubleshooting

### "Loaded 0 player-games for window"
‚úÖ **FIXED in v2.1!** This was a type mismatch bug (float64 vs int).
- The fix is in the latest code from GitHub
- diagnose_player_filter.py will verify the fix worked
- You should now see: "Loaded 245,892 player-games for window" (or similar)

### "No models found"
- Training failed - check the error output above
- Most common: priors_data.zip not uploaded correctly

### "GPU not available"
- Go to Runtime ‚Üí Change runtime type ‚Üí Hardware accelerator ‚Üí GPU
- Training will still work on CPU (just slower)

### "Out of memory"
- Restart runtime: Runtime ‚Üí Restart runtime
- Then re-run from Step 1

### Player prior merging issues
- Fuzzy matching is already enabled (name normalization + season offset)
- Check diagnose_player_filter.py output for merge statistics
- Sample player IDs should appear in diagnostic output

### Need help?
- Check QUICK_REFERENCE.txt in downloaded files
- Or create a GitHub issue at: https://github.com/tyriqmiles0529-pixel/meep/issues

---

## üéØ Why This Works Better Than Local Training:

1. **GPU Acceleration**: 5-10x faster than CPU
2. **More RAM**: 12GB+ vs your laptop's limits
3. **No System Slowdown**: Your computer stays responsive
4. **Free**: Google Colab is free for up to 12 hours/session
5. **Consistent Environment**: No dependency conflicts

---

## üìà Model Architecture (What You're Training):

### Game Models:
1. **Ridge Regression** (baseline)
2. **Dynamic Elo** (momentum-based ratings)
3. **Four Factors** (advanced stats)
4. **LightGBM** (gradient boosting)
5. **Meta-Learner** (combines all 4)

### Player Models:
1. **TabNet** (deep learning for feature extraction)
2. **LightGBM** (using raw + deep features)
3. **Sigma Model** (uncertainty quantification)

### Feature Pipeline:
- **Phase 1-5**: Basic stats + rolling averages + team context
- **Phase 6**: Optimization (momentum, consistency, fatigue)
- **Phase 7**: Situational (schedule density, opponent history)
- **Basketball Reference Priors**: Historical statistical context (~68 features)
- **Fuzzy Matching**: Name normalization + season offset (¬±1 year)

**Total Features**: ~120-150 per model

---

## üîÑ Re-training Schedule:

- **Daily**: Not needed (models are stable)
- **Weekly**: Run for current season updates
- **Monthly**: Full retrain recommended
- **Mid-Season**: After All-Star break (team dynamics change)
- **Playoffs**: Retrain with playoff-specific weights

You can upload your previous model_cache/ to speed up retraining (only trains new data)

---

## üêõ Changelog

### v2.1 (November 2025) - CRITICAL FIX
- **Fixed**: Player data filtering bug (0 rows issue)
  - Root cause: Type mismatch (float64 season vs int set)
  - Solution: Added .astype('Int64') in train_auto.py
- **Added**: diagnose_player_filter.py diagnostic tool
- **Verified**: Fuzzy matching already implemented

### v2.0 (November 2025)
- Neural Network (TabNet) embedded as default
- Full feature pipeline (Phase 1-7)
- GPU acceleration support
- Compressed CSV upload (87% smaller)

---

**Version**: 2.1 (Player Data Fix)

**Last Updated**: November 6, 2025