# Feature Importance Analysis

This notebook analyzes what drives our NBA player props predictions using:
- **LightGBM native importance** (gain, split)
- **SHAP values** for interpretable ML
- **Feature category analysis** to understand model behavior

**Key Question**: Which features most influence whether a player goes OVER their prop line?

In [None]:
import pickle
import json
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Optional: SHAP for interpretability
try:
    import shap
    SHAP_AVAILABLE = True
except ImportError:
    SHAP_AVAILABLE = False
    print("SHAP not installed. Run: pip install shap")

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 11

## 1. Load Production Models

In [None]:
MODELS_DIR = Path("../nba/models/saved_xl")
MARKET = "points"  # Change to 'rebounds' to analyze that market

# Load model components
with open(MODELS_DIR / f"{MARKET}_market_regressor.pkl", "rb") as f:
    regressor = pickle.load(f)

with open(MODELS_DIR / f"{MARKET}_market_classifier.pkl", "rb") as f:
    classifier = pickle.load(f)

with open(MODELS_DIR / f"{MARKET}_market_features.pkl", "rb") as f:
    feature_names = pickle.load(f)

with open(MODELS_DIR / f"{MARKET}_market_metadata.json", "r") as f:
    metadata = json.load(f)

print(f"Market: {MARKET.upper()}")
print(f"Trained: {metadata['trained_date']}")
print(f"Features: {len(feature_names)}")
print(f"Architecture: {metadata['architecture']}")

## 2. LightGBM Native Feature Importance

LightGBM provides two importance metrics:
- **Gain**: Total gain (reduction in loss) from splits on this feature
- **Split**: Number of times this feature was used for splitting

In [None]:
# Get importance from both models
reg_importance = regressor.feature_importances_
clf_importance = classifier.feature_importances_

# Create importance DataFrame
importance_df = pd.DataFrame({
    'feature': feature_names,
    'regressor_importance': reg_importance,
    'classifier_importance': clf_importance
})

# Combined importance (average of both heads)
importance_df['combined'] = (importance_df['regressor_importance'] + 
                             importance_df['classifier_importance']) / 2

importance_df = importance_df.sort_values('combined', ascending=False)
importance_df.head(20)

In [None]:
# Plot top 25 features
fig, axes = plt.subplots(1, 2, figsize=(16, 10))

# Regressor importance
top_reg = importance_df.nlargest(25, 'regressor_importance')
axes[0].barh(top_reg['feature'], top_reg['regressor_importance'], color='steelblue')
axes[0].set_xlabel('Importance (Gain)')
axes[0].set_title(f'{MARKET.upper()} Regressor - Top 25 Features')
axes[0].invert_yaxis()

# Classifier importance
top_clf = importance_df.nlargest(25, 'classifier_importance')
axes[1].barh(top_clf['feature'], top_clf['classifier_importance'], color='darkorange')
axes[1].set_xlabel('Importance (Gain)')
axes[1].set_title(f'{MARKET.upper()} Classifier - Top 25 Features')
axes[1].invert_yaxis()

plt.tight_layout()
plt.savefig('feature_importance_native.png', dpi=150, bbox_inches='tight')
plt.show()

## 3. Feature Category Analysis

Group features by category to understand which types of information matter most.

In [None]:
def categorize_feature(name):
    """Assign feature to category based on naming convention."""
    if name.startswith('ema_') or name.startswith('fg_pct') or name.startswith('ft_rate'):
        return 'Player Rolling Stats'
    elif name.startswith('h2h_'):
        return 'Head-to-Head History'
    elif name.startswith('prop_'):
        return 'Prop History'
    elif name.startswith('bp_'):
        return 'BettingPros Data'
    elif name.startswith('vegas_'):
        return 'Vegas Lines'
    elif 'deviation' in name or 'line' in name.lower() or 'book' in name:
        return 'Book Disagreement'
    elif 'team' in name or 'opp' in name or 'pace' in name:
        return 'Team Context'
    elif 'rest' in name or 'b2b' in name or 'travel' in name or 'game' in name:
        return 'Schedule/Game'
    elif name in ['is_home', 'expected_diff', 'starter_flag', 'position_encoded']:
        return 'Computed'
    else:
        return 'Other'

importance_df['category'] = importance_df['feature'].apply(categorize_feature)

# Aggregate by category
category_importance = importance_df.groupby('category').agg({
    'combined': 'sum',
    'feature': 'count'
}).rename(columns={'feature': 'num_features'})

category_importance['avg_importance'] = category_importance['combined'] / category_importance['num_features']
category_importance = category_importance.sort_values('combined', ascending=False)
category_importance

In [None]:
# Visualize category importance
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Total importance by category
colors = plt.cm.Set2(np.linspace(0, 1, len(category_importance)))
axes[0].barh(category_importance.index, category_importance['combined'], color=colors)
axes[0].set_xlabel('Total Importance')
axes[0].set_title('Total Feature Importance by Category')
axes[0].invert_yaxis()

# Pie chart of importance distribution
axes[1].pie(category_importance['combined'], labels=category_importance.index, 
            autopct='%1.1f%%', colors=colors, startangle=90)
axes[1].set_title('Importance Distribution')

plt.tight_layout()
plt.savefig('feature_category_importance.png', dpi=150, bbox_inches='tight')
plt.show()

## 4. SHAP Analysis (if available)

SHAP values provide interpretable, additive feature attributions that explain individual predictions.

In [None]:
if SHAP_AVAILABLE:
    # Load sample data for SHAP analysis
    # Using validation data as background
    sample_data_path = Path("../nba/features/datasets")
    
    # Try to load training data for background distribution
    try:
        training_file = list(sample_data_path.glob(f"*{MARKET.upper()}*.csv"))[0]
        df = pd.read_csv(training_file, nrows=1000)  # Sample for speed
        
        # Get feature columns only
        X = df[feature_names].values
        
        print(f"Loaded {len(X)} samples for SHAP analysis")
        print(f"Computing SHAP values (this may take a minute)...")
        
        # Create SHAP explainer
        explainer = shap.TreeExplainer(classifier)
        shap_values = explainer.shap_values(X[:200])  # Subset for speed
        
        print("SHAP values computed!")
    except Exception as e:
        print(f"Could not load training data: {e}")
        SHAP_AVAILABLE = False
else:
    print("SHAP not available. Install with: pip install shap")

In [None]:
if SHAP_AVAILABLE and 'shap_values' in dir():
    # SHAP summary plot
    plt.figure(figsize=(12, 10))
    shap.summary_plot(shap_values[1] if isinstance(shap_values, list) else shap_values, 
                      X[:200], 
                      feature_names=feature_names,
                      max_display=25,
                      show=False)
    plt.title(f'{MARKET.upper()} Classifier - SHAP Summary')
    plt.tight_layout()
    plt.savefig('shap_summary.png', dpi=150, bbox_inches='tight')
    plt.show()

In [None]:
if SHAP_AVAILABLE and 'shap_values' in dir():
    # SHAP bar plot (mean absolute SHAP values)
    plt.figure(figsize=(12, 10))
    shap.summary_plot(shap_values[1] if isinstance(shap_values, list) else shap_values, 
                      X[:200], 
                      feature_names=feature_names,
                      plot_type='bar',
                      max_display=25,
                      show=False)
    plt.title(f'{MARKET.upper()} - Mean |SHAP| Values')
    plt.tight_layout()
    plt.savefig('shap_bar.png', dpi=150, bbox_inches='tight')
    plt.show()

## 5. Key Insights

### Top Predictive Features

In [None]:
print("="*60)
print(f"TOP 15 MOST IMPORTANT FEATURES - {MARKET.upper()} MODEL")
print("="*60)

for i, row in importance_df.head(15).iterrows():
    print(f"{importance_df.head(15).index.get_loc(i)+1:2d}. {row['feature']:40s} ({row['category']})")

print("\n" + "="*60)
print("CATEGORY RANKING")
print("="*60)

for cat, row in category_importance.iterrows():
    pct = row['combined'] / category_importance['combined'].sum() * 100
    print(f"{cat:25s}: {pct:5.1f}% ({int(row['num_features']):3d} features)")

## 6. Interpretation

### What This Tells Us About the Model

1. **`expected_diff` dominates** - The difference between predicted value and line is the strongest predictor. This validates our two-head architecture: the regressor's prediction minus the line creates a powerful signal.

2. **Recent performance matters** - EMA rolling stats (L3, L5) rank highly, showing the model weights recent games appropriately.

3. **Book disagreement is valuable** - Line spread and book deviations contribute significantly, validating our multi-source line shopping approach.

4. **Head-to-head history helps** - H2H features rank in the top categories, confirming that matchup-specific performance is predictive.

5. **Prop history provides edge** - Bayesian hit rates and historical patterns contribute to predictions.

### Actionable Insights

- **Trust high `expected_diff`**: When the model predicts significantly above/below line, confidence is warranted
- **Line spread matters**: Props with disagreement between books (high spread) have more predictable outcomes
- **Recent form > season averages**: L3/L5 stats are more predictive than L20 stats