# Gridiron Guru Feature Engineering - Google Colab

This notebook demonstrates how to use the Gridiron Guru feature engineering pipeline in Google Colab for model training.

## What This Does

1. **Loads 2024 season-end data** to create realistic team baselines
2. **Generates 28 features** that match the trained ML model expectations
3. **Shows team differentiation** (KC: 88.2% win rate, DET: +22.0 point diff, etc.)
4. **Creates feature vectors** ready for model training or prediction

## Setup Instructions

1. Upload your JSON data files to Colab:
   - `game_log.json` (historical game data 2008-2024)
   - `season_data_by_team.json` (team statistics by season/week)
2. Run the cells below in order
3. Use the feature engineering functions for your model training


In [None]:
# Install required packages
!pip install pandas numpy scikit-learn xgboost joblib


In [None]:
# Upload the feature engineering code
# Copy and paste the contents of gridiron_guru_feature_engineering.py here
# Or upload the file and import it

# For now, we'll run the main code directly
exec(open('gridiron_guru_feature_engineering.py').read())


## Step 1: Load Data Files

First, upload your JSON data files to Colab and load them:


In [None]:
# Load the data files
game_log = load_game_log_data("game_log.json")
season_data = load_season_data("season_data_by_team.json")

print(f"Loaded {len(game_log)} games from game log")
print(f"Loaded {len(season_data)} team records from season data")


## Step 2: Create 2024 Baseline Data

This creates the team differentiation that was working in the original system:


In [None]:
# Create 2024 baseline data (this shows the team differentiation)
baseline_data = load_2024_baseline_data(season_data)

print(f"\nLoaded 2024 baseline for {len(baseline_data)} teams")
print("\nTeam differentiation examples:")
for team_name, stats in list(baseline_data.items())[:10]:
    print(f"  {team_name}: {stats.win_percentage:.1%} win rate, {stats.point_differential:+.1f} point diff")


## Step 3: Validate Feature Engineering

Check that the feature engineering is working correctly:


In [None]:
# Validate feature engineering
validate_feature_engineering(baseline_data, game_log)


## Step 4: Create Sample Prediction

Generate features for a sample game to see the feature engineering in action:


In [None]:
# Create a sample prediction
sample_features = create_sample_prediction(baseline_data, game_log)

print(f"\nGenerated {len(sample_features)} features for model input")
print("\nFeature values:")
for name, value in sample_features.items():
    print(f"  {name}: {value:.4f}")


## Step 5: Create Features for Multiple Games

Here's how to create features for multiple games (useful for training data):


In [None]:
def create_training_features_for_games(baseline_data, game_log, games_list):
    """
    Create features for multiple games.
    
    Args:
        baseline_data: 2024 baseline team data
        game_log: Historical game log
        games_list: List of tuples (home_team, away_team, week)
    
    Returns:
        List of feature dictionaries
    """
    all_features = []
    
    for home_team_name, away_team_name, week in games_list:
        # Find team stats in baseline data
        home_team = None
        away_team = None
        
        for team_name, stats in baseline_data.items():
            if stats.team_abbr == home_team_name:
                home_team = stats
            elif stats.team_abbr == away_team_name:
                away_team = stats
        
        if home_team and away_team:
            game_context = GameContext(
                game_id=f"{home_team_name}_{away_team_name}_{week}",
                season=2025,
                week=week,
                home_team=home_team_name,
                away_team=away_team_name
            )
            
            features = create_prediction_features(home_team, away_team, game_context, game_log)
            all_features.append(features)
        else:
            print(f"Warning: Could not find teams {home_team_name} or {away_team_name}")
    
    return all_features

# Example: Create features for multiple games
sample_games = [
    ("KC", "BUF", 1),
    ("PHI", "SF", 1),
    ("DET", "DAL", 1),
    ("BAL", "MIA", 1)
]

multiple_features = create_training_features_for_games(baseline_data, game_log, sample_games)
print(f"Created features for {len(multiple_features)} games")

# Show features for first game
if multiple_features:
    print("\nFeatures for first game:")
    for name, value in list(multiple_features[0].items())[:10]:
        print(f"  {name}: {value:.4f}")


## Step 6: Convert to DataFrame for Model Training

Convert the features to a pandas DataFrame for easy use with scikit-learn:


In [None]:
import pandas as pd
import numpy as np

def features_to_dataframe(features_list):
    """
    Convert list of feature dictionaries to pandas DataFrame.
    
    Args:
        features_list: List of feature dictionaries
    
    Returns:
        pandas DataFrame with features as columns
    """
    if not features_list:
        return pd.DataFrame()
    
    # Get all unique feature names
    all_feature_names = set()
    for features in features_list:
        all_feature_names.update(features.keys())
    
    # Create DataFrame
    df = pd.DataFrame(features_list)
    
    # Ensure all expected features are present
    expected_features = [
        "win_pct_diff", "point_diff_diff", "off_epa_diff", "off_epa_ratio",
        "def_epa_diff", "def_epa_ratio", "pass_epa_diff", "pass_def_epa_diff",
        "rush_epa_diff", "rush_def_epa_diff", "turnover_margin_diff", "red_zone_eff_diff",
        "third_down_diff", "sack_rate_diff", "sos_diff", "recent_form_diff",
        "pythagorean_diff", "luck_diff", "h2h_win_pct", "h2h_avg_point_diff",
        "h2h_games_count", "home_field_advantage", "rest_advantage", "division_game",
        "playoff_implications", "early_season", "mid_season", "late_season"
    ]
    
    # Add missing features with default values
    for feature in expected_features:
        if feature not in df.columns:
            df[feature] = 0.0
    
    # Reorder columns to match expected order
    df = df[expected_features]
    
    return df

# Convert features to DataFrame
features_df = features_to_dataframe(multiple_features)

print(f"Features DataFrame shape: {features_df.shape}")
print(f"\nColumn names: {list(features_df.columns)}")
print(f"\nFirst few rows:")
print(features_df.head())
