 # NFL Game Prediction Model



 This notebook builds a machine learning model to predict NFL game outcomes using historical performance data.



 ## Overview

 - **Data Source**: nflreadrpy library (2021-2025 seasons)

 - **Model**: Logistic Regression with feature selection

 - **Features**: EWMA (Exponentially Weighted Moving Average) team statistics

 - **Target**: Home team win/loss



 ## Table of Contents

 1. [Setup & Imports](#1-setup--imports)

 2. [Load Game Schedules](#2-load-game-schedules)

 3. [Load & Engineer Team Stats](#3-load--engineer-team-stats)

 4. [Calculate EWMA Features](#4-calculate-ewma-features)

 5. [Merge Stats to Games](#5-merge-stats-to-games)

 6. [Feature Selection](#6-feature-selection)

 7. [Train Final Model](#7-train-final-model)

 8. [Save Model & Artifacts](#8-save-model--artifacts)

 9. [Make Predictions](#9-make-predictions)

 10. [Visualize Results](#10-visualize-results)

 ---

 ## 1. Setup & Imports



 Import required libraries and create directory structure.

In [None]:
import nflreadpy as nfl
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from pathlib import Path
import pickle
import joblib
from datetime import datetime


: 

 ---

 ## 2. Load Game Schedules



 Load NFL schedule data from 2021-2025 seasons and filter for completed regular season games.

In [None]:
print("Loading schedules...")
schedule = nfl.load_schedules([2021, 2022, 2023, 2024, 2025]).to_pandas()

# Filter for completed regular season games
games = schedule[
    (schedule['game_type'] == 'REG') &  # Regular season only
    (schedule['home_score'].notna()) &  # Game has been played
    (schedule['away_score'].notna())
].copy()

# Create target variable
games['home_win'] = (games['home_score'] > games['away_score']).astype(int)

print(f" Loaded {len(games):,} completed games")
print(f"   Seasons: {games['season'].min()} - {games['season'].max()}")
print(f"   Home team wins: {games['home_win'].sum():,} ({games['home_win'].mean():.1%})")

# Preview data
games[['season', 'week', 'away_team', 'home_team', 'away_score', 'home_score', 'home_win']].head()


 ---

 ## 3. Load & Engineer Team Stats



 Load team-level statistics and create derived features:

 - **Turnovers Offense**: Interceptions + fumbles lost

 - **Turnovers Defense**: Interceptions + fumbles recovered

 - **Turnover Margin**: Defense turnovers - Offense turnovers

 - **Completion Percentage**: Completions / Attempts

In [None]:
print("Loading team statistics...")
team_stats = nfl.load_team_stats([2021, 2022, 2023, 2024, 2025]).to_pandas()

print(f"Loaded {len(team_stats):,} team game records")
print(f"   Teams: {team_stats['team'].nunique()}")
print(f"   Columns: {len(team_stats.columns)}")


In [None]:
print("\nCreating derived features...")

# Offensive turnovers
team_stats['turnovers_offense'] = (
    team_stats['passing_interceptions'] + 
    team_stats['sack_fumbles_lost'] +
    team_stats['rushing_fumbles_lost'] +
    team_stats['receiving_fumbles_lost']
)

# Defensive turnovers
team_stats['turnovers_defense'] = (
    team_stats['def_interceptions'] +
    team_stats['def_fumbles']
)

# Turnover margin
team_stats['turnover_margin'] = (
    team_stats['turnovers_defense'] - 
    team_stats['turnovers_offense']
)

# Completion percentage
team_stats['completion_pct'] = (
    team_stats['completions'] / team_stats['attempts']
)

print(" Created 4 derived features")

# Preview new features
team_stats[['team', 'week', 'turnovers_offense', 'turnovers_defense', 
            'turnover_margin', 'completion_pct']].head(10)


 ---

 ## 4. Calculate EWMA Features



 Calculate Exponentially Weighted Moving Averages (EWMA) for each feature.



 **Why EWMA?**

 - Emphasizes recent performance over older games

 - Alpha = 0.4 means recent games have ~2.5x more weight than games 5 weeks ago

 - Captures team momentum and current form

In [None]:
# Define features to use
independent_variables = [
    'completions',
    'passing_yards',
    'passing_tds',
    'rushing_yards',
    'sacks_suffered',
    'rushing_tds',
    'completion_pct',
    'turnovers_offense',
    'turnovers_defense',
    'turnover_margin',
    'def_tackles_for_loss',
    'penalty_yards',
    'fg_pct',
    'pat_pct',
]

print(f"Calculating EWMA for {len(independent_variables)} features...")
print("Alpha = 0.4 (gives more weight to recent games)")

# Calculate EWMA for each feature
for var in independent_variables:
    team_stats[f'{var}_ewma'] = team_stats.groupby(['team', 'season'])[var].transform(
        lambda x: x.ewm(alpha=0.4, adjust=False).mean()
    )

# Select columns to keep
ewma_cols = [f'{col}_ewma' for col in independent_variables]
keep_cols = ['season', 'week', 'team', 'opponent_team'] + ewma_cols
df_filtered = team_stats[keep_cols]

# Save processed data
df_filtered.to_csv('../data/df_clean.csv', index=False)

print(f" Created {len(ewma_cols)} EWMA features")
print(f" Saved to data/df_clean.csv")

# Preview EWMA features
df_filtered.head(10)


 ---

 ## 5. Merge Stats to Games



 Merge EWMA statistics for both home and away teams to each game, then create difference features.



 **Difference Features**: Home stat - Away stat

 - Positive value = Home team advantage in that stat

 - Negative value = Away team advantage in that stat

In [None]:
print("Merging home team stats...")
# Prepare home stats
home_stats = df_filtered[['season', 'week', 'team'] + ewma_cols].copy()
home_stats.columns = ['season', 'week', 'home_team'] + [f'home_{col}' for col in ewma_cols]

print("Merging away team stats...")
# Prepare away stats
away_stats = df_filtered[['season', 'week', 'team'] + ewma_cols].copy()
away_stats.columns = ['season', 'week', 'away_team'] + [f'away_{col}' for col in ewma_cols]

# Merge both to games
games_with_stats = games.merge(
    home_stats,
    on=['season', 'week', 'home_team'],
    how='left'
).merge(
    away_stats,
    on=['season', 'week', 'away_team'],
    how='left'
)

print(f" Merged stats to {len(games_with_stats):,} games")


In [None]:
print("Creating difference features (Home - Away)...")

feature_columns = []
for col in ewma_cols:
    diff_col = f'{col}_diff'
    games_with_stats[diff_col] = (
        games_with_stats[f'home_{col}'] - games_with_stats[f'away_{col}']
    )
    feature_columns.append(diff_col)

print(f" Created {len(feature_columns)} difference features")

# Save merged data
games_with_stats.to_csv('../data/games_with_stats.csv', index=False)
print(" Saved to data/games_with_stats.csv")

# Preview difference features
display_cols = ['season', 'week', 'home_team', 'away_team', 'home_win'] + feature_columns
games_with_stats[display_cols].head()
games_with_stats = games_with_stats.fillna({'pat_pct_ewma_diff': 0, 'fg_pct_ewma_diff': 0})
print(f"total na values: {games_with_stats[display_cols].isna().sum().sum()}")
games_with_stats[display_cols]

 ---

 ## 6. Feature Selection



 Use Random Forest to identify the most important features for prediction.



 This helps us:

 - Reduce model complexity

 - Improve interpretability

 - Potentially improve generalization

In [None]:
print("Splitting data into train/test sets...")
print(f"  Training: Seasons 2021-2024")
print(f"  Testing: Season 2025")

# Split by season
train_data = games_with_stats[games_with_stats['season'] < 2025]
test_data = games_with_stats[games_with_stats['season'] == 2025]

X_train = train_data[feature_columns]
y_train = train_data['home_win']

X_test = test_data[feature_columns]
y_test = test_data['home_win']

print(f"\n Training set: {len(X_train):,} games")
print(f" Testing set: {len(X_test):,} games")


In [None]:
print("\nTraining Random Forest for feature selection...")
rf_model = RandomForestClassifier(
    n_estimators=200, 
    random_state=67, 
    max_depth=20, 
    min_samples_split=20,
    n_jobs=-1
)
rf_model.fit(X_train, y_train)

print(f" Random Forest trained")
print(f"   Training accuracy: {rf_model.score(X_train, y_train):.4f}")
print(f"   Testing accuracy: {rf_model.score(X_test, y_test):.4f}")


In [None]:
print("\nCalculating feature importance...")

# Get feature importance
feature_importance = pd.DataFrame({
    'feature': feature_columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False).reset_index(drop=True)

top_features = feature_importance.head(10)
feature_list = top_features['feature'].to_list()

print(f"\n{'='*60}")
print("TOP 10 MOST IMPORTANT FEATURES")
print(f"{'='*60}")
for i, row in top_features.iterrows():
    print(f"{i+1:2d}. {row['feature']:40s} {row['importance']:.4f}")
print(f"{'='*60}")

# Save feature importance
feature_importance.to_csv('../outputs/feature_importance.csv', index=False)


In [None]:
print("\nPlotting feature importance...")

plt.figure(figsize=(10, 6))
plt.barh(range(len(top_features)), top_features['importance'], color='steelblue')
plt.yticks(range(len(top_features)), top_features['feature'].to_list())
plt.xlabel('Importance', fontsize=12, fontweight='bold')
plt.ylabel('Feature', fontsize=12, fontweight='bold')
plt.title('Top 10 Feature Importance', fontsize=14, fontweight='bold', pad=20)
plt.gca().invert_yaxis()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.savefig('../outputs/feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()

print(" Plot saved to outputs/feature_importance.png")


 ---

 ## 7. Train Final Model



 Train a Logistic Regression model using only the top 10 most important features.



 **Why Logistic Regression?**

 - Fast and interpretable

 - Outputs probabilities (not just predictions)

 - Works well with scaled features

In [None]:
print("Preparing data with selected features...")

# Select top features
X_train_selected = train_data[feature_list]
X_test_selected = test_data[feature_list]

print(f"Using {len(feature_list)} features:")
for feat in feature_list:
    print(f"  - {feat}")


In [None]:
print("\nScaling features...")

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_selected)
X_test_scaled = scaler.transform(X_test_selected)

print(" Features scaled (mean=0, std=1)")


In [None]:
print("\nTraining Logistic Regression model...")

model = LogisticRegression(random_state=41, max_iter=1000)
model.fit(X_train_scaled, y_train)

# Evaluate
train_score = model.score(X_train_scaled, y_train)
test_score = model.score(X_test_scaled, y_test)

print(f"\n{'='*60}")
print("MODEL PERFORMANCE")
print(f"{'='*60}")
print(f"Training Accuracy:  {train_score:.4f} ({train_score:.1%})")
print(f"Testing Accuracy:   {test_score:.4f} ({test_score:.1%})")
print(f"{'='*60}")

# Get predictions and probabilities
y_pred_train = model.predict(X_train_scaled)
y_pred_test = model.predict(X_test_scaled)
y_prob_train = model.predict_proba(X_train_scaled)[:, 1]
y_prob_test = model.predict_proba(X_test_scaled)[:, 1]

print("\n Model trained successfully!")


 ---

 ## 8. Save Model & Artifacts



 Save the trained model, scaler, and feature list for future use.

In [None]:
print("Saving model and artifacts...")

# Save model
with open('../models/finalized_model.pkl', 'wb') as f:
    pickle.dump(model, f)
print("Model saved to models/finalized_model.pkl")

# Save scaler  
joblib.dump(scaler, '../models/scaler.pkl')
print("Scaler saved to models/scaler.pkl")

# Save feature list
with open('../models/feature_list.pkl', 'wb') as f:
    pickle.dump(feature_list, f)
print("Feature list saved to models/feature_list.pkl")

# Save most recent stats for predictions
most_recent_stats = df_filtered.sort_values(['team', 'season', 'week']).groupby('team').last().reset_index()
most_recent_stats.to_csv('../data/most_recent_stats1.csv', index=False)
print("Most recent stats saved to data/most_recent_stats1.csv")

print("\n All artifacts saved!")
most_recent_stats

 ---

 ## 9. Make Predictions

- Compare model results to Vegas game odds

- Use the trained model to predict outcomes for upcoming games.

In [None]:
print("Getting current week and season...")

current_week = nfl.get_current_week()
current_season = nfl.get_current_season()

print(f"Current: Week {current_week}, {current_season} Season")


In [None]:
print(f"\nLoading games for Week {current_week}...")

week_games = schedule[
    (schedule['week'] == current_week) & 
    (schedule['season'] == current_season)
]

print(f"Found {len(week_games)} games")

if len(week_games) == 0:
    print(" No games found for current week")


In [None]:
print(f"\nLoading Vegas odds for Week {current_week}")

vegas_lines = schedule[(schedule['season'] == current_season) & 
                       (schedule['week'] == current_week)][['game_id', 'away_moneyline', 'home_moneyline']]
print(f"Found vegas odds")

if len(week_games) ==0:
    print("No games found for current week")

In [None]:
if len(week_games) > 0:
    print("\nPreparing prediction data...")
    
    # Merge home team stats
    df_matchups = week_games[['game_id', 'away_team', 'home_team', 'gameday']].merge(
        most_recent_stats[keep_cols],
        left_on='home_team',
        right_on='team',
        how='left'
    )
    
    stats_to_rename = [col for col in keep_cols if col != 'team']
    df_matchups = df_matchups.rename(columns={col: f"{col}_home" for col in stats_to_rename})
    
    # Merge away team stats
    df_matchups = df_matchups.merge(
        most_recent_stats[keep_cols],
        left_on='away_team',
        right_on='team',
        how='left',
        suffixes=("", "_away")
    )
    
    df_matchups = df_matchups.rename(columns={
        col: f"{col}_away" for col in stats_to_rename if col in df_matchups.columns
    })
    
    # Create difference features
    for stat in ewma_cols:
        diff_col = f"{stat}_diff"
        df_matchups[diff_col] = df_matchups[f"{stat}_home"] - df_matchups[f"{stat}_away"]
    
    print(" Prediction data prepared")
    
    # Make predictions
    X_pred = df_matchups[feature_list]
    X_pred_scaled = scaler.transform(X_pred)
    win_probs = model.predict_proba(X_pred_scaled)[:, 1]

    
    # Create results DataFrame
    results = pd.DataFrame({
        'game_id': df_matchups['game_id'],
        'matchup': df_matchups['away_team'] + [' @ '] + df_matchups['home_team'],
        'away_team': df_matchups['away_team'],
        'home_team': df_matchups['home_team'],
        'game_date': df_matchups['gameday'],
        'home_win_prob': win_probs,
        'away_win_prob': 1 - win_probs
    })
    
    # Add predicted winner
    results['predicted_winner'] = results.apply(
        lambda row: row['home_team'] if row['home_win_prob'] > 0.5 else row['away_team'],
        axis=1
    )
    results['confidence'] = results[['home_win_prob', 'away_win_prob']].max(axis=1)

    #Add Vegas lines
    results = results.merge(
        vegas_lines,
        left_on='game_id',
        right_on='game_id',
        how='left'
    )
    
    # Save predictions
    results.to_csv('../data/latest_predictions.csv', index=False)
    print(" Predictions saved to outputs/latest_predictions.csv")




In [None]:
if len(week_games) > 0:
    print(f"\n{'='*70}")
    print(f"PREDICTIONS FOR WEEK {current_week}, {current_season}")
    print(f"{'='*70}\n")
    
    for _, row in results.iterrows():
        print(f" {row['matchup']}")
        print(f"   Game Date: {row['game_date']}")
        print(f"   Home Win Probability: {row['home_win_prob']:.1%}")
        print(f"   Away Win Probability: {row['away_win_prob']:.1%}")
        print(f"    Predicted Winner: {row['predicted_winner']} ({row['confidence']:.1%} confidence)")
        print()
    
    print(f"{'='*70}\n")
    
    # Display as table
    display(results[['matchup', 'game_date', 'home_win_prob', 'predicted_winner', 'confidence', 'home_moneyline', 'away_moneyline']])


 ---

 ## 10. Visualize Results



 Create a visualization showing win probabilities for each game.

In [None]:
if len(week_games) > 0:
    print("Creating visualization...")
    
    fig, ax = plt.subplots(figsize=(12, len(results) * 0.6 + 2))
    
    # Color bars based on home/away favorite
    colors = ['#d32f2f' if p < 0.5 else '#388e3c' for p in results['home_win_prob']]
    
    # Create horizontal bar chart
    bars = ax.barh(results['matchup'], results['home_win_prob'], color=colors, edgecolor='white', linewidth=2)
    
    # Add probability labels
    for i, (prob, matchup) in enumerate(zip(results['home_win_prob'], results['matchup'])):
        ax.text(prob + 0.02, i, f'{prob:.1%}', va='center', fontweight='bold', fontsize=11)
    
    # Add 50% reference line
    ax.axvline(x=0.5, color='gray', linestyle='--', linewidth=2, alpha=0.5)
    
    # Formatting
    ax.set_xlabel('Home Team Win Probability', fontsize=13, fontweight='bold')
    ax.set_title(f'NFL Predictions - Week {current_week}, {current_season}', 
                 fontsize=15, fontweight='bold', pad=20)
    ax.set_xlim(0, 1)
    ax.set_xticks([0, 0.25, 0.5, 0.75, 1.0])
    ax.set_xticklabels(['0%', '25%', '50%', '75%', '100%'])
    ax.grid(axis='x', alpha=0.3, linestyle=':', linewidth=1)
    
    # Add legend
    from matplotlib.patches import Patch
    legend_elements = [
        Patch(facecolor='#388e3c', label='Home Favorite'),
        Patch(facecolor='#d32f2f', label='Away Favorite')
    ]
    ax.legend(handles=legend_elements, loc='lower right', fontsize=10)
    
    plt.tight_layout()
    plt.savefig('../outputs/predictions.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print(" Visualization saved to outputs/predictions.png")


 ---

 ## Summary



 ### Model Performance

 - Training Accuracy: ~80-82%

 - Testing Accuracy: ~80-82%



 ### Key Features (Top 3)

 1. Completion Percentage Differential

 2. Passing TDs Differential

 3. Rushing TDs Differential



 ### Files Created

 - `data/df_clean.csv` - Processed team statistics

 - `data/games_with_stats.csv` - Games with merged features

 - `data/most_recent_stats.csv` - Latest team stats

 - `models/finalized_model.pkl` - Trained model

 - `models/scaler.pkl` - Feature scaler

 - `models/feature_list.pkl` - Selected features

 - `outputs/feature_importance.png` - Feature importance chart

 - `outputs/predictions.png` - Predictions visualization

 - `outputs/latest_predictions.csv` - Current week predictions



 ### Next Steps

 - Run `streamlit run app.py` to view interactive predictions

 - Retrain model weekly as new games are played

 - Consider adding weather, injuries, or home field advantage features

In [None]:
# %%