# Penalty Prediction Models

## Introduction
This notebook aims to build a series of Negative Binomial Regression models to predict the count of each penalty type for a specific NFL game scenario. Each model will correspond to a different type of penalty.

## Data Preparation
We will start by loading and preparing the dataset, which includes encoding categorical variables and aggregating penalty counts.

In [21]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import statsmodels.api as sm
from statsmodels.formula.api import glm
from statsmodels.genmod.families import NegativeBinomial
from sklearn.metrics import mean_squared_error, r2_score

## Data Loading and Preprocessing

First, we load the necessary data and preprocess it for analysis.

In [22]:
# Load the dataset
data_path = '../data/processed/penalties.csv'
df = pd.read_csv(data_path)

# Count the frequency of each penalty type
penalties_count = df['penalty'].value_counts()

# Filter for penalties that occur at least 50 times
frequent_penalties = penalties_count[penalties_count >= 50].index.tolist()

# Exclude special teams penalties and filter by frequent penalties
penalties_data = df[(df['phase'] != 'ST') & (df['penalty'].isin(frequent_penalties))]

df_filtered = penalties_data.loc[:, ['game_id', 'team_id', 'opp_id', 'penalty', 'year', 'week', 'ref_crew', 'home', 'postseason']]

# Aggregate data to get the count of each penalty type per game and team
df_grouped = df_filtered.groupby(['game_id', 'team_id', 'opp_id', 'year', 'week', 'ref_crew', 'home', 'postseason', 'penalty']).size().reset_index(name='count')

# Encode categorical variables
label_encoders = {}
for column in ['team_id', 'opp_id', 'ref_crew', 'home', 'postseason', 'penalty']:
    le = LabelEncoder()
    df_grouped[column] = le.fit_transform(df_grouped[column])
    label_encoders[column] = le

# Display basic information and the first few rows of the dataset
df_filtered.head()

Unnamed: 0,game_id,team_id,opp_id,penalty,year,week,ref_crew,home,postseason
0,2009_1_TEN_PIT,PIT,TEN,Def_Unnecessary_Roughness,2009,1,Bill Leavy,Yes,No
1,2009_1_TEN_PIT,TEN,PIT,Off_Illegal_Formation,2009,1,Bill Leavy,No,No
2,2009_1_TEN_PIT,PIT,TEN,Off_Holding,2009,1,Bill Leavy,Yes,No
3,2009_1_TEN_PIT,PIT,TEN,Off_Holding,2009,1,Bill Leavy,Yes,No
4,2009_1_TEN_PIT,PIT,TEN,Def_Pass_Interference,2009,1,Bill Leavy,Yes,No


## Model Building
For each unique penalty type, we will construct a Negative Binomial Regression model.

In [23]:
# Dictionary to store models
models = {}

# Define predictor variables
predictors = ['team_id', 'opp_id', 'year', 'week', 'ref_crew', 'home', 'postseason']

# Identify the top 5 most common penalty types
top_penalty_codes = df_grouped['penalty'].value_counts().nlargest(5).index.tolist()

# Store MSE and R-squared values for overall evaluation
overall_mse = []
overall_r2 = []

# Train and evaluate a model for each of the top 5 penalty types
for penalty_code in top_penalty_codes:
    # Filter data for the current penalty type
    df_penalty = df_grouped[df_grouped['penalty'] == penalty_code]
    
    # Model formula
    formula = 'count ~ ' + ' + '.join(predictors)
    
    # Fit the model with explicit alpha to avoid warnings
    model = glm(formula, data=df_penalty, family=NegativeBinomial(alpha=1.0)).fit()
    models[penalty_code] = model
    
    # Predict on the training data
    y_pred = model.predict(df_penalty[predictors])
    y_true = df_penalty['count']
    
    # Calculate MSE and R-squared
    mse = mean_squared_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    
    # Append to overall lists
    overall_mse.append(mse)
    overall_r2.append(r2)
    
    # Print the model's evaluation
    print(f"Penalty: {label_encoders['penalty'].inverse_transform([penalty_code])[0]}")
    print(f"Weights: {model.params}")
    print(f"MSE: {mse}")
    print(f"R^2: {r2}")
    print("\n---\n")

# Calculate and print overall ensemble model's MSE and R^2
ensemble_mse = np.mean(overall_mse)
ensemble_r2 = np.mean(overall_r2)
print(f"Overall Ensemble MSE: {ensemble_mse}")
print(f"Overall Ensemble R^2: {ensemble_r2}")

Penalty: Off_Holding
Weights: Intercept    -56.036432
team_id        0.000627
opp_id         0.000716
year           0.028097
week          -0.008158
ref_crew       0.000468
home           0.028761
postseason    -0.167842
dtype: float64
MSE: 1.136740533579205
R^2: 0.05256882622285752

---

Penalty: Off_False_Start
Weights: Intercept    -29.218550
team_id       -0.000827
opp_id        -0.001180
year           0.014800
week          -0.002910
ref_crew       0.000070
home          -0.002231
postseason    -0.113933
dtype: float64
MSE: 1.0853138637897506
R^2: 0.017467152606337133

---

Penalty: Def_Pass_Interference
Weights: Intercept    -51.618504
team_id       -0.000964
opp_id        -0.001356
year           0.025811
week          -0.003856
ref_crew       0.000266
home          -0.038040
postseason    -0.053744
dtype: float64
MSE: 0.4962166675359705
R^2: 0.057064251375285036

---

Penalty: Def_Holding
Weights: Intercept    -56.940973
team_id        0.001148
opp_id         0.000794
year   

## Prediction Function
This function takes game details as input and predicts the penalty counts for each type.

In [24]:
def predict_penalties(team_id, opp_id, year, week, ref_crew, home, postseason):
    # Encode input data
    input_data = {
        'team_id': label_encoders['team_id'].transform([team_id])[0],
        'opp_id': label_encoders['opp_id'].transform([opp_id])[0],
        'year': year,
        'week': week,
        'ref_crew': label_encoders['ref_crew'].transform([ref_crew])[0],
        'home': label_encoders['home'].transform([home])[0],
        'postseason': label_encoders['postseason'].transform([postseason])[0]
    }
    
    predictions = {}
    for penalty_code, model in models.items():
        features_df = pd.DataFrame([input_data])
        predicted_count = model.predict(features_df)[0]
        penalty_type = label_encoders['penalty'].inverse_transform([penalty_code])[0]
        predictions[penalty_type] = max(0, predicted_count)  # Ensure non-negative predictions
    
    return pd.DataFrame([predictions], index=['Predicted Count'])

# Example prediction
predict_penalties('DAL', 'SEA', 2023, 10, 'Bill Leavy', 'Yes', 'No')

Unnamed: 0,Off_Holding,Off_False_Start,Def_Pass_Interference,Def_Holding,Def_Offside
Predicted Count,2.174033,1.920161,1.610521,1.691706,1.573771
