# Gradient Boosting Model for Drive Results Prediction

This notebook outlines the steps to create and evaluate a Gradient Boosting Machine (GBM) model. The model predicts football drive outcomes based on penalties and other game conditions.

In [44]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report

## Load the Data

The dataset includes details about football drives, such as penalties and yardage. We'll use a subset of this data to accelerate the training process.

In [45]:
# Load the dataset
data = pd.read_csv('../data/processed/drives.csv')
data = data.drop(columns=['home_team', 'away_team', 'home_coach', 'away_coach'])

data.head()

Unnamed: 0,game_id,team_id,num,quarter,time,los,plays,length,net_yds,result,...,Off_Illegal_Blindside_Block,Off_12_On-field,Off_Disqualification,Def_Too_Many_Men_on_Field,Def_Lowering_the_Head_to_Initiate_Contact,Off_Too_Many_Men_on_Field,total_off_pen,total_def_pen,total_off_pen_yards,total_def_pen_yards
0,2009_1_TEN_PIT,PIT,1,1,15:00,58,3,1:44,2,Punt,...,0,0,0,0,0,0,0,0,0,0
1,2009_1_TEN_PIT,TEN,1,1,13:16,98,3,1:52,2,Punt,...,0,0,0,0,0,0,0,0,0,0
2,2009_1_TEN_PIT,PIT,2,1,11:24,43,5,3:04,2,Punt,...,0,0,0,0,0,0,0,0,0,0
3,2009_1_TEN_PIT,TEN,2,1,8:20,89,6,1:36,70,Missed FG,...,0,0,0,0,0,0,0,1,0,15
4,2009_1_TEN_PIT,PIT,3,1,6:44,73,3,1:55,-6,Punt,...,0,0,0,0,0,0,0,0,0,0


## Prepare the Data

In [46]:
# Function to convert "HH:MM:SS" to total seconds
def time_to_seconds(time_str):
    h, m, s = map(int, time_str.split(':'))
    return h * 3600 + m * 60 + s

# Apply the conversion to the 'time_left' column
data['time_left_seconds'] = data['time_left'].apply(time_to_seconds)

data['result'] = data['result'].apply(lambda x: x if x in ['Touchdown', 'Field Goal'] else 'Zero')

# Select features and target
features = data[['total_off_pen', 'total_def_pen', 'total_off_pen_yards', 'total_def_pen_yards', 'los', 'time_left_seconds']]
target = data['result']

# Encode the target variable
le = LabelEncoder()
target_encoded = le.fit_transform(target)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(features, target_encoded, test_size=0.3, random_state=42)

## Model Training

We utilize a Gradient Boosting Classifier to train our model on the football drive data.

In [47]:
# Initialize the Gradient Boosting Classifier
gbm_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# Train the model
gbm_model.fit(X_train, y_train)

# Predicting the test set results
y_pred = gbm_model.predict(X_test)

# Evaluate the model by specifying all labels known to the LabelEncoder
classification_report_result = classification_report(y_test, y_pred, labels=np.arange(len(le.classes_)), target_names=le.classes_)
print("Classification Report:\n", classification_report_result)

# Extracting and printing the feature importance
feature_importance = gbm_model.feature_importances_
features_df = pd.DataFrame({'Feature': features.columns, 'Importance': feature_importance}).sort_values(by='Importance', ascending=False)
print("\nFeature Importance:\n", features_df)

Classification Report:
               precision    recall  f1-score   support

  Field Goal       0.43      0.02      0.03      4034
   Touchdown       0.54      0.21      0.30      5857
        Zero       0.68      0.97      0.80     18001

    accuracy                           0.67     27892
   macro avg       0.55      0.40      0.38     27892
weighted avg       0.62      0.67      0.58     27892


Feature Importance:
                Feature  Importance
1        total_def_pen    0.429765
4                  los    0.242568
5    time_left_seconds    0.187329
3  total_def_pen_yards    0.080356
2  total_off_pen_yards    0.031086
0        total_off_pen    0.028895


## Prediction Function

Create a function that takes input features and returns the predicted drive result.

In [48]:
def predict_drive_result(total_off_pen, total_def_pen, total_off_pen_yards, total_def_pen_yards, los, time_left_seconds):
    input_features = pd.DataFrame([{
        'total_off_pen': total_off_pen, 
        'total_def_pen': total_def_pen, 
        'total_off_pen_yards': total_off_pen_yards, 
        'total_def_pen_yards': total_def_pen_yards, 
        'los': los, 
        'time_left_seconds': time_left_seconds
    }])
    prediction = gbm_model.predict(input_features)
    return le.inverse_transform(prediction)[0]

# Example usage of the prediction function
result = predict_drive_result(2, 1, 15, 10, 50, 900)
print(f"Predicted Drive Result: {result}")

Predicted Drive Result: Zero


## Preparing Data for Points Prediction

In this section, we transform the `result` column to reflect the points scored for each type of drive outcome: 7 points for a Touchdown, 3 points for a Field Goal, and 0 points for all other outcomes.

In [49]:
# Define a function to convert drive results to points
def result_to_points(result):
    if result == 'Touchdown':
        return 7
    elif result == 'Field Goal':
        return 3
    else:
        return 0

# Apply the function to the 'result' column to create a new target variable
data['points'] = data['result'].apply(result_to_points)

# Display the distribution of the points column to verify
data['points'].value_counts()

points
0    60483
7    19362
3    13128
Name: count, dtype: int64

## Preparing Data for Regression

We'll modify the 'result' column in our dataset to reflect the number of points scored: converting 'Touchdown' results to 7 points, 'Field Goal' to 3 points, and all other outcomes to 0 points. Then, we'll prepare our data for regression.

In [50]:
# Define a function to convert drive results into points
def result_to_points(result):
    if result == 'Touchdown':
        return 7
    elif result == 'Field Goal':
        return 3
    else:
        return 0

# Apply the function to the 'result' column
data['points'] = data['result'].apply(result_to_points)

# Select features for the regression model
features = data[['total_off_pen', 'total_def_pen', 'total_off_pen_yards', 'total_def_pen_yards', 'los', 'time_left_seconds']]
target = data['points']

# Split data into train and test sets for regression
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(features, target, test_size=0.3, random_state=42)

## Gradient Boosting Regression Model

Now we'll configure and train a Gradient Boosting Machine (GBM) to predict the number of points scored at the end of each drive based on the specified inputs.

In [51]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Initialize the Gradient Boosting Regressor
gbm_regressor = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# Train the regressor
gbm_regressor.fit(X_train_reg, y_train_reg)

# Predicting the test set results
y_pred_reg = gbm_regressor.predict(X_test_reg)

# Evaluate the regression model
mse = mean_squared_error(y_test_reg, y_pred_reg)
r2 = r2_score(y_test_reg, y_pred_reg)
print("Mean Squared Error:", mse)
print("R^2 Score:", r2)

Mean Squared Error: 6.88506327387569
R^2 Score: 0.13576086162277534


## Prediction Function for Regression

We also create a function to predict the number of points based on the input features for the regression model.

In [62]:
def predict_points(total_off_pen, total_def_pen, total_off_pen_yards, total_def_pen_yards, los, time_left_seconds):
    input_features = pd.DataFrame([{
        'total_off_pen': total_off_pen, 
        'total_def_pen': total_def_pen, 
        'total_off_pen_yards': total_off_pen_yards, 
        'total_def_pen_yards': total_def_pen_yards, 
        'los': los, 
        'time_left_seconds': time_left_seconds
    }])
    predicted_points = gbm_regressor.predict(input_features)
    return predicted_points[0]

# Example usage of the prediction function for regression
points = predict_points(0, 3, 0, 30, 25, 800)
print(f"Predicted Points: {points}")

Predicted Points: 5.046858987798735
