# NFL Data Models and Results

This notebook will go through the methodology of model training on various subsets of the cleaned data and return a summary of analysis to passing/running plays in the 2018 NFL season. The data is obtained from the NFL's 2020 and 2021 Big Data Bowl.

The main focus of this analysis is to see how offensive / defensive personnel and the formation and defensive players on the field affect the decision to run or pass the ball. This can be a unique opportunity to utilize tracking / location data of players as well (which may be explored in a separate notebook).

Data bowls for reference:
- https://www.kaggle.com/c/nfl-big-data-bowl-2020: Forecast yardage gained on the run plays
- https://www.kaggle.com/c/nfl-big-data-bowl-2021/: Forecast yardage gained on pass plays
- https://www.kaggle.com/c/nfl-big-data-bowl-2022: Analyze special teams data
- https://github.com/nfl-football-ops/Big-Data-Bowl: Inaugural data bowl from 2019, useful R code on animation of tracking

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report
import xgboost as xgb
import os
pd.set_option('display.max_columns', None)
pd.options.mode.chained_assignment = None  # default='warn'

In [2]:
# Read in cleaned dataset for general models
bdb = pd.read_csv('bdb_2018_dummy.csv')

# Split response vs train data
bdb_y = np.where(bdb['Type']=='run', 1, 0) ## convert response variable to 1 for run, 0 for pass
bdb_x = bdb.drop(['Type'], axis=1)

# Models

In this section, we train 3 models on the train/test split: logistic regression model, random forest model, and gradient-boosted decision trees with XGBoost.

We do this on a train/test split of 75/25.

To do:
- K-fold cross-validation
- Tune hyperparameters...these are the absolute most bare versions of these models so far.
- Identify important features in prediction
- Set up a function to train these models at once

Other ideas:
- Segment the data by the position on the field (i.e. train a model for prediction when team is between their 0-20, then their own 20-50, then opposing 50-80, then redzone on opposing 20)
- Train models by team on offense
- Train models by down #

In [3]:
# Create train / test split
x_train, x_test, y_train, y_test = train_test_split(bdb_x, bdb_y, test_size=0.25, random_state=2022)
print('Training Rows: {}\nTest Rows: {}'.format(x_train.shape[0], x_test.shape[0]))

Training Rows: 21439
Test Rows: 7147


In [4]:
## Train Logistic Regression model
lr = LogisticRegression()
lr.fit(x_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [5]:
## Train RF model
rf = RandomForestClassifier()
rf.fit(x_train, y_train)

RandomForestClassifier()

In [6]:
## Train XGBoost model
xgc = xgb.XGBClassifier()
xgc.fit(x_train, y_train)





XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=8,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

# Results

In [7]:
1- (sum(bdb_y) / len(bdb_y)) ## Benchmark: We can get 62.16% success rate just guessing pass every play

0.6215979850276359

In [8]:
## Predict logistic regression
y_pred = lr.predict(x_test)
print(lr.score(x_test, y_test))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.739611025605149
[[3680  755]
 [1106 1606]]
              precision    recall  f1-score   support

           0       0.77      0.83      0.80      4435
           1       0.68      0.59      0.63      2712

    accuracy                           0.74      7147
   macro avg       0.72      0.71      0.72      7147
weighted avg       0.74      0.74      0.74      7147



In [9]:
## Predict RF
y_pred = rf.predict(x_test)
print(rf.score(x_test, y_test))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.7446481040996222
[[3587  848]
 [ 977 1735]]
              precision    recall  f1-score   support

           0       0.79      0.81      0.80      4435
           1       0.67      0.64      0.66      2712

    accuracy                           0.74      7147
   macro avg       0.73      0.72      0.73      7147
weighted avg       0.74      0.74      0.74      7147



In [10]:
## Predict XGB
y_pred = xgc.predict(x_test)
print(xgc.score(x_test, y_test))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.7543025045473626
[[3583  852]
 [ 904 1808]]
              precision    recall  f1-score   support

           0       0.80      0.81      0.80      4435
           1       0.68      0.67      0.67      2712

    accuracy                           0.75      7147
   macro avg       0.74      0.74      0.74      7147
weighted avg       0.75      0.75      0.75      7147



In [11]:
lr.fit(x_train[['Down','Distance']], y_train)
y_pred = lr.predict(x_test[['Down','Distance']])
print(lr.score(x_test[['Down','Distance']], y_test))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.6363509164684483
[[4272  163]
 [2436  276]]
              precision    recall  f1-score   support

           0       0.64      0.96      0.77      4435
           1       0.63      0.10      0.18      2712

    accuracy                           0.64      7147
   macro avg       0.63      0.53      0.47      7147
weighted avg       0.63      0.64      0.54      7147

