# NFL Data Models and Results

This notebook will go through the methodology of data collection, cleaning, and return a finished dataset of a narrowed-down analysis to passing/running plays in the 2018 NFL season. The data is obtained from the NFL's 2020 and 2021 Big Data Bowl:

*There is potential to combine the 2020 data bowl data, which contains similar info to 2021 data bowl data except about rushing plays 2017-2019. Combining these sources to produce a similar notebook to the original tendency analysis. Less data, but more information in our columns.*

The main focus of this analysis is to see how offensive / defensive personnel and the formation and defensive players on the field affect the decision to run or pass the ball. This can be a unique opportunity to utilize tracking / location data of players as well (which may be explored in a separate notebook).

Data bowls for reference:
- https://www.kaggle.com/c/nfl-big-data-bowl-2020: Forecast yardage gained on the run plays
- https://www.kaggle.com/c/nfl-big-data-bowl-2021/: Forecast yardage gained on pass plays
- https://www.kaggle.com/c/nfl-big-data-bowl-2022: Analyze special teams data
- https://github.com/nfl-football-ops/Big-Data-Bowl: Inaugural data bowl from 2019, useful R code on animation of tracking

In [19]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report
import xgboost as xgb
import os
pd.set_option('display.max_columns', None)
pd.options.mode.chained_assignment = None  # default='warn'

In [20]:
# Read in cleaned dataset
bdb = pd.read_csv('bdb_2018.csv')

# Split response vs train data
bdb_y = np.where(bdb['Type']=='run', 1, 0) ## convert response variable to 1 for run, 0 for pass
bdb_x = bdb.drop(['Type'], axis=1)

# Models

In this section, we train 3 models: logistic regression model, random forest model, and gradient-boosted decision trees with XGBoost.

We do this on a train/test split of 75/25..I will include cross-validation in a more thorough development of these models.

In [21]:
# Create train / test split
x_train, x_test, y_train, y_test = train_test_split(bdb_x, bdb_y, test_size=0.25, random_state=2022)
print('Training Rows: {}\nTest Rows: {}'.format(x_train.shape[0], x_test.shape[0]))

Training Rows: 21439
Test Rows: 7147


In [22]:
## Train Logistic Regression model
lr = LogisticRegression()
lr.fit(x_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [23]:
## Train RF model

In [24]:
## Train XGBoost model

# Results

In [42]:
1- (sum(bdb_y) / len(bdb_y)) ## Benchmark: We can get 62.16% success rate just guessing pass every play

0.6215979850276359

In [14]:
## Predict logistic regression
y_pred = lr.predict(x_test)
lr.score(x_test, y_test)

0.7412900517699734

In [15]:
confusion_matrix(y_test, y_pred)

array([[3668,  767],
       [1082, 1630]], dtype=int64)

In [18]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.77      0.83      0.80      4435
           1       0.68      0.60      0.64      2712

    accuracy                           0.74      7147
   macro avg       0.73      0.71      0.72      7147
weighted avg       0.74      0.74      0.74      7147

