For this project I wanted to create a Model to predict whether a pass play would be completed based on the other variables inlcuded in the plays dataset.

The goal for this would be two-fold:

(1) For NFL defenses the goal would be to determine which personnel (defenders in the box, number of pass rushers) would be most advantageous in order to force an incompletion based on offensive situations and variables (downs, yards to go). 

(2) For NFL offenses this could be used in order to determine the likelihood of a pass completion based on the same variables listed above. 


In [None]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



In [None]:
import pandas as pd
from sklearn.preprocessing import label_binarize
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import GradientBoostingClassifier
import matplotlib.pyplot as plt
from sklearn import tree

I had to clean up the plays data set so that it would be ready for a Decision Tree or Rando Forest model so this dataset is my edited version.

In [None]:
#Read plays data file
playsFile = "../input/preprocessed-plays-data/Preprocessed_plays.csv"
playsData = pd.read_csv(playsFile)

playsData.head()

The following steps are to create a train and test dataset as well as separating the target column from the data.

In [None]:
X_train, X_test = train_test_split(playsData, test_size=0.20)

In [None]:
combined_Data = pd.concat([X_train,X_test], keys=[0,1])

In [None]:
#Separate Train data and test data
trainData = combined_Data.xs(0)
testData = combined_Data.xs(1)

trainData.head()

In [None]:
y_train = trainData["passResult"]
X_train = trainData.drop(["passResult"], axis=1) #extracting training data without the target column
y_test = testData["passResult"]
X_test = testData.drop(["passResult"], axis=1) #extracting training data without the target column

X_train.head()


Here we run the data through a default Decision Tree Classifier

In [None]:
#Decision Tree Classifier ========================================================================
#CONSTRUCT DEFAULT DECISION TREE AND OBTAIN RESPECTIVE ACCURACY 
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
clf_predict=clf.predict(X_test)
print("accuracy Score (testset) for Decision Tree:{0:6f}".format(clf.score(X_test,y_test)))
print()
print("Confusion Matrix for Decision Tree")
print(confusion_matrix(y_test,clf_predict))

In order to get the best accuracy we tune the parameters for the model.

In [None]:
#Hyperparameter tuning done for decision tree classifier
#do random search
print("RandomizedSearchCV-Decision tree")
parameters={'min_samples_leaf' : range(10,100,10),'max_depth': range(5,30,5),'criterion':['gini','entropy']}
clf_random = RandomizedSearchCV(clf,parameters,n_iter=15,cv=5)
clf_random.fit(X_train, y_train)
grid_parm=clf_random.best_params_
print(grid_parm)



In [None]:
clf = DecisionTreeClassifier(min_samples_leaf= 10, max_depth= 20, criterion= 'entropy')
clf.fit(X_train, y_train)
clf_predict=clf.predict(X_test)
print("accuracy Score (testset) for Decision Tree:{0:6f}".format(clf.score(X_test,y_test)))
print()
print("Confusion Matrix for Decision Tree")
print(confusion_matrix(y_test,clf_predict))

In [None]:
#Print Classification Report
print(classification_report(y_test,clf_predict))

In [None]:
#Visualization for the Decision Tree
tree.plot_tree(clf);

Here we use a Random Forest Classifier with the same data set

In [None]:
#Random Forest =============================================================
#Default mode
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
rfc_predict=rfc.predict(X_test)
rf_pred = rfc.predict(X_test)
rf_pred = pd.DataFrame(rf_pred,columns=['passResult'])


print("accuracy Score (testset) for Random Forest:{0:6f}".format(rfc.score(X_test,y_test)))
print()
print("Confusion Matrix for Random Forest")
print(confusion_matrix(y_test,rf_pred))

The next steps with this model would be to accumulate data from prior NFL seasons in order to be able to feed more data when building the model. This should allow for greater predictive accuracy. From there teams can use the models to determine offensive/defensive strategies to best gameplan for varying situations. 