# SDAC VT Big Data Bowl Challenge
[Link to NFL Big Data Bowl 2021 Challenge](https://www.kaggle.com/c/nfl-big-data-bowl-2021/overview)
* Brock Morgan
* Jacob Parker
* Nick Grifasi
* Cullen Wallace

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
plays = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2021/plays.csv')
games = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2021/games.csv')

# Task 1: Data Cleaning
## For our first task, we thought it to be critical that the field positioning should be standardized so that it would be understandable for future modelling and functions 

In [None]:
weeks = []
for i in tqdm(range(17)):
    z = i+1
    weeks.append(pd.read_csv(f'/kaggle/input/nfl-big-data-bowl-2021/week{z}.csv'))
tracking = pd.concat(weeks)

In [None]:
a = []
pg_id = plays['gameId']
for index, value in pg_id.iteritems():
    x = str.split(str(games['homeTeamAbbr'][games['gameId'] == value]), ' ')[4][:3]
    a.append(x)

In [None]:
plays['homeTeam'] = pd.Series(a)

In [None]:
x = []
p_yd = plays['yardlineNumber']
p_sd = plays['yardlineSide']

for index, value in p_yd.iteritems():
    if plays['yardlineSide'][index] == plays['homeTeam'][index]:
        x.append(value+10)
    else:
        x.append(110-value)

In [None]:
plays['standardYardLine'] = pd.Series(x)

### This list shows the final standardized field positions

In [None]:
plays[['yardlineNumber', 'yardlineSide', 'homeTeam', 'standardYardLine']]

In [None]:
nDB = []
nLB = []
nDL = []
for idx,val in plays['personnelD'].items():
    if not pd.isnull(plays['personnelD'][idx]):
        nLB.append(plays['personnelD'][idx][plays['personnelD'][idx].find(' LB')-1])
        nDL.append(plays['personnelD'][idx][plays['personnelD'][idx].find(' DL')-1])
        nDB.append(plays['personnelD'][idx][plays['personnelD'][idx].find(' DB')-1])
    else:
        nLB.append(None)
        nDB.append(None)
        nDL.append(None)
plays['nLB'] = nLB
plays['nDB'] = nDB
plays['nDL'] = nDL
plays.head()

# Task 2: Data Modeling

For the second portion of the challenge we are going to be analyzing some of the data used previously for the first task. We want to look at how defensive schemes, yardline number, and tracking data affect the epa.


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
plays['yardsToGo2'] = np.array(plays['yardsToGo'])**2
plays = plays.dropna()
inputs = plays.loc[:,['nLB','nDB','nDL','yardsToGo','yardsToGo2','down','yardlineNumber']]
inputs = pd.get_dummies(data=inputs, drop_first=True)

regmod = LinearRegression()
lm = regmod.fit(inputs, plays['epa'])

#m1,m2,m3
print("Coefficents: ", lm.coef_)
print()
#b-intercept
print("b-intercept: ", lm.intercept_)
print()

plays['epa_pred'] = lm.predict(inputs)
print('R^2 for the linear regression', r2_score(plays['epa'], plays['epa_pred']))
print()

inputs.head()

Based on our model, all of the terms within it are significant. The number of linebackers, defensive backs, and defensive lineman all have an effect on the epa of a play. Other terms we found to be significant were the yards to go and yards to go squared as well as the yardline number and what down they were on. All of these significant terms together however only had an R-squared value of about 2.6% of it could only explain that much of the variance in the data. This unfortunately means this model cannot be widely applied but does suggest that each of the terms in this model can play a part in a model that can accurately predict the success of a play.

# Task 3: Data Visualization

In [None]:
tracking['position'].unique()

In [None]:
offense_p = ['QB', 'WR', 'RB', 'TE', 'FB', 'HB']
defense_p = ['SS', 'FS', 'MLB', 'CB', 'LB', 'OLB', 'ILB', 'DL', 'DB', 'NT', 'S', 'DE', 'DT']
special_p = ['P', 'LS', 'K']

In [None]:
t_p = tracking['position']
side = []
for idx, val in t_p.iteritems():
    if val in offense_p:
        side.append('O')
    elif val in defense_p:
        side.append('D')
    elif val in special_p:
        side.append('S')
    else:
        side.append('F')
        
tracking['O/D/S/F'] = pd.Series(side)

In [None]:
tracking.head()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
yardline_play_result = plays[['standardYardLine','playType','isDefensivePI']]
playTypes = np.zeros([99,4])
yardLineList = []
for idx, val in yardline_play_result['standardYardLine'].iteritems():
    yardLineList.append(val)
    if yardline_play_result['playType'][idx] == 'play_type_pass':
        playTypes[val-11, 0] += 1
    elif yardline_play_result['playType'][idx] == 'play_type_sack':
        playTypes[val-11,1] += 1
    if yardline_play_result['isDefensivePI'][idx] == True:
        playTypes[val -11, 2] += 1
    playTypes[val-11,3] += 1
    
play_frequencies = pd.DataFrame(data = {'passFreq': playTypes[:,0]/playTypes[:,3],
                                   'sackFreq': playTypes[:,1]/playTypes[:,3],
                                   'passInterference': playTypes[:,2]/playTypes[:,3],
                                   'yardLineNumber': np.arange(99)+1})
play_frequencies.head()

In [None]:
def plotPlayFreq(frequencies, start_yardline = 0, end_yardline = 100, play_type = 'all'):
    fig = plt.figure()
    plt.style.use('fivethirtyeight')
    ax = fig.add_axes([0,0,1,1])
    if play_type == 'all':
        ax.bar(frequencies['yardLineNumber'][start_yardline:end_yardline], frequencies['sackFreq'][start_yardline:end_yardline]+frequencies['passFreq'][start_yardline:end_yardline]+frequencies['passInterference'][start_yardline:end_yardline], color = 'r', width = 0.25)
        ax.bar(frequencies['yardLineNumber'][start_yardline:end_yardline], frequencies['sackFreq'][start_yardline:end_yardline]+frequencies['passFreq'][start_yardline:end_yardline], color = 'g', width = 0.25)
        ax.bar(frequencies['yardLineNumber'][start_yardline:end_yardline], frequencies['passFreq'][start_yardline:end_yardline], color = 'b', width = 0.25)
        ax.legend(labels=['DPI', 'Sack', 'Pass'], loc = 4)
    elif play_type == 'pass':
        ax.bar(frequencies['yardLineNumber'][start_yardline:end_yardline] + 0.00, frequencies['passFreq'][start_yardline:end_yardline], color = 'b', width = 0.25)
        ax.legend(labels=['Pass'])
    elif play_type == 'sack':
        ax.bar(frequencies['yardLineNumber'][start_yardline:end_yardline] + 0.25, frequencies['sackFreq'][start_yardline:end_yardline], color = 'g', width = 0.25)
        ax.legend(labels=['Sack'])
    elif play_type == 'PI':
        ax.bar(frequencies['yardLineNumber'][start_yardline:end_yardline] + 0.50, frequencies['passInterference'][start_yardline:end_yardline], color = 'r', width = 0.25)
        ax.legend(labels=['Defensive Pass Interference'])
    ax.set_xlabel('Yardline')
    ax.set_ylabel('Frequency')
    ax.set_title('Yardline vs Play Frequency')
    plt.show()

## For our first analysis, we looked into the frequency of defensive pass interference based on the yardline

In [None]:
plotPlayFreq(play_frequencies, 0,100, 'PI')

The graph shows a large amount of DPI at primarily the first and last 25 yardlines with a level amount in between those areas. Nearly 50% of plays ran on the first and last five yards resulted in a DPI, which is a significant finding. We believe one of the reasons for this might be the cost of allowing a touchdown if the defensive player is beaten by the reciever. It would better serve the defense to sacrifice the yards rather than give up a touchdown down the stretch.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
yardline_play_result = plays[['standardYardLine','passResult', 'personnelD', 'isDefensivePI', 'epa']]
schemes = yardline_play_result['personnelD'].unique()
schemes = schemes[:len(schemes)-1]
playRes = np.zeros([len(schemes),5])
for idx, val in yardline_play_result['standardYardLine'].iteritems():
    if yardline_play_result['passResult'][idx] == 'C':
        if len(np.where(schemes == yardline_play_result['personnelD'][idx])[0]) > 0:
            i = np.where(schemes == yardline_play_result['personnelD'][idx])[0][0]
            playRes[i,0] += 1
            playRes[i,2] += 1
    elif yardline_play_result['passResult'][idx] == 'I':
        if len(np.where(schemes == yardline_play_result['personnelD'][idx])[0]) > 0:
            i = np.where(schemes == yardline_play_result['personnelD'][idx])[0][0]
            playRes[i, 1] += 1
            playRes[i,2] += 1
            if yardline_play_result['isDefensivePI'][idx] == True:
                playRes[i,3] += 1
    if len(np.where(schemes == yardline_play_result['personnelD'][idx])[0]) > 0:
            i = np.where(schemes == yardline_play_result['personnelD'][idx])[0][0]
            playRes[i,4] += yardline_play_result['epa'][idx]
    
    
    
result_frequencies = pd.DataFrame(data = {'compFreq': playRes[:,0]/playRes[:,2],
                                   'icompFreq': playRes[:,1]/playRes[:,2],
                                    'PIFreq': playRes[:,3]/playRes[:,2],
                                          'Total Plays': playRes[:,2],
                                          'Defense': schemes,
                                         'AvgEPA': playRes[:,4]/playRes[:,2]})
result_frequencies = result_frequencies.sort_values(by=['AvgEPA'], ascending = False)
result_frequencies

In [None]:
def plotDFreq(frequencies, min_plays = 250):
    frequencies = frequencies[frequencies['Total Plays'] >= min_plays]
    fig = plt.figure(figsize = (15,10))
    plt.style.use('fivethirtyeight')
    ax = fig.add_axes([0,0,1,1])
    X = np.arange(len(frequencies))
    ax.bar(X, frequencies['AvgEPA'], color = 'b', width = 0.25)
    ax.legend(labels=['DPI', 'Completion', 'Incompletion'])
    plt.xticks(X, frequencies['Defense'], rotation = 'vertical')
    ax.set_xlabel('Defense Personnel')
    ax.set_ylabel('EPA')
    ax.set_title(f'EPA vs Defense Personel (Min {min_plays} Plays)')
    ax.axhline(y = 0, color = 'r')
    plt.show()

## Our next analysis looks at all of the different possible defensive schemes and shows the expected points metric values

In [None]:
plotDFreq(result_frequencies, 20)

The EPA decreases when defensive coverages include more defensive backs and less linebackers and defensive lineman. This trend holds as every defensive personel increases in DBs.

In [None]:
tracking.head(23)

In [None]:
temp = tracking[tracking['gameId']==2018090600]
temp = temp[temp['playId'] == 75]
temp = temp[temp['frameId'] == 1]

## This next chart was our way of visualizing players on the field in a 2D space

In [None]:
colordict = dict({'home': 'green',
              'away': 'red',
             'football': 'brown'})
color = temp['team'].apply(lambda x: colordict[x])
plt.scatter(temp['x'], temp['y'], c = color)
plt.title('Player Locations (Home = Green, Away = Red, Football = Brown)')

Tracking graph comments For the graph above, we can see the defensive personnel lined up in green and the offensive personnel are in red. This graph allows us to see where all the players are located physically on the field. Using this it can be adapted to show where they are at any given point in time, not just the start of the play like it is shown here.


In [None]:
epa_result = plays[['standardYardLine', 'epa']]
epa = np.zeros([99,2])
for idx, val in epa_result['standardYardLine'].iteritems():
    epa[val-11, 0] += epa_result['epa'][idx]
    epa[val-11, 1] += 1
rel_epa = epa[:,0]/epa[:,1]

## For this analysis, we combined some of the first two graphs to investigate the EPA per each yardline

In [None]:
plt.bar(np.arange(99), rel_epa)
plt.xlabel('Yardline')
plt.ylabel('Expected Pointed Added (EPA)')
plt.title('Expected Points Added vs Yardline')

Based off the graph above, it’s clear there is no uniform distribution. However, we can see that around the 50-yard line we have the lowest EPA, while around the 10 we have the two highest EPA’s. The 50-yard line could be a result of it being a “shot play” as teams will often go for high risk plays around that point on the field, these plays could often result in no gain, or even worse, a turnover. The ten-yard line’s high EPA could be caused by the offense’s high choice of plays which leads to a defense’s inability to defend properly.


In [None]:
nDB = []
nLB = []
nDL = []
for idx,val in result_frequencies['Defense'].items():
    nLB.append(result_frequencies['Defense'][idx][result_frequencies['Defense'][idx].find(' LB')-1])
    nDL.append(result_frequencies['Defense'][idx][result_frequencies['Defense'][idx].find(' DL')-1])
    nDB.append(result_frequencies['Defense'][idx][result_frequencies['Defense'][idx].find(' DB')-1])

In [None]:
result_frequencies['nLB'] = nLB
result_frequencies['nDB'] = nDB
result_frequencies['nDL'] = nDL

In [None]:
result_frequencies.head()