# NFL - Punt Data Analytics 
 _Kyle Killion, Former NFL Player and aspiring Data Scientist_

<hr>

__Introduction__

This notebook describes the datasets and variables provided for the analysis of football punt plays.  The data provided for analysis are specific to punt plays during the 2016 to 2017 seasons.  Four different data sources are provided which describe various elements of each punt play and player.  This notebook describes the specifics of each variable contained within the datasets as well as guidelines on the best approach to use for analysis.  

## Data Description
<hr>
**Data Relation**<br>
The following datasets will be provided for NFL seasons 2016 to 2017.  Each dataset can be merged on the game, play or player level using the provided key variables (Table 1).  GameKey provides a unique identifier for a specific game which is unique across NFL seasons.  PlayID identifies a unique play within a specified GameKey.  GSISID provides a unique identifier for a player across all seasons
![Data Relation](../input/informativePics/DataRelation.png "Data Relationship")

**Game Data**<br>
Game level data that specifies the type of season (pre, reg, post), week and the hosting city and team.  Each game is uniquely identified across all seasons using GameKey.  

![Game Data](../input/informativePics/GameData.png "Game Data Variables")
<br>

**Play Information** <br>
Play level data that describes the type of play, possession team, score and a brief narrative of each play.  Plays are uniquely identified using a its PlayID along with the corresponding GameKey.  PlayIDs are not unique.  
<br>
![Play Data](../input/informativePics/PlayData.png "Play Data Variables")

**Player Punt Data**<br>
Player level data that specifies the traditional football position for each player.  Each player is identified using his GSISID.  
<br>
![Player Punt Data](../input/informativePics/PlayerPuntData.png "Player Punt Data Variables")

__Play Player Role Data__<br>
Play and player level data that specifies a punt specific player role.  This dataset will specify each player that played in each play.  A player’s role in a play is uniquely defined by the Gamekey PlayID and GSISID.  

![Play Player Role Data](../input/informativePics/PlayPlayerRoleData.png "Play Player Role Data Variables")

__Video Review__<br>
Injury level data that provides a detailed description of the concussion-producing event.  Video Review data are only available in cases in which the injury play can be identified.  Each video review case can be identified using a combination of GameKey, PlayID, and GSISID.  A brief narrative of the play events is provided.
![Video Review Data](../input/informativePics/VideoReviewData1.png "Video Review Data Variables")
![Video Review Data](../input/informativePics/VideoReviewData2.png "Video Review Data Variables")

__NGS__ – __Next Gen Stats__<br>
Player level data that describes the movement of each player during a play.  The NGS data is identified using GameKey, PlayID, and GSISID.  Player data for each play is provided as a function of time (Time) for the duration of the play.  
![NGSData1 Data](../input/informativePics/NGSData1.png "Player Movement Data Variables")
![NGSData1 Data](../input/informativePics/NGSData2.png "Player Movement Data Variables")

In [None]:
# Import Tools
import os
import re
import glob as glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# KungFu 
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import SpectralClustering


# Use some configs
%matplotlib inline
plt.rcParams['figure.dpi'] = 170

# Exploring Data Set

In [None]:
import os

os.chdir('../input/NFL-Punt-Analytics-Competition/')

# Read in data
data = pd.read_csv('NGS-2017-post.csv', parse_dates=True)

# Fill the Events Column Forward
data.fillna(method='ffill',inplace=True)

# Get Roles
role = pd.read_csv('play_player_role_data.csv')

# Get the Concussion data
events = pd.read_csv('video_review.csv')

# Take a summary
print(data.info(),'\n\n\n')
print(role.info(),'\n\n\n')
print(events.info(),'\n\n\n')

In [None]:
# First Merge the movement data with the Player Roles
posFrame = data.merge(role, how='inner', on=['Season_Year','GameKey','PlayID','GSISID'])

print(posFrame.info())
posFrame[(posFrame.PlayID.isin(events.PlayID))].sort_values('Time').head()

In [None]:
# Look to see what Plays Resulted in Concussions
print('Number of Plays: ', len(events.PlayID.unique()))
print(events.PlayID.unique())



# Look at all the players involved with Blocking or being Blocked
events[(events.Primary_Partner_Activity_Derived == 'Blocking') | (events.Player_Activity_Derived == 'Blocking')].head()

In [None]:
# The Roles and Movement during those particular plays
# Need to look at the only players involved in concussions
conFrame = posFrame[(posFrame.PlayID.isin(events.PlayID))]


print('\nList of the Plays in the Frame\n')
conFrame = conFrame[(conFrame.PlayID.isin(events.PlayID)) & (~conFrame['x'].isna())]
print(conFrame.PlayID.unique())
print('\nList of the Positions/Role in the Frame\n')
print(conFrame.Role.unique())
print('\n###############################################################\n')

print(conFrame.info())
conFrame.head()

In [None]:
conFrame.loc[:,['x','y']]\
.fillna(method='ffill').dropna()\
.plot(x='x', y='y',kind='scatter')

endZone1 = [0, -10]
endZone2 = [100, 110]

plt.axvspan(endZone1[0], endZone1[1], alpha=.5, lw=0, color='red')
plt.axvspan(endZone2[0], endZone2[1], alpha=.5, lw=0, color='green')
plt.xticks(np.arange(0, 101, 10))
plt.grid()

# Data preprocessing - working with the whole Data Set

In [None]:
# Get the Concussion data
events = pd.read_csv('video_review.csv')

# Get Roles
role = pd.read_csv('play_player_role_data.csv')

# Run through all the data here to find all plays of interest
fileList = glob.glob(os.getcwd() + "/*.csv")

# Regex to match for NGS data
match = ['NGS']

# The Frame to build
bigFrame = pd.DataFrame()


for s in fileList:
    
    # Get all the movement NGS Data
    if (re.findall(r"(?=("+'|'.join(match)+r"))", s)):
        
        print('Processing %s...' % s.split('\\')[-1])

        df = pd.read_csv(s, 
                    parse_dates=['Time'],
                    infer_datetime_format=True,
                    dtype = {'Event' : str}) 
        
        # Carry play Event forward
        df.Event.fillna(method='ffill', inplace=True)
        
        
        # Filter for only Concussion plays
        df = df[(df.PlayID.isin(list(events.PlayID.values)))].sort_values('Time')
        
        # Now get the players involved
        df = df[df.GSISID.isin(list(events.GSISID.values))].sort_values('Time')

        
        # Fill in the Roles/Positions
        df = df.merge(role, how='inner', on=['Season_Year','GameKey','PlayID','GSISID'])
        df.reset_index(inplace=True, drop=True)
        
        # Concatenate the Data with bigFrame
        bigFrame = pd.concat([bigFrame, df], sort=False)

print('\n\nFinal DataFrame:\n\n', bigFrame.info())
bigFrame.tail()

In [None]:
# Sanity Check
print(sorted([int(x) for x in bigFrame.GSISID.unique()]))
print('\n\n', sorted(events.GSISID.unique()))

In [None]:
# Merge Events 
bigFrame1 = bigFrame.merge(events, how='inner', on=['Season_Year','GameKey','PlayID','GSISID'])
bigFrame1 = bigFrame1[bigFrame1.Primary_Partner_GSISID != 'Unclear'].dropna().apply(pd.to_numeric, errors='ignore')
bigFrame1.info()

In [None]:
bigFrame2 = pd.get_dummies(data=bigFrame1, columns=['Role', 
                                                    'Player_Activity_Derived',
                                                    'Primary_Impact_Type', 
                                                    'Primary_Partner_Activity_Derived'])

# Remove for clustering analytics
del bigFrame2['Event']
del bigFrame2['Turnover_Related']
del bigFrame2['Friendly_Fire']


bigFrame2.info()

In [None]:
# Normalize Movement data
scaler = StandardScaler()
clusterFrame = bigFrame2
clusterFrame[['dir','dis','o','x','y']] = scaler.fit_transform(bigFrame2[['dir','dis','o','x','y']])

clusterFrame.set_index('Time', inplace=True)
print(clusterFrame.info())
clusterFrame.head()

# Spectral Clustering

In [None]:
# Spectral Clustering
specCluster = SpectralClustering(n_clusters=4, eigen_solver='arpack', random_state=43, assign_labels='discretize')

clusterFrame['clusters'] = specCluster.fit_predict(clusterFrame.values)
clusterFrame.info()


# Cluster Analytics 

In [None]:
clusterFrame.groupby(['clusters'])['Primary_Impact_Type_Helmet-to-helmet',
                                  'Primary_Impact_Type_Helmet-to-body']\
.agg(['sum','count','max','mean','min','first','last'])

## Primary Partner Activity

In [None]:
clusterFrame.groupby(['clusters'])['Primary_Partner_Activity_Derived_Blocked',
                                  'Primary_Partner_Activity_Derived_Blocking']\
.agg(['sum','count','max','mean','min','first','last'])

In [None]:
clusterFrame.groupby(['clusters'])['Primary_Partner_Activity_Derived_Tackled',
                                  'Primary_Partner_Activity_Derived_Tackling']\
.agg(['sum','count','max','mean','min','first','last'])

## Player Activity

In [None]:
clusterFrame.groupby(['clusters'])['Player_Activity_Derived_Tackled',
                                  'Player_Activity_Derived_Tackling']\
.agg(['sum','count','max','mean','min','first','last'])

In [None]:
clusterFrame.groupby(['clusters'])['Player_Activity_Derived_Blocked',
                                  'Player_Activity_Derived_Blocking']\
.agg(['sum','count','max','mean','min','first','last'])

In [None]:
# Look through the clusters

sumFrame = pd.DataFrame()

for cluster in [0,1,2,3]:
    interestFrame = clusterFrame[clusterFrame.clusters == cluster]

    print('\n\n')
    print('Number of Plays in Cluster %s :' % cluster, len(interestFrame['PlayID'].unique()))
    print('Plays from Cluster %s : ' % cluster, interestFrame['PlayID'].unique())
    print('Players from Cluster %s : ' % cluster, interestFrame['GSISID'].unique(), interestFrame.Primary_Partner_GSISID.unique())

    
    players = interestFrame.Primary_Partner_GSISID.unique()
    
    activityFrame = interestFrame[interestFrame.Primary_Partner_GSISID.isin([int(x) for x in players])]\
          .loc[:,['PlayID',
                  'Primary_Partner_GSISID',
                  'GSISID',
                  'Player_Activity_Derived_Blocked',
                  'Player_Activity_Derived_Blocking',             
                  'Player_Activity_Derived_Tackled',              
                  'Player_Activity_Derived_Tackling',
                  'Primary_Partner_Activity_Derived_Blocked',
                  'Primary_Partner_Activity_Derived_Blocking',
                  'Primary_Partner_Activity_Derived_Tackled',
                  'Primary_Partner_Activity_Derived_Tackling',
                  'Primary_Impact_Type_Helmet-to-helmet',
                  'Primary_Impact_Type_Helmet-to-body']]


    print('\n\n')
    results = activityFrame[activityFrame > 0].groupby(['PlayID','GSISID', 'Primary_Partner_GSISID']).last().sum()
    print(results)
    
    activityFrame.groupby(['PlayID','GSISID', 'Primary_Partner_GSISID'])['Player_Activity_Derived_Blocked',
                  'Player_Activity_Derived_Blocking',             
                  'Player_Activity_Derived_Tackled',              
                  'Player_Activity_Derived_Tackling',
                  'Primary_Partner_Activity_Derived_Blocked',
                  'Primary_Partner_Activity_Derived_Blocking',
                  'Primary_Partner_Activity_Derived_Tackled',
                  'Primary_Partner_Activity_Derived_Tackling',
                  'Primary_Impact_Type_Helmet-to-helmet',
                  'Primary_Impact_Type_Helmet-to-body']\
    .last().sum().plot(kind='bar', legend=False)
    
    plt.show()
    
    filter_cols = [col for col in clusterFrame if col.startswith('Role')]
    filter_cols.append('clusters')
    filter_cols.append('PlayID')

    interestFrame.loc[:,filter_cols].groupby(['clusters','PlayID'])\
    .last().sum().plot(kind='bar',legend=False)
    
    plt.show()
    
    sumFrame = sumFrame.append(results, ignore_index=True)
    

# Overall Cluster Sums 

In [None]:
sumFrame.sum().plot(kind='bar')
print(sumFrame.loc[:,['Player_Activity_Derived_Tackled',              
                  'Player_Activity_Derived_Tackling',
                'Primary_Partner_Activity_Derived_Tackled',
                  'Primary_Partner_Activity_Derived_Tackling']].sum())
sumFrame.loc[:,['Player_Activity_Derived_Blocked','Player_Activity_Derived_Blocking',
                'Primary_Partner_Activity_Derived_Blocked','Primary_Partner_Activity_Derived_Blocking']].sum()

# Investigated Instance

http://a.video.nfl.com//films/vodzilla/153280/Wing_37_yard_punt-cPHvctKg-20181119_165941654_5000k.mp4

In [None]:
from IPython.display import HTML

HTML('<video width="560" height="315" controls> <source src="http://a.video.nfl.com//films/vodzilla/153280/Wing_37_yard_punt-cPHvctKg-20181119_165941654_5000k.mp4" type="video/mp4"></video>')

# Conclusion

After going through the data and deriving an unsupervised Spectral clustering approach, the results were more than interesting to look through. There can be a wide variety of insights drawn and improvements could have been made to possibly strengthen the analysis. For instance, going back and grouping the two sides of the ball from each other so the algorithm could be aware of this relationship. Another would be to introduce punt hang time in seconds into the study since this highly impacts decisions made by players. 

__Clustering Summary:__

I selected an arbitrary number of 4 clusters for Spectral Clustering which are as follows:<br>

__Cluster 0__<br>
This was a Cluster that favored predominately tackling with 10 plays. These tackling concussions were mainly from the Primary Partner Activity. Also to note that there was zero Primary Partners conducting any blocking. This cluster also preferred more of a helmet-to-body impact along with singling out 3 plays involving the Punt Returner. 

__Cluster 1__ <br>
This Cluster as well preferred an activity of tackling, however, the activity was significantly derived from the Player Activity Tackling and with 5 out of the 8 total plays in the cluster. 

__Cluster 2 __<br>
This Cluster was fairly uniform with a slight preference for tackling with 4 out of the 7 plays involving injury.  It was an odd collection of plays on the cluster with long change of field returns and a fake punt which resulted in the punter suffering a concussion on the tackle. Of the tackling, the Primary Partner was heavily more common than the Player Activity. 

__Cluster 3__ 
This Cluster we predominately a blocking/blocked cluster with 5 out of the 7 total plays. It was also noted that 3 of the plays involved wall blocking schemes. The cluster also did not have any Primary Partners conducting any tackles. The PlayID 1683 was a textbook Helmet-to-Helmet defenseless tackler. 

But while considering all that was in front of me, I knew upholding the game was important to me. The first time the rule was enforced on a defenseless receiver was called I became frustrated. How was this not imposed on Special Teams then if we are just going to be doing it just for the Offense? It’s a third of the game and these plays are specifically designed to take someone’s head off. There are Right and Left Wedges/Walls and snipe blocks coming off the heels targeting guys.

So with this in mind and going through the game film (specifically PlayID 1683 in the investigated instance) and data, the most fair and possible imposed rule is to protect a defenseless tackler from a helmet-to-helmet, especially the guys inside the box. This would impact roughly 50% of the concussions during punts in a significant way. This would also most importantly preserve the game while also maintaining equality of safety during the course of the football game. 
