# Strides Towards Safety 
### NFL 1st and Future - Playing Surface Analytics -  Submission by Katherine Lordi

# My Findings <a id="1"></a>
Given the provided [datasets](http://https://www.kaggle.com/c/nfl-playing-surface-analytics/data), I found specific player movement patterns that may present an elevated risk of acute onset of injury in NFL athletes. 
### *Metrics of Player Movement That May Influence Risk of Injury:*
#### 1) Frequency of Slows/Stops: When players slow to speeds near 0 (or completely stop) more frequently, they may be at greater risk for injuries on Synthetic Turf than Natural Grass on average. 
* On average, number of slows/stops per sec  for injuries was higher on Synthetic Turf than Natural Grass by 30%. 

#### 2) Frequency of Changes of Direction: When players change their direction more frequently, they may be at greater risk for injuries on Synthetic Turf than Natural Grass on average. 
- On average, number of changes in direction per sec  for injuries was higher on Synthetic Turf than Natural Grass by 37%. 

#### 3) Frequency of Changes of Orientation: When players change their orientation (turn or pivot) more frequently, they may be at greater risk for injuries on Synthetic Turf than Natural Grass on average.
- On average, number of changes in orientation per sec for injuries was higher on Synthetic Turf than Natural Grass by 63%. 

#### 4) Differences in Metrics Across Playing Surfaces for Non-Injuries: For the non-injured player population, I found that the metrics of player movement that may influence risk of injury (frequencies of slows/stops, change in direction, and change in orientation) do not differ across playing surfaces. 

### *How Player Movement and Game Environment Interact to Influence Risk of Injury :*
#### 1) On average in outdoor stadiums, severity of injuries that occur on synthetic turf was 87% higher than natural grass.
- Holds across player movement metrics (frequency of slows/stops, direction, orientation) that may influence risk of injury
- Looked at play types with the most injuries (ex. pass punt)
- Looked at positions with the most injuries (ex. WR) 

### *Extensions of My Findings to Next Gen Stats to Help Safety Efforts*
Based on my findings above and my attempt at having a decision tree model classify things such as field type and DM from the data, here are some ideas to help reduce lower extremity injuries in the NFL, the highest driver of days missed by players. 
#### Incorporate these player movement metrics into AI models aiming to predict and prevent injury 
- Frequency of slows/slops, direction and orientation changes
- Other historical player-specific and game-level data 

#### Look at these player movement metrics in conjunction with specific types of cleats, grass and synthetic fields and incorporate into AI models 
- Cleat Tracking Technology 
- Injury Rates Per Cleat 

![](https://media.phillyvoice.com/media/images/Wentz_Peters.2e16d0ba.fill-735x490.png)
Lower extremity injuries such as ACL tears and Achilles tendon ruptures are two of the most serious season-ending injuries ([Bleacher Report](http://bleacherreport.com/articles/2089798-how-and-why-minimal-or-non-contact-injuries-occur-in-nfl-workouts)). Carson Wentz (QB) and Jason Peters (T) of the Philadelphia Eagles suffered season-ending knee injuries prior to the Eagles Super Bowl LII win. 


# Contents 
* [My Findings](#1)  
* [Challenge & Dataset](#2)
* [Overview of My Process](#3)
* [1) Preprocessing the Data](#4) 
* [2) Understanding the Data: Preliminary Visualizations of Injuries](#5)
* [3) Visualizing Player Movement](#6)
* [4) Preprocessing Player Tracking Data: Flattening the Data](#7)
* [5) A Look at Player Movement: Slows/Stops, Direction, Orientation](#8)
* [6) A Look at Outdoor Stadiums](#9)
* [7) Decision Tree to Predict Injury](#10)
* [8) Extensions to Next Gen Stats](#11)

# Overview: Challenge & Dataset <a id="2"></a>

#### **The Challenge:** To characterize differences in player movement between the natural grass and synthetic turf and identify specific variables that may influence player movement and the risk of injury. For the analysis, we used [3 data files](https://www.kaggle.com/c/nfl-playing-surface-analytics/data) that are linked using the fields PlayerKey, GameID, and PlayKey. 

- **InjuryRecord:** Contains 105 lower-limb non-contact injuries, location in the body, field type the injury occurred, and severity based on # days a player missed during regular season games over two seasons.
- **PlayList:** Contains details for the 267,005 player-plays including the playerâ€™s roster and play positions, stadium type, field type, weather, and play type.
- **PlayerTrackerData:** Contains 76,366,748 data points of player tracking data via [Next Gen Stats](https://nextgenstats.nfl.com/) that describe the location, orientation, speed, and direction of each player during a play. 10 observations were recorded per second. 

### Data Constraints 
With these three files, we have information about the injuries, inflicted player (position, day #, game #), tracking data on the field, play type, and environment (playing surface, environment, temperature). However, here are some constraints that I took into consideration: 
- We only have data on 105 non-contact lower limb injuries that occurred in 267,005 plays (0.039% of plays have a non-contact lower-limb injury)
- When exactly in the play does the injury occur? 
- What kind of injury is it exactly? (ie. ankle sprain, ACL tear, etc.) 
- Name? Age? Years Active in the NFL? Height? Weight? Teams? Players are more than just a PlayerKey. 
- What is the player's injury history? Is the player suffering or recovering from any other injuries? Having other injuries certainly impacts health and the potential for future injuries.
- Not all players play the same # of games on the same types of fields
- Specifics about type of cleats differ for many players across natural grass VS synthetic turf 
- Type of playing surface that a player and team practices on
- Home Field playing surface 
- How often a player plays on natural grass VS synthetic turf 
- Player preferences for playing surface 
- etc. 

The above points are limitations of our dataset, and player factors such as these certainly play a major role in understanding player biomechanics and potential risks for injury. However, rather than focusing on biomechanics and these factors, **we focus on examining player movement itself on synthetic turf and natural grass** and **variables that may influence movements and risk of injury** such as player tracking data and environmental/game variables such as weather, temperature, play type, etc. 

# Overview: My Process <a id="3"></a>
In football, there is so much variability in player movement that differs across every single play across so many game and environmental variables. For example, Wide Receivers (WR) constantly change their speed and direction to get open, and Cornerbacks (CB) that defend them  quickly anticipate their movements to prevent them from catching a pass. Additionally, we only have 105 non-contact lower limb injuries total, so these injuries are pretty rare. 

After doing my preliminary analysis of the data, I saw that there were no significant correlations between player tracking data (player's x and y position on the field, speed, acceleration, orientation, and direction) and environmental/game variables to injuries on different field types. Understanding that no playing surface is perfect (but there is [NFL Field Certification](https://operations.nfl.com/the-game/game-day-behind-the-scenes/nfl-field-certification/)!) and given 2 years of Player Tracking Data, I sought dive deeper into understanding player movements such as rates and frequencies of sudden stops, directional changes, and orientation changes in order to understand how they relate to risk of injury. In this notebook, I show my process and the notable visualizations that helped me get to my findings. 

# 1) Preprocessing The Data <a id="4"></a>
First, I import the files and joined InjuryRecord with PlayList into one file in order to visualize relationships between factors such as injury location, severity, play type, field type, position, weather, temperature, etc. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
import seaborn as sns 
import numpy as np

In [None]:
# Import Data Files and display length 
ir = pd.read_csv("../input/nfl-playing-surface-analytics/InjuryRecord.csv")
pl = pd.read_csv("../input/nfl-playing-surface-analytics/PlayList.csv").set_index("PlayKey")
ptd = pd.read_csv("../input/nfl-playing-surface-analytics/PlayerTrackData.csv").set_index("PlayKey")
l1 = len(ir)
l2 = len(pl)
l3 = len(ptd)
print("Length of InjuryRecord is ", l1)
print("Length of PlayList is ", l2)
print("Length of PlayerTrackData is ", l3)

I also modified InjuryRecord by adding the appropriate missing values in the PlayKey column and merging the multiple DM (Days Missed) columns indicating severity  into one column for simplicity. Therefore, DM = 1 means 1+ days missed while a DM = 42 means 42 + days missed. See the head of this file below. 

In [None]:
# Modifying InjuryRecord: Adding PlayKey for missing rows and Merging DM columns (indicate severity) into one column. 
def fdm(d):
    DM=[1,7,28,42]
    return DM[d-1] if d else 0

ir = ir.fillna("*")

ir['dmi'] = ir['DM_M1'] + ir['DM_M7'] + ir['DM_M28'] + ir['DM_M42']
ir['DM'] = ir['dmi'].apply(fdm)

gidpk = {}
for gid in ir.query('PlayKey == "*"').GameID:
    gidpk[gid] = pl.query('GameID=="%s"' % gid).index[-1]


def fpku(a,b):
    return gidpk[b] if a == "*" else a

ir['PlayKey'] = ir.apply(lambda x: fpku(x.PlayKey, x.GameID), axis=1)

InjuryRecord = ir.set_index("PlayKey")[['BodyPart','DM']]
InjuryRecord.head()

Next, I joined InjuryRecord and PlayList into a new file and then check to ensure that the # of injuries in this file is 105.

In [None]:
# joining InjuryRecord and PlayList 
IRPL = pl.join(InjuryRecord).fillna(0)
IRPL.head()

In [None]:
# does a check to ensure that the # of injuries in the dataset is 105. 
IRPL.query('DM > 0').count()

# 2) Understanding the Data: Preliminary Visualizations <a id="5"></a>
First, I looked at some tables and visualizations in order to better understand the data, visualize relationships, brainstorm ideas, and guide my next steps. 

### Injuries by Body Part and Field Type 
The graphs below visualize the distribution of injury type on each playing surface and how severity differed across the playing surfaces. Some things to note are:
- Knee and ankle injuries are the most common. 
- There are an equal number of knee injuries on synthetic turf and natural grass (24). 
- There are more ankle injuries on synthetic turf (25) than natural grass (17). 
- There are more foot injuries on natural grass than synthetic turf.
- There are more toe injuries on synthetic turf than natural grass.
- Ankle injuries are more severe (more days missed) on synthetic turf than natural grass.  

In [None]:
fig = plt.figure(figsize=(20,16))
ax1 = fig.add_subplot(221)
ax2 = fig.add_subplot(222)
ax3 = fig.add_subplot(223)
ax4 = fig.add_subplot(224)
ax1.set_title('# Injuries by Body Part')
ax1.set_ylabel('Count')
ax1.set_xlabel('Body Part')
ax2.set_title('# Injuries by Field Type')
ax2.set_ylabel('Count')
ax2.set_xlabel('Body Part')
ax3.set_title('# Natural Grass: Injuries by Body Part by Severity')
ax3.set_ylabel('Count')
ax3.set_xlabel('Body Part')
ax3.set_ylim(0,12)
ax4.set_title('# Synthetic Turf: Injuries by Body Part by Severity')
ax4.set_ylabel('Count')
ax4.set_xlabel('Body Part')
ax4.set_ylim(0,12)
plot1 = IRPL.query('DM>0').pivot_table(index=['BodyPart'], columns=[], values='DM', aggfunc='count').fillna(0).plot(kind='bar',ax=ax1,legend=False)
plot2 = IRPL.query('DM>0').pivot_table(index=['BodyPart'], columns=['FieldType'], values='DM', aggfunc='count').fillna(0).plot(kind='bar',ax=ax2)
plot3 = IRPL.query('DM>0 and FieldType == "Natural"').pivot_table(index=['BodyPart'], columns=['DM'], values='PlayerKey', aggfunc='count').fillna(0).plot(kind='bar',ax=ax3)
plot4 = IRPL.query('DM>0 and FieldType == "Synthetic"').pivot_table(index=['BodyPart'], columns=['DM'], values='PlayerKey', aggfunc='count').fillna(0).plot(kind='bar',ax=ax4)

### Injuries By Position, Field Type, Severity, & Play Type 
The graph below shows positions with an injury by body part. We notice the following in the graph below:
- Wide Receivers (WR) see the most non-contact ankle and knee injuries.
- Cornerbacks (CB) have the most ankle and knee injuries of the DB Position Group.
- Offensive Linebackers (OLB) see the second highest # of knee injuries (next to WR) and also a notable number of ankle injuries.  
- Running Backs (RB) also see greater amounts of non-contact knee injuries than other positions but have fewer ankle injuries. 


In [None]:
fig = plt.figure(figsize=(12,4))
ax1 = fig.add_subplot(111)
ax1.set_title('# Injuries by Position and Body Part')
ax1.set_ylabel('Count')
ax1.set_xlabel('Position Group, Position')
IRPL.query('DM>0').pivot_table(index=['PositionGroup','Position'], columns=['BodyPart'], values='DM', aggfunc='count').fillna(0).plot(kind='bar', ax = ax1)
plt.show()

Next, I looked at injuries through the perspective of field type and severity. In the table below, count indicates injuries by the # while sum indicates injuries by severity (DM == Days Missed). I noticed the following that can be seen in the table below:
- Cornerbacks (CB) only have ankle injuries on synthetic turf (7) and report 0 ankle injuries on natural grass. 
- Strong Safeties (SS) only have ankle injuries on synthetic turf (3) and report 0 ankle injuries on natural grass. 
- Wide Receivers (WR) have 5 ankle injuries on natural grass and 4 on synthetic turf. 
- There are an equal number of knee injuries on synthetic turf and natural grass. 

In [None]:
IRPL.query('DM>0').pivot_table(index=['PositionGroup', 'Position','FieldType'], columns=['BodyPart'], values='DM',aggfunc=['count', 'sum']).fillna(0)

I also looked at injuries through the perspective of field type and play type. I noticed the following major significances that can be seen in the table and graph below:
- Pass and rush plays have the most injuries. 
- The most knee injuries occurred during a pass (10 natural, 10 synthetic) followed by rush (5 natural, 5 synthetic) and then kickoff (4 natural, 3 synthetic), and punt (4 natural, 2 synthetic). 
- The most ankle injuries occurred during a pass (7 natural, 14 synthetic --> largest discrepancy) followed by rush (7 natural, 7 synthetic), kickoff, and then punt. 

In [None]:
IRPL.query('DM>0').pivot_table(index=['PlayType', 'FieldType'], columns=['BodyPart'], values='DM',aggfunc=['count', 'sum']).fillna(0)

In [None]:
fig = plt.figure(figsize=(12,4))
ax1 = fig.add_subplot(111)
ax1.set_title('# Injuries by Play Type and Body Part')
ax1.set_ylabel('Count')
ax1.set_xlabel('Play Type')
IRPL.query('DM>0').pivot_table(index=['PlayType'], columns=['BodyPart'], values='DM', aggfunc='count').fillna(0).plot(kind='bar', ax = ax1)
plt.show()

It is difficult to speculate a lot from the above tables about WHY certain positions and play types have different numbers of injuries in differerent body parts based on the playing surface. Therefore, my next step was to look at correlations between data variables and analyze player movements using the Player Tracking Data. 

### Correlations Between Injuries and PlayList Columns (Field Type, StadiumType, Temperature, Weather, PlayType, Position)

I made a correlation heat map to see if there are any obvious correlations between injuries and game/player variables. After seeing no correlations, I dove into trying to analyze player movements using the Player Tracking Data since 2 seasons worth was provided. 

In [None]:
def ftoint(d):
    l = list(set(d))
    ld = {l[k]:k for k in range(len(l))}

    def fmap(x):
        return ld[x]
    
    return fmap

def fint(s, k):
    f =  ftoint(IRPL[s])
    IRPL[k] = IRPL[s].apply(f)

In [None]:
fint('RosterPosition', 'irp')
fint('StadiumType', 'istadium')
fint('FieldType', 'ifield')
fint('Weather', 'iweather')
fint('PlayType', 'iplay')
fint('Position', 'iposition')
fint('PositionGroup', 'igroup')
fint('BodyPart', 'ibody')

fig = plt.figure(figsize=(20,20))
ax1 = fig.add_subplot(221)
ax2 = fig.add_subplot(222)
ax1.set_title('Heat Map of Correlations of Features to Injury')
ax2.set_title('Heat Map of Correlations of Features')
sns.heatmap(IRPL.query('DM > 0')[['PlayerDay','PlayerGame','PlayerGamePlay','Temperature','iweather','ifield','istadium','iposition','igroup','ibody']].corr(),ax=ax1)
sns.heatmap(IRPL[['PlayerDay','PlayerGame','PlayerGamePlay','Temperature','iweather','ifield','istadium','iposition','igroup','ibody']].corr(),ax=ax2)
plt.show()

# 3) Visualizing Player Movement <a id="6"></a> 
I did some visualizations of Player Movement using the Player Tracking Data in order to see how different positions move on the field. In these graphs of the player's direction, change in direction, velocity, and change in orientation over time, we can better understand how the player is moving on the field.

In [None]:
# Change this to change the player and play being looked at: 
PLAYER='43518'
PK='43518-6-25'

In [None]:
PK0 = IRPL.query('PlayerKey==%s' % PLAYER).index[0]
PKN = IRPL.query('PlayerKey==%s' % PLAYER).index[-1]
(PK0, PKN)

df1 = ptd[PK0:PKN]
df2 = IRPL[PK0:PKN]

df = df1.join(df2).fillna(0)

df['rdir'] = df['dir'] * np.pi / 180.

PK = '43518-1-12'
dfx = df.query('PlayKey == "%s"' % PK).reset_index().set_index('time')

funwrap = lambda c : np.unwrap(dfx[c], discont=180.)
fdiff = lambda c : abs(dfx[[c]].diff(axis=0).fillna(0))

dfx['udir'] = funwrap('dir')
dfx['vudir'] = fdiff('udir')

dfx['uo'] = funwrap('o')
dfx['vuo'] = fdiff('uo')


In [None]:
fbins = lambda c, b : pd.cut(x=dfx[c], bins=b, labels=b[1:])
sbins = [0., 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 1.0, 5.0 ]
scols = ['cs_%d' % int(i*100) for i in sbins[1:] ]
dfx['s_bin'] = fbins('s', sbins)

bins = [0,30,60,90,120,150,180,360]
cols1 = ['cdir_%d' % i for i in bins[1:]]
dfx['vudir_bin'] = fbins('vudir', bins)

cols2 = ['co_%d' % i for i in bins[1:]]
dfx['vuo_bin'] = fbins('vuo', bins)

fzip = lambda c, h : dict(zip(h, list(dfx.groupby(c)[c].count())))

x = fzip('vudir_bin', cols1)
x.update(fzip('vuo_bin', cols2))
x.update(fzip('s_bin', scols))

#dfx.groupby('s_bin')['s_bin'].count()

In [None]:
x = dfx.groupby('vudir_bin')['vudir_bin'].count()
cols = ['cdir_%d' % i for i in bins[1:]]
(cols,list(x))

dict(zip(cols,list(x)))

fig = plt.figure(figsize=(20,12))
ax1 = fig.add_subplot(221)
ax2 = fig.add_subplot(222)
ax3 = fig.add_subplot(223)
ax4 = fig.add_subplot(224)

ax1.set_title('Player Direction Over Time')
ax1.set_ylabel('Degrees')
ax1.set_xlabel('time (sec)')
ax2.set_title('Change in Direction Over Time')
ax2.set_ylabel('Degrees')
ax2.set_xlabel('time (sec)')
ax3.set_title('Player Velocity Over Time')
ax3.set_ylabel('yards/sec')
ax3.set_xlabel('time (sec)')
ax4.set_title('Change in Orientation Over Time')
ax4.set_ylabel('degrees')
ax4.set_xlabel('time (sec)')
ax4.set_ylim(-90,90)


dfx['udir'].plot(ax=ax1)
dfx['vudir'].plot(ax=ax2)
dfx['s'].plot(ax=ax3)
dfx['vuo'].plot(ax=ax4)
dfx.query('s_bin <= 0.05')['s'].plot(style='.', ax=ax3)
plt.show()

# 4) Preprocessing Player Tracking Data: Flattening the Data <a id="7"></a>
After visualizing player movement in different ways, I hypothesized that when players are constantly shifting their direction, orientation, speed, and making cuts on the field that they may be at greater risk of lower limb injuries. I identified the following variables of interest: change in direction, change in orientation, changes in speed near 0. 

Since Player Tracking Data is so large and each row represents .10 seconds of a play, I decided to flatten it in order to decrease the amount of data points to one data point per play. I did the following when flattening the Player Tracking Data:
- Classified changes in direction and orientation by binning  in 30 degree buckets
- Characterized velocity in the x and y direction by binning in buckets of 0.01 yards/second to 5 yards/second. 

See the code for flattening the data below (do not run it I load the file in after): 

In [None]:
# Flattening Player Tracking Data in order to decrease the amount of data points to one data point per play
# Classified changes in direction and orientation by binning  in 30 degree buckets
# Characterized velocity in the x and y direction by binning in buckets of 0.01 yards/second to 5 yards/second. 

class Play(object):
    
    def __init__(self):
        self.header=list(pd.read_csv("../input/nfl-playing-surface-analytics/PlayerTrackData.csv",nrows=2).keys())
        self.current_play_df = None
        self.next_row = 1
    
    def get_play1(self):
        dft_prev = None
        n = 0
        for df in pd.read_csv("../input/nfl-playing-surface-analytics/PlayerTrackData.csv",names=self.header,chunksize=1000,skiprows=1):
            n = n + 1
            dfc = df[['PlayKey']]
            pkeys_j = dfc.ne(dfc.shift()).apply(lambda x: x.index[x].tolist())
            pkeys = [dfc.PlayKey[j] for j in pkeys_j[0]]
            for playkey in pkeys[0:-1]:
                if dft_prev is None:
                    dft = df.query('PlayKey == "%s"' % playkey)
                else:
                    dft = dft_prev.append(df.query('PlayKey == "%s"' % playkey))
                    dft_prev = None
                
                yield dft.reset_index()
                
            dft_prev = df.query('PlayKey == "%s"' % pkeys[-1])
            
    
    def next_play(self):
        dft = pd.read_csv("../input/nfl-playing-surface-analytics/PlayerTrackData.csv",names=header,skiprows=self.next_row,nrows=1000)
        playkey = dft.PlayKey[0]
        self.current_play_df = dft.query('PlayKey == "%s"' % playkey)
        l = len(self.current_play_df.PlayKey)
        self.next_row = self.next_row + l
        return self.current_play_df
    
    def get_playkey(self):
        return self.current_play_df.PlayKey[0]
    
    def get_events(self):
        return set(self.current_play_df.event)
    
    def get_plays(self, n):
        
        sbins = [0., 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 1.0, 5.0 ]
        scols = ['cs_%d' % int(i*100) for i in sbins[1:] ]
        scolsx = ['cvx_%d' % int(i*100) for i in sbins[1:] ]
        scolsy = ['cvy_%d' % int(i*100) for i in sbins[1:] ]
        
        bins = [0,30,60,90,120,150,180,360]
        cols1 = ['cdir_%d' % i for i in bins[1:]] 
        cols2 = ['co_%d' % i for i in bins[1:]] 

        funwrap = lambda dfx, c : np.unwrap(dfx[c], discont=180.)
        fdiff = lambda dfx, c : abs(dfx[[c]].diff(axis=0).fillna(0))
        fbins = lambda dfx, c, b : pd.cut(x=dfx[c], bins=b, labels=b[1:])
        fzip = lambda dfx, c, h : dict(zip(h, list(dfx.groupby(c)[c].count())))
        
        data = []
        t0 = datetime.now()
        i = 0
        for df in self.get_play1():
            d = {'PlayKey': df.PlayKey[0]}
            last = len(df) - 1
            d['Events'] = list(set(df.event))
            d['Duration'] = df.time[last]
            d['TotalDis'] = df.dis.sum()
            
            df1 = df.set_index('time')
            df1['udir'] = funwrap(df1, 'dir')
            df1['vudir'] = fdiff(df1, 'udir')           
            df1['uo'] = funwrap(df1, 'o')
            df1['vuo'] = fdiff(df1, 'uo')
            
            df1['vx'] = fdiff(df1, 'x')
            df1['vy'] = fdiff(df1, 'y')
            
            df1['vx_bin'] = fbins(df1, 'vx', sbins)
            df1['vy_bin'] = fbins(df1, 'vy', sbins)

            df1['vudir_bin'] = fbins(df1, 'vudir', bins)
            df1['vuo_bin'] = fbins(df1, 'vuo', bins)
            
            d.update(fzip(df1, 'vudir_bin', cols1))
            d.update(fzip(df1, 'vuo_bin', cols2))
            d.update(fzip(df1, 'vx_bin', scolsx))
            d.update(fzip(df1, 'vy_bin', scolsy))
            
            data.append(d)
            i = i + 1
            if i%10000 == 0:
                t1 = datetime.now()
                print(i,t1-t0)
                t0=t1
            if i >= n: break
        return data
    

print("Start:", datetime.now())
play = Play()
data = play.get_plays(300000)
dfplays = pd.DataFrame(data).set_index('PlayKey')
print("End:", datetime.now())

dfplays.to_parquet('plays4.parquet', compression='GZIP')
dfplays.to_csv('plays4.csv')
df = dfpi.join(dfplays).fillna(0)
df.to_csv("df1.csv")

After flattening the Player Tracking Data, I had counts of changes in speed to near 0, direction, and orientation by the count. However, I realized that counts can vary greatly depending on the duration of the play. Therefore, I created new columns for the following in order to normalize the data:
- change in direction per second: cdir_rate
- change in orientation per second: co_rate
- change in speed to a slow or stop per second: cvx_rate and cvy_rate 

In [None]:
ptd2 = pd.read_csv("../input/joined2/df2.csv")
#ptd2.keys()

In [None]:
ptd2['cdir_total'] = ptd2['cdir_60']  + ptd2['cdir_90'] + ptd2['cdir_120'] + ptd2['cdir_150'] + ptd2['cdir_180']
ptd2['co_total'] = ptd2['co_60'] + ptd2['co_90'] + ptd2['co_120'] + ptd2['co_150'] + ptd2['co_180']
ptd2['cdir_rate'] = ptd2['cdir_total'] / ptd2['Duration']
ptd2['co_rate'] = ptd2['co_total'] / ptd2['Duration']
ptd2['sum_vx'] = ptd2['cvx_1'] + ptd2['cvx_5'] + ptd2['cvx_10'] + ptd2['cvx_20'] + ptd2['cvx_30'] + ptd2['cvx_40'] + ptd2['cvx_50'] + ptd2['cvx_100'] + ptd2['cvx_500']    
ptd2['sum_vy'] = ptd2['cvy_1'] + ptd2['cvy_5'] + ptd2['cvy_10'] + ptd2['cvy_20'] + ptd2['cvy_30'] + ptd2['cvy_40'] + ptd2['cvy_50'] + ptd2['cvy_100'] + ptd2['cvy_500'] 
ptd2['cvx_rate'] = ptd2['cvx_1'] / ptd2['Duration']
ptd2['cvy_rate'] = ptd2['cvy_1'] / ptd2['Duration']
ptd2['avg_velocity'] = ptd2['TotalDis'] / ptd2['Duration']
ptd2.head()

# 5) A Look at Player Movement: Slows/Stops, Direction, Orientation <a id="8"></a>
Looking at the graphs below and also converting them into tables to see the actual values, I found that frequency of slows and stops, changes in direction, and changes in orientation may influence risk of injury on average. I compared graphs with the average frequencies across the injured and non-injured population.

In [None]:
fig = plt.figure(figsize=(20,18))
ax1 = fig.add_subplot(321)
ax2 = fig.add_subplot(322)
ax3 = fig.add_subplot(323)
ax4 = fig.add_subplot(324)
ax5 = fig.add_subplot(325)
ax6 = fig.add_subplot(326)

ax1.set_title('Injury: Average # Slows/Stops per Second')
ax1.set_ylabel('Average')
ax1.set_xlabel('Field Type')
ax1.set_ylim(0,1)

ax2.set_title('No Injury: Average # Slows/Stops per Second')
ax2.set_ylabel('Average')
ax2.set_xlabel('Field Type')
ax2.set_ylim(0,1)

ax3.set_title('Injury: Average # Changes in Direction per Second')
ax3.set_ylabel('Average')
ax3.set_xlabel('Field Type')
ax3.set_ylim(0,0.25)

ax4.set_title('No Injury: Average # Changes in Direction per Second')
ax4.set_ylabel('Average')
ax4.set_xlabel('Field Type')
ax4.set_ylim(0,0.25)

ax5.set_title('Injury: Average # Changes in Orientation per Second')
ax5.set_ylabel('Average')
ax5.set_xlabel('Field Type')
ax5.set_ylim(0,0.040)

ax6.set_title('No Injury: Average # Changes in Orientation per Second')
ax6.set_ylabel('Average')
ax6.set_xlabel('Field Type')
ax6.set_ylim(0,0.040)

ptd2.query('DM > 0').groupby(['FieldType']).mean()[['cvx_rate', 'cvy_rate']].plot(kind='bar', ax = ax1)
ptd2.query('DM == 0').groupby(['FieldType']).mean()[['cvx_rate', 'cvy_rate']].plot(kind='bar', ax = ax2)
ptd2.query('DM > 0').groupby(['FieldType']).mean()['cdir_rate'].plot(kind='bar', ax = ax3)
ptd2.query('DM == 0').groupby(['FieldType']).mean()['cdir_rate'].plot(kind='bar', ax = ax4)
ptd2.query('DM > 0').groupby(['FieldType']).mean()['co_rate'].plot(kind='bar', ax = ax5)
ptd2.query('DM == 0').groupby(['FieldType']).mean()['co_rate'].plot(kind='bar', ax = ax6)
plt.show()

# 6) A Look at Outdoor Stadiums <a id="9"></a>
After analyzing player movement metrics that may have an influence on risk of injury, I strived to find how factors such as playing surface, game scenario, player movement, and weather interact to influence the risk of injury. Upon graphing my flattened data and looking at relationships and visualizations of these factors and injury on natural and synthetic surfaces, I found that in outdoor stadiums, injuries on synthetic turf have severities 87% higher than natural grass on average.
- Holds across player movements (freq. slows/stops, orientation) that may influence risk of injury 

In [None]:
fig = plt.figure(figsize=(20,18))
ax1 = fig.add_subplot(221)
ax2 = fig.add_subplot(222)
ax3 = fig.add_subplot(223)
ax4 = fig.add_subplot(224)

ax1.set_title('Injury: Average DM on Synthetic VS Natural')
ax1.set_ylabel('Average')
ax1.set_xlabel('Field Type')

ax2.set_title('Injury: Average # Slows/Stops per Sec on Synthetic VS Natural')
ax2.set_ylabel('Average')
ax2.set_xlabel('Field Type')

ax3.set_title('Injury: Average # Changes in Direction per Sec on Synthetic VS Natural')
ax3.set_ylabel('Average')
ax3.set_xlabel('Field Type')

ax4.set_title('Injury: Average # Changes in Orientation per Second')
ax4.set_ylabel('Average')
ax4.set_xlabel('Field Type')

ptd2.query('DM > 0 and StadiumType == "Outdoor"').groupby(['FieldType']).mean()[['DM']].plot(kind='bar', ax = ax1)
ptd2.query('DM > 0 and StadiumType == "Outdoor"').groupby(['FieldType']).mean()[['cvx_rate', 'cvy_rate']].plot(kind='bar', ax = ax2)
ptd2.query('DM > 0 and StadiumType == "Outdoor"').groupby(['FieldType']).mean()[['cdir_rate']].plot(kind='bar', ax = ax3)
ptd2.query('DM > 0 and StadiumType == "Outdoor"').groupby(['FieldType']).mean()[['co_rate']].plot(kind='bar', ax = ax4)
ptd2.query('DM > 0 and StadiumType == "Outdoor" and Position == "WR"').groupby(['FieldType']).mean()[['DM']].plot(kind='bar', ax = ax1)
plt.show()

# 7) Decision Tree to Predict Injury <a id="10"></a>
I explored Decision Trees and Neural Networks to try to create a model from the data and the player movement metrics that I found significant (frequency of slows/stops, directional changes, and orientation changes) that can predict outputs such as Days Missed and field type given an input play. I tried the white box Decision Tree Algorithm since it can train faster on large amounts of data. See code for building this model below inspired from [this article on Datacamp](https://www.datacamp.com/community/tutorials/decision-tree-classification-python). Working on this decision tree helped me understand some of my thoughts of how findings from this project can be incorporated into the NFL's Next Gen Stats based on innovations in improving player safety that they are currently working on. 

In [None]:
# columns considered as inputs to the Decision Tree 
feature_cols = ['Temperature','irp', 'istadium', 'iweather', 'iplay', 'iposition', 'ifield',
       'igroup', 'ibody', 'Duration', 'TotalDis', 'cdir_30',
       'cdir_60', 'cdir_90', 'cdir_120', 'cdir_150', 'cdir_180', 'cdir_360',
       'co_30', 'co_60', 'co_90', 'co_120', 'co_150', 'co_180', 'co_360',
       'cvx_1', 'cvx_5', 'cvx_10', 'cvx_20', 'cvx_30', 'cvx_40', 'cvx_50',
       'cvx_100', 'cvx_500', 'cvy_1', 'cvy_5', 'cvy_10', 'cvy_20', 'cvy_30',
       'cvy_40', 'cvy_50', 'cvy_100', 'cvy_500', 'cdir_total', 'co_total',
       'cdir_rate', 'co_rate', 'sum_vx', 'sum_vy', 'cvx_rate', 'cvy_rate',
       'avg_velocity']
# OUTPUT change this for varying what is predicted 
output = ['DM']

In [None]:
# training on 50% of the data 
data_train = ptd2.sample(frac=0.50).fillna(0)

# testing on 50% of injury data (hoping to avoid cross-referencing with training data)
data_test = ptd2.query('DM > 0').fillna(0)
data_test = data_test.sample(frac=0.50)
data_test.head()

In [None]:
# *** BUILD DECISION TREE USING SCI-KIT LEARN ***
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(data_train[feature_cols], data_train[output])

In [None]:
#Predict the response for test dataset
output_pred = clf.predict(data_test[feature_cols])
data_test['dt_output'] = output_pred
print('Decision Tree Result:')
print(data_test.head(100))

In [None]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy: ",metrics.accuracy_score(data_test[output], output_pred))

Given 75% of the flattened player tracking data (see [4) Preprocessing Player Tracking Data: Flattening the Data](#7)) used for training and 50% of injuries (~50) used for test, the Decision Tree model achieved ~60% accuracy. This isn't very great, but we only have 105 injuries. Further test cases and splicing of the input data needs to be done in order to optimize accuracy of the model. 

# Conclusions & Extensions to Next Gen Stats <a id="11"></a>
![](https://nflops.blob.core.windows.net/cachenflops-lb/c/9/8/8/d/2/c988d26c9837b9783c487e53006f96068123cd14.png)
As we know, [Next Gen Stats](https://nextgenstats.nfl.com/) captures real time location, speed, and acceleration for every player on every play. Through technologies such as the Digital Athlete, AI can help enhance safety in the NFL. I thought about how my findings about player movement can be incorporated into Next Gen Stats to help reduce the number of lower limb non-contact injuries in the future of the game. 

#### Incorporate these player movement metrics into AI models aiming to predict and prevent injury 
- Frequency of slows/slops, direction and orientation changes
- Other player and game-level data 

#### Look at these player movement metrics in conjunction with specific types of cleats, grass and synthetic fields
- Cleat Tracking Technology 
- Injury Rates Per Cleat 


# Challenges Faced, Reflections & Thank You
#### Some Challenges Faced:
- Since we have such a small amount of injury data for non-contact injuries in lower limbs, it was difficult to find meaningful correlations and figure out how to create an accurate prediction/classification model (due to so many noninjuries compared to injuries). More years would have been useful.
- Figuring out how to flatten Player Tracking Data after identifying the variables I cared about (change of direction, change of orientation, velocity) to decrease the amount of data points to one data point per play. 

A little bit about me is that I am a recent graduate from Bucknell University (Lewisburg, PA) with a dual-degree in computer science & engineering and management. I grew up outside of Philadelphia and am an Eagles fan #gobirds with a lacrosse and running background. Seeing this challenge, I was excited to contribute to innovations of [enhancing player safety in the NFL](https://www.playsmartplaysafe.com/). I really enjoyed this analytics problem centered around the health and safety aspects of the NFL and want to do more problem solving surrounding safety of athletes in the future. 