<div class="h1">Reducing Lower Body Injuries</div>
<img src="https://nonwebstorage.s3.amazonaws.com/lowerbody/zeke.jpg" alt="zeke" height="400" width="720"/>

In [None]:
%%HTML
<style type="text/css">

div.h1 {
    font-size: 32px; 
    margin-bottom:2px;
}
div.h2 {
    background-color: steelblue; 
    color: white; 
    padding: 8px; 
    padding-right: 300px; 
    font-size: 24px; 
    max-width: 1500px; 
    margin-top: 50px;
    margin-bottom:4px;
}
div.h3 {
    color: steelblue; 
    font-size: 18px; 
    margin-top: 4px; 
    margin-bottom:8px;
}
div.h4 {
    font-size: 15px; 
    margin-top: 20px; 
    margin-bottom: 8px;
}
span.note {
    font-size: 5; 
    color: gray; 
    font-style: italic;
}
span.lists {
    font-size: 16; 
    color: dimgray; 
    font-style: bold;
    vertical-align: top;
}
span.captions {
    font-size: 5; 
    color: dimgray; 
    font-style: italic;
    margin-left: 130px;
    vertical-align: top;
}
hr {
    display: block; 
    color: gray;
    height: 1px; 
    border: 0; 
    border-top: 1px solid;
}
hr.light {
    display: block; 
    color: lightgray;
    height: 1px; 
    border: 0; 
    border-top: 1px solid;
}
table.dataframe th 
{
    border: 1px darkgray solid;
    color: black;
    background-color: white;
}
table.dataframe td 
{
    border: 1px darkgray solid;
    color: black;
    background-color: white;
    font-size: 14px;
    text-align: center;
} 
table.rules th 
{
    border: 1px darkgray solid;
    color: black;
    background-color: white;
    font-size: 14px;
}
table.rules td 
{
    border: 1px darkgray solid;
    color: black;
    background-color: white;
    font-size: 13px;
    text-align: center;
} 
table.rules tr.best
{
    color: green;
}

</style>

In [None]:
# Install on commit
! conda install -c conda-forge hvplot=0.5.1 lifelines=0.23.7 -y

In [None]:
# Import
import warnings
import numpy as np
import pandas as pd
from IPython.display import HTML, Image

import matplotlib.pyplot as plt
%matplotlib inline
import holoviews as hv
hv.extension("bokeh")
from holoviews import opts
import hvplot.pandas
from bokeh.models import HoverTool

from lifelines import KaplanMeierFitter
from lifelines.statistics import logrank_test

# Set additional display options for report
pd.set_option("display.max_columns", 100)
th_props = [('font-size', '13px'), ('background-color', 'white'), 
            ('color', '#666666')]
td_props = [('font-size', '15px'), ('background-color', 'white')]
styles = [dict(selector="td", props=td_props), dict(selector="th", 
            props=th_props)]

The NFL has challenged the data science community to join them in reducing lower body injuries. We were asked to examine the effects of playing on synthetic turf versus natural turf along with other factors that might influence player movements and safety. This report represents my analysis. 

Based on the data provided, it is my opinion that playing on synthetic turf over two or more seasons is associated with a higher rate of injury to knees, ankles, and feet. I did not find that a history of playing on synthetic fields was the most important indicator of player injury. The following factors showed more importance in the final model used for this analysis:

  - The amount of rest between games
  - Lateral changes in speed from cutting or turning
  - Changes in speed from starting and stopping
  - Environmental factors of precipitation, temperature and stadium type

I find only a weak connection to the playing surface as measured at the time of injury. In other words, the plays ending in player injury on synthetic turf did not appear different from plays without injuries. Rather it was a player's exposure to the factors over time that correlated with increased risk.

An interesting finding is that the proportion of games played on synthetic turf varied quite a bit among the 250 players in the sample. It also varies across all 32 NFL teams. The chart below shows the number of games played on each type of turf for the 2018 season. 


<span class="note"> <i>Hover over the bars to see exact numbers.</i> </span>

In [None]:
#%% Data from NFL BigDataBowl
cols = ['GameId', 'Turf', 'HomeTeamAbbr', 'VisitorTeamAbbr']
train = pd.read_csv("../input/nfl-big-data-bowl-2020/train.csv", 
                    usecols=cols).drop_duplicates('GameId')

train = train[train.GameId.astype('str').str.startswith('2017')]

grass_labels = ['grass', 'natural']  #includes hybrids (16 games)
train['HomeSynth'] = np.where(train.Turf.str.lower().str.contains('|'.join(grass_labels)), 
                              0, 1)

games = pd.melt(train, id_vars=['GameId', 'HomeSynth'],  value_name='Team',
                value_vars=['HomeTeamAbbr', 'VisitorTeamAbbr'])\
                .sort_values('GameId')

homes = train[['HomeTeamAbbr', 'HomeSynth']].drop_duplicates()\
                                            .set_index('HomeTeamAbbr')

totals = games.groupby('Team').agg({'HomeSynth': ['size', 'sum']})
totals.columns = ['Games', 'Synthetic']
totals['Natural'] = totals.Games- totals.Synthetic

totals = totals.join(homes).reset_index().sort_values('Natural')

# plot
plot_opts = {'invert': True,
             'height': 500,
             'width': 320,
             'grid': True,
             'line_alpha': 0,
             'ylim': (0, 16),
             'xticks': np.arange(0,17,2).tolist()
             }
natural = totals.hvplot.bar(x='Team', y='Natural', flip_xaxis=True, 
                        title='2018 Games by Field Surface', **plot_opts)
synth = totals.hvplot.bar(x='Team', y='Synthetic', color='orange', 
                          yaxis='right', **plot_opts)
display(natural + synth)

Team experience varied in 2018 between the Oakland Raiders, with 1 game on synthetic turf, and the Atlanta Falcons, with 13 games. I discuss the differences in more detail within the body of the analysis.

My report contains the following sections:
  - <a href='#bg'>Background</a>
  - <a href='#mt'>Methodology</a>
  - <a href='#an'>Analysis</a>
  - <a href='#ap'>Application</a>
  - <a href='#ax'>Appendix</a>
  
Thank you and happy reading!

John Miller<br />
Customer Data Scientist, H2O.ai


<hr class="light">
UPDATE: You can find my slides at https://www.slideshare.net/JohnMiller153/nfl-first-and-future. These are slightly different than the actual submission. They include some improvements based on feedback from preparing for the presentation.

<a id='bg'></a>
<div class="h2">Background</div>


<div class="h3">THE ISSUE</div>
Lower body injuries place a high burden on players, teams, and the league overall. [Earlier this year](https://apnews.com/2fdaa276b87f4a71b248476cf12f3d36), the league's chief medical officer, Dr. Allen Sills, stressed the need to target the lower extremity injuries "the same was as with concussions." "We are thinking about injury burden," Sills said. "Not only how many injuries but how long a player missed. 

The chart below shows injury counts for torn ligaments in the knee. The figures don't include other types of knee injury, or injury to ankles, feet, etc.

In [None]:
tears = pd.DataFrame({'Season': [2016, 2017, 2018], 
                      'ACL Tears': [29, 21, 24], 
                      'MCL Tears': [86, 97, 94]})
display(HTML('<span style="font-weight:bold">' + 'Knee Injuries, Regular Season Games' + '</span>'), 
                tears,
        HTML('<span class="captions">' + 'Source: IQVIA' + '<span>'))

ACL and MCL tears requiring surgery are particularly damaging to player careers. Players miss the rest of the season and often find that when they do return, their peak performance is less than what it was before. 

The impact of player injuries on team resources is also huge. A [December 2019 article](https://www.espn.com/blog/new-york-jets/post/_/id/81835/the-big-hurt-jets-roster-salary-cap-crushed-by-injuries) by ESPN describes the extreme case of the New York Jets. According to the article, the combined salaries of players on their Injured Reserve list are eating up \$61.5 million in salary cap space -- roughly 1/3 of the \$188 million Base Salary Cap set for the NFL. Even though these numbers include injuries other than those to the lower body, it gives an idea of the financial impact.

<hr class="light">

<div class="h3">RELEVANT FACTORS</div>

The NFL is interested in the effect of field surface type on lower extremity injuries. Recent studies cited in this challenge show higher injury rates on synthetic turf compared to natural turf. Also of interest are the following factors included in our data:
  - Stadium type
  - Weather
  - Temperature
  - Rest between games
  - Play types
  - Player position
  - Player movement (as recorded by NGS)

Other factors not included in this challenge have also been linked to lower extremity injuries:
  - Cleat-turf interaction
  - Player conditioning
  - Injury history
  - Player anatomical, hormonal, neruomuscular factors


<a id='mt'></a>
<div class="h2">Methodology</div>

I used the following high-level methodology for this analysis:


<div>
<span style="color:#445f7c; font-size:16px; line-height:35px;">
  1. Examine player survivability relative to field surface <br/>
  2. Characterize player movement using NGS data<br/>
  3. Build a machine learning model to identify relevance of factors<br/>
  4. Examine model factors<br/>
</span>
</div>


<a id='an'></a>
<div class="h2">Analysis</div>

[](http://)<div class="h3">1. PLAYER SURVIVABILITY</div>

I first looked at player injuries using Survival Analysis. Survival analysis was developed to model the expected time until an event happens for members of a given population. In this case I applied it to estimate how long players might go without injury, and to see if playing on synthetic turf affects the probability of injury. This study presented several challenges to using an approach that is statistically valid: 

  - Many players started or stopped playing at various times during the study period. 
  - There is no record of injuries before or after the study period.
  - Risk is not proportional over time 

I used the [Kaplan-Meier estimator](https://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator) to overcome these challenges. Kaplan-Meier is often used in medical research and was used by a team from Cleveland Clinic in 2018 to model the [effects of concussion on player longevity](https://consultqd.clevelandclinic.org/concussion-in-the-nfl-new-study-demonstrates-significant-detrimental-effects-in-the-short-term/). It is considered to be a reliable indicator when the sample data accurately represent the population.

In looking at survival rates, I defined "non-survival" as missing more than 7 days to injury. That is consistent with the [2018 Mack study](https://www.ncbi.nlm.nih.gov/pubmed/30452873) on the effects of synthetic turf that used injuries with over 8 days missed as a classifier.

I divided the sample of 250 players into two groups: those who played over 50% of all plays on synthetic turf, and those who played less than 50%. The mean percent for all players is 43% so the groups are of roughly equal size. 

The chart below shows the survival rates. For each group you can see the proportion of players who drop out as the number of plays increases form left to right. When one curve is below the other it represents a lower survival rate; i.e., a higher injury rate. 

In [None]:
key_cols = ['PlayerKey', 'GameID', 'PlayKey']
playlist = pd.read_parquet('../input/nfl-1standfuture-dataprep/PlayListLabeled.parq').sort_values(key_cols)
last_plays = playlist[playlist.PlayCount == playlist.FinalPlay]

#%%#######################
## Kaplan Meier curves
#  Playing Surface
#


# Plot options         
plt.figure(figsize=(12,8))
plt.style.use('seaborn-whitegrid')
SMALL_SIZE = 16
MEDIUM_SIZE = 20
BIGGER_SIZE = 24
plt.rc('font', size=SMALL_SIZE)
plt.rc('axes', titlesize=MEDIUM_SIZE)
plt.rc('axes', labelsize=MEDIUM_SIZE)
plt.rc('xtick', labelsize=SMALL_SIZE)
plt.rc('ytick', labelsize=SMALL_SIZE)
plt.rc('legend', fontsize=SMALL_SIZE)
plt.rc('figure', titlesize=BIGGER_SIZE)


# Synthetic turf
CONFIDENCE = 0.9
BREAK_PT = 0.5 #mean is 0.43
idx_ = (last_plays.PctPlaysSynthetic < BREAK_PT)

durations_1 = last_plays.loc[idx_, 'FinalPlay']
injuries_1 = last_plays.loc[idx_, 'Missed7Days']
kmf1 = KaplanMeierFitter()
kmf1.fit(durations_1, injuries_1, alpha=(1-CONFIDENCE), label=f'< {BREAK_PT:.0%} of Plays on Synthetic')
a1 = kmf1.plot()

durations_2 =  last_plays.loc[~idx_, 'FinalPlay']
injuries_2 =   last_plays.loc[~idx_, 'Missed7Days']
kmf2 = KaplanMeierFitter()
kmf2.fit(durations_2, injuries_2, alpha=(1-CONFIDENCE), label=f'> {BREAK_PT:.0%} of Plays on Synthetic')
a2 = kmf2.plot(ax=a1)

plt.ylim(0, 1)
plt.title("Survival Rate of Players")
plt.xlabel("Number of Plays")
plt.ylabel("Fraction of Players not Injured")
plt.show()

The solid lines represent the observed survival rate for each group. The shaded areas represent the 90% confidence interval. Confidence intervals that do not overlap indicate a meaningful difference between the groups. In this case the lines are separate, indicating a higher injury rate for players in the sample with over 50% of plays on synthetic turf. The difference is most noticeable prior to players reaching 500 plays. 80% of players playing mostly on synthetic turf make it to their 250th play, whereas 80% of players playing mostly on nautral turf make it to their 500th play.

On the other hand, the shaded confidence intervals overlap. This indicates the difference observed in our sample of 250 players may not accurately represent the entire NFL player population.

I used a statistical test and calculated the p-value to more formally assess the relevance of synthetic turf. A p-value can be thought of the probability that our observed difference is due to random chance. Because we are using a sample of 250 players instead of all players, such an outcome is possible. The tables below show the results of the formal test with a p-value of 0.34.


In [None]:
results_turf = logrank_test(durations_1, durations_2, injuries_1, injuries_2, alpha=.90)
results_turf.print_summary()

A p-value of 0.34 is higher than the widely accepted maximum value of 0.05. The data does not strictly support the idea that players with more plays on synthetic turf have a higher injury rate.

Even so, this doesn't mean that the effect of playing on synthetic turf is nowhere to be seen. To give this number some perspective, think of a weather person telling you it will rain today. She thinks it will rain and admits there's a 34% probability her conclusion is based on random chance. You probably wouldn't bet a day's pay on her being right about the rain, but you certainly might take an umbrella to work. There's still value in her forecast, just as with this one.

<hr class="light">

<div class="h3">2. CHARACTERIZING MOVEMENT</div>

Kaplan-Meier estimators work very well for assessing factors one at a time. They are not able to discover interactions among factors. Machine learning models, on the other hand, are very good at it. Before building the actual model I looked at player movement data and captured it in a way that is useful for machine learning.

The Next Generation Stats (NGS) system captures player position and orientation 10 times per second. The data provided for this challenge consisted of 76 million snapshots covering 267,000  plays. I used these base measurements to calculate speed, direction of travel, and acceleration. My objective was to transform these metrics into a useful form that relate to injury-causing forces. 

The video below shows wide receiver Jarvis Landry practicing a route from when he was with the Miami Dolphins. Landry accelerates straight ahead and then cuts to the left.

<span class="note"> <i>Play the video to see Landry's skills.</i> </span>

In [None]:
from IPython.display import Video

display(Video("https://i.imgur.com/H2nvXRc.mp4"))

I used information from the field of kinesiology - the study of human movement - to derive relevant measurements. Knees and ankles are generally designed to move a person forward. They can take heavier stress in the forward-backward direction and in the up-down direction than in a sideway direction. Because of this difference, I decomposed the acceleration vector at each point into longitudinal acceleration (forward-backward) and lateral acceleration (sideway). The diagram below shows the forces acting on Landry's body, particularly on the leg in contact with the ground, as he makes the cut.


<picture of landry with vectors>
<img src="https://nonwebstorage.s3.amazonaws.com/lowerbody/landry_grid2b.png" alt="landry" height="420" width="720" />



The two components of acceleration have these characteristics at every point along a player's path. 
  - Longitudinal acceleration is proportional to changes in player speed. A player speeding up or slowing down experiences acceleration in the direction of travel.
  - Lateral acceleration is proportional to changes in speed and in a player's direction. A player turning and running in another direction experiences lateral acceleration. Both speed and change in direction contribute to lateral acceleration.



The top chart below shows the movement pattern of a single player on one play as seen from above. The play is shown from when the ball is snapped until the play ends. The x-y plot represents the path traveled with dots at each 1/10 of a second. The two plots after show longitudinal and lateral accelerations during the play.


In [None]:
tracks = pd.read_parquet('../input/nfl-1standfuture-dataprep/PlayerTrackData.parq')

one_play = tracks[(tracks.PlayerKey == 35611) &\
                        (tracks.GameID == 7) &\
                        (tracks.PlayKey == 42)
                        ].copy()

display(one_play.hvplot.scatter(x='x', y='y', color='darkgray',
                    # , xlim=(0,120), ylim=(0,54)
                    ),
        one_play.hvplot.line(x='time', y='AccelLong', height=200),
        one_play.hvplot.line(x='time', y='AccelLateral', height=200, color='red')
       )

When play begins, acceleration is typically in line with the direction of movement as a player speeds up (longitudinal). When the player reaches running speed, and changes direction, there is lateral acceleration across the direction of movement. If the player also changes speed during a turn, longitudinal acceleration occurs as well.

Players also experience acceleration in the up-down direction and angular acceleration from twisting the body. I discuss treatment of these variables and my assumptions in the Appendix. I also include details on the calculation of longitudinal and lateral acceleration.

To better compare movement patterns across players, I standardized the NGS data. Below are standardized movements for six players. Distances and play time are normalized to a scale of 0 to 1. Also, routes are standardized to go from left to right. Actual magnitudes for distance, time, and acceleration are given in the reference table.

<span class="note"> <i>Hover over points and lines on the chart to see more information.</i> </span>


In [None]:
#%%###################
## Scaled Movements
#

# Grab some players for a demo
grabs = [46038, 43518, 40405, 41209, 43532, 44165]
tracks2 = tracks[tracks.PlayerKey.isin(grabs)].reset_index(drop=True)
# tracks2 = tracks #full set
tracks2['LastPlay'] = tracks2.groupby(key_cols[0:2])['PlayKey'].transform('max')
tracks2['LastGame'] = tracks2.groupby('PlayerKey')['GameID'].transform('max')

tracks2 = tracks2[(tracks2.GameID == tracks2.LastGame) &\
                  (tracks2.PlayKey == tracks2.LastPlay)]


# Create reference table
aggdict = {'dis': 'sum',
           'time': 'max',
           'AccelLateral': 'max', 
           'AccelLong': 'max'
           }
tablecols = ['PlayerKey', 'dis', 'time', 'AccelLateral', 'AccelLong', 
             'RosterPosition', 'FieldType',
             'PlayType', 'BodyPart', 'DaysMissed'
             ]
tracks_table = tracks2.groupby(key_cols).agg(aggdict)\
                      .merge(playlist, how='left', right_on=key_cols, 
                             left_index=True)[tablecols]\
                      .round(3)\
                      .sort_values('DaysMissed', ascending=False)

# Explore
# pd.set_option("display.max_rows", 250)
# tracks_table.reset_index()


# Scale all player movement to 0-1
def scaler(series_name):
    player_min = tracks2.groupby('PlayerKey')[series_name].transform(min)
    player_max = tracks2.groupby('PlayerKey')[series_name].transform(max)
    grouped = tracks2.groupby('PlayerKey')[series_name].transform(min)
    scaled = (tracks2[series_name]-player_min) / (player_max-player_min)
    return scaled

# series_names = ['x', 'y', 'AccelLateral', 'AccelLong', 'time']
series_names = ['x', 'y', 'time']
scaled_names = [name+'_scaled' for name in series_names]

for name_pair in zip(series_names, scaled_names):
    tracks2[name_pair[1]] = scaler(name_pair[0])

    
# Standardize movement from left to right
tracks2['x_first'] = tracks2.groupby('PlayerKey')['x'].transform('first')
tracks2['x_last'] = tracks2.groupby('PlayerKey')['x'].transform('last')
tracks2.loc[tracks2.x_last < tracks2.x_first, 'x_scaled'] = 1 - tracks2.x_scaled

tracks2['AccelLateral'] = tracks2.AccelLateral.round(2)
tracks2['AccelLong'] = tracks2.AccelLong.round(2)
tracks2['x'] = tracks2.x.round(0).astype(int)
tracks2['y'] = tracks2.y.round(0).astype(int)


In [None]:

#%%###################
## Grid of plots
#

#Make table element
plays = hv.Table(tracks_table)

#Make plot elements
numbers = tracks_table.PlayerKey
positions = tracks_table.RosterPosition
layout_list = []
for playa,pos in zip(numbers, positions):
    label_string = f'{playa} {pos}'
    # Create plot elements
    pts = hv.Points(tracks2[tracks2.PlayerKey == playa],
               kdims=['x_scaled', 'y_scaled'],
               vdims=['x', 'y', 'time_scaled'],
               label = label_string
                     )
    tooltips_pts = [('X Scaled', '@x_scaled'),
                ('Y Scaled', '@y_scaled')
               ]
    hover_pts = HoverTool(tooltips=tooltips_pts)

    longitudinal = hv.Curve(tracks2[tracks2.PlayerKey == playa],
                   kdims=['time_scaled', 'AccelLong'], #####
                   vdims=['x_scaled', 'y_scaled'],
                   label='Longitudinal',
                   )  
    lateral = hv.Curve(tracks2[tracks2.PlayerKey == playa],
               kdims=['time_scaled', 'AccelLateral'], #####
               vdims=['x_scaled', 'y_scaled'],
               label='Lateral',
               )
    tooltips_cv = [('X Original', '@x'),
                   ('Y Original', '@y'),
                   ('Time Scaled', '@time_scaled')
                   ]
    hover_cv = HoverTool(tooltips=tooltips_cv)


    # Overlay and listify
    accel = (longitudinal*lateral).opts(show_legend=False)
    
    layout_list.append(pts)
    layout_list.append(accel)

neworder = [0, 2, 4, 1, 3, 5, 6, 8, 10, 7, 9, 11]
layout_list = [layout_list[i] for i in neworder]
layout = hv.Layout(layout_list).opts(opts.Points(width=250, height=220, size=4, color="darkgray", show_grid=True, tools=[hover_cv]), 
                 opts.Curve(width=250, height=120, line_width=1, ylabel='Acceleration', ylim=(0,10), yticks=[0, 5, 10], tools=[hover_pts])
                 ).cols(3)


# Show
def highlight_max(s):
    is_large = s.nlargest(2).values
    return ['background-color: #ffffcc' if v in is_large else '' for v in s]

def highlight_hurt(s):
    return ['background-color: #ffb3b3' if v >0 else '' for v in s]

display(tracks_table.reset_index(drop=True).style\
                    .apply(highlight_max, subset=['AccelLong', 'AccelLateral'])\
                    .apply(highlight_hurt, subset='DaysMissed'), layout)
       


In the sample charts above, it can be hard to pick out differences between the 3 injured players in the top row and the 3 non-injured players in the bottom row. Sometimes there appears to be a difference such as with the player in the top right graph, 46038 Safety. Notice the red line representing lateral acceleration. The player reaches peak lateral acceleration of 6.5 yards/second per second and maintains it throughout the turn. That's the equivalent of sustaining over 1/2 a g-force for 5 seconds!

At the same time, other play charts are less clear, 41209 Linebacker in the lower left experiences the highest lateral acceleration of the group and is on synthetic turf. He doesn't get hurt. 

The most appropriate use of these charts would be in conjunction with a lookup tool. For instance, teams could pull a player's play history over the last several games following an injury, focusing on the same type of play. They might also compare the player's movement patterns to another player who played the same position without getting hurt.

<hr>

[](http://)<div class="h3">3. MACHINE LEARNING</div>

One challenge in building a model for machine learning with the dataset is identifying the temporal aspect of the injury when it occurs. As someone who tore his ACL and endured numerous ankle sprains playing sports, I understand the potential for both sudden injuries and injuries that come from stress accumulated over time. 

To account for both types of injuries, I used these steps:
  - Aggregate the movement metrics over the course of a play, recording maximum and cumulative values.
  - Merge the play-level data to other play characteristics and injuries including turf type, weather, days missed, etc.
  - Aggregate play-level data by player, again using averages, maximums, and percentiles of play conditions 
  
Using these steps is a common technique to encapsulate lower-level data into a higher level. Given the nature of player injuries, and the goal of improving safety on a per-player basis, a model focused on player history was most appropriate.

It was important to avoid factors that introduce leakage to the model, meaning anything that hints at whether or not a player was injured. For instance, including the number of plays as a factor allows the algorithm to learn that players only playing a few plays in total were probably injured. Likewise, I chose not to use play type or player position. The data reflects what is already known - certain combinations of plays and positions are more likely to result in injury.

The table below shows the factors provided as inputs to the machine learning algorithm. As with the survivability analysis and the [Mack study](https://www.ncbi.nlm.nih.gov/pubmed/30452873) mentioned earlier, I defined the target variable by injuries with over 7 days missed.

In [None]:
tracks = pd.read_parquet('../input/nfl-1standfuture-dataprep/PlayerTrackData.parq')

################################
## Aggregate NGS data by play 
#

key_cols = ['PlayerKey', 'GameID', 'PlayKey']
aggdict = {'time': ['min', 'max'],
           'dis':['sum'],
           'VelocityIn':['max'],
           'AccelLateral': ['sum', 'max'],
           'AccelLong': ['sum', 'max']
           }
tracks_agg = tracks.groupby(key_cols).agg(aggdict)


tracks_agg.columns = [c[0]+'_'+c[1] for c in tracks_agg.columns]
tracks_agg['PlayDuration'] = tracks_agg.time_max - tracks_agg.time_min
tracks_agg = tracks_agg.fillna(0).reset_index() #should not be any na's now


##########################################
## Merge NGS data with other play factors 
#

playlist = playlist.drop_duplicates(key_cols)
tracks_agg = tracks_agg.merge(playlist, how='inner', on=key_cols)

# null out season breaks and long injuries contained in DaysRest
tracks_agg.loc[tracks_agg.DaysRest > 11, 'DaysRest'] = np.nan #prevents leakage

# get targets
tracks_agg['PlayerOut7Days'] = tracks_agg.groupby('PlayerKey')['Missed7Days']\
                                        .transform(max)
tracks_agg['PlayerOut1Day'] = tracks_agg.groupby('PlayerKey')['Missed1Day']\
                                        .transform(max)


################################
## Aggregate play-level data by player 
#

aggdict_words = {'RosterPosition': ['last'],
                'FieldType': ['last'],
                'PlayType': ['last'],
                }

aggdict_nums = {'PctWetWeather': ['mean'],
                'PctOpenStadium': ['mean'],
                'Temperature': ['mean'],
                'DaysRest': ['mean'],
                'AccelLateral_max': ['mean'],
                'AccelLong_max': ['mean'],
                'PctPlaysSynthetic': ['last'],
                'PlayerOut7Days': ['last']
                }

df_words = tracks_agg.groupby('PlayerKey').agg(aggdict_words)\
                     .reset_index(drop=True)
df_nums = tracks_agg.groupby('PlayerKey').agg(aggdict_nums)\
                    .reset_index(drop=True)

plays_agg = pd.concat([df_words, df_nums], axis=1)
plays_agg.columns = [c[0]+'_'+c[1] for c in plays_agg.columns]


#%%
display(plays_agg.drop(columns='RosterPosition_last').head())


Note that orientation - the direction a player was facing - was not used in this study. There is an issue with inconsistencies between years where the sensor data was 90 degreees out from the correct value. The issue is documented by Michael Lopez, Director of Football Data & Analytics in [this post](https://www.kaggle.com/c/nfl-big-data-bowl-2020/discussion/112303).

Before identifying which of the above factors are most relevant to lower-body injuries, it is important to make the most accurate model possible. I used a high-performing machine learning algorithm known as [Extreme Gradient Boosting](https://en.wikipedia.org/wiki/XGBoost) as the basis for a classifier model. As with other machine learning algorithms of this type, it works by playing the Hot and Cold game with itself. The program first searches for patterns to learn what injured players look like as reflected in the data. Then it applies the pattern to data it has not seen and guesses which of these players are injured. After recording where it was correct, it searches for patterns again and makes a new prediction. If the new prediction is better, the model "gets hotter" and continues down that path. If the new prediction is "colder", the algorithm changes course and uses the factors a different way to find better patterns.

The intial model scored 0.65 AUC using the factors above as inputs. [AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic), which stands for Area under the Curve, is a common way to measure performance for classifiers. Here the score measures the model's ability to correctly separate players injured for more than seven days from those who were not. A score of 0.5 can be had by choosing players at random and lining them up by the likelihood they were injured. A score of 1.0 is perfect meaning that the groups are correctly ranked and separated.

In [None]:

#%%################################
## Initial model
## Xgboost 
#

from category_encoders import OrdinalEncoder, WOEEncoder, TargetEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
import xgboost as xgb
import shap
shap.initjs()

oe = OrdinalEncoder()

y_initial = plays_agg.pop('PlayerOut7Days_last')
X_initial = oe.fit_transform(plays_agg)
print(X_initial.columns)
xgbdata = xgb.DMatrix(data=X_initial,label=y_initial.to_numpy())

params = {'objective': 'binary:logistic',
          'eval_metric': 'auc',
          'eta': 0.03,
          'max_depth': 4
          }
cv_scores = xgb.cv(dtrain=xgbdata,
                    params=params,
                    nfold=5,
                    num_boost_round=1800,
                    early_stopping_rounds=60,
                    verbose_eval=False,
                    as_pandas=True,
                    seed=135
                    )

best_round = cv_scores['test-auc-mean'].idxmax()
best_score = cv_scores['test-auc-mean'].max()


In [None]:
print(f'Model Score: {best_score:.2f} AUC after {best_round} iterations.')

I improved the model using the following techniques:

  - Combined features into pairs to explicitly include interactions
  - Randomly shuffled features and selected only those contributing to model perfomance
  - Tuned the parameters of the model to fit the data
  
The results below show the AUC of the improved model.

In [None]:
#####################
## Optimized Model
#

y = y_initial

# Keep best factors
final_cols = ['AccelLateral_max_mean', 
              'DaysRest_mean',
              'PctWetWeather_mean',
              'AccelLong_max_mean',
              'PctPlaysSynthetic_last',
              'Temperature_mean',
              'PctOpenStadium_mean',
              'FieldType_last'
              ]
X = plays_agg[final_cols].copy()

# Combine
# X['DaysRest_*Stadium'] = X.DaysRest_mean * X.PctOpenStadium_mean


# # Categorize PctPlaysSynthetic into buckets and encode
# bin_labels = ['b'+str(num) for num in range(10)]
# X['RelativePctSynth'] = pd.qcut(X.PctPlaysSynthetic_last, 10, labels=bin_labels)

te = TargetEncoder(cols='FieldType_last')
X = te.fit_transform(X, y=y)

# X = X.drop(columns=['AccelLong_max_mean', 'PctPlaysSynthetic_last'])

params_final = {'objective': 'binary:logistic',
                  'eval_metric': 'auc',
                  'colsample_bytree': 0.7,
                  'subsample': 0.4,
                  'gamma': 0.1,
                  'eta': 0.05,
                  'max_depth': 10,
                  'max_leaves': 1024,
                  'grow_policy': 'depthwise',
                  'min_child_weight': 1,
                  'lambda': 1.0,
                  'alpha': 0.0,
                  'scale_pos_weight': 1.0,
                  'tree_method': 'hist',
                  'max_bin': 256,
                  'seed': 429
                  }

skf = StratifiedKFold(n_splits=14, random_state=2233)
oof_preds = np.zeros((len(plays_agg)))
for train_idx, val_idx in skf.split(X, y):
    X_train, X_val = X.loc[train_idx], X.loc[val_idx]
    y_train, y_val = y.to_numpy()[train_idx], y.to_numpy()[val_idx]
    dtrain = xgb.DMatrix(data=X_train, label=y_train)
    dval = xgb.DMatrix(data=X_val, label=y_val)
    watchlist = [(dtrain, 'train'), (dval, 'eval')]
    model_final = xgb.train(params_final,
                            dtrain,
                            num_boost_round=1200,
                            evals=watchlist,
                            early_stopping_rounds=60,
                            verbose_eval=False
                            )
    preds = model_final.predict(dval, ntree_limit=model_final.best_ntree_limit)
    oof_preds[val_idx] = preds                       


In [None]:
score = roc_auc_score(y, oof_preds)
print(f'Optimized Model Score: {score:.2f} AUC.')

This score lies half-way between random choice and perfection. It is definitely separating the signal from the noise. Identifying factors as relevant based on their importance to the model is directionally correct. The model's prediction accuracy is not high enough to reliably predict player risk for players not previously seen. Since we are using the model only for factor relevance, we can move forward.

The following chart shows the relative importance of factors used in the model. Specific definitions are as follows:

In [None]:
#%%##############
## plot shaps
#

explainer = shap.TreeExplainer(model_final, data=X,
                               model_output='probability')
shap_values_final = explainer.shap_values(X)



In [None]:
display(shap.summary_plot(shap_values_final, X, plot_type="bar"))
# shap.summary_plot(shap_values_final, X),

The chart above shows the following:
  - The factor most relevant to identifying injured players was the average days of rest between games, followed closely by the percent of games in wet or snowy weather. 
  - Lateral and longitudinal acceleration wer fairly important
  - The percent of plays on a synthetic field had a measurable influence, but it was less than several other factors
  - The Field Type on the player's last play, including for those who were injured, was barely a factor
  
The relative importance is based on [SHAP](https://github.com/slundberg/shap) values for the factors and take into account the factor on it's own as well as in interactions with other factors. The method is considered to be one of the most comprehensive ways to assess factor importance. In the next section I show the factors one at a time.

<hr class="light">

[](http://)<div class="h3">4. EXAMINING MODEL FACTORS</div>

Below are charts examining distributions of each factor for the injured vs. non-injured players. Each of these factors shows some difference in distribution, even if it is small in some cases. As noted before, the factor importance calculated above includes interactions among factors and may not compeltely match differences seen in one factor at a time.

In [None]:
df = pd.concat([X, y], axis=1)
df['Injured'] = np.where(df.PlayerOut7Days_last == 0, 'Good to Play', 'Missed >7 Days')
plotlist = []
for col in df.columns.tolist()[:-2]:
    kde_ = (df.hvplot.kde(y=col, by='Injured', alpha=0.6, yaxis=None, 
                          title=col, width=450, height=250))
    plotlist.append(kde_)
    
display(hv.Layout(plotlist).cols(2))

The singe factor distributions show differences that are generally consistent with the model importance. As mentioned before, this is not always the case. It is a measure of added insurance when they agree.

<a id='ap'></a>
<div class="h2">Potential Application</div>

In this section I discuss two areas of opportunity to address the factors found to influence lower body injury rates:
  - Scheduling games
  - Selecting plays
  <br/>
  <br/>
  <br/>
  
<div class="h3">SCHEDULING GAMES</div>

Setting the schedule of games for an NFL season is an enormous task. There are many considerations of which here are a few:
  - Fan preferences 
  - Rules for the competitive framework
  - Priorities of broadcast partners
  - Travel considerations for teams
  - Rest between games

Given the demonstration of rest as an important factor, any changes to games per year or rest between games should be carefully considered.

I did not come across any mention of field surface as a consideration for scheduling. The percent of play on synthetic fields, which is relevant to injury rates, differs quite a bit among players. As one might expect, players based at stadiums with natural turf play over half their games on natural turf, and vice-versa for players based on synthetic turfed fields. 

The chart below shows how big the difference can be in practice. It contains each team's games by type of field surface for the 2018 season.

<span class="note"> <i>Hover over the bars to see exact numbers.</i> </span>

In [None]:
#%%
display(natural + synth)

The Atlanta Falcons of the NFC South only played three games on natural turf for 2018. According to NFL scheduling rules, they had to play 5 teams based on synthetic fields - 2 division rivals, 2 teams from NFC East and 1 team from AFC North. However, there may have been an opportunity regarding their home/away schedule - they played 4 games at home against teams based at fields with natural turf. 

Adding field surface into the scheduling equation is a daunting proposition. Except for a few swaps, it is fairly certain that current scheduling rules, which have largely been in place since the early 2000s, would have to change. Proposals in the past have included relaxing the constraints on divisional matches, adjusting the rotating division matches, and even changing the league structure. Others have mentioned holding games at neutral sites on occasion, which could be chosen in part based on the playing surface.

<hr>

<div class="h3">PLAY SELECTION</div>

High acceleration has been identified as being relevant to the likelihood of injury to the lower extremities. It's possible that teams can avoid certain plays as they become more aware of how patterns repeated over time increase player risk. Zachary Binney, an epidemiologist and consultant who has worked with Major League Baseball and college sports teams on injury prevention, had the following to say in a [recent article](https://www.theverge.com/2019/12/6/20999403/amazon-nfl-injuries-concussions-big-data-machine-learning):

> “You could look at what happens when a wide receiver moving this quickly makes this sharp turn, and might be able to tease something out. ... One thing they could do is put the information out there and tell coaches that when they ask a lineman to do some kind of block, or route, it creates the sorts of changes in direction or deceleration when we see bad things happen. It’s in the best interest of coaches, after all, to keep players healthy, and there could be alternatives to riskier plays.

I speculate that coaches and players already consider situational risk to the extent possible and make suitable choices. The opportunity going forward is to present additional information to further reduce risk.

<hr>

<br/>
<br/>
In conclusion, there is evidence that playing on synthetic turf over two or more seasons is associated with a higher rate of injury to knees, ankles, and feet. Other factors also show evidence of influencing the likelihood of player injury over time: 

  - Lateral changes in speed from cutting or turning
  - Changes in speed from starting and stopping
  - Rain or snow during a game
  - Days of rest between games
  - Games played in open vs. closed stadiums
  
There may be solid opportunities to address some of these factors for the coming season. Pursuing viable options would improve player health and safety and continue the NFL's commitment to keep the game safe and relevant.

<a id='ax'></a>
<div class="h2">Appendix</div>

<div class="h3">QUALIFICATIONS</div>

I work as a Customer Data Scientist for [H2O.ai](https://www.h2o.ai/), helping everyday companies use AI to make better decisions. My functional expertise includes predictive modeling, machine learning, and statistics.

I graduated from MIT with Master’s degrees in Business and Engineering. I also have a BS from the United States Military Academy at West Point. Professional certifications include Microsoft Data Science Professional and Lean Six Sigma Master Black Belt.

<hr>


<div class="h3" style="margin-top: 50px">CALCULATING ACCELERATION</div>

I made the following assumptions during my treatment of player movement data:

  - Movement is assumed to be in the horizontal plane. I did not estimate acceleration in the up-down direction, which is proportional to a player's mass and the number of steps taken.
  - Body mechanics and angles are not considered. Players with correct running technique and strong mechanics transfer lateral force on the joints to compression force.
  - Instantaneous changes in direction and speed are smoothed by using a rolling mean over 3/10 of a second (three successive measurements).
  - As mentioned previously, angular acceleration from twisting is not considered due to problems with orientation data.
