The focus on the 2021 - 2022 Big Data Bowl is special teams. This demo will provide sample code to help with the process of getting started. This demo includes sample code to do the following:

[Speed on Kickoff Plays](#whatis) - explains the metric of interest and how it can be used to analyze members of kickoff coverage unit.

[Read Data](#ReadData) - will read the tracking / non-tracking data.

[Clean Data](#CleanData) - will filter for plays of interest and filter out players.

[Mixed-Effects Model](#ModelAll) - will create a mixed effects model for using player and surface.

[Analyze Stability](#ModelAcross) - will analyze the stability of the modelling technique.

<a id="whatis"></a>
# Speed on Kickoff Plays

When NFL kickoff specialists kick the ball to the opposing team at the beginning of a half or after a score, the other members of the kickoff coverage unit typically will sprint down the field to tackle the returner. In order to limit the return yards, the kickoff team wants to get down the field and meet ther returner as fast as possible. Thus, player max speed reached during the initial sprint (before they are blocked / obstructed by the return team) can be a valuable metric to analyze those on the kickoff team. However, taking the raw average of max speed can be decieving as other factors, such as field surface, can have an effect on player speeds as well. This demo will show how to model player speeds on kickoff coverage unit while accounting for surface.


<img src="https://operations.nfl.com/images/infocus-rules/evolution-of-kickoff-poster.jpg" style="width:500px;">

<a id="ReadData"></a>
# Read Data

Before we can do anything, we must read in the data.

In [None]:
import numpy as np 
import pandas as pd

In [None]:
##reading in non-tracking data

#includes info on games
df_games = pd.read_csv("../input/nfl-big-data-bowl-2022/games.csv")

#includes play-by-play info on specific plays
df_plays = pd.read_csv("../input/nfl-big-data-bowl-2022/plays.csv")

#includes background info for players
df_players = pd.read_csv("../input/nfl-big-data-bowl-2022/players.csv")

#Reading tracking data (2020 only for this case)
df_tracking = pd.read_csv("../input/nfl-big-data-bowl-2022/tracking2020.csv")

#includes scouting info on specific plays
df_PFFScouting = pd.read_csv("../input/nfl-big-data-bowl-2022/PFFScoutingData.csv")

#loading data from Lee Sharpe's public GitHub repository. It includes info on field surface.
df_leeSharpeGames = pd.read_csv("https://raw.githubusercontent.com/nflverse/nfldata/master/data/games.csv")

<a id="CleanData"></a>
# Clean Data

To start the cleaning process, we must first use PFF scouting data to identify special teams safeties who are not actively advancing downfield. To be able to merge this information, we will use tracking data to get a map of a player's jersey number to their `nflId` in each game.

In [None]:
#using df_tracking to merge to jersey numbers

#selecting variables of interest & dropping duplicates - jersey # is constant throughout game
df_jerseyMap = df_tracking.drop_duplicates(subset = ["gameId", "team", "jerseyNumber", "nflId"])

#joining to games
df_jerseyMap = pd.merge(df_jerseyMap, df_games, left_on=['gameId'], right_on =['gameId'])

#getting name of team
conditions = [
    (df_jerseyMap['team'] == "home"),
    (df_jerseyMap['team'] != "home"),
]

values = [df_jerseyMap['homeTeamAbbr'], df_jerseyMap['visitorTeamAbbr']]

#adjusting jersey number so that it includes 0 when < 10
df_jerseyMap['team'] = np.select(conditions, values)

df_jerseyMap['jerseyNumber'] = df_jerseyMap['jerseyNumber'].astype(str)

df_jerseyMap.loc[df_jerseyMap['jerseyNumber'].map(len) < 4, 'jerseyNumber'] = "0"+df_jerseyMap.loc[df_jerseyMap['jerseyNumber'].map(len) < 4, 'jerseyNumber'].str[:2]

df_jerseyMap['jerseyNumber'] = df_jerseyMap['jerseyNumber'].str[:2]

#getting team and jersey
df_jerseyMap['teamJersey'] = df_jerseyMap['team'] + ' ' + df_jerseyMap['jerseyNumber'].str[:2]

#map to merge nflId to teamJersey
df_jerseyMap = df_jerseyMap[['gameId', 'nflId', 'teamJersey']]

df_jerseyMap = df_jerseyMap.sort_values(['gameId', 'nflId', 'teamJersey'])

df_jerseyMap.head()

In [None]:
#dataframe will include gameId, playId and nflId for each special teams safety
df_PFF_specialTeamSafeties = df_PFFScouting.copy()

#splitting into a column for each special teams safety
df_PFF_specialTeamSafeties[['teamJersey1', 'teamJersey2', 'teamJersey3', 'teamJersey4', 'teamJersey5', 'teamJersey6']] = df_PFF_specialTeamSafeties['specialTeamsSafeties'].str.split('; ',expand=True)

#selecting jersey numbers for each team
df_PFF_specialTeamSafeties = df_PFF_specialTeamSafeties[['gameId', 'playId', 'teamJersey1', 'teamJersey2', 'teamJersey3', 'teamJersey4', 'teamJersey5', 'teamJersey6']]

#gathering data
df_PFF_specialTeamSafeties = pd.melt(df_PFF_specialTeamSafeties, id_vars =['gameId', 'playId'], value_vars =['teamJersey1', 'teamJersey2', 'teamJersey3', 'teamJersey4', 'teamJersey5', 'teamJersey6'],
               value_name = 'teamJersey')

#dropping NA rows
df_PFF_specialTeamSafeties.dropna()

#joining to jersey map
df_PFF_specialTeamSafeties = pd.merge(df_PFF_specialTeamSafeties, df_jerseyMap, on = ['gameId', 'teamJersey'])

#selecting variables of interest
df_PFF_specialTeamSafeties = df_PFF_specialTeamSafeties[['gameId', 'playId', 'nflId']]

df_PFF_specialTeamSafeties = df_PFF_specialTeamSafeties.sort_values(['gameId', 'playId', 'nflId'])

df_PFF_specialTeamSafeties.head()

Next, we will create a dataframe to only include deep kickoffs so that we can remove all unnecessary rows in the tracking data.

In [None]:
#creating data frame that will only include deep kickoffs
df_deepKickoffs = df_plays.copy()

#joining the scouting data
df_deepKickoffs = pd.merge(df_deepKickoffs, df_PFFScouting, on = ['gameId', 'playId'])

#filtering for kickoff plays only & deep kickoffs only
df_deepKickoffs = df_deepKickoffs[(df_deepKickoffs['specialTeamsPlayType'] == 'Kickoff') & (df_deepKickoffs['kickType'] == 'D')]

#selecting variables of interest
df_deepKickoffs = df_deepKickoffs[['gameId', 'playId', 'kickerId', 'possessionTeam']]

df_deepKickoffs.head()

Now, we will use the data frames we created to filter the tracking data. We will use `df_PFF_specialTeamSafeties` to filter out players who were special teams safeties and use `df_deepKickoffs` to remove plays where there was not a deep kickoff. For each player on each play in the the tracking data, we will filter for the first 40 frames which approximately corresponds to the initial sprint portion of the play. Over that interval, we calculate the maximum speed reached for each player in the play.

In [None]:
df_maxSpeeds = df_tracking.copy()

#joining games
df_maxSpeeds = pd.merge(df_maxSpeeds, df_games, on = 'gameId')

#using a join to remove special teams safeties
df_maxSpeeds = pd.merge(left = df_maxSpeeds, right = df_PFF_specialTeamSafeties, how='left', indicator=True, on = ['gameId', 'playId', 'nflId'])

df_maxSpeeds = df_maxSpeeds.loc[df_maxSpeeds._merge == 'left_only', :].drop(columns = '_merge')

#joining deep kickoffs
df_maxSpeeds = pd.merge(df_maxSpeeds, df_deepKickoffs, on = ['gameId', 'playId'])

#removing the kicker from the tracking data
df_maxSpeeds = df_maxSpeeds[(df_maxSpeeds['kickerId'] != df_maxSpeeds['nflId']) &
                            
                            #player is on home team and kicking team is home
                            ((df_maxSpeeds['team']=='home') & (df_maxSpeeds['possessionTeam'] == df_maxSpeeds['homeTeamAbbr']) |
                             
                             #or player is on away team and kicking team is away
                            (df_maxSpeeds['team']=='away') & (df_maxSpeeds['possessionTeam'] == df_maxSpeeds['visitorTeamAbbr']))]

#select variables of interest
df_maxSpeeds = df_maxSpeeds[['gameId', 'playId', 'frameId', 'nflId', 'event', 's']]

#arranging by gameId, playId, frameId and nflId
df_maxSpeeds = df_maxSpeeds.sort_values(['gameId', 'playId', 'frameId'])

#grouping by gameId, playId and nflId & filtering for frames after kickoff
df_maxSpeeds = df_maxSpeeds.loc[df_maxSpeeds.groupby(['gameId', 'playId']).event.transform(lambda z: np.cumsum(z.isin(['kickoff', 'free_kick'])) >= 1)]

#grouping by gameId, playId and nflId & filtering for first 40 observations
df_maxSpeeds = df_maxSpeeds.groupby(['gameId', 'playId', 'nflId']).head(40).reset_index()

#calculating max speed for given play / player
df_maxSpeeds = df_maxSpeeds.groupby(['gameId', 'playId', 'nflId']).s.apply(lambda z: z.max()).reset_index()

#renaming speed column as maxSpeed
df_maxSpeeds = df_maxSpeeds.rename(columns={"s" : "maxSpeed"})

In [None]:
df_maxSpeeds.head()

In [None]:
#will have max speeds for each player with surface info
df_maxSpeeds2 = df_maxSpeeds.copy()

#merging to Lee Sharpe's data
df_maxSpeeds2 = pd.merge(df_maxSpeeds, df_leeSharpeGames, left_on = ['gameId'], right_on = ['old_game_id'])

#selecting variables of interest
df_maxSpeeds2 = df_maxSpeeds2[['gameId', 'playId', 'nflId', 'week', 'surface', 'maxSpeed']]

#striping the surface column to remove extra spaces at the end
df_maxSpeeds2['surface'] = df_maxSpeeds2['surface'].transform(lambda x : x.str.strip())

df_maxSpeeds2['surface'].unique()

<a id="ModelAll"></a>
# Mixed Effects Model

To account for the lack of independence among observations from the same player, we fit a mixed effects model of player speed that includes a random intercept for each player (called a random effect) and a fixed effect surface type. The random effect represents how that player, after some shrinkage towards the overall league mean, compares to the league average. Random effects models allow us to avoid overfitting (e.g., a fixed effect for each player, and the shrinkage towards league average pulls some players more than others (those with fewer observations get pulled further towards the league average).

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

#fitting model
model = smf.mixedlm(formula= "maxSpeed ~ C(surface, Treatment(reference='grass'))", data = df_maxSpeeds2, groups = 'nflId')

modelF = model.fit(method=["lbfgs"])

Let's view a summary of the model:

In [None]:
print(modelF.summary())

As you can see in the "Random effects" section of the summay, the `nflId` effect has a standard deviation of ~0.5. This would imply that the model estimated the player population as a distribution with mean 0 and standard deviation of ~0.5.

In [None]:
modelF.params

Moreover, each of the surface effects appear to be signifcant. Matrix Turf stands out as the biggest effect. The intercept corresponds to an overall average.

Let's analyze the player effects:

In [None]:
#saving results of model in data frame. extracting ids:
df_modelResults = pd.DataFrame.from_dict(modelF.random_effects, orient = 'index').reset_index()

#renaming the columns appropriately
df_modelResults = df_modelResults.rename(columns={"index" : "nflId", "nflId" : "effect"})

#joining by nflId
df_modelResults = pd.merge(df_modelResults, df_players, on = 'nflId')

#selecting variables of interest
df_modelResults = df_modelResults[['nflId', 'displayName', 'effect']]

df_modelResults.head()

Our results give us a player effect for each player which we can use to rank them. First, we plot the top 25 players with the highest player effect:

In [None]:
#ordering for top effects
df_modelResults = df_modelResults.sort_values(by = 'effect', ascending = False)

#filtering for top 25
df_visual1 = df_modelResults.head(25)

In [None]:
#plotting
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(15,10))
plt.rc('grid', linestyle=':', color='lightgray', linewidth=0.5)
plt.grid(True, zorder = 0)

#reordering to show largest effect at the top
df_visual1 = df_visual1.sort_values(by = 'effect')

#creating bar graph
plt.barh(list(df_visual1['displayName']), list(df_visual1['effect']), color = "lightblue")
plt.xticks([0.0, 0.5, 1, 1.5])

#setting labels
plt.title("Top 25 Players in Max Speed Player Effect", fontsize = 16)
plt.xlabel("Max Speed Player Effect", fontsize = 14)
plt.ylabel("Player Name", fontsize = 14)

plt.show()

Next, we analyze the density of player effects:

In [None]:
import seaborn as sns

fig = plt.figure(figsize=(15,10))

#plotting density plot
sns.kdeplot(df_modelResults['effect'], fill = True)

#setting theme
plt.rc('grid', linestyle=':', color='lightgray', linewidth=0.5)
plt.grid(True, zorder = 0)

#setting labels
plt.title("Density of Max Speed Player Effects", fontsize = 16)
plt.xlabel("Max Speed Player Effects", fontsize = 14)
plt.xticks([-3, -2, -1, 0, 1])

plt.show()

The density curve shows that the population of player effects is somewhat different from a standard normal distribution. While most of the players are between -1 and 1, there are some outliers on the left side of curve. Although, we removed special teams safeties and kickers in the data cleaning process, some of these players likely have a non-traditional role on the kickoff team.

<a id="ModelAcross"></a>
# Analyze Stability

To ensure that our results are meaningful, it is important to check the stability of this analysis across weeks. If each player effect is mostly independent of the player's previous weeks' player effect, it might suggest that the analysis cannot be used to properly measure performance. Here, we will make a model using data from weeks 1-8 only and compare the results to a model using weeks 9-17 only.

In [None]:
#dataframe for week 1-8
df_wk1to8 = df_maxSpeeds2.loc[df_maxSpeeds2['week'] <= 8].copy()

#fitting model filtering for week 1-8
model1 = smf.mixedlm(formula= "maxSpeed ~ C(surface, Treatment(reference='grass'))", data = df_wk1to8, groups = 'nflId')
model_wk1to8 = model1.fit(method=["lbfgs"])

#saving results of effects
df_model_wk1to8Results = pd.DataFrame.from_dict(model_wk1to8.random_effects, orient = 'index').reset_index()
df_model_wk1to8Results = df_model_wk1to8Results.rename(columns={"index" : "nflId", "nflId" : "effect"})
df_model_wk1to8Results = pd.merge(df_model_wk1to8Results, df_players, on = 'nflId')
df_model_wk1to8Results = df_model_wk1to8Results[['nflId', 'displayName', 'effect']]

df_model_wk1to8Results.head()

In [None]:
#dataframe for week 9-17
df_wk9to17 = df_maxSpeeds2.loc[df_maxSpeeds2['week'] >= 9].copy()

#fitting model filtering for week 9-17
model2 = smf.mixedlm(formula= "maxSpeed ~ C(surface, Treatment(reference='grass'))", data = df_wk9to17, groups = 'nflId')
model_wk9to17 = model2.fit(method=["lbfgs"])

#saving results of effects
df_model_wk9to17Results = pd.DataFrame.from_dict(model_wk9to17.random_effects, orient = 'index').reset_index()
df_model_wk9to17Results = df_model_wk9to17Results.rename(columns={"index" : "nflId", "nflId" : "effect"})
df_model_wk9to17Results = pd.merge(df_model_wk9to17Results, df_players, on = 'nflId')
df_model_wk9to17Results = df_model_wk9to17Results[['nflId', 'displayName', 'effect']]

df_model_wk9to17Results.head()

In [None]:
#merging model results
df_model_ResultsCompare = pd.merge(df_model_wk9to17Results, df_model_wk1to8Results, on = ['nflId', 'displayName'], suffixes = ('_wk1to8', '_wk9to17'))

df_model_ResultsCompare.head()

An easy way to measure stability is to graph the values in a scatter plot. We also will use a line with a slope of 1 and intercept 0 to visualize the correlation. If the points align well with the line, we can feel good that the first sample is well correlated with the second sample.

In [None]:
#plotting scatter plot
fig = plt.figure(figsize=(10,10))
plt.rc('grid', linestyle=':', color='lightgray', linewidth=0.5)
plt.grid(True, zorder = 0)

#finding the outliers
df_model_ResultsCompare_highlight = df_model_ResultsCompare.loc[((abs(df_model_ResultsCompare['effect_wk1to8'] - df_model_ResultsCompare['effect_wk9to17']) > 1) |
                                                                (df_model_ResultsCompare['effect_wk9to17'] == max(df_model_ResultsCompare['effect_wk9to17'])) |
                                                                (df_model_ResultsCompare['effect_wk1to8'] == max(df_model_ResultsCompare['effect_wk1to8'])))]

#adding points
plt.plot(df_model_ResultsCompare['effect_wk1to8'], df_model_ResultsCompare['effect_wk9to17'], 'o', color = 'black')

#labeling outliers
for x, y, l in zip(df_model_ResultsCompare_highlight['effect_wk1to8'],df_model_ResultsCompare_highlight['effect_wk9to17'], df_model_ResultsCompare_highlight['displayName']):
    plt.text(x, y, l)

#adding linear model trend line
x = np.linspace(-3, 1)
y = x
plt.plot(x, y,  color='red')
plt.gca().set_aspect('equal')

#setting labels
plt.title("Stability of Max Speed Player Effect from Weeks 1-8 to Weeks 9-17 \n (Each dot represents a player)", fontsize = 16)
plt.xlabel("Weeks 1-8 Max Speed Player Effect", fontsize = 14)
plt.xticks([-3, -2, -1, 0, 1])
plt.ylabel("Weeks 9-17 Max Speed Player Effect", fontsize = 14)
plt.yticks([-4, -2, 0])

plt.show()

As suggested under the density plot, some of the players that are shown as outliers on the graph likely have a non-traditional role on the kickoff team. To analyze the trend more easily, lets remove the players who are outliers:

In [None]:
#plotting scatter plot
fig = plt.figure(figsize=(10,10))
plt.rc('grid', linestyle=':', color='lightgray', linewidth=0.5)
plt.grid(True, zorder = 0)

#filtering out outliers
df_model_ResultsCompareNoOutliers = df_model_ResultsCompare.loc[(df_model_ResultsCompare['effect_wk1to8'] >= -2) & 
                                        (df_model_ResultsCompare['effect_wk1to8'] <= 2) &
                                        (df_model_ResultsCompare['effect_wk9to17'] >= -2) &
                                        (df_model_ResultsCompare['effect_wk9to17'] <= 2)].copy()

#finding the outliers
df_model_ResultsCompare_highlight = df_model_ResultsCompareNoOutliers.loc[((abs(df_model_ResultsCompareNoOutliers['effect_wk1to8'] - df_model_ResultsCompareNoOutliers['effect_wk9to17']) > 1) |
                                                                (df_model_ResultsCompareNoOutliers['effect_wk9to17'] == max(df_model_ResultsCompareNoOutliers['effect_wk9to17'])) |
                                                                (df_model_ResultsCompareNoOutliers['effect_wk1to8'] == max(df_model_ResultsCompareNoOutliers['effect_wk1to8'])))]

#adding points
plt.plot(df_model_ResultsCompareNoOutliers['effect_wk1to8'], df_model_ResultsCompareNoOutliers['effect_wk9to17'], 'o', color = 'black')

#labeling outliers
for x, y, l in zip(df_model_ResultsCompare_highlight['effect_wk1to8'],df_model_ResultsCompare_highlight['effect_wk9to17'], df_model_ResultsCompare_highlight['displayName']):
    plt.text(x, y, l)

#adding linear model trend line
x = np.linspace(-2, 1)
y = x
plt.plot(x, y,  color='red')
plt.gca().set_aspect('equal')

#setting labels
plt.title("Stability of Max Speed Player Effect from Weeks 1-8 to Weeks 9-17 \n (Each dot represents a player)", fontsize = 16)
plt.xlabel("Weeks 1-8 Max Speed Player Effect", fontsize = 14)
plt.xticks([-1, 0, 1])
plt.ylabel("Weeks 9-17 Max Speed Player Effect", fontsize = 14)
plt.yticks([-1, 0, 1])

plt.show()

Another way to measure stability is to calculate the $R^2$ of the player effects across weeks:

In [None]:
#calculating R^2 with all the data
corr_AllData = np.corrcoef(df_model_ResultsCompare['effect_wk1to8'], df_model_ResultsCompare['effect_wk9to17'])
corr_xy_AllData = corr_AllData[0,1]
Rsquared_AllData = corr_xy_AllData**2

#printing R^2
print('R^2 Correlation from week 1-8 player effect to week 9-17 player effect:',Rsquared_AllData)

#calculating R^2 after removing the outliers
corr_NoOutliers = np.corrcoef(df_model_ResultsCompareNoOutliers['effect_wk1to8'], df_model_ResultsCompareNoOutliers['effect_wk9to17'])
corr_xy_NoOutliers = corr_NoOutliers[0,1]
Rsquared_NoOutliers = corr_xy_NoOutliers**2

#printing R^2
print('R^2 Correlation from week 1-8 player effect to week 9-17 player effect:',Rsquared_NoOutliers)

Considering the modestly high $R^2$ value and the clear trend in the scatter plot, it is fair to say that this technique has reasonable stability.