# Preface

The NFL has given out several data sets about the plays made by Special Teams.

More concretely we will use four of those data sets, containing the following columns:

**plays** 
* *gameId*: Game identifier, unique (numeric)
* *playId*: Play identifier, not unique across games (numeric)
* *specialTeamsPlayType*: Formation of play: Extra Point, Field Goal, Kickoff or Punt (text)
* *specialTeamsResult*: Special Teams outcome of play dependent on play type: Blocked Kick Attempt, Blocked Punt, Downed, Fair Catch, Kick Attempt Good, Kick Attempt No Good, Kickoff Team Recovery, Muffed, Non-Special Teams Result, Out of Bounds, Return or Touchback (text)
* *returnerId*: nflId(s) of returner(s) on play if there was a special teams return. Multiple returners on a play are separated by a ; (text)
* *penaltyYards*: yards gained by possessionTeam by penalty (numeric)
* *kickLength*: Kick length in air of kickoff, field goal or punt (numeric)
* *kickReturnYardage*: Yards gained by return team if there was a return on a kickoff or punt (numeric)
* *playResult*: Net yards gained by the kicking team, including penalty yardage (numeric)

**tracking2018, 2019 and 2020**
* *gameId*: Game identifier, unique (numeric)
* *playId*: Play identifier, not unique across games (numeric)
* *frameId*: Frame identifier for each play, starting at 1 (numeric)
* *nflId*: Player identification number, unique across players (numeric)
* *s*: Speed in yards/second (numeric)
* *a*: Speed in yards/second^2 (numeric)
* *event*: Tagged play details, including moment of ball snap, pass release, pass catch, tackle, etc (text)

# Goal

In this code we want to have at look at whether speed or acceleration correlates more with a far punt return. 

We do this by using the **plays** dataset and all its data, where *specialTeamsPlayType* is Punt and *specialTeamsResult* is Return, which means the ball was punted and returned.

Further we will use the **tracking** datasets to add the average and top speed and acceleration of the Punt Returners in these plays.

This will be done by matching the *retournerId* of **plays** to the corresponding data of **tracking**.

# Data Preparation

In [None]:
import matplotlib.pyplot as pl
import pandas as pd
import numpy as np
import seaborn as sns
import math

In [None]:
tracking2018 = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2022/tracking2018.csv')
tracking2019  = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2022/tracking2019.csv')
tracking2020 = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2022/tracking2020.csv')
plays = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2022/plays.csv')

## Punt_return data set

First we extract all the punt returns from the **plays** data set.

In [None]:
punt_return = plays[(plays['specialTeamsResult'] == 'Return') & (plays['specialTeamsPlayType'] == 'Punt')]

Secondly we will exclude all columns we will not need and have a look at whether all the entries we care about are there.

In [None]:
punt_return = punt_return[['gameId', 'playId', 'returnerId',
       'kickLength', 'kickReturnYardage', 'playResult', 'penaltyYards']]

In [None]:
((punt_return.isna().sum())/punt_return.shape[0]).sort_values()

We see immediately that 0.04% of the *retournerId*'s and 0.35% of *kickReturnYardage* are missing. The missing *retournerId*'s are a very small portion and cannot easily be reconstructed, hence we will exclude these.

Also most of the *penaltyYards* are missing, but we do not really care at this point.

In [None]:
punt_return = punt_return[(~punt_return.returnerId.isna())]

The missing values in *kickReturnYardage* might be reconstructed as 

*kickReturnYardage* = *kickLength* - *playResult* + *penaltyYards*, (1)

where *penaltyYards* is often missing. It is very much possible though that *penaltyYards* is missing whenever there was no penalty or equivalently penaltyYards = 0.

Let's check this hypothesis:

In [None]:
punt_return['penaltyYards'] = punt_return['penaltyYards'].fillna(0)
1 - sum(punt_return['kickReturnYardage'] == punt_return['kickLength'] - punt_return['playResult'] + punt_return['penaltyYards']) / punt_return.shape[0]

Unfortunately this is not the case, as in 1.6% of the rows the formula in (1) does not add up, making it not reliable enough.

As a consequence we will also remove the rows with missing *kickReturnYardage*.

In [None]:
punt_return = punt_return[~punt_return.kickReturnYardage.isna()]

In [None]:
punt_return[punt_return['returnerId'].str.contains(';')]

Finally, there are three punt returns (above) with two people returning the ball (a backward pass must have happened). We exclude these as speed and acceleration have a lesser/different meaning in such a play.

In [None]:
punt_return = punt_return[~punt_return['returnerId'].str.contains(';')]

And we make sure *returnerId* is a float as *nflId* in our **tracking** data set will also be a float.

In [None]:
punt_return['returnerId'] = punt_return['returnerId'].astype(float)

## Tracking Data set

In [None]:
playIds = punt_return['playId'].unique()
gameIds = punt_return['gameId'].unique()
tracking2018 = tracking2018[(tracking2018.playId.isin(playIds)) & (tracking2018.gameId.isin(gameIds))]
tracking2019 = tracking2019[(tracking2019.playId.isin(playIds)) & (tracking2019.gameId.isin(gameIds))]
tracking2020 = tracking2020[(tracking2020.playId.isin(playIds)) & (tracking2020.gameId.isin(gameIds))]

Next, we will examine the **tracking** data set and the columns needed for our research, but first we combine them.

In [None]:
tracking = tracking2018.append(tracking2019).append(tracking2020)
tracking = tracking[['s', 'a', 'nflId', 'gameId','playId', 'frameId', 'event']]
((tracking.isna().sum())/tracking.shape[0]).sort_values()

Here we see that in 4.3% of the rows the *nflId* is missing, which is very concerning.

In the next chapter we will be adding the average and top speeds and acceleration of the punt retourner. As all of these stats would be influenced by missing data in the given play, we seriously need to consider to delete all punt returns with missing data or find a way to reconstruct it.

It is important to realize at this point that all players on the field are tracked for a given play, but we actually only care about the punt returner.

Meaning, we should check how many frames (as the **tracking** data set is collection of frames) of the punt returner are missing per play.

In [None]:
for index, row in punt_return.iterrows():
    playId = int(row['playId'])
    gameId = int(row['gameId'])
    df_check = tracking[(tracking['playId'] == playId) & (tracking['gameId'] == gameId)] ##The tracking data of the given play
    df_check_frames = df_check[df_check['nflId'] == int(row['returnerId'])] ##The tracking data of the given play of the punt returner
    if int(row['returnerId']) not in list(df_check['nflId']): ##Checking whether the returner is even found in the tracking data set for the given play
        punt_return.loc[index, 'missingFrames'] = 1
    elif df_check_frames.shape[0] != (df_check_frames.iloc[-1]['frameId'] - df_check_frames.iloc[0]['frameId']) - 1: ##Checking if the punt returner is missing in some frames
        punt_return.loc[index, 'missingFrames'] = (df_check_frames.shape[0] - (df_check_frames.iloc[-1]['frameId'] - df_check_frames.iloc[0]['frameId']) - 1) / df_check_frames.shape[0]
    else:
        punt_return.loc[index, 'missingFrames'] = 0

In [None]:
print(sum(punt_return.missingFrames == 0), 'fully captured punt returns or in percent:', sum(punt_return.missingFrames == 0) / punt_return.shape[0], '%')

This is amazing news as there seem to be no obviously missing frames (at the end and at the beginning is still possible).

Next, we simplify the huge **tracking** data set one more time by only keeping the rows, containing a *nflId* of a punt returner.

In [None]:
tracking = tracking[tracking['nflId'].isin(list(punt_return['returnerId'].unique()))]

# Adding speed and acceleration to the data set

First, we drop the *missingFrames*, *playResult* and *penaltyYards* columns.

In [None]:
df = punt_return.drop(['missingFrames','playResult','penaltyYards'], axis = 1)

Next, we calculate the average and top speed/acceleration

It is important here that we only consider the frames between punt_received and tackled.

The following code is hard to read. Essentially what happened is that all cases where there was no clear beginning and ending of the punt return were excluded.

In [None]:
for index, row in df.iterrows():
    playId = int(row['playId'])
    gameId = int(row['gameId'])
    nflId = int(row['returnerId'])
    punt_returner_frames = tracking[(tracking['playId'] == playId) & (tracking['gameId'] == gameId) & (tracking['nflId'] == nflId)]
    list_of_events = punt_returner_frames.event.unique()
    if 'punt_received' not in list_of_events:
        ## If there is no clear beginning to the punt return, we skip
        continue
    elif not ('tackle' in list_of_events or 'out_of_bounds' in list_of_events or 'touchdown' in list_of_events):
        ## If there is no clear ending to the punt return, we skip
        continue
    else:
        ## First we extract the index when the punt was received
        index_punt_received = punt_returner_frames.event[(punt_returner_frames.event == 'punt_received')].index.tolist()[0]
        
        ## Second we extract the index when the punt return ended (tackled, out of bounds, touchdown)
        conditions = (punt_returner_frames.event == 'tackle') | (punt_returner_frames.event == 'out_of_bounds') | (punt_returner_frames.event == 'touchdown')
        index_tackle = punt_returner_frames.event[conditions].index.tolist()[0]
        
        punt_returner_frames = punt_returner_frames.loc[index_punt_received:index_tackle + 1, :]
        df.loc[index, 'topAcceleration'] = punt_returner_frames['a'].max()
        df.loc[index, 'topSpeed'] = punt_returner_frames['s'].max()
        df.loc[index, 'meanSpeed'] = punt_returner_frames['s'].mean()

In [None]:
df.info()

As one can see only a few rows did not make it as they didn't have a clear beginning and ending of the punt return.

Let's exclude these.

In [None]:
df = df[~df['topSpeed'].isna()]

# Finding answers

Finally we are ready to answer our initial question:

What is more important for a punt returner, speed or acceleration?

We will discuss this by looking at the respective correlations with:
* **kickReturnYardage**
* **kickLength**

In [None]:
corr = df[['kickLength', 'kickReturnYardage',
        'topAcceleration', 'topSpeed', 'meanSpeed']].corr()
corr_2 = corr.iloc[[1,0], 1:6]
f, ax = pl.subplots(figsize=(12, 9))
sns.set(font_scale = 1.25)
sns.heatmap(corr_2, cbar = True, annot = True, fmt = '.2f', vmax=1, annot_kws={'size': 10}, square=True);

At first sight, it seems that speed is way more important for a punt returner than acceleration based on the correlation to *kickReturnYardage*.

On the other hand, we must also consider the fact that *kickReturnYardage* correlates with *kickLength*, and consequentely skews the heatmap.

This makes sense as a further kick would give the punt returner more time to accelerate and get speed.

To solve this problem, we will group the kick lengths into kick length ranges.

In [None]:
df.kickLength.describe()

Based on the 25% and 75% of *kickLength*, we group it into the following intervals:
* **Short**: 0 - 45 meters
* **Medium**: 45 - 55 meters
* **Long**: 55+ meters

In [None]:
df_short = df[df['kickLength'] <= 45]
df_medium = df[(df['kickLength'] > 45) & (df['kickLength'] < 55)]
df_long = df[df['kickLength'] >= 55]

Now we will check whether the correlations differ based on these newly formed ranges.

In [None]:
corr_short = df_short[['kickLength', 'kickReturnYardage',
        'topAcceleration', 'topSpeed', 'meanSpeed']].corr()
corr_short = corr_short.iloc[[1,0], 1:6]
corr_medium = df_medium[['kickLength', 'kickReturnYardage',
        'topAcceleration', 'topSpeed', 'meanSpeed']].corr()
corr_medium = corr_medium.iloc[[1,0], 1:6]
corr_long = df_long[['kickLength', 'kickReturnYardage',
        'topAcceleration', 'topSpeed', 'meanSpeed']].corr()
corr_long = corr_long.iloc[[1,0], 1:6]
fig = pl.figure(figsize=(20, 9))
sns.set(font_scale = 1.25)
ax1 = fig.add_subplot(3,3,1)
ax1.set_title('Short kicks')
ax2 = fig.add_subplot(3,3,2)
ax2.set_title('Medium kicks')
ax3 = fig.add_subplot(3,3,3)
ax3.set_title('Long kicks')
sns.heatmap(corr_short, cbar = False, ax = ax1, annot = True, fmt = '.2f', vmax=1, annot_kws={'size': 10});
sns.heatmap(corr_medium, cbar = False, ax = ax2, annot = True, fmt = '.2f', vmax=1, annot_kws={'size': 10}, yticklabels = False);
sns.heatmap(corr_long, cbar = False, ax = ax3, annot = True, fmt = '.2f', vmax=1, annot_kws={'size': 10}, yticklabels = False);

This does now paint a very clear picture as we were able to reduce the correlation between *kickLength* and *kickReturnYardage*.

It still holds true that speed correlates more with *kickReturnYardage* than acceleration in all kick length ranges. Especially for long kicks

Another interesting observation is the correlation of *kickLength* with speed and acceleration for the long kicks. This makes sense, since the defence is less on top of the punt returner the further the kick goes, giving him the change to accelerate and gain speed.

# Difficulties with this approach

### Causation

One big problem is to figure whether the *kickReturnYardage* is high, because e.g the *topSpeed* is high, or *topSpeed* is high, because *kickReturnYardage* is high. So the question of causation is not answered at this point

If we look at the scatterplot, we see that clearly the touchdowns drive the *topSpeed* up a lot.

In [None]:
sns.scatterplot(data=df, y="kickReturnYardage", x="topSpeed")

Similarly for the *meanSpeed*.

In [None]:
sns.scatterplot(data=df, y="kickReturnYardage", x="meanSpeed")

This means that the causation must go both ways, as a higher speed definitely helps for a further punt return.

### Missing variable hangTime

To make this research more sound, we should have also considered hangTime of the punt, as kicks of similar lengths, but differing hang times are very much different scenarios.

This variable could be established as the frames between the events 'punt_kicked' and ' punt_received'.

### Comparing punt returners

To be able to make a more clear statement about what makes a great punt returner, we have to compare the punt returners and not the punt returns.

The problem with that approach would have been though that there are not many punt returners that have a significant amount of punt returns, increasing the variance of our research.

# Conclusion

Based on our research, we conclude that the top and average speed achieved by a punt returner is correlating more with the kick return yardage than the top acceleration. Especially for punts beyond 55 meters.

This does not mean that speed is more important in a punt returner than acceleration. This merely means that when the kick return yardage is far, then, on average, the top and mean speed was high. 

A smaller correlation is true for acceleration in short and medium kick, but not for long kick punt return.

# Interesting open questions

1. What makes a great punt returner? 
2. What is more important in a good kick? kickLength or hangTime?