# A New Metric for Kick Returns - Yardage Pressure Ratio (YPR)

We were giddy with excitment at the possibility of contributing to the NFL experience since the start of this challenge. As fans of the League who watch their home teams regularly (we are Dolphins and Seahawks fans), we were thrilled with the abundance of spatial data from *Next Gen Stats* that would allow us to focus on ways to measure special team plays. We decided to narrow our focus specifically on kick returns for simplicity, but believe that our approach can be adapted to other special team plays. 

There were countless directions we considered taking, but ultimately we asked ourselves how we could distinguish two kick return plays where the returners both ran the same distance. We wanted to find a way to compare the difficulty of kick returns that took into account not only the kick returner's ability, but also the performance of the defense.  

In this notebook, we focus on kick return plays and introduce a novel metric to assess the difficulty of a kick return.

We call the metric the **Yardage Pressure Ratio** or the **YPR**. The **YPR** quantifies how far the kick returner was able to run in relation to how close the other players were to him--in other words, how much *pressure* the defense placed on average during the play. The metric reflects the kick returner's ability to cover more yards while overcoming tighter defense.

The **YPR** is calculated by dividing the total distance the kick returner ran during a kick return in yards by the sum of the distances of all opponents to the kick returner for the play in yards, divided by time of play in seconds, all multiplied by 100 for easier interpretation. 

The remainder of the notebook will shortly walk through methodology and compare some of the best kick returners in the NFL over the 2018-2020 seasons.

## Code Set up

The following section will load in the datasets, perform data cleaning and wrangling and do the necessary calculations to get a dataframe with the **YPR**s.

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime
import re
import os
import math
import matplotlib.pyplot as plt 
import seaborn as sns

# import data
# players
players_path = "../input/nfl-big-data-bowl-2022/players.csv"
players_raw = pd.read_csv(players_path)

# tracking 2018
nfl_2018_path = "../input/nfl-big-data-bowl-2022/tracking2018.csv"
nfl_2018 = pd.read_csv(nfl_2018_path)

# tracking 2019
nfl_2019_path = "../input/nfl-big-data-bowl-2022/tracking2019.csv"
nfl_2019 = pd.read_csv(nfl_2019_path)

# tracking 2020
nfl_2020_path = "../input/nfl-big-data-bowl-2022/tracking2020.csv"
nfl_2020 = pd.read_csv(nfl_2020_path)

# get play data -- only look at kick returns
plays_path = "../input/nfl-big-data-bowl-2022/plays.csv"
plays_raw = pd.read_csv(plays_path)


# Get a subset of the *plays* data where the special team play is **Kickoff** and the result 
# is a **Return**. We also don't want plays where the kick was blocked, so we 
# need **KickBlockerId** to be `Nan`.

kickoffReturns = plays_raw.loc[
    (plays_raw["specialTeamsPlayType"] == "Kickoff") &  #### Can change to include punts ####
    (plays_raw["specialTeamsResult"] == "Return") & 
    (np.isnan(plays_raw["kickBlockerId"]))
]

# grab only the columns we need from each.
kickoffReturnCols = [
    "gameId", "playId", "returnerId"
]

kickoffReturns = kickoffReturns[kickoffReturnCols]

nflTrackingCols = [
    "gameId", "playId", "x", "y", 
    "nflId", "time", "team", "frameId", 
    "dis"
]

nfl_2018 = nfl_2018[nflTrackingCols]
nfl_2019 = nfl_2019[nflTrackingCols]
nfl_2020 = nfl_2020[nflTrackingCols]


# Must join all years by both gameId and playId 
def mergeWithTracking(nfl_df, kickoffReturns = kickoffReturns):
    """Merges nfl tracking dataframe with plays dataframe 
    subset for kickoff returns and no blocked kicks. 
    
    Returns merged dataframes
    """
    merged_df = pd.merge(
        kickoffReturns, 
        nfl_df, 
        on = ["gameId", "playId"], 
        how = "inner"
    )
    return merged_df

nfl_2018_tracking = mergeWithTracking(nfl_2018)
nfl_2019_tracking = mergeWithTracking(nfl_2019)
nfl_2020_tracking = mergeWithTracking(nfl_2020)

kickReturnTracking = pd.concat(
    [nfl_2018_tracking, nfl_2019_tracking, nfl_2020_tracking]
).reset_index(
    drop = True
).drop_duplicates()


# data cleanup
# kickReturnTracking.dtypes

# parse times
kickReturnTracking["time"] = kickReturnTracking["time"].apply(
    lambda t: datetime.strptime(t, "%Y-%m-%dT%H:%M:%S.%f")
)

# drop rows where there is a semi-colon for `returnerId`. Multiple returners. < 1% of data and where 
# is `NaN`
rows_to_drop = kickReturnTracking.loc[
    kickReturnTracking["returnerId"].str.contains(";", na = False)
]

kickReturnTracking = kickReturnTracking[~kickReturnTracking.isin(rows_to_drop)]

assert kickReturnTracking["returnerId"].str.contains(";", na = False).sum() == 0

# make `returnerId` numeric
kickReturnTracking["returnerId"] = pd.to_numeric(kickReturnTracking["returnerId"])

kickReturnTracking = kickReturnTracking[
    ~np.isnan(kickReturnTracking["returnerId"])
]

assert np.isnan(kickReturnTracking["returnerId"]).sum() == 0

# Creating boolean for kick returner
kickReturnTracking["isKickReturner"] = kickReturnTracking["returnerId"].eq(kickReturnTracking["nflId"])

# Start creating dictionary to hold key value pairs of game and plays for each game.
unique_game_ids = kickReturnTracking["gameId"].unique()
# drop nan cases
unique_game_ids = unique_game_ids[~np.isnan(unique_game_ids)]

# Make empty dictionary that will hold the play IDs for each Game ID.
game_plays_dict = {}

# Loop through the game IDs
for game in unique_game_ids:
    # start with an empty array to hold the play IDs
    play_arr = np.empty(0)
    
    # add the play IDs from each Game ID
    play_arr = np.append(
        play_arr,
        kickReturnTracking.loc[
            kickReturnTracking["gameId"] == game
        ]["playId"]
    )
    play_arr = np.unique(play_arr)
    
    # Assign the key value pair to the dictionary
    game_plays_dict[game] = play_arr
    
    
### Helper Functions ###
# function to get the desired game and play from dataframe
def getPlay(df, game, play):
    """
    Returns single game and single play subset of dataframe. Meant to be
    used with the kickReturnTracking dataframe.
    
    Args: 
        df: should be kickReturnTracking dataframe 
        game: gameId desired
        play: playId desired. Will be subset of game.
    """
    game_df = df.loc[df["gameId"] == game]
    play_df = game_df.loc[game_df["playId"] == play]
    return play_df

# Write distance formula
def distanceFormula(x1, x2, y1, y2):
    """Distance formula 
    Args:
        x1: x coordinate of player 1 
        x2: x coordinate of player 2
        y1: y coordinate of player 1 
        y2: y coordinate of player 2
    """
    d = math.sqrt(
        ((x2 - x1) ** 2) + ((y2 - y1) ** 2)
    )
    return d

# Function to get distance covered by kickReturner for each play
def getKickReturnerDistanceForPlay(play_df):
    """
    Returns dataframe of kickReturner distance covered for each play
    Args:
        play_df: should be returned from getPlay() function using
        kickReturner dataframe.
    """
    
    # get the row of the kick returner
    kick_returner_row = play_df.loc[
        play_df["isKickReturner"] == True
    ]
    
    kickReturnerDistance = kick_returner_row["dis"].sum()
    
    # makes dataframe 
    kr_df = pd.DataFrame({
        "gameId" : play_df["gameId"], 
        "playId" : play_df["playId"],
        "nflId" : play_df["nflId"],
        "time" : play_df["time"], 
        "kickReturnerDistance" : kickReturnerDistance
    })
    
    return kr_df
        
    
# Function to get the opponent's distances for each play
def getOpponentsDistanceForPlay(play_df):
    """
    Calculates the distance of each kick returner to opposing team
    players.
    
    Args:
        play_df: should be from getPlay function
    """
    # Get all the frames per play
    frames = play_df["frameId"]
    
    # For loop of each frame of the play_df subset. 
    # make an array to hold the distances of each frame 
    distance = np.empty(0)

    for frame in frames:
        play_i_df = play_df.loc[
            play_df["frameId"] == frame
        ]
    
        # get the row of the kick returner
        kick_returner_row = play_i_df.loc[
            play_i_df["isKickReturner"] == True
        ]
        
        # Some issue where there are plays without a kick returner. We will drop those for now 
        # and revisit later.
        try:
            assert len(kick_returner_row) == 1
        except:
            problem_game = play_i_df["gameId"]
            problem_play = play_i_df["playId"]
            
            distance = np.append(distance, 0)
            return np.sum(distance) 
            break  # Does this break statement need to be here?
       
        # capture kick returner team in a variable
        kick_returner_team = kick_returner_row["team"]

        kick_returner_x = kick_returner_row["x"]
        kick_returner_y = kick_returner_row["y"]

        # filter for rows of opposing team players
        opposing_team_players = play_i_df.loc[
            (play_i_df["team"] != "football") & 
            (play_i_df["team"] != str(kick_returner_team))
        ]
        
        # Itterate through the opposing players to get each of their 
        # x and y positions and calculate distance from kick returner
        for opposing_player_i in range(len(opposing_team_players)):
            opposing_player_x = opposing_team_players.iloc[opposing_player_i]["x"]
            opposing_player_y = opposing_team_players.iloc[opposing_player_i]["y"]
            
            # Will use distance formula; euclidean distance 
            distance_diff = distanceFormula(
                kick_returner_x, 
                opposing_player_x, 
                kick_returner_y, 
                opposing_player_y
            )
            distance = np.append(distance, distance_diff)
            
        all_opposing_player_dist = np.sum(distance)
        
        # Make a dataframe to return 
        distance_df = pd.DataFrame({
            "gameId" : play_df["gameId"],
            "playId" : play_df["playId"],
            "nflId" : play_df["nflId"], 
            "frameId" : play_df["frameId"],
            "time" : play_df["time"],
            "allOpposingPlayerDist" : all_opposing_player_dist 
        })
        
        return distance_df    
    

# Function to get time for play
def getTimeForPlay(play_df):
    """
    Function to return duration of play column for dataframes returned from 
    getKickReturnerDistanceForPlay() function and getOpponentsDistanceForPlay()
    function.
    This should produce the final output dataframe needed for the remainder of 
    of the analysis.
    """
    kr_distance_df = getKickReturnerDistanceForPlay(play_df)
    opp_distance_df = getOpponentsDistanceForPlay(play_df)

    # append kr distance Column to opponent df
    opp_distance_df["kickReturnerDistance"] = kr_distance_df["kickReturnerDistance"]
    
    # remove cases where kr distance is 0
    opp_distance_df = opp_distance_df[
        opp_distance_df["kickReturnerDistance"] > 0
    ]

    # Get play duration
    play_duration = opp_distance_df["time"].iloc[-1] - opp_distance_df["time"].iloc[0] 

    play_duration_seconds = play_duration.value / 1e9
    opp_distance_df["playDurationSec"] = play_duration_seconds
    
    return opp_distance_df

# itterate through all the plays and get the final nfl_kr_opp_df data frame needed.
nfl_kr_opp_df = pd.DataFrame()
for g, p in zip(game_plays_dict.keys(), game_plays_dict.values()):
    for i in range(len(p)):
        # Get the ith play from game
        play_df = getPlay(kickReturnTracking, g, p[i])
        # get the time, opponent distance, and kr distance df
        kr_dist_df = getTimeForPlay(play_df)
        # append the first row 
        nfl_kr_opp_df = nfl_kr_opp_df.append(kr_dist_df.iloc[0])
        
        
nfl_kr_opp_df.reset_index(drop = True, inplace = True)

# get YPR metric
nfl_kr_opp_df["ypr"] = (
    100 * \
    ((nfl_kr_opp_df["kickReturnerDistance"] / nfl_kr_opp_df["allOpposingPlayerDist"]) \
     / nfl_kr_opp_df["playDurationSec"])
)

# merge player names to nfl_kr_opp_df 
nfl_players_ypr_df = pd.merge(
    nfl_kr_opp_df, 
    players_raw, 
    on = "nflId", 
    how = "left"
)

## Methodology 

To calculate the **YPR**, we looked only at kick return special team plays and calculated the sum of the distances of the opponents to the kick returner for each play. The *Next Gen Stats* data provided the `X` and `Y` coordinates of each player on the field which we used to calculate the distances using the distance formula. The ratio is multiplied by 100 for easier interpretation. We compiled all the data in a dataframe that holds the **YPR**s for each available play.

## Analysis

Below is a table of the top 10 **YPR** values. There are some impressive runs above 1.0! 

In [None]:
nfl_players_ypr_df.sort_values("ypr", ascending = False).head(10)

We also wanted to consider players who performed well *consistently*. So we filtered for players who recorded at least five kick returns. Below are the top **YPR**s from that subset.

In [None]:
nflIdValueCounts = nfl_players_ypr_df["nflId"].value_counts()

nfl_players_ypr_df_gt5 = nfl_players_ypr_df[
    nfl_players_ypr_df["nflId"].isin(nflIdValueCounts.index[nflIdValueCounts.gt(4)])
]

nfl_players_ypr_df_gt5.sort_values("ypr", ascending = False).head(10)

#### Comparison

To further drive the point home, we wanted to show how players with the highest average yards per kick return (also with at least five runs) are different from those with the highest average **YPR**. Below are those respective lists.

In [None]:
avg_run_dist_per_player = nfl_players_ypr_df_gt5.groupby(
    "displayName"
)["kickReturnerDistance"].mean().sort_values(ascending = False)

avg_ypr_per_player = nfl_players_ypr_df_gt5.groupby(
    "displayName"
)["ypr"].mean().sort_values(ascending = False)

- Top kick return distances on average: 

In [None]:
avg_run_dist_per_player

- Average **YPR** per player:

In [None]:
avg_ypr_per_player

Even though Nate Stupar had the highest average kick return distance among players in our subset, the **YPR** uncovers that Derek Carrier's runs had more pressure on average. The **YPR** surfaces Derek Carrier's otherwise hidden accomplishment. In fact, Derek Carrier is **eigth from the bottom in average kick return distance**. Simply focusing on average kick return distance would rank players like Derek Carrier significantly below their true ability. The **YPR** uncovers a player's ability to overcome tighter defense--something coaches might want to consider if they are playing a notoriously defense-heavy team, or a play near the goal line.

In [None]:
avg_run_dist_per_player.loc["Derek Carrier":]

#### Descriptive Statistics

For a more complete picture of the **YPR**, see below for the distribution of **YPR**s and the median and mean values for plays in the subset.

The distribution of **YPR**s is slightly right skewed, but otherwise unimodal with a peak around 0.3.

In [None]:
sns.set(rc = {"figure.figsize": (10, 6)})
sns.set_theme(style = "whitegrid")
hist = sns.histplot(
    nfl_players_ypr_df_gt5["ypr"], 
    bins = 30
)
hist.set_title("Distribution of YPRs")
hist.set_xlabel("YPR");

In [None]:
print(f'The mean YPR value is {np.mean(nfl_players_ypr_df_gt5["ypr"])}\nThe median YPR value is {np.median(nfl_players_ypr_df_gt5["ypr"])}')

# Conclusion

We hope this brief introduction into the **Yardage Pressure Ratio** highlights a potential new way to assess special team plays. We envision the **YPR** can be a useful metric for coaches and the NFL to asses performances of not just the kick returner, but also the defense. The **YPR** uncovers otherwise ignored performances by kick returners who consistently are able to run farther with tighter defense. Conversely, this could also be helpful for teams tracking their defense's performance across plays. 

This is just the tip of the iceburg in terms of ways to use the **YPR**. One could also imagine the **YPR** used to assess punts, field goals, and even general defensive plays. 