# Using EPA allowed per play to measure secondary performance. 

# Objective: 

EPA is a metric known as expected points added. For a good explanation of EPA, this is a good summary: https://www.advancedfootballanalytics.com/index.php/home/stats/stats-explained/expected-points-and-epa-explained. It's generally viewed as a good way to view the impact of any particular play in football as it shows how certain plays affect winning and impact the number of points scored by a team.  

It's common to look at EPA/per play to evaluate individual quarterback, running back and receiver performance. In particular, Ben Baldwin and Sebastian Carl's site https://rbsdm.com/stats/stats/, has popularized the notion of looking at Quarterback by play by EPA per play. But it is less common to look at indivdual EPA per play stats for defensive players. 

Secondary player performance metrics are sparse compared to offensive player metrics(simply because there are more counting stats for offensive players). In order to judge secondary performance we can look at things like Interceptions or Passes Defensed. Neither of these are ideal, as both them only look at a fraction of coverage snaps over the course of the game. They also don't given enough value to a player who is so good in coverage that a QB decides to not even throw his way, which is not counted in these counting stats. 

The other common way to measure secondary performance is to look at a more complex metric such as passer rating against. This is the passer rating that a defensive back or linebacker allows when they are thrown against in coverage. This will take into account all targets but it will not fix the scenario previously mentioned(it ignores the value created when a player is NOT targeted). Passer rating also does not take in account the relative value of all targets/receptions/incompletions, which EPA does. In addition, passer rating against usually requires people to look at all plays manually to determine targets and cannot be calculated automatically.

A much more manual way to measure secondary performance is to use a manual system such as PFF grades(https://www.pff.com/grades) or a similar scouting system. This involves an intelligent analyst watching each play on the All-22 Coaches Tape and assigning grades to each play. This does account for the scenario talked about previously and should properly grade a shut down corner who is so good he receives few looks over the course of a game. However, this grading system is not particularly scalable and cannot be calculated automatically. 

In this notebook, we'll seek to use measure the effectiveness of secondary play/coverage play, by looking at how much EPA a player allows when in coverage and try to calculate it automatically using tracking data and play by play data. We'll first try to calculate it for an individual game and look at the relationship between secondary performance and EPA Allowed per targeted play for a game, then we'll look at it over the course of a season. If successful, we should be able to use this metric as an automated way to look at secondary performance if we have access to tracking data. 

In this metric, if the EPA allowed per targeted play is high, that means the player is poor in coverage and if the EPA allowed per targeted play is low, that means the player is good in coverage. 

All Code is in this workbook, but is also available in github here:https://github.com/adubashi/NFLData. Additionally, the data that we precompute from tracking data is available here in this public data set: https://www.kaggle.com/arjundubashi/nfl-modified-tracking-data/. The process for precomputing this data is explained in this notebook, but it takes a while and thus we have precomputed it in order for the notebook to work with reasonable speed. 

Note: All data used to compute the metric is from the provided data in the nfl-big-data-bowl-data. PFF grades are used to prove the effectiveness of the metric but are not used in the calculation. 



# Algorithm:

The algorithm we'll use is as follows: 
1. For each passing play(that results in a completion, interception, incompletion, or secondary penalty), calculate the distance from the football for each player in coverage. 
2. We will determine that the defensive player closest to the football when the pass arrives is the player IN COVERAGE on that particular play. This is an assumption we are making. 

**Is this a correct assumption:**
This is a complicated question. If a defense is in man coverage, it's reasonable to assume that the closest defensive player to the targeted receiver when the pass arrives is the the player in coverage. In Zone coverage, there is more potential for error, but we still feel it is still a reasonable assumption to make. However, regardless of the coverage scheme the distance from the football/targeted receiver does not account for potential mistakes by secondary players that result in cascading events on a play. For example, if a defense player vacates his coverage assignment in either zone or man and another defensive player comes over to help out and ends up closer to the football when the pass arrives, then the player that helps out would be charged unfairly with the EPA associated with the pass, even though he was not the one who made the mistake. However, these errors have the ability to happen in any metric like this, even manual grades such as PFF. If we watched the All-22 of each play to determine every target(like PFF) that would be more accurate, but that would not be able to be calculated automatically using tracking data.  

3. Assign the responsible defensive player for each play based on which defensive player is in coverage, intercepts the pass or commits a penalty in the secondary.
4. Assign the EPA given up based on the player in coverage/responsible player. 
5. Aggregate the EPA given up per player to produce a report for a game. 

The following code is the code used to calculate the distance from the football and the player in coverage we'll use to assign the EPA allowed from the tracking data. This code takes a while to run, so what we've done is precomputed the data and included it in the nfl-modified-tracking-data data set, but we'll include the code below that calculates it for each week. 

As an example, the following code runs on the first pass play of game of the 2018 season, in the opening game between the Falcons and the Eagles. 

The following is the play by play. 
1st & 15 at ATL 20
(15:00 - 1st) M.Ryan pass short right to J.Jones pushed ob at ATL 30 for 10 yards (M.Jenkins).

The coverage player is determined to be "Malcolm Jenkins". When we watch the All-22 for this particular play, we can verify that this is correct.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import math

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

offensive_positions = ['WR', 'RB', 'TE']
defensive_positions = ['LB', 'MLB', 'OLB', 'CB', 'FS', 'SS']

event_priority = {'pass_arrived': 1, 'pass_outcome_interception': 2}
football_Name = "Football"


def generate_football_distance_for_tracking_week(filename):
    now = datetime.now()
    tracking_df = pd.read_csv("/kaggle/input/nfl-modified-tracking-data/" + filename)
    now = datetime.now()
    generate_football_distance(tracking_df)
    generate_coverage_player(tracking_df)
    correct_interceptions(tracking_df)
    tracking_df.to_csv("/kaggle/working/" + filename + "-modified-with-distance-and-coverage" + ".csv")
    return tracking_df
    
def generate_football_distance(tracking):
    for i, row in tracking.iterrows():
        if tracking.at[i, 'event'] == 'pass_outcome_interception':
            xPosition = int(tracking.at[i, "x"])
            yPosition = int(tracking.at[i, "y"])
            play_Id = tracking.at[i, "playId"]
            frame_Id = tracking.at[i, "frameId"]
            game_Id = tracking.at[i, "gameId"]
            tracking.at[i, "total_distance_from_football_pass_outcome_interception"] = generate_distance(tracking, play_Id, frame_Id, game_Id, xPosition, yPosition)
        if tracking.at[i, 'event'] == 'pass_arrived':
            xPosition = int(tracking.at[i, "x"])
            yPosition = int(tracking.at[i, "y"])
            play_Id = tracking.at[i, "playId"]
            frame_Id = tracking.at[i, "frameId"]
            game_Id = tracking.at[i, "gameId"]
            tracking.at[i, "total_distance_from_football_pass_arrived"] = generate_distance(tracking, play_Id, frame_Id, game_Id, xPosition, yPosition)


def generate_distance(tracking, play_Id, frame_Id, game_Id, xPosition, yPosition):
    football_event = tracking.query('playId == @play_Id').query('frameId == @frame_Id').query('displayName == @football_Name').query('@game_Id == gameId')
    if not (football_event.empty):
        xFootballPosition = int(football_event.iloc[0].at["x"])
        yFootballPosition = int(football_event.iloc[0].at["y"])
        return math.sqrt((xFootballPosition - xPosition) ** 2 + ((yFootballPosition - yPosition) ** 2))

def generate_coverage_player(tracking):
    for i, row in tracking.iterrows():
        if tracking.at[i, 'event'] == 'pass_arrived':
           play_Id = tracking.at[i, "playId"]
           frame_Id = tracking.at[i, "frameId"]
           game_Id = tracking.at[i, "gameId"]
           players_at_pass_arrival = tracking.query('playId == @play_Id').query('frameId == @frame_Id');
           offensive_players_at_pass_arrival =  players_at_pass_arrival.query('position in @offensive_positions')
           defensive_players_at_pass_arrival = players_at_pass_arrival.query('position in @defensive_positions')
           targeted_player_row = offensive_players_at_pass_arrival[offensive_players_at_pass_arrival.total_distance_from_football_pass_arrived ==
                                             offensive_players_at_pass_arrival.total_distance_from_football_pass_arrived.min()]
           coverage_player_row = defensive_players_at_pass_arrival[
               defensive_players_at_pass_arrival.total_distance_from_football_pass_arrived ==
               defensive_players_at_pass_arrival.total_distance_from_football_pass_arrived.min()]
           if not (targeted_player_row.empty):
              tracking.at[i, "targeted_player_name"] = targeted_player_row.iloc[0].at["displayName"]
           if not (coverage_player_row.empty):
               tracking.at[i, "player_in_coverage"] = coverage_player_row.iloc[0].at["displayName"]

def correct_interceptions(tracking):
    for i, row in tracking.iterrows():
        if tracking.at[i, 'event'] == 'pass_outcome_interception':
           play_Id = tracking.at[i, "playId"]
           frame_Id = tracking.at[i, "frameId"]
           players_at_pass_arrival = tracking.query('playId == @play_Id').query('frameId == @frame_Id');
           offensive_players_at_pass_arrival =  players_at_pass_arrival.query('position in @offensive_positions')
           defensive_players_at_pass_arrival = players_at_pass_arrival.query('position in @defensive_positions')
           targeted_player_row = offensive_players_at_pass_arrival[offensive_players_at_pass_arrival.total_distance_from_football_pass_outcome_interception ==
                                             offensive_players_at_pass_arrival.total_distance_from_football_pass_outcome_interception.min()]
           coverage_player_row = defensive_players_at_pass_arrival[
               defensive_players_at_pass_arrival.total_distance_from_football_pass_outcome_interception ==
               defensive_players_at_pass_arrival.total_distance_from_football_pass_outcome_interception.min()]
           if not (targeted_player_row.empty):
                tracking.at[i, "targeted_player_name"] = targeted_player_row.iloc[0].at["displayName"]
           if not (coverage_player_row.empty):
                tracking.at[i, "player_in_coverage"] = coverage_player_row.iloc[0].at["displayName"]
            
tracking_df = generate_football_distance_for_tracking_week("week1-2018090600-sample.csv")
pass_arrived = "pass_arrived"
play_Id = "75"
frame_Id = "47"
tracking_df = tracking_df.query("event == @pass_arrived").query("playId == @play_Id").query("frameId == @frame_Id")
print(tracking_df.iloc[0]["player_in_coverage"])

# Looking at an individual game

Once we have calculated the player in coverage, we can proceed with the remaining steps:
1. Correct for interceptions and penalties -> Assign the responsible player to the player who intercepts a pass or commits a penalty on a play where that is relevant.  
2. Assign the EPA given up based on the player in coverage/responsible player.
3. Aggregate the EPA given up per player to produce a report
This will allow us to calculate a game report for an individual game, to see how each defensive player did for that game. 

The follow code does those steps and performs the calculations for the first game of the 2018 Season, Falcons at Eagles. We can also plot the total EPA given up per player on a bar graph, and exclude players that didn't have enough plays. For this example, we'll say that each player needs to have 3 plays where they were determined to be in coverage.   


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import math

excluded_penalty_list = ["RPS"]
event_priority = {'pass_arrived': 1, 'pass_outcome_interception': 2}

def calculate_epa_game_report(game_id):
    gameListToFileName = pd.read_csv("/kaggle/input/nfl-modified-tracking-data/game_list_to_file_name_kaggle.csv")
    fileNameForGameId = gameListToFileName.query("@game_id == gameId")
    tracking = pd.read_csv("/kaggle/input/" + fileNameForGameId.iloc[0].at["tracking_csv_file_name"])
    plays = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/plays.csv", error_bad_lines=False)
    players = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/players.csv")
    games = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/games.csv")
    return calculate_epa_game_report_with_df_tracking_set(game_id, gameListToFileName, tracking, plays,players, games)

def calculate_epa_game_report_with_df_tracking_set(game_id, gameListToFileName, tracking, plays, players, games):
    ##Get tracking data only from the individual game
    tracking = tracking.query('gameId == @game_id')
    plays = plays.query('gameId == @game_id')

    ##Join with Play by Play Data
    joined_with_play_data = tracking.merge(plays, left_on=['playId', 'gameId'],
                                                  right_on=['playId', 'gameId'],
                                                  how='inner')
    ##Get only one frame per play event(I.E one per pass_arrived, intercepted), take interceptions over pass arrived.
    ##Only one event per play(interception or PASS arived, but not both).
    cleaned_play_data = get_cleaned_play_data(joined_with_play_data)

    ##Set the Away Team/Home Team, so we can set players, jersey number and penalty abbreviations
    set_defense_and_offense_team(cleaned_play_data, games)

    ##Create a mapping of
    players_by_game = create_players_by_game(tracking, cleaned_play_data)

    ##Remove certain penalties and assign player in coverage to the penalized player
    correct_penalties(cleaned_play_data, players_by_game)

    ##Generate the epa data by defensive players.
    epa_game_report = generate_epa_game_report(cleaned_play_data, players)
    
    return epa_game_report

def generate_epa_game_report(cleaned_play_data, players):
    epa_game_report = cleaned_play_data.groupby(['player_in_coverage', 'defenseTeam'])[['epa']].agg('sum').reset_index();
    epa_game_report = players.merge(epa_game_report
                                    , right_on='player_in_coverage'
                                    , left_on='displayName'
                                    , how='inner')
    epa_game_report = epa_game_report.drop(['height', 'weight', 'birthDate', 'displayName','collegeName','nflId'], axis=1)
    for i, row in epa_game_report.iterrows():
        playerInCoverage = epa_game_report.at[i, "player_in_coverage"]
        df = cleaned_play_data.query("@playerInCoverage == player_in_coverage")
        epa_game_report.at[i, "epa_play_count"] = len(df.index)
        epa_game_report.at[i, "epa_per_targeted_play"] = epa_game_report.at[i, "epa"] / len(df.index)
    return epa_game_report

def correct_penalties(cleaned_play_data, players_by_game):
    cleaned_play_data = cleaned_play_data.query("penaltyCodes not in @excluded_penalty_list")
    for i, row in cleaned_play_data.iterrows():
        penaltyCodesFromRow = cleaned_play_data.at[i, "penaltyJerseyNumbers"]
        if not pd.isna(penaltyCodesFromRow):
            rows = players_by_game.query("penaltyAbbr in @penaltyCodesFromRow")
            if not rows.empty:
                playerName = rows.iloc[0].at["displayName"]
                cleaned_play_data.at[i, 'player_in_coverage'] = playerName

def set_defense_and_offense_team(cleaned_play_data, games):
    for i, row in cleaned_play_data.iterrows():
        game_Id = cleaned_play_data.at[i, "gameId"]
        homeTeam = games.query('gameId == @game_Id').iloc[0].at["homeTeamAbbr"]
        awayTeam = games.query('gameId == @game_Id').iloc[0].at["visitorTeamAbbr"]
        if homeTeam == cleaned_play_data.at[i, "possessionTeam"]:
            cleaned_play_data.at[i, "defenseTeam"] = awayTeam
        else:
            cleaned_play_data.at[i, "defenseTeam"] = homeTeam

def create_players_by_game(tracking, cleaned_play_data):
    players_by_game = pd.DataFrame(columns=['displayName', 'jerseyNumber'])
    for i in tracking.displayName.unique():
        players_by_game = players_by_game.append({'displayName': i}, ignore_index=True)
    for i, row in players_by_game.iterrows():
        display_Name = players_by_game.at[i, 'displayName']
        players_by_game.at[i, 'jerseyNumber'] = tracking.query("displayName == @display_Name").iloc[0].at[
            "jerseyNumber"]
        defensivePlayer = cleaned_play_data.query("player_in_coverage == @display_Name")
        offensivePlayer = cleaned_play_data.query("displayName == @display_Name")
        if not offensivePlayer.empty:
            players_by_game.at[i, 'teamAbbr'] = offensivePlayer.iloc[0].at["possessionTeam"]
            players_by_game.at[i, 'penaltyAbbr'] = players_by_game.at[i, 'teamAbbr'] + str(
                int(players_by_game.at[i, 'jerseyNumber']))
        if not defensivePlayer.empty:
            players_by_game.at[i, 'teamAbbr'] = defensivePlayer.iloc[0].at["defenseTeam"]
            players_by_game.at[i, 'penaltyAbbr'] = players_by_game.at[i, 'teamAbbr'] + " " + str(
                int(players_by_game.at[i, 'jerseyNumber']))
    return players_by_game

def get_event_priority(event):
    return event_priority[event]

def get_cleaned_play_data(joined_with_play_data):
    ##Get only one frame per play event(I.E one per pass_arrived, intercepted).
    cleaned_play_data = joined_with_play_data[joined_with_play_data["displayName"] == joined_with_play_data["targeted_player_name"]]

    ##Set the event priority
    cleaned_play_data["event_priority"] = cleaned_play_data.apply(lambda row : get_event_priority(row["event"]), axis = 1)

    ##Sort Based on Event Priority
    cleaned_play_data.sort_values("event_priority", ascending=True)

    ##Remove duplicates on play ID and ONLY keep the highest priority event
    cleaned_play_data = cleaned_play_data.drop_duplicates(subset=['playId'], keep="last")
    return cleaned_play_data

def plot_game_id(plotDF):
    plotDF = plotDF.loc[plotDF['epa_play_count'] >= 3]
    plotDF.sort_values("epa_per_targeted_play", ascending=True)
    plotDF.plot.bar(x="player_in_coverage", y="epa_per_targeted_play", rot=0, title="EPA by Player")
    plt.xticks(rotation=90)
    plt.show()

def calculate_and_plot(game_id):
    plotDF = calculate_epa_game_report(game_id)
    plotDF.to_csv(game_id + "-" + "epa_game_report.csv")
    plot_game_id(plotDF)

game_report = calculate_epa_game_report("2018090600")
print(game_report)
plot_game_id(game_report)

# Comparing with PFF Grades

This is cool, but how does it correlate to player performance? To prove effective of this metric, we need to use a separate source of defensive player performance data. To do this we'll use Pro Football Focus grades for the 1st game of the season in 2018, Falcons vs Eagles.

In this case, we'll try to compare the EPA given up per targeted play with the PFF grade in coverage.

Because giving up EPA is bad from a defensive perspective if the epa_per_targeted_play is NEGATIVE/LOW that means the defensive player played WELL in coverage. Conversely, if the epa_per_targeted_play is POSITIVE/HIGH that means the defensive player played BADLY in coverage.

Because of this, we would expect for the PFF Coverage Grade(which if it is high means the player played well, and if it is low means the player played poorly) to correlate negatively with the epa_per_targeted_play. Let's go ahead and look at the example. 

Note: The game report used below is precomputed for notebook speed, but we do the same comparison/plotting on the data previously plotted. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats
import numpy as np # linear algebra

minimum_epa_play_count = 3

def plot():
    ###
    # Create the pandas DataFrame
    pff_grade = pd.read_csv("/kaggle/input/nfl-modified-tracking-data/2018090600-PFF-Grades.csv")
    epa_report = pd.read_csv("/kaggle/input/nfl-modified-tracking-data/2018090600-epa_game_report-correct.csv")

    epa_report_with_grade = epa_report.merge(pff_grade, left_on=['player_in_coverage'],
                                                  right_on=['displayName'],
                                                  how='inner')

    epa_report_with_grade = epa_report_with_grade.query("epa_play_count >= @minimum_epa_play_count")
    rvalue = scipy.stats.linregress(epa_report_with_grade[['epa_per_targeted_play', 'coverageGrade']].to_numpy()).rvalue ** 2

    ax = epa_report_with_grade.plot.scatter(x = 'epa_per_targeted_play', y = 'coverageGrade')
    epa_report_with_grade[['epa_per_targeted_play',
                   'coverageGrade',
                   'player_in_coverage']].apply(lambda row: ax.text(*row), axis=1);
    x = epa_report_with_grade['epa_per_targeted_play']
    y = epa_report_with_grade['coverageGrade']
    z = np.polyfit(x, y, 1)
    p = np.poly1d(z)
    plt.plot(x,p(x),"r--")
    plt.suptitle("EPA Per Play Versus PFF Coverage Grade: Falcons vs Eagles, September 6th 2018", y=1.05, fontsize=18)
    ax.set_title('R^2 Value = ' + str(round(rvalue,2)))

    plt.xticks(rotation=90)
    plt.show()

    
plot()

# Looking at Season Long EPA Allowed Per Play data for invididual players

For this game at least, our hypothesis has been proven correct. There's a rough correlation between PFF Coverage Grade and epa per targeted play, which was the intention of this metric. We can see from looking at the trendline that player who played well by PFF grade will have a low epa per targeted play, while a player who played poorly by PFF grade will have a high epa per targeted play. But we need to test out more games in order to have more confidence that this metric has a strong correlation with player performance. We can look at the same metric for players over the course of a season. Let's try that and look at two players over the 2018 season. The code below is what we use to aggregate players by their EPA given up over the course of the season. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np # linear algebra

def get_merged_epa_report_for_player_list(playerList):
    merged_tracking_df = pd.read_csv("/kaggle/input/nfl-modified-tracking-data/aggregated-tracking-data.csv")
    dfDict = {}
    for playerName in playerList:
            epa_report_data_frame = get_epa_report_for_player(playerName, merged_tracking_df)
            dfDict[playerName] = epa_report_data_frame
    return merge_epa_reports_for_player_list(dfDict)

def get_epa_report_for_player(playerName, merged_tracking_df):
    epa_reports = calculate_epa_reports_for_player(playerName, merged_tracking_df)
    df_epa_report = aggregate_game_reports_for_player(epa_reports, playerName)
    return df_epa_report

def calculate_epa_reports_for_player(playerName, merged_tracking_df):
    now = datetime.now()
    gameListToFileName = pd.read_csv("/kaggle/input/nfl-modified-tracking-data/game_list_to_file_name.csv")
    plays = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/plays.csv", error_bad_lines=False)
    players = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/players.csv")
    games = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/games.csv")
    tracking_df_with_player = merged_tracking_df.query('displayName == @playerName')
    game_id_list = tracking_df_with_player.gameId.unique()
    game_id_list_str = [str(i) for i in game_id_list]
    epa_reports = calculate_epa_reports_for_player_df(merged_tracking_df, plays, players, games, game_id_list_str, gameListToFileName)
    return epa_reports

def calculate_epa_reports_for_player_df(tracking, plays, players, games, gameIdList, gameListToFileName):
    dfDict = {}
    for i in gameIdList:
        game_data_frame = calculate_epa_game_report_with_df_tracking_set(i, gameListToFileName, tracking, plays, players, games)
        dfDict[i] = game_data_frame
    return dfDict

def aggregate_game_reports_for_player(epa_reports, player_Name):
    epa_report_list = epa_reports.values()
    if len(epa_report_list) == 0:
       return pd.DataFrame()
    df_epa_report = pd.concat(epa_report_list)
    df_epa_report = df_epa_report.query("@player_Name == displayName")
    df_epa_report = df_epa_report.groupby(['displayName', 'nflId', 'position', 'defenseTeam'])\
        .agg({'epa': 'sum', 'epa_play_count': 'sum'}).reset_index()
    for i, row in df_epa_report.iterrows():
        df_epa_report.at[i, "epa_per_targeted_play"] = df_epa_report.at[i, "epa"] / df_epa_report.at[i, "epa_play_count"]
    return df_epa_report

def merge_epa_reports_for_player_list(epaReportDict):
    dfEpaReport = epaReportDict.values()
    epa_report_with_names = pd.concat(dfEpaReport)
    return epa_report_with_names

def plot_df(plotDF):
    plotDF = plotDF.loc[plotDF['epa_play_count'] >= 3]
    plotDF.sort_values("epa_per_targeted_play", ascending=True)
    plotDF.plot.bar(x="displayName", y="epa_per_targeted_play", rot=0, title="EPA by Player")
    plt.xticks(rotation=90)
    plt.show()

players = ["Stephon Gilmore", "Ahkello Witherspoon"]
epa_report = get_merged_epa_report_for_player_list(players)
print(epa_report)
plot_df(epa_report)

# Comparison with Other Secondary Performance Metrics 

As shown previously, we calculated season long epa per targeted play allowed for two players. We used two examples, Stephon Gilmore and Ahkello Witherspoon in the 2018 season. Stephon Gilmore was one of the best cover corners in the league in 2018 via PFF and Ahkello Witherspoon was one of the worst via PFF(Among players who played over 50 percent of their teams snaps). As you can see, this can be roughly seen by Stephon Gilmore and Ahkello Witherspoon's EPA per play numbers. 

But this isn't enough evidence that our metric is correct. We'd need to look at season long epa per play numbers and compare it to PFF coverage grades(or a similar metric) in order to have more confidence that our metric has good correlation with secondary performance. We'll go ahead and import a report that was precomputed(since it takes a while to compute the report for this many players, but will go ahead and include all the code that we use for computing the comparison). 

We'll go ahead and look at cornerbacks in the 2018 season with over 950 snaps. We'll look at three plots/correlations.
1. EPA Per Targeted Play vs PFF Coverage Grade 
2. EPA Per Targeted Play vs Passer Rating Against
3. PFF Coverage Grade vs Passer Rating Against 

In [None]:
import pandas as pd
import numpy as np
import scipy.stats

import plotly.express as px
import matplotlib.colors as colors
import matplotlib.pyplot as plt
from matplotlib import cm

def compare_pff_with_epa_report(playerList):
    playerToEPAReport = get_merged_epa_report_for_player_list(playerList)
    defense_grades = pd.read_csv("/kaggle/input/nfl-modified-tracking-data/defense-grades.csv")

    epaReportDefenseGrades = playerToEPAReport.merge(defense_grades, left_on=['displayName'],
                                                  right_on=['player'],
                                                  how='inner')
    epaReportDefenseGrades_DroppedColumns = epaReportDefenseGrades[["displayName", "nflId", "position_x","defenseTeam","epa_play_count","epa","epa_per_targeted_play","player_game_count","snap_counts_total","snap_counts_coverage","grades_coverage_defense","qb_rating_against"]].copy()
    return epaReportDefenseGrades_DroppedColumns

epaReportDefenseGrades = pd.read_csv("/kaggle/input/nfl-modified-tracking-data/epa-report-corner-2018.csv")
snap_count_min = 950
epaReportDefenseGrades = epaReportDefenseGrades.query("snap_counts_total > @snap_count_min")
epaReportDefenseGrades_Plot = epaReportDefenseGrades.copy()

rvalue_squaredValueEPAVSPFF = scipy.stats.linregress(epaReportDefenseGrades[['epa_per_targeted_play', 'grades_coverage_defense']].to_numpy()).rvalue ** 2

colormap = cm.viridis
colorlist = [colors.rgb2hex(colormap(i)) for i in np.linspace(0, 0.9, len(epaReportDefenseGrades_Plot['displayName']))]
ax = epaReportDefenseGrades_Plot.plot.scatter(x='epa_per_targeted_play', y='grades_coverage_defense')
epaReportDefenseGrades_Plot[['epa_per_targeted_play',
                   'grades_coverage_defense',
                   'displayName']].apply(lambda row: ax.text(*row, horizontalalignment = "center", verticalalignment='bottom'), axis=1);

for i,c in enumerate(colorlist):
    x = epaReportDefenseGrades_Plot['epa_per_targeted_play'].iloc[i]
    y = epaReportDefenseGrades_Plot['grades_coverage_defense'].iloc[i]
    l = epaReportDefenseGrades_Plot['displayName'].iloc[i]
    ax.scatter(x, y, label=l, s=50, linewidth=0.1, c=c)
    
x = epaReportDefenseGrades_Plot['epa_per_targeted_play']
y = epaReportDefenseGrades_Plot['grades_coverage_defense']
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")
plt.suptitle("Cornerback EPA Per Targeted Play Versus PFF Coverage Grade: 2018 Season, Over 950 Snaps. ", y=1.05, fontsize=18)
ax.set_title('R^2 Value = ' + str(round(rvalue_squaredValueEPAVSPFF,2)))
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

In [None]:
import pandas as pd
import numpy as np
import scipy.stats

import plotly.express as px
import matplotlib.colors as colors
import matplotlib.pyplot as plt
from matplotlib import cm


epaReportDefenseGrades = pd.read_csv("/kaggle/input/nfl-modified-tracking-data/epa-report-corner-2018.csv")
snap_count_min = 950
epaReportDefenseGrades = epaReportDefenseGrades.query("snap_counts_total > @snap_count_min")
epaReportDefenseGrades_Plot = epaReportDefenseGrades.copy()

rvalue_squaredValueEPAPasserRating = scipy.stats.linregress(epaReportDefenseGrades[['epa_per_targeted_play', 'qb_rating_against']].to_numpy()).rvalue ** 2

colormap = cm.viridis
colorlist = [colors.rgb2hex(colormap(i)) for i in np.linspace(0, 0.9, len(epaReportDefenseGrades_Plot['displayName']))]
ax = epaReportDefenseGrades_Plot.plot.scatter(x='epa_per_targeted_play', y='qb_rating_against')
epaReportDefenseGrades_Plot[['epa_per_targeted_play',
                   'qb_rating_against',
                   'displayName']].apply(lambda row: ax.text(*row, horizontalalignment = "center", verticalalignment='bottom'), axis=1);

for i,c in enumerate(colorlist):
    x = epaReportDefenseGrades_Plot['epa_per_targeted_play'].iloc[i]
    y = epaReportDefenseGrades_Plot['qb_rating_against'].iloc[i]
    l = epaReportDefenseGrades_Plot['displayName'].iloc[i]
    ax.scatter(x, y, label=l, s=50, linewidth=0.1, c=c)
    
x = epaReportDefenseGrades_Plot['epa_per_targeted_play']
y = epaReportDefenseGrades_Plot['qb_rating_against']
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")
plt.suptitle("Cornerback EPA Per Targeted Play Versus QB Rating Against: 2018 Season, Over 950 Snaps. ", y=1.05, fontsize=18)
ax.set_title('R^2 Value = ' + str(round(rvalue_squaredValueEPAPasserRating,2)))
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

In [None]:
import pandas as pd
import numpy as np
import scipy.stats

import plotly.express as px
import matplotlib.colors as colors
import matplotlib.pyplot as plt
from matplotlib import cm

epaReportDefenseGrades = pd.read_csv("/kaggle/input/nfl-modified-tracking-data/epa-report-corner-2018.csv")
snap_count_min = 950
epaReportDefenseGrades = epaReportDefenseGrades.query("snap_counts_total > @snap_count_min")
epaReportDefenseGrades_Plot = epaReportDefenseGrades.copy()

rvalue_squaredValuePasserRatingVsPFFGrade = scipy.stats.linregress(epaReportDefenseGrades[['grades_coverage_defense', 'qb_rating_against']].to_numpy()).rvalue ** 2

colormap = cm.viridis
colorlist = [colors.rgb2hex(colormap(i)) for i in np.linspace(0, 0.9, len(epaReportDefenseGrades_Plot['displayName']))]
ax = epaReportDefenseGrades_Plot.plot.scatter(x='qb_rating_against', y='grades_coverage_defense')
epaReportDefenseGrades_Plot[['qb_rating_against',
                   'grades_coverage_defense',
                   'displayName']].apply(lambda row: ax.text(*row, horizontalalignment = "center", verticalalignment='bottom'), axis=1);

for i,c in enumerate(colorlist):
    x= epaReportDefenseGrades_Plot['qb_rating_against'].iloc[i]
    y = epaReportDefenseGrades_Plot['grades_coverage_defense'].iloc[i]
    l = epaReportDefenseGrades_Plot['displayName'].iloc[i]
    ax.scatter(x, y, label=l, s=50, linewidth=0.1, c=c)
    
x = epaReportDefenseGrades_Plot['qb_rating_against']
y = epaReportDefenseGrades_Plot['grades_coverage_defense']
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")
plt.suptitle("Cornerback PFF Grade Versus QB Rating Against: 2018 Season, Over 950 Snaps. ", y=1.05, fontsize=18)
ax.set_title('R^2 Value = ' + str(round(rvalue_squaredValuePasserRatingVsPFFGrade,2)))
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

# Findings of Correlations with other Secondary Performance Metrics 

After running these correlations and viewing the plots, we end up with these results: 


* EPA per Play vs PFF Coverage Grade: 0.42
* EPA per Play vs Passer Rating Against: 0.63
* Passer Rating Against vs PFF Coverage Grade: 0.38

What do these correlations tell us? The first thing this tells us is that EPA per targeted play correlates strongly to passer rating against and to the PFF Coverage grade. This is another good finding. This tells that EPA per play can be used to evaluate player performance, which was the objective of this notebook and defining this metric

The second conclusion that we can make is that with a significant number of snaps, EPA per targeted play actually has a slightly stronger correlation with the PFF grade than passer rating against. This means that EPA per play correlates stronger to someone watching the ALL-22 of every play(which in my view is the ideal state we want to achieve with automated metrics) than passer rating against.

Why are these findings valuable? Mainly because we have proven that EPA per targeted play correlates strongly with PFF grade(which is completely manual) and passer rating against(which is partially automated but still relies on needing to know the targeted receiver for each play). This means that with tracking data, we could use EPA Per Play as an automated metric for evaluated secondary performance!

# Conclusion

At the start of this notebook, we wanted to determine whether we can use EPA per targeted play as a way of evaluating secondary performance. Based on the correlations previously discussed with PFF Grade and Passer Rating Against, we have shown that we can use EPA per targeted play is a suitable metric for evaluating secondary performance and that it can be automatically calculated using tracking data. This can be an advantage for teams/scouts who want to supplement their own tape study with a metric that matches the eye test. Any team with access to tracking data can canculate these metrics and start utilizing them to evaluate secondary performance.  
