# Defensive Back Starting Positions Reveal Key Predictors of Pass Completion Probabilities

Agnim Agarwal <sup>1</sup>, Brandon Guo <sup>1</sup>, Neha Dalia <sup>1</sup>, Pranav Patil <sup>1</sup>

<sup>1</sup> Caltech, Pasadena, CA

## Introduction

The NFL Big Data Bowl challenges the data science community to explore insights for defending the pass play. Here, we explore the importance of a defensive back's position relative to his receiver at the beginning of each play, offering some relevant, actionable insights to NFL organizations.

We chose this topic of research primarily because it is particularly actionable. There exists a wealth of research on the relevance of statistics during a play to the plays outcome or possible strategies. However, many of these plays rely on information of a play that is not known *a priori*, such as the routes of the receivers. Thus it remains a challenge to implement these research ideas in practice, especially since one defensive strategy against certain types of pass play may be completely undermined by a different one.

We have not seen much work done on exploring how defenders line up at the beginning of the play, but we believe that insights on this matter can be directly implemented during the games *without loss of generality*, which makes it of particular value to organizations.

Thus the goal of our work is to determine the most influential factors in different coverage types (position of the defensive back). Knowing this information would allow defensive coaches to determine the most favorable coverage matchups in a certain situation, for any type of defensive position they take. 

We subset this notebook submission into a few main sections for clarity and organization. At each technical step of our work, we offer analysis and interpretation of key figures/results, building to a few key conclusions and recommendations proposed at the end of the notebook. Source code is typically hidden, but can be expanded for detailed viewing.


Many of the intermediate analyses depend on somewhat long scripts, so we have implemented them as external functions to make the code in the body of the report more readable and concise. All the functions not explicitly implemented in the body will be included in the Appendix, sorted by alphabetical order by function name. 

## Background


Analyzing pass defenders is a difficult task. Looking at their overall metrics is a good start. Breaking down their play by coverage type and alignment is even better, as man vs zone schemes and slot vs wide alignment are all fundamentally different. Going a step further, however, previous studies ([FiveThirtyEight](https://fivethirtyeight.com/features/our-new-metric-shows-how-good-nfl-receivers-are-at-creating-separation/), [FiveThirtyEight](https://fivethirtyeight.com/features/what-the-nfls-new-pass-defense-metric-can-and-cant-tell-us/)) have found it helpful to break down their play by Press Type. This allows us to separate off coverage from tight bump-and-run coverage, which again are fundamentally different techniques and situations.Â 

While evaluating a defender's press coverage ability could take hours of film study and can be subjective at times, tracking data can provide the same insight during just the first seconds of the play ([MaddenGuides](https://www.maddenguides.com/press-normal-and-soft-coverage/)) Shown below, the cornerback successfully limits the receiver using this technique to slow down the timing of the receiver running the route. 

![](https://images.actionnetwork.com/blog/2018/11/brad.gif)

This form of coverage is largely dependent on the distance between the defender and the receiver for the duration of the play. As such, we are motivated to further pursue this separation distance in the broader context of quantifying pass defender performance. As we see in our exploratory data analysis, this separation can be made in both the *x* and *y* axes, defined as shown below (from the competition page).

![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3258%2F820e86013d48faacf33b7a32a15e814c%2FIncreasing%20Dir%20and%20O.png?generation=1572285857588233&alt=media)

We will refer to the *x* and *y* axes in this way throughout the report.


## Exporatory Data Analysis (EDA)

We begin by exploring the given data regarding pass distances to see if there are any surface-level underlying features driving pass distance, which would in turn influence pass coverage.

In [None]:
import numpy as np # linear algebra
import pandas as pd
import os

pd.options.mode.chained_assignment = None

# Loading in all the raw datasets we will use here
pass_plays = pd.read_csv("../input/nfl-big-data-bowl-2021/plays.csv")
pass_plays = pass_plays[pass_plays['playType'] == "play_type_pass"]
players = pd.read_csv("../input/nfl-big-data-bowl-2021/players.csv")
all_plays = pd.read_csv("../input/nfl-big-data-bowl-2021/plays.csv")



# Setting global parameters to make the plots look nicer
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl
mpl.rcParams["axes.spines.right"] = False
mpl.rcParams["axes.spines.top"] = False
mpl.rcParams['axes.linewidth'] = 2
mpl.rcParams['ytick.labelsize'] = 12
mpl.rcParams['xtick.labelsize'] = 12
mpl.rcParams['axes.titlesize'] = 12
mpl.rcParams['axes.labelsize'] = 12


We are interested in exploring the role of defender starting position in a given pass play. To begin with, we perform some basic analyses on passing plays to see if there are any anomalies in the data. If not, we can confidently continue with our specific analysis on defender starting position without worrying about outliers or underlying trends.

In [None]:

# This code loops through all the data to collect all the macro-level data about a given play. (down, time of play, duration, etc.)
df_list = []

for i in range(1, 18):
    week_str = "../input/nfl-big-data-bowl-2021/week" + str(i) + ".csv"
    week = pd.read_csv(week_str)

    # Only care about the football, at the snap and when the pass arrives. This gives us the intended pass distance.
    football = week[(week['event'].isin(['ball_snap', 'pass_arrived'])) & (week['displayName'] == 'Football')]
    distances = football.groupby(['gameId', 'playId'])['x'].diff()
    football['distance'] = np.abs(distances)
    football = football.dropna(subset = ['distance'])
    # Take just the relevant columns
    df_list.append(football[['gameId', 'playId', 'distance']])


pass_distances = pd.concat(df_list)
pass_plays = pass_plays.merge(pass_distances, on = ['gameId', 'playId'])

First, since the choice of defender position depends on what type of play the defending team might suspect (long pass v.s. short pass), we first see if there are specific relationships that affect the types of pass plays that are chosen by the offense, which might confound with the our analysis of how a defensive team responds. 

In [None]:
## Plot for Down analysis




different_downs = [np.array(pass_plays[pass_plays['down'] == i]['distance']) for i in range(1, 5)]
# Take middle 95 percent (robust to outliers)
different_downs = [x[ x > np.percentile(x, 2.5)] for x in different_downs]
different_downs = [np.log(x + 0.01) for x in different_downs]


parts = plt.violinplot(different_downs, showmeans=False, showmedians=False,
        showextrema=False)
plt.xticks([1, 2, 3, 4])
plt.title("Relative pass length on different downs", fontweight = "bold")
plt.xlabel("Down Number")
plt.ylabel("Length of Pass (log scale)")
colors = ["#2b8cbe", "#43a2ca", "#a8ddb5", "#e0f3db"]
for i in range(len(parts['bodies'])):
    pc = parts['bodies'][i]
    pc.set_facecolor(colors[i])
    pc.set_edgecolor('black')
    pc.set_linewidths(1.5)
    pc.set_alpha(1)
plt.show()


In [None]:
different_downs = [np.array(pass_plays[pass_plays['yardsToGo'] == i]['distance']) for i in range(1, 20)]
# Take middle 95 percent (robust to outliers)
different_downs = [x[ x > np.percentile(x, 2.5)] for x in different_downs]
different_downs = [x[ x < np.percentile(x, 97.5)] for x in different_downs]
different_downs = [np.log(x + 0.01) for x in different_downs]
plt.figure(figsize=(11,3))
plt.xlabel("Yards left to go")
plt.ylabel("Length of Pass (log scale)")
plt.violinplot(different_downs)
plt.title("Relative pass length in different \"yard-to-go\" scenarios", fontweight = "bold")
plt.show()

In [None]:
pass_plays['time_elapsed'] = time_elapsed(pass_plays['quarter'], pass_plays['gameClock'])
pass_plays = pass_plays[pass_plays['time_elapsed'] < 3600]
# Over time data is excluded due to lack of data points
pass_plays['time_binned'] = np.floor(pass_plays['time_elapsed'] / 60)
temp = pass_plays.groupby(['time_binned']).mean()
plt.figure(figsize=(11,3))
plt.plot(temp.index.values, temp['distance'], 'o-')
plt.xlabel("Minutes elapsed in game (regular time only)")
plt.ylabel("Average Pass length (yards)")
plt.title("Average pass length in games", fontweight = "bold")
plt.show()



From the figures above, we can see that there are no obvious underlying factors from the ones we studied above, like the time of the pass or the down of the pass, that make a very quantifiable impact on what type of passes teams tend to select. Therefore, from a defensive standpoint, these metrics do not give very significant information, and therefore these measures should not play too significant of a role in a defensive strategy. 

For example, defenses should not necessarily assume a quarterback will select a long pass as the game elapses or when the offense has many yards to go. Sometimes a short pass can yield a large yardage gain, especially if the defense expects a very long pass. 

### Exploration of Defender Starting Position

Now, we look more closely into what type of defenders' positions occur throughout the past plays, and what type of defensive positions can help force an incomplete pass. We show here a heatmap of a defensive back's starting position for all passing plays in the data. For each play, we identified to whom the ball was thrown, and identify whether this pass was complete or incomplete. Then, for this specific receiver and his corresponding defensive back, we explore further their tracking data. Detailed implementation in the code. 

In [None]:
# Where are defenders lining up?

deltax = []
deltay = []
pass_result = []

for n in range(1, 18):
    print(n)
    df = pd.read_csv("../input/nfl-big-data-bowl-2021/week" + str(n) + ".csv")
    unique_plays = df.drop_duplicates(['playId', 'gameId'])

    plays = np.unique(df['playId'])

    for i in range(len(unique_plays)):

        PLAY_NUMBER = unique_plays.iloc[i, :]['playId']
        game_id = unique_plays.iloc[i, :]['gameId']
        result = get_details(PLAY_NUMBER, game_id)

        play = df[df['playId'] == PLAY_NUMBER]
        play = play[play['gameId'] == game_id]
        play['nflId'] = play['nflId'].replace(np.nan, 0)

        # Get the vectors of each of the defenders lining up
        ## Code for finding the starting distance for each receiver

        if len(find_all_receivers(play)) == 0:
            # means there is no passing this play
            continue

        try:
            pairings = find_closest_defender(play, "all")
        except (IndexError, ValueError):
            continue

        if result == 'I':
            pass_result.append('I')
        elif result == 'C':
            pass_result.append('C')
        else:
            pass_result.append('O')

        # Finding the frame 
        frame = list(play[play['event'] == 'ball_snap']['frameId'])[0]

        for i in range(len(pairings)):
            try:
                rec = pairings.iloc[i, :]['receiver']
                dfndr = pairings.iloc[i, :]['closest_def']
                temp = play[(play['nflId'] == dfndr) & (play['frameId'] == frame)]
                x_def, y_def = float(pd.to_numeric(temp['x'])) , float(pd.to_numeric(temp['y']))

                temp = play[(play['nflId'] == rec) & (play['frameId'] == frame)]
                x_rec, y_rec = float(pd.to_numeric(temp['x'])) , float(pd.to_numeric(temp['y']))

                dire = play.iloc[0, :]['playDirection']
                x = x_def - x_rec
                y = y_def - y_rec
                if dire == "left":
                    deltax.append(-1 * x)
                    deltay.append(-1 * y)
                else:
                    deltax.append(x)
                    deltay.append(y)
            except TypeError:
                pass

In [None]:
from scipy.ndimage.filters import gaussian_filter
import matplotlib.pyplot as plt

heatmap, xedges, yedges = np.histogram2d(deltax, deltay, bins=40, density = True, range = [[0, 10], [-4, 4]])
heatmap = gaussian_filter(heatmap, sigma=0.01)
extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]

plt.figure(figsize=(9,6))
plt.clf()
plt.imshow(heatmap.T, extent=extent, origin='lower', cmap = "PiYG")
cb = plt.colorbar()
cb.set_label('density')
plt.title("heatmap of positioning of defender", fontweight = "bold")
plt.xlabel("Yards off line of scrimmage (x-axis direction)")
plt.show()

Based on the heatmap, we can see that the heatmap is roughly symmetrical across the x-axis (y = 0), which suggests that the fruit of the analysis might lie along different values of x (distance from line of scrimmage). We also notice that there is very little variation along the y-axis i.e. defenders typically line up directly with their receiver. This is what inspired us to approach the problem in terms of this distance along the x-axis, since different starting positions along the x-axis are established ideas in defensive strategy i.e. press coverage. 

We will now look into specific defender positioning along these two axes in a bit more detail. In particular, we want an precursory view of what types of positioning, along both the x-axis and the y-axis, are associated with a pass being complete/incomplete. To directly compare such cases, it is not practical to use a heatmap as above, we will instead compare the positioning of the two axes individually, operating under the hypothesis that vertical positioning (y-axis) might not be so meaningful:


In [None]:
## Compare incomplete and complete passes

# same runner code, one per play.
def exploration():
    deltax = []
    deltay = []
    pass_result = []
    all_plays = pd.read_csv("../input/nfl-big-data-bowl-2021/plays.csv")

    for n in range(1, 18):
        print(n)
        df = pd.read_csv("../input/nfl-big-data-bowl-2021/week" + str(n) + ".csv")
        unique_plays = df.drop_duplicates(['playId', 'gameId'])

        plays = np.unique(df['playId'])

        for i in range(len(unique_plays)):

            PLAY_NUMBER = unique_plays.iloc[i, :]['playId']
            game_id = unique_plays.iloc[i, :]['gameId']
            result = get_details(PLAY_NUMBER, game_id)

            play = df[df['playId'] == PLAY_NUMBER]
            play = play[play['gameId'] == game_id]
            play['nflId'] = play['nflId'].replace(np.nan, 0)

            # Get the vectors of each of the defenders lining up
            ## Code for finding the starting distance for each receiver

            if len(find_all_receivers(play)) == 0:
                # means there is no passing this play
                continue

            try:
                pairings = find_closest_defender(play, "all")
            except (IndexError, ValueError):
                continue

            

            # Finding the frame 
            frame = list(play[play['event'] == 'ball_snap']['frameId'])[0]

            receiver = ball_thrown_to(play)
            pairings = pairings[pairings['receiver'] == receiver]
            for i in range(len(pairings)):
                try:
                    rec = pairings.iloc[i, :]['receiver']
                    dfndr = pairings.iloc[i, :]['closest_def']
                    temp = play[(play['nflId'] == dfndr) & (play['frameId'] == frame)]
                    x_def, y_def = float(pd.to_numeric(temp['x'])) , float(pd.to_numeric(temp['y']))

                    temp = play[(play['nflId'] == rec) & (play['frameId'] == frame)]
                    x_rec, y_rec = float(pd.to_numeric(temp['x'])) , float(pd.to_numeric(temp['y']))

                    dire = play.iloc[0, :]['playDirection']
                    x = x_def - x_rec
                    y = y_def - y_rec
                    if dire == "left":
                        deltax.append(-1 * x)
                        deltay.append(-1 * y)
                    else:
                        deltax.append(x)
                        deltay.append(y)
                        
                    if result == 'I':
                        pass_result.append('I')
                    elif result == 'C':
                        pass_result.append('C')
                    else:
                        pass_result.append('O')
                except TypeError:
                    pass
    return pd.DataFrame({'x': deltax, 'y': deltay, 'result': pass_result})

passes_df = exploration()

In [None]:
from scipy.stats.kde import gaussian_kde
from numpy import linspace

complete_x = passes_df[passes_df['result'] == "C"]['x']
incomplete_x = passes_df[passes_df['result'] == "I"]['x']

plt.figure(figsize = (12, 5))
plt.subplot(1, 2, 1)
plt.hist(complete_x, density = True, color = 'green', bins = 45, alpha = 0.2, range = (0, 20))
plt.hist(incomplete_x, density = True, color = 'red', bins = 45, alpha = 0.2, range = (0, 20))
kde = gaussian_kde( complete_x )
# these are the values over wich your kernel will be evaluated
dist_space = linspace( min(complete_x), max(complete_x), 300 )
# plot the results

plt.xlim((0, 20))
plt.plot( dist_space, kde(dist_space), color = 'green', ls = '--', label = 'complete passes')
kde = gaussian_kde( incomplete_x )
# these are the values over wich your kernel will be evaluated
dist_space = linspace( min(incomplete_x), max(incomplete_x), 300 )


plt.plot( dist_space, kde(dist_space), color = 'red', ls = '--', label = 'incomplete passes')
plt.legend(loc='upper right')
plt.ylabel("density")
plt.xlabel("Yards from line of scrimmage along x-axis")
plt.title("X-axis positioning of defensive back", fontweight = "bold")
plt.xticks(2 * np.arange(1, 10))


plt.subplot(1, 2, 2)
complete_x = passes_df[passes_df['result'] == "C"]['y']
incomplete_x = passes_df[passes_df['result'] == "I"]['y']
plt.hist(complete_x, density = True, color = 'green', bins = 35, alpha = 0.2, range = (-7, 7))
plt.hist(incomplete_x, density = True, color = 'red', bins = 35, alpha = 0.2, range = (-7, 7))
kde = gaussian_kde( complete_x )
# these are the values over wich your kernel will be evaluated
dist_space = linspace( min(complete_x), max(complete_x), 300 )
# plot the results
plt.xlim((-7, 7))
plt.plot( dist_space, kde(dist_space), color = 'green', ls = '--', label = 'complete passes')
kde = gaussian_kde( incomplete_x )
# these are the values over wich your kernel will be evaluated
dist_space = linspace( min(incomplete_x), max(incomplete_x), 150 )
plt.plot( dist_space, kde(dist_space), color = 'red', ls = '--', label = 'incomplete passes')
plt.legend(loc='upper right')
plt.ylabel("density")
plt.xlabel("Yards from receiver along y-axis")
plt.title("Y-axis positioning of defensive back", fontweight = "bold")
plt.tight_layout()
plt.show()

In the left figure above, we conclude that positioning along the x-axis has a clear effect on whether or not a pass will fall complete or incomplete. However, in the right figure, where we look at positioning along the y-axis, there is no difference at all in the means of distributions of complete and incomplete passes. Even though there is a slightly different variance between the two, the fact that they have zero mean and Gaussian shape suggests that any differences can be confidently attributed to noise.

Thus we focus our attention solely towards positioning on the x-axis (left figure). Fortunately, we have a good way of describing this type of positioning in the context of a real football game. These distances fall into one of three categories (press, medium, soft coverages). Additionally, the figure above suggests that defensive backs who start closer to the receiver in a typical play may do better in preventing a completed pass, but this is only an associative effect. The causality and certainty of such a claim will be explored in great depth throughout the rest of this analysis. 

We also wish, here, to be concrete about what we would consider to be a "press", "normal", or "soft" coverage. To do this, we simply plot the distribution of all the positions of defensive backs in the data. It appears that there are three clear "peaks" in the data, which matches up with the understood idea of the three types of coverage. To preserve this natural clustering, we draw the following partitions in the data that distinguish coverage as being press, medium, or soft.

We include a table along with the figure below, which explicitly outlines our conditions for the different types of coverage.

In [None]:
## 1 - D histogram of the yards off line of scrimmage
from scipy.stats.kde import gaussian_kde
from numpy import linspace
plt.figure(figsize=(9, 5))
plt.hist(deltax, density = True, bins = 30, alpha = 0.2, range = (0, 20))
kde = gaussian_kde( deltax )
# these are the values over wich your kernel will be evaluated
dist_space = linspace( min(deltax), max(deltax), 300 )
# plot the results
plt.plot( dist_space, kde(dist_space) )
plt.xlim((0, 20))
plt.xticks(2 * np.arange(1, 10))
plt.axvline(x = 4, color = 'red', ls = '--', linewidth = 3)
plt.axvline(x = 8, color = 'red', ls = '--', linewidth = 3)
plt.title("Distribution of defender starting positions, and identification of coverage types", fontweight = "bold")
plt.xlabel("Starting position from line of scrimmage")
plt.ylabel("density")
plt.show()

### Three types of defensive back coverage
| Coverage Type      | Yard Range |
| :----------------: | :-----------------:|
| Press      | 0-4        |
| Normal   | 4-8         |
| Soft        | 8+          |


Our exploratory analysis was able to rule out the possibility of any obvious outliers or latent variables to the question posed, and we were able to further generate motivation to study the distance a defensive back lines up from the line of scrimmage as a fruitful route for analysis. 

## Comparison of Defensive Back Positioning Schemes

In [None]:
data = {
    "avg_separation": [],
    "avg_separation_air": [],
    "down": [],
    "ori_difference": [],
    "rec": [],
    "rec_name": [],
    "rec_position": [],
    "rec_weight": [],
    "rec_height": [],
    "rec_avg_ori": [],
    "rec_avg_speed": [],
    "rec_max_speed": [],
    "rec_distance_ran": [],
    "def": [],
    "def_name": [],
    "def_weight": [],
    "def_height": [],
    "def_position": [],
    "def_avg_ori": [],
    "def_avg_speed": [],
    "def_max_speed": [],
    "def_distance_ran": [],
    "distance_to_go": [],
    "total_time": [],
    "route": [],
    "def_starting_dist": [],
    "sep_decreasing": [],
    "game_id": [],
    "play_id": [],
    "is_completion": []
}

for i in range(1, 18):
    print (i)
    week_str = "../input/nfl-big-data-bowl-2021/week" + str(i) + ".csv"
    df = pd.read_csv(week_str)

    unique_plays = df.drop_duplicates(['playId', 'gameId'])
    plays = np.unique(df['playId'])
    
    for i in range(len(unique_plays)):
        PLAY_NUMBER = unique_plays.iloc[i, :]['playId']
        game_id = unique_plays.iloc[i, :]['gameId']
        result = get_details(PLAY_NUMBER, game_id)
        
        play = df[df['playId'] == PLAY_NUMBER]
        play = play[play['gameId'] == game_id]
        play['nflId'] = play['nflId'].replace(np.nan, 0)
        
        try:
            total_time = time_to_num(play[play['event'] == 'pass_arrived']['time'].iloc[0])\
                            - time_to_num(play[play['event'] == 'ball_snap']['time'].iloc[0])
        except:
            continue
        
        try:
            receiver = ball_thrown_to(play)
            if np.isnan(receiver):
                continue
            defenders = find_closest_defender_details(play, receiver, 'in_air', 'snap_reached')
        except (IndexError, ValueError, TypeError):
            continue
        
        for i in range(len(defenders)):
            # get average separation
            data['avg_separation'].append(defenders['avg_distance'].iloc[i])
            data['avg_separation_air'].append(defenders['avg_distance_air'].iloc[i])
            
            data['sep_decreasing'].append(defenders['sep_decreasing'].iloc[i])
            data['game_id'].append(game_id)
            data['play_id'].append(PLAY_NUMBER)
            
            # get direction of play (plays going left will be flipped)
            dire = play.iloc[0, :]['playDirection']
            
            # get orientation and difference between orientation
            rec_ori = defenders['rec_ori'].iloc[i]
            def_ori = defenders['def_ori'].iloc[i]
            
            if dire == "left":
                rec_ori = 360 - rec_ori
                def_ori = 360 - def_ori
            
            data['rec_avg_ori'].append(rec_ori)
            data['def_avg_ori'].append(rec_ori)
            data['ori_difference'].append(angle_diff(def_ori, rec_ori))
            
            data['rec_avg_speed'].append(defenders['rec_speed'].iloc[i])
            data['rec_max_speed'].append(defenders['rec_max_speed'].iloc[i])
            data['rec_distance_ran'].append(defenders['rec_dist'].iloc[i])
            
            data['def_avg_speed'].append(defenders['def_speed'].iloc[i])
            data['def_max_speed'].append(defenders['def_max_speed'].iloc[i])
            data['def_distance_ran'].append(defenders['def_dist'].iloc[i])
            data['route'].append(defenders['route'].iloc[i])
            data['def_starting_dist'].append(defenders['def_starting_dist'].iloc[i])
            
            # get information about receiver and defender
            rec_id = defenders['receiver'].iloc[i]
            receiver = players[players['nflId'] == rec_id].iloc[0]
            def_id = defenders['closest_def'].iloc[i]
            closest_def = players[players['nflId'] == def_id].iloc[0]
            
            data['def'].append(def_id)
            data['def_name'].append(closest_def['displayName'])
            data['def_position'].append(closest_def['position'])
            data['def_weight'].append(closest_def['weight'])
            data['def_height'].append(height_to_num(closest_def['height']))
            
            data['rec'].append(rec_id)
            data['rec_name'].append(receiver['displayName'])
            data['rec_position'].append(receiver['position'])
            data['rec_weight'].append(receiver['weight'])
            data['rec_height'].append(height_to_num(receiver['height']))
            
            data["is_completion"].append(1 if result == 'C' else 0)
            data["down"].append(all_plays[(all_plays['playId'] == PLAY_NUMBER) & \
                                          (all_plays['gameId'] == game_id)]['down'].iloc[0])
            data["distance_to_go"].append(all_plays[(all_plays['playId'] == PLAY_NUMBER) & \
                                          (all_plays['gameId'] == game_id)]['yardsToGo'].iloc[0])
            data["total_time"].append(total_time)

all_defenders = pd.DataFrame(data)

all_defenders.head()

For each play in the season, we collected data about the receiver targeted and the closest defender as the ball was in the air, tracking various features:

* avg. separation (from snap to throw)
* avg. separation (when ball is in the air)
* avg. difference in orientation between receiver and defender
* down and distance
* receiver position, height, weight
* receiver avg. orientation
* receiver avg. speed
* receiver max speed
* receiver total distance ran
* defender position, height, weight
* defender avg. orientation
* defender avg. speed
* defender max speed
* defender total distance ran
* time elapsed from snap to throw
* defender starting distance away from the receiver
* proportion of frames where separation is decreasing
* if the play resulted in a completion

Orientation was adjusted based on the direction of the play.
For a quick test, we rank the receivers by the lowest average separation when the ball is thrown (minimum 20 targets)

In [None]:
name_sep = {}
df = all_defenders
for i in range(len(df)):
  # name -> list([sum_separation, count])
  if df.iloc[i]['def_name'] in name_sep:
    name_sep[df.iloc[i]['def_name']] = [name_sep[df.iloc[i]['def_name']][0] + df.iloc[i]['avg_separation'], name_sep[df.iloc[i]['def_name']][1]+1]
  else:
    name_sep[df.iloc[i]['def_name']] = [df.iloc[i]['avg_separation_air'], 1]
name_sep = {k: [v[0]/v[1] if v[1] > 20 else 20] for k, v in name_sep.items()}
name_sep = {k: v for k, v in sorted(name_sep.items(), key=lambda item: item[1])}
sorted_df = pd.DataFrame.from_dict(name_sep).transpose()
sorted_df.columns = ['avg_separation_air']
sorted_df.head(10)

With this data, we then create three separate logistic regressions on these inputs to predict the probability of a catch. These models separate the press, normal, and soft cases of defending classified by defenders being less than 4, between 4 and 8, and more than 8 yards away from the receiver at snap (these groups were created by looking at the distribution of thed data earlier). Non-numerical inputs (categories) are also encoded. 

These models were designed to aim for around the real percent of completions so we are sure that the uneven amount of completions and incompletions is not creating a model biased towards completions (remembering that a model that predicts completions only would have a high accuracy but not be very insightful). 

We chose to use a logistic regression model to predict the probability of a catch for a variety of reasons. We chose this model since it is very efficient to train and makes no assumptions about class distribution in the feature space. This is beneficial since we trained the model on a high volume of data. Additionally, many of the feature weights were different for each of the covereages, so we didn't want any assumptions to be made. Another benefit of this model was that it easily provides us with a measure of how important a feature is as well as the direction of association. This was really important to us as our main goal with this project was to identify the most important features for completion prediction. This model is also less inclined to over-fitting. 

In [None]:
import warnings
from sklearn.exceptions import DataConversionWarning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

def encode_df_col (df, col):
    le = LabelEncoder()
    le.fit (df[col].unique())
    df[col] = df[col].apply(lambda x : le.transform([x])[0])
    
def clean(df):
    return df.replace([np.inf, -np.inf], np.nan).dropna(axis=0)

desired_cols = ['avg_separation', 'avg_separation_air', 'down', 'ori_difference', 'rec_position', 'rec_weight', 'rec_height', 'rec_avg_ori',
       'rec_avg_speed', 'rec_max_speed', 'rec_distance_ran', 'def_weight', 'def_height', 'def_position', 'def_avg_ori',
       'def_avg_speed', 'def_max_speed', 'def_distance_ran', 'distance_to_go',
       'total_time', 'def_starting_dist', 'sep_decreasing', 'is_completion']

press = all_defenders[all_defenders['def_starting_dist'] < 4][desired_cols]
normal = all_defenders[(all_defenders['def_starting_dist'] >= 4) & (all_defenders['def_starting_dist'] < 8)][desired_cols]
soft = all_defenders[all_defenders['def_starting_dist'] >= 8][desired_cols]
soft = clean(soft)
for df in [press, normal, soft]:
    encode_df_col(df, 'rec_position')
    encode_df_col(df, 'def_position')
    
x_press = press.drop(press.columns[[len(desired_cols)-1]], axis=1)
y_press = press.drop(press.columns[np.arange(len(desired_cols)-1)], axis=1)

scaler_press = StandardScaler()

x_press = scaler_press.fit_transform(x_press)

x_normal = normal.drop(normal.columns[[len(desired_cols)-1]], axis=1)
y_normal = normal.drop(normal.columns[np.arange(len(desired_cols)-1)], axis=1)

scaler_normal = StandardScaler()
x_normal = scaler_normal.fit_transform(x_normal)

x_soft = soft.drop(soft.columns[[len(desired_cols)-1]], axis=1)
y_soft = soft.drop(soft.columns[np.arange(len(desired_cols)-1)], axis=1)

scaler_soft = StandardScaler()
x_soft = scaler_soft.fit_transform(x_soft)

X_train_press, X_test_press, y_train_press, y_test_press = train_test_split(x_press, y_press, test_size=0.2, random_state=42)
X_train_normal, X_test_normal, y_train_normal, y_test_normal = train_test_split(x_normal, y_normal, test_size=0.2, random_state=42)
X_train_soft, X_test_soft, y_train_soft, y_test_soft = train_test_split(x_soft, y_soft, test_size=0.2, random_state=42)

print ("Press coverage: ")
clf_press = LogisticRegression(class_weight={ 0:1.5, 1:1 })
clf_press.fit(X_train_press, y_train_press)
pred_press = clf_press.predict(X_test_press)
print("Accuracy: " + "{perc:.2f}%".format(perc=100 * accuracy_score(y_test_press, pred_press)))
print ("% completions predicted: " + "{perc:.2f}%".format(perc=100 * np.count_nonzero(pred_press) / len(pred_press)))

print ("")

print ("Normal coverage: ")
clf_normal = LogisticRegression(class_weight={ 0:2.6, 1:1 })
clf_normal.fit(X_train_normal, y_train_normal)
pred_normal = clf_normal.predict(X_test_normal)
print("Accuracy: " + "{perc:.2f}%".format(perc=100 * accuracy_score(y_test_normal, pred_normal)))
print ("% completions predicted: " + "{perc:.2f}%".format(perc=100 * np.count_nonzero(pred_normal) / len(pred_normal)))

print ("")

print ("Soft coverage: ")
clf_soft = LogisticRegression(class_weight={ 0:2.2, 1:1 })
clf_soft.fit(X_train_soft, y_train_soft)
pred_soft = clf_soft.predict(X_test_soft)
print("Accuracy: " + "{perc:.2f}%".format(perc=100 * accuracy_score(y_test_soft, pred_soft)))
print ("% completions predicted: " + "{perc:.2f}%".format(perc=100 * np.count_nonzero(pred_soft) / len(pred_soft)))

### Sensitivity Analysis 

A logistic regression is extremely interpretable. Roughly, the weights determine if a greater value contributes to a greater output probability or a smaller output probability. In this model, the inputs were normalized (to have a standard deviation of 1 and a mean of 0). The result of this is that the weights can be compared such that a greater weight means a greater impact of the variable on the model (interpreted as a more important variable when analyzing defenses). This is a critical insight of this analysis.

In [None]:
f, ax = plt.subplots(1,1,figsize=(12,5))
w = 0.2

ax.bar(x=np.arange(len(desired_cols)-1)-w, height=clf_soft.coef_[0], width=w, alpha=0.5, color='blue', align='center')
ax.bar(x=np.arange(len(desired_cols)-1), height=clf_normal.coef_[0], width=w, alpha=0.5, color='orange', align='center')
ax.bar(x=np.arange(len(desired_cols)-1)+w, height=clf_press.coef_[0], width=w, alpha=0.5, color='green', align='center')
ax.xaxis_date()
ax.autoscale(tight=True)

plt.xticks(np.arange(len(desired_cols)-1), desired_cols[:-1], alpha=0.3, rotation=90)

colors = {'soft':'blue', 'normal':'orange', 'press':'green'}         
labels = list(colors.keys())
handles = [plt.Rectangle((0,0),1,1, color=colors[label], alpha=0.5) for label in labels]
plt.legend(handles, labels)
plt.title ("Logistic Regression Weights Comparison")
plt.axhline(0, color = 'black', lw = 1, alpha = 0.5)
plt.show()

From these weights, we can see that the most important factor in all levels of coverage is average separation when the ball is in the air. This serves as a bit of a sanity check, as it is clear that a wide open receiver (high separation) is more likely to be a completion than a closely covered receiver. The other weights provide a bit more nuance in the analysis of defenses. 

We will get the top 7 most significant variables (by the top 7 weights in terms of magnitude). These are each labeled with whether they contribute to a completion (+) or incompletion (-).

In [None]:
clf_soft_abs = np.abs(clf_soft.coef_[0])
clf_normal_abs = np.abs(clf_normal.coef_[0])
clf_press_abs = np.abs(clf_press.coef_[0])

biggest_soft = []
biggest_normal = []
biggest_press = []
for i in range(len(clf_soft.coef_[0])):
    biggest_soft.append([clf_soft_abs[i], i])
    biggest_normal.append([clf_normal_abs[i], i])
    biggest_press.append([clf_press_abs[i], i])
    
biggest_soft.sort()
biggest_normal.sort()
biggest_press.sort()

top_n = 7

weight_rank = [[] for i in range(4)]
weight_rank[0] = [i+1 for i in range(top_n)]

for i in range (top_n):
    weight_rank[1].append(desired_cols[biggest_soft[-1-i][1]] + " (" + \
           ("+" if clf_soft.coef_[0][biggest_soft[-1-i][1]] > 0 else "-") + ")")

for i in range (top_n):
    weight_rank[2].append (desired_cols[biggest_normal[-1-i][1]] + " (" + \
           ("+" if clf_normal.coef_[0][biggest_normal[-1-i][1]] > 0 else "-") + ")")

for i in range (top_n):
    weight_rank[3].append (desired_cols[biggest_press[-1-i][1]] + " (" + \
           ("+" if clf_press.coef_[0][biggest_press[-1-i][1]] > 0 else "-") + ")")

print ("rank | soft | normal | press")
print ("--- | --- | --- | ---")
print ("\n".join(["|".join(x) for x in np.transpose(weight_rank)]))

### Top Weights for probability models of each coverage type
rank | soft | normal | press
--- | --- | --- | ---
1|avg_separation_air (+)|avg_separation_air (+)|avg_separation_air (+)
2|def_distance_ran (-)|total_time (-)|def_avg_speed (-)
3|rec_distance_ran (+)|def_max_speed (+)|def_max_speed (+)
4|rec_avg_speed (-)|avg_separation (-)|def_distance_ran (+)
5|avg_separation (-)|def_avg_speed (-)|total_time (-)
6|def_max_speed (+)|rec_position (+)|rec_avg_speed (-)
7|def_avg_speed (+)|def_weight (+)|sep_decreasing (-)

The insights that these weights give us is the importance of various factors in how defenders cover. 

1. The biggest factor is the average separation when the ball is thrown. More separation means a completion is more likely.

2. In normal coverage, most of the weights are less significant than average separation. However, the defender's max speed has a positive weight while the defender's average speed has a negative weight. This suggests that a defender will be unsuccessful (high catch probability for the receiver) if they are more changes in speed / direction, situations which cause the defender to slow down. This suggests that a hitch route will be successful against normal coverage.

3. The defender's distance ran varies greatly in the three cases. In soft coverage, the weight is large and negative. In normal coverage, the weight is near 0. In press coverage, the weight is large and positive. This is indicative of that in soft coverage, defender running far means they have time to recover the separation the receiver starts with. In press coverage, the defender running far means the receiver has time to create separation. 

4. The distance the defender and receiver ran are both important for soft coverage (the distance the receiver ran is only weighted heavily in soft coverage). A receiver needs to run further than the defender (and in the process creating separation) for an overall net zero contribution from these inputs. This also suggests that a defender does not need to move as far to be successful in soft coverage. 

5. The longer a play goes, the less likely it is to be a completion (may be indicative of a quarterback scrambling). Time of play has more of an effect on press coverage, which is tied to closer (likely man) coverages where if receivers aren't open after a few seconds then the quarterback may simply throw the ball away or be sacked.

### Analysis of Special Cases

Here we study, for each type of coverage, one particular play which has a very low catch probability. This obviously indicates a successful defensive effort. We look at the attributes of this play to draw some general conclusions of such plays predicted by the model.

### Successful soft coverage

Exploring the one of the lowest catch probabilities predicted by the soft coverage model, we see an 12% catch probability for a RB running a flat route. The key variables that contributed to this low catch probability are:

1. Extremely low separation in the air
2. High average separation (represents a successful soft coverage, maintaining a cushion of space until the ball is thrown so as not to get beat). 
3. The defender's average speed was lower than the receiver's average speed, although the max speeds were similar, signifying that the defender made up ground only when the RB entered the appropriate zone. The orientation also allows us to understand that this was likely a zone coverage. 

#### Conclusions
For soft coverage, a defensive back should not commit to a receiver too early, but need the speed to catch up if the receiver chooses to accelerate. 

In [None]:
pred_soft = clf_soft.predict_proba(X_test_soft)[:,1]

worst = np.argpartition(pred_soft, 10)[3]

print ("Low catch probability (soft coverage): {prob:.2f}%".format(prob=pred_soft[worst] * 100))

worst_play = list(zip(desired_cols, list(scaler_soft.inverse_transform(X_test_soft[worst]))))
_, worst_separation = worst_play[0]
play_id = 840
game_id = 2018100709

print (list(all_plays[(all_plays['playId'] == play_id) & (all_plays['gameId'] == game_id)]['playDescription'])[0])

all_defenders[all_defenders['avg_separation'] == worst_separation]

### Successful normal coverage

Exploring the one of the lowest catch probabilities predicted by the normal coverage model, we see an 11.28% catch probability for a deep ball thrown on a go route. The key variables that contributed to this low catch probability are: 

1. Low separation in the air
2. High total time until the ball reached the target (lots of time for receiver to make up yardage)
3. High average defender speed relative to receiver average speed. 

#### Conclusions
In order to be successful at normal coverage, a defender needs to keep track of the receiver extremely tightly. This type of coverage may favor a receiver with high speed and one who is heavier/taller. The defender should also have a high average speed, but not necessarily a high top speed, which means it is less likely that they are caught in their tracks, but rather consistently following their receiver. 

In [None]:
pred_normal = clf_normal.predict_proba(X_test_normal)[:,1]

worst = np.argpartition(pred_normal, 10)[6]

print ("Low catch probability (normal coverage): {prob:.2f}%".format(prob=pred_normal[worst] * 100))

worst_play = list(zip(desired_cols, list(scaler_normal.inverse_transform(X_test_normal[worst]))))
_, worst_separation = worst_play[0]
play_id = 2098
game_id = 2018121603

print (list(all_plays[(all_plays['playId'] == play_id) & (all_plays['gameId'] == game_id)]['playDescription'])[0])

all_defenders[all_defenders['avg_separation'] == worst_separation]

### Successful press coverage

Exploring the one of the lowest catch probabilities predicted by the press coverage model, we see an 20% catch probability for another deep ball thrown on a post route. The key variables that contributed to this low catch probability are: 

1. Fairly low separation in the air (given the speeds, this is significant)
2. Extremely high average defender speed (only slightly lower than receiver metric)
3. The maximum defender speed was not vastly different than the average speed, meaning that the defender maintained the high speed and was not beaten on something like a double move. 

The defender's speed was what made the model predict low catch probability even though the defender had to run a large distance.

#### Conclusions
Press coverage is successful when the defender is fast and can respond quickly to the situation. In this play, the defender is able to change direction extremely quickly, which is significant because he starts very close to his receiver and cannot afford to get beat.

In [None]:
pred_press = clf_press.predict_proba(X_test_press)[:,1]

worst = np.argpartition(pred_press, 10)[6]

print ("Low catch probability (press coverage): {prob:.2f}%".format(prob=pred_press[worst] * 100))

worst_play = list(zip(desired_cols, list(scaler_press.inverse_transform(X_test_press[worst]))))
_, worst_separation = worst_play[0]
play_id = 3613
game_id = 2018091610

print (list(all_plays[(all_plays['playId'] == play_id) & (all_plays['gameId'] == game_id)]['playDescription'])[0])

all_defenders[all_defenders['avg_separation'] == worst_separation]

## Conclusions

In this report, we motivate and perform an analysis on the starting defender's position during a pass play. We begin by showing why this is an important problem to study and how the problem is significantly nuanced. We heuristically define three types of coverage which depends on this metric: press, normal, and soft. Then, we demonstrate that a high amount of the variance regarding whether a pass is completed can be explained by the feature set we propose. Using sensitivity analysis, we build logistic models for each of the play types and study which features increase and decrease in importance in each model. Using a cross-validated logistic model, we then are able to show what types of plays might be successful for each type of pass, which allows to make suggestions based on what attributes of a certain type of coverage are important.

Importantly, we find what makes a certain type of coverage (especially press v.s. soft) successful **depends on different criteria** (listed above), which teams should look into depending on the defensive coverage scheme they wish to use. Further research could be done with regards to formalizing these criteria to do some form of probabilistic analysis how they empirically affect the catch probability.

## Appendix
This contains the function files for all the codes used in this report

In [None]:
def angle_diff (ang1, ang2):
    """Given two angles from 0 to 360, find the difference between the angles, ranging from -180 to 180"""
    diff = ang1 - ang2
    if diff < -180:
        diff += 360
    elif diff > 180:
        diff -= 360
    return diff

def ball_thrown_to(play):
    """Given a play, figure out to whom the ball is thrown to, by checking the closest receiver to 
    the football at the 'pass_arrived' frame, which exists for every throw."""
    
    
    pass_arrived = play[play['event'] == 'pass_arrived']
    football = pass_arrived[pass_arrived['displayName'] == "Football"]
    
    receivers = find_all_receivers(play)
    closest_receiver = np.float("NaN")
    mindist = 100
    for rec in receivers:
        dist = distance(pass_arrived, rec, 0) 
        if dist < mindist:
            mindist = dist
            closest_receiver = rec
    return closest_receiver 

def distance(play, id1, id2):
    """For a given play, given the id's of two players, calculate the average distance per frame
    b/w them through the entire play. Helper function"""
     
    p1 = play[play['nflId'] == id1]
    p2 = play[play['nflId'] == id2]
     
    x1 = np.array(p1['x'])
    x2 = np.array(p2['x'])
    y1 = np.array(p1['y'])
    y2 = np.array(p2['y'])
    dist = np.sum(((x1 - x2) **2 + (y1 - y2) ** 2) ** 0.5)
    # Return value should be float in yards
    if len(p1) == 0:
        return float('inf')
    return dist/len(p1)

def dist_decreasing(play, id1, id2):
    """For a given play, given the id's of two players, calculate the proportion of frames
    where the distance between them decreases"""
     
    p1 = play[play['nflId'] == id1]
    p2 = play[play['nflId'] == id2]
     
    x1 = np.array(p1['x'])
    x2 = np.array(p2['x'])
    y1 = np.array(p1['y'])
    y2 = np.array(p2['y'])
    dist = ((x1 - x2) **2 + (y1 - y2) ** 2) ** 0.5
    change = dist[1:] - dist[:-1]
    
    if len (p1) <= 1:
        return 0
    
    proportion = np.count_nonzero(change < 0) / len (change)
    
    # Return value should be float in yards
    return proportion


def find_all_receivers(play):
    """Takes in a DF of a certain play, and returns the nfl IDs of all offensive receivers."""
    temp = play.dropna(subset = ['route'])
    return np.unique(temp['nflId'])

def find_closest_defender(play, interval):
    """Given list of IDs for offensive receivers, find their closest defender across an interval.
    Returns a DF with the schema: (receiver, closest_def, average distance).
    params: 
    play - tracking DF for a certain play, from weekN.csv dataset
    interval - pre_throw, in_air, after, all 
    """
    receivers = find_all_receivers(play)
    closest_defenders = []
    dst = []

    if len(receivers) == 0:
        raise IndexError("No receivers here")

    recs = play[play['nflId'].isin(receivers)]
    side = np.unique(recs['team'])
    assert len(side) == 1

    if side[0] == "home":
        facing = "away"
    elif side[0] == "away":
        facing = "home"
    else:
        raise ValueError("Bad side given.")

    opponents = play[play['team'] == facing]
    # unique list of opponents
    opponents = np.unique(opponents['nflId'])

    ## Setting Frames and subsetting based on intervals
    snap_frame = play[play['event'] == 'ball_snap']['frameId'].iloc[0]
    throw_frame = play[play['event'] == 'pass_forward']['frameId'].iloc[0]
    end_signals = ['pass_arrived', 'pass_outcome_caught',
       'pass_outcome_incomplete', 'pass_outcome_interception',
       'pass_outcome_touchdown']
    reached_frame = play[play['event'].isin(end_signals)]['frameId'].iloc[0]


    num_frames = np.max(play['frameId'])
    # If specified all or nothing
    num_index = np.arange(snap_frame, num_frames)
    if interval == "pre_throw":
        num_index = np.arange(snap_frame, throw_frame)
    elif interval == "in_air":
        num_index = np.arange(throw_frame, reached_frame)
    elif interval == "after":
        num_index = np.arange(reached_frame, num_frames)


    # Subset play DF to contain desired frames
    positions = play[play['frameId'].isin(num_index)]

    # Begin distance calculations

    for rec in receivers:
    # Store per-frame distances in hashmap
        distances = {}
        for opp in opponents:
            dist = distance(positions, rec, opp)
            distances[opp] = dist

        closest = min(distances, key = distances.get)
        closest_defenders.append(closest)
        dst.append(distances[closest])

    return pd.DataFrame({'receiver': receivers, 'closest_def': closest_defenders, 'avg_distance': dst})

def find_closest_defender_details(play, receiver, rec_interval, calc_interval):
  """Given list of IDs for offensive receivers, find their closest defender across an interval
  (one for finding clsoest defender, next for the calculations).
  Returns a DF with the schema: (receiver, closest_def, average distance).
  params: 
  play - tracking DF for a certain play, from weekN.csv dataset
  interval - pre_throw, in_air, after, all 
  """
  receivers = [receiver]
  
  closest_defenders = []
  dst = []
  rec_ori = []
  def_ori = []
  def_speed = []
  rec_speed = []
  rec_max_speed = []
  def_max_speed = []
  def_dist = []
  rec_dist = []
  route = []
  def_starting_dist = []
  avg_distance_air = []
  sep_decreasing = []

  if len(receivers) == 0:
    raise IndexError("No receivers here")

  recs = play[play['nflId'].isin(receivers)]
  side = np.unique(recs['team'])
  assert len(side) == 1

  if side[0] == "home":
      facing = "away"
  elif side[0] == "away":
      facing = "home"
  else:
      raise ValueError("Bad side given.")
      
  opponents = play[play['team'] == facing]
  # unique list of opponents
  opponents = np.unique(opponents['nflId'])

  ## Setting Frames and subsetting based on intervals
  snap_frame = play[play['event'] == 'ball_snap']['frameId'].iloc[0]
  throw_frame = play[play['event'] == 'pass_forward']['frameId'].iloc[0]
  end_signals = ['pass_arrived', 'pass_outcome_caught',
       'pass_outcome_incomplete', 'pass_outcome_interception',
       'pass_outcome_touchdown']
  reached_frame = play[play['event'].isin(end_signals)]['frameId'].iloc[0]

  num_frames = np.max(play['frameId'])
  # If specified all or nothing
  rec_num_index = np.arange(snap_frame, num_frames)
  if rec_interval == "pre_throw":
    rec_num_index = np.arange(snap_frame, throw_frame)
  elif rec_interval == "in_air":
    rec_num_index = np.arange(throw_frame, reached_frame)
  elif rec_interval == "after":
    rec_num_index = np.arange(reached_frame, num_frames)
  elif rec_interval == "snap_reached":
    rec_num_index = np.arange(snap_frame, reached_frame)
    
  # If specified all or nothing
  num_index = np.arange(snap_frame, num_frames)
  if calc_interval == "pre_throw":
    num_index = np.arange(snap_frame, throw_frame)
  elif calc_interval == "in_air":
    num_index = np.arange(throw_frame, reached_frame)
  elif calc_interval == "after":
    num_index = np.arange(reached_frame, num_frames)
  elif calc_interval == "snap_reached":
    num_index = np.arange(snap_frame, reached_frame)

  # Subset play DF to contain desired frames
  positions_rec = play[play['frameId'].isin(rec_num_index)]
  positions = play[play['frameId'].isin(num_index)]

  # Begin distance calculations

  for rec in receivers:
    # Store per-frame distances in hashmap
    distances = {}
    distances_full = {}
    for opp in opponents:
      dist = distance(positions_rec, rec, opp)
      distances[opp] = dist
    
    closest = min(distances, key = distances.get)
    closest_defenders.append(closest)
    dst.append(distance(positions, rec, closest))
    avg_distance_air.append(distance(positions_rec, rec, closest))
    
    sep_decreasing.append(dist_decreasing(positions, rec, closest))
    
    ori, avg_speed, max_speed, dist = play_details(positions, rec)
    rec_ori.append(ori)
    rec_speed.append(avg_speed)
    rec_max_speed.append (max_speed)
    rec_dist.append(dist)
    
    ori, avg_speed, max_speed, dist = play_details(positions, closest)
    def_ori.append(ori)
    def_speed.append(avg_speed)
    def_max_speed.append (max_speed)
    def_dist.append(dist)
    
    # get most common route (only make it nan if that's the only route)
    route_frames = positions[positions['nflId'] == rec].dropna(subset=['route'])
    rt = np.nan
    if len(route_frames) > 0:
        rt = route_frames['route'].value_counts().idxmax()
    route.append(rt)
    
    defender = play[(play['event'] == 'ball_snap') & (play['nflId'] == closest)]
    receiver = play[(play['event'] == 'ball_snap') & (play['nflId'] == rec)]
    
    def_starting_dist.append (np.abs(float(pd.to_numeric(defender['x'])) - float(pd.to_numeric(receiver['x']))))

  return pd.DataFrame({'receiver': receivers, 'closest_def': closest_defenders, 'avg_distance': dst, \
                       'rec_ori': rec_ori, 'def_ori': def_ori, 'rec_speed': rec_speed, 'def_speed': def_speed, \
                       'rec_max_speed': rec_max_speed, 'def_max_speed': def_max_speed, 'rec_dist': rec_dist, 'def_dist': def_dist, \
                       'route': route, 'def_starting_dist': def_starting_dist, 'avg_distance_air': avg_distance_air, \
                       'sep_decreasing': sep_decreasing})

def get_details(playId, gameId):
    """Given a gameId, and a play ID (this uniquely defines a play), we extract a few important statistics.
    Namely, we have the following returns:
    is_completed: whether pass play is completed
    yardage: number of yards gained
    yardline: starting yardline
    """
    temp = all_plays[(all_plays['playId'] == playId) & (all_plays['gameId'] == gameId)]
    assert len(temp) == 1
    complete = list(temp['passResult'])[0]
    ## TODO: Add other fields as needed
    return complete

def get_ball_details (gameId, playId, df):
    """
    Given a DataFrame with frame-details of plays, and game ID and play ID
    return the average speed of the ball and the average direction its heading
    """
    
    # speed is just an average
    speed = df[(df['playId'] == playId) & (df['gameId'] == gameId) & (df['displayName'] == 'Football')]['s'].agg(np.mean)
    
    # direction is arctan from pass thrown to pass caught
    temp = df[(df['playId'] == playId) & (df['gameId'] == gameId) & (df['displayName'] == 'Football') & (df['event'] == 'pass_forward')]

    assert len(temp) == 1
    start_x, start_y = float(pd.to_numeric(temp['x'])) , float(pd.to_numeric(temp['y']))
    
    temp = df[(df['playId'] == playId) & (df['gameId'] == gameId) & (df['displayName'] == 'Football') \
              & (df['event'] == 'pass_arrived')]
    assert len(temp) == 1
    end_x, end_y = float(pd.to_numeric(temp['x'])) , float(pd.to_numeric(temp['y']))
    
    dist = np.sqrt((end_x - start_x)**2 + (end_y - start_y)**2)
    
    dir = np.arccos (y / dist) * 180 / np.pi # range from 0 to 180 deg
    if end_x < start_x:
        dir = 360 - dir # corrects if angle is actually between 180 and 360 deg
    
    return (
        speed,
        dir,
        distance_thrown
    )


def height_to_num (str_height):
    """convert height from string format of 6-1 or 73 to integer"""
    if "-" not in str_height:
        return int (str_height)
    else:
        return int(str_height[0:str_height.index("-")]) * 12 + int(str_height[str_height.index("-")+1:])

def play_details(play, id1):
    """For a given play, given the id of a player, calculate the average orientation through
    entire play, average speed, max speed, and distance ran. Helper function"""
     
    p1 = play[play['nflId'] == id1]
     
    ori = np.array(p1['o'])
    speed = np.array(p1['s'])
    dist = np.array(p1['dis'])
    
    if len (p1) == 0:
        return 0
    
    return (np.mean(ori), np.mean(speed), np.amax(speed), np.sum(dist))

def time_to_num (str_time):
    """convert time of day to number"""
    time = str_time[str_time.index("T") + 1:-1].split(":")
    hour, minute, sec = int(time[0]), int(time[1]), float(time[2])
    return hour * 60 * 60 + minute * 60 + sec

def time_elapsed(quarter, gameclock):
    """Computes minutes elapsed given quarter and gameclock. Up to 75 minutes including overtime."""
    def compute_elapsed(string):
        if string != string:
            return 0
        m, s, _ = string.split(':')
        elapsed = 15 * 60 - 60 * int(m) - int(s)
        return elapsed
    gameclock = gameclock.apply(compute_elapsed)
    t = 15 * 60 * (quarter - 1) + gameclock
    # Overtime is now only 10 minutes long
    t[t > 3600] -= 300
    return t

    
