This is the NFL Big Data Bowl 2021 submission from team **isaihchrischarlie** by collaborators [Isaih](https://www.kaggle.com/zayuhtheiv), [Chris](https://www.kaggle.com/christopherhanes), and [Charlie](https://www.kaggle.com/danoff).

# Introduction

Drawing on our team strengths, our approach was to focus less on sophisticated machine learning and more on having crisp data visualizations and clear writing informed by football game expertise. We all have experience doing data analysis. Chris has extensive knowledge of databases, Charlie writes, and Isaih is our [Subject Matter Expert](https://en.wikipedia.org/w/index.php?title=Subject-matter_expert) (SME) on the game drawing on his time playing defensive back in high school. Isaih and Charlie have been researching sports analytics for multiple years and [submitted an entry](https://www.kaggle.com/danoff/neural-networks-isaih-divya-charlie-version?scriptVersionId=24162374) to the last big data bowl. 

This time we wanted to evaluate which defenders close gaps best while the ball was in the air. In other words, when the ball leaves the quarterback’s hands to when it arrives at the targeted receiver. On one play for example, Marcus Williams was 8.19 yards away from the targeted receiver at the "pass forward" event and 5.96 yards away at the "pass arrived" event, good for a 27.2% gap decrease.

Another factor we wanted to consider was which specific defender we were going to analyze on a given play. We decided that a corner on the other side of the field defending a non-targeted receiver should not be part of the evaluation. We created a dummy variable to show who is closest to the targeted receiver when the ball arrives. 

To gain an outside perspective on who were some of the best defensive backs in the year we will be analyzing we reviewed the [Top 25 Cornerbacks in the NFL in 2018](https://www.pff.com/news/pro-top-25-cornerbacks-in-the-nfl-in-2018) from Pro Football Focus. As a way to audit the quality of our KPI, we will see if our top rated defenders align with the [PFF grades](https://www.pff.com/grades).

## Focus of Analysis

Ultimately the question we decided to focus on was: 

*Is a pass defender’s percent distance closed to the target receiver, during the pass air time, a meaningful measure of pass defender performance?*

# Methodology

Now we will outline our methodological approach to preparing the base data for analysis.

1. We define pass defenders as any player playing any of the following positions during a play: 
    * Cornerback
    * Defensive Back
    * Free Safety
    * Inside Linebacker
    * Middle Linebacker
    * Outside Linebacker
    * Safety
    * Strong Safety

2. We calculate the distance of each pass defender to the target receiver at the moment of pass arrival.

3. We identify the pass defender closest to the target receiver.

4. We calculate the identified pass defender’s distance to target receiver at the moment pass was thrown (i.e., at the pass forward event).

5. We calculate a distance closed variable by subtracting the identified defender’s distance to the target receiver at the moment of pass arrival from the distance from the distance at the moment the pass was thrown.

6. We normalize the distance closed variable by turning it into a percentage and call it Percent Distance Closed (PDC).

# Results

For the initial analysis we only included data from defensive backs (i.e., removed all linebackers) and excluded one play that had two incomplete pass events (Game ID 2018090900, Play ID 2037). The mean average PDC was 22.7% with a standard deviation of 59.2%. This makes sense given the wide range from defenders getting badly burned to close encounters. Most of the time the closest defensive back was making up around one fifth of the distance between them and the targeted receiver while the ball was in the air. The histogram below shows the shape of the data:

![histogram](http://danoff.org/hist_pct_distance_closed.png)

Next, we looked to see whether or not our key metric increased the odds of successful on field outcomes for defenders. We ran a logistic regression with PDC as the independent variable and if the pass was incomplete as the dependent variable. We found that PDC was not significant at the 10% level (*p* = 0.18). You can see the full results of all our regressions in the appendix.  

We tried again with more independent variables: 
* Number of defenders in the box
* Number of pass rushers
* Distance closed (total, not percentage)
* Defender weight
* Defender height
* Targeted receiver height
* Targeted receiver weight
* Yardline
* Yards to go
* Expected Points Added (EPA)

This time PDC was significant at the 10% level (*B* = -0.008, *p* = -0.029, Odds Ratio = 1.008). In this case, we can specifically state that the better the defensive back is at closing the gap between themselves and the targeted receiver while the ball is in the air, the better the odds of an incomplete pass. The other significant predictors were Number of pass rushers, Distance Closed (in total), yardline number, yards to go, and EPA.

We ran it again with only the significant predictors. This time they were all significant at the 10% level, except for yards to go. Of the significant variables, the number of pass rushers had the highest odds ratio of 1.491, indicating that in this regression the number of pass rushers increased the odds of an incomplete pass the most. PDC was second at 1.008. This indicates that the number of pass rushers (which may lead to the quarterback being more pressured) is more influential on the outcome of a pass than how much the closest defensive back closes the gap to the target receiver.

Next we looked to see how individual defensive backs scored in PDC. In the chart below we show the top 15 defensive backs in PDC for the 2018 season with a minimum of 40 plays where they were the closest defender to the targeted receiver. 

![Top 15 Defensive Backs](http://danoff.org/Top15DefensiveBacksDash3.png)

Adrian Amos, a safety for the Bears, had the highest average PDC at 44%. He was not one of the top defenders according to Pro Football Focus, but his fellow Bears safety Eddie Jackson was 4th on the list at 36% and he was listed as [the top safety of the year by PFF](https://www.pff.com/news/pro-best-player-at-every-position-in-the-nfl-in-2018). Amongst cornerbacks, Jonathan Joseph plus Kareem Jackson of the Texans along with Adoree’ Jackson of the Titans and A.J. Bouye of the Jaguars were listed in the [top 25 cornerbacks of 2018 by PFF](https://www.pff.com/news/pro-top-25-cornerbacks-in-the-nfl-in-2018). 

We then looked to see how different positions fared. Free Safeties had the highest mean PDC at 29%. Cornerbacks were the lowest amongst defensive backs at 17%. Not far behind them were Middle Linebackers at 15%. 

We felt that this made sense when considering that cornerbacks are typically tracking their assignments more closely when the ball is thrown in the first place, as evidenced by our data. We observed that cornerbacks were 4.57 yards away from the targeted receiver when the pass was thrown, on average (n=6,879). For safeties of all types the distance was more at 6.24 yards away, on average (n=2,609). 

Due to this, we have reason to believe that the PDC metric is higher for safeties because cornerbacks are already in more of a position to make a play when the ball is thrown and when it arrives. The distance when the ball arrives data shows that cornerbacks are 3.49 yards away on average (n=6,879) to 4.16 yards for safeties (n=2,609). 

With the larger gap for safeties they have more of an incentive to “get on their horse” and cover as much ground as possible by the time the ball does arrive, hence the larger percentages illustrated below:

![Comparing Positions](http://danoff.org/ComparingPositions3.png)

We also looked at a scatter plot with average PDC on the y-axis and average pass distance on the x-axis. Similar to our finding above, amongst defensive backs, free safeties had covered the longest passes while corners the least. 

![Scatter Plot](http://danoff.org/ScatterPlot.png)

## Including Zone as a Variable

We thought that there could be some interesting findings within our data if we decided to parse things out by zone and man coverage for contextual purposes, given the different responsibilities of defensive backs (as well expalined by [former college corner Micheal Felder](https://bleacherreport.com/articles/1745443-how-to-read-and-react-a-college-cornerbacks-guide-to-pre-snap-pass-defense)). Within our final dataset of approximately 13,896 plays we found 737 observations (5%) that had information on what coverage the defense was in (thanks to the [bonus dataset](https://www.kaggle.com/tombliss/nfl-big-data-bowl-2021-bonus)). Of those 737 observations, 509 (69%) were zone, and 228 (31%) were man coverage. Here is a quick rundown of the coverage-contextualized findings:

* The average number of defenders in the box when in man coverage was 6.37 to zone’s 5.95
* The average number of expected points added is .27 when man coverage is called versus zone’s .07. 
    * This falls in line with our expectations where higher stakes situations call for tighter coverage.
* An offense’s yards to go on average was 8.02 in man, and 9.62 in zone
    * This also falls in line with the expectations of higher stakes situations calling for man coverage since it is typically tighter
* As expected, the defensive back’s average distance to the targeted receiver when the ball is passed forward in man coverage (3.33 yards) is lower than in zone (5.90)
* As expected, the defensive back’s average distance to the targeted receiver when the ball arrives in man coverage (2.67 yards) is lower than in zone (4.38)
* As expected, the average distance closed by a defensive back on a targeted receiver is lower in man coverage (.65 yards) than in zone (1.52)
* The average PDC by a defensive back on a targeted receiver is lower in man coverage (7.37%) than in zone (20.88%)
* Offenses typically earn nearly one extra yard on average when passing versus man coverage (8.97) over zone (8.02) 
* In man coverage, the offense’s pass outcome is found to be a touchdown 3.51% of the time compared to 1.38% in zone. 
    * Zone is considered to be a safer option. For example, at the end of a game defenses may employ a prevent defense, which can be thought of as a “hyper” zone. And as we mentioned above man is used more in tighter situations so it makes sense that there are more plays with defensive backs scored upon since they are in a riskier situation in the first place as evidenced by the EPA data referenced above. 

Finally, we compared PDC by position and coverage. In all cases the PDC was higher for zone than it was for man.

![Scatter Plot](http://danoff.org/ComparingPositionsbyCoverage3.png)

# Future Work

In the future we would like to understand the quantitative measurements of defensive backs more deeply. For instance, similar to what [Dutta, Yurko, Ventura](https://arxiv.org/abs/1906.11373) suggested, can we create specific models for free safeties, strong safeties, and even nickel backs? Can we come up with a metric for how effective defensive backs are at stopping the run? Or now that we know more about PDC can we connect that to [data from the annual NFL combine](https://www.pro-football-reference.com/draft/2020-combine.htm) to see if things such as the shuttle drill actually connect with on-field performance at all? As [Caio Brighenti suggested](https://operations.nfl.com/media/4201/bdb_brighenti.pdf), can we consider pitch control in our analysis? Could these defensive metrics help fantasy choose which defense to start in a given week or help bettors make the winning pick? 

# Conclusion

Our key takeaway is that PDC can significantly increase the odds of an incomplete pass and it roughly aligns with qualitative assessments of top defensive backs by Pro Football Focus grades. Additionally PDC is meaningfully affected by both position and the type of defensive coverage. 

Given all of this, we can see that situational football plays heavily into the type of coverage called on any given play. Thus, there is a lot of context to be considered when evaluating defensive back play. Coaches should consider PDC and other distance tracking metrics referenced when evaluating their defensive backs in regards to their ability to read and react, and in turn, prevent chunk yardage and touchdowns. 

# Appendix

Links

* [GitHub Repo](https://github.com/danoff/Mind-the-PDC-Gap)

## Code

In [None]:
# Load libraries and check versions
 
# Python version
import sys
print('Python: {}'.format(sys.version))

# scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
from scipy import stats

# numpy
import numpy as np
print('numpy: {}'.format(np.__version__))

# matplotlib

import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# seaborn

import seaborn


# pandas
import pandas as pd
print('pandas: {}'.format(pd.__version__))
from pandas.plotting import scatter_matrix

# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# statsmodels

import statsmodels
from statsmodels.compat import lzip
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms
import statsmodels.api as sm

# read CSV of selected 2018 pass data 

data = pd.read_csv('../input/onlydbs/alpha_analysis_detail_few_columns_only_dbs.csv', header=0)
data = data.dropna()
data.head

# make PDC histogram

data.km_pctDistClosed.hist(color='yellow')
plt.suptitle('Percent of Distance Closed to Targeted Receiver by Defensive Back')
plt.title('While Pass is in the Air Histogram')
plt.xlabel('% of Distance Closed')
plt.ylabel('Frequency')
plt.xlim(-900, 100)
plt.grid(b=None)
plt.text(-300, 350, r'$\mu=22.7$')
plt.text(-300, 300, r'$\sigma=59.2$')
plt.text(-300, 250, r'min = -855.8')
plt.text(-300, 200, r'max = 95.2')
plt.savefig('hist_pct_distance_closed')

In [None]:
# Binary logistic regression to predict incomplete passes take one

y=data['pm_efcPassOutcomeIncomplete']

data["constant"] = 1.0

x=data[['km_pctDistClosed', 'constant']]

logit_model=sm.Logit(y,x)
result=logit_model.fit(method='bfgs')
print(result.summary2())

params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']

print(np.exp(conf))

In [None]:
# Binary logistic regression to predict incomplete passes take two

y2=data['pm_efcPassOutcomeIncomplete']

data["constant"] = 1.0

x2=data[['km_pctDistClosed', 'p_defendersInTheBox', 'p_numberOfPassRushers', 'km_distClosed', 'tdf_weight', 'tdf_height', 'ttr_height', 'ttr_weight', 'p_yardlineNumber', 'p_yardsToGo', 'p_epa',
        'constant']]

logit_model=sm.Logit(y2,x2)
result=logit_model.fit(method='bfgs')
print(result.summary2())

params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']

print(np.exp(conf))

In [None]:
# Binary logistic regression to predict incomplete passes take three

y3=data['pm_efcPassOutcomeIncomplete']

data["constant"] = 1.0

x3=data[['km_pctDistClosed', 'p_numberOfPassRushers', 'km_distClosed', 'p_epa', 'p_yardlineNumber', 'p_yardsToGo',
        'constant']]

logit_model=sm.Logit(y3,x3)
result=logit_model.fit(method='bfgs')
print(result.summary2())

params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']

print(np.exp(conf))