# Defensive Back Domination
### NFL Big Data Bowl 2021 - Submission by Katherine Lordi 

# Project Outcomes: Evaluating Defensive Back Domination ‚ùåüèà

When do Defensive Backs (DB) have a dominating presence on the field? The DB Position Group includes Safeties (S, FS, SS) and Cornerbacks (CB). DBs dominate when they effectively shut down their Receivers, cause an incomplete pass, turnover, intercept the ball, etc. For example, see[ this video](https://www.youtube.com/watch?v=tcFi07b12jY) to hear former Philadelphia Eagles Safety Malcolm Jenkins break down some tactics that make for a successful Safety.

Here is a summary of my project's outcomes for evaluating Defensive Back (DB) Domination for the [NFL Data Bowl 2021](https://www.kaggle.com/c/nfl-big-data-bowl-2021/overview) and [data provided](https://www.kaggle.com/c/nfl-big-data-bowl-2021/data) on drop-back pass plays for the 2018 Regular Season. See full code solution on Github [here](https://github.com/klordi/lordi_nfl_big_data_bowl_2021). 


### Figure 1. Summary of Project Outcomes: Evaluating Defensive Back Domination  

In [None]:
from IPython.display import Image
Image("../input/overview-main1/NFL Data Bowl 2021 Outcomes Overview (8).png")

# Introduction

There are two ways to win a football game. 1) Score more points than your opponent üèàüèà or 2) Stop them from scoring more points than you ‚ùåüèà. "[Defense wins champions](https://www.forbes.com/sites/briangoff/2019/01/14/defense-wins-championships-in-the-nfl-fact-or-folklore/?sh=6569c6786053)." While working on the [2021 Big Data Bowl](https://www.kaggle.com/c/nfl-big-data-bowl-2021/overview) and the focus on analyzing defense on pass plays with no linemen info included, I kept a shorter, focused question on evaluating Defensive Backs (DB) who cover the pass in mind.

## How can analyzing [NFL game, play and tracking data](https://www.kaggle.com/c/nfl-big-data-bowl-2021/data) of all drop-back pass plays from the 2018 Regular Season help the league assess DB Performance and teams improve their DB's performance? 

This is a broad question, but as a growing data scientist and NFL analyst üòÇ, I explored it thoughtfully to help identify unique and impactful approaches and applications to measure DB performance on the pass, which lag behind metrics for analyzing offenses in the [Next Gen Stats Glossary](https://nextgenstats.nfl.com/glossary). We need a defense category added!

I took a [machine learning approach to my analysis rather than a statistical one ](https://towardsdatascience.com/the-actual-difference-between-statistics-and-machine-learning-64b49f07ea3) in order to dive deeper in my knowledge of ML concepts, practice implemention, and to challenge myself to make repeatable predictions of DB performance, which can be assessed by # Completions VS # Incompletion (and Interceptions) of the guarded Target Receiver on offense. 

Additionally, focusing on the 2018 regular season data, which is season data that teams may potentially have access to real-time in the future, can help me figure how this data can actually be used throughout the season by the League, Defensive Coordinators, teams, and individual players.


# Notable Engineered Features, Visualizations, & Data Formatting

I used Python to create the following notable engineered features and visualizations. Before getting to this point, I conducted Research (see Appendix A - Research) and started formatting the data (see Appendix B - Data Formatting for more details and full code on Github [here](https://github.com/klordi/lordi_nfl_big_data_bowl_2021)). 

## 1A. TR-DB-Ball Triangle Visualization

In addition to calculating the distances between the Target Receiver, the Football, and Every Defender of the Field and creating new columns in the Tracking Data containing this information, I created a visualization that displays distances between the target receiver of the pass, the üèà, and every defender on the field in the form of a triangle that changes size and shape throughout the course of the play. I created this vizz because these distances were an influential input for enhancing accuracy of the XGBoost Implementation to Classify Pass Outcome (Complete or Incomplete).


In [None]:
from IPython.display import Image
Image("../input/figures-lordi/fig2.png")

Figure 2. QB-DB-Ball Triangle Visualization. Compare to all player route info on left to see the targeted route by the QB. QB Aaron Rodgers passes deep left to DeVante Adams pushed ob at DET 9 for 30 yards (Closest Defender: Darius Slay, CB of DET). See Devante Adams highlights [here at 2:56](https://www.youtube.com/watch?v=6cLUzFGGoDo%3Fstart%3D76&end=120). 

## 1B. Normalized DB Separation Visualization
For input to the Machine Learning XGBoost Trees to classify pass outcome as Complete (0,C) or Incomplete (1,I), I thought that it would be beneficial to normalize the positions defenders and the football to the target receiver at (x,y) = (0,0) to see how these normalized positions change over the course of the play at critical events in the play such as:
- Ball_Snap 
- Pass_Forward (moment ball leaves the QB's hands)
- Pass_Arrived (moment ball arrives at Target Receiver)
- Pass_Outcome (Caught, Touchdown, or Incomplete)
- Time Intervals after Pass_Outcome_Caught to assess how well DBs stop Receivers gaining yardage after a catch (Future Work)
- Time Intervals between each 

Then, I visualized the positions of DBs relative to the Normalized Target Receiver using a [Hexagonal Bin Plot](https://datavizproject.com/data-type/hexagonal-binning/#:~:text=Hexagonal%20binning%20plots%20density%2C%20rather,the%20area%20of%20the%20hexagons.&text=There%20are%20many%20reasons%20for,2D%20surface%20as%20a%20plane.). Hexagonal Binning is useful because it shows density of the points, so we can see where more DBs tend to position themselves around the target receiver on Successful Completions (C) VS Incompletions (I). I binned the densities by log base 2 in order to make the color pop.

In the figures below, the plots show the target receiver in blue at the origin and the densities of how often DBs position themselves on ALL PLAYS IN 2018 around them for Completions (Figure 3) and Incompletions (Figure 4) at distances of 2, 5, and 10 yards away from the target receiver. What we can see from comparing the figures is that for Completions, more DBs are behind target receiver and the direction of the ball. This makes sense since receivers are more likely to make a completion when they're open and don't have a defender in between them and the QB. 

Questions: Can we further query on orientation of defenders? can we label each chart better? I feel like some fancy math can be done to better quantify this. Make it a % so that it is even across data since there are way more completions than incompetions? 

In [None]:
Image("../input/figures-lordi/fig3.png")

In [None]:
Image("../input/figures-lordi/fig4.png")

Figure 4: Normalized DB Separation Visualization for Event = Pass_Arrived; Outcome = Incomplete (I); showing densities of defender position within 2, 5, 10 yards; Legend: Blue = Target Receiver of Pass; Red = Density of Defenders in Binned Region

### How can the Normalized DB Separation Visualization be impactful? 
Through querying on different parameters, reading these plots and further deriving data from them can be meaningful for the league and for teams in many ways. For example, the league and Next Gen Stats can report a metric called *DB Separation Rank* (or rename it something better), a new Next Gen Stat under a new Defensive Stats Category, which assesses and ranks DBs by how well a DB they cover receivers at critical points during play action such as pass_arrived and pass_forward (the moment the QB throws the ball). By creating more in-depth stats beneath this category named *DB Separation Zone Rank* and  *DB Separation Man Rank* for example that query on data such as defensive coverage, target receiver's route, yards to touchdown to quantify the amount of space available to defend on the field, down, etc., one can identify how well certain DBs and teams perform in a variety of:
* Game Scenarios: Down, # yards to 1st down, # yards to touchdown, type dropback by QB, weather, etc.
* Offensive Formations and Strategies: Route run by receiver, offensive formation, 
* Defensive Formations and Strategies: Defense Coverage Scheme (ie. Cover 0-6, Man VS Zone), individual player tendencies and skillsets 

For example, this data can be used to assist with watching game film in prep for the week to determine what defensive strategies / skillsets need to be optimized in the gameplan and during practice to help with root cause analysis of how specific DBs can best limit separation in different game scenarios and defensive schemes.

## 1C. Vector Visualization

This visualization is pretty simple. It just normalizes the line of scrimmage in all plays and displays the player and ball positions, orientation, and direction of motion as vectors given input play and event information (at a single moment in time). By means of having the event as an input, one can see the play and position information at a discrete and often critical point in time. Below, see the play in week 5 in which Aaron Rodgers passes deep left to DeVante Adams pushed ob at DET 9 for 30 yards (Closest Defender: Darius Slay, CB of DET). See Devante Adams highlights [here at 2:56](https://www.youtube.com/watch?v=6cLUzFGGoDo%3Fstart%3D76&end=120). 

In [None]:
from IPython.display import Image
Image("../input/figures-lordi/fig5.png")

Figure 5. Normalized Player Position Visualization for following play: Aaron Rodgers passes deep left to DeVante Adams pushed ob at DET 9 for 30 yards (Closest Defender: Darius Slay, CB of DET). See Devante Adams highlights [here at 2:56](https://www.youtube.com/watch?v=6cLUzFGGoDo%3Fstart%3D76&end=120). 

# Machine Learning: XGBoost Trees Classify Pass Outcome 
 
**Goal**: Accurately classify Pass Play Outcome as Complete (C) or Incomplete (I) given training data with multiple feature inputs that characterize Defensive Backs (DBs) and their relationships to Football and Target Receiver on Offense.
Note: For this model, tried to classify without time-series analysis due to computation capabilites

**Motivation Behind XGBoost Gradient Boosting Decision Trees:** [XGBoost Model](https://arxiv.org/pdf/1603.02754.pdf) is a powerful ML technique and gradient boosting decision tree framework for regression and classification problems. XGBoost is strong for its ability to analyze non-linear relationships between coordinates and when the data may be sparese. 

**Language Used:** [XGBoost Python Package](https://xgboost.readthedocs.io/en/latest/python/python_intro.html) 

**Inputs for each Defender:**
* Height
* Weight 
* Orientation
* Direction of Motion 
* Normalized  x and y Coordinates (Normalized to the target receiver) at the Event == Pass Arrived
* Distance from the Target Receiver

**Inputs for each Play:**
Note: For this model, tried to classify without time-series analysis due to computation capabilites
* Yards To Go for 1st Down
* Down
* Route 
* Defensive Team
* Offensive Formation
* Dropback Type
* Number of Defenders in Box
* Number of Pass Rushers
* Absolute Yardline Number 

### Output: 
* Binary Classification: Pass Complete (0) or Incomplete (1)

### Model Parameters:
Test size of 0.3
param = {
    'eta': 0.3, 
    'max_depth': 6,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

param1 = {
    'eta': 0.3, 
    'max_depth': 6,  
    'objective': 'binary:logistic',  
    'num_class': 2} 

steps = 40  # The number of training iterations


### Result: 78% Accuracy Test, 89% Accuracy Training

Adding more refined features and continuing feature engineering (see Next Steps & Future Work) can help increase the accuracy of the model. Personally, I am impressed with the accuracy given that we are only including the normalized positions of defenders at Event = Pass_Arrived. 





# Potential Applications & Innovations for the NFL

### How can the League and Individual NFL Teams and Coaches use the model and visualizations to help Defensive Backs be more dominant in shutting down the pass???! ‚ùåüèàüêê

## For the League:
* **DB Separation Rank:** Use Normalized DB Separation Visualization and queries on parameters such as Team, Player, and Receiver Route to quantify this metric on a weekly basis given input parameters. Examples:
    * DB Separation Zone Rank
    * DB Separation Man Rank 
    * DB Separation Cover Rank
    * DB Separation Route 
    * etc.


## For Individual NFL Teams, Defense Coordinators, & Players:
* Data-assisted film watching & game prep using model outcomes and Normalized DB Separation Visualization 
* Root cause analysis for team on how DBs can best limit separation in different game scenarios and defensive schemes.

# Next Steps & Future Work
There is alot more I could have done and to learn from, which is good!! #gainz Here are few on my mind as I wrapped up my analysis.

### DB Performance Analysis & Strategy
* Assess the quality of the position of the DB after the catch using statistical and/or machine learning methods. For example, based on a DBs depth, position, speed, acceleration, etc. following the catch by the receiver, how many yards is the receiver likely to gain? Analyzing this is another beast. 
* Assess certain special skills that successful DBs employ depending on factors such as Game Scenario, Offensive Scheme, and Defensive Coverage. For example, like I said at the start of this notebook, see [this video](https://www.youtube.com/watch?v=tcFi07b12jY) to hear former Philadelphia Eagles Safety Malcolm Jenkins break down tactics that make for a successful Safety. Also, see [this video](https://www.youtube.com/watch?v=F6kYFFoVOp4) to hear  CB Xavier Rhodes breakdown his technique. 
* So much more it's football. 

### Machine Learning Model
* Investigate classifying defensive coverages (Cover 0-6, Nickel, Dime, Quarters, etc.) using supervised or unsupervised machine learning methods. This can help enhance current XGBoost Model to Classify Pass Outcome. 
* Investigate classifying a play as Man VS Zone as players move around field over time. See [this paper (Dutta, Yurko, Ventura, 2020)](https://arxiv.org/pdf/1906.11373.pdf). This can help enhance current XGBoost Model to Classify Pass Outcome. 
* Identifying feature importance of the XGBoost Model using a method for interpreting model predictions ( such as [SHapley Additive exPlanations (SHAP)](https://proceedings.neurips.cc/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf).

Thank you for reading! üôÇ See Appendix for research topics, references, and plots not directly in submission. I look forward to learning from everyone's submissions, methods used, the winners,  the NFL, etc. and seeing what everyone came up with in the time they were able to dedicate to this project. 
See the full code on Github [here](https://github.com/klordi/lordi_nfl_big_data_bowl_2021). 

# ----------------------------------------------------------------

# Appendix

## Appendix A - Research & References 

### Research Areas on NFL Defenses
* Defensive Coverages 
* Highlights 
* Watching lots of football with my dog 
* etc.

### Select References

I did some reading and research on past work in the field of analyzing defenses and data science in order to see how what I can do for my project can build off past and current work. Here are some of the links I looked at in depth for inspiration for my project:

[1] Cheong, Chi, Noori, Schaefer, Tyagi, Zeng. "Predicting Defender Trajectories in NFL‚Äôs Next Gen Stats" 2020. https://aws.amazon.com/blogs/machine-learning/predicting-defender-trajectories-in-nfls-next-gen-stats/

[2] Dutta, Ventura, Yurko. "Unsupervised Methods for Identifying
Pass Coverage Among Defensive Backs
with NFL Player Tracking Data" 2020. https://arxiv.org/pdf/1906.11373.pdf









 

## Appendix B - Data Formatting

See my github [here in the data_pickled folder](https://github.com/klordi/lordi_nfl_big_data_bowl_2021/tree/main/Data) and all files I used for analysis after data formatting 

1. Read in the original raw data .
2. Made text columns numerical by adding "i" (integer) columns by converting the text fields into a catagory and then converting that to a number. Did this because data needs to be numerical to use in my preliminary exploratory analysis of the data and to be used as inputs to Machine Learning models. 
3. After adding derived columns, [Pickled the data](https://docs.python.org/3/library/pickle.html#:~:text=%E2%80%9CPickling%E2%80%9D%20is%20the%20process%20whereby,back%20into%20an%20object%20hierarchy.)in order to make its size more manageable and saved as csv files. The Python pickle module serializes and de-serializes a Python object structure.
4. Joined new pickled tracking data with player, play, and game data to analyze relationships between variables. 
5. Normalized the data using two transforms: (1) Normalized tracking data to line of scrimmage x = 0. (2) Normalized Defender and Football Position in Tracking Data to Target Receiver at (x,y) = (0,0).


