<h1 style="color:black; text-align:center;">üèà Bring down 85! Metrics to predict Expected YAC üèà</h1>

![display gif](https://cdn.vox-cdn.com/uploads/chorus_asset/file/19535152/Pass_06_Kittle_4th_Down_Catch_vs._Saints.gif)

# Introduction

The short / screen pass play on the third and long scenario has been a long debated topic. It has been commonly criticized for coming up short and can be seen as a wasted play. How effective is it to throw a short or screen pass during a third a long? Is it even worth attempting it? Who has the best chance of getting you the first down or the most yardage? 

The term ‚ÄúYAC‚Äù is a measure of the yards a receiver gains after he catches the ball. YAC can be a crucial measure when evaluating players, and can be instrumental in coaching and teambuilding. San Francisco 49ers‚Äô wide receiver Deebo Samuel knows the importance of this stat, as he calls himself, along with fellow teammates George Kittle and Brandon Aiyuk, the ‚ÄúYAC bros.‚Äù These types of players take immense pressure off the quarterback by being able to extend short and medium passes into big gains. For example, they allow quarterbacks to feel comfortable throwing shorter throws on 3rd and long scenarios with the expectation that the receiver can pick up the necessary yardage. However, there are a variety a factors beyond the receiver himself that determine how much yards can be gained after the catch.

![display gif](https://thumbs.gfycat.com/NippyIgnorantHedgehog-size_restricted.gif)

We created a metric to determine how defender positions and distances in relation to the intended receiver affects the receiver‚Äôs ability to turn upfield for yards after the catch. We also thought it would be interesting to account for the skill level of defenders and receivers, by assigning each of them scores based on tackling ability and the ability to break tackles for defenders and receivers, respectively.

# Step 1: Evaluating Pass Catchers


We first set out to quantify each receiver‚Äôs ability to break tackles, and thus gain yards after the catch. We created a ‚Äúreceiver score‚Äù by finding each receiver‚Äôs average yards after contact, and then scaling this number based on the total catches the player made over the season. By combining catch totals with ability estimates for extending plays, we constructed a metric tailored to predicting YAC. Figure 1 displays the top 15 pass catchers according to this metric. The names in the graph are those of elite pass catchers who possess both speed and strength, making them absolutely lethal in the open field. It comes as no surprise that tight end George Kittle, who had a historical 2018 campaign, ranks third on this list, behind only workhorses Christian McCaffrey and Saquon Barkley. 

In [None]:
from IPython.display import Image
import os

print('Figure 1')
Image("../input/gitrepo/graphs/rec_graph.png")

# Step 2: Evaluating Tacklers

We similarly evaluated each defensive player‚Äôs ability to tackle. This metric was built from two components: the total amount of tackles the players made, and the ability of each player to convert contact into tackles. The former was fairly straightforward, and involved a simple tally. The latter was induced by identifying the plays where a defender made first contact with a receiver, and figuring out whether that defender made the tackle. Figure 2 shows the 15 best players according to this metric. As expected, the list features Darius Leonard and ‚ÄúWolfhunter‚Äù Leighton Vander Esch, who finished 1st and 3rd respectively in total tackles over the 2018 season. However, this graph is notably missing Blake Martinez, who finished second in that stat. Our advanced-metric places emphasis not only on gross totals, but efficiency as well, which lends itself better to predicting YAC.

In [None]:
print('Figure 2')
Image("../input/gitrepo/graphs/tackle_graph.png")

# Step 3: Scoring player positions at the time of the catch

We wanted to have a metric to score each completed pass play based on how the defenders are positioned when the ball is actually caught. Every play started with an initial score of 20, and each defender then subtracted an amount related to how far he was from the receiver and his ability score from the previous section. We chose to give higher weight to those defenders who were in front of the receiver, meaning in the direction of the receiver‚Äôs end zone. We found this by using the orientation of the quarterback when the ball was snapped, as this will always be in the direction the offense is moving. 

As the distance of each defender to the receiver increases, the amount that is subtracted from the score decreases. In addition, we only accounted for defenders that are within a 10 yard radius of the receiver once he catches the ball because these are the defenders who are able to make the first contact or tackle him. 

On this particular play from Week 3, George Kittle catches the ball at the opponent‚Äôs 44 yard line. He is then met with three defenders, as shown in the circle representing a 10 yard radius, in positions that seem to block every angle he wants to go. This is exactly what you want to do as a defense after a receiver has caught the ball - limit the damage. Anthony Hitchens, being one of the better tacklers in the league, as shown by his tackling score, decreases this particular play score even more. In the end, this play received a score of 6.96, due to three close defenders and their positions to stop Kittle from advancing further.

In [None]:
Image('../input/gitrepo/graphs/Week_3_Pass_to_George_Kittle.png')

# Additional features and feature correlation

In addition to the features mentioned above, we also devised two others to build our model: the distance between the closest defender and the pass catcher, and that defenders tackler score. A heatmap of how these features correlate can be seen below.

In [None]:
Image('../input/gitrepo/graphs/Feature_Corr.png')

When two independent features have a strong relationship they are considered either positively or negatively correlated. The reason highly correlated variables are avoided when creating models is because they can skew the output. If there are two independent variables that are representing the same occurrence it can create ‚Äúnoise‚Äù or inaccuracy in the model. Models rely solely on outside information in order to create a useful output and having correlated variables can create an inflated variance in at least one of the regression outputs.

However, these correlations can also help us in instances like these, looking at stats. There are many correlations that can pick up interesting facts about the game. **One of the most important things to note is that no matter who the defender is, there is a high chance that the receiver goes down if there is a defender near him**. This heat map‚Äôs high correlation between Closest Distance and Play Score clearly showcases this. 


![display gif](https://thumbs.gfycat.com/AnchoredClutteredIchidna-mobile.mp4)

# Step 4: Determing actual YAC information

We determined the actual YAC amount for each elligible play across all games

In [None]:
import pandas as pd
actual_yac = pd.read_csv('../input/gitrepo/intermediate_data/yac_labels.csv')
actual_yac.head

# Step 5: Constructing the model

After gathering the above inputs and determining the true YAC, we then aggregated this data for each elligible play across every game. To transform the tackler scores into a feature we could use, we performed a similar operation as that used to compute play scores, weighting each defender's score by his distance to the pass-catcher. We then built an XGBoosting supervised regression model to predict YAC. 

XGBoost specializes in processing large amounts of features and is one of the most accurate models. At first, when using XGBoost, we were constantly getting high MSE values, signifying high error, when passing in our data normally. We were then able to convert our arrays into dmatrices(data structure used by XGBoost to optimize memory efficiency and training speed) and the model‚Äôs performance increased significantly. 

**How XGBoost Works:**
1. The Mean of target values is calculated for Initial predictions and the corresponding initial residual errors.
2. A model (shallow decision tree) is trained with independent variables and residual errors as the data to get the predictions.
3. The additive predictions and the residual errors are calculated with some learning rate from the previous output predictions obtained from the model.
4. Steps 2 and 3 are iterated for M number of times until the required number of models are built.
5. The final prediction from boosting is the additive sum of all previous predictions made by the models.

In [None]:
Image('../input/gitrepo/graphs/xgboost.png')

# Conclusion
The results of our efforts to estimate the YAC were very promising, having only a 4.2277 yard MSE. This means that for every carry a receiver had, we were only off by 4.2277 before accounting for only the receiver‚Äôs ability, defender‚Äôs ability and the closest defender(s). This model would be incredibly useful during play to play scenarios when trying to estimate the best receiver/matchup combo to target.

We opted to use the supervised regression model over its decision tree counterpart as it yielded better predictions. A residual plot detailing the accuracy of the model is shown below.

A residual plot has the Residual Values on the vertical axis; the horizontal axis displays the independent variable. A residual plot is typically used to find problems with regression. This residual plot describes how well our regression results were able to match up with the actual yardage that resulted in the play. The seemingly random error shows that the model was not overfit to the data and proves normalcy. The skew on the right hand side shows that we needed to modify the data before running the regression. We took out several outliers and produced a much more normal result. 

In [None]:
Image('../input/gitrepo/graphs/YAC_Residuals.png')

# Future Improvements
One way we can improve our model is by not only factoring in player positions into the play score, but also their speed and acceleration in the moment the ball is caught. This way, we can tell exactly how much ground a receiver can gain before being met by an opponent. In addition, our model accounts for defenders in a ten yard radius for all plays, but in the future, we could use a dynamic radius that takes the sideline and field position into account, as 10 yards in the red zone is not nearly the same as 10 yards in your own half of the field.

One main factor that we could add when doing further research is the play type and the down that it was thrown on. These factors could signify key moments in games and also determine what plays are most effective at what time. Moreover, the position that the defender plays could be noted to best determine what type of coverage to pick at. This is just the first study in what could be a ton of useful information for offensive pass schemes. 


[Code, Graphs, and Modules can be found here](https://github.com/adithyashanmugam/NFL-Big-Data-Bowl-2021)

Authors: Adithya Shanmugam, Vinayak Nadig, Praveen Ravisankar