# Introduction

Defenses in the National Football League are incredibly complex systems requiring collaboration from all 11 players on the field, coordinators on the sideline and coaches in the skybox. Ultimately, an NFL defense has one goal: stop the opposing offense to get off the field and allow its own offense to takeover for a chance to score. To do that, understanding the success of various defensive pass coverage packages and when to use them goes a long way in successfully playing situational football. As a part of the 2021 NFL Big Data Bowl, our team analyzed and visualized thousands of entries of data to better understand the success of pass defense coverage.
In this notebook, we looked to address a few centric questions that would give NFL teams more insight on how and when to utilize coverages. A few of our questions heading into this project were:

1. What different coverage schemes do defenses employ?
2. What coverage options tend to be better performing? And in which situations are they best performing?
3. How does a defense react to certain offensive plays and formations?


# Data Preprocessing

Though we were given the NFL's internal data from all passing plays in the 2018 season, data required considerable cleanup and preprocessing to be useful for this project.

Our preprocessing steps included creating a database on SQL Server to store all the given data and create new data sets. We first imported all the given data and assigned them primary and secondary keys as required. Then we joined each week's data set with *plays.csv* based on ‘playId’ and ‘gameId’ and created new tables for each week. This allowed us to access a more detailed dataset for each week that contained information about plays and players. We also cleaned the data to ensure that the coordinates are flipped when necessary to match the direction of the offense’s target endzone. Furthermore, we created a calculated 'distance from line of scrimmage' column to understand player positioning and assist our machine learning algorithm. Since we were given coverage data for each play in Week 1, we added an extra column to the Week 1 data set and joined the Week 1 and coverage datasets based on ‘gameId’ and ‘playId’.

We used a combination of different technologies to perform our analysis and create our visualizations. We used Python and Google Collab to write our model and animate plays. We also used SQL, R and Excel to process data, calculate statistics and find correlations between coverages and other features. We also utilized Tableau to create visualizations. 



# Coverage Model

Since we were given coverage schemes for Week 1, we wanted to use that data to predict coverage schemes for the remaining weeks. We understood that a defense could show a certain coverage scheme prior to the snap but can often disguise coverages and may end up running a different defensive coverage. Therefore, we limited our model to predict coverages the moment the ball was snapped. To do so we filtered the ‘*event*’ column in Week 1 by “ball_snap”. Then from here we filtered the data to only include offensive players because we wanted to predict coverage schemes based on offensive characteristics.

### Feature Selection

The features we used to train our model were *‘x’, ‘y’, ‘a’, ‘dis’, ‘dir’, ‘quarter’, ‘down’, 'yardsToGo','offenseFormation', 'personnelO', 'typeDropback'*, and we set our target feature to be ‘coverage’. Since many of these features are categorical, we had to create dummy columns for all of them before training our model. We then standardized the continuous columns. Since coverage is also a categorical value, we had to encode the labels. Then we split the data into train and test data.

### Model Selection

We tested various machine learning models such as decision trees, SVC, XGBoost, and LGBM. Through comparing the accuracies of these models, we found that random forest worked the best, achieving an 85% train accuracy. We also attempted deep learning and TabNet methods, but our results were below our expectation, so we preferred random forest with 100 estimators. The best estimator was a decision tree with gini criterion.

### Feature Importance

For our random forest model, we found that these features had the highest feature importance in order: *'x', 'yardsToGo', 'o', 'dir', 'a', 'y', 'quarter', 'dis', 'down'*. After these, the next highest feature importances were the dummy features created for *’dropbackType’, ‘offenseFormation’* and *‘personnelO’*. We hypothesized that yardsToGo and the x and y coordinates of players would influence our coverage model greatly but were surprised that the personnel used by the offensive team and offensive formation did not have as much importance when predicting coverage.
We were able to use our random forest model below to predict coverage for all 17 weeks and then performed further coverage analysis from there.


In [None]:
def preprocessing(df):
    
    print('Preprocessing Started...')
    
    # Filtering data by ball snaop and offense formation
    week = df[df['event'] == 'ball_snap']
    options = ['QB', 'RB', 'WR', 'TE', 'HB', 'FB']
    week = week[week['position'].isin(options)]
    
    # Selecting coverage as our target
    y = week['coverage']
    x = week[['x', 'y', 'a', 'dis', 
               'o', 'dir',
               'quarter', 'down',
               'yardsToGo','personnelO','offenseFormation', 'typeDropback']]
    x = x.dropna()
    x= week 
    
    # Creating dummies for categorical columns
    x= pd.get_dummies(x)
    print('Scaling Started...')
    
    # Standardizing variables
    sc= StandardScaler()
    col = ['x','y','a','dis','o','dir','quarter','down','yardsToGo']
    x[col] = sc.fit_transform(x[col])
    print('Preprocessing Done')
    
    # Encoding the coverage column since it is categorical
    print('Label Encoding Started....')
    lenc = LabelEncoder()
    lenc.fit(y)
    y = lenc.transform(y)
    print('Label Encoding Finished...')
    
    return x,y,lenc

def training(x,y):
    X_train,X_test,y_train,y_test = train_test_split(x,y)
    rf = RandomForestClassifier()
    rf.fit(X_train,y_train)
    rf_pred = rf.predict(X_test)
    print(classification_report(y_test,rf_pred))
    return rf


def prediction(data,model,lenc):
    pred = model.predict(data)
    pred = lenc.inverse_transform(pred)
    return pred

### Cover 3 Zone Play Visualization
Here you can see the Bears line up in a Cover 3 Zone at the snap of the ball and there are three deep defenders. However, this left the underneath route open and Aaron Rodgers was able to throw a pass short left to Davante Adams for a touchdown.


In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import ImageGrid
import numpy as np
import matplotlib.image as mpimg
%matplotlib inline

i1 = mpimg.imread('../input/visualizations/cover3_final.png')
i2 = mpimg.imread('../input/visualizations/cover3_TD_final.png')
i3 = mpimg.imread('../input/visualizations/cover6_final.png')
i4 = mpimg.imread('../input/visualizations/cover6_TD_final.png')


fig = plt.figure(figsize=(60., 60.))

grid = ImageGrid(fig, 111,  # similar to subplot(111)
                 nrows_ncols=(1, 2),  # creates 1x2 grid of axes
                  axes_pad=0.5 # pad between axes in inch.
                 )

for ax, im in zip(grid, [i1, i2]):
    ax.axis('off')
    ax.imshow(im)  

### Cover 6 Zone Play Visualization
Here you can see the Ravens line up in a Cover 6 Zone at the snap of the ball and there are three deep defenders who split the coverage into quarters and half. Although the roles of defenders here are like Cover 3 Zone, the Ravens were able to confuse the Bills offense at the start of the play. Josh Allen was only able to throw an incomplete pass to Kelvin Benjamin in the middle of the field.


In [None]:
fig = plt.figure(figsize=(60., 60.))

grid = ImageGrid(fig, 111,  # similar to subplot(111)
                 nrows_ncols=(1, 2),  # creates 1x2 grid of axes
                  axes_pad=0.5 # pad between axes in inch.
                 )

for ax, im in zip(grid, [i3, i4]):
    ax.axis('off')
    ax.imshow(im)

# Findings

After creating our machine learning algorithm, we were able to visualize our newfound coverage data and answer our questions.

### 3rd and Long (10 or more yards to go)

Third and long is a common occurrence in NFL games and it is critical for the defense to come up with a stop in these scenarios to force a punt and get the ball back to its offense. We wanted to analyze what coverages defenses use most in these situations and which ones perform best. To do so, we analyzed coverages used by defense on 3rd and long plays in each quarter of the game.

In every quarter we found that Cover 3 Zone and Cover 1 Man were used most often. However, these schemes on average allowed a yard more on a play than a Cover 4 Zone defense did. Cover 4 Zones were not used as often but performed better. Cover 4 Zones defense only allowed a big play (20+ yards) 3.8% of the time, whereas Cover 3 and Cover 1 did 7% of the time.

### Red Zone

The red zone is also another key area where we thought coverage analysis could be useful. When in the red zone, defenses were once again using Cover 3 and Cover 1 more often than other coverages. These allowed a touchdown around 23 to 25 percent of the time but using a Cover 6 Zone defense only allowed a touchdown 18 percent of the time. On average Cover 6 also allowed fewer yards on each play.

To take our analysis a step further, we also investigated which offensive formations were used the most in the red zone. We found that Empty and Shotgun formation were used the most by offenses. Against each of these formations, we found that showing a Cover 6 scheme at the snap of the ball was the best at stopping touchdowns.

### Goal-to-Go

Goal-to-go situations are very important for the offense to score in and are particularly difficult for teams to defend. To assess how well defenses performed in these situations, we thought it was best to see what percent of the time each coverage allowed a touchdown. Once again, we saw that Cover 6 and Cover 4 are the best at stopping touchdowns. Showing this coverage allowed a touchdown only 20% of the time, whereas other coverages allowed touchdowns upwards of 30% of the time.

### EPA and Play Result

In play-by-play analysis we found that Cover 4 and Cover 6 Zones performed particularly well in reducing EPA, the expected points added, of the offense. They also on average allowed a smaller offense play result than other coverages. We also found that Cover 2 Man defense allowed by far the least EPA (-0.23) but the second-highest offensive play result (average 7.235 yards). On the flip side, Cover 0 Man allows the highest EPA (0.03) but the second lowest average play result (5.82).

### Defenders in the Box and Number of Pass Rushers

To dig a little deeper after our coverage analysis, we also analyzed how the number of defenders in the box and amount of pass rushers affect the play result.

Defenses were able to hold a Shotgun offense to negative or zero yardage about 50% of the time when they had 8 defenders in the box. However, teams use 4, 5, 6, or 7 defenders in the box more often than 8. The lesser number of defenders in the box only results in a zero-yardage result around 43% of the time. In short yardage situations we hypothesized that having 8 defenders in the box can be more effective. 

Additionally, rushing 5 resulted in a sack 9% of the time and 6 rushers resulted in a sack 12% of the time, compared to just 6% when 4 pass rushers are sent.

In [None]:
from IPython.display import IFrame

# Import Tableau Visualization 
IFrame('https://public.tableau.com/views/VisualizationofCoverageData/EPAandPlayResultbyCoverage?:embed=y&:display_count=yes&?:showVizHome=no', width = 750, height = 900)

# Summary, Conclusions and Future Considerations

Through our analysis, we have concluded that although Cover 3 Zone and Cover 1 Man defensive schemes are used the most often, there are key situations in which they are not the most effective. In Goal-to-Go and red zone situations, initially showing Cover 4 and Cover 6 Zones is the most effective at stopping opposing offenses from scoring touchdowns. We recommend to NFL coaches to utilize these coverages more in goal-to-go and red zone situations as it would give them a better chance at stopping the opposing offense.

We also found that showing Cover 4 and Cover 6 at snap would allow a smaller play result and lesser EPA than other coverages. However, taking this fact in context of football, Cover 4 and Cover 6 are largely used throughout a play to prevent the deep ball, meaning it makes sense that they allow less yards as opposing quarterbacks are likely to settle for a shorter route or the check down against them.

Utilizing man coverages seems to be the highest-risk/highest-reward for an NFL defense. While Cover 2 Man certainly has the potential to stop opposing offenses from scoring, it can allow a lot of yards since it is so dependent on cornerbacks winning their one-on-one matchups with only two safeties to help. If head coaches and defensive coordinators trust their defense to beat opposing receivers one-on-one, they should utilize Cover 2 Man the most. On the other side, Cover 0 Man has a high chance of allowing a score but also a good chance at allowing a small number of yards. This is due to the risk that comes with blitzing and isolating corners in one-on-ones without safety help; it could pay off with a sack or result in a touchdown. See the final play of Jets vs. Raiders in Week 12 of the 2020 NFL season for an example.

Finally, our defenders in the box findings encourage the increased use of the 46 defense. The 46 defense operates with 8 defenders in the box, a single high safety, and two corners. Since having 8 defenders in the box holds shotgun offenses to negative or 0 yardage 50 percent of the time, NFL defenses should be encouraged to use 46 more, especially in short yardage situations.

Our analysis is limited by the accuracy of our machine learning algorithm. Our model is limited to 85 percent accuracy and to predicting coverages at the time of the snap. In the future if we knew with 100 percent accuracy all coverages used, we could further predict the success and failure of various coverages.