# Uncovering NFL Playbooks: Automatic Play Clustering

## Introduction
Gaining a better understanding of NFL defensive play efficacy can improve a team's chance of success. One can simply use existing NFL dataset fields such as offensive formation or defenders in the box to improve play efficacy knowledge. However, these fields contain more coarse information and may not offer key insight into play performance.

We contribute a novel data mining method that automatically determines a set of offensive and defensive play type categories based on the movement of the players and the football. In summary, we transform offensive and defensive player movement from each play into trajectories in higher dimensional space and cluster them into similar play categories.

Our data mining method populates two new columns in the Plays table: offensive play category and defensive play category. Parameters allow for finer or coarser play categories, so that one can delve into the nuances of player movement. This is an improvement over exisiting dataset fields such as defensive formation since our new play categories offer richer and more detailed information.

We essentially uncover the playbook, both offensive and defensive, for all teams.

## Data Mining Method
Several steps, listed below, are involved in mining offensive and defensive play categories from player movement data.

**Step 1: Normalize each play**<br>
(i) Rotate players and the football in each play such that the offense always moves toward the same direction.<br>
(ii) Translate players and football so that the first frame has the football starting at coordinate (0,0).

**Step 2: Determine all the unique offensive and defensive player role sets for all plays**<br>
Each offensive or defensive play contains a set of player positions/roles. We identify all unique player role sets. As an example, for offense there were 13305 plays with the player role set of (QB, RB, TE, WR, WR2).

If two players have the same role, append an incremental number (e.g. WR, WR2) starting at top of field, and sort ascending.

**Step 3: For each play compute an offensive trajectory and a defensive trajectory in higher dimensional space**<br>
A trajectory in d-dimensional space is sourced from the normalized player position table and is a matrix of rows (frames) and columns (player x/y coordinates). Each row corresponds to a trajectory vertex and the number of columns represents the size of the d-dimensional space. For example, an offensive play with 100 frames and player role set of (QB, RB, TE, WR, WR2, FTBL) will have a corresponding trajectory matrix of 100 rows and 12 columns (QB has two columns for x and y, RB has two columns for x and y, etc). In this example, the trajectory is in 12-dimensional space.

**Step 4: Create clusters**<br>
For each offensive and defensive player role set, obtain the underlying trajectories, and then cluster them into *k* clusters, stopping when the radius of the largest cluster is less than *m* (in our experiments we set *m* to 1000, which results in clusters that have visually similar player movement). A larger m results in coarser play categories, and a smaller m results in more refined play categories.

Each offensive and defensive play cluster is assigned a unique ID, which are used to populate the offensive play category and defensive play category in the Plays table. So, play categories represent clusters of similar plays.

We use the [Gonzalez (1985)](https://pdf.sciencedirectassets.com/271538/1-s2.0-S0304397500X05245/1-s2.0-0304397585902245/main.pdf?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEHEaCXVzLWVhc3QtMSJIMEYCIQDGfxpaGACnccvlqmpBdBRqDVHIiZ8X3lCBJ4S%2FDgNAeAIhAK8%2FMI4ZT7dq9aFBS%2FlC36OKGKp8AiF4uO%2B5XVWJ7eo5KrQDCDoQAxoMMDU5MDAzNTQ2ODY1Igxf0fq1SH3J5xaye7EqkQOu%2BpyLPBjl121xWljCYSHdsxvc0puui3sbthzaX5t7s9eiPR0FSXnng4KjDMnKlIWU5p6bmaS%2Bu1%2B97sP1BXUNBoqEKehWGfbau%2B3xVNHOVfPpac%2BHmLlQmVjHAlvulFKJcrJFcaPCHTMyXbqtN%2FsuKMB%2FHAbLp2cyAkTfFo4pn4WY3h1RXE1rlkXDMdoAlC2P6TNcmGDcy%2FxR2ZeptaTO9IBsnzrYBX%2F%2BCWfeE2oYC9tydFlOb2FwgaLRtL110CQZNuUezxmncqlXJoCEb2RFpcYkHfqIqNeLZeYXLaYpsJZ6hDuohZgImZUfSu%2FMBLMkqIJ%2BEm5z9GC3C4A3wOmCsvfv2Ae1TLr6v8vogCd6fOWsEH8xSgE1et5QOQBCQIShhxX8oC28lvh7S9NondhhOaNzWvpfwVc27PH2Jyy%2FlVIhZsvRG%2BLtooN4vC2AxA5Fd3QS1fixVEF7XM3pMeeGd8Owwt7Bc%2FfHL572evpNFT7hm79wfeU4SLJXqUSW68J17nGnGZYq%2FF6nU90J%2BhNmFjC8t9n%2FBTrqAZjuSa4ZByPcfgFkJyh6gOR5DhGjmFVfUfXqBSx%2FuXUrXCu143G2aIZ0ZghC%2B33G0sCGQf3LyMRCi7P%2BsNCoka6WSnl7rJ17DHoq94hF1JcbMqjc9rwJJPEZeBFubpfESACS%2FgqL5Ly8U2bCuTbXDiZBCSM7QBQfIJeKubnvrPjZ8jsMB9xuqIO1URnZGN2wyJKIniBXHeRWtbjOesN5lxQLBEOUqaXklQIVseZsTndy4KRPXr9fBUulb53YmGlhsGGVosI4j9R%2FodsnInv9X%2BFrXNgqiO78PZSLzSSmAHbYtQ58Phy1KpoILA%3D%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20210107T012913Z&X-Amz-SignedHeaders=host&X-Amz-Expires=300&X-Amz-Credential=ASIAQ3PHCVTYTJKIFFML%2F20210107%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=304481e7a552dad5232e21f4abe2b3e59bb0832ecb045f4d21826150b69d3439&hash=696736a59b490f986429d934a6ab3ca98ddd680ca2a0944690a1f5db659a0c09&host=68042c943591013ac2b2430a89b270f6af2c76d8dfd086a07176afe7c76c2c61&pii=0304397585902245&tid=spdf-bc85b58f-90ee-4086-8401-9dbeecb72201&sid=66ebbdbe8fe9294e38791ed7a4f0aa32eb98gxrqa&type=client) clustering algorithm, since it produces clusters that have radii at most two times the optimal cluster radii. We use the Dynamic Time Warp (DTW) [Keogh et al. (2005)](https://www.researchgate.net/publication/225230134_Exact_indexing_of_dynamic_time_warping) distance measure to compute distances between trajectories since it can smooth individual player outlier movement.

**Outliers**<br>
We exclude outlier offensive/defensive player role sets and outlier offensive/defensive clusters, i.e. exclude those that have few plays associated with them. All visuals presented only contain data from clusters that contain at least ten different plays.

**Cluster Creation Speedup**<br>
The DTW algorithm runtime is quadratic in the number of trajectory vertices (frames), which can be computationally expensive for larger trajectories. We simplify the trajectories (reduce the vertices) using a straightforward linear-time algorithm by [Driemel et al. (2012)](https://www.researchgate.net/publication/221589940_An_algorithmic_framework_for_segmenting_trajectories_based_on_spatio-temporal_criteria). This results in much faster preprocessing runtimes.

Our simplification error for the experiments is 3 yards, which results in much smaller trajectories that still describe the movement well.

**Cluster Example**<br>
For proof of concept, the algorithm clustered two defensive plays shown below into the same play category. Visually, it is easy to verify that the two plays are quite similar.

We used functions from [Rob Mulla's notebook](https://www.kaggle.com/robikscube/nfl-big-data-bowl-plotting-player-position) to visualize plays.

In [None]:
from IPython.display import Image

In [None]:
Image("../input/nflvisuals/CINvsPIT.png")

In [None]:
Image("../input/nflvisuals/DALvsDET.png")

## Experiments using new play categories

### Overview

We first consider which categories had the best and worst outcomes for both ofensive and defensive plays, measured by the yards gained by the offensive team. The term ClusterID corresponds to a unique play category. 

In [None]:
Image("../input/nflvisuals/BestWorstDefClusters.png")

In [None]:
Image("../input/nflvisuals/BestWorstOffClusters.png")

The following tables identify which defensive categories resulted in the greatest overall proportion of incomplete passes, interceptions, and sacks.

In [None]:
Image("../input/nflvisuals/MostIncompletes.png")

In [None]:
Image("../input/nflvisuals/MostInterceptions.png")

In [None]:
Image("../input/nflvisuals/MostSacks.png")

The histogram below shows a distribution of average yards gained by each defensive cluster. The majority of defensive play clusters fall within the 2.5 to 7 yards gained range.

In [None]:
Image("../input/nflvisuals/PlayResultDistribution.png")

### Comparing play categories

The heatmap below shows five of the most common defensive and offensive play categories used throughout the season and compares how well they perform against one another based on yards gained by the offensive team. Certain offensive play clusters work better against certain defensive clusters, and vice versa.

We observe that the defensive play cluster with an ID of 340 was very effective against the offensive play cluster with an ID of 249, with an average play result of over 5 lost yards by the offense. However, this same defensive play cluster performed very poorly against the offensive cluster with ID 1606.

Thus, knowing which defensive plays perform best overall is simply not enough; we must analyze their performance with regard to the offensive play of the other team.

In [None]:
Image("../input/nflvisuals/heatmap.png")

The catplot below shows a subset of defensive clusters and plays at equal intervals. Each column of points represents a single play cluster, with each point representing a single play. The lower the point, the fewer yards gained by the offensive team, representing a better defensive outcome.

On average, play clusters to the left outperform those on the right.

In [None]:
Image("../input/nflvisuals/playResultCatplot.png")

### Breakdown by team

Perhaps the greatest use of the clustering algorithm is that it allows for an analysis of the plays most frequently used by teams in the NFL. Essentially, we get to uncover their playbooks.

The visuals below show the most commonly used defensive play clusters by the Seattle Seahawks and the San Francisco 49ers, measured by the average yards gained by the opposing offensive team. Defensive play clusters with IDs 2069 and 2580 show up for both teams, with very similar efficacy.

Many of the above experiments can be performed on data for specific teams, allowing coaches to uncover not only which plays other teams are most likely to run in certain scenarios, but also how effective each of those plays are.

In [None]:
Image("../input/nflvisuals/SEA_plays_bar.png")

In [None]:
Image("../input/nflvisuals/SF_plays_bar.png")

## Future Work

Several improvements can be made to our framework:
<ul>
<li>Other trajectory distance measures such as the Fréchet distance can be tested (this computes a worst-case distance, but may be useful for some types of analysis). 

<li>Other clustering algorithms such as k-means or more recent techniques may result in more meaningful clusters.

<li>With regards to data limitations, the supplied player positional table usually did not include all players' movement - it would be interesting to see clusters that are created from "full" data.

<li>The outliers that were excluded may be interesting to look at in more detail (e.g. perhaps there are some outlier plays that are very successful and therefore should be utilized more often).

<li>The Gonzalez (1985) clustering algorithm runtime is O(kn) where n is the number of trajectories. We can heuristically speed up the algorithm by using a lower bound that can reduce the number of clusters that must be checked during each k-iteration.
</ul>

## Links

**Code:**

All code for preprocessing, DTW, and Gonzalez clustering was custom written for this project, and can be found at this [Github repository](https://github.com/evanpfeifer/nfl-big-data-bowl)

**Contact:**
<ul>
    <li>Evan Pfeifer: evanpfeifer@berkeley.edu
    <li>John Pfeifer: johnapfeifer@yahoo.com
</ul>