# PaVE the Way for NFL Passing Analytics: Passing Value in Expectation

> Udit Ranasaria (@uditranasaria; Microsoft)
> <br>
> Rishav Dutta (@rishavd64; Sofy.ai)
> <br>
> [Suvansh Sanjeev](https://suvan.sh) (@SuvanshSanjeev; Carnegie Mellon University)
> <br>
> Aditya Murali (@aditya_murali7; Johns Hopkins University)

<div style="width:100%;height:0px;position:relative;padding-bottom:56.250%;"><iframe src="https://streamable.com/e/xuhfjm?autoplay=1" frameborder="0" width="100%" height="100%" allowfullscreen allow="autoplay" style="width:100%;height:100%;position:absolute;left:0px;top:0px;overflow:hidden;"></iframe></div>

## Introduction

Is it possible to quantify the dynamics of pass plays independent of the true pass thrown? How can we evaluate defensive coverage on all the valuable areas of the field?  We _tackle_ these questions by building the **Passing Value in Expectation ($PaVE$)** metric using the tracking data provided. 

Any casual football viewer knows that an incomplete pass doesn’t necessarily reflect immaculate defensive coverage, and passes can be completed despite multiple defenders draped over the receiver ([see Gronkowski, Rob](https://arc-anglerfish-arc2-prod-tbt.s3.amazonaws.com/public/NI5UXEMOHJCZBNBVQUBC6RDDIE.jpg)). Rather, coaches aim to design defensive game-plans that position defenders to minimize the value of potential passes a QB can throw based on the game situation and offensive positioning.


Key highlights of our approach: 

> * **Intuitive, Modular Modeling:** All core aspects of the model should represent how football is naturally played, with each decomposed part easily modified or replaced without affecting the overall pipeline .
> * **Trajectory-Based Analysis:** Defense is often played beyond just the arrival point of the ball, and we want to capture the motion of the ball through air space. Thus, we analyze an average of 1.3 million potential passes per second of game time using their trajectory provided by [NFL3D](https://dutta.github.io/nfl3d.html) (by Dutta et al).
> * **Predictivity:** The metric should correlate with offensive passing output and be predictive of future offensive passing output.

Much of the high-level conceptual framework behind $PaVE$ comes from _off-ball scoring opportunity_ in soccer ([Spearman](https://www.researchgate.net/publication/327139841_Beyond_Expected_Goals) and [Spearman et al.](https://www.researchgate.net/publication/315166647_Physics-Based_Modeling_of_Pass_Probabilities_in_Soccer)).


## Methodology
To evaluate offensive value over all potential passes over the field, we _breakup_ the pass into 3 components: 
> 1. **Selection:** The probability that a given pass is selected to be thrown by the quarterback ($S(p)$)
> 
> 2. **Influence:** Aggregate probability that a pass will be ultimately influenced by a member of the offense given that that pass is thrown ($I(p)$)
> 
> 3. **Value:** Given that a certain pass is thrown, value gained for a completion ($VC(p)$) or incompletion ($VI(p)$)

$PaVE$ is defined as the expected value of the pass taken over
\begin{align*}
PaVE_F&=\sum_{p\in\mathcal{P}}\underbrace{\left(VC(p)\cdot I(p)+VI(p)\cdot(1-I(p))\right)}_{\text{expected pass value}}\cdot S(p) \\
PaVE&=\frac{1}{N}\sum_{f=1}^N PaVE_f,
\end{align*}
where $PaVE_F$ is per frame, and $PaVE$ is evaluated on a play with $N$ frames from snap to throw. We evaluate defenses as minimizing $PaVE$ and offenses as maximizing it.

For a given frame, we identify the quarterback's location as the starting location of possible passes $\mathcal{P}=\mathcal{L}\times\mathcal{T}$:
> $\mathcal{L}=[0,120]\times[0,54]$ The field location to which the pass was thrown.
> 
> $\mathcal{T}=[0,4]$ The time of flight of the pass.

Any given pass can then be specified as $p=(\ell,T)$. In our implementation, we discretize both $\mathcal{L}$ and $\mathcal{T}$.


### Influence

#### Player Influence

The basic building block for pass influence is evaluating the probability that a player $j$ can arrive at a location $\ell$ by the time the ball arrives there, $T$. We first define $ToA(j, \ell)$ to be the time of arrival of player $j$ at $\ell$. We assume players have a reaction time $t_{\text{react}}$ to the pass and project their velocity at that time onto the vector from their location to $\ell$ in order to get their initial velocity during ball pursuit. From here, we model player movement with constant acceleration up to a maximum speed until they reach $\ell$. 

Now if $T\geq ToA(j,\ell)$, the player is projected to arrive in time to influence the ball, and otherwise they are not. To smooth this step function and account for temporal uncertainty due to unmodeled factors, we apply a logistic function. Lastly, we assume that the ball must have a height $z$ between $z_{\text{min}}$ and $z_{\text{max}}$ at $\ell$ in order for a player to influence it. This is incorporated as a simple indicator variable on $z$, a component of $\ell$. Altogether, we have
$$
P_\mathrm{inf}(\ell,T,j)=\left(1+\exp\left(-\frac{\pi(T-ToA(j,\ell))}{\sqrt{3}\sigma}\right)\right)^{-1}*\mathbf{1}(z\in[z_\mathrm{min},z_\mathrm{max}]),
$$

where $\sigma$ is tuned on the provided dataset, with lower values representing less uncertainty and a function closer to a step function. 

<figure>
    <figcaption style='text-align:center; font-size: 12px'>Figure 1: Example calculation of $PPI$ by team over 3D trajectory.</figcaption>
    <img src="https://raw.githubusercontent.com/uditrana/BigDataBowl/master/bdb_throw_example.png"/>
</figure>

#### Integrating Across a Pass Trajectory

Having modeled the probability that a player influences a given location at a given point in time, we can model potential player influence $PPI(p)$ on a pass $p=(\ell,T)$ as a sequence of Bernoulli random variables corresponding to every player-time combination along the trajectory, with probabilities given by $P_{inf}$. They are ordered first chronologically, and within each timestep by defensive players coming before offensive players, since defensive influence only requires tipping the ball rather than catching it. To capture the relative ease of tipping over catching once near the ball, we give the chronological “tiebreaker” to the defense. 

By passing the ball through this sequence of Bernoulli trials, we can calculate the probability that any of these player-time combinations is the first success (representing the play made on the ball). Integrating these probabilities over players and over the trajectory gives us the total potential player influence for the offense ($PPI_\mathrm{off}$) and defense ($PPI_\mathrm{def}$), with the remaining probability corresponding to the probability that neither team influencing the ball, denoted $PPI_{\text{rem.}}$

$$
\begin{align}
PPI_\mathrm{def}(p)&=\sum_{t=0}^{T-1}\underbrace{\left(\prod_{j\in\mathcal{O}\cup\mathcal{D}}\prod_{t’=0}^{t-1}1-P_\mathrm{inf}(\ell(t’), t', j)\right)}_{P(\text{ball is still in play at }t)}~\underbrace{\left(1-\prod_{j\in\mathcal{D}}1-P_\mathrm{inf}(\ell(t),t,j)\right)}_{P(\text{at least 1 defender influences ball at }t)} \\
PPI_\mathrm{off}(p)&=\sum_{t=0}^{T-1}\underbrace{\left(\prod_{j\in\mathcal{O}\cup\mathcal{D}}\prod_{t’=0}^{t-1}1-P_\mathrm{inf}(\ell(t’), t’,j)\right)}_{P(\text{ball is still in play at }t)}~\underbrace{\left(\prod_{j\in\mathcal{D}}1-P_\mathrm{inf}(\ell(t), t, j)\right)}_{P(\text{no defender influences ball at }t)}~\underbrace{\left(1-\prod_{j\in\mathcal{O}}1-P_\mathrm{inf}(\ell(t), t,j\right)}_{P(\text{at least 1 receiver influences ball at }t)} \\
PPI_\mathrm{rem.}(p)&=1-PPI_\mathrm{def}(p)-PPI_\mathrm{off}(p)
\end{align}
$$

Here, $\ell(t)$ refers to the location of the football at time $t$ along the pass trajectory along $p$. For the purpose of individual player rankings, we credit player $j$ with influence $PPI_j$ proportionally to the contribution of their $P_\mathrm{inf}$ to the team $PPI$. See Figure 1 for an example of $PPI$ calculation over given $P_\mathrm{inf}$ over a sample ball trajectory. 

### Selection

Pass selection consists of both QB ability and QB decision-making. For the former, we defined a distribution over passes $p=(\ell,T)\in\mathcal{P}$ based on the historical distribution of the times of flight on passes based on the distance traveled on the field.
$$H(p) = P(T|\ell)$$

As for QB decision making, we know a QB is intelligently selecting passes that he deems likely to be influenced by his team. Conveniently, we have just developed an offensive influence metric in the previous section to model this.

Combining these two factors, we have
$$S(p) = H(p) * (PPI_{\text{off}}(p))^{\alpha}$$
 
$\alpha$ is a tuned hyperparameter and $S(p)$ is normalized to be a valid probability distribution over $\mathcal{P}$.

### Value

During a play, a defense should look to position themselves to prevent the offense from hitting on explosive plays or throws likely to move the sticks. Yet, they must also be aware that if they offer low value passes too much space, it could incur valuable yardage after catch (YAC). To capture this, we model expected YAC (xYAC) and EPA to get expected EPA (xEPA).

#### xYAC Model
We used XGBoost to train a tree-based model similar to the nflFastR xYAC model. From the tracking data we created a set of 21 features, upon which we trained the model. The model was trained on completions weeks 1-4 and validated on the remaining weeks, hyperparameters were manually tweaked to maximize accuracy. 

For features we used the y-coordinate of the ball and the speed, x, y, distance at ball-catch for the 5 closest defenders. After training, the 3 most important features were found to be closest defender speed, ball y coordinate, and closest defender y coordinate. 

After training, we achieved a model with validation accuracy 81% with an average of 2.4 yards from the truth.

#### xEPA Model
    
While largely similar in training and feature construction to the nflfastR EP model by [Baldwin et al](https://www.opensourcefootball.com/posts/2020-09-28-nflfastr-ep-wp-and-cp-models/). We forwent the time feature as predicting clock spent on a play was an unmodeled uncertainty.  

The model was trained on the entire publicly available play by play dataset from nflFastR using the weighting technique suggested by [Yurko et al](https://arxiv.org/pdf/1802.00998.pdf), following the same training algorithm and hyper parameters as [Baldwin et al](https://www.opensourcefootball.com/posts/2020-09-28-nflfastr-ep-wp-and-cp-models/). We found our EPA model has a 0.8 correlation with nflfastR EPA. 

Now we define the value of a completion $VC(p)$ by appropriately updating down, distance, and yardline based on $x+xYAC(p)$, where $x$ is the $x$-component of $\ell$. Similarly, incompletion value $VI(p)$ is retrieved by incrementing down without adjusting distance or yardline. $xEPA$ can be calculated as the expected value of $EPA$ over completion probability.

<figure>
    <figcaption style='text-align:center; font-size: 12px'>Figure 2: Demonstration of $PaVE$ and its constituent components on a sample play. Top row: $PPI_\mathrm{off}$, $S$. Bottom row: $xEPA$, $PaVE$. Notice that between frames 35 and 40, Kendricks takes away many of the available passes to Gordon, but by the time the pass is thrown, Gordon moves past him into an open area. Furthermore, Edelman is the most open receiver by $PPI$, but after factoring in value, $PaVE$ favors Gordon over Edelman. </figcaption>
    <div style="width:100%;height:0px;position:relative;padding-bottom:55.556%;"><iframe src="https://streamable.com/e/sj00un?autoplay=1" frameborder="0" width="100%" height="100%" allowfullscreen allow="autoplay" style="width:100%;height:100%;position:absolute;left:0px;top:0px;overflow:hidden;"></iframe></div>
</figure>

## Validation

### Predictivity
<figure>
    <figcaption style='text-align:center; font-size: 12px'>Table 3: The Pearson correlation coefficient between successive 4-game bins within a season for three per-dropback metrics: $PaVE$, $EPA$, Passing Yards.</figcaption>
    <style type="text/css">
    .tg  {border-collapse:collapse;border-spacing:0;}
    .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
      overflow:hidden;padding:12px 20px;word-break:normal;}
    .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
      font-weight:normal;overflow:hidden;padding:12px 20px;word-break:normal;}
    .tg .tg-whk6{background-color:#c0c0c0;border-color:inherit;font-family:Arial, Helvetica, sans-serif !important;;font-size:13px;
      text-align:left;vertical-align:middle}
    .tg .tg-9wq8{border-color:inherit;text-align:center;vertical-align:middle}
    .tg .tg-jaco{background-color:#c0c0c0;border-color:inherit;text-align:left;vertical-align:middle}
    .tg .tg-lupf{background-color:#c0c0c0;border-color:inherit;color:#000000;font-weight:bold;text-align:left;vertical-align:middle}
    .tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
    .tg .tg-llyw{background-color:#c0c0c0;border-color:inherit;text-align:left;vertical-align:top}
    .tg .tg-lgs3{background-color:#c0c0c0;border-color:inherit;color:#000000;font-weight:bold;text-align:left;vertical-align:top}
    .tg .tg-biip{background-color:#c0c0c0;border-color:#000000;font-weight:bold;text-align:left;vertical-align:middle}
    .tg .tg-0a7q{border-color:#000000;text-align:left;vertical-align:middle}
    .tg .tg-73oq{border-color:#000000;text-align:left;vertical-align:top}
    .tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
    </style>
    <table class="tg" style="margin-left:auto;margin-right:auto">
    <thead>
      <tr>
        <th class="tg-whk6"><span style="font-weight:bold;font-style:italic">bin\bin+1</span></th>
        <th class="tg-lupf">PaVE</th>
        <th class="tg-lgs3">EPA</th>
        <th class="tg-lgs3">Yards</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td class="tg-biip"><span style="font-weight:bold">PaVE</span></td>
        <td class="tg-0a7q">0.19</td>
        <td class="tg-73oq">0.30</td>
        <td class="tg-73oq">0.44</td>
      </tr>
      <tr>
        <td class="tg-jaco"><span style="font-weight:bold">EPA</span></td>
        <td class="tg-9wq8">~</td>
        <td class="tg-0pky">0.36</td>
        <td class="tg-0pky">0.33</td>
      </tr>
      <tr>
        <td class="tg-llyw"><span style="font-weight:bold">Yards</span></td>
        <td class="tg-c3ow">~</td>
        <td class="tg-c3ow">~</td>
        <td class="tg-0pky">0.43</td>
      </tr>
    </tbody>
    </table>
</figure>

Football is a dynamic sport with many interacting factors and high variance outcomes. In the midst of this variance, we find that $PaVE$ is more predictive of passing yards gained than either EPA or passing yards itself. It also demonstrates a moderate correlation in predicting EPA. Predictions are on metric values averaged per dropback in consecutive 4-game bins over the 2018 NFL season.
<br/>

### Hyperparameter Tuning
There are four model parameters that we tuned based on ground-truth passes in the dataset:
> * maximum player acceleration ($a_\text{max}$)
> * maximum player velocity ($v_\text{max}$)
> * influence logistic function parameter $\sigma$
> * selection parameter $\alpha$

The first three parameters ($a_\text{max}$, $v_\text{max}$, and $\sigma$) were tuned jointly to minimize the logistic loss between the individual potential player influence $PPI_j(p)$ and the true pass outcome. We tune the selection parameter $\alpha$ with a logistic loss on the selection probability for the ground-truth pass $S(p)$. All parameters were tuned using PyTorch ([Paszke et al.](https://arxiv.org/abs/1912.01703)) using the Adam optimizer ([Kingma and Ba](https://arxiv.org/abs/1412.6980)). The final tuned values were $a_\text{max} = 7.67\,\frac{\mathrm{yd}}{\mathrm{s}^2}$, $v_\text{max} = 9.42\,\frac{\mathrm{yd}}{\mathrm{s}}$, $\sigma = 0.31\,\mathrm{s}$, and $\alpha = 1.2$.
Additional model hyperparameters were set as $t_\mathrm{react}=0.2\,\mathrm{s}$, $z_\mathrm{min}=1\,\mathrm{yd}$, and $z_\mathrm{max}=3\,\mathrm{yd}$.


## Applications

### Defensive Positional Breakdown
$PaVE$ broken down by position reveals an expected trend in Figure 4 wherein players in positions closer to the receivers and in more valuable areas of the field have better defensive $PaVE$. This confirms that cornerbacks are by far the most valuable position with respect to pass defense.

<figure>
    <figcaption style='text-align:center; font-size: 12px'>Figure 4: Individual defensive positional rankings by $PaVE$.</figcaption>
    <img style='width: 60%' src='https://github.com/uditrana/BigDataBowl/blob/master/colorized.png?raw=true'/>
</figure>

### Coverage Comparison on Route Combinations
In Figure 5, we identified two "GO-OUT-CROSS-IN" plays, one with Kirk Cousins throwing into good coverage and one with Jared Goff throwing into poor coverage, as measured by $PaVE$. In the top play, Eli Apple picks up the in route, while in the left play, Robert Woods' in route is left unguarded, conceding high $PaVE$ to the offense. We foresee this application of $PaVE$ being used to identify the most successful coverages against various route combinations.

<figure>
    <figcaption style='text-align:center; font-size: 12px'>Figure 5: Note that the color scale is shifted more positive on the bottom play, showing that the defensive coverage was weaker.</figcaption>
    <div style="width:100%;height:0px;position:relative;padding-bottom:94.340%;"><iframe src="https://streamable.com/e/eqsp8i?autoplay=1" frameborder="0" width="100%" height="100%" allowfullscreen allow="autoplay" style="width:100%;height:100%;position:absolute;left:0px;top:0px;overflow:hidden;"></iframe></div>
</figure>

### Defensive Player Optimization

Having developed $PaVE$ to quantify defensive coverage strength, we investigate its application as an objective function for the joint optimization of defensive player trajectories over the course of a play. This can serve as a guide to uncover new coverage strategies against various route combinations. We present a first look at a naïve greedy optimization procedure in Figure 6.

<figure>
    <figcaption style='text-align:center; font-size: 12px'>Figure 6: Top: original play with a blown coverage on Taylor Gabriel's seam route. Bottom: optimized defensive trajectories given the offensive trajectories. Note that the optimizer is a causal system and does not look ahead in the tracking data.</figcaption>
    <div style="width:100%;height:0px;position:relative;padding-bottom:94.340%;"><iframe src="https://streamable.com/e/cd5pbm?autoplay=1" frameborder="0" width="100%" height="100%" allowfullscreen allow="autoplay" style="width:100%;height:100%;position:absolute;left:0px;top:0px;overflow:hidden;"></iframe></div>
</figure>

## Conclusion

$PaVE$ represents a novel attempt at comprehensively modeling the extremely variant action that comprises an NFL passing play. This first attempt is not only a predictive measure that passes the eye test, but also a foundation for countless further applications. For example, a more mature rendition of $PaVE$ can be used by NFL teams weekly by swapping out or honing on specific parts. If the Rams are game-planning for the Seahawks, they might shift the selection probability $P(T|\ell)$ to reflect RW3’s propensity for the [moonball](https://youtu.be/DEKhWgrB0qY?t=795) while also simulating various defensive schemes to minimize DK Metcalf’s value based on his past route tendencies. We truly hope that this metric **PaVE**s the way to new findings for passing in the NFL.

Our code is available [here](https://github.com/uditrana/BigDataBowl/).