# Introduction

In [1]:
suppressMessages(suppressWarnings(install.packages("reactablefmtr")))
suppressMessages(suppressWarnings(install.packages("nflfastR")))
suppressMessages(suppressWarnings(install.packages("nflplotR")))

suppressMessages(suppressWarnings(library(tidyverse)))
suppressMessages(suppressWarnings(library(nflfastR)))
suppressMessages(suppressWarnings(library(reactable)))
suppressMessages(suppressWarnings(library(reactablefmtr)))
suppressMessages(suppressWarnings(library(viridis)))
suppressMessages(suppressWarnings(library(scales)))
suppressMessages(suppressWarnings(library(htmlwidgets)))
suppressMessages(suppressWarnings(library(IRdisplay)))
suppressMessages(suppressWarnings(library(gt)))

library(dplyr)
library(ggplot2)
library(tidyr)

dir.create(file.path("plots/"), showWarnings = FALSE)

rankings <- read.csv("../input/xt-rankings/player_rankings.csv", row.names=1) 

Despite its involvement in almost every snap, judging a player's talent to make tackles still remains much more an art than a science. This begs the question - how should we measure a player's tackling ability?

## Current Metrics

A common approach is to simply count the number of tackles a player gets in a season. But there's an obvious issue with this:

**Cumulative tackles do not account for how often a player is put in a position to make a tackle.**

If player A has 100 tackles but was targeted 150 times, and player B only has 20 yet was only targeted 20 times, who would we say is the better tackler? This gives us Missed Tackles Percentage (MIS%) - simply taking the ratio of successful tackles to total tackle attempts. 

But this leads to the natural next question - how do we define a tackle attempt (and subsequent tackle failure)? The extreme cases are obvious, like if a player wraps himself around the ball carrier and slides off unabated.

But what about the non-extreme cases? Imagine we have a wide receiver who catches the ball in the middle of the field with a lone safety between him and the endzone. Since the safety is the only player in the ball carrier's path, this is clearly a great tackling opportunity, right? Or should it depend on how much space is available to the receiver? Or if he caught the ball in stride?

**Missed Tackles Percentage does not differentiate failed tackles by the quality of the opportunity.**

It treats tackles opportunities as a binary process (1: opportunity, 0: no opportunity). A lot of nuance in player tackling ability is lost in this discretization.



# Tackles over Expected (ToE)

Cumulative tackles and missed tackles are flawed because they do not account for the context of the situation surrounding the tackle. So what's the alternative?

Similar to how we now use expected points and expected completion percentage, we can better measure tackling ability using Tackles over Expected, defined as **the difference between the tackles a given player makes and the average number of tackles a player would make in the same states of the plays** (factoring in variables such as location, speed, position). This would then tell us how well a player compares to his peers at tackling adjusted for their surrounding circumstances.


## How to Calculate Expected Tackles (xT)

For any given snap, the $j$th player's expected number of tackles **$xT$ equals the probability that the average player gets a tackle on the play**:

\begin{aligned}
xT_{i} &= (numberOfTackles  | isTackler)(P(isTackler) + (numberOfTackles | isNotTackler)*P(isNotTackler)          \\
&= 1*P(isTackler) + 0*(P(isNotTackler))                                        \\
&= P(isTackler)                                                                      \\
\end{aligned}

Formally, we want the probability that player $j$ is the tackler at time $t$ given all the events that have previously occured:

$$
P(T_{jt} | S_{t}, S_{t-1}, S_{t-2},...,S_{0})$$
where $S_{k}$ is the state of the game at time k. Due to project time constraints, we make the simplifying assumption that $P(T_{jt} | S_{t}, S_{t-1}, S_{t-2},...,S_{0}) = P(T_{jt} | S_{t})$; i.e. we only consider the current state of the field for our predictions.

For a given player $j$ this gives us a sequence of probabilities as the play develops $p_{j0}, p_{j1}, p_{j2},...,p_{jT}$. To get a single statistic to summarize this sequence of $T$ probabilities, we take the average to get the $jth$ player's final $xT$ for the play. Subtracting this from the player's true number of tackles gives us our Tackles Over Expected (ToE) metric for a play.

All that's left is to actually generate the probabilities.

## A New Approach - Going Deeper

The historical way of tackling such a problem was to handcraft useful features and plug these into older machine learning methods (such as tree-based models or SVMs). This has even been the framework in past Big Data Bowl winners. In the 2022 winning submission [Punt Returns: Using the Math to Find the Path](https://www.kaggle.com/code/robynritchie/punt-returns-using-the-math-to-find-the-path/notebook), the authors need to calculate what they define as Penalized Expected Arrival Time to the Returner:

> Intuitively speaking, this time penalty is:  
1-5 seconds when the blocker is <5 yards to the tackler and directly in his path,  
0.1-1 seconds when the blocker is >5 yards from the tackler but in the neighbourhood of his path, or  
0-0.1 seconds when he is far enough to the side of the tackler’s path that he will likely not be able to block him.
>

Aprior, we don't know how true these explicit assumptions and parameterizations are. A weighted Gaussian kernel is used — is it parameterized well? Could a non-parametric approach perform better, one that has the representational power to generate multimodal blocker time distributions, conditioned on blocker success or talent?

**Deep learning models do not require such restrictive assumptions**, being able to take as input raw data and learn whatever representations are supported by the data. For this reason, we turn our *attention* to more flexible architectures.

## Attention-Based Transformer Models

**The model is a Transformer-based architecture utilizing the concept of [attention](https://arxiv.org/abs/1706.03762) in machine learning, which has revolutionized a model's ability to process sequences.** If we think of our 23 players on the field at a specific time as an unordered sequence, a transformer can focus on the most relevant players at a given time and predict who will be the final tackler, all without affixing an arbitrary ordering to the players.

<center>
<div class="prs">
    <div class="title" style="text-align:left;">
        <h2 style="font-size:24px"><b>Neural Network Architecture</b></h2>
        <i> The model takes in each player's feature vector, passes it through a Transformer's Encoder block, and eventually outputs the probability that player will make a tackle.
        </i>
        <br>
        <br>
    </div>
        <img src="https://github.com/scottmaran/big_data_bowl_2024/blob/main/statistics/report_data/images/tackle_arch_high_sat.jpeg?raw=true" width=100%>
    </div>
<center>

At time $t$, we define our state $S_{t}$ as a $(23,P)$ matrix, where each row represents the $i^{th}$ player (including the ball) on the field and $P=$ the number of features per agent. In our base model, $P=9$, with features:

**$[$IsAttackingTeam, IsFootball, IsBallCarrier, XCoord, YCoord, Direction, Orientation, Speed, Acceleration$]$**

Successive models include height, weight, and position. The only feature preprocessing we did was to standardize the features according to [Michael Lopez's notebook](https://www.kaggle.com/statsbymichaellopez/nfl-tracking-wrangling-voronoi-and-sonars) and normalize them to improve training speed.

# Accounting for Player Positions

## Including Tactical Context

Before moving on to the model outputs, have we accounted for all the context we need on the field? With NGS data we have all the physical movements, but can we incorporate the **tactical context** going on during the play?

For example, suppose Derrick Henry (RB) is carrying the ball up the middle, breaking tackle attempts from player A (MLB) and player B (FS). Given nearly identical NGS features (think acceleration, angle, distance to ball carrier), should we have different tackling expectations based on their position's responsibilities?

## Defining Positional Embeddings

We want to be able to account for player position. The most obvious approach would be to assign dummy variables to each position. For example, a 1 if quarterback, 0 if not. But with so many positions, that would mean adding 20 dummy variables to our data, 19 of which would be 0 for any given player. One issue with this is that adding a large number of dummy variables effectively [reduces](https://files.eric.ed.gov/fulltext/ED493866.pdf) the amount of training data.

Additionally, using 1s (player is a cornerback) and 0s (is not a cornerback) ignores relationships that exist between positions (isn’t a cornerback more like a safety in their behavior than a guard?). If we want to preserve these relationships, we need a continuous representation (e.g. 0.75 for CBs, 0.7 for S, and 0.2 for G).


## The Graph2Vec Algorithm

At a high level, the Graph2Vec algorithm learns a representation of the player's position by considering both their location on the field and distance to every other player at every moment they're on the football field. The inspiration behind this [algorithm](https://nlp.stanford.edu/pubs/glove.pdf) in the NLP community is that a word can be defined by its surrounding words in a sentence.

Translated to sports, **we can define a player's position by the surrounding players in a play**. We do this by representing each player on the field as a node in a graph, connected by edges (weighted by their distance)*.

## Positional Relationships

Let's first make it clear what our output is. Below would be a 20-dimensional dummy-encoding vector with each dimension (column) having an interpretable meaning - the first representing "is quarterback", the second "is wide receiver", etc:

Cooper_Kupp = $[0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]$

For our 32-dimensional positional embedding vector, we instead have:

Cooper_Kupp = $[-3.07, -5.81, 14.20, 2.41, 6.12, 4.34, -1.89, -3.83, 3.15, ...]$

What does each dimension represent? It's not explicitly defined - it's what the model learned. We can try to intuit some meaning from them by viewing them on a graph:

<center>
<div class="prs">
    <div class="title" style="text-align:left;">
        <h2 style="font-size:24px"><b>Clustered 2-D Projection of Positional Embeddings</b></h2>
        <i> We can visually inspect player vectors by projecting them into a 2-D vector space using <br>
            the t-SNE algorithm. We can identify similar groups within this space using spectral clustering.
        </i>
        <br>
        <br>
    </div>
        <img src="https://github.com/scottmaran/big_data_bowl_2024/blob/main/statistics/report_data/images/embeddings_32_dim_cluster_plot_no_titles.png?raw=true" width=100%>
    </div>
<center>

For example, take the green cluster. Our (projected) Cooper Kupp positional embedding would look approximately like:

Cooper_Kupp = $[40, -21.0]$

Though there are a few wide receivers in an adjacent cluster (like Michael Thomas, Skyy Moore), we can see Kupp surrounded by guys like Stefon Diggs and Gabe Davis. In fact, it seems like a lot of the wide receivers have a dimension 2 value of less than -15. One possible interpretation of our embeddings is that negative values of dimension 2 encode some sort of measure of "wide receiver-ness". 

We can also see that **the relationships we would expect to exist between positions arise naturally from our embeddings** by taking the similarity of every pair of player's embeddings and grouping by position:

<center>
<div class="prs">
    <div class="title" style="text-align:left;">
        <h2 style="font-size:24px"><b>Similarity Between Positions</b></h2>  
        <i> Note that the positions we would expect to be similar (FS, SS, CB) are similar,<br>
            while the positions we would expect to be dissimilar (e.g. CB & G) are dissimilar
        </i>
        <br>
        <br>
    </div>
        <img src="https://github.com/scottmaran/big_data_bowl_2024/blob/main/statistics/report_data/images/similarity_32_dim_no_titles.png?raw=true" width=100%>
    </div>
</center>

### Pass-Catching Running Backs - Case Study

The original motivation behind this idea was how to properly represent players who didn't fit nicely into a single well-defined position. The current 49ers roster, headlined by Deebo Samuel and Christian McCaffrey, have an abundance of these types of players.

In fact, there were two WRs with over 100 rushing yards last season - Deebo Samuel (SF) and Curtis Samuel (WAS). The top three RBs in both receptions and receiving yards were Austin Ekeler, Leonard Fournette, and Christian McCaffrey. To test our positional embeddings, shouldn't these WRs be more similar to RBs than the average WR (and vice versa for the listed RBS)? 

This turns out to be the case — the WRs and RBs are higher than the average WR-RB similarity. **This is part of the tactical context that positional embeddings capture but dummy-variable encoding and NGS data ignore**.


<center>
<div class="prs">
    <div class="title" style="text-align:left;">
        <h2 style="font-size:24px"><b>Similarity Between Players</b></h2>
    <i>Our selected players (left graph) have a similarity higher than than the average similarity 
        between a WR and RB (bottom right cell), <br> and compared to other players (right graph)
    </i>
    </div>
    <img src="https://github.com/scottmaran/big_data_bowl_2024/blob/main/statistics/report_data/images/similarity_between_players_heatmap.png?raw=true" width=100%>
</div>
<center>

The result also holds for slot receivers like Cooper Kupp and Tyler Boyd - they are more similar to the average RB (avg_rb row) than an average WR (bottom-right cell).

# Model Breakdown

Now that we have probabilities that include the context surrounding a player, we can model his probability of finishing with the tackle.

<center>
<div class="prs">
    <div class="title" style="text-align:left;">
        <h2 style="font-size:22px"><b>Example Play</b></h2>  
        <i> Here we can see the state of the field, the current probabilities of each defender
            registering a tackle, and how these probabilities change over time (and vary within each model)
        </i>
        <br>
        <br>
    </div>
        <img src="https://github.com/scottmaran/big_data_bowl_2024/blob/main/statistics/report_data/videos/video_2022100213_3554.gif?raw=True" width=100%>
        <br>
        <img src="https://github.com/scottmaran/big_data_bowl_2024/blob/main/statistics/report_data/model_preds_line_800_300_names.gif?raw=True" width=100%>
    </div>
<center>

If we look closer at a few specific timesteps, we can see one of the model's greatest strengths - **its ability to learn the affects of high-level features (like blocking) from raw data**. 

<center>
<img src="https://github.com/scottmaran/big_data_bowl_2024/blob/main/statistics/report_data/images/video_img_1_caption_box.png?raw=true" width=100%>
</center>

<center>
<img src="https://github.com/scottmaran/big_data_bowl_2024/blob/main/statistics/report_data/images/video_img_2_caption_box.png?raw=true" width=100%>
</center>

## Summary Statistics

As a baseline, we simply predicted the tackler as the defender closest to the ball. **Our model is much better at predicting the correct tackler than this naive baseline.** The prediction performance is also further improved by adding height, weight, and our own novel positional embeddings.

In [2]:

model_data <- data.frame(
  'Model' = c('Baseline', 'Base Model', 'Height and Weight', 'Height, Weight, and Position'),
  'quarter' = c(34.5, 43.8, 42.9, 51.8),
  'half' = c(45.2, 50.9, 51.8, 53.6),
  'third' = c(64.6, 61.6, 78.6, 67.9),
  'final_frame' = c(54.0, 64.3, 60.7, 73.2),
  'anytime_acc' = c(91.3, 92.1, 92.9, 96.4)
)

gt_tbl <- gt(model_data, rowname_col='Model') %>%
  gt::tab_header(title = gt::md("Model Statistics"),
                 #subtitle = gt::md("not now")
                ) %>%
  gt::cols_label(Model = "Model",
                 quarter = "1/4 mark",
                 half = "1/2 mark",
                 third = "3/4 mark",
                 final_frame = "Final Frame",
                 anytime_acc = "Anytime Acc%") %>% 
  gt::tab_spanner(columns = c("quarter", "half", "third", "final_frame"),
                  label = "Frame-Level Accuracy %") %>% 
  gt::tab_spanner(columns = c("anytime_acc"),
                  label = "Play-Level Accuracy %") %>%
  gt::tab_footnote(footnote = "We provide classification accuracy at four standardized points within every play 
        - at the 1/4 mark, 1/2 mark, 3/4 mark, and final frame of the play. \n
        We also calculate Anytime Acc% - the percentage of plays where the model
        identified the correct tackler with the highest probability at any time."
        ) %>%
  gt::data_color(
    #columns = vars(`quarter`, `half`, `third`, `final_frame`, `anytime_acc`),
    fn = scales::col_numeric(
               palette = c("#35b0ab", "#ffffff", "#ffffff", "#ffffff", "#ffffff"),
               domain = NULL,
               reverse = TRUE)
  )

display_html(as_raw_html(gt_tbl))


Model Statistics,Model Statistics,Model Statistics,Model Statistics,Model Statistics,Model Statistics
Unnamed: 0_level_1,Frame-Level Accuracy %,Frame-Level Accuracy %,Frame-Level Accuracy %,Frame-Level Accuracy %,Play-Level Accuracy %
Unnamed: 0_level_2,1/4 mark,1/2 mark,3/4 mark,Final Frame,Anytime Acc%
Baseline,34.5,45.2,64.6,54.0,91.3
Base Model,43.8,50.9,61.6,64.3,92.1
Height and Weight,42.9,51.8,78.6,60.7,92.9
"Height, Weight, and Position",51.8,53.6,67.9,73.2,96.4
"We provide classification accuracy at four standardized points within every play - at the 1/4 mark, 1/2 mark, 3/4 mark, and final frame of the play. We also calculate Anytime Acc% - the percentage of plays where the model  identified the correct tackler with the highest probability at any time.","We provide classification accuracy at four standardized points within every play - at the 1/4 mark, 1/2 mark, 3/4 mark, and final frame of the play. We also calculate Anytime Acc% - the percentage of plays where the model  identified the correct tackler with the highest probability at any time.","We provide classification accuracy at four standardized points within every play - at the 1/4 mark, 1/2 mark, 3/4 mark, and final frame of the play. We also calculate Anytime Acc% - the percentage of plays where the model  identified the correct tackler with the highest probability at any time.","We provide classification accuracy at four standardized points within every play - at the 1/4 mark, 1/2 mark, 3/4 mark, and final frame of the play. We also calculate Anytime Acc% - the percentage of plays where the model  identified the correct tackler with the highest probability at any time.","We provide classification accuracy at four standardized points within every play - at the 1/4 mark, 1/2 mark, 3/4 mark, and final frame of the play. We also calculate Anytime Acc% - the percentage of plays where the model  identified the correct tackler with the highest probability at any time.","We provide classification accuracy at four standardized points within every play - at the 1/4 mark, 1/2 mark, 3/4 mark, and final frame of the play. We also calculate Anytime Acc% - the percentage of plays where the model  identified the correct tackler with the highest probability at any time."


# Player Rankings

Using our Tackles Over Expected metric, we can now rank players based on their performance relative to how their peers would perform on average. Below we have rankings for both (cumulative) **Tackles over Expected** and **Tackles over Expected per Snap** *(divided by total number of snaps)*.

The notable aspect of our model is that, while properly identifying players we know to be good tacklers, it also **highlights players who we suspect are good tacklers but are overlooked by traditional metrics such as missed tackle percentage**.

In [3]:
first_last <- function(name) {
  first <- stringr::word(name, 1)
  last <- stringr::str_trim(stringr::str_extract(name, " .*"))
  glue::glue("<div style='line-height:11px'><span style ='font-family:Arial;font-weight:bold;color:grey;font-size:10px'>{first}</span></div>\n    
<div style='line-height:9px'><span style='font-weight:bold;font-variant:small-caps;font-size:13px'>{last}</div>")
}

t0 <- rankings %>%
  left_join(nflfastR::teams_colors_logos, by = c('Team' = 'team_abbr')) %>%
  select(c(name, Position, Team, team_logo_espn, Total.Snaps, xT, xT.snap)) %>%
  mutate(xT = round(xT, digits = 3),
         xT.snap = round(xT.snap, digits = 3),
         name = first_last(name))

tackle <- reactable(t0,
          pagination = TRUE,
          highlight = TRUE,
          striped = TRUE,
          defaultSorted = "xT",
          defaultSortOrder = "desc",
          theme = espn(),
          defaultPageSize = 10,
          defaultColDef = colDef(align = "center"),
          columns = list(
            name = colDef(name = "name", maxWidth = 120, style=list(fontFamily = "Arial"),
                          html = TRUE),
            pos = colDef(name = "Position", maxWidth = 70, 
                         style = list(fontWeight = "bold", fontFamily = "Arial")),
            snaps = colDef(name = "Total Snaps", maxWidth = 70, 
                           style = list(fontWeight = "bold", fontFamily = "Arial")),
            team_logo_espn = colDef(name = "Team", maxWidth = 70,
                                    cell = embed_img(height = 20, width = 20)), 
            xT = colDef(name = "xT", maxWidth = 120, 
                              cell = color_tiles(t0,colors = viridis::plasma(10, direction = -1),
                                                 bold_text = TRUE,
                                                 box_shadow = TRUE)),
            xT.snap = colDef(name = "xT/Snap", maxWidth = 120, 
                               cell = color_tiles(t0,colors = viridis::plasma(10, direction = -1), 
                                                  bold_text = TRUE,
                                                  box_shadow = TRUE,
                                                  number_fmt = scales::number_format(accuracy = 0.01)))
            )
)

f0 <-"plots/tackle.html"
saveWidget(tackle, file.path(normalizePath(dirname(f0)),basename(f0)))

display_html('<center>
<div class="prs">
    <div class="title" style="text-align:left;">
        <h2 style="font-size:24px"><b>Tackling Metrics</b></h2>
        <i>2022 NFL Season: Weeks 1-10</i>
    </div>
    <iframe src="plots/tackle.html" align="center" width="100%" height="500" frameBorder="0"></iframe>
    <span style="font-style:italic;font-size:15px">
        While underrepresented by traditional tackling metrics, players like Maxx Crosby and Denzel Perryman <br> 
        (recently suspended for excessively <a href="https://bleacherreport.com/articles/10097293-texans-denzel-perryman-suspended-3-games-for-violating-nfl-player-safety-rules">violent</a>
        hits) score highly in Tackles over Expected</span><br>
</div>
<center>')

# Innovation & Utility

We now have a new way to measure how well a player performs at tackling relative to his peers. This model has a unique ability to understand and apply the rules of football to predict tackling because of its deep, attention-based transformer and its novel method of capturing the relationship between player positions via positional embeddings.

It's uses extend to player evaluation, acquisition, and gameday strategy. The positional embeddings framework also has many uses independent of tackling, such as any modeling that uses position as an input or potential player similarity metrics.

<a href="https://github.com/scottmaran/big_data_bowl_2024">Code</a>

# Appendix

(not finished)

In [4]:
sim_matrix <- read.csv("../input/xt-rankings/percentile_similarity_full_df.csv", row.names=1) 
sim_df <- as.data.frame(sim_matrix) %>% mutate_if(is.numeric, round, digits=3)

sim_df_name <- function(name) {
  first <- stringr::word(name, 1)
  last <- stringr::str_trim(stringr::str_extract(name, " .*"))
  #glue::glue("<div style='line-height:31px'><span style ='font-family:Arial;font-weight:bold;font-variant:small-caps;font-size:30px'>{first}</span></div>     
  #<div style='line-height:29px'><span style='font-weight:bold;font-variant:small-caps;font-size:33px'>{last}</div>")
  glue::glue("{first} {last}")
}

base_name_df <- data.frame("name" = rownames(sim_df)) %>%
  left_join(rankings, by = c('name' = 'name')) %>%
    select(c(name, Position, Team)) %>%
  left_join(nflfastR::teams_colors_logos, by = c('Team' = 'team_abbr')) %>%
  select(c(name, Position, Team, team_logo_espn)) %>%
  mutate(name = sim_df_name(name))

first_names <- c("Tom Brady", "L'Jarius Sneed", "Stefon Diggs", "Bobby Wagner", "Josh Jacobs")#, "Budda Baker")
# Filter the data based on the default search value
filtered_data <- base_name_df[order(base_name_df$name %in% first_names,decreasing=TRUE),]
sim_df <- sim_df[order(base_name_df$name %in% first_names,decreasing=TRUE),]

# Function to format the row details
row_details <- function(index) {
  player_name <- sim_df_name(rownames(sim_df)[index])
  similarities <- sim_df[index, -ncol(sim_df)]
  top_similar <- sort(similarities, decreasing = TRUE)[2:6]
  least_similar <- sort(similarities, decreasing = FALSE)[1:5]
  
  top_similar_df <- data.frame('Most' = names(top_similar),
                               'Similar' = unname(unlist(top_similar)), 
                               'Least'= names(least_similar),
                               'Similar' = unname(unlist(least_similar)))
  
  htmltools::div(
    style = "padding: 16px",
    reactable::reactable(top_similar_df, outlined = TRUE)
  )
}

# Create the reactable table
player_similarity_table <- reactable(
  filtered_data,
  pagination = TRUE,
  highlight = TRUE,
  striped = TRUE,
  #theme = espn(),
  defaultPageSize = 5,
  #height=500,
  #width=100,
  defaultColDef = colDef(align = "center"),
  searchable = TRUE,
  defaultSortOrder = "desc",
  details = row_details,
  columns = list(
            name = colDef(name = "Name", style = list(fontFamily = "Arial"),
                          html = TRUE),
            Position = colDef(name = "Position", #maxWidth = 120, 
                         style = list(fontFamily = "Arial")),
            Team = colDef(name = "Team", maxWidth = 70, 
                         style = list(fontFamily = "Arial")),
            team_logo_espn = colDef(name = "Logo", maxWidth = 200,
                                    cell = embed_img(height = 75, width = 75))
  )
)

p0 <-"plots/sim_table.html"
saveWidget(player_similarity_table, file.path(normalizePath(dirname(p0)),basename(p0)))

display_html('<center>
<div class="prs">
    <div class="title" style="text-align:left;">
        <h2 style="font-size:24px"><b>Player Similarity Database</b></h2>
        <i>Search for a players name and get their Most and Least Similar Comparables!</i>
    </div>
    <iframe src="plots/sim_table.html" align="center" width="100%" height="400" frameBorder="0"></iframe>
</div>
<center>')

### Node2Vec - algo

Now we're thinking of our players as vectors that represent their position, where each entry of this vector will be a "trait" that the model learns.

The algorithm's function is, given a sequence of vectors, update them to maximize the probability of its surroundings given any specific vector. For example, if we had a sequence such as [McCaffrey, Sneed, Aiyuk], we want to maximize:
$$$$
$$
\begin{aligned}
P(McCaffrey|Sneed)*P(McCaffrey|Aiyuk)* P(Sneed|McCaffrey)*P(Sneed|Aiyuk)*P(Aiyuk|McCaffrey)*P(Aiyuk|Sneed)
\end{aligned}
$$

The inspiration behind this algorithm in the NLP community is that a word can be defined by the surrounding words in the sentence. Translated to sports, we can define a player's position by the other players movements around him.

If you're skeptical that this could work, we can walk through a small example. Let's take one potential sequence for each of the 49ers running backs:

[McCaffrey (RB) - Sneed (CB) - Aiyuk (WR)]

[Mitchell (RB) - Williams (LT) - Jones (DT)]

[Jordan Mason (RB) - Williams (LT) - Derrick Nnadi (DT)]

Looking at these sequences, we can tell that McCaffrey is more different from Mitchell and Mason than Mitchell and Mason are with one another. Mason and Mitchell have more lineman in their sequences while McCaffrey has more defensive backs. But how does the model learn this?

For sequence 1, some of the components we are maximizing are P(McCaffrey|Sneed) and P(McCaffrey|Aiyuk). If we define the likelihood of a player pair as the dot product between tehir vectors, than maximizing it constitutes making them as similar as possible. Thus we want the McCaffrey, Sneed, and Aiyuk vectors to be similar to one another. Since Sneed and Aiyuk aren't in Mitchell's (and Mason's) sequences, we don't get terms like P(Mitchell|Sneed) in their calculations, so Mitchell's vector doesn't need to be similar to Sneed or Aiyuk's.

Thus the main steps of the Graph2Vec algorithm are:
1) Generate random samples of node sequences

2) Use Word2Vec to get embeddings - predict the probability of a context node given the center node

How do we generate these sequences tho? We take biased random walks along graph representations of the field; i.e. each node is a player and the edges are the euclidean distances between each player. The idea is that, if we are more likely to walk between players close to one another, we get sequences like the above example with the 49ers running backs

### Expand on clustered visualized embeddings

We looked at which positions ended up in which clusters and found patterns that distinguished between different types of positions. This indicates the algorithm is capturing some aspects of position.

<center>
<img src="https://github.com/scottmaran/big_data_bowl_2024/blob/main/statistics/report_data/images/embeddings-32-dim-cluster-pos-dis.png?raw=true" width=100%>
</center>

### Pos Embeddings - Dimension size

We compiled position embeddings of dimension sizes 3,9, and 32 and compared the results.

Intra-Position Similarity for 3-Dim Embeddings (e.g. WR-WR, CB-CB)

| WR       | CB       | SS       | FS       | ILB      | RB       | T        | OLB      | C        | TE       | DT       | G        | QB       | NT       | DE       | FB       | MLB      | DB   | LS   |
|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|------|------|
| 0.522215 | 0.359815 | 0.5418   | 0.436683 | 0.927439 | 0.876778 | 0.967161 | 0.989147 | 0.966166 | 0.978499 | 0.992412 | 0.993798 |


The 3-dimensional vectors capture a lot of information but miss out on a few key aspects. We can see that the intra-positional similarity for WR, CB, SS, and FS are all relatively low (below 0.6), which isn't the case in the higher dimensional embeddings. But for all other positions, the similarities are high (above 0.9).

The one advantage to the lower dimensional embeddings is that they are more discriminative than their higher-dimesional counterparts. For example, the position similarity between G and CB under the different embeddings: \
3-dim: 0.033 \
9-dim: 0.642 \
32-dim: 0.535


#### Player Similarity

Relationship between TE and QB

### Why not 100% for baseline at the end?

The careful reader may notice an interesting phenomenom. If our baseline predicts the tackler as the closest person to the ball, shouldn't it be 100% accurate at the last frame of the play? This should be the frame where the tackler is making contact with the ball carrier; i.e. zero distance!

The baseline model is only correct at the final frame on 54% of all the plays. We visually inspected 20 random plays and confirmed it was only correct on 45% of the plays.

While initally strange, this should not be surprising when we consider the difference in precision of our data (every 1/10 of a second) and the imprecise nature of defining when *exactly* a "tackle" takes place. Many other players usually surround the ball carrier and are also moving towards trying to takcle him.

### model stats

Provide classification accuracy at relative as opposed to at fixed times (e.g. 3 seconds in) to account for plays of different. For example, a prediction at two seconds in is very different for a play that's two seconds long versus 12 seconds.

### Uses

This extends to:
- player evaluation - if a team wants to understand how good their players are at tackling
- player acquisition - if a team wants to understand how good available players are at tackling
- gameday strategy - if a team wants to understand how good their opponents are at tackling

The positional embeddings framework also has many uses independent of tackling:

- Model building - for any model that uses positions as input, position embeddings theoretically should provide a better alternative
- Player similarity metrics - similarly to how we look at CMC and how he compared to different players across the league, we can do this for any player. If a team needs to replace a player, they can look at how similar potential replacements' positional embeddings are to the departing player. The hypothesis is that players with similar positional embeddings play a similar style, and it might be easier for similar players to slide into a different teams strategy.