# NFL Data Tracking Analysis / Classification

This notebook will go over an outline of how to utilize NFL tracking data to classify coverages and WR routes. I am using the framework outlined by this paper:

Dutta et al, 2019, "Unsupervised Methods for Identifying Pass Coverage Among Defensive Backs with NFL Player Tracking Data" https://arxiv.org/abs/1906.11373

The animations and plots of tracking data are from the following Kaggle notebooks:

https://www.kaggle.com/robikscube/nfl-big-data-bowl-plotting-player-position

https://www.kaggle.com/ar2017/nfl-big-data-bowl-2021-animating-players-movement

The ultimate objective is first to follow the methodology of the paper to classify man vs zone in coverage, then adapt what was done to see if I can build an supervised classification model to identify routes run by the receiver if they were targeted.

## Methods of the Paper

First, they define a set of features from the tracking data that distinguish between "man" and "zone" coverage. They use mixture models to create clusters corresponding to each group, allowing them to provide probabilistic assignments to each coverage type (or cluster). Additionally, they quantify each feature’s influence in distinguishing defensive pass coverage types. First, I will attempt to replicate the features they built, only on week 1 of the tracking data just to test it out.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from ipywidgets import interact, fixed

%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from matplotlib import animation
from matplotlib.animation import FFMpegWriter
pd.set_option('max_columns', 100)

import dateutil
from math import radians
from IPython.display import Video

from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

import os
import warnings
from tqdm import tqdm
import gc
warnings.filterwarnings('ignore')

In [2]:
plays = pd.read_csv('nfl-big-data-bowl/pass-2018.csv')
players = pd.read_csv('nfl-big-data-bowl/players.csv')

In [3]:
%%time
#week = pd.read_csv('nfl-big-data-bowl/week1.csv')
week1 = pd.read_csv('nfl-big-data-bowl/week1.csv')
week2 = pd.read_csv('nfl-big-data-bowl/week2.csv')
week3 = pd.read_csv('nfl-big-data-bowl/week3.csv')
week4 = pd.read_csv('nfl-big-data-bowl/week4.csv')
week5 = pd.read_csv('nfl-big-data-bowl/week5.csv')
week6 = pd.read_csv('nfl-big-data-bowl/week6.csv')
week7 = pd.read_csv('nfl-big-data-bowl/week7.csv')
week8 = pd.read_csv('nfl-big-data-bowl/week8.csv')
week9 = pd.read_csv('nfl-big-data-bowl/week9.csv')
week10 = pd.read_csv('nfl-big-data-bowl/week10.csv')
week11 = pd.read_csv('nfl-big-data-bowl/week11.csv')
week12 = pd.read_csv('nfl-big-data-bowl/week12.csv')
week13 = pd.read_csv('nfl-big-data-bowl/week13.csv')
week14 = pd.read_csv('nfl-big-data-bowl/week14.csv')
week15 = pd.read_csv('nfl-big-data-bowl/week15.csv')
week16 = pd.read_csv('nfl-big-data-bowl/week16.csv')
week17 = pd.read_csv('nfl-big-data-bowl/week17.csv')
week = pd.concat([week1, week2, week3, week4, week5,
                  week6, week7, week8, week9, week10,
                  week11, week12, week13, week14,
                  week15, week16, week17], ignore_index=True)

Wall time: 37.5 s


<img src="time-snap.png" width=640 height=480 />

We will create time periods of when the ball is snapped, split between the three sections similar to the image above from the paper. These sections will be "before snap", "after snap, before throw", and "after throw". I have information on when the ball is snapped and when the pass is thrown, so I will be able to create the variables required.

In [4]:
print(week['event'].unique())
week.columns ## event is 8, frame is 13

['None' 'ball_snap' 'pass_forward' 'pass_arrived' 'pass_outcome_caught'
 'out_of_bounds' 'pass_outcome_incomplete' 'first_contact' 'tackle'
 'man_in_motion' 'play_action' 'qb_sack' 'fumble'
 'fumble_offense_recovered' 'handoff' 'pass_tipped'
 'pass_outcome_interception' 'qb_strip_sack' 'pass_shovel' 'line_set'
 'shift' 'touchdown' 'fumble_defense_recovered' 'pass_outcome_touchdown'
 'run' 'touchback' 'penalty_flag' 'penalty_accepted' 'qb_spike'
 'field_goal_blocked' 'punt_fake' 'snap_direct' 'run_pass_option'
 'pass_lateral' 'lateral' 'field_goal_fake' 'huddle_start_offense'
 'huddle_break_offense' 'timeout_home' 'safety' 'field_goal_play']


Index(['time', 'x', 'y', 's', 'a', 'dis', 'o', 'dir', 'event', 'nflId',
       'displayName', 'jerseyNumber', 'position', 'frameId', 'team', 'gameId',
       'playId', 'playDirection', 'route'],
      dtype='object')

In [5]:
# Convert dataframe to array to make iteration faster
weekArray = np.array(week)
# Create previous event variable to overwrite and check against when setting new ones
prevEvent = 'ball_snap'

for i, play in enumerate(tqdm(weekArray)):
    # The "event" on the field, i.e. ball snap or between snap or after throw
    event = play[8]
    frameId = play[13]
    # Create before snap category
    # Set equal to "ball_snap" if not available yet, or if it is first frame of motion
    if (prevEvent == 'ball_snap' and event != 'ball_snap') or frameId == 1:
        weekArray[i][8] = 'ball_snap'
        prevEvent = 'ball_snap'
    # Default value for ball_snap in dataset was for the first frame the ball was actually snapped
    elif (event == 'ball_snap'):
        prevEvent = 'between_snap'
    # Set value equal to "between_snap" if it is after snap and before pass thrown
    elif (prevEvent == 'between_snap' and event != 'pass_forward'):
        weekArray[i][8] = 'between_snap'
        prevEvent = 'between_snap'
    # Default value for "pass_forward" in dataset was for the first frame the ball left QB's hand
    elif (event == 'pass_forward'):
        weekArray[i][8] = 'after_throw'
        prevEvent = 'after_throw'
    # Extra cases where the ball has been thrown and is not the first frame of action
    elif (prevEvent == 'after_throw' and frameId != 1):
        weekArray[i][8] = 'after_throw'
        prevEvent = 'after_throw'
        
weekMod = pd.DataFrame(weekArray, columns=week.columns)
week['event'] = weekMod['event']
weekMod = week

100%|██████████████████████████████████████████████████████████████████| 18309388/18309388 [00:27<00:00, 658019.31it/s]


In [6]:
week['event'] = ['before_snap' if e=='ball_snap' else e for e in week['event']]
print(week['event'].unique()) ## desired result

['before_snap' 'between_snap' 'after_throw']


## Features generated in the paper

The paper created the following features:

<img src="feat-create.png" width=640 height=480 />
<img src="feat-create2.png" width=640 height=480 />

We will use similar features for all players, offensive and defensive. We create variance in x, y, and speed as outlined in the paper. Then we calculate distance from nearest opposing player and distance from nearest teammate, and their respective variances. We also create difference in direction between the player and their nearest opponent for each frame, then aggregate it by mean and variance. Finally, we create a ratio of distance of the nearest offensive player divided by distance of nearest offensive player to nearest defensive player at each frame, then aggregate it by mean and variance.

Features added that were not described in the paper that are created include orientation of the player relative to the nearest opposing player, orientation of the player relative to the line of scrimmage, and orientation of the player relative to the position of the football.

In [7]:
%%time
## var in the x, y, speed
varX = weekMod.groupby(['gameId', 'playId', 'event', 'nflId'])['x']\
.agg(['var']).reset_index().rename(columns={"var": "varX"})
varY = weekMod.groupby(['gameId', 'playId', 'event', 'nflId'])['y']\
.agg(['var']).reset_index().rename(columns={"var": "varY"})
varS = weekMod.groupby(['gameId', 'playId', 'event', 'nflId'])['s']\
.agg(['var']).reset_index().rename(columns={"var": "varS"})

Wall time: 13.8 s


In [8]:
%%time
## mean, var distance of nearest opposing player
# first need to group by each frame of the play by the game
groupedWeek = weekMod.groupby(['gameId', 'playId', 'frameId'])
playerXY = {} # initialize dictionary of player coordinates and direction
# get relevant player direction information
for name, group in tqdm(groupedWeek):
    playerXY[name] = [] # holds player id, home/away, x, y, dir, orientation
    for row in group.iterrows():
        data = [row[1]['nflId'], row[1]['team'], row[1]['x'], row[1]['y'], row[1]['dir'], row[1]['o']]
        playerXY[name].append(data)

100%|███████████████████████████████████████████████████████████████████████| 1247711/1247711 [33:31<00:00, 620.22it/s]

Wall time: 34min 14s





In [19]:
%%time
# get list of features
features = list(weekMod.columns)
# convert entire dataset to numpy array, to make iteration faster
weekArray = np.array(weekMod)
minOppDist = []
# for each player in each frame, want to get their distance from nearest opposing player
# this loop goes through each player in each frame
for player in tqdm(weekArray):
    if (player[features.index('position')] in ['SS','FS','CB']):
        # exclude the actual tracking of the football
        if player[features.index('team')] != 'football':
            # grabs each opponent player ID, home/away, x, y, dir in each frame of play
            opponentPositions = playerXY[(player[features.index('gameId')],
                                          player[features.index('playId')],
                                          player[features.index('frameId')])]
            # create list to store following metrics
            distances, directions, opponents, xs, ys, orients = [], [], [], [], [], []
            # for the player distances, get each opponent in frame, then grab key metrics
            for oppPos in opponentPositions:
                # only apply to players not on same team (oppPos[1] is home/away/football marker)
                if (player[features.index('team')] != oppPos[1]
                and player[features.index('team')] != 'football'
                and oppPos[1] != 'football'):
                    # calculate squared difference in x coordinate between current player, opposing player
                    dx = (player[features.index('x')] - oppPos[2])**2
                    # same thing for y coordinate
                    dy = (player[features.index('y')] - oppPos[3])**2
                    # calculate distance, append it to list
                    dist = np.sqrt(dx+dy)
                    distances.append(dist)
                    # append opponent direction to list
                    directions.append(oppPos[4])
                    # append opponent ID to list
                    opponents.append(oppPos[0])
                    # opponent X and Y coords
                    xs.append(oppPos[2])
                    ys.append(oppPos[3])
                    # opponent orientation
                    orients.append(oppPos[5])

            # find closest player via minimum distance
            # also log their direction and XY coords
            if len(distances) != 0:
                minDist = min(distances, default=0)
                closestOpponent = opponents[np.argmin(distances)]
                opponentDir = directions[np.argmin(distances)]
                opponentX = xs[np.argmin(distances)]
                opponentY = ys[np.argmin(distances)]
                opponentO = orients[np.argmin(distances)]
            else:
                minDist, closestOpponent, opponentX, opponentY, opponentO = np.nan, np.nan, np.nan, np.nan, np.nan
            # collect all of this to one list
            summary = [player[features.index('gameId')],
                       player[features.index('playId')],
                       player[features.index('frameId')],
                       player[features.index('nflId')],
                       minDist, closestOpponent, opponentDir, opponentX, opponentY, opponentO]
            # append to massive list for all observations
            minOppDist.append(summary)

# convert multidimensional list to pandas df
minOppDist = pd.DataFrame(minOppDist, columns=['gameId', 'playId', 'frameId', 'nflId',
                                               'oppMinDist', 'closestOpp(nflId)', 'oppDir', 'oppX', 'oppY', 'oppO'])
# add these metrics to main df weekMod
weekMod = pd.merge(weekMod, minOppDist, how='left', on=['gameId', 'frameId', 'playId', 'nflId'])

# calculate nearest opponent distance variance and mean for each play
oppVar = weekMod.groupby(['gameId', 'playId', 'event', 'nflId'])['oppMinDist']\
.agg(['var']).reset_index().rename(columns={"var": "oppVar"})
oppMean = weekMod.groupby(['gameId', 'playId', 'event', 'nflId'])['oppMinDist']\
.agg(['mean']).reset_index().rename(columns={"mean": "oppMean"})

100%|███████████████████████████████████████████████████████████████████| 18309388/18309388 [09:52<00:00, 30881.82it/s]


Wall time: 13min 48s


In [20]:
%%time
## mean, var distance of nearest teammate
# same procedure as for opponent, but changing selection criteria to only same team
features = list(weekMod.columns)
weekArray = np.array(weekMod)
minMateDist = []
for player in tqdm(weekArray):
    if (player[features.index('position')] in ['SS','FS','CB']):
        if player[features.index('team')] != 'football':
            matePositions = playerXY[(player[features.index('gameId')],
                                      player[features.index('playId')],
                                      player[features.index('frameId')])]
            distances, mates, xs, ys, orients = [], [], [], [], []
            for matePos in matePositions: 
                if (player[features.index('team')] == matePos[1]
                and player[features.index('nflId')] != matePos[0]
                and player[features.index('team')] != 'football'
                and matePos[1] != 'football'):
                    dx = (player[features.index('x')] - matePos[2])**2
                    dy = (player[features.index('y')] - matePos[3])**2
                    dist = np.sqrt(dx+dy)
                    distances.append(dist)
                    mates.append(matePos[0])
                    xs.append(matePos[2])
                    ys.append(matePos[3])
                    orients.append(matePos[5])
            minDist = min(distances)
            closestMate = mates[np.argmin(distances)]
            mateX = xs[np.argmin(distances)]
            mateY = ys[np.argmin(distances)]
            mateO = orients[np.argmin(distances)]
            summary = [player[features.index('gameId')],
                       player[features.index('playId')],
                       player[features.index('frameId')],
                       player[features.index('nflId')],
                       minDist, closestMate, mateX, mateY, mateO]
            minMateDist.append(summary)
        
minMateDist = pd.DataFrame(minMateDist, columns=['gameId', 'playId', 'frameId', 'nflId',
                                                 'mateMinDist', 'closestMate(nflId)', 'mateX', 'mateY', 'mateO'])

weekMod = pd.merge(weekMod, minMateDist, how='left', on=['gameId', 'frameId', 'playId', 'nflId'])

mateVar = weekMod.groupby(['gameId', 'playId', 'event', 'nflId'])['mateMinDist']\
.agg(['var']).reset_index().rename(columns={"var": "mateVar"})
mateMean = weekMod.groupby(['gameId', 'playId', 'event', 'nflId'])['mateMinDist']\
.agg(['mean']).reset_index().rename(columns={"mean": "mateMean"})

100%|███████████████████████████████████████████████████████████████████| 18312098/18312098 [10:02<00:00, 30374.88it/s]


Wall time: 15min 14s


In [21]:
%%time
## create direction relative to opposition
diffDir = np.absolute(weekMod['dir'] - weekMod['oppDir'])
weekMod['diffDir'] = diffDir
## create mean/var of direction relative to opposition
oppDirVar = weekMod.groupby(['gameId', 'playId', 'event', 'nflId'])['diffDir']\
.agg(['var']).reset_index().rename(columns={"var": "oppDirVar"})
oppDirMean = weekMod.groupby(['gameId', 'playId', 'event', 'nflId'])['diffDir']\
.agg(['mean']).reset_index().rename(columns={"mean": "oppDirMean"})

## create ratio of distances
ratio = weekMod['oppMinDist'] / np.sqrt((weekMod['oppX'] - weekMod['mateX'])**2 + (weekMod['oppY'] - weekMod['mateY'])**2)
weekMod['oppMateDistRatio'] = ratio
## create mean/var of ratios
oppMateDistRatioMean = weekMod.groupby(['gameId', 'playId', 'event', 'nflId'])['oppMateDistRatio']\
.agg(['mean']).reset_index().rename(columns={"mean": "meanOppMateDistRatio"})
oppMateDistRatioVar = weekMod.groupby(['gameId', 'playId', 'event', 'nflId'])['oppMateDistRatio']\
.agg(['var']).reset_index().rename(columns={"var": "varOppMateDistRatio"})

Wall time: 12.5 s


In [22]:
%%time
## create orientation relative to line of scrimmage on play
orientMean = weekMod.groupby(['gameId', 'playId', 'event', 'nflId'])['o']\
.agg(['mean']).reset_index().rename(columns={"mean": "orientMean"})
orientVar = weekMod.groupby(['gameId', 'playId', 'event', 'nflId'])['o']\
.agg(['var']).reset_index().rename(columns={"var": "orientVar"})

## orientation relative to opposition location
orientDiff = np.absolute(weekMod['o'] - weekMod['oppO'])
weekMod['orientDiff'] = orientDiff
## create mean/var of orientation relative to opposition
oppOrientMean = weekMod.groupby(['gameId', 'playId', 'event', 'nflId'])['orientDiff']\
.agg(['mean']).reset_index().rename(columns={"mean": "oppOrientMean"})
oppOrientVar = weekMod.groupby(['gameId', 'playId', 'event', 'nflId'])['orientDiff']\
.agg(['var']).reset_index().rename(columns={"var": "oppOrientVar"})

Wall time: 11.7 s


In [23]:
%%time
## collect created features and merge into main df
features = [varX, varY, varS,
            oppVar, oppMean,
            mateVar, mateMean,
            oppDirVar, oppDirMean,
            oppMateDistRatioMean, oppMateDistRatioVar,
            orientMean, orientVar, oppOrientMean, oppOrientVar]

for feature in features:
    weekMod = pd.merge(weekMod, feature, how='left', on=['gameId', 'event', 'playId', 'nflId'])

Wall time: 2min 50s


In [24]:
print(weekMod.columns[33:])

Index(['varX', 'varY', 'varS', 'oppVar', 'oppMean', 'mateVar', 'mateMean',
       'oppDirVar', 'oppDirMean', 'meanOppMateDistRatio',
       'varOppMateDistRatio', 'orientMean', 'orientVar', 'oppOrientMean',
       'oppOrientVar'],
      dtype='object')


In [25]:
## send to csv so i don't have to keep doing this
weekMod.to_csv('weekMod17.csv', index=False)

In [30]:
## send only relevant positions to csv
weekMod.loc[weekMod['position'].isin(['WR','SS','FS','CB'])].to_csv('weekModpos.csv', index=False)