# Returner Quality 
Knowing the quality of a returner is more complex than simply how many yards are gained on returns. For example, a return of a few positive yards may have been at the expense of greater yardage if a player did not choose an optimal path away from tacklers. Rather than using total yardage for a measure of returner quality, we can estimate the return yardage using two metrics: a tackle probability metric and an expected return yards metric. Combining the outputs of the two metrics and creating a composite metric with tackle probability and expected return yards, we can evaluate the returner on a metric more statistically advanced than just total yardage. I call this metric Returner Quality. 

## Measuring Returner Quality  
A primary responsibility of a returner is to gain yardage beyond the initial reception point. A player can take all the available space in front of them and a player can increase available space by making potential tacklers miss tackle attempts. Players who can excel in positive return yardage and also avoid tackles can be considered "good" returners. Players who are below-average in these measures could be considered "bad" returners. Players who excel in positive return yardage but do not avoid tackles can be labeled "trains", as they theoretically are straight-line (or vertical) runners while players who excel in avoiding tackles but are below-average in return yardage can be labeled "dancers", as they theoretically make many cut moves (horizontal runners). We can illustrate this by taking the average return yardage and average missed tackles per return (min 20 returns).  
![](https://i.imgur.com/ogcXuLI.png)  

However, per-play averages alone suffer when contextualized over the course of a season. Players with more returns will inevitably have more return yardage and avoided tackles. For example, 2018 Tyler Lockett is considered a bad returner based on his average return yards (15.3) and average tackles avoided (0.3) but over the course of a season he is considered a good returner because he had 38 returns over the season (583 yards gained, 10 tackles avoided).  
![](https://i.imgur.com/LKVKaTY.png)  

So is Tyler Lockett good or bad? Rather than using season totals, I created a model that estimated return yardage and a model that estimated tackle probability. Each of these models estimated per frame of tracking data, allowing for precise event-based measurements over time. The estimated return yardage at the time a returner receives the ball is what I consider "expected return yardage". Kicking team members within 4 yards of the returner who do _not_ tackle the returner are considered avoided tackles and their tackle probability is considered the amount of a tackle "avoided". 

Below is an example of a play that tracks these metrics. 
![](https://i.imgur.com/01KioxM.gif) 

The dotted black line is the line of scrimmage, solid red line is the actual return yards gained, the dotted red line is a frame-by-frame estimated return yards, and the dotted blue line is the expected return yards at the point of reception. In the example above, Lockett gains 9 yards but was estimated to gain 12 yards. He also avoids 0 tackles in this example.  

Using this expected return yards value based on nearly all return plays from 2018-2020, I can subtract actual from expected to get "return yards over expected" where returns with positive return yards over expected would be good for the returner. Likewise, using the estimated tackles avoided metric, we can assess the elusiveness of a player. 

We can plot these two new measures of returner quality just like we plotted return yardage and tackles avoided while still categorizing each player based off of their original returner types:  
![](https://i.imgur.com/ovouW66.png)  

Based on Lockett's per-return averages, he would classify as a bad returner based on return quality. We can also see if this is the case in the season total:  
![](https://i.imgur.com/q6FFjzK.png) 

Over the course of the 2018 season, he would technically qualify as a returner who dances but does not generate above-average return yardage.  

Scaling and combining average return yards over expected and average estimated tackles avoided results in a new metric, returner quality. 
### Top 10 player-seasons, by returner quality 
Rank |	Player |	# Returns |	Returner Quality |	Returner Type  
--- |--- |--- |--- |---  
1 |	2018  Darius Jennings |	23 |	1.11 |	train
2 |	2018  Desmond King |	42 |	0.89 |	good
3 |	2018  Andre Roberts |	62 |	0.81 |	good
4 |	2018  Cordarrelle Patterson |	22 |	0.80 |	train
5 |	2018  Tremon Smith |	32 |	0.73 |	train
6 |	2019  Cordarrelle Patterson |	27 |	0.68 |	good
7 |	2019  Desmond King |	30 |	0.63 |	dances
8 |	2018  Alex Erickson |	59 |	0.61 |	train
9 |	2020  Isaiah Rodgers |	25 |	0.60 |	good
10 |	2020  Deonte Harris |	32 |	0.58 |	good   
  
### Bottom 10 player-seasons, by returner quality 
Rank |	Player |	# Returns |	Returner Quality |	Returner Type  
--- |--- |--- |--- |---  
81 |	2018 Isaiah McKenzie |	26 |	-0.48 |	bad
82 |	2020 Braxton Berrios |	20 |	-0.49 |	bad
83 |	2020 Brandon Powell |	30 |	-0.53 |	bad
84 |	2019 Richie James |	44 |	-0.54 |	bad
85 |	2020 Greg Ward |	20 |	-0.56 |	bad
86 |	2019 T.J. Logan |	23 |	-0.59 |	dances
87 |	2018 Adam Humphries |	21 |	-0.61 |	bad
88 |	2019 Tyler Lockett |	27 |	-0.66 |	bad
89 |	2020 Christian Kirk |	22 |	-0.82 |	bad
90 |	2018 Pharoh Cooper |	22 |	-0.93 |	bad  

2018 Tyler Lockett ranks 67th in returner quality (-0.20).  

Returner quality has an average of 0.076 (sd: 0.404). The distribution for qualified returners ranges between -0.94 and 1.12.  
![](https://i.imgur.com/AGSwUwy.png)   

Returner quality also correlates positively with actual return yardage and tackles avoided as determined by PFF.  
![](https://i.imgur.com/KikFY6u.png)  

Overall, if a player had to choose between a vertical runner or a horizontal runner, vertical running has a stronger positive correlation with returner quality than horizontal running. In more general terms, juking your opponent out is no replacement for gaining positive yardage. 

Returner quality does have limitations. One area of limitation is blocker usage by the returner. 
![](https://i.imgur.com/fDkRnmw.gif)  

The tackle probability model considers blocker distance as well as kicking teammate distance relative to each other and the ball carrier but has trouble assigning tackle probability when a returner's best path requires blocker usage. In the example above, the returner passes behind their blocker even though the nearest tackler is identified with having a high likelihood to tackle. This perhaps indicates overfitting tackler distance from returner and may be best solved with a model that takes into account future areas controlled by the returning team.  

## Returner Quality, applied  
Two application areas are return optimization and identifying players who may be overlooked in their return abilities. Since this is a per-play metric, each play can be evaluated for returner quality. Each play can also assess whether returner quality depended on return yards over expected or estimated missed tackles. Simple returner optimization could rely on whether a returner followed an optimal path and when the return failed to be optimal over time.  

Identifying players who have positive returner quality but perhaps are underappreciated is also a potential application. For example, 2018 Chester Rogers classifies as a bad returner based on below-average return yards and below-average tackles avoided. Using returner quality, he scores a 0.181 which is above average in returner quality. This is similar to 2020 Cordarrelle Patterson (0.187). The discrepancy between the two players as identified by returner quality is Patterson gains 2.17 yards over expected whereas Rogers gains 1.72 yards over expected. Although small, this is the difference between a 1st team all-pro year and not, respectively. 

There are many areas of improvement for returner quality but this model provides good context for a returner using basic kick/punt return concepts. Future steps would include model improvement and predictability of return quality over seasons. 

## Model 

For the data inputs to the tackle probability model, I used the majority of player tracking information (time, space, speed) as well as player-specific statistics (height/weight, average speed metrics) and play-context information (kicking team player distance from ball carrier, closest teammate, nearest teammate to ball carrier, and nearest opponent player). The target of the model was the player who tackled the ball carrier (the target label is static throughout each play, i.e. non-tacklers get a 0 for every row whereas the tackler gets a 1 for every row for the target feature column). The model is a `LightGBM` gradient boosted decision tree classifer, hyperparameter tuned using `optuna`. 

After the model tuning, predicted probabilities are generated on all available plays using a 5-fold cross validation, saving the out-of-sample predictions per iteration. 

These outputs are used in conjunction with a feature set similar to the one The Zoo used in their 2020 Big Data Bowl [expected rush yards notebook](https://www.kaggle.com/jccampos/nfl-2020-winner-solution-the-zoo). Rather than using a convolutional neural net, I simplified the approach by using data only from the highest probability tackler, as opposed to all 11 kicking team players in a `LightGBM` gradient boosted decision tree regressor model. This was in addition to using the information regarding the returner, which also includes the distance to nearest tackler as well as the probability of tackle from the previous model. The target is yards gained on the return. This produces a prediction of yards, where outputs were also saved through out-of-sample 5-fold cross validation. 

This results in two predictions: return yards and tackle probability. I scale both of these values using a z-transform and add the results together to create returner quality. 
![](https://i.imgur.com/7sNC8xc.jpeg)  





# installs

In [None]:
!apt install imagemagick
!pip install optuna
!pip uninstall lightgbm -y
!git clone --recursive --branch v3.2.1 https://github.com/Microsoft/LightGBM
!apt-get install -y -qq libboost-all-dev

In [None]:
%%bash
cd LightGBM
rm -rf build
mkdir build
cd build
cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ ..
make -j$(nproc)

In [None]:
!cd LightGBM/python-package/;python3 setup.py install --precompile
!mkdir -p /etc/OpenCL/vendors && echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd
!rm -r LightGBM
!apt install imagemagick
!pip install optuna

# imports 

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
import os 
from tqdm.notebook import tqdm

import optuna
import optuna.integration.lightgbm as lgb
import lightgbm as lgb_vanilla
from sklearn.metrics import mean_squared_error, log_loss, accuracy_score, roc_auc_score, roc_curve, auc
from sklearn.model_selection import GroupKFold
from sklearn.preprocessing import minmax_scale, scale
from scipy.stats import pearsonr, spearmanr, percentileofscore
from scipy.spatial import distance

# for mpl animation
import matplotlib.animation as animation
from matplotlib import rc
rc('animation', html='html5')


# functions 

In [None]:
def get_play_by_frame(fid, ax, los, one_play):
  ax.cla()
  gid = one_play['gameId'].unique()[0]
  pid = one_play['playId'].unique()[0]
  one_frame = one_play.loc[one_play['frameId']==fid]
  try:
    fig1 = sns.scatterplot(x='x',y='y',data=one_frame, hue='team', ax=ax, size='target_scaled', sizes=(60,200))
  except:
    fig1 = sns.scatterplot(x='x',y='y',data=one_frame, hue='team', ax=ax, s=100)
  fig1.axvline(los, c='k', ls=':')
  fig1.axvline(0, c='k', ls='-')
  fig1.axvline(100, c='k', ls='-')
  fig1.set_title(f"game {gid} play {pid}")
  fig1.legend([]).set_visible(False)
  sns.despine(left=True)
  fig1.set_ylabel('')

  fig1.set_yticks([])
  fig1.set_xlim(-10,110)    
  fig1.set_ylim(0,54) 

def animate_predicted_play(one_play, suffix='_pred', use_pred_values=False, 
                           pred_value=None):    
  gid = one_play['gameId'].unique()[0]
  pid = one_play['playId'].unique()[0]

  scaled = pd.Series()
  if use_pred_values:
    if pred_value is None:
      raise ValueError('no pred value provided bruv')
      return None
    for idx, row in one_play.loc[:, ['team','frameId']].drop_duplicates().iterrows():
      fid = row['frameId']
      tid = row['team']
      _one_play = one_play.loc[(one_play['team']==tid)&(one_play['frameId']==fid), pred_value]
      scaled = scaled.append(pd.Series(minmax_scale(_one_play, feature_range=(0,1)), index=_one_play.index))
      
    one_play['target_scaled'] = scaled.fillna(scaled.quantile(.3))
    one_play.loc[one_play['team']=='football', 'target_scaled'] = scaled.quantile(.15) 

  los = one_play.loc[(one_play['frameId']==1) & (one_play['team']=='football'), 'x'].values[0]

  fig = plt.figure(figsize=(14.4, 6.4))
  ax = fig.gca()
  ani = animation.FuncAnimation(fig, get_play_by_frame, 
                                frames=one_play['frameId'].sort_values().unique().shape[0],
                                interval=100, repeat=True, fargs=(ax,los,one_play,))

  plt.close()
  ani.save(f'{gid}_{pid}.gif', writer='imagemagick', fps=10)
  return ani    

def tuple2range(df):
  x = range(int(df[0]),int(df[1]))
  return x

def is_event_str(event):
  return type(event) == str

def is_event_int(event):
  return type(event) == int

def is_event_math(event):
  event = event.split(' ')  
  return event[0] in ['add', 'sub', 'mul', 'div']

def do_event_math(event, df):
  new_series = None
  event = event.split(' ')  
  if event[0] == 'add':
    new_series = df['frameId'].add(int(event[1]))
  elif event[0] == 'sub':
    new_series = df['frameId'].sub(int(event[1]))
  elif event[0] == 'mul':
    new_series = df['frameId'].mul(int(event[1]))
  elif event[0] == 'div':
    new_series = df['frameId'].div(int(event[1]))
  return new_series

def explode_dataframe(df, from_event, to_event):
  if is_event_str(from_event):
    _df1 = df.loc[(df['event']==from_event), ['gameId', 'playId', 'frameId']].drop_duplicates()

  elif is_event_int(from_event):
    _df1 = df.loc[(df['frameId']==from_event), ['gameId', 'playId', 'frameId']].drop_duplicates()

  if is_event_str(to_event): 
    if is_event_math(to_event): 
      _df2 = _df1.loc[:, ['gameId', 'playId', 'frameId']].drop_duplicates() 
      _df2['frameId'] = do_event_math(to_event, _df1)
    else:
      _df2 = df.loc[(df['event']==to_event), ['gameId', 'playId', 'frameId']].drop_duplicates()

  elif is_event_int(to_event): 
    _df2 = df.loc[(df['frameId']==to_event), ['gameId', 'playId', 'frameId']].drop_duplicates() 
  
  _df1 = _df1.merge(_df2, on=['gameId', 'playId'], suffixes=('_from', '_to'))
  _df1['explode'] = pd.Series(_df1.loc[:, ['frameId_from', 'frameId_to']].values.tolist(), index=_df1.index)
  _df1['explode'] = _df1['explode'].apply(tuple).apply(tuple2range)
  _df1 = _df1.explode('explode')
  _df = _df1.drop(['frameId_from', 'frameId_to'], axis=1).rename(columns={'explode':'frameId'})
  return _df 

def clean_raw_data(df):
  df['is_football'] = 0
  df.loc[df['team']=='football', 'is_football'] = 1
  _df = df.loc[df['position'].isin(['P','K']), ['gameId', 'playId', 'team']].drop_duplicates()
  _df['is_kicking_team'] = 1
  df = df.merge(_df, how='left')
  df['is_kicking_team'] = df['is_kicking_team'].fillna(0).astype(int)
  df['is_going_left'] = df['playDirection'].replace(['left', 'right'], [1,0])
  return df

def clean_players_data(players):
  _height = players.loc[players['height'].str.contains('-'),'height'].str.split('-', expand=True)
  _height['inches'] = _height[0].astype(int).mul(12).add(_height[1].fillna(0).astype(int))
  players['inches'] = _height['inches']
  _height = players.loc[~players['height'].str.contains('-'),'height'].astype(int)
  players['inches'] = players['inches'].fillna(_height)
  return players      

def generate_tackles_df(df, plays, pff):
  _df = df.loc[(df['is_kicking_team']==1) & (df['event']=='tackle'), ['gameId', 'playId', 'frameId', 'nflId', 'jerseyNumber']].drop_duplicates()
  _df = plays.loc[:, ['gameId', 'playId', 'possessionTeam']].merge(_df)
  _df['jerseyNumber'] = _df['jerseyNumber'].astype(int)
  _df['tackler'] = _df['possessionTeam'].add(' ').add(_df['jerseyNumber'].astype(str).str.rjust(2,'0'))
  _pff = pff.loc[pff['tackler'].notnull(), ['gameId', 'playId', 'tackler']]
  _pff = _pff.append(pff.loc[pff['tackler'].notnull(), ['gameId', 'playId', 'assistTackler']].rename(columns={'assistTackler':'tackler'})).dropna(subset=['tackler'])
  _pff['makes_tackle'] = 1
  _df = _df.merge(_pff)
  df_tackles = _df.loc[:, ['gameId', 'playId']].drop_duplicates().merge(df)
  df_tackles = df_tackles.merge(_df, how='left')
  df_tackles['makes_tackle'] = df_tackles['makes_tackle'].fillna(0).astype(int)
  return df_tackles

def generate_agg_bullshit(df, players, plays):
  _df = explode_dataframe(df, 'ball_snap', 'add 3')
  df_agg = df.merge(_df)
  df_agg = df_agg.loc[:, ['nflId', 's', 'a']].groupby(['nflId'], as_index=False).mean()
  
  df_agg = players.loc[:, ['nflId', 'Position', 'inches', 'weight']].merge(df_agg)
  df_agg = df_agg.rename(columns={'s':'snapBurstS', 'a':'snapBurstA'})

  _df = explode_dataframe(df, 1, 'ball_snap')
  _df_agg = df.merge(_df)
  _df_agg = _df_agg.loc[:, ['gameId', 'playId','nflId', 'dis']].groupby(['gameId', 'playId','nflId'], as_index=False).sum()
  _df_agg = _df_agg.loc[:, ['nflId', 'dis']].groupby('nflId', as_index=False).mean()
  _df_agg = _df_agg.rename(columns={'dis':'preSnapDis'})  
  df_agg = df_agg.merge(_df_agg)
  df_agg = players.loc[:, ['nflId','displayName']].merge(df_agg)

  pids = df.loc[~df['nflId'].isin(df_agg['nflId'].unique()), 'nflId'].unique()
  _df = explode_dataframe(df, 'autoevent_kickoff', 'add 3')
  _df1 = explode_dataframe(df, 'ball_snap', 'add 3')
  _df2 = explode_dataframe(df, 'kickoff', 'add 3')
  _df3 = explode_dataframe(df, 'onside_kick', 'add 3')
  _df4 = explode_dataframe(df, 'drop_kick', 'add 3')
  _df5 = explode_dataframe(df, 'free_kick', 'add 3')
  _df = _df.append(_df1).append(_df2).append(_df3).append(_df4).append(_df5).drop_duplicates()
  burst_fill = df.loc[df['nflId'].isin(pids)].merge(_df)
  burst_fill = burst_fill.loc[:, ['nflId', 's', 'a']].groupby(['nflId'], as_index=False).mean()

  _df = explode_dataframe(df, 1, 'autoevent_kickoff')
  _df1 = explode_dataframe(df, 1, 'ball_snap')
  _df2 = explode_dataframe(df, 1, 'kickoff')
  _df3 = explode_dataframe(df, 1, 'onside_kick')
  _df4 = explode_dataframe(df, 1, 'drop_kick')
  _df5 = explode_dataframe(df, 1, 'free_kick')
  _df = _df.append(_df1).append(_df2).append(_df3).append(_df4).append(_df5).drop_duplicates()
  motion_fill = df.loc[df['nflId'].isin(pids)].merge(_df)
  motion_fill = motion_fill.loc[:, ['gameId', 'playId','nflId', 'dis']].groupby(['gameId', 'playId', 'nflId'], as_index=False).sum()
  motion_fill = motion_fill.loc[:, ['nflId', 'dis']].groupby(['nflId'], as_index=False).mean()

  agg_fill = players.loc[:, ['nflId', 'displayName', 'Position', 'inches', 'weight']].merge(burst_fill).merge(motion_fill).rename(columns={'s':'snapBurstS', 'a':'snapBurstA', 'dis':'preSnapDis'})

  df_agg = df_agg.append(agg_fill, ignore_index=True).drop_duplicates(subset=['nflId'])

  _plays = plays.loc[plays['specialTeamsResult']=='Return', ['returnerId']].rename(columns={'returnerId':'nflId'}).dropna().drop_duplicates()
  _plays['nflId'] = _plays['nflId'].str.split(';')
  _plays = _plays.explode('nflId')
  _plays['nflId'] = _plays['nflId'].astype(int)

  _df = explode_dataframe(df, 'kick_received', 'add 3')
  _df_agg = df.merge(_df)
  _df_agg = _df_agg.loc[:, ['nflId', 's', 'a']].groupby('nflId', as_index=False).mean()
  _df_agg1 = _df_agg.rename(columns={'s':'kickReturnBurstS', 'a':'kickReturnBurstA'})  

  _df = explode_dataframe(df, 'kick_received', 'add 3')
  _df_agg = df.merge(_df)
  _df_agg = _df_agg.loc[:, ['gameId', 'playId','nflId', 'dis']].groupby(['gameId', 'playId','nflId'], as_index=False).sum()
  _df_agg = _df_agg.loc[:, ['nflId', 'dis']].groupby('nflId', as_index=False).mean()
  _df_agg2 = _df_agg.rename(columns={'dis':'kickReturnBurstDis'})  

  _df = explode_dataframe(df, 'punt_received', 'add 3')
  _df_agg = df.merge(_df)
  _df_agg = _df_agg.loc[:, ['nflId', 's', 'a']].groupby('nflId', as_index=False).mean()
  _df_agg3 = _df_agg.rename(columns={'s':'puntReturnBurstS', 'a':'puntReturnBurstA'})  

  _df = explode_dataframe(df, 'punt_received', 'add 3')
  _df_agg = df.merge(_df)
  _df_agg = _df_agg.loc[:, ['gameId', 'playId','nflId', 'dis']].groupby(['gameId', 'playId','nflId'], as_index=False).sum()
  _df_agg = _df_agg.loc[:, ['nflId', 'dis']].groupby('nflId', as_index=False).mean()
  _df_agg4 = _df_agg.rename(columns={'dis':'puntReturnBurstDis'})  

  _df_agg = _df_agg1.merge(_df_agg2).merge(_df_agg3).merge(_df_agg4)
  _df_agg = _plays.merge(_df_agg)

  df_agg = df_agg.merge(_df_agg, how='left')
  return df_agg

def generate_distance2football(df):
  df_dist2fb = pd.DataFrame()
  total = df.loc[:, ['gameId', 'playId']].drop_duplicates().shape[0]
  for idx, row in tqdm(df.loc[:, ['gameId', 'playId']].drop_duplicates().iterrows(), total=total, desc='dist2fb', leave=False):
    gid, pid = row['gameId'], row['playId']
    one_play = df.loc[(df['gameId']==gid) & (df['playId']==pid)]
    defteam_ids = one_play.loc[(one_play['is_kicking_team']==1), 'nflId'].astype(int).unique()
    id2generic = {x:idx for idx,x in enumerate(defteam_ids)}
    generic2id = {v:k for k,v in id2generic.items()}
    n_defenders = defteam_ids.shape[0]
    def_loc = one_play.loc[one_play['nflId'].isin(defteam_ids), ['x', 'y']]
    football_loc = one_play.loc[one_play['is_football']==1, ['x', 'y']]
    _dist = pd.DataFrame(distance.cdist(football_loc, def_loc, 'euclidean')) 

    _stage = pd.DataFrame()
    for generic_id in range(len(defteam_ids)):
      nflId = generic2id[generic_id]
      player_cols = _dist.index.values + (generic_id * one_play['frameId'].unique().shape[0])
      _dist1 = _dist.loc[:, player_cols]
      _dist1.columns = range(_dist1.shape[1])
      _df = pd.DataFrame(np.diag(_dist1), index=[_dist1.index, _dist1.columns]).reset_index(drop=True)
      _df['frameId'] = range(1,_df.shape[0]+1)
      _df['nflId'] = nflId
      _df['gameId'] = gid
      _df['playId'] = pid
      _df = _df.rename(columns={0:'distanceToFootball'})
      _stage = _stage.append(_df, ignore_index=True)
    df_dist2fb = df_dist2fb.append(_stage, ignore_index=True)
  return df_dist2fb       

def generate_nearest_blocker(df):
  nearest_blocker = pd.DataFrame()
  total = df.loc[:, ['gameId', 'playId']].drop_duplicates().shape[0]
  for idx, row in tqdm(df.loc[:, ['gameId', 'playId']].drop_duplicates().iterrows(), total=total, desc='nearest blocker', leave=False):
    gid, pid = row['gameId'], row['playId']
    one_play = df.loc[(df['gameId']==gid) & (df['playId']==pid)]

    n_frames = one_play['frameId'].max()
    posteam_ids = one_play.loc[(one_play['is_kicking_team']==0) & (one_play['is_football']==0), 'nflId'].astype(int).unique()
    defteam_ids = one_play.loc[(one_play['is_kicking_team']==1), 'nflId'].astype(int).unique()

    defid2generic = {x:idx for idx,x in enumerate(defteam_ids)}
    defgeneric2id = {v:k for k,v in defid2generic.items()}

    posid2generic = {x:idx for idx,x in enumerate(posteam_ids)}
    posgeneric2id = {v:k for k,v in posid2generic.items()}

    n_footguys = posteam_ids.shape[0]
    n_defenders = defteam_ids.shape[0]

    pos_loc = one_play.loc[one_play['nflId'].isin(posteam_ids), ['x', 'y']]
    def_loc = one_play.loc[one_play['nflId'].isin(defteam_ids), ['x', 'y']]
    _dist = pd.DataFrame(distance.cdist(pos_loc, def_loc, 'euclidean')) 
    diagonal = _dist.copy()
    for col in diagonal.columns:
      diagonal[col].values[:] = 0
    _stage = pd.DataFrame()
    for def_generic_id in range(len(defteam_ids)):
      def_nflId = defgeneric2id[def_generic_id]
      def_player_cols = np.array(range(n_frames)) + (def_generic_id * one_play['frameId'].unique().shape[0])
      _dist1 = _dist.loc[:, def_player_cols]
      _dist1.columns = range(_dist1.shape[1])
      
      _diagonal = diagonal.copy()
      for pos_generic_id in range(len(posteam_ids)):
        pos_nflId = posgeneric2id[pos_generic_id]
        pos_player_cols = np.array(range(n_frames)) + (pos_generic_id * one_play['frameId'].unique().shape[0])
        _dist2 = _dist1.loc[pos_player_cols]
        _dist2.columns = pos_player_cols
        _diagonal.loc[pos_player_cols, pos_player_cols] = _dist2
      _df = pd.DataFrame(np.diag(_diagonal))
      _df['frameId'] = list(range(1,n_frames+1)) * len(posteam_ids)
      _df['posId'] = np.repeat(posteam_ids, n_frames)
      _df = pd.pivot_table(_df, index=['frameId'], columns=['posId'])
      _df.columns = [x[1] for x in _df.columns.to_flat_index()]
      _df['nearestBlockerDistance'] = _df.min(axis=1)
      _df['nearestBlockerId'] = _df.idxmin(axis=1)
      _nearest_blocker = _df.loc[:, ['nearestBlockerDistance', 'nearestBlockerId']].reset_index()
      _nearest_blocker['nflId'] = def_nflId
      _nearest_blocker['gameId'] = gid
      _nearest_blocker['playId'] = pid  
      _stage = _stage.append(_nearest_blocker)
    nearest_blocker = nearest_blocker.append(_stage)    
  nearest_blocker = nearest_blocker.reset_index(drop=True)  
  return nearest_blocker  

def standardization(df):
  df['dir_rad'] = np.mod(90 - df['dir'], 360) * math.pi/180.0
  df['x_std'] = df['x']
  df.loc[df['is_going_left']==1, 'x_std'] = 120 - df.loc[df['is_going_left']==1, 'x_std']
  df['y_std'] = df['y']
  df.loc[df['is_going_left']==1, 'y_std'] = 160/3 - df.loc[df['is_going_left']==1, 'y']
  df['dir_std'] = df['dir_rad']
  df.loc[df['is_going_left']==1, 'dir_std'] = np.mod(np.pi + df.loc[df['is_going_left']==1, 'dir_rad'], 2*np.pi)
  
  # kinda crude. could make this sharper (aka eliminate those who come back into the end zone)
  df['touchback_possible'] = 0
  df.loc[df['x_std']>100, 'touchback_possible'] = 1

  #Replace Null in Dir_rad
  df.loc[(df['is_kicking_team']==1) & (df['dir_std'].isna()), 'dir_std'] = 0
  df.loc[(df['is_kicking_team']==0) & (df['is_football']==0) & (df['dir_std'].isna()), 'dir_std'] = np.pi

  # speed relative of direction 
  df['sx'] = df['s']*df['dir_std'].apply(math.cos)
  df['sy'] = df['s']*df['dir_std'].apply(math.sin)
  return df  

def get_play_desc(gid, pid):
  return plays.loc[(plays['gameId']==gid) & (plays['playId']==pid), 'playDescription'].values[0]

def get_xret(one_play, fid):  
  received = 1
  is_going_left = one_play['is_going_left'].unique()[0]
  received_at = one_play.loc[(one_play['event'].str.contains('receive')) & (one_play['is_football']==1), 'x'].values[0]
  received_frame = one_play.loc[(one_play['event'].str.contains('receive')) & (one_play['is_football']==1), 'frameId'].values[0]
  received_est_yards = one_play.loc[(one_play['frameId']==received_frame), 'kickReturnYardage_pred'].values[0]
  if received_frame >= fid:
    received = 0
  if is_going_left:
    est_yards = received_at + one_play.loc[one_play['frameId']==fid, 'kickReturnYardage_pred'].values[0]
    actual_yards = received_at + one_play.loc[one_play['frameId']==fid, 'kickReturnYardage'].values[0]    
    received_est_yards = received_at + received_est_yards
  else:
    est_yards = received_at - one_play.loc[one_play['frameId']==fid, 'kickReturnYardage_pred'].values[0]
    actual_yards = received_at - one_play.loc[one_play['frameId']==fid, 'kickReturnYardage'].values[0]
    received_est_yards = received_at - received_est_yards

  return est_yards, actual_yards, received, received_est_yards

def get_play_by_frame(fid, ax, los, one_play):
  ax.cla()
  gid = one_play['gameId'].unique()[0]
  pid = one_play['playId'].unique()[0]
  play_desc = get_play_desc(gid, pid)

  one_frame = one_play.loc[one_play['frameId']==fid]    
  # print([fid, est_yards, actual_yards])

  try:
    fig1 = sns.scatterplot(x='x',y='y',data=one_frame, hue='team', ax=ax, size='target_scaled', sizes=(60,200))
    est_yards, actual_yards, received, received_est_yards = get_xret(one_play, fid)
    fig1.axvline(received_est_yards, c='b', ls=':', alpha=received)
    fig1.axvline(est_yards, c='r', ls=':')
    fig1.axvline(actual_yards, c='r', ls='-')
  except:
    fig1 = sns.scatterplot(x='x',y='y',data=one_frame, hue='team', ax=ax, s=100)
  
  
  fig1.axvline(los, c='k', ls=':')
  fig1.axvline(los, c='k', ls=':')
  fig1.axvline(0, c='k', ls='-')
  fig1.axvline(100, c='k', ls='-')
  fig1.set_title(f"{play_desc}\ngame {gid} play {pid}")
  fig1.legend([]).set_visible(False)
  sns.despine(left=True)
  fig1.set_ylabel('')

  fig1.set_yticks([])
  fig1.set_xlim(-10,110)    
  fig1.set_ylim(0,54) 

def animate_predicted_play(one_play, suffix='_pred', use_pred_values=False, 
                           pred_value=None):    
  gid = one_play['gameId'].unique()[0]
  pid = one_play['playId'].unique()[0]

  scaled = pd.Series()
  if use_pred_values:
    if pred_value is None:
      raise ValueError('no pred value provided bruv')
      return None
    for idx, row in one_play.loc[:, ['team','frameId']].drop_duplicates().iterrows():
      fid = row['frameId']
      tid = row['team']
      _one_play = one_play.loc[(one_play['team']==tid)&(one_play['frameId']==fid), pred_value]
      scaled = scaled.append(pd.Series(minmax_scale(_one_play, feature_range=(0,1)), index=_one_play.index))
      
    one_play['target_scaled'] = scaled.fillna(scaled.quantile(.3))
    one_play.loc[one_play['team']=='football', 'target_scaled'] = scaled.quantile(.15) 

  los = one_play.loc[(one_play['frameId']==1) & (one_play['team']=='football'), 'x'].values[0]

  fig = plt.figure(figsize=(14.4, 6.4))
  ax = fig.gca()
  ani = animation.FuncAnimation(fig, get_play_by_frame, 
                                frames=one_play['frameId'].sort_values().unique().shape[0],
                                interval=100, repeat=True, fargs=(ax,los,one_play,))

  plt.close()
  ani.save(f'{gid}_{pid}.gif', writer='imagemagick', fps=10)
  return ani      

# data load and preprocessing 

load data in, including `nflfastR` 2018-2020 data 

In [None]:
project_dir = 'drive/My Drive/2021bdb'
fastr_dir = 'drive/My Drive/nflfastR-data/data'

fns = [x for x in os.listdir(fastr_dir) if ('csv.gz' in x)]
fns = [f"{fastr_dir}/{x}" for x in fns if x[-11:-7] in ['2018', '2019', '2020']]
fastr_data = pd.DataFrame()
for fn in fns:
  _fastr_data = pd.read_csv(fn)
  fastr_data = fastr_data.append(_fastr_data)
fastr_data = fastr_data.reset_index(drop=True)

plays = pd.read_csv(f"{project_dir}/data/plays.csv.zip", compression='zip')
plays['specialTeamsPlayType_code'] = plays['specialTeamsPlayType'].astype('category').cat.codes

_df = fastr_data.loc[:, ['old_game_id', 'play_id', 'ep', 'epa']].rename(columns={'old_game_id':'gameId', 'play_id':'playId'})
plays = plays.merge(_df)

players = pd.read_csv(f"{project_dir}/data/players.csv")
players = clean_players_data(players)
pff = pd.read_csv(f"{project_dir}/data/PFFScoutingData.csv.zip", compression='zip')

preprocess raw data (~3 hrs) 

In [None]:
tracking_fns = [x for x in os.listdir(f"{project_dir}/data") if 'tracking' in x]
for season in tqdm(range(2018,2021), desc='preprocessing'):
  df = pd.read_csv(f"{project_dir}/data/tracking{season}.csv.zip", compression='zip')
  df = clean_raw_data(df)  
  
  df_tackles = generate_tackles_df(df, plays, pff)
  df_tackles.to_csv(f'{project_dir}/processed/{season}_df_tackles.csv', index=False)      

  df_agg = generate_agg_bullshit(df, players, plays)
  df_agg.to_csv(f'{project_dir}/processed/{season}_df_agg.csv', index=False)      

  df_dist2fb = generate_distance2football(df)
  df_dist2fb.to_csv(f'{project_dir}/processed/{season}_kicking_team_dist2fb.csv', index=False)
  
  nearest_blocker = generate_nearest_blocker(df)
  nearest_blocker.to_csv(f'{project_dir}/processed/{season}_nearest_blocker.csv', index=False)
  

merge preprocessed data into one dataset (~30 mins) 

In [None]:
for season in tqdm(range(2018,2021), desc='merging and saving'):
  df = pd.read_csv(f"{project_dir}/data/tracking{season}.csv.zip", compression='zip')
  df = clean_raw_data(df)  
  df_tackles = pd.read_csv(f'{project_dir}/processed/{season}_df_tackles.csv')      
  df_agg = pd.read_csv(f'{project_dir}/processed/{season}_df_agg.csv')      
  df_dist2fb = pd.read_csv(f'{project_dir}/processed/{season}_kicking_team_dist2fb.csv')
  nearest_blocker = pd.read_csv(f'{project_dir}/processed/{season}_nearest_blocker.csv')
  df_with_agg = (df
  .merge(df_dist2fb, how='left')
  .merge(df_agg
          .loc[:, ['nflId', 'Position', 
                  'inches', 'weight', 'snapBurstS', 
                  'snapBurstA', 'preSnapDis']]
          .drop_duplicates(), how='left')
  .merge(nearest_blocker.drop_duplicates(), how='left')
  .merge(df_tackles, how='left')
  .merge(plays.loc[:, ['gameId', 'playId', 'specialTeamsPlayType_code']]))
  _df_min = df_with_agg.loc[:, ['gameId', 'playId', 'frameId', 'distanceToFootball']].groupby(['gameId', 'playId', 'frameId'], as_index=False).min()
  df_with_agg = df_with_agg.merge(_df_min.rename(columns={'distanceToFootball':'closestDefenderDistance'}))
  df_with_agg['distanceToLikelyTackler'] = df_with_agg['distanceToFootball'].sub(df_with_agg['closestDefenderDistance'])
  df_with_agg['is_going_left'] = df_with_agg['playDirection'].replace(['left', 'right'], [1,0])
  df_with_agg['makes_tackle'] = df_with_agg['makes_tackle'].fillna(0)
  df_with_agg.to_csv(f'{project_dir}/processed/{season}_processed_data.csv', index=False) 

# Tackle probability model

parse data into tackle probability model data 

In [None]:
id_cols = ['gameId', 'playId','nflId', 'displayName','position']
model_feats = ['frameId', 'specialTeamsPlayType_code', 'is_going_left', 'x', 'y', 's', 'a', 'dis', 'o', 
               'dir', 'distanceToFootball', 'inches', 'weight', 'snapBurstS', 
               'snapBurstA', 'preSnapDis', 'closestDefenderDistance',
               'distanceToLikelyTackler', 'nearestBlockerDistance']
target = 'makes_tackle'
tackle_model_data = pd.DataFrame()
for season in tqdm(range(2018,2021), desc='make tackle model data'):
  df_with_agg = pd.read_csv(f'{project_dir}/processed/{season}_processed_data.csv') 
  df_with_agg['frame_of_tackle'] = df_with_agg['makes_tackle']

  tackle_model_data = tackle_model_data.append(df_with_agg.loc[(df_with_agg['is_kicking_team']==1), id_cols+model_feats+[target]].drop_duplicates().dropna(subset=model_feats+[target]))

tackle_model_data.to_csv(f'{project_dir}/processed/tackle_model_data.csv', index=False)

load tackle prob model data

In [None]:
model_data = pd.read_csv(f'{project_dir}/processed/tackle_model_data.csv') 
_df1 = model_data.loc[:, ['gameId', 'playId', 'nflId', 'frameId']].reset_index()
_df2 = model_data.loc[model_data['makes_tackle']==1, ['gameId', 'playId', 'nflId']]
_df = _df1.merge(_df2)
idx = _df['index'].unique()

_s = pd.Series(index=idx).fillna(1)
model_data['frame_of_tackle'] = model_data['makes_tackle']
model_data['makes_tackle'] = _s

_tacklers_only = model_data.loc[(model_data['makes_tackle']==1), ['gameId', 'playId', 'nflId']].drop_duplicates()
model_data['gameIdPlayId'] = model_data['gameId'].astype(str).add(model_data['playId'].astype(str))
model_data['gameIdPlayId'] = model_data['gameIdPlayId'].astype(int)
tacklers_only = _tacklers_only.merge(model_data)
id_cols = ['gameId', 'playId','nflId', 'displayName','position']
model_feats = ['frameId', 'specialTeamsPlayType_code', 'is_going_left', 'x', 'y', 's', 'a', 'dis', 
               'o', 'dir', 'distanceToFootball', 'inches', 'weight', 
               'snapBurstS', 'snapBurstA', 'preSnapDis', 
               'closestDefenderDistance', 'distanceToLikelyTackler', 
               'nearestBlockerDistance']
target = 'makes_tackle'

produce tackle probabilities for all kicking team players

In [None]:
df_pred = pd.DataFrame()
folds = 5
kf = GroupKFold(folds)

# learned from hyperparam tuning with optuna 
p = {'bagging_fraction': 0.9911851261849397,
 'bagging_freq': 3,
 'device': 'gpu',
 'feature_fraction': 0.948,
 'feature_pre_filter': False,
 'gpu_device_id': 0,
 'gpu_platform_id': 0,
 'lambda_l1': 1.0919192786057997e-05,
 'lambda_l2': 0.23713111023668204,
 'learning_rate': 0.1,
 'max_bin': 63,
 'min_child_samples': 20,
 'num_leaves': 3,
 'num_thread': 28,
 'objective': 'binary',
 'verbosity': -1}

 for train_idx, test_idx in tqdm(kf.split(model_data, groups=model_data['gameIdPlayId']), total=folds):
  train_data = model_data.iloc[train_idx]
  test_data = model_data.iloc[test_idx]

  lgb_train = lgb_vanilla.Dataset(train_data.loc[:, model_feats], train_data[target])
  lgb_test = lgb_vanilla.Dataset(test_data.loc[:, model_feats], test_data[target])
  _model = lgb_vanilla.train(p,lgb_train,num_boost_round=4000, valid_sets=lgb_test, early_stopping_rounds=200, verbose_eval=0)
  test_data[f'{target}_pred'] = pd.Series(_model.predict(test_data.loc[:, model_feats]), index=test_data.index)
  df_pred = df_pred.append(test_data)

df_pred.to_csv(f'{project_dir}/processed/tackle_pred.csv', index=False) 
df_pred = pd.read_csv(f'{project_dir}/processed/tackle_pred.csv') 
df_pred['season'] = df_pred['gameId'].astype(str).str[:4].astype(int)
for season in range(2018,2021):
  df_pred.loc[df_pred['season']==season, ['gameId', 'playId', 'frameId', 'nflId', 'makes_tackle_pred']].to_csv(f'{project_dir}/processed/{season}_tackle_pred.csv', index=False) 

# xreturn model 

preprocess the xreturn model data 

In [None]:
id_cols = ['gameId', 'playId','nflId', 'displayName','position']
returner_feats = ['frameId', 'x_std', 'y_std', 'sx', 'sy', 'touchback_possible', 'specialTeamsPlayType_code']
tackler_feats = ['frameId', 'x_std', 'y_std', 'sx', 'sy', 
                 'nearestBlockerDistance', 'makes_tackle_pred']
target = 'kickReturnYardage'

for season in tqdm(range(2018,2021)):
  df_with_agg = pd.read_csv(f'{project_dir}/processed/{season}_processed_data.csv') 
  tackle_pred = pd.read_csv(f'{project_dir}/processed/{season}_tackle_pred.csv') 
  df_with_agg = standardization(df_with_agg)
  _returner = plays.loc[plays['specialTeamsResult']=='Return', ['gameId', 'playId', 'returnerId', 'kickReturnYardage']].rename(columns={'returnerId':'nflId'})
  x = _returner['nflId']
  y = x[x.str.contains(';', na=False)].str.split(';')
  x = x[~x.str.contains(';', na=False)].str.split(';')
  x = x.append(y)
  _returner['nflId'] = x
  _returner = _returner.explode('nflId')
  _returner['nflId'] = _returner['nflId'].astype(float)
  _returner['is_returner'] = 1
  df_with_agg = df_with_agg.merge(_returner, how='left')
  df_with_agg['is_returner'] = df_with_agg['is_returner'].fillna(0)
  df_with_agg = df_with_agg.merge(tackle_pred, how='left')
  df_with_agg['uid'] = df_with_agg['gameId'].astype(str).add(df_with_agg['playId'].astype(str))



  _tackler = df_with_agg.loc[df_with_agg['distanceToLikelyTackler']==0, id_cols+tackler_feats]
  _tackler['makes_tackle_pred'] = _tackler['makes_tackle_pred'].fillna(_tackler['makes_tackle_pred'].median())
  _returner = df_with_agg.loc[df_with_agg['is_returner']==1, id_cols+returner_feats+[target]]
  drop_plays = _returner.loc[_returner['sx'].isnull(), ['gameId', 'playId']].drop_duplicates()
  drop_plays['uid'] = drop_plays['gameId'].astype(str).add(drop_plays['playId'].astype(str))
  df_with_agg = df_with_agg.loc[~df_with_agg['uid'].isin(drop_plays['uid'].unique())]
  _returner = _returner.dropna()
  rename_cols = {'x_std':'kick_x_std', 'y_std':'kick_y_std', 'sx':'kick_sx', 'sy':'kick_sy'}
  _tackler = _tackler.loc[:, ['gameId', 'playId', 'frameId', 'x_std', 'y_std', 'sx', 'sy', 'nearestBlockerDistance', 'makes_tackle_pred']].rename(columns=rename_cols)
  xret_model_data = _returner.merge(_tackler)
  xret_model_data['x_diff'] = xret_model_data['x_std'].sub(xret_model_data['kick_x_std'])
  xret_model_data['y_diff'] = xret_model_data['y_std'].sub(xret_model_data['kick_y_std'])
  xret_model_data['sx_diff'] = xret_model_data['sx'].sub(xret_model_data['kick_sx'])
  xret_model_data['sy_diff'] = xret_model_data['sy'].sub(xret_model_data['kick_sy'])
  xret_model_data.to_csv(f'{project_dir}/processed/{season}_xret_model_data.csv', index=False)

make data model-ready 

In [None]:
model_data = pd.DataFrame()
for season in range(2018,2021):
  _model_data = pd.read_csv(f'{project_dir}/processed/{season}_xret_model_data.csv')
  model_data = model_data.append(_model_data)
model_data = model_data.reset_index(drop=True)  
model_data['gameIdPlayId'] = model_data['gameId'].astype(str).add(model_data['playId'].astype(str))
model_data['gameIdPlayId'] = model_data['gameIdPlayId'].astype(int)
id_cols = ['gameId', 'playId','nflId', 'displayName','position']
model_feats = ['frameId', 'specialTeamsPlayType_code', 'x_std', 'y_std', 'sx', 'sy', 'touchback_possible',
               'kick_x_std', 'kick_y_std', 'kick_sx', 'kick_sy','x_diff',
               'y_diff', 'sx_diff', 'sy_diff', 'nearestBlockerDistance', 
               'makes_tackle_pred']
target = 'kickReturnYardage'

In [None]:
df_pred = pd.DataFrame()
folds = 5
kf = GroupKFold(folds)

# params from optuna
p ={'bagging_fraction': 1.0,
 'bagging_freq': 0,
 'device': 'gpu',
 'feature_fraction': 1.0,
 'feature_pre_filter': False,
 'gpu_device_id': 0,
 'gpu_platform_id': 0,
 'lambda_l1': 0.0,
 'lambda_l2': 0.0,
 'learning_rate': 0.1,
 'max_bin': 63,
 'metric': 'rmse',
 'min_child_samples': 20,
 'num_leaves': 251,
 'num_thread': 28,
 'objective': 'regression'}

for train_idx, test_idx in tqdm(kf.split(model_data, groups=model_data['gameIdPlayId']), total=folds):
  train_data = model_data.iloc[train_idx]
  test_data = model_data.iloc[test_idx]

  lgb_train = lgb_vanilla.Dataset(train_data.loc[:, model_feats], train_data[target])
  lgb_test = lgb_vanilla.Dataset(test_data.loc[:, model_feats], test_data[target])
  _model = lgb_vanilla.train(p,lgb_train,num_boost_round=4000, valid_sets=lgb_test, early_stopping_rounds=200, verbose_eval=0)
  test_data[f'{target}_pred'] = pd.Series(_model.predict(test_data.loc[:, model_feats]), index=test_data.index)
  df_pred = df_pred.append(test_data)

df_pred.to_csv(f'{project_dir}/processed/xret.csv', index=False) 
df_pred['season'] = df_pred['gameId'].astype(str).str[:4].astype(int)
for season in range(2018,2021):
  df_pred.loc[df_pred['season']==season, ['gameId', 'playId', 'frameId', 'nflId', 'makes_tackle_pred']].to_csv(f'{project_dir}/processed/{season}_tackle_pred.csv', index=False) 

# munge model outputs together 

take outputs and munge into one dataframe 

In [None]:
_merge = pd.DataFrame()
for season in [2018,2019,2020]:
  df_with_agg = pd.read_csv(f'{project_dir}/processed/{season}_processed_data.csv') 
  _df = df_with_agg.loc[:, ['gameId', 'playId', 'frameId', 'distanceToFootball', 'makes_tackle']].sort_values('distanceToFootball').drop_duplicates(subset=['gameId', 'playId', 'frameId'])
  _df['season'] = season
  _merge = _merge.append(_df)


received_frames = pd.DataFrame()
for season in [2018,2019,2020]:
  df = pd.read_csv(f"{project_dir}/data/tracking{season}.csv.zip", compression='zip')
  df = clean_raw_data(df)  
  df = standardization(df)
  _received_frames = df.loc[df['event'].str.contains('receive'), ['gameId', 'playId', 'frameId']]
  _received_frames = _received_frames.drop_duplicates()
  _received_frames['receives_ball'] = 1
  received_frames = received_frames.append(_received_frames)

xret = pd.read_csv(f'{project_dir}/processed/xret.csv')
x = pff['missedTackler']
y = x[x.str.contains(';', na=False)].str.split(';')
x = x[~x.str.contains(';', na=False)].str.split(';')
z = x.append(y)
pff['tacklesAvoided'] = z.fillna("").apply(list).apply(len)
xret = xret.merge(pff.loc[:, ['gameId', 'playId', 'tacklesAvoided']].drop_duplicates())
xret = xret.merge(_merge, on=['gameId', 'playId', 'frameId'], how='left')
xret = xret.merge(received_frames, how='left')
xret['receives_ball'] = xret['receives_ball'].fillna(0)
xret = xret.merge(plays.loc[:, ['gameId', 'playId','specialTeamsPlayType']])
xret['kickReturnYardage_oe'] = xret['kickReturnYardage'].sub(xret['kickReturnYardage_pred'])
xret_oe = xret.loc[xret['receives_ball']==1, ['season','nflId', 'kickReturnYardage', 'tacklesAvoided', 'kickReturnYardage_oe']].groupby(['season','nflId'], as_index=False).sum()
_xret_oe = xret.loc[xret['receives_ball']==1, ['season','nflId', 'kickReturnYardage_oe']].groupby(['season','nflId'], as_index=False).count().rename(columns={'kickReturnYardage_oe':'n_returns'})
xret_oe = players.loc[:, ['nflId', 'displayName']].drop_duplicates(subset=['nflId'], keep='last').merge(xret_oe).merge(_xret_oe)
xret_oe = xret_oe.sort_values('kickReturnYardage_oe', ascending=False).reset_index(drop=True)

_ep = plays.loc[:, ['returnerId', 'ep', 'epa']].rename(columns={'returnerId':'nflId'}).dropna()
x = _ep['nflId']
y = x[x.str.contains(';', na=False)].str.split(';')
x = x[~x.str.contains(';', na=False)].str.split(';')
x = x.append(y)
_ep['nflId'] = x
_ep = _ep.explode('nflId')
_ep['nflId'] = _ep['nflId'].astype(int)
_ep = _ep.groupby('nflId', as_index=False).sum()

xret_oe = xret_oe.merge(_ep)

target = 'makes_tackle_pred'
_df1 = xret.loc[xret['receives_ball']==1, ['gameId', 'playId', 'season', 'nflId', 'displayName', 'kickReturnYardage', 'tacklesAvoided', 'kickReturnYardage_oe']]
_df2 = xret.loc[(xret['makes_tackle']!=1)&(xret['distanceToFootball']<=4), ['gameId', 'playId','season','nflId', target]].groupby(['gameId', 'playId','season','nflId'], as_index=False).sum()
wtf = _df1.merge(_df2)
for col in ['kickReturnYardage_oe', 'makes_tackle_pred']:
  wtf[f"{col}_scaled"] = scale(wtf[col])

WINNING_METRIC_NAME = "returner quality" 
wtf[WINNING_METRIC_NAME] = wtf.loc[:, ['kickReturnYardage_oe_scaled', 'makes_tackle_pred_scaled']].sum(axis=1)  

cols = ['player', 'n_returns', 'kickReturnYardage',
       'tacklesAvoided', 'kickReturnYardage_oe', 'makes_tackle_pred',
       WINNING_METRIC_NAME]

_df = wtf.loc[:, ['season','nflId', 'displayName', WINNING_METRIC_NAME]].groupby(['season','nflId', 'displayName'], as_index=False).count().rename(columns={WINNING_METRIC_NAME:'n_returns'})
df_agg = wtf.loc[:, ['season','nflId', 'displayName', 'kickReturnYardage', 'tacklesAvoided', 'kickReturnYardage_oe', 'makes_tackle_pred', WINNING_METRIC_NAME]].groupby(['season','nflId', 'displayName'], as_index=False).mean()
df_agg = df_agg.merge(_df)
df_agg['player'] = df_agg['season'].astype(str).add(' ').add(df_agg['displayName'])
df_agg = df_agg.loc[:, cols]