# Goal
This notebook analyzes data on kickoff and punt plays.  

A player in opposing team who receives the ball on a punt or a kickoff play runs in order to return the ball, while the kickoff team run up to the returner and tackle to prevent them from returning.

For better strategy, it is useful to predict the return distance of a punt from the player tracking data.
However, it is not possible to predict the exact return distance from temporal position data.
The return distance increases significantly when the returner slips through the tackle, but this does not always happen.

In this note, we will obtain the probability distribution of the distance the returner travels from the current position, instead of predicting the return distance and considers the success rate of the tackle and the position of all players.

# 1. Load the Libraries and Datasets
We use the following datasets and the parameters.  
To compress data size, downcast function which copied from this page $\downarrow$ is applied. <br>
https://www.kaggle.com/werooring/nfl-big-data-bowl-basic-eda-for-beginner

* **games.csv** : gameId, homeTeamAbbr

* **plays.csv** : gameId, playId, possessionTeam, specialTeamsPlayType, specialTeamsResult, returnerId, kickReturnYardage

* **tracking20\*\*.csv** : x, y, s, dir, event, nflId, displayName, team, frameId, gameId, playId, playDirection

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
# いつものおまじない
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# データをインポートします
#players = pd.read_csv('../input/nfl-big-data-bowl-2022/players.csv')
#pffs = pd.read_csv('../input/nfl-big-data-bowl-2022/PFFScoutingData.csv')
track18 = pd.read_csv('../input/nfl-big-data-bowl-2022/tracking2018.csv')
#track19 = pd.read_csv('../input/nfl-big-data-bowl-2022/tracking2019.csv')
#track20 = pd.read_csv('../input/nfl-big-data-bowl-2022/tracking2020.csv')
games = pd.read_csv('../input/nfl-big-data-bowl-2022/games.csv')
plays = pd.read_csv('../input/nfl-big-data-bowl-2022/plays.csv')

In [None]:
# Copied from https://www.kaggle.com/werooring/nfl-big-data-bowl-basic-eda-for-beginner

def downcast(df, verbose=True):
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        dtype_name = df[col].dtype.name
        if dtype_name == 'object':
            pass
        elif dtype_name == 'bool':
            df[col] = df[col].astype('int8')
        elif dtype_name.startswith('int') or (df[col].round() == df[col]).all():
            df[col] = pd.to_numeric(df[col], downcast='integer')
        else:
            df[col] = pd.to_numeric(df[col], downcast='float')
    end_mem = df.memory_usage().sum() / 1024**2
#    if verbose:
#        print('{:.1f}% Compressed'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df


# 不要な列の削除
games = games.drop(['season', 'week', 'gameDate', 'gameTimeEastern','visitorTeamAbbr'], axis=1)
games = downcast(games)

plays = plays.drop(['playDescription', 'quarter', 'down', 'yardsToGo', 
                    'kickerId', 'kickBlockerId', 'yardlineSide', 'yardlineNumber', 
                    'gameClock', 'penaltyCodes', 'penaltyJerseyNumbers', 'penaltyYards', 
                    'preSnapHomeScore', 'preSnapVisitorScore', 'passResult', 
                    'kickLength', 'playResult', 'absoluteYardlineNumber'], axis=1)
plays = downcast(plays)

df_list = []
for df_temp in [track18]:#, track19, track20]:
    df_temp = df_temp.drop(['time','dis','o','jerseyNumber','position'], axis=1)
    df_temp = downcast(df_temp)
    df_list.append(df_temp)
track_data = pd.concat(df_list, axis=0).reset_index(drop=True)

# メモリの節約用
del track18#, track19, track20

# 2. Data Preparation
Looking at parameter "event" in tracking data, the flow of the game should be in the following order in a normal punt play.

* ball_snap
* punt
* punt_received
* first_contact
* tackle
  
We will examine return yardage from punt_recieved (or kick_received in kickoff play) to tackle.


### Modifying "plays.csv" data (+ "games.csv")
1. Limit plays.csv data to only when a return occurs.
1. Add home team information ("homeTeamAbbr") from games.csv data.
1. Add returner's id ("returnerId") as an explanatory variable.

### Modifying "tracking20\*\*.csv" data

1. Restrict tracking data to only when a return occurs.
1. Modify the coordinates according to the "playDirection".
1. Convert the coordinate data team from (home / away) to (offense / diffense).
1. Restrict to the time between receive and tackle.

In [None]:
# Return 距離の頻度を調べる
return_plays = plays[(plays['specialTeamsResult']== 'Return')]

# game データから home のチームの情報を追加
return_plays = pd.merge(return_plays, games, on=['gameId'], how='inner')

# returner の id を説明変数に加える
return_plays = return_plays.dropna(subset=['returnerId'])
return_plays['returnerId'] = return_plays['returnerId'].str[:5]
return_plays['returnerId'] = return_plays['returnerId'].astype(int)


###
# 追跡データを return が起きた場合のみに制限
track_data = pd.merge(track_data, return_plays, on=['gameId','playId'], how='inner')

# コートの入れ替えとかあるっぽいので攻撃方向(playDirection)に応じて座標を修正
track_data['x'] = (120-track_data['x']) - (120-track_data['x']*2) * (track_data['playDirection'] == 'right')
track_data['y'] = (53.3-track_data['y']) - (53.3-track_data['y']*2) * (track_data['playDirection'] == 'right')
track_data['dir'] = np.mod(track_data['dir'] + 180 * (track_data['playDirection'] == 'left'), 360)

# 座標データの team を home / away から offense / diffense に変換
homeplayer = (track_data['team'] == 'home')                                         # 'football'は'away'のもの
hometeam   = (track_data['possessionTeam'] == track_data['homeTeamAbbr'])
track_data['team'] = (homeplayer == hometeam).replace({True:'offense', False:'diffense'})


###
# さらにreceive されてからtackleするまでの間に制限
receive = track_data[(track_data['displayName']=='football')&((track_data['event']=='punt_received') | (track_data['event']=='kick_received'))]
receive = receive[['gameId', 'playId','frameId']]
receive = receive.rename(columns={'frameId': 'frame_start'})

tackle = track_data[(track_data['displayName']=='football')&(track_data['event']=='tackle')]
tackle = tackle[['frameId','gameId', 'playId','x', 'returnerId']]
tackle = tackle.rename(columns={'frameId': 'frame_end', 'x': 'x_end'})

return_frame = pd.merge(receive, tackle, on=['gameId', 'playId'], how='inner')
#return_frame.head()


###
# いらない変数のお掃除
del games, plays, receive, tackle

track_data = track_data[track_data['displayName'] != 'football']
track_data = track_data[['x', 'y', 's', 'a', 'dir', 'nflId', 'team', 'frameId', 'gameId', 'playId']]
#track_data.head()

# 3. Add New Feature
From the original dataset,

The following variables will be adopted as features.

* ball_neighbour(df_track, dist) : number of enemies and allies within (dist) yards around the returner when the player placement follows (df_track)
* pos_neighbour(df_track, dist) : number of enemies and allies within (dist) yards **in front of** the returner when the player placement follows (df_track)

In addition, the average of the number of people in the last five frames is also added as an explanatory variable.

In [None]:
# 特徴量として、ボールから一定距離の味方・相手選手の人数を用いる
def ball_neighbour(df_track, dist):
    offense_temp = df_track[(df_track['distance'] < dist)&(df_track['team']=='offense')]
    diffense_temp = df_track[(df_track['distance'] < dist)&(df_track['team']=='diffense')]
    return [offense_temp.shape[0], diffense_temp.shape[0]]

def pos_neighbour(df_track, dist):
    offense_temp = df_track[(df_track['distance'] < dist)&(df_track['distance_x'] >= 0)&(df_track['team']=='offense')]
    diffense_temp = df_track[(df_track['distance'] < dist)&(df_track['distance_x'] >= 0)&(df_track['team']=='diffense')]
    return [offense_temp.shape[0], diffense_temp.shape[0]]

# 選手の位置データから特徴量を作成
result_list = []
for i_index in return_frame.index:
    game_id = return_frame.loc[i_index,'gameId']
    play_id = return_frame.loc[i_index,'playId']
    frame_start = return_frame.loc[i_index,'frame_start']
    frame_end = return_frame.loc[i_index,'frame_end']
    x_end = return_frame.loc[i_index,'x_end']
    returner = return_frame.loc[i_index, 'returnerId']

    # 中間のフレームをすべてデータとして用いる
    for frame_index in range(frame_start, frame_end, 3):
        df_temp = track_data[(track_data['gameId']==game_id)&(track_data['playId']==play_id)&(track_data['frameId']==frame_index)].copy()
        df_temp1 = df_temp[df_temp['team']=='offense'].copy()
        df_temp2 = df_temp[(df_temp['team']=='diffense')&(df_temp['nflId']!=returner)].copy()

        returner_loc = df_temp[df_temp['nflId']==returner][['x','y','s','a','dir']].values[0]
        df_temp['distance'] = ((df_temp['x'] - returner_loc[0]) ** 2 + (df_temp['y'] - returner_loc[1]) ** 2) ** 0.5
        df_temp['distance_x'] = -(df_temp['x'] - returner_loc[0])     # returner より前に人がいるなら正値

        df_temp1['distance'] = 999999
        for diffense_index in df_temp2.index:
            diffense_loc = df_temp.loc[diffense_index,['x','y']].values
            df_temp1['distance'] = np.minimum(df_temp1['distance'], ((df_temp1['x'] - diffense_loc[0]) ** 2 + (df_temp2['y'] - diffense_loc[1]) ** 2) ** 0.5)
        df_temp1 = df_temp1[df_temp1['distance'] > 0.5]
        df_temp1['distance'] = ((df_temp1['x'] - returner_loc[0]) ** 2 + (df_temp1['y'] - returner_loc[1]) ** 2) ** 0.5

        feature_item = [i_index, game_id, play_id, frame_index] + returner_loc.tolist()
        feature_item += ball_neighbour(df_temp, 7) + ball_neighbour(df_temp, 5) + ball_neighbour(df_temp, 3) + ball_neighbour(df_temp, 1) + ball_neighbour(df_temp, 0.5)
        feature_item += pos_neighbour(df_temp, 7) + pos_neighbour(df_temp, 5) + pos_neighbour(df_temp, 3) + pos_neighbour(df_temp, 1) + pos_neighbour(df_temp, 0.5)
        feature_item += [ball_neighbour(df_temp1, 7)[0]] + [ball_neighbour(df_temp1, 3)[0]] + [ball_neighbour(df_temp1, 1)[0]]
        feature_item += [-(x_end-returner_loc[0])]
        
        result_list.append(feature_item)

df_train = pd.DataFrame(result_list, columns=['gameplayId','gameId','playId','frameId', 
                                              'returner_x','returner_y','returner_s','returner_a','returner_dir', 
                                              'offense7','diffense7','offense5','diffense5','offense3','diffense3','offense1','diffense1','offense.5','diffense.5',
                                              'offense+7','diffense+7','offense+5','diffense+5','offense+3','diffense+3','offense+1','diffense+1','offense+.5','diffense+.5',
                                              'free-offense7','free-offense3','free-offense1','return_yard'])

In [None]:
df_train.to_csv('intermediate_file.csv')
#df_train = pd.read_csv('../input/nfl2022-app/intermediate_file.csv', index_col=0)

In [None]:
# 機械学習で使う用のデータ
df_list = []
for i_index in return_frame.index:
    df_temp = df_train[df_train['gameplayId']==i_index]
    df_temp = df_temp.rolling(5, min_periods=1).sum() / 5    # 前の 5 フレーム間のデータの平均
    df_list.append(df_temp)

df_train_mean = pd.concat(df_list, axis=0)
df_train_mean = df_train_mean.drop(['gameplayId','gameId','playId','frameId','returner_s','returner_a','returner_dir','return_yard'], axis=1)
df_train_mean = df_train_mean.add_suffix('_mean')

df_train = pd.concat([df_train, df_train_mean], axis=1)
df_train = df_train.astype(float)

# 4. Predict Success Rate of the Tackle
From the data above, we will predict the success rate of return rejection.  
For easy understanding, to block a return is defined as the returner not advancing more than 5 yards in the x-axis direction from the moment.
* 1 : Returner will stop in 5 yards.
* 0 : Returner will advance at least 5 yards.

Logistic regression is used for prediction.  
The data from October and November 2018 will be used as the teacher data and Descember 2018 will be used as the test data.

In [None]:
# 訓練データ、検証データの分割
from sklearn.model_selection import train_test_split

y_train = (df_train['return_yard'] <= 5) * 1
X_train = df_train.drop(['gameplayId','gameId','playId','frameId','return_yard'], axis=1)

mean_x = X_train.mean()
std_x = X_train.std()
#X_train = (X_train - mean_x) / std_x

# テストデータ、検証データのとりわけ
X_test = X_train[df_train['gameId'] >= 2018120000]
y_test = y_train[df_train['gameId'] >= 2018120000]
X_train = X_train[df_train['gameId'] < 2018120000]
y_train = y_train[df_train['gameId'] < 2018120000]

# 学習開始
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.metrics import precision_score, recall_score

lr5 = LogisticRegression(solver='liblinear', penalty='l1')
lr5.fit(X_train, y_train)
y_train_pred = lr5.predict(X_train)
y_test_pred = lr5.predict(X_test)
print('F-score(train data): ', f1_score(y_train.values, y_train_pred))
print('F-score(test data): ', f1_score(y_test.values, y_test_pred))

This model gives us not only a prediction of blocking success / failure, but also the success rate.   
The calibration curve below shows the relationship between the predicted probability (x axis) and actual success rate in each bins.

In [None]:
# Calibration curves の作図
from sklearn.calibration import calibration_curve

prob = lr5.predict_proba(X_test)[:, 1] # 目的変数が1である確率を予測
prob_true, prob_pred = calibration_curve(y_true=y_test, y_prob=prob, n_bins=20)

fig, ax1 = plt.subplots()
ax1.plot(prob_pred, prob_true, marker='s', label='calibration plot', color='skyblue') # キャリプレーションプロットを作成
ax1.plot([0, 1], [0, 1], linestyle='--', label='ideal', color='limegreen') # 45度線をプロット
ax1.legend(bbox_to_anchor=(1.12, 1), loc='upper left')
ax1.set_xlabel("Mean Predicted Probability")
ax1.set_ylabel("Fraction of positives")
#ax2 = ax1.twinx() # 2軸を追加
#ax2.hist(prob, bins=20, histtype='step', color='orangered') # スコアのヒストグラムも併せてプロット
#ax2.set_ylabel("Counts")
plt.show()

Similarly, calculate the probability for each distance of how far the returner can go before being stopped.

In [None]:
# リターンの阻止の定義をすこしずつ変えて計算
yard_list = np.arange(0.5, 15, 0.5)
model_list = []

for yard_index in yard_list:
    y_train = (df_train['return_yard'] <= yard_index) * 1

    # テストデータ、検証データのとりわけ
    y_test = y_train[df_train['gameId'] >= 2018120000]
    y_train = y_train[df_train['gameId'] < 2018120000]

    # 学習開始
    lr = LogisticRegression(solver='liblinear', penalty='l1')
    lr.fit(X_train, y_train)
    y_train_pred = lr.predict(X_train)
    y_test_pred = lr.predict(X_test)
    model_list.append(lr)

### Output Example
As an example, we will predict the distance the returner will advance in a given frame.
The horizontal axis represents the distance the returner being stopped and the vertical axis represents the corresponding probability.
The figure below shows the player's position. The blue dots represent the attackers, and the orange dots represent the defenders.

In [None]:
X_sample = X_test.iloc[240:241,:]
prob_list = []
for lr in model_list:
    prob = lr.predict_proba(X_sample)[:, 1] # 目的変数が1である確率を予測
    prob_list.append(prob[0])
plt.scatter(yard_list, prob_list)
plt.plot(yard_list, prob_list)
plt.ylim(0,1)
plt.show()

In [None]:
X_sample = df_train.iloc[240,:]
gameId = X_sample['gameId']
playId = X_sample['playId']
frameId = X_sample['frameId']

df_temp = track_data[(track_data['gameId']==gameId)&(track_data['playId']==playId)&(track_data['frameId']==frameId)]
df_temp1 = df_temp[df_temp['team']=='offense']
df_temp2 = df_temp[df_temp['team']=='diffense']

plt.scatter(df_temp1['x'], df_temp1['y'], alpha=0.9)
plt.scatter(df_temp2['x'], df_temp2['y'], alpha=0.9)
plt.ylim(0,53.3)
plt.show()