This is a series of kernels for the NFL Punt Analytics competition:
1. [NFL Punt: Game Mechanics](https://www.kaggle.com/argentium/nfl-punt-game-mechanics)
2. [Group Dynamics](https://www.kaggle.com/argentium/nfl-punt-group-dynamics)
3. [Penalties](https://www.kaggle.com/argentium/nfl-punt-penalties)

In the previous kernel, I have identified key findings:

1. Coverage Formation Blocking
2. Punt Returner Collisions
3. Gunner Collisions
4. Block Opposition
5. Friendly-Fires Collisions (Without the Gunner)

#### Note: In this kernel, the units are in metric system (unless specified otherwise).
1. Distances are in meters
2. Speed is in kilometers per hour.

In [None]:
import os
import re
import pandas as pd
import numpy as np
import seaborn as sns

import scipy
import math
from matplotlib import pyplot as plt

import statsmodels.api as sm
from statsmodels.formula.api import ols

import warnings
warnings.filterwarnings("ignore")

In [None]:
df_play_info = pd.read_csv('../input/play_information.csv')
df_injury = pd.read_csv('../input/video_review.csv')
df_punt_role = pd.read_csv('../input/play_player_role_data.csv')

In [None]:
team_positions = {'Return': 
                  ['VR', 'VRo', 'VRi', 
                   'VL', 'VLo', 'VLi',
                   'PDR1', 'PDR2', 'PDR3', 'PDR4', 'PDR5', 'PDR6',
                   'PDM',
                   'PDL1', 'PDL2', 'PDL3', 'PDL4', 'PDL5', 'PDL6',
                   'PLR', 'PLR1', 'PLR2', 'PLR3',
                   'PLM', 'PLM1',
                   'PLL', 'PLL1', 'PLL2', 'PLL3', 'PLLi',
                   'PR', 'PFB'
                   ],
     'Coverage': ['GR', 'GRo', 'GRi',
                  'GL', 'GLo', 'GLi',
                  'PRG', 'PRT', 'PRW',
                  'PPR', 'PPRo', 'PPRi', 
                  'PPL', 'PPLo', 'PPLi',
                  'P', 'PC', 'PLS',
                  'PLW', 'PLT', 'PLG'
                  ]}

role_categories = {'G': ['GR', 'GRo', 'GRi',
                        'GL', 'GLo', 'GLi'],
                      'Coverage_Center': ['PRG', 'PLG', 'PRT', 'PLT', 'PRW', 'PLW'],
                  'PP': ['PPR', 'PPRo', 'PPRi',
                         'PPL', 'PPLo', 'PPLi'],
                  'P': ['P'],
                  'PC': ['PC'],
                  'PLS': ['PLS'],
                    'V': ['VR', 'VRo', 'VRi',
                        'VL', 'VLo', 'VLi'],
                  'PD': ['PDR1', 'PDR2', 'PDR3', 'PDR4', 'PDR5', 'PDR6',
                          'PDM',
                         'PDL1', 'PDL2', 'PDL3', 'PDL4', 'PDL5', 'PDL6'],
                  'PL': ['PLR', 'PLR1', 'PLR2', 'PLR3',
                         'PLM', 'PLM1',
                         'PLL', 'PLL1', 'PLL2', 'PLL3', 'PLLi'],
                  'PR': ['PR'],
                  'PFB': ['PFB']
                 }

sides = {'Right': ['GR', 'GRo', 'GRi', 'PRG', 'PRT', 'PRW',
                  'PPR', 'PPRo', 'PPRi',
                   'PDR1', 'PDR2', 'PDR3', 'PDR4', 'PDR5', 'PDR6',
                   'PLR', 'PLR1', 'PLR2', 'PLR3',
                  'VR', 'VRo', 'VRi'],
         'Left': ['GL', 'GLo', 'GLi', 'PLG', 'PLT', 'PLW',
                 'PPL', 'PPLo', 'PPLi',
                  'PDL1', 'PDL2', 'PDL3', 'PDL4', 'PDL5', 'PDL6',
                  'PLL', 'PLL1', 'PLL2', 'PLL3', 'PLLi',
                 'VL', 'VLo', 'VLi'],
         'Center': ['P', 'PC', 'PLS', 'PDM', 'PLM', 'PLM1', 'PR', 'PFB']
                 }

# Add the corresponding side of their role
def set_category(role, dictionary):
    for catgory in dictionary.keys():
        if str(role) in dictionary[catgory]:
            return str(catgory)
    return None

df_punt_role['Team'] = df_punt_role.apply(lambda row: 
                                          set_category(row['Role'], team_positions), axis=1)
df_punt_role['Side'] = df_punt_role.apply(lambda row: 
                                          set_category(row['Role'], sides), axis=1)
df_punt_role['Role_Category'] = df_punt_role.apply(lambda row: 
                                                   set_category(row['Role'], role_categories),
                                                   axis=1)
df_punt_role = df_punt_role.drop(columns=['Season_Year'])

In [None]:
def get_goal(activity):
    if (activity == 'Blocking') or (activity == 'Tackled'):
        return 'Offensive'
    else:
        return 'Defensive'

# Add the corresponding side of their role
def set_phase(row):
    goal = get_goal(row['Player_Activity_Derived'])
    if row['Team'] == 'Coverage':
        if goal == 'Offensive':
            return 1
        else:
            return 2
    else: # Return Team
        if goal == 'Offensive':
            return 2
        else:
            return 1

# Convert to int data type
df_injury['Primary_Partner_GSISID'] = df_injury.apply(lambda row: 
                                                                  row['Primary_Partner_GSISID'] 
                                                                  if (row['Primary_Partner_GSISID'] != 'Unclear')
                                                                 else 0,
                                                                 axis=1)
df_injury['Primary_Partner_GSISID'] = df_injury['Primary_Partner_GSISID'].fillna(0)
df_injury['Primary_Partner_GSISID'] = df_injury['Primary_Partner_GSISID'].astype(int)

# Merge with df_punt_role
df_injury = df_injury.merge(df_punt_role,
                                right_on=['GameKey', 'PlayID', 'GSISID'],
                                 left_on=['GameKey', 'PlayID', 'GSISID'],
                           how='left')
df_injury = df_injury.merge(df_punt_role,
                                right_on=['GameKey', 'PlayID', 'GSISID'],
                                 left_on=['GameKey', 'PlayID', 'Primary_Partner_GSISID'],
                            suffixes=('', '_Partner'),
                           how='left')
df_injury['Phase'] = df_injury.apply(lambda row: 
                                                set_phase(row), 
                                                axis=1)

In [None]:
df_yardline = df_play_info['YardLine'].str.split(" ", n = 1, expand = True)
df_play_info['yard_team'] = df_yardline[0]
df_play_info['yard_number'] = df_yardline[1].astype(float)

# Process Team Sides
df_home_visit = df_play_info['Home_Team_Visit_Team'].str.split("-", n = 1, expand = True)
df_play_info['home'] = df_home_visit[0]
df_play_info['visit'] = df_home_visit[1]

# Convert to coordinate system, origin at goal line
def convert_yardage(row):
    actual_yards = row['yard_number'] + 10
    if row['yard_team'] == row['home']:
        return actual_yards
    else:
        return 120 - actual_yards

# Convert to goal line distance
def convert_goal_distance(row):
    if row.loc[('Poss_Team')] == row.loc[('home')]:
        return row.loc[('Scrimmage_Line')]
    else:
        return 120 - row.loc[('Scrimmage_Line')]

df_play_info['Home_Poss'] = df_play_info.apply(lambda row: row.loc[('Poss_Team')] == row.loc[('home')], axis=1)
df_play_info['Scrimmage_Line'] = df_play_info.apply(lambda row: convert_yardage(row), axis=1)
df_play_info['Goal_Line'] = df_play_info.apply(lambda row: convert_goal_distance(row), axis=1)

# Convert to meters
df_play_info['Scrimmage_Line'] = 0.9144 * df_play_info['Scrimmage_Line']
df_play_info['Goal_Line'] = 0.9144 * df_play_info['Goal_Line']

In [None]:
# Graph
def graph_distribution(column):
    f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw= {"height_ratios": (0.2, 1)})
    mean = column.mean()
    median = column.median()
    mode = column.mode().get_values()[0]

    sns.boxplot(column, ax=ax_box)
    ax_box.axvline(mean, color='r', linestyle='--')
    ax_box.axvline(median, color='g', linestyle='-')
    ax_box.axvline(mode, color='b', linestyle='-')

    sns.distplot(column, ax=ax_hist)
    ax_hist.axvline(mean, color='r', linestyle='--')
    ax_hist.axvline(median, color='g', linestyle='-')
    ax_hist.axvline(mode, color='b', linestyle='-')

    plt.legend({'Mean':mean,'Median':median,'Mode':mode})

    ax_box.set(xlabel='')
    plt.show()

The data pre-processing above is a continuation of the Game Mechanics kernel. Below are the new codes.

## Statistics 101:

### ANOVA (Analysis of Variance)
It is an indicator whether there is sufficient variance to distinguish the means between two groups. This is used to prove a hypothesis as shown in the link: [Hypothesis Testing](https://www.analyticsvidhya.com/blog/2015/09/hypothesis-testing-explained/).

In layman's terms, it attempts to prove that the two results from a control group and experimental group are different. There are important measures to take note.

- F value
> The F-value indicates the variance between groups over the variance within groups. In simple terms, a high F-value indicates that the groups have distinct characteristics that separates themselves from the other groups.
- Pr(>5)
> The P-value indicates the probability that a given value would be greater than standard values. 

### Correlation Test
The correlation test is an indicator that two variables have a relationship.
There are two possible tests for correlation:
1. Pearson Test- This indicates a linear relationship between variables.
2. Spearman Test- This indicates a non-linear relationship between variables.

Both test would have values ranging from -1 to 1.
The values have the following meanings:
- -1 = negative correlation
-  0 = no correlation
- +1 = positive correlation

A negative correlation means that as one variable increases, the other decreased or vice-versa.
On the other hand, a positive correlation means that both variables move in the same direction (both increasing or both decreasing).

The absolute value of the result indicates the strength of the relationship. The farther the value is from 0 means that the relationship between variables is strong.

----
### NGS Data

For the data heavy processing, we use the code from [How to Import Large CSV Files](https://www.kaggle.com/akosciansky/how-to-import-large-csv-files-and-save-efficiently)

In [None]:
import gc
import tqdm
import feather

dtypes = {'Season_Year': 'int16',
         'GameKey': 'int64',
         'PlayID': 'int16',
         'GSISID': 'float32',
         'Time': 'str',
         'x': 'float32',
         'y': 'float32',
         'dis': 'float32',
         'o': 'float32',
         'dir': 'float32',
         'Event': 'str'}
col_names = list(dtypes.keys())

ngs_files = ['NGS-2016-pre.csv',
             'NGS-2016-reg-wk1-6.csv',
             'NGS-2016-reg-wk7-12.csv',
             'NGS-2016-reg-wk13-17.csv',
             'NGS-2016-post.csv',
             'NGS-2017-pre.csv',
             'NGS-2017-reg-wk1-6.csv',
             'NGS-2017-reg-wk7-12.csv',
             'NGS-2017-reg-wk13-17.csv',
             'NGS-2017-post.csv']

# Load each ngs file and append it to a list. 
# We will turn this into a DataFrame in the next step

df_list = []

for i in tqdm.tqdm(ngs_files):
    df = pd.read_csv('../input/'+i, usecols=col_names,dtype=dtypes)
    
    df_list.append(df)

# Merge all dataframes into one dataframe
ngs = pd.concat(df_list)

# Delete the dataframe list to release memory
del df_list
gc.collect()

# # Convert Time to datetime
ngs['Time'] = pd.to_datetime(ngs['Time'], format='%Y-%m-%d %H:%M:%S')

# There are 2536 out of 66,492,490 cases where GSISID is NAN. Let's drop those to convert the data type
ngs = ngs[~ngs['GSISID'].isna()]

# Convert GSISID to integer
ngs['GSISID'] = ngs['GSISID'].astype('int32')

# ngs.set_index(['GameKey', 'PlayID', 'GSISID'], inplace=True)

# I. Data Pre-processing

I only extracted those with injuries to save memory.

- Assumption: The Injury data is a sufficient sample size and there is no or little variance with the injury data and no injury data.
- Sidenote: I tested the variances of some important time events and there is insufficient variance between the two.

In [None]:
# Get Injury Moves
# Ensure same data types
columns = ['GameKey', 'PlayID', 'GSISID']
for col in columns:
    df_injury[col] = df_injury[col].astype(ngs[col].dtype)
    df_punt_role[col] = df_punt_role[col].astype(ngs[col].dtype)

df_injury['Primary_Partner_GSISID'] = df_injury['Primary_Partner_GSISID'].astype(ngs['GSISID'].dtype)

# Get Only Games with Injuries
df_injury_moves = ngs.merge(df_injury[['GameKey', 'PlayID']],
                                  left_on=['GameKey', 'PlayID'],
                                  right_on=['GameKey', 'PlayID'],
                                 suffixes=('', '_Injury'))

# # Get All Moves of PR
# df_pr = df_punt_role[df_punt_role['Role']=='PR']
# df_pr_moves = ngs.merge(df_pr[['GameKey', 'PlayID', 'GSISID']],
#                                   left_on=['GameKey', 'PlayID', 'GSISID'],
#                                   right_on=['GameKey', 'PlayID', 'GSISID'])

# Delete the dataframe list to release memory
del ngs
gc.collect()

## A. Basic Info
I categorized the movements according to the player roles.

In [None]:
# I added the role for easier categorization
df_injury_moves = df_injury_moves.merge(df_punt_role,
                                  left_on=['GameKey', 'PlayID', 'GSISID'],
                                  right_on=['GameKey', 'PlayID', 'GSISID'],
                                 suffixes=('', '_Injury'))

# Create Gameplay ID
df_injury_moves['Gameplay'] = df_injury_moves.apply(lambda row: 
                                            str(row['GameKey']) + '_' + 
                                            str(row['PlayID']),
                                           axis=1)

### 1. Distance (Conversion of yards to meters)

In [None]:
df_injury_moves['x'] = 0.9144 * df_injury_moves['x']
df_injury_moves['y'] = 0.9144 * df_injury_moves['y']
df_injury_moves['dis'] = 0.9144 * df_injury_moves['dis']

### 2. Time (Time elapsed since the start of the play)

In [None]:
# Get only events
df_events = df_injury_moves.dropna(subset=['Event'])
df_events_indexed = df_events.set_index(['GameKey', 'PlayID', 'Event'])

# Get event
df_injury_moves[['Event']] = df_injury_moves[['Event']].fillna('')
df_injury_moves_gameplay = df_injury_moves.groupby(['GameKey', 'PlayID']).agg({'Time': ['min', 'max'],
                                                                               'dis': ['min', 'median', 'max', 'sum'],
                                                                               'dir': ['min', 'median', 'max'],
                                                                               'o': ['min', 'median', 'max']})

# Get start time by subtracting the time at play start in the specified GameKey, PlayID
def get_start_time(row):
    start = df_events_indexed.loc[(row['GameKey'], row['PlayID'], 'ball_snap')]['Time'][0]
    end = row['Time']
    return (end - start).total_seconds()

df_injury_moves['PlayStartTime'] = df_injury_moves.apply(lambda row: 
                                                         get_start_time(row),
                                                         axis=1)
df_injury_moves = df_injury_moves[df_injury_moves['PlayStartTime'] >= 0]
df_injury_moves['Sequence'] = df_injury_moves.apply(lambda row: 
                                                    str(int(row['PlayStartTime']*10)), 
                                                    axis=1)
df_injury_moves['Second'] = df_injury_moves.apply(lambda row: 
                                                    str(int(row['PlayStartTime'])), 
                                                    axis=1)

### 3. Speed (in kilometers per hour or kph)
I computed the speed according to the change in distance and time.

In [None]:
def get_speed(row):
    meters_per_sec = row['dis'] / 0.1
    kph = 3.6 * meters_per_sec
    return kph

df_injury_moves = df_injury_moves.sort_values(by=['GameKey', 'PlayID', 'GSISID', 'PlayStartTime'])
df_injury_moves['kph'] = df_injury_moves.apply(lambda row: get_speed(row), axis=1)

### 4. PR Distance
Since the PR plays a central role in the Punt play, I decided to compute player distances from the PR.

In [None]:
# Get the PR Moves
df_pr_moves = df_injury_moves[df_injury_moves['Role']=='PR']
df_pr_moves_indexed = df_pr_moves.set_index(['GameKey', 'PlayID', 'Sequence'])

# Compute each distance from the PR
def get_axial_distance(row, axis):
    try:
        coordinates = df_pr_moves_indexed.loc[(row['GameKey'], row['PlayID'], row['Sequence'])]
        return abs(coordinates[axis] - row[axis])
    except:
        return None

def get_distance(row):
    try:
#         coordinates = df_pr_moves.loc[(row['GameKey'], row['PlayID'], row['Sequence'])]
#         return math.sqrt(pow(coordinates['x'] - row['x'], 2) + pow(coordinates['y'] - row['y'], 2))
        return math.sqrt(pow(row['PR_X'], 2) + pow(row['PR_Y'], 2))
    except:
        return None

df_injury_moves['PR_X'] = df_injury_moves.apply(lambda row: 
                                                get_axial_distance(row, 'x'), 
                                                axis=1)
df_injury_moves['PR_Y'] = df_injury_moves.apply(lambda row: 
                                                get_axial_distance(row, 'y'), 
                                                axis=1)
df_injury_moves['PR_Distance'] = df_injury_moves.apply(lambda row: 
                                                       get_distance(row), 
                                                       axis=1)

### 5. Collision Time

In [None]:
def get_distance(row):
    return math.sqrt(pow(row['x_Player'] - row['x_Partner'], 2) + 
                     pow(row['y_Player'] - row['y_Partner'], 2))


df_injury_moves_player = df_injury_moves.merge(df_injury[['GameKey', 'PlayID', 'GSISID']],
                                                left_on=['GameKey', 'PlayID', 'GSISID'],
                                                 right_on=['GameKey', 'PlayID', 'GSISID'],
                                                 how='right')
df_injury_moves_partner = df_injury_moves.merge(df_injury[['GameKey', 'PlayID', 
                                                          'Primary_Partner_GSISID']],
                                                left_on=['GameKey', 'PlayID', 'GSISID'],
                                                 right_on=['GameKey', 'PlayID', 'Primary_Partner_GSISID'],
                                                 how='right')

# Put it side-by-side in a row
df_involved_pairs = df_injury_moves_player.merge(df_injury_moves_partner,
                                                left_on=['GameKey', 'PlayID', 'PlayStartTime'],
                                                right_on=['GameKey', 'PlayID', 'PlayStartTime'],
                                                suffixes=('_Player', '_Partner'))

# Compute distance of pairs
df_involved_pairs['PairDistance'] = df_involved_pairs.apply(lambda row:
                                                            get_distance(row),
                                                            axis=1)

# Collision point at minimum pair distance
df_min_distances = df_involved_pairs.groupby(["GameKey", "PlayID"])['PairDistance'].idxmin()
df_collision_point = df_involved_pairs.loc[df_min_distances]

# II. Analysis

## A. Coverage Formation Blocking

This is better discussed in the [Collision Pairs](https://www.kaggle.com/argentium/nfl-punt-collision-pairs) kernel.

## B. Punt Returner Collisions

For a better understanding of time-related events, I have checked the median times for certain landmark events during a punt in they injury plays. These events are the following:
1. Punt
2. Punt Received
3. Tackle

### What is the range of the time events since the start of the play?

In [None]:
events = ['punt', 'punt_received', 'tackle']
events_list = []
for event in events:
    df_event = df_injury_moves[df_injury_moves['Event']==event]
    events_list.append(df_event)

df_events = pd.concat(events_list)

# ANOVA
mod = ols('PlayStartTime ~ Event',
            data=df_events).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
print(aov_table)

# Graph
ax = sns.boxplot(x="PlayStartTime", y="Event", data=df_events)
ax.set_title('Timeline')

### What is the trend of the Coverage team's distance from the PR across time?

In [None]:
df_injury_moves_coverage = df_injury_moves[(df_injury_moves['Team']=='Coverage') & 
                                          (df_injury_moves['Role']!='P')]
df_injury_moves_coverage = df_injury_moves_coverage[df_injury_moves_coverage['PlayStartTime'] < 15]

df_injury_moves_coverage = df_injury_moves_coverage.dropna()

# Pearson for linear correlation
output = scipy.stats.pearsonr(df_injury_moves_coverage['PlayStartTime'], 
                    df_injury_moves_coverage['PR_Distance'])
print(output)

# Spearman for non-linear correlation
output = scipy.stats.spearmanr(df_injury_moves_coverage['PlayStartTime'], 
                    df_injury_moves_coverage['PR_Distance'])
print(output)

# Grah
sns.jointplot(x="PlayStartTime", y="PR_Distance", kind='hex', data=df_injury_moves_coverage)

Both the Pearson and Spearman test show a strong negative correlation of PR Distance and Time Elapsed.

This means that the whole coverage team is generally moving towards the Punt Returner.

### How long was the ball hang time?


In [None]:
def get_time_diff(row, start_event, end_event):
    try:
        start = df_events_indexed.loc[(row['GameKey'], row['PlayID'], start_event)]['Time'][0]
        end = df_events_indexed.loc[(row['GameKey'], row['PlayID'], end_event)]['Time'][0]
        return (end - start).total_seconds()
    except:
        return None

df_injury['waiting_time'] = df_injury.apply(lambda row: 
                                                   get_time_diff(row, 'punt', 'punt_received'),
                                                  axis=1)
graph_distribution(df_injury['waiting_time'].dropna())
df_injury['waiting_time'].describe()


### Which team is closer when the PR received the ball?

In [None]:
df_injury_moves_received = df_injury_moves[(df_injury_moves['Event']=='punt_received')]
df_injury_moves_received_nostars = df_injury_moves_received[(df_injury_moves_received['Role']!='PR') &
                                                           (df_injury_moves_received['Role']!='P')]

# Graph
ax = sns.boxplot(x='PR_Distance', 
                 y='Team',
                 data=df_injury_moves_received_nostars)
ax.set_title("Team Distance from PR\nwhen Punt is Received")

# ANOVA
mod = ols('PR_Distance ~ Team',
            data=df_injury_moves_received_nostars).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
print(aov_table)

The Coverage team is generally closer to the PR than the Return Team. However, the PR(>F) is still lower than 0.05, but the value is very close to our significance level. This means the confidence level is not as strong.

### How far was the coverage team when the PR received the ball?

In [None]:
df_injury_moves_received_nostars_coverage = df_injury_moves_received_nostars[df_injury_moves_received_nostars['Team']=='Coverage']
graph_distribution(df_injury_moves_received_nostars_coverage['PR_Distance'].dropna())
df_injury_moves_received_nostars_coverage['PR_Distance'].describe()

Half of the Coverage team is around 23 meters from the PR when the PR received the ball.

### How fast was the coverage team when the PR received the ball?

In [None]:
graph_distribution(df_injury_moves_received_nostars_coverage['kph'].dropna())
df_injury_moves_received_nostars_coverage['kph'].describe()

The Coverage team was running fast with around 28 kph when the ball was received.

### How soon after the PR received the ball does the tackle occur?

In [None]:
def get_time_diff(row, start_event, end_event):
    try:
        start = df_events_indexed.loc[(row['GameKey'], row['PlayID'], start_event)]['Time'][0]
        end = df_events_indexed.loc[(row['GameKey'], row['PlayID'], end_event)]['Time'][0]
        return (end - start).total_seconds()
    except:
        return None

df_injury['reaction_time'] = df_injury.apply(lambda row: 
                                                   get_time_diff(row, 'punt_received', 'tackle'),
                                                  axis=1)
graph_distribution(df_injury['reaction_time'].dropna())
df_injury['reaction_time'].describe()

The ball only stays in the Punt Returner's hands for around 3-4 seconds before getting tackled. The average human reaction time for a visual stimulus is 0.25 seconds [Reaction Time](https://backyardbrains.com/experiments/reactiontime). However, that would require the PR's attention from the start. If the PR is focused on the ball, he must quickly react when he catches the ball.

Considering the distance and speed of the Coverage team when the punt is received, it is no surprise that it is just a matter of time that the PR gets tackled. Unfortunately for the Punt Returner, his team was behind the Coverage team when the punt is received. Hence, there is not much protection between him and the Coverage team.

### How fast was the PR when he got tackled?

In [None]:
df_moves_pr = df_injury_moves[df_injury_moves['Role']=='PR']
df_moves_pr_tackle = df_moves_pr[df_moves_pr['Event']=='tackle']

graph_distribution(df_moves_pr_tackle['kph'].dropna())
df_moves_pr_tackle['kph'].describe()

The PR is just around 9 kph when he got tackled. He has not yet build up enough speed because he just caught the ball.

### What are the team's speed when the PR got tackled?

In [None]:
df_injury_moves_received = df_injury_moves[(df_injury_moves['Event']=='tackle')]
df_injury_moves_received_nostars = df_injury_moves_received[(df_injury_moves_received['Role']!='PR') &
                                                           (df_injury_moves_received['Role']!='P')]

# Graph
ax = sns.boxplot(x='kph', 
                 y='Team',
                 data=df_injury_moves_received_nostars)
ax.set_title("Team Speed from PR\nwhen a Tackle Occured")

# ANOVA
mod = ols('kph ~ Team',
            data=df_injury_moves_received_nostars).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
print(aov_table)

The Coverage team is faster when the PR got tackled.

### How far were the teams when the PR got tackled?

In [None]:
df_injury_moves_received = df_injury_moves[(df_injury_moves['Event']=='tackle')]
df_injury_moves_received_nostars = df_injury_moves_received[(df_injury_moves_received['Role']!='PR') &
                                                           (df_injury_moves_received['Role']!='P')]

# Graph
ax = sns.boxplot(x='PR_Distance', 
                 y='Team',
                 data=df_injury_moves_received_nostars)
ax.set_title("Team Distance from PR\nwhen a tackle occured")

# ANOVA
mod = ols('PR_Distance ~ Team',
            data=df_injury_moves_received_nostars).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
print(aov_table)

The PR was already surrounded by the coverage team when he got tackled.

### How far was the crowd when the PR got tackled?

In [None]:
df_injury_phase2 = df_injury[df_injury['Phase']==2]
df_injury_phase2_opponent = df_injury_phase2[df_injury_phase2['Friendly_Fire']=='No']
df_injury_phase2_opponent_tackled = df_injury_phase2_opponent[(df_injury_phase2_opponent['Player_Activity_Derived']=='Tackled') |
                                                             (df_injury_phase2_opponent['Player_Activity_Derived']=='Tackled')]

# Get only the block collisions
df_collision_tackled = df_collision_point.merge(df_injury_phase2_opponent_tackled[['GameKey',
                                                   'PlayID']],
                                        left_on=['GameKey', 'PlayID'],
                                        right_on=['GameKey', 'PlayID'],
                                   how='left')

df_collision_tackled_indexed = df_collision_tackled.set_index(['GameKey', 'PlayID', 'PlayStartTime'])

# Get all players during the block collision
df_collision_tackled_moves = df_injury_moves.merge(df_collision_tackled,
                                        left_on=['GameKey', 'PlayID', 'PlayStartTime'],
                                        right_on=['GameKey', 'PlayID', 'PlayStartTime'])
df_collision_tackled_moves = df_collision_tackled_moves[(df_collision_tackled_moves['Role']!='PR') &
                                                       (df_collision_tackled_moves['Role']!='P')]
graph_distribution(df_collision_tackled_moves['PR_Distance'])
df_collision_tackled_moves['PR_Distance'].describe()

In the tackle injuries, the crowd is already close to the PR when he got tackled. Around half the other players are within 10 meters of the PR during the tackle.

### How far from the ball received gridline did the PR get tackled?

In [None]:
df_injury_phase2 = df_injury[df_injury['Phase']==2]
df_injury_phase2_tackle = df_injury_phase2[(df_injury_phase2['Player_Activity_Derived']=='Tackling') |
                                         (df_injury_phase2['Player_Activity_Derived']=='Tackled')]
df_collision_point_info = df_collision_point.merge(df_injury_phase2_tackle[['GameKey', 'PlayID',
                                                                     'Player_Activity_Derived']],
                                                  left_on=['GameKey', 'PlayID'],
                                                  right_on=['GameKey', 'PlayID'],
                                                  how='right')

# Get the PR Moves
df_pr_moves = df_injury_moves[df_injury_moves['Role']=='PR']
df_pr_moves = df_pr_moves.set_index(['GameKey', 'PlayID', 'Event'])

df_collision_point_indexed = df_collision_point.set_index(['GameKey', 'PlayID'])
df_collision_point_indexed.head()

# Compute each distance from the PR
# TODO: punt_received, fumble, fair catch
def get_collision_distance_from_ball_landing(row):
    try:
        # fair_catch, punt_received,fumble
        ball_location = df_pr_moves.loc[(row['GameKey'], row['PlayID'], 'punt_received')]
        collision_location = df_collision_point_indexed.loc[(row['GameKey'], row['PlayID'])]
        return abs(ball_location['x'][0] - collision_location['x_Player'])
    except:
        return None

df_collision_point_info['Ball_X'] = df_collision_point_info.apply(lambda row: get_collision_distance_from_ball_landing(row), 
                                                       axis=1)

# Graph
graph_distribution(df_collision_point_info['Ball_X'].dropna())
df_collision_point_info['Ball_X'].describe()

The PR got tackled at a median of 6.2 meters or 6.75 yards close to the (x-axis) gridline of the place where the ball is caught.

### What is the basic statistics of the tackle?

In [None]:
df_injury_phase2 = df_injury[df_injury['Phase']==2]
df_injury_phase2_opponent = df_injury_phase2[df_injury_phase2['Friendly_Fire']=='No']
df_injury_phase2_opponent_tackle = df_injury_phase2_opponent[(df_injury_phase2_opponent['Player_Activity_Derived']=='Tackled') |
                                            (df_injury_phase2_opponent['Player_Activity_Derived']=='Tackling')]

df_moves = df_injury_moves.merge(df_injury_phase2_opponent_tackle[['GameKey',
                                                   'PlayID']],
                                        left_on=['GameKey', 'PlayID'],
                                        right_on=['GameKey', 'PlayID'],
                                        how='right')
df_moves_collision = df_moves.merge(df_collision_point[['GameKey',
                                                   'PlayID', 'PlayStartTime']],
                                        left_on=['GameKey', 'PlayID', 'PlayStartTime'],
                                        right_on=['GameKey', 'PlayID', 'PlayStartTime'],
                                        how='inner')
df_moves_collision.describe()
# df_injury_moves_received = df_injury_moves[(df_injury_moves['Event']=='tackle')]
# df_injury_moves_received_nostars = df_injury_moves_received[(df_injury_moves_received['Role']!='PR') &
#                                                            (df_injury_moves_received['Role']!='P')]

# # Graph
# df_injury_moves_received_nostars.describe()

The crowd have the following statistics:

- The tackles often occur at around 9 seconds since the ball snap.
- The speed of collision is around 12 kph.
- The involved players' distance from the PR is around 6.6 meters.

The given data indicates a few things: the tackle injuries tend to occu at moderate speeds when most of the players are close to the PR.

#### Finding:
1. The PR and tackling player use time differently. When combined, the PR lacks sufficient time to react towards the oncoming attacker.
    1. The PR spends most of his time waiting for the ball.
    2. The Tackling person uses the waiting time as a headstart to run towards the PR.

## C. Gunner Collisions

### What are the fastest roles?

In [None]:
df_injury_moves_normal = df_injury_moves[(df_injury_moves['PlayStartTime'] < 15) &
                                        (df_injury_moves['kph'] < 50)]

# Ensure that there is only one instance of player per GameKey PlayID
df_injury_speed_max_roles = df_injury_moves_normal.groupby(['GameKey', 'PlayID', 'GSISID', 'Role_Category'])['kph'].max().reset_index()

df_medians = df_injury_speed_max_roles.groupby(['Role_Category'])['kph'].median().reset_index().sort_values('kph')

# Graph
ax = sns.boxplot(x='kph', y='Role_Category', 
            order = df_medians['Role_Category'].values,
            data = df_injury_speed_max_roles)
ax.set_title('Maximum Speeds (kph)')

# Table
df_injury_speed_roles_g = df_injury_speed_max_roles.groupby(['Role_Category']).agg({'kph':
                                                                            ['min', 'max',
                                                                            'median', 'mean']})
df_injury_speed_roles_g.head(40)

The Gunner (G), PP and V tend to have the higher median maximum speeds. On the speed range of 30-50kph, the Gunner tends to have more instances of getting higher speeds than other roles. His speed is closely followed by the V position. If we look at the formation, their speeds make sense because they are away from the crowd and no one is blocking them. Unfortunately, their speeds increases the chances of injuries.

Although the PR is expected to run fast while carrying the ball, the maximum speed of the PR is spread out. This means there is little consistency on the maximum speeds reached by the PR.


### Is there sufficient differences between the maximum speeds per role?

In [None]:
# ANOVA
mod = ols('kph ~ Role_Category',
            data=df_injury_speed_max_roles).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
print(aov_table)

There is a considerable variance between the speeds of each role. The F value indicates there is more variance between groups than within groups.
In simple terms, the range of speeds are distinct for each role category.

### What were the speeds of the Gunner when the Gunner is injured?

In [None]:
df_collision_point_gunner = df_collision_point[df_collision_point['Role_Category_Player']=='G']
df_collision_point_gunner[['GameKey', 'PlayID', 'kph_Player', 'kph_Partner']].head(10)

For the most cases, the Gunner is very fast (above 20 kph). Also, the Gunner is usually faster than his collision partner.

### What were the speeds of the Gunner as the partner role in the injuries?

In [None]:
df_collision_point_gunner = df_collision_point[df_collision_point['Role_Category_Partner']=='G']
df_collision_point_gunner[['GameKey', 'PlayID', 'kph_Player', 'kph_Partner']].head(10)

### How soon after the the crowd's peak speeds did the collisions occur according to the injured role?

In [None]:
# Get time of min-max points
df_moves_agg = df_injury_moves.groupby(['GameKey','PlayID', 'GSISID']).agg({'kph': ['idxmin', 'idxmax']})
df_moves_agg['kph_Max'] = df_injury_moves.loc[df_moves_agg[('kph', 'idxmax')].values]['PlayStartTime'].values
df_moves_agg['kph_Min'] = df_injury_moves.loc[df_moves_agg[('kph', 'idxmin')].values]['PlayStartTime'].values

df_moves_agg = df_moves_agg.reset_index()

# Get Collision Time
df_moves_agg = df_moves_agg.merge(df_collision_point[['GameKey',
                                                   'PlayID',
                                                   'PlayStartTime']],
                                        left_on=['GameKey', 'PlayID'],
                                        right_on=['GameKey', 'PlayID'],
                                        how='right')

# Get Time difference of turning point to collision time
df_moves_agg['kph_Diff'] = df_moves_agg.apply(lambda row:
                                            row['PlayStartTime'] - row[('kph_Max', '')],
                                            axis=1)

# Get activity of involved injury
df_moves_agg = df_moves_agg.merge(df_injury[['GameKey','PlayID', 'GSISID', 'Role_Category',
                                             'Player_Activity_Derived', 
                                             'Primary_Impact_Type',
                                            'Primary_Partner_Activity_Derived']],
                                  left_on=['GameKey', 'PlayID'],
                                  right_on=['GameKey', 'PlayID'],
                                  how='right')

df_grouped = df_moves_agg.groupby(['Role_Category'])['kph_Diff'].median().reset_index()
df_grouped = df_grouped.sort_values('kph_Diff')

# Remove outliers for the graph
df_moves_agg = df_moves_agg[(df_moves_agg['kph_Diff'] > -10) &
                           (df_moves_agg['kph_Diff'] < 30)]
# Graph
ax = sns.boxplot(x='kph_Diff', y='Role_Category', 
                 order = df_grouped['Role_Category'],
            data=df_moves_agg)
ax.set_title('Time Since Peak Speed (seconds)')
df_grouped.head()

In the collisions where the Gunner got injured, the collisions often occured less than 2 seconds after the crowd's peak speed. This means that the crowd was just starting to slow down when the Gunner got injured.

### Is there sufficient variance in the crowd deceleration time depending on the role category of the injured?

In [None]:
# ANOVA
mod = ols('kph_Diff ~ Role_Category',
            data=df_moves_agg).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
print(aov_table)

The PR(>F) is still below 0.025. This means there is sufficient variance.

### What role is farthest from the PR at the formation?

In [None]:
# Get only the position at the start of the play
df_injury_moves_normal_start = df_injury_moves_normal[df_injury_moves_normal['PlayStartTime']==0]

df_injury_group = df_injury_moves_normal_start.groupby(['GameKey', 'PlayID', 'GSISID', 'Role_Category'])['PR_Distance'].max().reset_index()
df_injury_group = df_injury_group.sort_values('PR_Distance')

df_medians = df_injury_group.groupby(['Role_Category'])['PR_Distance'].median().reset_index().sort_values('PR_Distance')

# Graph
ax = sns.boxplot(x='PR_Distance', 
            y='Role_Category', 
            order = df_medians['Role_Category'].values,
            data = df_injury_group)
ax.set_title('PR Distance in formation (meters)')

1. Punter: 
    - The Punter's role is often the farthest because he is placed 15 yards beyond the scrimmage line for protection before he punts. However, his job is often done after the kick and no longer has to tackle the PR (although he is still allowed to do so).
2. PP
    - The Punt Protector is close to the Punter because his main job is to protect the punter.
3. Gunner:
    - The Gunner is the third farthest role from the PR. The main role of the Gunner is running fast to tackle the PR. If the Gunner is also the farthest role from the PR, then it means that he has to cover a much larger distance in a short amount of time to reach the PR.
    
### Is there sufficient variance between the maximum PR distances of the roles?

In [None]:
# ANOVA
mod = ols('PR_Distance ~ Role_Category',
            data=df_injury_group).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
print(aov_table)

Yes. There is sufficient variance to indicate that the Role Categories has distinct PR distances.

### What is the distance of the players from the border line in their respective formations?

In [None]:
width = 53.3

df_injury_moves_normal_start['BorderDistance'] = df_injury_moves_normal_start.apply(lambda row:
                                                                        min(width - row['y'], row['y']),
                                                                       axis=1)

df_grouped = df_injury_moves_normal_start.groupby(['Role_Category'])['BorderDistance'].median().reset_index()
df_grouped = df_grouped.sort_values('BorderDistance')

# Graph
ax = sns.boxplot(x="BorderDistance", y="Role_Category", 
            order=df_grouped['Role_Category'],
            data=df_injury_moves_normal_start)
ax.set_title('Borderline Distance per Role Category')

# ANOVA
mod = ols('BorderDistance ~ Role_Category',
            data=df_injury_moves_normal_start).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
print(aov_table)

The Gunner is the closest player on the border. This also means that he is the farthest from the center of the action, which is around 26.65 yards from the border. If we look at the PR distance from the border, he is much closer to the center line.

In simple terms, the gunner must run a much longer distance (on the y axis) to reach the punter. Hence, he must be really fast to cover such distance.

#### Finding: 

1. The Gunner has the highest maximum speeds. This makes it hard for him to decelerate in collisions.
2. His location away from the crowd forces him to run faster to reach locations that the players in the center have an easier time to reach.

## D. Block Opposition

### If the Coverage Team is moving towards the PR, what is the trend of the Return team's PR distance across time?

In [None]:
df_injury_moves_return = df_injury_moves[(df_injury_moves['Team']=='Return') & 
                                          (df_injury_moves['Role']!='PR')]
df_injury_moves_return = df_injury_moves_return[df_injury_moves_return['PlayStartTime'] < 15]

sns.jointplot(x="PlayStartTime", y="PR_Distance", kind='hex', data=df_injury_moves_return)

At the start of the play, the players are in formation around 40-50 yards away from the Punt Returner. The Return team initially charges towards the coverage team. After the kick, the players move closer to the Punt Returner. At 10 seconds onwards, the players are already very near the Punt Returner.

### Is there a correlation between the Return Team's PR distance and Time?

In [None]:
df_injury_moves_return = df_injury_moves_return.dropna()

# Pearson for linear correlation
output = scipy.stats.pearsonr(df_injury_moves_return['PlayStartTime'], 
                    df_injury_moves_return['PR_Distance'])
print(output)

# Spearman for non-linear correlation
output = scipy.stats.spearmanr(df_injury_moves_return['PlayStartTime'], 
                    df_injury_moves_return['PR_Distance'])
print(output)

The correlation tests show a a moderate relationship between time and PR distance. The negative correlation indicates that the distance from the PR decreases as time increases.

### How far from the PR does the block injuries occur?

In [None]:
df_injury_phase2 = df_injury[df_injury['Phase']==2]
df_injury_phase2_block = df_injury_phase2[(df_injury_phase2['Player_Activity_Derived']=='Blocking') |
                                         (df_injury_phase2['Player_Activity_Derived']=='Blocked')]
df_collision_point_info = df_collision_point.merge(df_injury_phase2_block[['GameKey', 'PlayID',
                                                                     'Player_Activity_Derived']],
                                                  left_on=['GameKey', 'PlayID'],
                                                  right_on=['GameKey', 'PlayID'],
                                                  how='right')

# Graph
graph_distribution(df_collision_point_info['PR_Distance_Player'].dropna())
df_collision_point_info['PR_Distance_Player'].describe()

The block collisions occured close to the PR with a median value of 4.3 meters distance from the gridline of the Punt Returner.

### How far from the ball landing zone did the block injuries occur?

In [None]:
# Get the PR Moves
df_pr_moves = df_injury_moves[df_injury_moves['Role']=='PR']
df_pr_moves = df_pr_moves.set_index(['GameKey', 'PlayID', 'Event'])

df_collision_point_indexed = df_collision_point.set_index(['GameKey', 'PlayID'])
df_collision_point_indexed.head()

# Compute each distance from the PR
# TODO: punt_received, fumble, fair catch
def get_collision_distance_from_ball_landing(row):
    try:
        # fair_catch, punt_received,fumble
        ball_location = df_pr_moves.loc[(row['GameKey'], row['PlayID'], 'punt_received')]
        collision_location = df_collision_point_indexed.loc[(row['GameKey'], row['PlayID'])]
        return abs(ball_location['x'][0] - collision_location['x_Player'])
    except:
        return None

df_collision_point_info['Ball_X'] = df_collision_point_info.apply(lambda row: get_collision_distance_from_ball_landing(row), 
                                                       axis=1)

# Graph
graph_distribution(df_collision_point_info['Ball_X'].dropna())
df_collision_point_info['Ball_X'].describe()

Most of the block collisions occured near the landing zone with a median value of around 6 meters from the gridline of where the ball was caught.

### What is the distribution of the team PR distances across time?

In [None]:
df_injury_moves_nostars = df_injury_moves[(df_injury_moves['Role']!='PR') & (df_injury_moves['Role']!='P')]
df_injury_moves_nostars = df_injury_moves_nostars[df_injury_moves_nostars['PlayStartTime'] < 15]

df_injury_moves_nostars = df_injury_moves_nostars.dropna(subset=['PR_Distance'])
df_injury_moves_nostars['Second'] = df_injury_moves_nostars['Second'].astype(int)

ax = sns.boxplot(x="Second", y="PR_Distance", 
            hue="Team",
            data=df_injury_moves_nostars)
ax.set_title('PR Distance Distribution\n(meters vs seconds)')

Both teams move towards the PR. However, notice the difference in the median distributions. The Return team tends to lag behind the Coverage team. This means that there are more chances that the blocks will come from behind. (This can be observed in the Block Opposition injuries in the Collision Pairs kernel, where the blocking person was behind from the x-axis.)

### Is there sufficient variance between the PR distance distribution teams in each second?

In [None]:
# ANOVA
for second in range(15):
    df_injury_moves_second = df_injury_moves_nostars[df_injury_moves_nostars['Second'] == second]
    print("Second " + str(second) + ':')
    mod = ols('PR_Distance ~ Team',
                data=df_injury_moves_second).fit()
    aov_table = sm.stats.anova_lm(mod, typ=2)
    print(aov_table)
    print()

There is insufficient variance for the first 4 seconds when they are coming from the formation.
Afterwards, there is sufficient variance between the PR distance of each team in every second because they are already on the move.

### How does the players move across the length of the field in reference to the location of the PR?

In [None]:
df_injury_moves_nostars = df_injury_moves[(df_injury_moves['Role']!='PR') & (df_injury_moves['Role']!='P')]
df_injury_moves_nostars = df_injury_moves_nostars[df_injury_moves_nostars['PlayStartTime'] < 15]
df_injury_moves_nostars_gameplay = df_injury_moves_nostars[(df_injury_moves_nostars['GameKey']==5) &
                                                          (df_injury_moves_nostars['PlayID']==3129)]
df_injury_moves_nostars_gameplay['Second'] = df_injury_moves_nostars_gameplay['Second'].astype(int)

ax = sns.boxplot(x="Second", y="PR_X", 
            hue="Team",
            data=df_injury_moves_nostars_gameplay)
ax.set_title("PR's X Distance Distribution\n(meters vs seconds)")

For this, I only used a sample of a single game because the X-Axis movement varies according to the game situation.

It can be seen that the players are moving closer to the same x-axis of the PR. However, there is a turning point when the PR passes them by as their X-Axis difference increases at the turning point. This turning point varies per gameplay because it is dependent on various variables (punt distance, where the PR went, etc.). Thus, I just used the time difference since the change in x-axis direction as shown below.

### How long after the direction change did the collision occur according to the injured player's activity type?

In [None]:
# Filter
df_injury_phase2 = df_injury[df_injury['Phase']==2]
df_injury_phase2_opponent = df_injury_phase2[df_injury_phase2['Friendly_Fire']=='No']

df_moves = df_injury_moves.merge(df_injury_phase2_opponent[['GameKey',
                                                   'PlayID']],
                                        left_on=['GameKey', 'PlayID'],
                                        right_on=['GameKey', 'PlayID'],
                                        how='right')

# Get time of min-max points
df_moves_agg = df_moves.groupby(['GameKey','PlayID', 'GSISID']).agg({'x': ['idxmin', 'idxmax']})
df_moves_agg['X_Max'] = df_moves.loc[df_moves_agg[('x', 'idxmax')].values]['PlayStartTime'].values
df_moves_agg['X_Min'] = df_moves.loc[df_moves_agg[('x', 'idxmin')].values]['PlayStartTime'].values

df_moves_agg = df_moves_agg.reset_index()

# Get Collision Time
df_moves_agg = df_moves_agg.merge(df_collision_point[['GameKey',
                                                   'PlayID',
                                                   'PlayStartTime']],
                                        left_on=['GameKey', 'PlayID'],
                                        right_on=['GameKey', 'PlayID'],
                                        how='right')


# Get Time difference of turning point to collision time
df_moves_agg['X_Diff'] = df_moves_agg.apply(lambda row:
#                                             get_facing(row),
                                            min(abs(row['PlayStartTime'] - row[('X_Max', '')]),
                                                abs(row['PlayStartTime'] - row[('X_Min', '')])),
                                            axis=1)

# Get activity of involved injury
df_moves_agg = df_moves_agg.merge(df_injury[['GameKey','PlayID',
                                             'Player_Activity_Derived']],
                                  left_on=['GameKey', 'PlayID'],
                                  right_on=['GameKey', 'PlayID'],
                                  how='right')

# Graph
ax = sns.boxplot(x='X_Diff', y='Player_Activity_Derived', 
            data=df_moves_agg)
ax.set_title('Deceleration Time Before Collision (seconds)')
df_moves_agg.groupby(['Player_Activity_Derived'])['X_Diff'].median().reset_index().head()

At some point in the play, the players either slow down or change in direction (in the x-axis) across the length of the field.
Most of the blocked and blocking incidents occured less than 2 seconds after the crowd changed directions. The median human reaction time is 0.25 seconds.
This means that the sudden change in direction may be a factor in the block collisions.

### Is there sufficient variance in the time elapsed between the activity types?

In [None]:
# ANOVA
mod = ols('X_Diff ~ Player_Activity_Derived',
            data=df_moves_agg).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
aov_table.head()

There is sufficient variance. The PR(>F) is still under the 0.05 confidence interval.

### How far was the crowd from the injured player during the blocked collisions?

In [None]:
df_injury_phase2 = df_injury[df_injury['Phase']==2]
df_injury_phase2_opponent = df_injury_phase2[df_injury_phase2['Friendly_Fire']=='No']
df_injury_phase2_opponent_block = df_injury_phase2_opponent[(df_injury_phase2_opponent['Player_Activity_Derived']=='Blocked') |
                                            (df_injury_phase2_opponent['Player_Activity_Derived']=='Blocking')]

# Get only the block collisions
df_collision_block = df_collision_point.merge(df_injury_phase2_opponent_block[['GameKey',
                                                   'PlayID']],
                                        left_on=['GameKey', 'PlayID'],
                                        right_on=['GameKey', 'PlayID'],
                                   how='left')

df_collision_block_indexed = df_collision_block.set_index(['GameKey', 'PlayID', 'PlayStartTime'])

# Get all players during the block collision
df_collision_block_moves = df_injury_moves.merge(df_collision_block,
                                        left_on=['GameKey', 'PlayID', 'PlayStartTime'],
                                        right_on=['GameKey', 'PlayID', 'PlayStartTime'])

# Compute each distance from the PR
def get_player_axial_distance(row, axis):
    try:
        coordinates = df_collision_block_indexed.loc[(row['GameKey'], row['PlayID'], row['PlayStartTime'])]
        index = axis+'_Player'
        return abs(coordinates[index] - row[axis])
    except:
        return None

def get_player_distance(row):
    try:
        return math.sqrt(pow(row['Player_X'], 2) + pow(row['Player_Y'], 2))
    except:
        return None

df_collision_block_moves['Player_X'] = df_collision_block_moves.apply(lambda row: 
                                                                     get_player_axial_distance(row, 'x'),
                                                                     axis=1)
df_collision_block_moves['Player_Y'] = df_collision_block_moves.apply(lambda row: 
                                                                     get_player_axial_distance(row, 'y'),
                                                                     axis=1)
df_collision_block_moves['Player_Distance']= df_collision_block_moves.apply(lambda row: 
                                                                     get_player_distance(row),
                                                                     axis=1)

# Graph
graph_distribution(df_collision_block_moves['Player_Distance'])
df_collision_block_moves['Player_Distance'].describe()

Half of the other players were close at least 8 meters from the injured player when the block collisions occured. The graph shows a clear accumulation of the players near the location of the collision.

### What is the basic statistics of the collision?

In [None]:
df_injury_phase2 = df_injury[df_injury['Phase']==2]
df_injury_phase2_opponent = df_injury_phase2[df_injury_phase2['Friendly_Fire']=='No']
df_injury_phase2_opponent_block = df_injury_phase2_opponent[(df_injury_phase2_opponent['Player_Activity_Derived']=='Blocked') |
                                            (df_injury_phase2_opponent['Player_Activity_Derived']=='Blocking')]

# Get only the moves for block injuries
df_moves = df_injury_moves.merge(df_injury_phase2_opponent_block[['GameKey',
                                                   'PlayID']],
                                        left_on=['GameKey', 'PlayID'],
                                        right_on=['GameKey', 'PlayID'],
                                        how='right')

# Get only the moves during the collision time
df_moves_collision = df_moves.merge(df_collision_point[['GameKey',
                                                   'PlayID', 'PlayStartTime']],
                                        left_on=['GameKey', 'PlayID', 'PlayStartTime'],
                                        right_on=['GameKey', 'PlayID', 'PlayStartTime'],
                                   how='inner')
df_moves_collision.describe()

The crowd have the following statistics:
- The speeds have a median of around 18 kph.
- The time of collision occured around 10 seconds after the ball snap.
- The median PR Distance is 11 meters.

#### Finding: 

1. Both teams are moving towards the PR.
2. The players have higher speeds and also a wide speed interval during the 10-20 seconds range.
3. The time of block collisions occur close to the PR at a speeds around 17.4 kph.

In summary, the teams were moving towards the PR. However, block collisions occur when the Return team catches up with the coverage team after their peak speeds.

## E. Friendly-Fire Collisions

### What team is generally faster?

In [None]:
# Ensure there is only one instance per person
df_injury_max_speeds = df_injury_moves_nostars.groupby(['GameKey', 'PlayID', 'GSISID', 'Team'])['kph'].max().reset_index()

# Graph
ax = sns.boxplot(x='kph', y='Team', data = df_injury_max_speeds)
ax.set_title('Maximum Team Speed (kph)')
ax.set(xlabel='kph', ylabel='Team')

df_injury_max_speeds_g = df_injury_max_speeds.groupby(['Team']).agg({'kph':
                                                                            ['min', 'max',
                                                                            'median', 'mean']})
df_injury_max_speeds_g.head(40)

In general, the Coverage team has higher peak speeds than the Return team. This makes them dangerous both to others and to themselves.

### Is there sufficient variance between the maximum speeds of the teams?

In [None]:
mod = ols('kph ~ Team',
            data=df_injury_max_speeds).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
print(aov_table)

This means that the team has an influence in the maximum speeds.

### How far were the maximum distances of the teams to the PR?

In [None]:
# Ensure there is only one instance per person
df_injury_max_distances = df_injury_moves_nostars.groupby(['GameKey', 'PlayID', 'GSISID', 'Team'])['PR_Distance'].max().reset_index()

# Graph
ax = sns.boxplot(x='PR_Distance', y='Team', data = df_injury_max_distances)
ax.set_title('Maximum Team-PR Distance (meters)')
ax.set(xlabel='PR Distance', ylabel='Team')

df_injury_max_distances_g = df_injury_max_distances.groupby(['Team']).agg({'PR_Distance':
                                                                            ['min', 'max',
                                                                            'median', 'mean']})
df_injury_max_distances_g.head(40)

The maximum distances of the members of the Coverage team is generally higher than the Return team. Considering the starting formations, this makes sense because the Return team's formation is in between the Coverage team and the PR. However, this also means that the Coverage team ran a longer distance to get closer to the PR.

### Is there sufficient variance between each team's maximum distance from the PR?

In [None]:
mod = ols('PR_Distance ~ Team',
            data=df_injury_max_distances).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
print(aov_table)

While the variance is small, the PR(>F) value is still below 0.025. There is sufficient variance between the team's maximum distances.

### How close were the minimum distances of the teams to the PR?

In [None]:
df_injury_moves_nostars = df_injury_moves[(df_injury_moves['Role']!='PR') & (df_injury_moves['Role']!='P')]
df_injury_moves_nostars = df_injury_moves_nostars[df_injury_moves_nostars['PlayStartTime'] < 30]

# Ensure there is only one instance per person
df_injury_min_distances = df_injury_moves_nostars.groupby(['GameKey', 'PlayID', 'GSISID', 'Team'])['PR_Distance'].min().reset_index()

# Graph
ax = sns.boxplot(x='PR_Distance', y='Team', data = df_injury_min_distances)
ax.set_title('Minimum Team-PR Distances (meters)')
ax.set(xlabel='PR Distance', ylabel='Team')

df_injury_min_distances_g = df_injury_min_distances.groupby(['Team']).agg({'PR_Distance':
                                                                            ['min', 'max',
                                                                            'median', 'mean']})
df_injury_min_distances_g.head(40)

Notice how much closer the Coverage team are compared to the Return team. The return team maintains a fair distance from the PR to ensure that the PR gets enough freedom of movement. On the other hand, the coverage team just does not bother much about space and attempts to get as close as possible to the PR. When people are clustered around a single point without ample space, the interactions between people is increased. This creates a higher risk of injuries.

### Is there sufficient variance between each team's minimum distance from the PR?

In [None]:
mod = ols('PR_Distance ~ Team',
            data=df_injury_min_distances).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
print(aov_table)

There is considerable variance between the teams. This means there is some difference in the minimum distance clustering of each team.

#### Finding: 
The friendly-fires can be attributed to the Coverage team behavior:
1. The Coverage team ran faster than the Return team.
2. The Coverage team travels much longer distance to the PR than the Return team.
3. The Coverage team are clustered much closer to the PR than the Return team.

When combined, all these factors increase their chances of friendly-fires. In other words, their behavior is similar to the single-door emergency situation when the whole crowd is running towards a single exit. The urgency of tackling the PR makes the Coverage team compete amongst themselves.

# III. Experiments

### What is the overall trend of PR Distance over time?

In [None]:
df_injury_moves_nostars = df_injury_moves[(df_injury_moves['Role']!='P') & 
                                          (df_injury_moves['Role']!='PR')]
df_injury_moves_nostars = df_injury_moves_nostars[df_injury_moves_nostars['PlayStartTime'] < 15]
df_injury_moves_nostars = df_injury_moves_nostars.dropna()

# Pearson for linear correlation
output = scipy.stats.pearsonr(df_injury_moves_nostars['PlayStartTime'], 
                    df_injury_moves_nostars['PR_Distance'])
print(output)

# Spearman for non-linear correlation
output = scipy.stats.spearmanr(df_injury_moves_nostars['PlayStartTime'], 
                    df_injury_moves_nostars['PR_Distance'])
print(output)

sns.jointplot(x="PlayStartTime", y="PR_Distance", kind='hex', data=df_injury_moves_nostars)

### What is the overall trend of PR Distance over time per team?

In [None]:
df_injury_moves_nostars = df_injury_moves[(df_injury_moves['Role']!='PR') & (df_injury_moves['Role']!='P')]
df_injury_moves_nostars = df_injury_moves_nostars[df_injury_moves_nostars['PlayStartTime'] < 15]

df_injury_moves_nostars = df_injury_moves_nostars.dropna(subset=['PR_Distance'])
df_injury_moves_nostars = df_injury_moves_nostars[df_injury_moves_nostars['PR_Distance'] < 50]
df_injury_moves_nostars['Second'] = df_injury_moves_nostars['Second'].astype(int)

# Graph
ax = sns.boxplot(x="Second", y="PR_Distance", 
            hue="Team",
            data=df_injury_moves_nostars)
ax.set_title('PR Distance Distribution\n(meters vs seconds)')

# ANOVA
for second in range(15):
    df_injury_moves_second = df_injury_moves_nostars[df_injury_moves_nostars['Second'] == second]
    print("Second " + str(second) + ':')
    mod = ols('PR_Distance ~ Team',
                data=df_injury_moves_second).fit()
    aov_table = sm.stats.anova_lm(mod, typ=2)
    print(aov_table)
    print()

### What is the overall trend of Speed over time?

In [None]:
df_injury_moves_nostars = df_injury_moves[(df_injury_moves['Role']!='P') & 
                                          (df_injury_moves['Role']!='PR')]
df_injury_moves_nostars = df_injury_moves_nostars[df_injury_moves_nostars['PlayStartTime'] < 15]

# Pearson for linear correlation
output = scipy.stats.pearsonr(df_injury_moves_nostars['PlayStartTime'], 
                    df_injury_moves_nostars['kph'])
print(output)

# Spearman for non-linear correlation
output = scipy.stats.spearmanr(df_injury_moves_nostars['PlayStartTime'], 
                    df_injury_moves_nostars['kph'])
print(output)

sns.jointplot(x="PlayStartTime", y="kph", kind='hex', data=df_injury_moves_nostars)

There is a parabolic trend of speed because of the acceleration and deceleration.

### What is the overall trend of Speed over time per team?

In [None]:
df_injury_moves_nostars = df_injury_moves[(df_injury_moves['Role']!='PR') & (df_injury_moves['Role']!='P')]
df_injury_moves_nostars = df_injury_moves_nostars[df_injury_moves_nostars['PlayStartTime'] < 15]

df_injury_moves_nostars = df_injury_moves_nostars.dropna(subset=['kph'])
df_injury_moves_nostars = df_injury_moves_nostars[df_injury_moves_nostars['kph'] < 50]
df_injury_moves_nostars['Second'] = df_injury_moves_nostars['Second'].astype(int)

ax = sns.boxplot(x="Second", y="kph", 
            hue="Team",
            data=df_injury_moves_nostars)
ax.set_title('Speed Distribution\n(kph vs seconds)')

### What axis has the highest speed?

In [None]:
df_injury_moves = df_injury_moves.sort_values(by=['GameKey', 'PlayID', 'GSISID', 'PlayStartTime'])

# Delta: Change in movements
df_injury_moves['dx'] = df_injury_moves['x'] - df_injury_moves.groupby(['GameKey', 'PlayID', 'GSISID'])['x'].shift(1)
df_injury_moves['dy'] = df_injury_moves['y'] - df_injury_moves.groupby(['GameKey', 'PlayID', 'GSISID'])['y'].shift(1)
df_injury_moves['dt'] = df_injury_moves['PlayStartTime'] - df_injury_moves.groupby(['GameKey', 'PlayID', 'GSISID'])['PlayStartTime'].shift(1)

# Velocity: Convert meters per second to kph
df_injury_moves['vx'] = 3.6 * (df_injury_moves['dx']/0.1)
df_injury_moves['vy'] = 3.6 * (df_injury_moves['dy']/0.1)

# Velocity: Convert to absolute value speed
df_injury_moves['vx'] = df_injury_moves.apply(lambda row: 
                                              abs(row['vx']),
                                              axis = 1)
df_injury_moves['vy'] = df_injury_moves.apply(lambda row: 
                                              abs(row['vy']),
                                              axis = 1)

# Put the velocities into one column
df_velocities = pd.melt(df_injury_moves, 
                        id_vars=['GameKey', 'PlayID', 'GSISID', 'PlayStartTime'], 
                        value_vars=['vx', 'vy'])

# Get the maximum speed of player per gameplay
df_v_max = df_velocities.groupby(['GameKey', 'PlayID', 'GSISID', 'variable'])['value'].max().reset_index()

# Clean data that is faster than maximum human speed
df_v_max_limit = df_v_max[df_v_max['value'] < 50]

# Graph
ax = sns.boxplot(y="variable", x="value", data=df_v_max_limit)
ax.set_title('Axial Velocity (kph)')

# ANOVA
mod = ols('value ~ variable',
            data=df_v_max).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
print(aov_table)

Running Area Ranges:
- X-Axis: 0 - 91.44 meters
- Y-Axis: 0 - 48.74 meters

The movements on the x-axis has a higher speed than the y-axis. The players are naturally faster on the x-axis because the Punt Returner is much further down the x-axis than the y-axis. They were all running straight across the X-Axis. Also, the running are ranges of the different axis are different. The maximum width of the y-axis only has around half the maximum length of the x-axis.

Note: The ANOVA shows that there is sufficient variance in the speeds between each axis.

### What are the trends in collision speed per activity?

In [None]:
df_injury_moves_details = df_injury_moves.merge(df_injury[['GameKey', 'PlayID', 'Player_Activity_Derived']],
                                                left_on=['GameKey', 'PlayID'],
                                                 right_on=['GameKey', 'PlayID'],
                                                 suffixes=('', '_Player'))
df_collision_point_injury = df_collision_point.merge(df_injury_moves_details[['GameKey', 'PlayID', 
                                                                              'Player_Activity_Derived',
                                                               'kph', 'PR_Distance', 'PlayStartTime',
                                                                              'Team']],
                                                left_on=['GameKey', 'PlayID', 'PlayStartTime'],
                                                 right_on=['GameKey', 'PlayID', 'PlayStartTime'],
                                                    how='left')
# Graph
ax = sns.boxplot(x='kph', 
                 y='Player_Activity_Derived',
                 hue='Team',
                 data=df_collision_point_injury)
ax.set_title("Players Speed\nwhen Collision Occured")

# ANOVA
mod = ols('kph ~ Player_Activity_Derived',
            data=df_collision_point_injury).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
print('Activity Variance:')
print(aov_table)
print()

# ANOVA
print('Coverage vs Return:')
activities=['Tackling', 'Blocked', 'Blocking', 'Tackled']
for activity in activities:
    df_activity = df_collision_point_injury[df_collision_point_injury['Player_Activity_Derived']==activity]
    mod = ols('kph ~ Team',
                data=df_activity).fit()
    aov_table = sm.stats.anova_lm(mod, typ=2)
    print(activity + ':')
    print(aov_table)
    print()

Tackling and blocked injuries tend to occur when the the coverage team have high speeds. If we combine this with the injury statistics, it can be noted that the tackling and blocked activities have the highest frequency as well. This proves that the coverage team is exhibiting the bottleneck or 1-door emergency situation where everyone is rushing towards a single point.

Between teams, only the tackling and blocked injuries are within significant values (below 0.05). This means that the higher speed of the coverage team during those moments may have an effect on the collision outcome. For both tackling and blocked, it would seem like the faster teams have higher chances of injuries.

### What were the player distances to the PR when the collisions occured?

In [None]:
df_injury_moves_details = df_injury_moves.merge(df_injury[['GameKey', 'PlayID', 'Player_Activity_Derived']],
                                                left_on=['GameKey', 'PlayID'],
                                                 right_on=['GameKey', 'PlayID'],
                                                 suffixes=('', '_Player'))
df_collision_point_injury = df_collision_point.merge(df_injury_moves_details[['GameKey', 'PlayID', 
                                                                              'Player_Activity_Derived',
                                                               'kph', 'PR_Distance', 'PlayStartTime',
                                                                              'Team']],
                                                left_on=['GameKey', 'PlayID', 'PlayStartTime'],
                                                 right_on=['GameKey', 'PlayID', 'PlayStartTime'],
                                                    how='left')
# Graph
ax = sns.boxplot(x='PR_Distance', 
                 y='Player_Activity_Derived',
                 hue='Team',
                 data=df_collision_point_injury)
ax.set_title("Players PR Distance\nwhen Collision Occured")

# ANOVA
mod = ols('PR_Distance ~ Player_Activity_Derived',
            data=df_collision_point_injury).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
print('Activity Variance:')
print(aov_table)
print()

# ANOVA
print('Coverage vs Return:')
activities=['Tackling', 'Blocked', 'Blocking', 'Tackled']
for activity in activities:
    df_activity = df_collision_point_injury[df_collision_point_injury['Player_Activity_Derived']==activity]
    mod = ols('PR_Distance ~ Team',
                data=df_activity).fit()
    aov_table = sm.stats.anova_lm(mod, typ=2)
    print(activity + ':')
    print(aov_table)
    print()

For each activity, the median PR Distances varies significantly. Between teams, however, there is insufficient variance.

Tackle-related injuries occur when the players are near the PR. Between blocked and blocking injuries, the blocked injuries occur often closer to the PR than the blocking injuries. However, I have to note that around 3 blocking injuries occur before the punt near their respective formations.

### For the punt received cases, how far did the injuries occur from gridline where the ball was caught?

In [None]:
# Get the PR Moves
df_pr_moves = df_injury_moves[df_injury_moves['Role']=='PR']
df_pr_moves = df_pr_moves.set_index(['GameKey', 'PlayID', 'Event'])

df_collision_point_indexed = df_collision_point.set_index(['GameKey', 'PlayID'])
df_collision_point_indexed.head()

# Compute each distance from the PR
# TODO: punt_received, fumble, fair catch
def get_collision_distance_from_ball_landing(row):
    try:
        # fair_catch, punt_received,fumble
        ball_location = df_pr_moves.loc[(row['GameKey'], row['PlayID'], 'punt_received')]
        collision_location = df_collision_point_indexed.loc[(row['GameKey'], row['PlayID'])]
        return abs(ball_location['x'][0] - collision_location['x_Player'])
    except:
        return None

df_collision_point['Ball_X'] = df_collision_point.apply(lambda row: get_collision_distance_from_ball_landing(row), 
                                                       axis=1)
df_collision_point['Ball_X_yards'] = 1.09361 * df_collision_point['Ball_X']

# Graph
graph_distribution(df_collision_point['Ball_X_yards'].dropna())
df_collision_point['Ball_X_yards'].describe()

Most of the injuries occured within 7.4 yards or 6.77 meters of the gridline of the ball caught.

### What is the frequency of the events?

In [None]:
df_pr_moves = df_injury_moves[df_injury_moves['Role']=='PR']
df_pr_moves['Event'].value_counts()