Hello,

Here is our submission to the NFL Punt Analytics Competition, you'll see that in this kernel we're not gonna dive directly into the statistics, We believe it is important to take some time to think about what approach should we have regarding the data, what are the given parameters that we can change, what has been done in other sports, other leagues and other things.
When this work is done, I'll then proceed to doing a statistics analysis and give a report of what I found.

## Table of contents
### I. Approaching the data  
1. Punt plays  
[1.1. The incentive behind punt plays](#1.1.-The-incentive-behind-punt-plays)  
[1.2. Why are punt plays a source of concussions](#1.2.-Punt-plays,-a-source-of-concussions)  

2. What parameters can be changed ?  
[2.1. Game environment](#2.1.-Game-environment)  
[2.2. Games rules (teams, players, referees)](#2.2.-Games-rules-(teams,-players,-referees))  

3. What has been done ?  
[3.1. In the Canadian Football League](#3.1-What-has-been-done-in-American-Football)    
[3.2. In French Rugby](#3.2-What-has-been-done-in-Rugby)    

[4. The way to approach new rules](#4.-The-way-to-approach-new-rules)  

### II. Analyzing the data  
[1.  Building the dataframe](#1.-Building-the-Dataframe)  

2. Game environment analysis  
[2.1. Turf analysis](#2.1.-Turf-analysis)  
[2.2. Weather analysis](#2.2.-Weather-analysis)  
[2.3. Match start time analysis](#2.3.-Start-time-analysis)  
[2.4. Match day analysis](#2.4.-Match-day-analysis)  

3. Play analysis focused on players  
[3.1. Punt returners](#3.1.-Punt-returners)  
[3.2. The rest](#3.2.-The-rest)  

[4. Play analysis focused on game](#4.-Play-analysis-focused-on-game)

[5. Conclusion](#5.-Conclusion)

[6. Bibliography](#6.-Bibliography)


## I. Approaching the data
### 1.1. The incentive behind punt plays
A punt play is a last resort solution, it's usually done on the final down available, at first it seems like a high risk high reward scheme :  
__Risks :__
- The opposing team takes control of the ball and tries a punt return

__Rewards :__
- Avoid a turnover on downs
  * If done well, it allows to score points
  * If the punt returner gets the ball he can still be tackled

Even if at first it seems like a high risk high reward, it is a medium-low risk high reward scheme so there is a huge incentive to do punt plays as they tend to be really profitable for the team doing them.

### 1.2. Punt plays, a source of concussions
During a punt play players make a scrimmage, which leads to contact between them.
Plus in case the punt returner gets the ball, the opposite team will try to tackle him in order to regain control of the ball.

Which is why punt plays are an incentive to contact between players.
Of course contact at full speed while trying to tackle the punt returner is far different from contact during the scrimmage and we'll see later what is the more dangerous and what causes the most concussions.

### 2.1. Game environment
What we define as the game environment is the type of turf, the weather, the start time of the game.
Can we change those and would it really be an interesting change ?   
Well as an example if there is a clear correlation between raining and concussions for example it would be wise to delay match when it rains on outdoor stadiums. This type of change would work well for the turf type of the weather.  

Considering the time at which the match starts or the day of the match, it seems normal that the majority of matches are played during the weekend or in the evening so when searching for correlations about that, we should keep in mind this fact. Even if it was an important factor changing the dates of the matches would narrow the hype because people wouldn't be able to watch them.

### 2.2. Games rules (teams, players, referees)
The game rules are the category where most of the changes should happen but what are the actors concerned by the games rules ?  
First those changes could be about teams compositions, if we see that a certain team composition is related to concussions we should avoid it.  
Those changes could also be about giving more rights to the referees, for example giving referees the right to suspend player that suffer from concussions or that tend to cause concussions.  

There are a lot of changes available within the game rules, but changing them might alter the most efficient tactic available so it should be changed wisely. There is no way in telling 100% accurately how changes in this category will affect the concussion rate.

### 3.1 What has been done in American Football
I found that a rule change relative to punt plays occured in 2015, here is what I found online about the rule change
> The rule changes, in a nutshell, are:  
- holding the interior five players from past the line of scrimmage until the ball is kicked on punts   
- adding five-yard no-yards penalty to the end of the play   
- **limiting contact on a receiver to five yards past the line of scrimmage**  
- pushing the convert back to a 32-yard kick and advancing a two-point convert to the three-yard line   
- no re-kick option on a kickoff which goes out of bounds;   
- coach can no longer request a measurement   
- the removal of a rule which made a slotback ineligible if he was too close to his offensive tackle.   

[[1]](https://calgaryherald.com/sports/football/cfl/cfls-glen-johnson-explains-the-new-rules-for-2015)

Also some of the new rule change :
- The setup zone
- The use of helmet rule


### 3.2 What has been done in Rugby
I thought it would be interesting to take a look at rugby as this sport is also suffering from a high concussion rate,  
For the 2018 - 2019 season, the french rugby federation has adopted a list of rule change :
> - The referee can send a player to bench if he detects clear signs of concussion, the resulting player change is not counted in the team total available player changes.  
- Each team in the national league has to get a doctor and 2 physiotherapists in its staff

[[2]](https://www.sudouest.fr/2018/07/05/rugby-quelles-sont-les-nouvelles-regles-pour-proteger-la-sante-des-joueurs-5206706-773.php) [[3]](https://www.bbc.com/sport/rugby-union/43802462)

Those give a good example of parameters that we can change, either giving more responsabilities to the referees, changing in game rules, changing strategic rules for the coachs.

### 4. The way to approach new rules
We have 2 ways to approach this submission :
- The first one is to have interesting parameters that can change the game directly from the players, i.e: rules concerning the micro-game
- The second one is to have some rule changes for the strategic side of the game, i.e: rules concerning the macro-game

## II. Analyzing the data
### 1. Building the Dataframe
Gathering all the data in a single dataframe to make it easier to analyze

Using [@kmader's kernel as model](https://www.kaggle.com/kmader/convert-to-feather-for-use-in-other-kernels) to handle all the NGS files.

In [None]:
import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode, iplot
import seaborn as sns
import warnings
import os
warnings.filterwarnings('ignore')
plt.style.use('bmh')
%matplotlib inline
plt.rcParams['figure.dpi'] = 100
init_notebook_mode(connected=True) 

#print(os.listdir('../input/NFL-Punt-Analytics-Competition'))

data = pd.read_csv('../input/NFL-Punt-Analytics-Competition/video_review.csv')
players = pd.read_csv('../input/NFL-Punt-Analytics-Competition/play_player_role_data.csv')
gd = pd.read_csv('../input/NFL-Punt-Analytics-Competition/game_data.csv')
play_info = pd.read_csv('../input/NFL-Punt-Analytics-Competition/play_information.csv')
video = pd.read_csv('../input/NFL-Punt-Analytics-Competition/video_footage-injury.csv')
positions = pd.read_csv('../input/NFL-Punt-Analytics-Competition/player_punt_data.csv')

data['GSISID'].apply(str)
data['Primary_Partner_GSISID'].apply(str)
players['GSISID'].apply(str)
gd['Turf'] = gd['Turf'].str.lower()
gd['GameWeather'] = gd['GameWeather'].str.lower()

for index, row in data.iterrows():
    role = players[(players.GSISID == row['GSISID']) & (players.GameKey == row['GameKey']) & (players.PlayID == row['PlayID'])]['Role'].tolist()[0]
    
    data.loc[index, 'game_clock'] = play_info[(play_info.PlayID == row['PlayID']) & (play_info.GameKey == row['GameKey'])]['Game_Clock'].tolist()[0]
    data.loc[index, 'video_url'] = video[(video.gamekey == row['GameKey']) & (video.playid == row['PlayID'])]['PREVIEW LINK (5000K)'].tolist()[0]
    data.loc[index, 'game_day'] = gd[(gd.GameKey == row['GameKey'])]['Game_Day'].tolist()[0]
    data.loc[index, 'turf'] = gd[(gd.GameKey == row['GameKey'])]['Turf'].tolist()[0]
    data.loc[index, 'start_time'] = gd[(gd.GameKey == row['GameKey'])]['Start_Time'].tolist()[0]
    data.loc[index, 'weather'] = gd[(gd.GameKey == row['GameKey'])]['GameWeather'].tolist()[0]
    data.loc[index, 'temperature'] = gd[(gd.GameKey == row['GameKey'])]['Temperature'].tolist()[0]
    data.loc[index, 'player_role'] = role
    data.loc[index, 'ball_possession'] = (role in ['GL','PLW','PLT','PLG','PLS','PRG','PRT','PRW','GR','PC','PPR','P'])
    #data.loc[index, 'description'] = video[(video.gamekey == row['GameKey']) & (video.playid == row['PlayID'])]['PlayDescription'].tolist()[0]
    data.loc[index, 'description'] = play_info[(play_info.PlayID == row['PlayID']) & (play_info.GameKey == row['GameKey'])]['PlayDescription'].tolist()[0]
    data.loc[index, 'illegal'] = ('illegal' in video[(video.gamekey == row['GameKey']) & (video.playid == row['PlayID'])]['PlayDescription'].tolist()[0].lower().split(' '))
    
    gsisid = row['GSISID']
    data.loc[index, 'player_pos'] = positions[positions.GSISID == gsisid]['Position'].tolist()[0]
    #print(i, gsisid, positions[positions.GSISID == gsisid]['Position'].tolist()[0])
    
    if str(row['Primary_Partner_GSISID']) != 'nan' and str(row['Primary_Partner_GSISID']) != 'Unclear':
        gsisid = int(row['Primary_Partner_GSISID'])
        data.loc[index, 'partner_pos'] = positions[positions.GSISID == gsisid]['Position'].tolist()[0]
        data.loc[index, 'partner_role'] = players[(players.GSISID == int(row['Primary_Partner_GSISID'])) & (players.GameKey == row['GameKey']) & (players.PlayID == row['PlayID'])]['Role'].tolist()[0]
        #data.loc[index, 'ROLES'] = '{}-{}'.format(str(data.loc[index]['player_role']),str(data.loc[index]['PP_GSISROLE']))

In [None]:
import tqdm
PATH = '../input/NFL-Punt-Analytics-Competition/'

dtypes = {'Season_Year': 'int16',
         'GameKey': 'int16',
         'PlayID': 'int16',
         'GSISID': 'float32',
         'Time': 'str',
         'x': 'float32',
         'y': 'float32',
         'dis': 'float32',
         'o': 'float32',
         'dir': 'float32',
         'Event': 'str'}

col_names = list(dtypes.keys())

ngs_files = ['NGS-2016-pre.csv',
             'NGS-2016-reg-wk1-6.csv',
             'NGS-2016-reg-wk7-12.csv',
             'NGS-2016-reg-wk13-17.csv',
             'NGS-2016-post.csv',
             'NGS-2017-pre.csv',
             'NGS-2017-reg-wk1-6.csv',
             'NGS-2017-reg-wk7-12.csv',
             'NGS-2017-reg-wk13-17.csv',
             'NGS-2017-post.csv']

df_list = []

for i in tqdm.tqdm(ngs_files):
    df = pd.read_csv(f'{PATH}'+i, usecols=col_names,dtype=dtypes)
    
    df_list.append(df)

In [None]:
import gc
# Merge all dataframes into one dataframe
ngs = pd.concat(df_list)

# Delete the dataframe list to release memory
del df_list
gc.collect()
ngs = ngs.drop(columns=['Event'])

In [None]:
def find_direction(angle):
    if angle < 45 and angle > -45:
        return 'front'
    elif (angle < -45 and angle > -135):
        return  'left'
    elif (angle > 45 and angle < 135):
        return 'right'
    else:
        return 'behind'

def find_half(angle):
    if angle > -90 and angle < 90:
        return 'front'
    else:
        return 'behind'
    

def find_impact(PlayID, GameKey, GSISID, Partner_GSISID, data_index):
    player_coords = ngs[(ngs.GSISID == GSISID) & (ngs.GameKey == GameKey) & (ngs.PlayID == PlayID)]
    partner_coords = ngs[(ngs.GSISID == Partner_GSISID) & (ngs.GameKey == GameKey) & (ngs.PlayID == PlayID)]

    player_coords = player_coords.sort_values(by=['Time'])
    partner_coords = partner_coords.sort_values(by=['Time'])

    p1 = player_coords.index.tolist()
    p2 = partner_coords.index.tolist()
    
    for i in range(0, len(p1)):
        player_index = p1[i]
        partner_index = p2[i]

        distance = math.sqrt((player_coords.loc[player_index]['x'] - partner_coords.loc[partner_index]['x'])**2 + (player_coords.loc[player_index]['y'] - partner_coords.loc[partner_index]['y'])**2)

        if distance < 1.2:
            
            rad = math.atan2(partner_coords.loc[partner_index]['y'] - player_coords.loc[player_index]['y'], partner_coords.loc[partner_index]['x'] - player_coords.loc[player_index]['x'])
            rad = (rad * 180) / math.pi
            if rad < 0:
                rad += 360
                
            rad2 = math.atan2(player_coords.loc[player_index]['y'] - partner_coords.loc[partner_index]['y'], player_coords.loc[player_index]['x'] - partner_coords.loc[partner_index]['x'])
            rad2 = (rad2 * 180) / math.pi
            if rad2 < 0:
                rad2 += 360
                
            collision_angle = player_coords.loc[player_index]['dir']  - rad
            collision_angle_partner = partner_coords.loc[partner_index]['dir']  - rad2
            
            data.loc[data_index, 'collision_angle'] = collision_angle
            data.loc[data_index, 'collision_angle_partner'] = collision_angle_partner
            data.loc[data_index, 'player_speed'] = player_coords.loc[player_index]['dis'] * 0.9144 * 10
            data.loc[data_index, 'partner_speed'] = partner_coords.loc[partner_index]['dis'] * 0.9144 * 10
            data.loc[data_index, 'player_x'] = player_coords.loc[player_index]['x']
            data.loc[data_index, 'player_y'] = player_coords.loc[player_index]['y']
            data.loc[data_index, 'partner_x'] = partner_coords.loc[partner_index]['x']
            data.loc[data_index, 'partner_y'] = partner_coords.loc[partner_index]['y']
            data.loc[data_index, 'collision_time1'] = player_coords.loc[player_index]['Time']
            data.loc[data_index, 'collision_time2'] = partner_coords.loc[partner_index]['Time']
            data.loc[data_index, 'player_dir'] = player_coords.loc[player_index]['dir']
            data.loc[data_index, 'player_o'] = player_coords.loc[player_index]['o']
            data.loc[data_index, 'partner_dir'] = partner_coords.loc[partner_index]['dir']
            data.loc[data_index, 'partner_o'] = partner_coords.loc[partner_index]['o']
            data.loc[data_index, 'distance'] = distance
            
            partner_side = find_direction(collision_angle)
            partner_half = find_half(collision_angle)
            
            player_side = find_direction(collision_angle_partner)
            player_half = find_half(collision_angle_partner)            
                
            data.loc[data_index, 'partner_side'] = partner_side
            data.loc[data_index, 'partner_half'] = partner_half
            
            data.loc[data_index, 'player_side'] = player_side
            data.loc[data_index, 'player_half'] = player_half
            
            avg_player_speed = 0
            avg_partner_speed = 0
            
            """
            for y in range(0, i):
                p_index = p1[y]
                p2_index = p2[y]
                
                avg_player_speed += player_coords.loc[p_index]['dis'] * 0.9144
                avg_partner_speed += partner_coords.loc[p2_index]['dis'] * 0.9144
                
            #data.loc[data_index, 'player_avg_speed'] = (avg_player_speed / i) * 10
            #data.loc[data_index, 'partner_avg_speed'] = (avg_partner_speed / i) * 10
            """
            # Nombre de joueurs dans le périmètre à l'impact 5 yards
            
            perimeter_players = 0
            for index in ngs[(ngs.GameKey == GameKey) & (ngs.PlayID == PlayID) & (ngs.Time == player_coords.loc[player_index]['Time'])].index:
                perimeter_player = ngs.loc[index]
                distance = math.sqrt((player_coords.loc[player_index]['x'] - perimeter_player[(perimeter_player.GameKey == GameKey)]['x'])**2 + (player_coords.loc[player_index]['y'] - perimeter_player[(perimeter_player.GameKey == GameKey)]['y'])**2)
                
                if distance < 3:
                    perimeter_players += 1
            
            perimeter_players -= 2
            data.loc[data_index, 'perimeter_players'] = perimeter_players
            
            return True
        

for i in data.index.tolist():
    PlayID = data.loc[i]['PlayID']
    GameKey = data.loc[i]['GameKey']
    GSISID = float(data.loc[i]['GSISID'])
    
    if str(data.loc[i]['Primary_Partner_GSISID']) != 'nan' and str(data.loc[i]['Primary_Partner_GSISID']) != 'Unclear':
        Partner_GSISID = float(data.loc[i]['Primary_Partner_GSISID'])
        
        find_impact(PlayID, GameKey, GSISID, Partner_GSISID, i)

data.head()

### 2. Game environment analysis
The idea here is to plot the match count and concussions count in a certain environment to see if there is an environment in which the ratio of concussion is really high compared to the number of games played in that environment.  
With some environments the sample rate is pretty low, so non-significative.

### 2.1. Turf analysis
The hypothesis for the turf type is that a some type might have less adherence for example, in such a way that would disturb the player for their trajectory.

In [None]:
# Creating a dataframe to show turf relative concussions rate
graph_data = pd.DataFrame()

for turf in gd['Turf'].unique():
    # data cleaning
    if isinstance(turf, str) and turf.startswith('nat'):
        turf = 'natural grass'
    
    if gd[(gd.Turf == turf)]['Turf'].count() >= 5:

        graph_data.loc[turf, 'Total'] = gd[(gd.Turf == turf)]['Turf'].count()
        graph_data.loc[turf, 'Concussions'] = data[(data.turf == turf)]['turf'].count()
        graph_data.loc[turf, 'NoConcussions'] = gd[(gd.Turf == turf)]['Turf'].count() - data[(data.turf == turf)]['turf'].count()
    
# Sorting the data!
graph_data = graph_data.sort_values(by=['Total'])

ind = np.arange(len(graph_data))
width = 0.7

# Putting a part of the plot above the other one
p1 = plt.bar(ind, graph_data['Concussions'].tolist(), width, yerr=None, color='red')
p2 = plt.bar(ind, graph_data['NoConcussions'].tolist(), width, yerr=None, color='lightblue', bottom=graph_data['Concussions'].tolist())

# Labelling, legends, etc...
plt.ylabel('Match count')
plt.title('Concussion relative to total games on each turf (at least 5 games played on it)')
plt.xticks(ind, graph_data.index.tolist(), rotation=90)
plt.yticks(np.arange(0, 250, 25))
plt.legend((p1[0], p2[0]), ('Match with a punt concussion', 'Match with no punt concussion'))

plt.show()

### 2.2. Weather analysis
The hypothesis for the weather type is that when raining the ground might get slippy or the adherence getting low.

In [None]:
# Creating a dataframe to show weather relative concussions rate
graph_data = pd.DataFrame()
for weather in gd['GameWeather'].unique().tolist():
    if type(weather) is str and gd[(gd.GameWeather == weather)]['GameWeather'].count() >= 5:
        
        graph_data.loc[weather, 'Total'] = gd[(gd.GameWeather == weather)]['GameWeather'].count()
        graph_data.loc[weather, 'Concussions'] = data[(data.weather == weather)]['weather'].count()
        graph_data.loc[weather, 'NoConcussions'] = gd[(gd.GameWeather == weather)]['GameWeather'].count() - data[(data.weather == weather)]['weather'].count()
    

# Sorting the data!
graph_data = graph_data.sort_values(by=['Total'])

ind = np.arange(len(graph_data))
width = 0.7

# Putting a part of the plot above the other one
p1 = plt.bar(ind, graph_data['Concussions'].tolist(), width, yerr=None, color='red')
p2 = plt.bar(ind, graph_data['NoConcussions'].tolist(), width, yerr=None, color='lightblue', bottom=graph_data['Concussions'].tolist())

# Labelling, legends, etc...
plt.ylabel('Match count')
plt.title('Concussion relative to total games for each weather (at least 5 games played on it)')
plt.xticks(ind, graph_data.index.tolist(), rotation=90)
plt.yticks(np.arange(0, 170, 15))
plt.legend((p1[0], p2[0]), ('Match with a punt concussion', 'Match with no punt concussion'))

plt.show()

### 2.3. Start time analysis
The hypothesis for the start time is that matches in the evening might reduce the visibility.

In [None]:
# Creating a dataframe to show start time relative concussions rate
graph_data = pd.DataFrame()

for start_time in gd['Start_Time'].unique().tolist():
    if gd[(gd.Start_Time == start_time)]['Start_Time'].count() >= 5:
        graph_data.loc[start_time, 'Total'] = gd[(gd.Start_Time == start_time)]['Start_Time'].count()
        graph_data.loc[start_time, 'Concussions'] = data[(data.start_time == start_time)]['start_time'].count()
        graph_data.loc[start_time, 'NoConcussions'] = gd[(gd.Start_Time == start_time)]['Start_Time'].count() - data[(data.start_time == start_time)]['start_time'].count()


# Sorting the data!
graph_data = graph_data.sort_values(by=['Total'])

ind = np.arange(len(graph_data))
width = 0.7

# Putting a part of the plot above the other one
p1 = plt.bar(ind, graph_data['Concussions'].tolist(), width, yerr=None, color='red')
p2 = plt.bar(ind, graph_data['NoConcussions'].tolist(), width, yerr=None, color='lightblue', bottom=graph_data['Concussions'].tolist())

# Labelling, legends, etc...
plt.ylabel('Match count')
plt.title('Concussion relative to total games for each start time (at least 5 games played on it)')
plt.xticks(ind, graph_data.index.tolist(), rotation=90)
plt.yticks(np.arange(0, 210, 15))
plt.legend((p1[0], p2[0]), ('Match with a punt concussion', 'Match with no punt concussion'))

plt.show()

### 2.4. Match day analysis
Had no clear hypothesis for this one but wanted to see it.

In [None]:
# Creating a dataframe to show game day relative concussions rate
graph_data = pd.DataFrame()
for game_day in gd['Game_Day'].unique().tolist():
    graph_data.loc[game_day, 'Total'] = gd[(gd.Game_Day == game_day)]['Game_Day'].count()
    graph_data.loc[game_day, 'Concussions'] = data[(data.game_day == game_day)]['game_day'].count()
    graph_data.loc[game_day, 'NoConcussions'] = gd[(gd.Game_Day == game_day)]['Game_Day'].count() - data[(data.game_day == game_day)]['game_day'].count()
    

# Sorting the data!
graph_data = graph_data.sort_values(by=['Total'])

ind = np.arange(len(graph_data))
width = 0.7

# Putting a part of the plot above the other one
p1 = plt.bar(ind, graph_data['Concussions'].tolist(), width, yerr=None, color='red')
p2 = plt.bar(ind, graph_data['NoConcussions'].tolist(), width, yerr=None, color='lightblue', bottom=graph_data['Concussions'].tolist())

# Labelling, legends, etc...
plt.ylabel('Match count')
plt.title('Concussion relative to total games for each game')
plt.xticks(ind, graph_data.index.tolist(), rotation=90)
plt.yticks(np.arange(0, 450, 25))
plt.legend((p1[0], p2[0]), ('Match with a punt concussion', 'Match with no punt concussion'))

plt.show()

None of the hypothesis are validated as those graphs show a concussion count relative to the number of games played in each environment.  
So fixing the game environment might not be the way to go.  
Some turfs don't have any concussions but they represent a small size in the given dataset (less than 50 games) so it's non-significant, we could attribute this to pure coincidence.  

### 3. Play analysis focused on players
Firstly let's try to find out what roles are the most involved in concussion plays.

In [None]:
sns.countplot(y='player_role', data=data, order=data['player_role'].value_counts().index)
plt.title('Role victim of concussion count')
plt.ylabel('Roles')
plt.show()

In [None]:
sns.countplot(y='partner_role', data=data, order=data['partner_role'].value_counts().index)
plt.title('Partner role count involved in concussion')
plt.ylabel('Roles')
plt.show()

### 3.1. Punt returners
As seen in the 2 above graphs, in 13 out of the 37 concussions (33+%) punt returners are involved in concussion plays. 
Let's separate the punt returners from the rest.

First we'll build a correlation matrix to try to get some insights about the data.


In [None]:
corr_data = data[(data['partner_role'] == 'PR') | (data['player_role'] == 'PR')]
corr_data = corr_data.drop(['Season_Year', 'GameKey', 'PlayID', 'GSISID', 'Primary_Partner_GSISID','collision_time1','collision_time2', 'video_url', 'description'], axis=1)
subcorr = corr_data.corr()

for column in corr_data.columns:
    if column not in subcorr.columns.tolist():
        corr_data[column] = corr_data[column].astype('category').cat.codes

corr = corr_data.corr()
f, ax = plt.subplots(figsize=(16, 8))
heatmap = sns.heatmap(corr)

#### 3.1.1. Activities
Activities performed in-game

In [None]:
sns.countplot(y='Player_Activity_Derived', data=data[(data['player_role'] == 'PR')], order=data['Player_Activity_Derived'].value_counts().index)
plt.title('Punt returner (as victim) activity')
plt.ylabel('Activity')
plt.show()

100% of punt returners (as victims) are tackled,  
And when the punt returner is the target of the victim :

In [None]:
sns.countplot(y='Player_Activity_Derived', data=data[(data['partner_role'] == 'PR')], order=data['Player_Activity_Derived'].value_counts().index)
plt.title('Victim activity when the partner is a Punt Returner')
plt.ylabel('Activity')
plt.show()

100% of concussions when the punt returner is the partner are happening on tackle.  
Even though the sample size is pretty small we can assume that most of the victims are the players tackling and not the one tackled.  
The difference between injured tacklers and injured tackled is 3 and 3/13 can be approximated to 20 - 25%, relative to the sample of the dataset this is significative it might not be with a bigger dataset.
#### 3.1.2. Impact types


In [None]:
sns.countplot(y='Primary_Impact_Type', data=data[(data['player_role'] == 'PR')], order=data['Primary_Impact_Type'].value_counts().index)
plt.title('Punt returner (as victims) impact type')
plt.ylabel('Impact type')
plt.show()

In [None]:
sns.countplot(y='Primary_Impact_Type', data=data[(data['partner_role'] == 'PR')], order=data['Primary_Impact_Type'].value_counts().index)
plt.title('Victim impact type when the partner is a Punt Returner')
plt.ylabel('Impact type')
plt.show()

In [None]:
sns.countplot(y='Primary_Impact_Type', data=data[(data['partner_role'] == 'PR') | (data['player_role'] == 'PR')], order=data['Primary_Impact_Type'].value_counts().index)
plt.title('Sum of both cases')
plt.show()

Impacts are equally spread between helmet-to-helmet and helmet-to-body, there is a no significant difference between when the victim is the punt returner or when the partner is the punt returner.
It might come from the small sample size we've been given but that's not enough to conclude anything about impacts on punt returners.

#### 3.1.3. Punt returners partners
Trying to find out who is tackling the punt returners

In [None]:
# merging punt returners partners
part1 = data[(data['player_role'] == 'PR')]['partner_role']
part2 = data[(data['partner_role'] == 'PR')]['player_role']

partners = pd.DataFrame()
for i in part1.index:
    partners.loc[i, 'partner_role'] = part1.loc[i]

for i in part2.index:
    partners.loc[i, 'partner_role'] = part2.loc[i]

sns.countplot(y='partner_role', data=partners, order=partners['partner_role'].value_counts().index)
plt.title('Punt returner partners')
plt.ylabel('Role')
plt.show()

Given the role appendix 1 in the dataset description : ![](https://storage.googleapis.com/kaggle-media/competitions/NFL%20player%20safety%20analytics/punt_coverage.png)   
We clearly see that the majority of players suffering or inflicting concussions to punt returners are the players from the front line in the punting team

#### 3.1.4. Orientations

In [None]:
sns.countplot(y='partner_side', data=data[(data['player_role'] == 'PR') & (data['Player_Activity_Derived'] == 'Tackled')], order=data['player_side'].value_counts().index)
plt.title('Side from which Punt Returners (as victims) are tackled')
plt.ylabel('side')
plt.show()

In [None]:
sns.countplot(y='player_side', data=data[(data['partner_role'] == 'PR') & (data['Player_Activity_Derived'] == 'Tackling')], order=data['player_side'].value_counts().index)
plt.title('Side from which the player (as victim) tackles the punt returner')
plt.ylabel('side')
plt.show()

Behind is the most frequent side in both cases, even though we are working with pretty small datasets, we can expect this to be a great parameter
#### 3.1.5. Punt returners speed distribution
In this category keep in mind that 10 m/s is above Ussain Bolt speed world record, 5 m/s is considered the average speed of a marathon runner, 2 m/s is considered the speed of someone walking.

In [None]:
part1 = data[(data['player_role'] == 'PR')]['player_speed']
part2 = data[(data['partner_role'] == 'PR')]['partner_speed']

result = pd.DataFrame()
for i in part1.index:
    result.loc[i, 'speed'] = part1.loc[i]

for i in part2.index:
    result.loc[i, 'speed'] = part2.loc[i]

result = result.dropna()
sns.distplot(result['speed'])
plt.title('Punt returners (as victims) speed distribution')
plt.xlabel('speed in m/s')
plt.show()

In [None]:
part1 = data[(data['player_role'] == 'PR')]['partner_speed']
part2 = data[(data['partner_role'] == 'PR')]['player_speed']

result = pd.DataFrame()
for i in part1.index:
    result.loc[i, 'speed'] = part1.loc[i]

for i in part2.index:
    result.loc[i, 'speed'] = part2.loc[i]

result = result.dropna()
sns.distplot(result['speed'])
plt.title('Victims speed distribution when tackling')
plt.xlabel('speed in m/s')
plt.show()

The punt returners are "easy-targets", the collision happens between tacklers at high speed and punt returners at low speed.

### 3.2. The rest
Now that we've studied the punt returners case, let's try to understand the remaining concussions play and try to find something regarding them.  
In this section we will mostly focus on the rest of the players by excluding the data from concussions involving punt returners as those can be studied apart.  
For a clear understanding I'll add a "PRs included" or "PRs excluded" in graph titles to tell if punt returners related concussions are part of the data.

Let's build a correlation matrix first to get some insights.

In [None]:
corr_data = data[(data['partner_role'] != 'PR') & (data['player_role'] != 'PR')]
corr_data = corr_data.drop(['Season_Year', 'GameKey', 'PlayID', 'GSISID', 'Primary_Partner_GSISID','collision_time1','collision_time2', 'video_url', 'description'], axis=1)
subcorr = corr_data.corr()

for column in corr_data.columns:
    if column not in subcorr.columns.tolist():
        corr_data[column] = corr_data[column].astype('category').cat.codes

corr = corr_data.corr()
f, ax = plt.subplots(figsize=(18, 9))
heatmap = sns.heatmap(corr)

#### 3.2.1. Ball posession
The team that has the ball possession is the team currently punting, whereas the team without ball possession is the team doing the punt return.

In [None]:
sns.countplot(y='ball_possession', data=data[(data['player_role'] != 'PR') & (data['partner_role'] != 'PR')], order=data['ball_possession'].value_counts().index)
plt.title('Ball possession')
plt.ylabel('Victim from punting team ? (PRs excluded)')
plt.show()

From this graph we can see that the players from the punting team are the more likely to get concussed, let's review their activities, impact types, roles

#### 3.2.2. Activities

In [None]:
sns.countplot(y='Player_Activity_Derived', data=data[(data['player_role'] != 'PR') & (data['partner_role'] != 'PR')], order=data['Player_Activity_Derived'].value_counts().index)
plt.title('Both sides - Victim activity (PRs excluded)')
plt.ylabel('Activity')
plt.show()

In [None]:
has_ball = True
sns.countplot(y='Player_Activity_Derived', data=data[(data['player_role'] != 'PR') & (data['partner_role'] != 'PR') & (data['ball_possession'] == has_ball)], order=data['Player_Activity_Derived'].value_counts().index)
plt.title('Punting team - Victim activity (PRs excluded)')
plt.ylabel('Activity')
plt.show()

In [None]:
has_ball = False
sns.countplot(y='Player_Activity_Derived', data=data[(data['player_role'] != 'PR') & (data['partner_role'] != 'PR') & (data['ball_possession'] == has_ball)], order=data['Player_Activity_Derived'].value_counts().index)
plt.title('Returning team - Victim activity (PRs excluded)')
plt.ylabel('Activity')
plt.show()

For the rest of the players the activities leading to injury are mostly related to blocking (65+%).  
Even though it is hard to tell with such a small sample we can emit the hypothesis that the player the most subject to concussion is the one being blocked, but as said we would need a larger sample to determine such a thing.

#### 3.2.3. Impact types

In [None]:
sns.countplot(y='Primary_Impact_Type', data=data[(data['player_role'] != 'PR') & (data['partner_role'] != 'PR')], order=data['Primary_Impact_Type'].value_counts().index)
plt.title('Impact types (PRs excluded)')
plt.ylabel('Impact')
plt.show()

Again the impacts are spread pretty much equally between Helmet-to-body and helmet-to-helmet. Even when looking in detail it stays pretty much the same

#### 3.2.4. Roles
Who is injuring who

In [None]:
sns.countplot(y='player_role', data=data[(data['player_role'] != 'PR') & (data['partner_role'] != 'PR') & (data['ball_possession'] == True)], order=data['player_role'].value_counts().index)
plt.title('Punting team - Victim role (PRs excluded)')
plt.ylabel('Role')
plt.show()

Given the appendix, mostly players from the sides and of the frontline

In [None]:
sns.countplot(y='player_role', data=data[(data['player_role'] != 'PR') & (data['partner_role'] != 'PR') & (data['ball_possession'] == False)], order=data['player_role'].value_counts().index)
plt.title('Returning team - Victim role (PRs excluded)')
plt.ylabel('Role')
plt.show()

Given the appendix for return team these players are again from the frontline
![](https://storage.googleapis.com/kaggle-media/competitions/NFL%20player%20safety%20analytics/punt_return.png)

In [None]:
sns.countplot(y='partner_role', data=data[(data['player_role'] != 'PR') & (data['partner_role'] != 'PR') & (data['ball_possession'] == False)], order=data['player_role'].value_counts().index)
plt.title('Returning team - Blocked partners (PRs excluded)')
plt.ylabel('Role')
plt.show()

In [None]:
sns.countplot(y='partner_role', data=data[(data['player_role'] != 'PR') & (data['partner_role'] != 'PR') & (data['ball_possession'] == True)], order=data['partner_role'].value_counts().index)
plt.title('Player roles involved in concussions (PRs excluded)')
plt.show()

The roles of the concussed players are mostly in the scrimmage

#### 3.2.5 Speed
Some interesting distribution concerning tackling and blocking

In [None]:
sns.distplot(data[(data['player_role'] != 'PR') & (data['Player_Activity_Derived'] =='Tackling')]['partner_speed'].dropna())
sns.distplot(data[(data['player_role'] != 'PR') & (data['Player_Activity_Derived'] =='Tackling')]['player_speed'].dropna(), hist_kws={'alpha':0.25})
plt.legend(['Target speed','Victim speed'])
plt.title('Tackling speed distribution (PRs excluded)')
plt.xlabel('speed m/s')
plt.show()

In [None]:
sns.distplot(data[(data['player_role'] != 'PR') & (data['Player_Activity_Derived'] == 'Blocking')]['partner_speed'].dropna())
sns.distplot(data[(data['player_role'] != 'PR') & (data['Player_Activity_Derived'] == 'Blocking')]['player_speed'].dropna(), hist_kws={'alpha':0.25})
plt.legend(['Target speed','Victim speed'])
plt.title('Blocking speed distribution (PRs excluded)')
plt.xlabel('speed m/s')
plt.show()

The speed difference is important and should be something to look at out when building the rules

#### 3.2.6. Orientation

In [None]:
sns.countplot(y='partner_side', data=data[(data['player_role'] != 'PR') & (data['partner_role'] != 'PR')], order=data['player_side'].value_counts().index)
plt.title('Side from which partner comes leading to concussion')
plt.ylabel('side')
plt.show()

In [None]:
sns.countplot(y='Player_Activity_Derived', data=data[(data['player_role'] != 'PR') & (data['partner_role'] != 'PR') & (data['player_side'] == 'behind')], order=data['Player_Activity_Derived'].value_counts().index)
plt.title('Activities leading to concussion when coming from behind the target (PRs excluded)')
plt.ylabel('Activity')
plt.show()

### 3.3 Global role analysis
#### 3.3.1. Global orientation

In [None]:
sns.countplot(y='partner_side', data=data[(data['Friendly_Fire'] == 'No')], order=data['partner_side'].value_counts().index)
plt.title('Side from which partner coming from which leads to concussion | Excluded friendly fire')
plt.ylabel('side')
plt.show()

### 4. Play analysis focused on game
#### 4.1. Move legality regarding game rules
Let's watch parameters related to the moves leading to concussions, are they illegal ? are they long or short punt, is the relative ratio abnormal ?

In [None]:
sns.countplot(y='illegal', data=data, order=data['illegal'].value_counts().index)
plt.title('Illegal move leading to concussion count')
plt.ylabel('Move illegal ?')
plt.show()

Most of the moves are currently legal that's why our rule change should propose new things and not try to enforce existing rules.
#### 4.2. Punt distance distribution
Let's take a look at the distance distribution :

In [None]:
for i in play_info.index:
    if 'punts' in play_info.loc[i, 'PlayDescription'].split(' '):
        description = play_info.loc[i, 'PlayDescription'].split(' ')
        play_info.loc[i, 'punt_dist'] = int(description[description.index('punts') + 1])
    else:
        play_info.loc[i, 'punt_dist'] = -1
        
for i in data.index:
    if 'punts' in data.loc[i, 'description'].split(' '):
        description = data.loc[i, 'description'].split(' ')
        data.loc[i, 'punt_dist'] = int(description[description.index('punts') + 1])
    else:
        data.loc[i, 'punt_dist'] = -1

sns.distplot(play_info[play_info.punt_dist > 0]['punt_dist'])
sns.distplot(data[data.punt_dist > 0]['punt_dist'], hist_kws={'alpha':0.25})        
plt.title('Punt distance distribution')
plt.legend(['All punts', 'Punt with concussion'])
plt.xlabel('Punt distance in yards')
plt.show()

On this graph the punt distance distribution of the 6000+ punts from the play_information.csv is shown in blue, it's approximately the same as the distribution of the 37 punt where a concussion happens from the video_review.csv. So we can't conclude on that as it just means that the most frequent distance has the most concussions which seems logical. This graph would've given interesting insights if the distribution was not the same at all, for example if the red part was heavily shifted to the left.

#### 4.3. Number of players near the incident
We thought it would be interesting to know how much players would be near the concussion

In [None]:
sample = data['perimeter_players'].dropna()
sns.distplot(sample)
plt.title('Player in perimeter distribution')
plt.xlabel('Number of players in a 3 yards perimeter')
plt.show()

A high density of people in the same place leads to people colliding, at high speed this doesn't often end up in a good way

#### 4.4. Average number of punt per games

In [None]:
concussion_gamekey = data['GameKey'].values.tolist()
concussion_punts = 0
normal_punts = 0

normal_gamekey = play_info['GameKey'].unique().tolist()
for i in normal_gamekey:
    if i in concussion_gamekey:
        concussion_punts += len(play_info[play_info.GameKey == i]['Play_Type'])
    else:
        normal_punts += len(play_info[play_info.GameKey == i]['Play_Type'])
        
print('Average punt in a game with concussion {} ({} punts / {} games)'.format(concussion_punts/ len(concussion_gamekey),concussion_punts , len(concussion_gamekey)))
print('Average punt in a game with no concussion {} ({} punts / {} games)'.format(normal_punts/ len(normal_gamekey), normal_punts , len(normal_gamekey)))

### 5. Conclusion
#### Micro-game rules :
**1. Forbid tackling / blocking on players that have a low speed (2 to 3 m/s = walking), especially do not tackle the punt returner until he starts running (let him 2 to 3 seconds)**
- **Goal : ** We've seen  that there usually was a huge speed difference in impacts leading to concussions.  
- **Risks :** Slow down not expected by other players on the field that could create other impacts ? (however the player has a clear trajectory if the target is not moving fast so it shouldn't happen). Optionally this will avoid people gathering in an area.  
- **Advantages :** Avoids contact when not needed, avoids huge speed difference  
- **Disadvantages :** Can slow down the game a bit (make it less spectacular?), feinting slowing down or other players accelerating might make tackles and blocks harder to perform ?

**2. Not really a rule : giving a side to avoid helmet impact in front collision**
- **Goal :** This is something we've taken from the aviation field to avoid front collision players should be given a side to put their head on to avoid the helmet-helmet impact which at high speed is a high source of concussion. This side can be left or right but has to be the same for every player.
- **Risks :** None   
- **Advantages  :** Pretty simple to setup, doesn't cost any money, doesn't impact the hype.  
- **Disadvantages :** Players might not think about it.

**3. Forbid tackling / blocking from behind the opponent**
- **Goal :** Reduces the number of dangerous impacts  
- **Risks :** None    
- **Advantages :** Easy to setup. will force the players to think more about their placement which adds strategy.
- **Disadvantages :** Will make tackles harder to perform in a good way

**4. Forbid very low height tackling**
- **Goal :** Some of the given plays were made of players diving head first in order to perform low height tackles, tackling too high or too low is too is pretty dangerous, tackling too low might generate spine problems, it raises chances of helmet-to-ground concussions. So tackling too low should be forbidden.
- **Risks :** None  
- **Advantages :** Same as forbidding tackles / blocks from behind
- **Disadvantages :** None

#### Macro-game rules : 
** 1. Teams with concussed players should receive sanctions (ex: lose an amount of point on the overall leaderboard)  **  
- **Goal :** With such a sanction we create a link between players health and team success, so it incentivizes team to avoid concussion, in order to do so they will deploy money towards prevention, better training, safe training methods (15% of concussion happen during training [[4]](#6.-Bibliography)), a better medical staff, .... Also incentivizes player to play safe and avoid being a burden to their teammates.   
- **Risk :** Teams not declaring concussion, in order to avoid the sanction should receive an even worse sanction is needed as teams should not be playing with their players health (ex: disqualified from playsoff)  
- **Advantages :** Doesn't cost any money to the league, doesn't impact the show, will help in the long run for all type of injuries, preserves the show, not hard to apply with the current concussion checking system.
- **Disadvantages :** Teams won't like having to spend money

** 2. Limit the number of punt per games to 4 per team for the entire match**
- **Goal :** As punts are a big source of concussion limiting them is a simple way to avoid the problem, We've seen that games with a concussion have an average of more than 10 punts whereas games with no concussion have an average below 10 punts. This is not a magic formula, but punts are requiring a lot of resources for players so limiting them might be a good way to go.  
- **Risk :** None  
- **Advantages :** Doesn't cost any money, preserves the show, adds a new dimension to strategies  
- **Disadvantages :** Might be disliked by coaches

**3. A rule for referees : Ability to send supposed concussed players on the bench (decision helped by medical staff)**
- **Goal :** Avoid worsening a concussion situation
- **Risk :** Wrong decision from the referee  
- **Advantages :** Doesn't make the player case worst
- **Disadvantages :** Team,player might contest the decision, it's a hard decision to take even with the help of a competent medical staff, costs money and is hard to setup

### 6. Bibliography  
[[1]  https://calgaryherald.com](https://calgaryherald.com/sports/football/cfl/cfls-glen-johnson-explains-the-new-rules-for-2015) - Recap on 2015 rule change  
[[2] https://sudouest.fr](https://www.sudouest.fr/2018/07/05/rugby-quelles-sont-les-nouvelles-regles-pour-proteger-la-sante-des-joueurs-5206706-773.php) - French rugby  
[[3] https://bbc.com](https://www.bbc.com/sport/rugby-union/43802462) - Rugby  
[[4] playsmartplaysafe.com](https://www.playsmartplaysafe.com/newsroom/reports/2017-injury-data/) - Concussion report