# Glossary

- <a href='#intro'><b>1 Introduction</b></a>
- <a href='#importing'><b>2 Importing and installing dependencies</b></a>
- <a href='#game_data'><b>3 Game Data</b></a>
    - <a href='#null'><b>3.1 Checking for Null values</b></a>
        - <a href='#null_needs'>3.1.1 Checking whether or not we need the columns with missing values</a>
        - <a href='#null_drop'>3.1.2 Dropping unnecessary columns</a>
    - <a href='#types'><b>3.2 Checking column types</b></a>
        - <a href='#memory_usage'> 3.2.1 Changing column types for less memory usage</a>
    - <a href='#scatter_plot'><b>3.3 Plotting the games of the season</b></a>
- <a href='#video_footage_injury'><b>4 Video Footage Injury</b></a>
    - <a href='#video_footage_injury_null'><b>4.1 Checking for Null values</b></a>
    - <a href='#video_footage_injury_types'><b>4.2 Checking column types</b></a>
    - <a href='#video_footage_injury_season'><b>4.3 Plotting concussions by season and year</b></a>
        - <a href='#video_footage_injury_season_season'>4.3.1 Concussions by season and year</a>
        - <a href='#video_footage_injury_season_week'>4.3.2 Concussions by week and quarter</a>
- <a href='#video_review'><b>5 Video Review</b></a>
    - <a href='#video_review_null'><b>5.1 Checking for Null values</b></a>
        - <a href='#video_review_null_needs'>5.1.1 Checking whether or not we need the columns with missing values</a>
    - <a href='#video_review_types'><b>5.2 Checking column types</b></a>
        - <a href='#video_review_usage'> 5.2.1 Changing column types for less memory usage</a>
    - <a href='#video_review_plot'><b>5.3 Plotting concussions by category</b></a>
- <a href='#video_footage_control'><b>6 Video Footage Control</b></a>
    - <a href='#video_footage_control_null'><b>6.1 Checking for Null values</b></a>
    - <a href='#video_footage_control_types'><b>6.2 Checking column types</b></a>
    - <a href='#video_footage_control_types_plot'><b>6.3 Plotting video reviews by quarter</b></a>
- <a href='#play_information'><b>7 Play Information</b></a>
    - <a href='#play_information_null'><b>7.1 Checking for Null values</b></a>
    - <a href='#play_information_types'><b>7.2 Checking column types</b></a>
    - <a href='#play_information_types_plot'><b>7.3 Plotting total amount of plays per week</b></a>
- <a href='#play_player_role_data'><b>8 Play Player Role Data</b></a>
    - <a href='#play_player_role_data_null'><b>8.1 Checking for Null values</b></a>
    - <a href='#play_player_role_data_types'><b>8.2 Checking column types</b></a>
    - <a href='#play_player_role_data_plot'><b>8.3 Plotting amount of players per role</b></a>
- <a href='#player_punt_data'><b>9 Player Punt Data</b></a>
    - <a href='#player_punt_data_null'><b>9.1 Checking for Null values</b></a>
    - <a href='#player_punt_data_types'><b>9.2 Checking column types</b></a>
    - <a href='#player_punt_data_plot'><b>9.3 Plotting amount of players per role</b></a>
- <a href='#player_data'><b>10 Player Data</b></a>
    - <a href='#player_data_plotting_injuries'><b>10.1 Plotting injuries from punts</b></a>
    - <a href='#player_data_plotting_aggressive_teams'><b>10.2 Plotting most aggressive teams</b></a>
    - <a href='#player_data_plotting_locations'><b>10.3 Plotting location of injuries in the field</b></a>
    - <a href='#player_data_plotting_time'><b>10.4 Plotting time of injuries</b></a>
- <a href='#play_data'><b>11 Play Data</b></a>

# <a id='intro'><b>1 Introduction:</b></a>

The National Football League is America's most popular sports league, comprised of 32 franchises that compete each year to win the Super Bowl, the world's biggest annual sporting event. Founded in 1920, the NFL developed the model for the successful modern sports league, including national and international distribution, extensive revenue sharing, competitive excellence, and strong franchises across the country.

The NFL is committed to advancing progress in the diagnosis, prevention and treatment of sports-related injuries. The NFL's ongoing health and safety efforts include support for independent medical research and engineering advancements and a commitment to look at anything and everything to protect players and make the game safer, including enhancements to medical protocols and improvements to how our game is taught and played.

As more is learned, the league evaluates and changes rules to evolve the game and try to improve protections for players. Since 2002 alone, the NFL has made 50 rules changes intended to eliminate potentially dangerous tactics and reduce the risk of injuries.

For more information about the NFL's health and safety efforts, please visit www.PlaySmartPlaySafe.com.

# <a id='importing'><b>2 Importing and installing dependencies:</b></a>

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)
sns.set()

print('All dependencies installed')

# <a id='game_data'><b>3 Game Data:</b></a>

In [None]:
def lower_case(dataset):
    dataset.columns = [col.lower() for col in dataset.columns]
    return dataset

game_data = pd.read_csv('../input/game_data.csv', parse_dates=True)
game_data = lower_case(game_data)
print(game_data.shape)
game_data.head()

## <a id='null'><b>3.1 Checking for Null values</b></a>

In [None]:
def num_missing(x):
  return print('Missing values per column:\n', np.sum(x.isnull()))

num_missing(game_data)

### <a id='null_needs'>3.1.1 Checking whether or not we need the columns with missing values</a>

In [None]:
stadium_type = game_data['stadiumtype'].value_counts()
turf = game_data['turf'].value_counts()
game_weather = game_data['gameweather'].value_counts()
temperature = game_data['temperature'].value_counts()
outdoor_weather = game_data['outdoorweather'].value_counts

print(stadium_type, '\n', '-'*50, '\n', turf, '\n', '-'*50, '\n', game_weather, '\n', '-'*50, '\n', temperature,  '\n', '-'*50, '\n', outdoor_weather)

stadium_type = most of the values are repeated or labelled differently but mean the same. Most stadiums also are outdoors with a few with retractable roofs. The few that are missing can be searched easily. 
turf = same with stadium_type, most values are the same but labelled differently and can be searched easily. 
game_weather = around 15% of the values are missing in this column. This can be searched but will be more time-consuming. 
temperature = around 10% of the values are missing in this column, which seems a little weird since game_weather has more missing values. 
outdoor_weather = around 38% of the data is missing and may potentially not have a very big impact on the analysis of data.

<a id='null_drop'>3.1.2 Dropping unnecessary columns</a>

In [None]:
game_data = game_data.drop(columns=['outdoorweather', 'gameweather'], axis=1)
game_data.info()

After dropping 2 columns and changing their types, the memory usage went down by 21+ KB

## <a id='types'><b>3.2 Checking column types</b></a>

In [None]:
game_data.info()

### <a id='memory_usage'> 3.2.1 Changing column types for less memory usage</a>

In [None]:
category_columns = ['season_type', 'stadiumtype', 'turf']
float_columns = ['temperature']

game_data[category_columns] = game_data[category_columns].astype('category')
game_data[float_columns] = game_data[float_columns].astype(float)
date = pd.to_datetime(game_data['game_date'].str.split(expand=True)[0], format='%Y-%m-%d')

game_data.info()

## <a id='scatter_plot'><b>3.3 Plotting the games of the season</b></a>

In [None]:
plt.figure(figsize=(20, 9))

_ = sns.scatterplot(x='home_team', y='visit_team', hue='week',data=game_data)
plt.xticks(rotation=90, fontsize=14)
plt.yticks(fontsize=15)
plt.xlabel('Visiting Team')
plt.ylabel('Home Team')

plt.show()

It seems that the dataset included for NFC and AFC which are divisions and not teams.

# <a id='video_footage_injury'><b>4 Video Footage Injury:</b></a>

In [None]:
video_footage_injury = pd.read_csv('../input/video_footage-injury.csv', parse_dates=True)
video_footage_injury = lower_case(video_footage_injury)
print(game_data.shape)
video_footage_injury.head()

## <a id='video_footage_injury_null'><b>4.1 Checking for Null values</b></a>

In [None]:
num_missing(video_footage_injury)

## <a id='video_footage_injury_types'><b>4.2 Checking column types</b></a>

In [None]:
video_footage_injury.info()

## <a id='video_footage_injury_season'><b>4.3 Plotting concussions by season and year</b></a>
### <a id='video_footage_injury_season_season'>4.3.1 Concussions by season and year</a>

In [None]:
plt.figure(figsize=(20, 7.5))

plt.subplot(1, 2, 1)
_ = sns.countplot(video_footage_injury['type'])
plt.title('Concussions per season type:', fontsize=20)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Type of season', fontsize=15)
plt.ylabel('Total amount of concussions', fontsize=15)

plt.subplot(1, 2, 2)
_ = sns.countplot(video_footage_injury['season'])
plt.title('Concussions per year:', fontsize=20)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Year', fontsize=15)
plt.ylabel('Total amount of concussions', fontsize=15)

plt.tight_layout()
plt.show()

Clearly, more of the concussions occur during regular season rather than the pre-season and 2016 is showing a slight increase in concussions.

### <a id='video_footage_injury_season_week'>4.3.2 Concussions by week and quarter</a>

In [None]:
plt.figure(figsize=(20, 10))

plt.subplot(2, 1, 1)
_ = sns.swarmplot(x='week', y='type', hue='qtr', data=video_footage_injury)
plt.title('Concussions per week:', fontsize=20)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Week', fontsize=15)
plt.ylabel('Season Type', fontsize=15)

plt.subplot(2, 1, 2)
_ = sns.boxplot(x='qtr', y='week', data=video_footage_injury)
plt.title('Concussions per quarter:', fontsize=20)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Week', fontsize=15)
plt.ylabel('')

plt.tight_layout()
plt.show()

It seems that as the season gets in its final stages, concussions are more prevalent. This can be correlated to teams trying to make it to the playoffs rather than getting eliminated. Also, most concussions occur during the 4th quarter followed by the 2nd quarter. The 1st, 2nd, and 3rd quarters show almost an identical distribution.

# <a id='video_review'><b>5 Video Review:</b></a>

In [None]:
video_review = pd.read_csv('../input/video_review.csv')
video_review = lower_case(video_review)
print(video_review.shape)
video_review.head()

## <a id='video_review_null'><b>5.1 Checking for Null values</b></a>

In [None]:
num_missing(video_review)

### <a id='video_review_null_needs'>5.1.1 Checking whether or not we need the columns with missing values</a>

In [None]:
video_review = video_review.drop(columns=['primary_partner_gsisid'], axis=1)
video_review = video_review.dropna()
video_review.info()

Decided to drop the column 'Primary_Partner_GSISID' that was not relevant and drop 2 rows missing values.

## <a id='video_review_types'><b>5.2 Checking column types</b></a>

In [None]:
video_review.info()

### <a id='video_review_usage'> 5.2.1 Changing column types for less memory usage</a>

In [None]:
category_columns = ['player_activity_derived', 'turnover_related', 'primary_impact_type', 'primary_partner_activity_derived', 'friendly_fire']

video_review[category_columns] = video_review[category_columns].astype('category')
video_review.info()

Dropping one of the columns and changing the type of 5 columns, memory usage was brought down by 0.5KB.

## <a id='video_review_plot'><b>5.3 Plotting concussions by category</b></a>

In [None]:
plt.figure(figsize=(20, 7.5))

plt.subplot(1, 2, 1)
_ = sns.countplot(video_review['player_activity_derived'])
plt.title('Concussions related incidents:', fontsize=20)
plt.xlabel('')
plt.ylabel('')
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)

plt.subplot(1, 2, 2)
_ = sns.countplot(video_review['friendly_fire'])
plt.title('Incidents from same team or opposing:', fontsize=20)
plt.xlabel('')
plt.ylabel('')
plt.xticks([0, 1, 2], ['Opposing Team', 'Unclear', 'Same Team'], fontsize=15)
plt.yticks(fontsize=15)

plt.tight_layout()
plt.show()

Most concussions occur while tackling a player from the opposed team. Maybe new rules could be applied on the players that are tackling in order to lower the incedents rate.

# <a id='video_footage_control'><b>6 Video Footage Control:</b></a>

In [None]:
video_footage_control = pd.read_csv('../input/video_footage-control.csv')
video_footage_control = lower_case(video_footage_control)
print(video_footage_control.shape)
video_footage_control.head()

## <a id='video_footage_control_null'><b>6.1 Checking for Null values</b></a>

In [None]:
num_missing(video_footage_control)

## <a id='video_footage_control_types'><b>6.2 Checking column types</b></a>

In [None]:
video_footage_control.info()

## <a id='video_footage_control_types_plot'><b>6.3 Plotting video reviews by quarter</b></a>

In [None]:
quarter_count = video_footage_control['qtr'].value_counts()

plt.figure(figsize=(20, 7.5))

_ = sns.scatterplot(x='qtr', y='home_team', data=video_footage_control)
plt.title('Distributions of injuries per quarter:', fontsize=20)
plt.xlabel('Quarter', fontsize=15)
plt.ylabel('')
plt.xticks([1, 2, 3, 4], fontsize=15)
plt.yticks(fontsize=15)

plt.show()

In the 'video_footage_control', most reviews occur in the 1st quarter and most teams show a similar amount of reviews.

# <a id='play_information'><b>7 Play Information:</b></a>

In [None]:
play_information = pd.read_csv('../input/play_information.csv')
play_information = lower_case(play_information)
print(play_information.shape)
play_information.head()

## <a id='play_information_null'><b>7.1 Checking for Null values</b></a>

In [None]:
num_missing(play_information)

## <a id='play_information_types'><b>7.2 Checking column types</b></a>

In [None]:
play_information.info()

In [None]:
category_columns = ['season_type']
play_information[category_columns] = play_information[category_columns].astype('category')
play_information['game_clock'] = pd.to_datetime(play_information['game_clock'], format='%H:%M').dt.time
play_information.info()
play_information.head()

## <a id='play_information_types_plot'><b>7.3 Plotting total amount of plays per week</b></a>

In [None]:
plt.figure(figsize=(20, 7.5))

_ = sns.countplot(play_information['week'])
# _ = sns.swarmplot(x='Week', y='YardLine', data=play_information, hue='Poss_Team')
plt.title('Total amount of plays per week:', fontsize=20)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Week', fontsize=15)
plt.ylabel('')

plt.show()

The average of plays made during the season is pretty even to each other an exception to the 2nd through the 5th week; which jumps to almost double than the closest one.

# <a id='play_player_role_data'><b>8 Play Player Role Data:</b></a>

In [None]:
play_player_role_data = pd.read_csv('../input/play_player_role_data.csv')
play_player_role_data = lower_case(play_player_role_data)
print(play_player_role_data.shape)
play_player_role_data.head()

## <a id='play_player_role_data_null'><b>8.1 Checking for Null values</b></a>

In [None]:
num_missing(play_player_role_data)

## <a id='play_player_role_data_types'><b>8.2 Checking column types</b></a>

In [None]:
play_player_role_data.info()

## <a id='play_player_role_data_plot'><b>8.3 Plotting amount of players per role</b></a>

In [None]:
plt.figure(figsize=(20, 7.5))

_ = sns.countplot(play_player_role_data['role'])
plt.title('Counting the total amount of players in each position per play:', fontsize=20)
plt.xticks(rotation=90, fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Role', fontsize=15)
plt.ylabel('')

plt.show()

This chart should NOT be confused with the total number of players in each position since each row is separated by play and not a game. Many plays can occur in each match.

# <a id='player_punt_data'><b>9 Player Punt Data:</b></a>

In [None]:
player_punt_data = pd.read_csv('../input/player_punt_data.csv')
player_punt_data = lower_case(player_punt_data)
print(player_punt_data.shape)
player_punt_data.head()

## <a id='player_punt_data_null'><b>9.1 Checking for Null values</b></a>

In [None]:
num_missing(player_punt_data)

## <a id='player_punt_data_types'><b>9.2 Checking column types</b></a>

In [None]:
player_punt_data.info()

## <a id='player_punt_data_plot'><b>9.3 Plotting each players position</b></a>

In [None]:
plt.figure(figsize=(20, 7.5))

_ = sns.countplot(player_punt_data['position'])
plt.title('Count of each individual position:', fontsize=20)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Positions', fontsize=15)
plt.ylabel('')

plt.show()

After importing, plotting and analyzing each of the datasets, I can see that some datasets can be combined in order to have a better picture of the overall injuries.

# <a id='player_data'><b>10 Player Data:</b></a>

In [None]:
player_data = pd.merge(play_player_role_data, player_punt_data)
player_data = pd.merge(player_data, play_information)
player_data = pd.merge(player_data, video_review)
player_data.head()

## <a id='player_data_plotting_injuries'><b>10.1 Plotting injuries from punts</b></a>

In [None]:
player_data = player_data.sort_values('poss_team')

plt.figure(figsize=(20, 7.5))

_ = sns.countplot(player_data['poss_team'])
plt.title('Teams that suffered injuries from punts:', fontsize=20)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('', fontsize=15)
plt.ylabel('Possesing Team', fontsize=15)

plt.show()

Interestingly, Green Bay seems to get all its injuries through blocks; whereas other teams tackling and blocking are the main factors. The tackler rarely gets injured; therefore, we can concentrate on making rules that will protect the person that is getting tackled or blocked and penalizing very aggressive tacklers.

In [None]:
plt.figure(figsize=(20, 7.5))

_ = sns.countplot(y=player_data['player_activity_derived'], hue=player_data['poss_team'])
plt.title('Activity when suffering injuries from punts:', fontsize=20)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('', fontsize=15)
plt.ylabel('Player Activity Derived', fontsize=15)
plt.legend(fontsize=15)

plt.show()

## <a id='player_data_plotting_aggressive_teams'><b>10.2 Plotting most aggressive teams</b></a>

In [None]:
plt.figure(figsize=(20, 7.5))

_ = sns.countplot(player_data['home_team_visit_team'], hue=player_data['player_activity_derived'])
plt.title('Analyzing if teams play more aggressively against others:', fontsize=20)
plt.xticks(rotation=90, fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Possesing Team', fontsize=15)
plt.ylabel('')
plt.legend(fontsize=15)

plt.show()

Some teams play more aggressively than others, like WAS. Maybe fines can be implemented to teams that constantly injure players more than average.

## <a id='player_data_plotting_locations'><b>10.3 Plotting location of injuries in the field</b></a>

In [None]:
player_data['yard'] = player_data['yardline'].str.split(' ').str[1].astype(int)
player_data.sort_values(by=['yard'], inplace=True)

In [None]:
plt.figure(figsize=(20, 10))

plt.subplot(2, 1, 1)
_ = sns.countplot(player_data['yard'])
plt.title('Analyzing location of injuries:', fontsize=20)
plt.xticks(rotation=90, fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Yard', fontsize=15)
plt.ylabel('')
plt.legend(fontsize=15, loc=1)

plt.subplot(2, 1, 2)
_ = sns.swarmplot(x='poss_team', y='yard', data=player_data, hue='quarter', s=7.5, palette=['Green', 'Blue', 'Orange', 'Red'])
plt.title('Analyzing location of injuries by team:', fontsize=20)
plt.xticks(rotation=90, fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Possesing Team', fontsize=15)
plt.ylabel('Yard', fontsize=15)
plt.legend(fontsize=15)

plt.tight_layout()
plt.show()

There is a slight correlation of injuries occurring between the 20th and 30th-yard line, as the team gets closer to the red zone. Surprisingly, only a few were marked in the red zone and a slight more were in the teens. Also, there are some teams that display more injuries than others. A

## <a id='player_data_plotting_time'><b>10.4 Plotting time of injuries</b></a>

In [None]:
player_data.sort_values(['quarter', 'game_clock'])

In [None]:
plt.figure(figsize=(20, 7.5))

_ = sns.barplot(x='game_clock', y=range(16), data=player_data)
plt.xticks(rotation=90)

plt.show()

Not surprising, as the clock winds down; the change of injury slowly increments.

* # <a id='play_data'><b>11 Play Data:</b></a>

In [None]:
play_data = pd.merge(play_information, game_data, on='gamekey')
print(play_data['playdescription'][0])
print(play_data['playdescription'][1])
print(play_data['playdescription'][2])
print(play_data['playdescription'][3])
print(play_data['playdescription'][4])