## Examination of efficient NHL roster design

### purpose of notebook:

- import all csv files from the 2010 season
- categorize and merge all csv files to create final data frame
- remove irrelavant observations
- keep the first two games of the season for analysis
- store final data frame

In [1]:
import sys
import os
import pandas as pd
import numpy as np
import datetime, time
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
from pylab import hist, show
import scipy

### import data frames

- import play by play csv file

In [2]:
dm = pd.read_csv('source_data/t_play_by_play_o.csv')
dm = dm.sort_values(['Season', 'GameNumber', 'Period', 'EventNumber'], ascending=[True, True, True, True])
dm = dm.rename(columns={'VPlayer1Position': 'VPosition1', 'VPlayer2Position': 'VPosition2', 'VPlayer3Position': 'VPosition3', 'VPlayer4Position': 'VPosition4', 'VPlayer5Position': 'VPosition5', 'VPlayer6Position': 'VPosition6', 'HPlayer1Position': 'HPosition1', 'HPlayer2Position': 'HPosition2', 'HPlayer3Position': 'HPosition3', 'HPlayer4Position': 'HPosition4', 'HPlayer5Position': 'HPosition5', 'HPlayer6Position': 'HPosition6' })
dm = dm [['Season', 'GameNumber', 'EventNumber', 'Period', 'AdvantageType', 'EventTimeFromZero', 'EventTimeFromTwenty', 'EventType', 'EventDetail', 'VPlayer1', 'VPosition1', 'VPlayer2', 'VPosition2', 'VPlayer3', 'VPosition3', 'VPlayer4', 'VPosition4', 'VPlayer5', 'VPosition5', 'VPlayer6', 'VPosition6', 'HPlayer1', 'HPosition1', 'HPlayer2', 'HPosition2', 'HPlayer3', 'HPosition3', 'HPlayer4', 'HPosition4', 'HPlayer5', 'HPosition5', 'HPlayer6', 'HPosition6']]

- import play by play giveaway csv file and rename columns

In [3]:
dg = pd.read_csv('source_data/t_play_by_play_giveaway_detail_o.csv')
dg = dg.rename(columns={'GivePlayerNumber': 'PlayerNumber', 'GivePlayerLName': 'PlayerName', 'GiveTeamCode': 'TeamCode'})
dg['EventType'] = 'GIVE'

- import play by play takeaway csv file and rename columns

In [4]:
dt = pd.read_csv('source_data/t_play_by_play_takeaway_detail_o.csv')
dt = dt.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True])
dt = dt.rename(columns={'TakePlayerNumber': 'PlayerNumber', 'TakePlayerLName': 'PlayerName', 'TakeTeamCode': 'TeamCode'})
dt['EventType'] = 'TAKE'

- import goal detail csv file and rename column

In [5]:
dl = pd.read_csv('source_data/t_play_by_play_goal_detail_o.csv')
dl = dl.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True])
dl = dl.rename(columns={'PlayerLName': 'PlayerName'})
dl['EventType'] = 'GOAL'

- import play by play shot detail csv file and rename column

In [6]:
ds = pd.read_csv('source_data/t_play_by_play_shot_detail_o.csv')
ds = ds.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True])
ds = ds.rename(columns={'PlayerLName': 'PlayerName'})
ds['EventType'] = 'SHOT'

- import play by play miss detail csv file and rename column

In [7]:
dn = pd.read_csv('source_data/t_play_by_play_miss_detail_o.csv')
dn = dn.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True])
dn = dn.rename(columns={'PlayerLName': 'PlayerName'})
dn['EventType'] = 'MISS'

- import play by play block detail csv file and rename columns

In [8]:
db = pd.read_csv('source_data/t_play_by_play_block_detail_o.csv')
db = db.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True])
db = db.rename(columns={'ShotPlayerLName': 'ShotPlayerName', 'BlockPlayerLName': 'PlayerName', 'BlockPlayerNumber': 'PlayerNumber', 'BlockTeamCode': 'TeamCode'})
db['EventType'] = 'BLOCK'

- import play by play hit detail csv file and rename columns

In [9]:
dh = pd.read_csv('source_data/t_play_by_play_hit_detail_o.csv')
dh = dh.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True])
dh = dh.rename(columns={'HitterTeamCode': 'TeamCode', 'HitterPlayerNumber': 'PlayerNumber', 'HitterPlayerLName': 'HitterPlayerName', 'HitteePlayerLName': 'HitteePlayerName'})
dh['EventType'] = 'HIT'

 - import play by play faceoff detail csv file and rename columns

In [10]:
df = pd.read_csv('source_data/t_play_by_play_faceoff_detail_o.csv')
df = df.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True])
df = df.rename(columns={'WinTeamCode': 'TeamCode', 'VPlayerLName': 'VName', 'HPlayerLName': 'HName', 'VPlayerNumber': 'VPlayer', 'HPlayerNumber': 'HPlayer'})
df['EventType'] = 'FAC'


 - import play by play penalty detail csv file and rename columns

In [11]:
dp = pd.read_csv('source_data/t_play_by_play_penalty_detail_o.csv')
dp = dp.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True])
dp = dp.rename(columns={'PenaltyTeamCode': 'TeamCode', 'PenaltyPlayerNumber': 'PlayerNumber', 'PenaltyPlayerLName': 'PenaltyPlayerName', 'DrawnByPlayerLName': 'DrawnByPlayerName'})
dp['EventType'] = 'PENL'

- import game detail csv file 

In [12]:
dd = pd.read_csv('source_data/t_game_detail_o.csv')

- import scoring csv file and rename columns

In [13]:
dc = pd.read_csv('source_data/t_scoring_detail_o.csv')
dc['GoalTime'] = dc['Time'].copy()
dc = dc.rename(columns={'Time': 'EventTimeFromZero', 'GoalType': 'AdvantageType'})
dc = dc [['Season', 'GameNumber', 'GoalNumber', 'Period', 'GoalTime', 'EventTimeFromZero', 'AdvantageType', 'TeamCode', 'Assist1Player', 'Assist2Player']]

- import shift csv file and rename columns

In [14]:
dz = pd.read_csv('source_data/t_player_shift_o.csv')
dz = dz.rename(columns={'season': 'Season', 'gamenumber': 'GameNumber', 'period': 'Period', 'teamcode': 'TeamCode', 'playernumber': 'PlayerNumber'})
dz = dz [['Season', 'GameNumber', 'Period', 'TeamCode', 'PlayerNumber', 'starttime', 'endtime']]

### group and merge data frames based on similarity

Since giveaway (dg) and takeaway (dt) csv files have identical columns, they are merged together. Thus, dt is merged on dg:

### merge giveaways and takeaways 

In [15]:
dg = dg.merge(dt, on=['Season', 'GameNumber', 'EventNumber', 'EventType', 'TeamCode', 'PlayerName', 'PlayerNumber', 'Zone'], how='outer')
dg.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)

dg now contains observations from both type of events. The merged data frame is ascendingly sorted by season, game number and event number.

### merge goal,  shot,  miss and block dataframes

Goal, miss and block are result of a shot. For that reason, they are merged together. 

- merge shot data frame (ds) onto goal data frame (dl):

In [16]:
dl = dl.merge(ds, on=['Season', 'GameNumber', 'EventNumber', 'EventType', 'TeamCode', 'PlayerName', 'PlayerNumber', 'Zone', 'ShotType', 'Length'], how='outer')
dl.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)


**dl** now contains observations from both shot and goal data set. The merged data frame is ascendingly sorted by season, game number and event number.

- merge miss dataframe (dm) onto goal data frame (dl): 

In [17]:
dl = dl.merge(dn, on=['Season', 'GameNumber', 'EventNumber', 'EventType', 'TeamCode', 'PlayerName', 'PlayerNumber', 'Zone', 'ShotType', 'Length'], how='outer')
dl.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)

**dl** now contains observations from goal, shot and miss events. The merged data frame is ascendingly sorted by season, game number and event number.

- merge block (db) onto goal dataframe (dl):

In [18]:
dl = dl.merge(db, on=['Season', 'GameNumber', 'EventNumber', 'EventType', 'TeamCode', 'PlayerName', 'PlayerNumber', 'Zone', 'ShotType'], how='outer')
dl.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)

**dl** now contains observations from goal, shot, miss and block events. The merged data frame is ascendingly sorted by season, game number and event number.

### merge faceoff, hit and penalty together

- merge hit data (dh) frame onto faceoff data frame (df):

In [19]:
df = df.merge(dh, on=['Season', 'GameNumber', 'EventNumber', 'EventType', 'Zone', 'TeamCode'], how='outer')
df.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)

**df** now contains observations from both faceoff and hit events. The merged data frame is ascendingly sorted by season, game number and event number.

- merge penalty data frame (dh) onto faceoff data frame (df):

In [20]:
df = df.merge(dp, on=['Season', 'GameNumber', 'EventNumber', 'EventType', 'TeamCode', 'PlayerNumber', 'Zone'], how='outer')
df.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)

**df** now contains observations from faceoff, hit and penalty events. The merged data frame is ascendingly sorted by season, game number and event number.

### merge merged data frames

Three merged data frames have been created:  dg, dl and df. The next step is to merge the merged data frames together.

- merge dg onto dl:

In [21]:
dl = dl.merge(dg, on=['Season', 'GameNumber', 'EventNumber', 'EventType', 'TeamCode', 'PlayerName', 'PlayerNumber', 'Zone'], how='outer')
dl.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)


**dl** now contains observations from goal, shot, miss , block, giveaway and takeaway events. The merged data frame is ascendingly sorted by season, game number and event number.

- merge df onto dl:

In [22]:
dl = dl.merge(df, on=['Season', 'GameNumber', 'EventNumber', 'EventType', 'TeamCode','PlayerNumber', 'Zone'], how='outer')
dl.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)


**dl** now contains observations from goal, shot, miss , block, giveaway, takeaway, faceoff, hit and penalty events. The merged data frame is ascendingly sorted by season, game number and event number.

### merge detail play (dd) onto all ice events (dl)

In [23]:
dl = dl.merge(dd, on=['Season', 'GameNumber', 'VTeamCode', 'HTeamCode'], how='outer')
dl.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)

In [24]:
dl = dl.merge(dz, on=['Season', 'GameNumber', 'TeamCode', 'PlayerNumber'], how='outer')
dl.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)

### merge all on-ice events (dl) onto play by play (dm)

The merged data frame containing all on-ice events (dl) is merged onto the play by play data frame by season, game number and event number.

In [25]:
dm = dm.merge(dl, on=['Season', 'GameNumber', 'EventNumber', 'EventType', 'Period'], how='outer')
dm.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)

### merge scoring detail (dc) onto play by play (dm)

The final merge is to add scoring detail data frame (dc) onto play by play data frame (dm). The purpose is to include assist 1 player name, assist 1 player number, assist 2 player number and assist player 2 name into the data frame that will be used for player evaluation analysis.

In [26]:
dm = dm.merge(dc, on=['Season', 'GameNumber', 'Period', 'EventTimeFromZero', 'AdvantageType', 'TeamCode'], how='outer')

### keep only regular season games

In [27]:
dm = dm[dm['GameNumber'] <= 21230]

### remove irrelevant observations


Tv time-out, goalie time-out and icing are listed as "stoppage" events and are removed from the data frame as they have no impact on the probability of a goal being scored.

In [28]:
dm = dm[dm['EventType']!='STOP']

### exclude overtime and shootouts

In [29]:
dm = dm[dm['Period'] <= 3]
dm = dm[dm['Period'] >= 1]

### man-advantage scenarios

In [30]:
value_list = ['PP', 'SH']
dm[dm['AdvantageType'].isin(value_list)]
dm = dm[dm['AdvantageType'] != 'PP']
dm = dm[dm['AdvantageType'] != 'SH']
dm['AdvantageType'] = dm['AdvantageType'].fillna('EV')

Since the player evaluation model uses only even strength situations, man-advantage scenarios are dropped from the data frame

### store final data frame

The merged play by play data frame is stored 

In [31]:
dm.to_csv('out_data/pbpmerge.csv', index='False', sep=',')

The next step is to reshape the data set from wide to long.