## Examination of efficient NHL roster design

### purpose of notebook:

- import all csv files from the 2010 season
- categorize and merge all csv files to create final data frame
- remove irrelavant observations
- keep the first two games of the season for analysis
- store final data frame

In [1]:
import sys
import os
import pandas as pd
import numpy as np
import datetime, time
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
from pylab import hist, show
import scipy
import zipfile


pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 200)


  from pandas.core import datetools


In [2]:
pwd

'/home/kmongeon/Documents/GIT/nhl_roster_design'

In [3]:
zipfile.ZipFile('source_data.zip')

<zipfile.ZipFile filename='source_data.zip' mode='r'>

### import data frames

- import play by play csv file

In [4]:
dm = pd.read_csv('source_data/t_play_by_play_o.csv')
dm = dm.sort_values(['Season', 'GameNumber', 'Period', 'EventNumber'], ascending=[True, True, True, True])
dm = dm.rename(columns={'VPlayer1Position': 'VPosition1', 'VPlayer2Position': 'VPosition2', 'VPlayer3Position': 'VPosition3', 'VPlayer4Position': 'VPosition4', 'VPlayer5Position': 'VPosition5', 'VPlayer6Position': 'VPosition6', 'HPlayer1Position': 'HPosition1', 'HPlayer2Position': 'HPosition2', 'HPlayer3Position': 'HPosition3', 'HPlayer4Position': 'HPosition4', 'HPlayer5Position': 'HPosition5', 'HPlayer6Position': 'HPosition6' })
dm = dm [['Season', 'GameNumber', 'EventNumber', 'Period', 'AdvantageType', 'EventTimeFromZero', 'EventTimeFromTwenty', 'EventType', 'EventDetail', 'VPlayer1', 'VPosition1', 'VPlayer2', 'VPosition2', 'VPlayer3', 'VPosition3', 'VPlayer4', 'VPosition4', 'VPlayer5', 'VPosition5', 'VPlayer6', 'VPosition6', 'HPlayer1', 'HPosition1', 'HPlayer2', 'HPosition2', 'HPlayer3', 'HPosition3', 'HPlayer4', 'HPosition4', 'HPlayer5', 'HPosition5', 'HPlayer6', 'HPosition6']]

- import game detail csv file 

In [5]:
dd = pd.read_csv('source_data/t_game_detail_o.csv')

**merge game detail on play by play**

In [6]:
dm = dm.merge(dd, on=['Season', 'GameNumber'], how='outer')
dm.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)

### import giveaway and takeaway

- import play by play giveaway csv file and rename columns

In [7]:
dg = pd.read_csv('source_data/t_play_by_play_giveaway_detail_o.csv')
dg = dg.rename(columns={'GivePlayerLName': 'PlayerName', 'GiveTeamCode': 'TeamCode', 'GivePlayerNumber': 'PlayerNumber'})


- import play by play takeaway csv file and rename columns

In [8]:
dt = pd.read_csv('source_data/t_play_by_play_takeaway_detail_o.csv')
dt = dt.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True])
dt = dt.rename(columns={'TakePlayerLName': 'PlayerName', 'TakeTeamCode': 'TeamCode', 'TakePlayerNumber': 'PlayerNumber'})


### merge giveaways and takeaways

In [9]:
dg = dg.merge(dt, on=['Season', 'GameNumber', 'EventNumber', 'Zone', 'TeamCode', 'PlayerNumber', 'PlayerName'], how='outer')
dg.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)

### import goal detail, shot detail, miss detail, block detail and scoring detail

- import play by play miss detail csv file and rename column

In [10]:
dn = pd.read_csv('source_data/t_play_by_play_miss_detail_o.csv')
dn = dn.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True])
dn = dn.rename(columns={'PlayerLName': 'PlayerName'})

- import play by play goal detail csv file and rename column

In [11]:
dl = pd.read_csv('source_data/t_play_by_play_goal_detail_o.csv')
dl = dl.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True])
dl = dl.rename(columns={'PlayerLName': 'PlayerName', 'TeamCode': 'TeamCode', 'PlayerNumber': 'PlayerNumber'})


- import play by play shot detail csv file and rename column

In [12]:
ds = pd.read_csv('source_data/t_play_by_play_shot_detail_o.csv')
ds = ds.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True])
ds = ds.rename(columns={'PlayerLName': 'PlayerName'})


- import play by play block detail csv file and rename columns

In [13]:
db = pd.read_csv('source_data/t_play_by_play_block_detail_o.csv')
db = db.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True])
db = db.rename(columns={'ShotPlayerLName': 'ShotPlayerName', 'BlockPlayerLName': 'PlayerName', 'BlockPlayerNumber': 'PlayerNumber', 'BlockTeamCode': 'TeamCode' })
db = db[['Season', 'GameNumber', 'EventNumber', 'Zone', 'TeamCode', 'PlayerNumber', 'PlayerName', 'ShotType']]

### merge goal,  shot,  miss and block dataframes

Goal, miss and block are result of a shot. For that reason, they are merged together. 

In [14]:
dn = dn.merge(dl, on=['Season', 'GameNumber', 'EventNumber', 'TeamCode', 'PlayerNumber', 'PlayerName', 'ShotType', 'Length', 'Zone'], how='outer')
dn.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)


In [15]:
dn = dn.merge(ds, on=['Season', 'GameNumber', 'EventNumber', 'Zone', 'TeamCode', 'PlayerNumber', 'PlayerName', 'ShotType', 'Length'], how='outer')
dn.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)

In [16]:
dn = dn.merge(db, on=['Season', 'GameNumber', 'EventNumber', 'TeamCode', 'PlayerName', 'PlayerNumber', 'Zone', 'ShotType'], how='outer')
dn.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)

**dn** now contains observations from goal, shot, miss and block events. The merged data frame is ascendingly sorted by season, game number and event number.

### merge dg on dn

In [17]:
dn = dn.merge(dg, on=['Season', 'GameNumber', 'EventNumber', 'Zone', 'TeamCode', 'PlayerNumber', 'PlayerName'], how='outer')
dn.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)

### import hit, faceoff, penalty detail

 - import play by play faceoff detail csv file and rename columns

In [18]:
df = pd.read_csv('source_data/t_play_by_play_faceoff_detail_o.csv')
df = df.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True])
df = df.rename(columns={'VPlayerLName': 'VPlayerName', 'HPlayerLName': 'HPlayerName', 'WinTeamCode': 'TeamCode'})
df['PlayerNumber'] = np.where(df['TeamCode'] == df['VTeamCode'], df['VPlayerNumber'], df['HPlayerNumber'])
df['PlayerName'] = np.where(df['TeamCode'] == df['VTeamCode'], df['VPlayerName'], df['HPlayerName'])
df = df[['Season', 'GameNumber', 'EventNumber', 'Zone', 'TeamCode', 'PlayerNumber', 'PlayerName']]

- import play by play hit detail csv file and rename columns

In [19]:
dh = pd.read_csv('source_data/t_play_by_play_hit_detail_o.csv')
dh = dh.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True])
dh = dh.rename(columns={'HitterPlayerLName': 'PlayerName', 'HitterTeamCode': 'TeamCode', 'HitterPlayerNumber': 'PlayerNumber', 'HitteePlayerLName': 'HitteePlayerName'})
dh = dh[['GameNumber', 'EventNumber', 'TeamCode', 'PlayerNumber', 'PlayerName', 'Zone', 'Season']]

 - import play by play penalty detail csv file and rename columns

In [20]:
dp = pd.read_csv('source_data/t_play_by_play_penalty_detail_o.csv')
dp = dp.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True])
dp = dp.rename(columns={'PenaltyPlayerLName': 'PlayerName', 'PenaltyPlayerNumber': 'PlayerNumber', 'PenaltyTeamCode': 'TeamCode', 'DrawnByPlayerLName': 'DrawnByPlayerName'})
dp = dp[['GameNumber', 'EventNumber', 'TeamCode', 'PlayerNumber', 'PlayerName','PenaltyType', 'Zone','Season']]

### merge faceoff, hit and penalty on dn 

In [21]:
dn = dn.merge(df, on=['Season', 'GameNumber', 'EventNumber', 'Zone', 'TeamCode', 'PlayerNumber', 'PlayerName'], how='outer')
dn.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)

In [22]:
dn = dn.merge(dh, on=['Season', 'GameNumber', 'EventNumber', 'Zone', 'TeamCode', 'PlayerNumber', 'PlayerName'], how='outer')
dn.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)

- merge penalty data frame (dh) onto faceoff data frame dn:

In [23]:
dn = dn.merge(dp, on=['Season', 'GameNumber', 'EventNumber', 'Zone', 'TeamCode', 'PlayerNumber', 'PlayerName'], how='outer')
dn.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)

### merge dn on dm

In [24]:
dm = dm.merge(dn, on=['Season', 'GameNumber', 'EventNumber'], how='outer')
dm.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)

In [25]:
dm.to_csv('pbp_merged.csv', index='False')

The merged data frame containing all on-ice events (dl) is merged onto the play by play data frame by season, game number and event number.

### keep only regular season games

In [None]:
dm = dm[dm['GameNumber'] <= 21230]

### remove irrelevant observations


Tv time-out, goalie time-out and icing are listed as "stoppage" events and are removed from the data frame as they have no impact on the probability of a goal being scored.

In [None]:
dm = dm[dm['EventType']!='STOP']
dm = dm[dm['EventType']!='EISTR']
dm = dm[dm['EventType']!='EIEND']

### exclude overtime and shootouts

In [None]:
dm = dm[dm['Period'] <= 3]
dm = dm[dm['Period'] >= 1]

### man-advantage scenarios

In [None]:
#value_list = ['PP', 'SH']
#dm[dm['AdvantageType'].isin(value_list)]
#dm = dm[dm['AdvantageType'] != 'PP']
#dm = dm[dm['AdvantageType'] != 'SH']
#dm['AdvantageType'] = dm['AdvantageType'].fillna('EV')

Since the player evaluation model uses only even strength situations, man-advantage scenarios are dropped from the data frame

### store final data frame

The merged play by play data frame is stored 

In [None]:
#dm.to_csv('pbpmerge.csv', index='False', sep=',')

The next step is to reshape the data set from wide to long.

In [None]:
dm.shape

In [None]:
dm.isnull().sum()

- Once each roster position has been determined, the next step is to reshape the data set form wide to long. Instead of having 2 columns for each roster position (24 total), all players will be listed into 4 columns: 2 columns for the visitor team ** 'VPlayer' & 'VPosition'** and 2 columns for the home team **'HPlayer' & 'HPosition'**

In [None]:
dm = dm.sort_values(['Season', 'GameNumber', 'Period', 'EventNumber'], ascending=[True, True, True, True])

In [None]:
a = [col for col in dm.columns if 'VPlayer' in col]
b = [col for col in dm.columns if 'HPlayer' in col]
c = [col for col in dm.columns if 'VPosition' in col]
d = [col for col in dm.columns if 'HPosition' in col]
dm = pd.lreshape(dm, {'VPlayer' : a, 'HPlayer' : b, 'VPosition' : c, 'HPosition': d})

In [None]:
dm.columns

In [None]:
dm.shape

In [None]:
dm = dm.rename(columns={'PlayerNumber': 'EventPlayerNumber', 'TeamCode': 'EventTeamCode', 'PlayerName': 'EventPlayerName' })
dm = dm[['Season', 'GameNumber', 'GameDate', 'Period', 'AdvantageType', 'Zone', 'EventNumber', 'EventType', 'EventDetail', 'EventTeamCode', 'EventPlayerNumber', 'EventPlayerName', 'EventTimeFromZero', 'EventTimeFromTwenty', 'VTeamCode', 'VPlayer', 'VPosition', 'HTeamCode', 'HPlayer', 'HPosition', 'ShotType', 'ShotResult', 'Length', 'PenaltyType']]

In [None]:
dm = dm.sort_values(['Season', 'GameNumber', 'Period', 'EventNumber'], ascending=[True, True, True, True])

In [None]:
dm['EventPlayerNumber'] = dm['EventPlayerNumber'].fillna('TEAM')

In [None]:
dm.isnull().sum()

In [None]:
dm.head(10)

In [None]:
dm.to_csv('play_by_play.csv', index='False', sep=',')