## roster model estimation

To determine the impact each roster position has on team success, we need to examine the quality of players per game and the result of each game. For each roster position, there will be elite players and secondary players per team. Elite players will be assinged a value of 1 whereas secondary players, a value 2.

### import data sets  "play by play goal detail" and "game detail"

In [1]:
import sys
import os
import pandas as pd
import numpy as np
import datetime, time
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
from pylab import hist, show
import scipy

In [2]:
ds = pd.read_csv('source_data/t_play_by_play_goal_detail_o.csv')
ds = ds.rename(columns={'TeamCode': 'GoalTeamCode'})
ds = ds[ds['GameNumber'] <= 21230]

In [3]:
ds.columns

Index(['GameNumber', 'EventNumber', 'GoalTeamCode', 'PlayerNumber',
       'PlayerLName', 'ShotType', 'Zone', 'Length', 'Season'],
      dtype='object')

- keep regular season games only.

In [4]:
dg = pd.read_csv('source_data/t_game_detail_o.csv')
dg = dg[dg['GameNumber'] <= 21230]

In [5]:
dg.shape

(1230, 5)

In [6]:
ds.columns

Index(['GameNumber', 'EventNumber', 'GoalTeamCode', 'PlayerNumber',
       'PlayerLName', 'ShotType', 'Zone', 'Length', 'Season'],
      dtype='object')

### merge game detail onto scoring detail

In [7]:
ds = ds.merge(dg, on=['Season', 'GameNumber'], how='left')
ds.sort_values(['Season', 'GameNumber'], ascending=[True, True], inplace=True)
ds = ds[['Season', 'GameNumber', 'GameDate', 'EventNumber', 'VTeamCode', 'HTeamCode', 'GoalTeamCode', 'PlayerNumber', 'PlayerLName', 'ShotType', 'Zone', 'Length']]

- create column that displays the number of goals by team per game. Generate 2 goal columns, one for visitor team and one for home team. Display the team that won each game. Create an indicator to show if home team won the game. Fill NaN for visitor or home goals, if team did not score during a given game.

In [8]:
ds['goals'] = ds.groupby(['Season', 'GameNumber', 'GoalTeamCode'])['GoalTeamCode'].transform('count')


In [9]:
ds['vgoals'] = np.where(ds['GoalTeamCode'] == ds['VTeamCode'], ds['goals'], np.nan)
ds['hgoals'] = np.where(ds['GoalTeamCode'] == ds['HTeamCode'], ds['goals'], np.nan)

In [10]:
ds['vgoals'] = ds.groupby(['Season', 'GameNumber'])['vgoals'].ffill()
ds['vgoals'] = ds.groupby(['Season', 'GameNumber'])['vgoals'].bfill()
ds['vgoals'] = ds['vgoals'].fillna(0)
ds['hgoals'] = ds.groupby(['Season', 'GameNumber'])['hgoals'].ffill()
ds['hgoals'] = ds.groupby(['Season', 'GameNumber'])['hgoals'].bfill()
ds['hgoals'] = ds['hgoals'].fillna(0)

In [11]:
ds['WinTeamCode'] = np.where(ds['hgoals'] > ds['vgoals'], ds['HTeamCode'], ds['VTeamCode'])

In [12]:
ds['HomeWin'] = ds.apply(lambda x: 1 if x['WinTeamCode'] == x['HTeamCode'] else 0, axis=1)

- keep one observation per game.

In [13]:
ds.isnull().sum()

Season          0
GameNumber      0
GameDate        0
EventNumber     0
VTeamCode       0
HTeamCode       0
GoalTeamCode    0
PlayerNumber    0
PlayerLName     0
ShotType        0
Zone            0
Length          0
goals           0
vgoals          0
hgoals          0
WinTeamCode     0
HomeWin         0
dtype: int64

In [14]:
ds.shape

(7043, 17)

In [15]:
ds = ds.drop_duplicates(['GameNumber'])

- save data set.

In [16]:
ds.to_csv('out_data/game_win_team.csv', index='False', sep=',')

### merge game win team onto team quality roster

- keep only columns that are relevant to team quality, merge data frames and sort values.

In [17]:
ds = ds[['Season', 'GameNumber', 'GameDate', 'VTeamCode', 'HTeamCode', 'vgoals', 'hgoals', 'WinTeamCode', 'HomeWin']]

In [18]:
dm = pd.read_csv('out_data/team_roster_quality_without_goalies.csv')
dm = dm.drop('Unnamed: 0', axis=1)

In [19]:
dm = dm.merge(ds, on=['Season', 'GameNumber'], how='left')
dm = dm[['Season', 'GameNumber', 'GameDate', 'VTeamCode', 'HTeamCode', 'vgoals', 'hgoals', 'WinTeamCode', 'HomeWin', 'VF1', 'VF2', 'VC1', 'VC2', 'VL1', 'VL2', 'VR1', 'VR2', 'VD1', 'VD2', 'HF1', 'HF2', 'HC1', 'HC2', 'HL1', 'HL2', 'HR1', 'HR2', 'HD1', 'HD2' ]]

- Calculate the difference between player quality per game for all positions with respect to home team ( Home Team - Visitor Team). There are 5 positions and 2 types of player quality. This will give us a total of 10 differenecs. 

In [20]:
dm.columns

Index(['Season', 'GameNumber', 'GameDate', 'VTeamCode', 'HTeamCode', 'vgoals',
       'hgoals', 'WinTeamCode', 'HomeWin', 'VF1', 'VF2', 'VC1', 'VC2', 'VL1',
       'VL2', 'VR1', 'VR2', 'VD1', 'VD2', 'HF1', 'HF2', 'HC1', 'HC2', 'HL1',
       'HL2', 'HR1', 'HR2', 'HD1', 'HD2'],
      dtype='object')

In [21]:
dm.shape

(1230, 29)

- add the total of forwards and deferencemen per game for both home and visiting team. To estimate the roster model, the diffence between home and visitor players per quality is calculate. For that reason, teams have to have the exact same quantity of **forwards (12) and defensemen (6)**. For games where teams didn't have the proper quantity of forwards (12) or defensemen (6) were dropped from the data set. 

In [22]:
dm['VF'] = dm['VF1'] + dm['VF2']
dm['VD'] = dm['VD1'] + dm['VD2']
dm['HF'] = dm['HF1'] + dm['HF2']
dm['HD'] = dm['HD1'] + dm['HD2']

In [23]:
dm = dm[['Season', 'GameNumber', 'GameDate', 'VTeamCode', 'HTeamCode', 'vgoals', 'hgoals', 'WinTeamCode', 'HomeWin', 'VF', 'VD', 'VF1', 'VF2', 'VC1', 'VC2', 'VL1', 'VL2', 'VR1', 'VR2', 'VD1', 'VD2', 'HF', 'HD', 'HF1', 'HF2', 'HC1', 'HC2', 'HL1', 'HL2', 'HR1', 'HR2', 'HD1', 'HD2']]

In [24]:
df = dm[((dm['VF'] != 12) | (dm['HF'] != 12) | (dm['VD'] != 6) | (dm['HD'] != 6))]

In [25]:
df.shape

(214, 33)

- display games where teams had less than 18 players on roster. Drop these observations.

In [26]:
df.to_csv('out_data/games_with_less_players.csv', index='False', sep=',')

In [27]:
dm = dm[((dm['VF'] == 12) & (dm['HF'] == 12) & (dm['VD'] == 6) & (dm['HD'] == 6))]

In [28]:
dm.to_csv('out_data/games_with_exact_quantity_of_players.csv', index='False', sep=',')

In [29]:
dm.shape

(1016, 33)

- 214 games have been dropped from the data set. 

In [30]:
dm.columns

Index(['Season', 'GameNumber', 'GameDate', 'VTeamCode', 'HTeamCode', 'vgoals',
       'hgoals', 'WinTeamCode', 'HomeWin', 'VF', 'VD', 'VF1', 'VF2', 'VC1',
       'VC2', 'VL1', 'VL2', 'VR1', 'VR2', 'VD1', 'VD2', 'HF', 'HD', 'HF1',
       'HF2', 'HC1', 'HC2', 'HL1', 'HL2', 'HR1', 'HR2', 'HD1', 'HD2'],
      dtype='object')

- independent variables display the differential of talent between home and visitor team per posiition.

In [31]:
dm['x1'] = dm['HF1'] - dm['VF1']
dm['x2'] = dm['HF2'] - dm['VF2']
dm['x3'] = dm['HD1'] - dm['VD1']
dm['x4'] = dm['HD2'] - dm['VD2']

In [32]:
dm.isnull().sum()

Season         0
GameNumber     0
GameDate       0
VTeamCode      0
HTeamCode      0
vgoals         0
hgoals         0
WinTeamCode    0
HomeWin        0
VF             0
VD             0
VF1            0
VF2            0
VC1            0
VC2            0
VL1            0
VL2            0
VR1            0
VR2            0
VD1            0
VD2            0
HF             0
HD             0
HF1            0
HF2            0
HC1            0
HC2            0
HL1            0
HL2            0
HR1            0
HR2            0
HD1            0
HD2            0
x1             0
x2             0
x3             0
x4             0
dtype: int64

In [33]:
dm.shape

(1016, 37)

### estimate roster model 

- regress home win on the difference in number of home and visitor players by position and quality (predictor variables). Add a constant to the predictors and use OLS. The purpose is to deterimine the impact each roster positin has on home team success.

In [34]:
y = dm['HomeWin']  
X = dm[['x1', 'x2', 'x3', 'x4']] 
X = sm.add_constant(X)  

In [35]:
result = sm.OLS(y, X).fit()
result.summary()

0,1,2,3
Dep. Variable:,HomeWin,R-squared:,0.005
Model:,OLS,Adj. R-squared:,0.003
Method:,Least Squares,F-statistic:,2.713
Date:,"Mon, 30 Oct 2017",Prob (F-statistic):,0.0668
Time:,11:47:18,Log-Likelihood:,-733.74
No. Observations:,1016,AIC:,1473.0
Df Residuals:,1013,BIC:,1488.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
const,0.5143,0.016,31.473,0.000,0.482 0.546
x1,0.0093,0.005,1.992,0.047,0.000 0.018
x2,-0.0093,0.005,-1.992,0.047,-0.018 -0.000
x3,-0.0067,0.008,-0.877,0.381,-0.022 0.008
x4,0.0067,0.008,0.877,0.381,-0.008 0.022

0,1,2,3
Omnibus:,1.272,Durbin-Watson:,1.883
Prob(Omnibus):,0.529,Jarque-Bera (JB):,165.756
Skew:,-0.086,Prob(JB):,1.02e-36
Kurtosis:,1.029,Cond. No.,6.44e+16


In [36]:
result.params

const    0.514307
x1       0.009281
x2      -0.009281
x3      -0.006692
x4       0.006692
dtype: float64