## roster model estimation

To determine the impact each roster position has on team success, we need to examine the quality of players per game and the result of each game. For each roster position, there will be elite players and secondary players per team. Elite players will be assinged a value of 1 whereas secondary players, a value 2.

### import data sets  "play by play goal detail" and "game detail"

In [28]:
import sys
import os
import pandas as pd
import numpy as np
import datetime, time
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
from pylab import hist, show
import scipy
import zipfile


pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 200)


In [29]:
pwd

'/home/kmongeon/Documents/GIT/nhl_roster_design'

In [30]:
d0 = pd.read_csv('season_games.csv', index_col=0)
d1 = pd.read_csv('team_roster_quality_without_goalies.csv', index_col=0)

In [31]:
dm = d0.merge(d1, on=['Season', 'GameNumber'], how='left')

- Calculate the difference between player quality per game for all positions with respect to home team ( Home Team - Visitor Team). There are 5 positions and 2 types of player quality. This will give us a total of 10 differenecs. 

In [32]:
dm.columns

Index(['Season', 'GameNumber', 'VTeamCode', 'HTeamCode', 'HGF', 'VGF', 'GD',
       'WinTeam', 'VF1', 'VF2', 'VC1', 'VC2', 'VL1', 'VL2', 'VR1', 'VR2',
       'VD1', 'VD2', 'HF1', 'HF2', 'HC1', 'HC2', 'HL1', 'HL2', 'HR1', 'HR2',
       'HD1', 'HD2'],
      dtype='object')

In [33]:
dm.shape

(1316, 28)

In [61]:
dm['VF'] = dm['VF1'] + dm['VF2']
dm['VD'] = dm['VD1'] + dm['VD2']
dm['HF'] = dm['HF1'] + dm['HF2']
dm['HD'] = dm['HD1'] + dm['HD2']

In [62]:
dm['VF'].value_counts()

12.0    1015
Name: VF, dtype: int64

In [63]:
dm['HF'].value_counts()

12.0    1015
Name: HF, dtype: int64

In [64]:
dm['VD'].value_counts()

6.0    1015
Name: VD, dtype: int64

In [65]:
dm['HD'].value_counts()

6.0    1015
Name: HD, dtype: int64

In [66]:
dm = dm[((dm['VF'] == 12) & (dm['HF'] == 12) & (dm['VD'] == 6) & (dm['HD'] == 6))]

## Summary analysis

In [80]:
dm.describe()

Unnamed: 0,Season,GameNumber,HGF,VGF,GD,VF1,VF2,VD1,VD2,HF1,HF2,HD1,HD2,HomeWin,VF,VD,HF,HD,DF1,DF2,DD1,DD2
count,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0
mean,2010.0,20622.621675,2.941872,2.739901,0.20197,3.002956,8.997044,0.991133,5.008867,3.461084,8.538916,1.079803,4.920197,0.521182,12.0,6.0,12.0,6.0,0.458128,-0.458128,0.08867,-0.08867
std,0.0,352.180753,1.716485,1.634,2.438513,1.098228,1.098228,0.722231,0.722231,1.282646,1.282646,0.726055,0.726055,0.499797,0.0,0.0,0.0,0.0,1.702177,1.702177,1.039657,1.039657
min,2010.0,20001.0,0.0,0.0,-8.0,0.0,6.0,0.0,4.0,0.0,6.0,0.0,4.0,0.0,12.0,6.0,12.0,6.0,-4.0,-6.0,-2.0,-2.0
25%,2010.0,20319.5,2.0,2.0,-1.0,2.0,8.0,0.0,4.0,3.0,8.0,1.0,4.0,0.0,12.0,6.0,12.0,6.0,-1.0,-2.0,-1.0,-1.0
50%,2010.0,20628.0,3.0,3.0,1.0,3.0,9.0,1.0,5.0,3.0,9.0,1.0,5.0,1.0,12.0,6.0,12.0,6.0,0.0,0.0,0.0,0.0
75%,2010.0,20927.5,4.0,4.0,2.0,4.0,10.0,2.0,6.0,4.0,9.0,2.0,5.0,1.0,12.0,6.0,12.0,6.0,2.0,1.0,1.0,1.0
max,2010.0,21230.0,9.0,10.0,7.0,6.0,12.0,2.0,6.0,6.0,12.0,2.0,6.0,1.0,12.0,6.0,12.0,6.0,6.0,4.0,2.0,2.0


In [53]:
dm = dm[['Season', 'GameNumber', 'VTeamCode', 'HTeamCode', 'HGF', 'VGF', 'GD','WinTeam',
         'VF1', 'VF2', 'VD1', 'VD2', 
         'HF1', 'HF2', 'HD1', 'HD2']]

In [69]:
dm['HomeWin'] = dm.apply(lambda x: 1 if x['WinTeam']=='HOME' else 0, axis=1)
dm['DF1'] = dm['HF1'] - dm['VF1']
dm['DF2'] = dm['HF2'] - dm['VF2']
dm['DD1'] = dm['HD1'] - dm['VD1']
dm['DD2'] = dm['HD2'] - dm['VD2']

In [72]:
dm.groupby(['WinTeam'])['DF1', 'DF2', 'DD1', 'DD2'].describe()

Unnamed: 0_level_0,DF1,DF1,DF1,DF1,DF1,DF1,DF1,DF1,DF2,DF2,DF2,DF2,DF2,DF2,DF2,DF2,DD1,DD1,DD1,DD1,DD1,DD1,DD1,DD1,DD2,DD2,DD2,DD2,DD2,DD2,DD2,DD2
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
WinTeam,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2
AWAY,486.0,0.339506,1.684974,-4.0,-1.0,0.0,1.0,5.0,486.0,-0.339506,1.684974,-5.0,-1.0,0.0,1.0,4.0,486.0,0.12963,1.044212,-2.0,-1.0,0.0,1.0,2.0,486.0,-0.12963,1.044212,-2.0,-1.0,0.0,1.0,2.0
HOME,529.0,0.567108,1.712182,-4.0,-1.0,0.0,2.0,6.0,529.0,-0.567108,1.712182,-6.0,-2.0,0.0,1.0,4.0,529.0,0.05104,1.035014,-2.0,-1.0,0.0,1.0,2.0,529.0,-0.05104,1.035014,-2.0,-1.0,0.0,1.0,2.0


## Mean number of F1, F2, D1 D1 per team

* create a season-team dataframe
  
  ** number of wins/points/winning percentage

### estimate roster model 

- regress home win on the difference in number of home and visitor players by position and quality (predictor variables). Add a constant to the predictors and use OLS. The purpose is to deterimine the impact each roster positin has on home team success.

In [78]:
y = dm['HomeWin']  
X = sm.add_constant(dm[['DF1', 'DD1']] )

In [79]:
result = sm.OLS(y, X).fit()
result.summary()

0,1,2,3
Dep. Variable:,HomeWin,R-squared:,0.005
Model:,OLS,Adj. R-squared:,0.003
Method:,Least Squares,F-statistic:,2.658
Date:,"Mon, 30 Oct 2017",Prob (F-statistic):,0.0706
Time:,18:14:19,Log-Likelihood:,-733.11
No. Observations:,1015,AIC:,1472.0
Df Residuals:,1012,BIC:,1487.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.5140,0.016,31.442,0.000,0.482,0.546
DF1,0.0183,0.009,1.965,0.050,2.95e-05,0.037
DD1,-0.0134,0.015,-0.878,0.380,-0.043,0.017

0,1,2,3
Omnibus:,1.217,Durbin-Watson:,1.884
Prob(Omnibus):,0.544,Jarque-Bera (JB):,165.661
Skew:,-0.084,Prob(JB):,1.06e-36
Kurtosis:,1.028,Cond. No.,1.97


In [47]:
result.params

const    0.513976
DF1      0.009163
DF2     -0.009163
DD1     -0.006705
DD2      0.006705
dtype: float64