## roster model estimation

To determine the impact each roster position has on team success, we need to examine the quality of players per game and the result of each game. For each roster position, there will be elite players and secondary players per team. Elite players will be assinged a value of 1 whereas secondary players, a value 2.

### import data sets  "play by play goal detail" and "game detail"

In [159]:
import sys
import os
import pandas as pd
import numpy as np
import datetime, time
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
from pylab import hist, show
import scipy
import zipfile


pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 200)


In [160]:
pwd

'/Users/stefanostselios/Desktop/nhl_roster_design-master'

In [161]:
d0 = pd.read_csv('season_games.csv', index_col=0)
d1 = pd.read_csv('season_game_roster.csv', index_col=0)

In [162]:
d0['WinTeam'] = d0.apply(lambda x: 'HOME' if x['GD'] > 0 else 'AWAY', axis=1)
d0 = d0[['Season', 'GameNumber', 'VTeamCode', 'HTeamCode', 'HGF', 'VGF', 'GD', 'WinTeam']]
d0.head()

Unnamed: 0,Season,GameNumber,VTeamCode,HTeamCode,HGF,VGF,GD,WinTeam
0,2010,20001,MTL,TOR,3,2,1,HOME
1,2010,20002,PHI,PIT,2,3,-1,AWAY
2,2010,20003,CAR,MIN,3,4,-1,AWAY
3,2010,20004,CHI,COL,4,3,1,HOME
4,2010,20005,CGY,EDM,4,0,4,HOME


In [163]:
dm = d0.merge(d1, on=['Season', 'GameNumber'], how='left')

- Calculate the difference between player quality per game for all positions with respect to home team ( Home Team - Visitor Team). There are 5 positions and 2 types of player quality. This will give us a total of 10 differenecs. 

In [164]:
dm = dm[dm['GameNumber'] <= 21230]
dm.shape

(1230, 16)

In [165]:
dm['VF'] = dm['VF1'] + dm['VF2']
dm['VD'] = dm['VD1'] + dm['VD2']
dm['HF'] = dm['HF1'] + dm['HF2']
dm['HD'] = dm['HD1'] + dm['HD2']

In [166]:
dm['F'] = dm['VF'] + dm['HF']
dm['D'] = dm['VD'] + dm['HD']
dm.head()

Unnamed: 0,Season,GameNumber,VTeamCode,HTeamCode,HGF,VGF,GD,WinTeam,VF1,VF2,VD1,VD2,HF1,HF2,HD1,HD2,VF,VD,HF,HD,F,D
0,2010,20001,MTL,TOR,3,2,1,HOME,2.0,10.0,1.0,5.0,2.0,10.0,1.0,5.0,12.0,6.0,12.0,6.0,24.0,12.0
1,2010,20002,PHI,PIT,2,3,-1,AWAY,5.0,7.0,2.0,4.0,5.0,7.0,3.0,3.0,12.0,6.0,12.0,6.0,24.0,12.0
2,2010,20003,CAR,MIN,3,4,-1,AWAY,3.0,9.0,1.0,5.0,2.0,10.0,1.0,5.0,12.0,6.0,12.0,6.0,24.0,12.0
3,2010,20004,CHI,COL,4,3,1,HOME,4.0,8.0,2.0,4.0,2.0,10.0,1.0,5.0,12.0,6.0,12.0,6.0,24.0,12.0
4,2010,20005,CGY,EDM,4,0,4,HOME,3.0,9.0,1.0,5.0,0.0,12.0,1.0,5.0,12.0,6.0,12.0,6.0,24.0,12.0


In [167]:
dm.shape

(1230, 22)

In [168]:
dm.isnull().sum()

Season          0
GameNumber      0
VTeamCode       0
HTeamCode       0
HGF             0
VGF             0
GD              0
WinTeam         0
VF1           215
VF2           215
VD1           215
VD2           215
HF1           215
HF2           215
HD1           215
HD2           215
VF            215
VD            215
HF            215
HD            215
F             215
D             215
dtype: int64

In [169]:
dm = dm[((dm['VF'] == 12) & (dm['VD'] == 6) & (dm['HF'] == 12) & (dm['HD'] == 6))]

In [170]:
dm.shape

(1015, 22)

In [171]:
dm['HF'].value_counts()

12.0    1015
Name: HF, dtype: int64

In [172]:
dm['HD'].value_counts()

6.0    1015
Name: HD, dtype: int64

In [173]:
dm['VF'].value_counts()

12.0    1015
Name: VF, dtype: int64

In [174]:
dm['VD'].value_counts()

6.0    1015
Name: VD, dtype: int64

## Summary analysis

In [175]:
dm.describe()

Unnamed: 0,Season,GameNumber,HGF,VGF,GD,VF1,VF2,VD1,VD2,HF1,HF2,HD1,HD2,VF,VD,HF,HD,F,D
count,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0
mean,2010.0,20622.621675,2.941872,2.739901,0.20197,2.825616,9.174384,1.40197,4.59803,3.26601,8.73399,1.599015,4.400985,12.0,6.0,12.0,6.0,24.0,12.0
std,0.0,352.180753,1.716485,1.634,2.438513,1.527568,1.527568,0.946287,0.946287,1.590832,1.590832,1.012642,1.012642,0.0,0.0,0.0,0.0,0.0,0.0
min,2010.0,20001.0,0.0,0.0,-8.0,0.0,6.0,0.0,3.0,0.0,6.0,0.0,3.0,12.0,6.0,12.0,6.0,24.0,12.0
25%,2010.0,20319.5,2.0,2.0,-1.0,2.0,8.0,1.0,4.0,2.0,7.0,1.0,4.0,12.0,6.0,12.0,6.0,24.0,12.0
50%,2010.0,20628.0,3.0,3.0,1.0,3.0,9.0,1.0,5.0,3.0,9.0,2.0,4.0,12.0,6.0,12.0,6.0,24.0,12.0
75%,2010.0,20927.5,4.0,4.0,2.0,4.0,10.0,2.0,5.0,5.0,10.0,2.0,5.0,12.0,6.0,12.0,6.0,24.0,12.0
max,2010.0,21230.0,9.0,10.0,7.0,6.0,12.0,3.0,6.0,6.0,12.0,3.0,6.0,12.0,6.0,12.0,6.0,24.0,12.0


In [176]:
dm = dm[['Season', 'GameNumber', 'VTeamCode', 'HTeamCode', 'HGF', 'VGF', 'GD','WinTeam',
         'VF1', 'VF2', 'VD1', 'VD2', 
         'HF1', 'HF2', 'HD1', 'HD2']]

In [177]:
dm['HomeWin'] = dm.apply(lambda x: 1 if x['WinTeam']=='HOME' else 0, axis=1)
dm['DF1'] = dm['HF1'] - dm['VF1']
dm['DF2'] = dm['HF2'] - dm['VF2']
dm['DD1'] = dm['HD1'] - dm['VD1']
dm['DD2'] = dm['HD2'] - dm['VD2']

In [178]:
dm.groupby(['WinTeam'])['DF1', 'DF2', 'DD1', 'DD2'].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,DF1,DF2,DD1,DD2
WinTeam,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AWAY,count,486.0,486.0,486.0,486.0
AWAY,mean,0.3107,-0.3107,0.131687,-0.131687
AWAY,std,2.183388,2.183388,1.378459,1.378459
AWAY,min,-5.0,-6.0,-3.0,-3.0
AWAY,25%,-1.0,-2.0,-1.0,-1.0
AWAY,50%,0.0,0.0,0.0,0.0
AWAY,75%,2.0,1.0,1.0,1.0
AWAY,max,6.0,5.0,3.0,3.0
HOME,count,529.0,529.0,529.0,529.0
HOME,mean,0.559546,-0.559546,0.257089,-0.257089


## Mean number of F1, F2, D1 D1 per team

* create a season-team dataframe
  
  ** number of wins/points/winning percentage

### estimate roster model 

- regress home win on the difference in number of home and visitor players by position and quality (predictor variables). Add a constant to the predictors and use OLS. The purpose is to deterimine the impact each roster positin has on home team success.

In [179]:
y = dm['HomeWin']  
X = sm.add_constant(dm[['DF1', 'DD1', 'DF2', 'DD2']] )
result = sm.OLS(y, X).fit()
result.summary()

0,1,2,3
Dep. Variable:,HomeWin,R-squared:,0.003
Model:,OLS,Adj. R-squared:,0.001
Method:,Least Squares,F-statistic:,1.635
Date:,"Mon, 27 Nov 2017",Prob (F-statistic):,0.196
Time:,15:23:17,Log-Likelihood:,-734.13
No. Observations:,1015,AIC:,1474.0
Df Residuals:,1012,BIC:,1489.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
const,0.5155,0.016,32.242,0.000,0.484 0.547
DF1,0.0058,0.005,1.129,0.259,-0.004 0.016
DD1,0.0012,0.008,0.153,0.878,-0.015 0.017
DF2,-0.0058,0.005,-1.129,0.259,-0.016 0.004
DD2,-0.0012,0.008,-0.153,0.878,-0.017 0.015

0,1,2,3
Omnibus:,1.213,Durbin-Watson:,1.889
Prob(Omnibus):,0.545,Jarque-Bera (JB):,167.003
Skew:,-0.084,Prob(JB):,5.44e-37
Kurtosis:,1.02,Cond. No.,1e+17


In [180]:
result.params

const    0.515545
DF1      0.005848
DD1      0.001235
DF2     -0.005848
DD2     -0.001235
dtype: float64

- By increasing the differential of **elite** player quality in forwards and defense (home team – visitor team) by one unit, home win **increases** by 0.4% and 1% respectfully.
- By increasing the differential of **secondary** player quality in forwards and defense (home team – visitor team) by one unit, home win **decreases** by 0.4% and 1% respectfully.

In [181]:
y = dm['HomeWin']  
X = sm.add_constant(dm[['DF1', 'DD1']] )
result = sm.OLS(y, X).fit()
result.summary()

0,1,2,3
Dep. Variable:,HomeWin,R-squared:,0.003
Model:,OLS,Adj. R-squared:,0.001
Method:,Least Squares,F-statistic:,1.635
Date:,"Mon, 27 Nov 2017",Prob (F-statistic):,0.196
Time:,15:23:17,Log-Likelihood:,-734.13
No. Observations:,1015,AIC:,1474.0
Df Residuals:,1012,BIC:,1489.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
const,0.5155,0.016,32.242,0.000,0.484 0.547
DF1,0.0117,0.010,1.129,0.259,-0.009 0.032
DD1,0.0025,0.016,0.153,0.878,-0.029 0.034

0,1,2,3
Omnibus:,1.213,Durbin-Watson:,1.889
Prob(Omnibus):,0.545,Jarque-Bera (JB):,167.003
Skew:,-0.084,Prob(JB):,5.44e-37
Kurtosis:,1.02,Cond. No.,2.92


In [183]:
y = dm['HomeWin']  
X = sm.add_constant(dm[['DF1', 'DD1']] )
result = sm.Logit(y, X).fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.690637
         Iterations 4


0,1,2,3
Dep. Variable:,HomeWin,No. Observations:,1015.0
Model:,Logit,Df Residuals:,1012.0
Method:,MLE,Df Model:,2.0
Date:,"Mon, 27 Nov 2017",Pseudo R-squ.:,0.002329
Time:,15:23:24,Log-Likelihood:,-701.0
converged:,True,LL-Null:,-702.63
,,LLR p-value:,0.1947

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
const,0.0624,0.064,0.973,0.331,-0.063 0.188
DF1,0.0470,0.042,1.128,0.259,-0.035 0.129
DD1,0.0099,0.065,0.153,0.879,-0.117 0.137


In [184]:
y = dm['HomeWin']  
X = sm.add_constant(dm[['DF2', 'DD2']] )
result = sm.Logit(y, X).fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.690637
         Iterations 4


0,1,2,3
Dep. Variable:,HomeWin,No. Observations:,1015.0
Model:,Logit,Df Residuals:,1012.0
Method:,MLE,Df Model:,2.0
Date:,"Mon, 27 Nov 2017",Pseudo R-squ.:,0.002329
Time:,15:23:28,Log-Likelihood:,-701.0
converged:,True,LL-Null:,-702.63
,,LLR p-value:,0.1947

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
const,0.0624,0.064,0.973,0.331,-0.063 0.188
DF2,-0.0470,0.042,-1.128,0.259,-0.129 0.035
DD2,-0.0099,0.065,-0.153,0.879,-0.137 0.117


In [185]:
y = dm['GD']  
X = sm.add_constant(dm[['DF1', 'DD1', 'DF2', 'DD2']] )
result = sm.OLS(y, X).fit()
result.summary()

0,1,2,3
Dep. Variable:,GD,R-squared:,0.003
Model:,OLS,Adj. R-squared:,0.001
Method:,Least Squares,F-statistic:,1.435
Date:,"Mon, 27 Nov 2017",Prob (F-statistic):,0.239
Time:,15:23:31,Log-Likelihood:,-2343.0
No. Observations:,1015,AIC:,4692.0
Df Residuals:,1012,BIC:,4707.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
const,0.1787,0.078,2.291,0.022,0.026 0.332
DF1,0.0409,0.025,1.617,0.106,-0.009 0.090
DD1,-0.0324,0.039,-0.824,0.410,-0.110 0.045
DF2,-0.0409,0.025,-1.617,0.106,-0.090 0.009
DD2,0.0324,0.039,0.824,0.410,-0.045 0.110

0,1,2,3
Omnibus:,0.427,Durbin-Watson:,1.918
Prob(Omnibus):,0.808,Jarque-Bera (JB):,0.516
Skew:,0.001,Prob(JB):,0.773
Kurtosis:,2.89,Cond. No.,1e+17
