## roster model estimation

To determine the impact each roster position has on team success, we need to examine the quality of players per game and the result of each game. For each roster position, there will be elite players and secondary players per team. Elite players will be assinged a value of 1 whereas secondary players, a value 2.

### import data sets  "play by play goal detail" and "game detail"

In [1]:
import sys
import os
import pandas as pd
import numpy as np
import datetime, time
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
from pylab import hist, show
import scipy
import zipfile


pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 200)


In [2]:
pwd

'/Users/stefanostselios/Desktop/nhl_roster_design-master'

In [3]:
d0 = pd.read_csv('season_games.csv', index_col=0)
d1 = pd.read_csv('game_team_roster_quality.csv', index_col=0)

In [4]:
d0['WinTeam'] = d0.apply(lambda x: 'HOME' if x['GD'] > 0 else 'AWAY', axis=1)
d0 = d0[['Season', 'GameNumber', 'VTeamCode', 'HTeamCode', 'HGF', 'VGF', 'GD', 'WinTeam']]
d0.head()

Unnamed: 0,Season,GameNumber,VTeamCode,HTeamCode,HGF,VGF,GD,WinTeam
0,2010,20001,MTL,TOR,3,2,1,HOME
1,2010,20002,PHI,PIT,2,3,-1,AWAY
2,2010,20003,CAR,MIN,3,4,-1,AWAY
3,2010,20004,CHI,COL,4,3,1,HOME
4,2010,20005,CGY,EDM,4,0,4,HOME


In [5]:
dm = d0.merge(d1, on=['Season', 'GameNumber'], how='left')

- Calculate the difference between player quality per game for all positions with respect to home team ( Home Team - Visitor Team). There are 5 positions and 2 types of player quality. This will give us a total of 10 differenecs. 

In [6]:
dm = dm[dm['GameNumber'] <= 21230]
dm.shape

(1230, 16)

In [7]:
dm['VF'] = dm['VF1'] + dm['VF2']
dm['VD'] = dm['VD1'] + dm['VD2']
dm['HF'] = dm['HF1'] + dm['HF2']
dm['HD'] = dm['HD1'] + dm['HD2']

In [8]:
dm['F'] = dm['VF'] + dm['HF']
dm['D'] = dm['VD'] + dm['HD']
dm.head()

Unnamed: 0,Season,GameNumber,VTeamCode,HTeamCode,HGF,VGF,GD,WinTeam,VF1,VF2,VD1,VD2,HF1,HF2,HD1,HD2,VF,VD,HF,HD,F,D
0,2010,20001,MTL,TOR,3,2,1,HOME,2.0,10.0,1.0,5.0,2.0,10.0,1.0,5.0,12.0,6.0,12.0,6.0,24.0,12.0
1,2010,20002,PHI,PIT,2,3,-1,AWAY,5.0,7.0,1.0,5.0,4.0,8.0,2.0,4.0,12.0,6.0,12.0,6.0,24.0,12.0
2,2010,20003,CAR,MIN,3,4,-1,AWAY,3.0,9.0,1.0,5.0,2.0,10.0,1.0,5.0,12.0,6.0,12.0,6.0,24.0,12.0
3,2010,20004,CHI,COL,4,3,1,HOME,4.0,8.0,2.0,4.0,2.0,10.0,1.0,5.0,12.0,6.0,12.0,6.0,24.0,12.0
4,2010,20005,CGY,EDM,4,0,4,HOME,3.0,9.0,1.0,5.0,0.0,12.0,1.0,5.0,12.0,6.0,12.0,6.0,24.0,12.0


In [9]:
dm['F'].value_counts()

24.0    1018
23.0     154
25.0      31
22.0      12
21.0       1
Name: F, dtype: int64

In [10]:
dm['D'].value_counts()

12.0    1018
13.0     154
11.0      31
14.0      12
15.0       1
Name: D, dtype: int64

In [11]:
dm = dm[((dm['VF'] == 12) & (dm['VD'] == 6) & (dm['HF'] == 12) & (dm['HD'] == 6))]

In [12]:
dm['VD'].value_counts()

6.0    1015
Name: VD, dtype: int64

In [13]:
dm['HD'].value_counts()

6.0    1015
Name: HD, dtype: int64

In [14]:
dm['VF'].value_counts()

12.0    1015
Name: VF, dtype: int64

In [15]:
dm['HF'].value_counts()

12.0    1015
Name: HF, dtype: int64

## Summary analysis

In [16]:
dm.describe()

Unnamed: 0,Season,GameNumber,HGF,VGF,GD,VF1,VF2,VD1,VD2,HF1,HF2,HD1,HD2,VF,VD,HF,HD,F,D
count,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0,1015.0
mean,2010.0,20622.621675,2.941872,2.739901,0.20197,2.805911,9.194089,1.121182,4.878818,3.209852,8.790148,1.26601,4.73399,12.0,6.0,12.0,6.0,24.0,12.0
std,0.0,352.180753,1.716485,1.634,2.438513,1.519357,1.519357,0.685216,0.685216,1.583414,1.583414,0.667783,0.667783,0.0,0.0,0.0,0.0,0.0,0.0
min,2010.0,20001.0,0.0,0.0,-8.0,0.0,6.0,0.0,4.0,0.0,6.0,0.0,4.0,12.0,6.0,12.0,6.0,24.0,12.0
25%,2010.0,20319.5,2.0,2.0,-1.0,2.0,8.0,1.0,4.0,2.0,7.0,1.0,4.0,12.0,6.0,12.0,6.0,24.0,12.0
50%,2010.0,20628.0,3.0,3.0,1.0,3.0,9.0,1.0,5.0,3.0,9.0,1.0,5.0,12.0,6.0,12.0,6.0,24.0,12.0
75%,2010.0,20927.5,4.0,4.0,2.0,4.0,10.0,2.0,5.0,5.0,10.0,2.0,5.0,12.0,6.0,12.0,6.0,24.0,12.0
max,2010.0,21230.0,9.0,10.0,7.0,6.0,12.0,2.0,6.0,6.0,12.0,2.0,6.0,12.0,6.0,12.0,6.0,24.0,12.0


In [17]:
dm = dm[['Season', 'GameNumber', 'VTeamCode', 'HTeamCode', 'HGF', 'VGF', 'GD','WinTeam',
         'VF1', 'VF2', 'VD1', 'VD2', 
         'HF1', 'HF2', 'HD1', 'HD2']]

In [18]:
dm['HomeWin'] = dm.apply(lambda x: 1 if x['WinTeam']=='HOME' else 0, axis=1)
dm['DF1'] = dm['HF1'] - dm['VF1']
dm['DF2'] = dm['HF2'] - dm['VF2']
dm['DD1'] = dm['HD1'] - dm['VD1']
dm['DD2'] = dm['HD2'] - dm['VD2']

In [19]:
dm.groupby(['WinTeam'])['DF1', 'DF2', 'DD1', 'DD2'].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,DF1,DF2,DD1,DD2
WinTeam,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AWAY,count,486.0,486.0,486.0,486.0
AWAY,mean,0.261317,-0.261317,0.080247,-0.080247
AWAY,std,2.204874,2.204874,0.967375,0.967375
AWAY,min,-5.0,-6.0,-2.0,-2.0
AWAY,25%,-1.0,-2.0,-1.0,-1.0
AWAY,50%,0.0,0.0,0.0,0.0
AWAY,75%,2.0,1.0,1.0,1.0
AWAY,max,6.0,5.0,2.0,2.0
HOME,count,529.0,529.0,529.0,529.0
HOME,mean,0.534972,-0.534972,0.204159,-0.204159


## Mean number of F1, F2, D1 D1 per team

* create a season-team dataframe
  
  ** number of wins/points/winning percentage

### estimate roster model 

- regress home win on the difference in number of home and visitor players by position and quality (predictor variables). Add a constant to the predictors and use OLS. The purpose is to deterimine the impact each roster positin has on home team success.

In [20]:
y = dm['HomeWin']  
X = sm.add_constant(dm[['DF1', 'DD1', 'DF2', 'DD2']] )
result = sm.OLS(y, X).fit()
result.summary()

0,1,2,3
Dep. Variable:,HomeWin,R-squared:,0.005
Model:,OLS,Adj. R-squared:,0.003
Method:,Least Squares,F-statistic:,2.508
Date:,"Sun, 26 Nov 2017",Prob (F-statistic):,0.0819
Time:,16:31:38,Log-Likelihood:,-733.26
No. Observations:,1015,AIC:,1473.0
Df Residuals:,1012,BIC:,1487.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
const,0.5147,0.016,32.281,0.000,0.483 0.546
DF1,0.0043,0.004,0.985,0.325,-0.004 0.013
DD1,0.0104,0.010,1.055,0.292,-0.009 0.030
DF2,-0.0043,0.004,-0.985,0.325,-0.013 0.004
DD2,-0.0104,0.010,-1.055,0.292,-0.030 0.009

0,1,2,3
Omnibus:,1.199,Durbin-Watson:,1.89
Prob(Omnibus):,0.549,Jarque-Bera (JB):,165.846
Skew:,-0.084,Prob(JB):,9.7e-37
Kurtosis:,1.027,Cond. No.,2.1e+16


In [21]:
result.params

const    0.514692
DF1      0.004312
DD1      0.010381
DF2     -0.004312
DD2     -0.010381
dtype: float64

- By increasing the differential of **elite** player quality in forwards and defense (home team – visitor team) by one unit, home win **increases** by 0.4% and 1% respectfully.
- By increasing the differential of **secondary** player quality in forwards and defense (home team – visitor team) by one unit, home win **decreases** by 0.4% and 1% respectfully.

In [23]:
y = dm['HomeWin']  
X = sm.add_constant(dm[['DF1', 'DD1']] )
result = sm.OLS(y, X).fit()
result.summary()

0,1,2,3
Dep. Variable:,HomeWin,R-squared:,0.005
Model:,OLS,Adj. R-squared:,0.003
Method:,Least Squares,F-statistic:,2.508
Date:,"Sun, 26 Nov 2017",Prob (F-statistic):,0.0819
Time:,16:32:00,Log-Likelihood:,-733.26
No. Observations:,1015,AIC:,1473.0
Df Residuals:,1012,BIC:,1487.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
const,0.5147,0.016,32.281,0.000,0.483 0.546
DF1,0.0086,0.009,0.985,0.325,-0.009 0.026
DD1,0.0208,0.020,1.055,0.292,-0.018 0.059

0,1,2,3
Omnibus:,1.199,Durbin-Watson:,1.89
Prob(Omnibus):,0.549,Jarque-Bera (JB):,165.846
Skew:,-0.084,Prob(JB):,9.7e-37
Kurtosis:,1.027,Cond. No.,3.05


In [22]:
y = dm['HomeWin']  
X = sm.add_constant(dm[['DF1', 'DD1', 'DF2', 'DD2']] )
result = sm.Logit(y, X).fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.689779
         Iterations 4


LinAlgError: Singular matrix

In [24]:
y = dm['HomeWin']  
X = sm.add_constant(dm[['DF1', 'DD1']] )
result = sm.Logit(y, X).fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.689779
         Iterations 4


0,1,2,3
Dep. Variable:,HomeWin,No. Observations:,1015.0
Model:,Logit,Df Residuals:,1012.0
Method:,MLE,Df Model:,2.0
Date:,"Sun, 26 Nov 2017",Pseudo R-squ.:,0.003569
Time:,16:32:15,Log-Likelihood:,-700.13
converged:,True,LL-Null:,-702.63
,,LLR p-value:,0.08144

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
const,0.0591,0.064,0.922,0.356,-0.066 0.185
DF1,0.0347,0.035,0.985,0.324,-0.034 0.104
DD1,0.0835,0.079,1.055,0.291,-0.072 0.239


In [25]:
y = dm['HomeWin']  
X = sm.add_constant(dm[['DF2', 'DD2']] )
result = sm.Logit(y, X).fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.689779
         Iterations 4


0,1,2,3
Dep. Variable:,HomeWin,No. Observations:,1015.0
Model:,Logit,Df Residuals:,1012.0
Method:,MLE,Df Model:,2.0
Date:,"Sun, 26 Nov 2017",Pseudo R-squ.:,0.003569
Time:,16:41:22,Log-Likelihood:,-700.13
converged:,True,LL-Null:,-702.63
,,LLR p-value:,0.08144

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
const,0.0591,0.064,0.922,0.356,-0.066 0.185
DF2,-0.0347,0.035,-0.985,0.324,-0.104 0.034
DD2,-0.0835,0.079,-1.055,0.291,-0.239 0.072
