## team roster quality

To determine the impact each roster position has on team success, we need to examine the quality of players per game and the result of each game. For each roster position, there will be elite players and secondary players per team. Elite players will be assinged a value of 1 whereas secondary players, a value 2.

In [137]:
import sys
import os
import pandas as pd
import numpy as np
import datetime, time
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
from pylab import hist, show
import scipy

### import data set

In [138]:
dm = pd.read_csv('out_data/play_by_play_with_player_rank.csv')
dm = dm.drop('Unnamed: 0', axis=1)

  interactivity=interactivity, compiler=compiler, result=result)


In [139]:
dm.columns

Index(['Season', 'GameNumber', 'GameDate', 'Period', 'AdvantageType', 'Zone',
       'EventNumber', 'EventType', 'EventDetail', 'EventTeamCode',
       'EventPlayerNumber', 'EventPlayerName', 'EventTimeFromZero',
       'EventTimeFromTwenty', 'TeamCode', 'PlayerNumber', 'PlayerPosition',
       'ShotType', 'ShotResult', 'Length', 'PenaltyType', 'Rank'],
      dtype='object')

In [140]:
dm.shape

(3063332, 22)

- use a different data set and name it dq ( data quality)

In [141]:
dq = dm

- create a variable that will display if an event on ice is a goal. If the on ice event is a goal, a value of 1 will be assigned and a value of 0 if an event is not a goal.

In [142]:
dq['goal'] = dq.apply(lambda x: 1 if (x['EventType'] == 'GOAL') else 0, axis=1)

- group by game number and team to sum up the goals each team score per game.

In [143]:
dq['score'] = dq.groupby(['Season', 'GameNumber', 'TeamCode'])['goal'].transform('sum')

- keep one observation per game, team and player (drop duplicates).

In [144]:
dq = dq.drop_duplicates(['GameNumber', 'TeamCode', 'PlayerNumber'])

In [145]:
dq.isnull().sum()

Season                     0
GameNumber                 0
GameDate                   0
Period                     0
AdvantageType              0
Zone                       0
EventNumber                0
EventType                  0
EventDetail                0
EventTeamCode              0
EventPlayerNumber          0
EventPlayerName            0
EventTimeFromZero          0
EventTimeFromTwenty        0
TeamCode                   0
PlayerNumber               0
PlayerPosition             0
ShotType               37953
ShotResult             44978
Length                 39953
PenaltyType            46498
Rank                       0
goal                       0
score                      0
dtype: int64

### count the number of quality players per position for each game

- group by season, game number, team and player to count the occurance of each player per game and sum up the observations of players. There should be 19 players per team and 38 per game for the dataset to be correct.

In [146]:
dq['playercount'] = dq.groupby(['Season', 'GameNumber', 'TeamCode', 'PlayerNumber',])['PlayerNumber'].transform('count')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [147]:
dq['roster'] = dq.groupby(['Season', 'GameNumber', 'TeamCode'])['playercount'].transform('sum')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


- create a column that will display the amount of quality players per position per team and game.

In [148]:
dq['rosterposition'] = dq.groupby(['Season', 'GameNumber', 'TeamCode', 'PlayerPosition', 'Rank'])['playercount'].transform('sum')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [149]:
dq.isnull().sum()

Season                     0
GameNumber                 0
GameDate                   0
Period                     0
AdvantageType              0
Zone                       0
EventNumber                0
EventType                  0
EventDetail                0
EventTeamCode              0
EventPlayerNumber          0
EventPlayerName            0
EventTimeFromZero          0
EventTimeFromTwenty        0
TeamCode                   0
PlayerNumber               0
PlayerPosition             0
ShotType               37953
ShotResult             44978
Length                 39953
PenaltyType            46498
Rank                       0
goal                       0
score                      0
playercount                0
roster                     0
rosterposition             0
dtype: int64

- the next step is to group players by gamenumber, teamcode, position and rank, to display the quality of players each team has per position. **Pivot table** by player position and rank using roster position values. Game number and team are the indexes. We want to join the levels to generate columns by roster position and rank (10 columns). 


In [150]:
dq = pd.pivot_table(dq, index=['Season', 'GameNumber', 'TeamCode'], columns=['PlayerPosition', 'Rank'], values=['rosterposition'])
dq = dq.reset_index()
dq.columns = ['_'.join(str(s).strip() for s in col if s) for col in dq.columns]
dq.reset_index()
dq = dq.fillna(0)

In [151]:
dq = dq.rename(columns={'rosterposition_C_1': 'C1', 'rosterposition_C_2': 'C2', 'rosterposition_D_1': 'D1', 'rosterposition_D_2': 'D2', 'rosterposition_G_1' : 'G1', 'rosterposition_G_2': 'G2', 'rosterposition_L_1': 'L1', 'rosterposition_L_2': 'L2', 'rosterposition_R_1': 'R1', 'rosterposition_R_2': 'R2' })


- the data set shows the quality amount of players per team for every single regular season game. We will have to pivot the table by team, in order to have one observation per game. 

In [152]:
dq.head(10)

Unnamed: 0,Season,GameNumber,TeamCode,C1,C2,D1,D2,G1,G2,L1,L2,R1,R2
0,2010,20001,MTL,1.0,6.0,1.0,5.0,1.0,0.0,0.0,4.0,0.0,1.0
1,2010,20001,TOR,2.0,3.0,1.0,5.0,0.0,1.0,2.0,1.0,0.0,4.0
2,2010,20002,PHI,3.0,2.0,2.0,4.0,1.0,0.0,1.0,4.0,1.0,1.0
3,2010,20002,PIT,2.0,6.0,2.0,4.0,1.0,0.0,1.0,2.0,0.0,1.0
4,2010,20003,CAR,2.0,3.0,2.0,4.0,1.0,0.0,0.0,3.0,1.0,3.0
5,2010,20003,MIN,1.0,3.0,1.0,5.0,1.0,0.0,0.0,3.0,1.0,4.0
6,2010,20004,CHI,1.0,2.0,2.0,4.0,0.0,1.0,0.0,2.0,3.0,4.0
7,2010,20004,COL,2.0,4.0,1.0,5.0,0.0,1.0,0.0,2.0,1.0,3.0
8,2010,20005,CGY,1.0,2.0,1.0,5.0,1.0,0.0,1.0,5.0,1.0,2.0
9,2010,20005,EDM,1.0,5.0,1.0,5.0,0.0,1.0,1.0,3.0,1.0,1.0


- create an index variable to deterime if a team is considered visitor or home for a given game. The column will be named "A". The 1st observation per game is the visitor team and will be assigned a value of 1. The 2nd and final observation per game, is the home team, so we fill in NaN with a value of 2 (home team).

In [153]:
dq.loc[dq.groupby('GameNumber',as_index=False).head(1).index,'A'] = 1
dq = dq.fillna(2)

- **pivot table using game number as index by whether a team is visitor (1) or home (2)**. The table will display the quality of each player per position and team. The next step is to join columns by team and player quality value. We will have for each team 10 columns ( 5 positions x 2 type of player quality). We will rename the columns as following: VC1 shows the amount of elite centers for the visitor team, HC1 displays the amount of elite centers for the home team etc. We rename the columns and sort them based on team, position and quality. 

In [154]:
dq = pd.pivot_table(dq, index=['Season', 'GameNumber'], columns=['A'], values=['C1', 'C2', 'D1', 'D2', 'G1', 'G2', 'L1', 'L2', 'R1', 'R2'])
dq = dq.reset_index()
dq.columns = ['_'.join(str(s).strip() for s in col if s) for col in dq.columns]
dq = dq.reset_index()

In [155]:
dq = dq.rename(columns={'C1_1.0': 'VC1', 'C2_1.0': 'VC2', 'D1_1.0': 'VD1', 'D2_1.0': 'VD2', 'G1_1.0': 'VG1', 'G2_1.0': 'VG2', 'L1_1.0': 'VL1', 'L2_1.0': 'VL2', 'R1_1.0': 'VR1', 'R2_1.0': 'VR2', 'C1_2.0': 'HC1', 'C2_2.0': 'HC2', 'D1_2.0': 'HD1', 'D2_2.0': 'HD2', 'G1_2.0': 'HG1', 'G2_2.0': 'HG2', 'L1_2.0': 'HL1', 'L2_2.0': 'HL2', 'R1_2.0': 'HR1', 'R2_2.0': 'HR2', })
dq = dq[['Season', 'GameNumber', 'VC1', 'VC2', 'VL1', 'VL2', 'VR1', 'VR2',  'VD1', 'VD2', 'VG1', 'VG2',  'HC1', 'HC2', 'HL1', 'HL2', 'HR1', 'HR2',  'HD1', 'HD2', 'HG1', 'HG2']]
dq.sort_values(['Season', 'GameNumber'], ascending=[True, True], inplace=True)

In [156]:
dq.head(10)

Unnamed: 0,Season,GameNumber,VC1,VC2,VL1,VL2,VR1,VR2,VD1,VD2,...,HC1,HC2,HL1,HL2,HR1,HR2,HD1,HD2,HG1,HG2
0,2010,20001,1.0,6.0,0.0,4.0,0.0,1.0,1.0,5.0,...,2.0,3.0,2.0,1.0,0.0,4.0,1.0,5.0,0.0,1.0
1,2010,20002,3.0,2.0,1.0,4.0,1.0,1.0,2.0,4.0,...,2.0,6.0,1.0,2.0,0.0,1.0,2.0,4.0,1.0,0.0
2,2010,20003,2.0,3.0,0.0,3.0,1.0,3.0,2.0,4.0,...,1.0,3.0,0.0,3.0,1.0,4.0,1.0,5.0,1.0,0.0
3,2010,20004,1.0,2.0,0.0,2.0,3.0,4.0,2.0,4.0,...,2.0,4.0,0.0,2.0,1.0,3.0,1.0,5.0,0.0,1.0
4,2010,20005,1.0,2.0,1.0,5.0,1.0,2.0,1.0,5.0,...,1.0,5.0,1.0,3.0,1.0,1.0,1.0,5.0,0.0,1.0
5,2010,20006,2.0,3.0,1.0,3.0,0.0,3.0,0.0,6.0,...,4.0,3.0,2.0,2.0,0.0,1.0,1.0,5.0,1.0,0.0
6,2010,20007,2.0,4.0,3.0,1.0,0.0,2.0,0.0,6.0,...,1.0,3.0,3.0,0.0,0.0,4.0,0.0,7.0,1.0,0.0
7,2010,20008,2.0,3.0,0.0,3.0,1.0,3.0,2.0,4.0,...,1.0,4.0,0.0,3.0,1.0,3.0,1.0,5.0,1.0,0.0
8,2010,20009,1.0,3.0,0.0,2.0,3.0,3.0,1.0,5.0,...,2.0,4.0,1.0,2.0,1.0,2.0,2.0,4.0,1.0,0.0
9,2010,20010,1.0,6.0,1.0,0.0,2.0,2.0,1.0,5.0,...,0.0,5.0,0.0,3.0,1.0,3.0,2.0,4.0,0.0,1.0


- save the new data set.

In [157]:
dq.to_csv('out_data/team_roster_quality.csv', index='False', sep=',')