## team roster quality

To determine the impact each roster position has on team success, we need to examine the quality of players per game and the result of each game. For each roster position, there will be elite players and secondary players per team. Elite players will be assinged a value of 1 whereas secondary players, a value 2.

In [1]:
import sys
import os
import pandas as pd
import numpy as np
import datetime, time
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
from pylab import hist, show
import scipy

  from pandas.core import datetools


In [2]:
if not os.path.exists('out_data'):
    os.makedirs('out_data')

### import data set

In [39]:
dm = pd.read_csv('out_data/play_by_play_with_player_rank.csv')
dm = dm.drop('Unnamed: 0', axis=1)

  interactivity=interactivity, compiler=compiler, result=result)


In [40]:
dm.columns

Index(['Season', 'GameNumber', 'GameDate', 'Period', 'AdvantageType', 'Zone',
       'EventNumber', 'EventType', 'EventDetail', 'EventTeamCode',
       'EventPlayerNumber', 'EventPlayerName', 'EventTimeFromZero',
       'EventTimeFromTwenty', 'TeamCode', 'PlayerNumber', 'PlayerPosition',
       'ShotType', 'ShotResult', 'Length', 'PenaltyType', 'Rank'],
      dtype='object')

In [41]:
dm.shape

(3061930, 22)

- use a different data set and name it dq ( data quality)

In [42]:
dq = dm

- create a variable that will display if an event on ice is a goal. If the on ice event is a goal, a value of 1 will be assigned and a value of 0 if an event is not a goal.

In [43]:
dq['goal'] = dq.apply(lambda x: 1 if (x['EventType'] == 'GOAL') else 0, axis=1)

- group by game number and team to sum up the goals each team score per game.

In [44]:
dq['score'] = dq.groupby(['Season', 'GameNumber', 'TeamCode'])['goal'].transform('sum')

- keep one observation per game, team and player (drop duplicates).

In [45]:
dq = dq.drop_duplicates(['GameNumber', 'TeamCode', 'PlayerNumber'])

In [46]:
dq = dq[dq['PlayerPosition'] !='G']

In [47]:
dq.isnull().sum()

Season                     0
GameNumber                 0
GameDate                   0
Period                     0
AdvantageType              0
Zone                       0
EventNumber                0
EventType                  0
EventDetail                0
EventTeamCode              0
EventPlayerNumber          0
EventPlayerName            0
EventTimeFromZero          0
EventTimeFromTwenty        0
TeamCode                   0
PlayerNumber               0
PlayerPosition             0
ShotType               35299
ShotResult             42322
Length                 37297
PenaltyType            43846
Rank                       0
goal                       0
score                      0
dtype: int64

### count the number of quality players per position for each game

- group by season, game number, team and player to count the occurance of each player per game and sum up the observations of players. There should be 19 players per team and 38 per game for the dataset to be correct.

In [48]:
dq['playercount'] = dq.groupby(['Season', 'GameNumber', 'TeamCode', 'PlayerNumber',])['PlayerNumber'].transform('count')

In [49]:
dq['roster'] = dq.groupby(['Season', 'GameNumber', 'TeamCode'])['playercount'].transform('sum')

- create a column that will display the amount of quality players per position per team and game.

In [50]:
dq['rosterposition'] = dq.groupby(['Season', 'GameNumber', 'TeamCode', 'PlayerPosition', 'Rank'])['playercount'].transform('sum')

In [51]:
dq.isnull().sum()

Season                     0
GameNumber                 0
GameDate                   0
Period                     0
AdvantageType              0
Zone                       0
EventNumber                0
EventType                  0
EventDetail                0
EventTeamCode              0
EventPlayerNumber          0
EventPlayerName            0
EventTimeFromZero          0
EventTimeFromTwenty        0
TeamCode                   0
PlayerNumber               0
PlayerPosition             0
ShotType               35299
ShotResult             42322
Length                 37297
PenaltyType            43846
Rank                       0
goal                       0
score                      0
playercount                0
roster                     0
rosterposition             0
dtype: int64

In [52]:
dq.head()

Unnamed: 0,Season,GameNumber,GameDate,Period,AdvantageType,Zone,EventNumber,EventType,EventDetail,EventTeamCode,...,ShotType,ShotResult,Length,PenaltyType,Rank,goal,score,playercount,roster,rosterposition
0,2010,20001,2010-10-07,1,EV,N,1,FAC,MTL won Neu. Zone - MTL #11 GOMEZ vs TOR #37 B...,MTL,...,,,,,2,0,30,1.0,18.0,6.0
1,2010,20001,2010-10-07,1,EV,N,1,FAC,MTL won Neu. Zone - MTL #11 GOMEZ vs TOR #37 B...,MTL,...,,,,,2,0,30,1.0,18.0,1.0
2,2010,20001,2010-10-07,1,EV,N,1,FAC,MTL won Neu. Zone - MTL #11 GOMEZ vs TOR #37 B...,MTL,...,,,,,2,0,30,1.0,18.0,4.0
3,2010,20001,2010-10-07,1,EV,N,1,FAC,MTL won Neu. Zone - MTL #11 GOMEZ vs TOR #37 B...,MTL,...,,,,,2,0,30,1.0,18.0,5.0
4,2010,20001,2010-10-07,1,EV,N,1,FAC,MTL won Neu. Zone - MTL #11 GOMEZ vs TOR #37 B...,MTL,...,,,,,2,0,30,1.0,18.0,5.0


### Pivot Table

- the next step is to group players by gamenumber, teamcode, position and rank, to display the quality of players each team has per position. **Pivot table** by player position and rank using roster position values. Game number and team are the indexes. We want to join the levels to generate columns by roster position and rank (10 columns). 


In [53]:
dq = pd.pivot_table(dq, index=['Season', 'GameNumber', 'TeamCode', 'roster'], columns=['PlayerPosition', 'Rank'], values=['rosterposition'])
dq = dq.reset_index()
dq.columns = ['_'.join(str(s).strip() for s in col if s) for col in dq.columns]
dq.reset_index()
dq = dq.fillna(0)

In [54]:
dq = dq.rename(columns={'rosterposition_C_1': 'C1', 'rosterposition_C_2': 'C2', 'rosterposition_D_1': 'D1', 'rosterposition_D_2': 'D2', 'rosterposition_L_1': 'L1', 'rosterposition_L_2': 'L2', 'rosterposition_R_1': 'R1', 'rosterposition_R_2': 'R2' })


- the data set shows the quality amount of players per team for every single regular season game. We will have to pivot the table by team, in order to have one observation per game. 

In [55]:
dq.head(10)

Unnamed: 0,Season,GameNumber,TeamCode,roster,C1,C2,D1,D2,L1,L2,R1,R2
0,2010,20001,MTL,18.0,1.0,6.0,1.0,5.0,0.0,4.0,0.0,1.0
1,2010,20001,TOR,18.0,2.0,3.0,1.0,5.0,2.0,1.0,0.0,4.0
2,2010,20002,PHI,18.0,3.0,2.0,2.0,4.0,1.0,4.0,1.0,1.0
3,2010,20002,PIT,18.0,2.0,6.0,2.0,4.0,1.0,2.0,0.0,1.0
4,2010,20003,CAR,18.0,2.0,3.0,2.0,4.0,0.0,3.0,1.0,3.0
5,2010,20003,MIN,18.0,1.0,3.0,1.0,5.0,0.0,3.0,1.0,4.0
6,2010,20004,CHI,18.0,1.0,2.0,2.0,4.0,0.0,2.0,3.0,4.0
7,2010,20004,COL,18.0,2.0,4.0,1.0,5.0,0.0,2.0,1.0,3.0
8,2010,20005,CGY,18.0,1.0,2.0,1.0,5.0,1.0,5.0,1.0,2.0
9,2010,20005,EDM,18.0,1.0,5.0,1.0,5.0,1.0,3.0,1.0,1.0


In [56]:
dq['F1'] = dq['C1'] + dq['L1'] + dq['R1']
dq['F2'] = dq['C2'] + dq['L2'] + dq['R2']

In [57]:
dq['F'] = dq['F1'] + dq['F2']
dq['D'] = dq['D1'] + dq['D2']

In [58]:
dq = dq[['Season', 'GameNumber', 'TeamCode', 'F', 'D', 'roster', 'F1', 'F2', 'C1', 'C2', 'L1', 'L2', 'R1', 'R2', 'D1', 'D2' ]]

In [59]:
dq.to_csv('out_data/team_roster_without_goalies.csv', index='False', sep=',')

- create a data set that contains all the teams that played with less than 18 players in a game. For some reason, Minnessota has 22 players for a game. 

In [60]:
dz = dq[dq['roster'] != 18]

In [61]:
dz.to_csv('out_data/teams_with_different_total_roster_number.csv', index='False', sep=',')

- create an index variable to deterime if a team is considered visitor or home for a given game. The column will be named "A". The 1st observation per game is the visitor team and will be assigned a value of 1. The 2nd and final observation per game, is the home team, so we fill in NaN with a value of 2 (home team).

In [62]:
dq.loc[dq.groupby('GameNumber',as_index=False).head(1).index,'A'] = 1
dq = dq.fillna(2)

- **pivot table using game number as index by whether a team is visitor (1) or home (2)**. The table will display the quality of each player per position and team. The next step is to join columns by team and player quality value. We will have for each team 10 columns ( 5 positions x 2 type of player quality). We will rename the columns as following: VC1 shows the amount of elite centers for the visitor team, HC1 displays the amount of elite centers for the home team etc. We rename the columns and sort them based on team, position and quality. 

In [63]:
dq = pd.pivot_table(dq, index=['Season', 'GameNumber'], columns=['A'], values=['F1', 'F2', 'C1', 'C2', 'D1', 'D2', 'L1', 'L2', 'R1', 'R2'])
dq = dq.reset_index()
dq.columns = ['_'.join(str(s).strip() for s in col if s) for col in dq.columns]
dq = dq.reset_index()

In [64]:
dq = dq.rename(columns={'F1_1.0': 'VF1', 'F2_1.0': 'VF2', 'C1_1.0': 'VC1', 'C2_1.0': 'VC2', 'D1_1.0': 'VD1', 'D2_1.0': 'VD2', 'L1_1.0': 'VL1', 'L2_1.0': 'VL2', 'R1_1.0': 'VR1', 'R2_1.0': 'VR2', 'F1_2.0': 'HF1', 'F2_2.0': 'HF2', 'C1_2.0': 'HC1', 'C2_2.0': 'HC2', 'D1_2.0': 'HD1', 'D2_2.0': 'HD2', 'L1_2.0': 'HL1', 'L2_2.0': 'HL2', 'R1_2.0': 'HR1', 'R2_2.0': 'HR2', })
dq = dq[['Season', 'GameNumber', 'VF1', 'VF2', 'VC1', 'VC2', 'VL1', 'VL2', 'VR1', 'VR2',  'VD1', 'VD2', 'HF1', 'HF2', 'HC1', 'HC2', 'HL1', 'HL2', 'HR1', 'HR2', 'HD1', 'HD2']]
dq.sort_values(['Season', 'GameNumber'], ascending=[True, True], inplace=True)

In [65]:
dq.head(10)

Unnamed: 0,Season,GameNumber,VF1,VF2,VC1,VC2,VL1,VL2,VR1,VR2,...,HF1,HF2,HC1,HC2,HL1,HL2,HR1,HR2,HD1,HD2
0,2010,20001,1.0,11.0,1.0,6.0,0.0,4.0,0.0,1.0,...,4.0,8.0,2.0,3.0,2.0,1.0,0.0,4.0,1.0,5.0
1,2010,20002,5.0,7.0,3.0,2.0,1.0,4.0,1.0,1.0,...,3.0,9.0,2.0,6.0,1.0,2.0,0.0,1.0,2.0,4.0
2,2010,20003,3.0,9.0,2.0,3.0,0.0,3.0,1.0,3.0,...,2.0,10.0,1.0,3.0,0.0,3.0,1.0,4.0,1.0,5.0
3,2010,20004,4.0,8.0,1.0,2.0,0.0,2.0,3.0,4.0,...,3.0,9.0,2.0,4.0,0.0,2.0,1.0,3.0,1.0,5.0
4,2010,20005,3.0,9.0,1.0,2.0,1.0,5.0,1.0,2.0,...,3.0,9.0,1.0,5.0,1.0,3.0,1.0,1.0,1.0,5.0
5,2010,20006,3.0,9.0,2.0,3.0,1.0,3.0,0.0,3.0,...,6.0,6.0,4.0,3.0,2.0,2.0,0.0,1.0,1.0,5.0
6,2010,20007,5.0,7.0,2.0,4.0,3.0,1.0,0.0,2.0,...,4.0,7.0,1.0,3.0,3.0,0.0,0.0,4.0,0.0,7.0
7,2010,20008,3.0,9.0,2.0,3.0,0.0,3.0,1.0,3.0,...,2.0,10.0,1.0,4.0,0.0,3.0,1.0,3.0,1.0,5.0
8,2010,20009,4.0,8.0,1.0,3.0,0.0,2.0,3.0,3.0,...,4.0,8.0,2.0,4.0,1.0,2.0,1.0,2.0,2.0,4.0
9,2010,20010,4.0,8.0,1.0,6.0,1.0,0.0,2.0,2.0,...,1.0,11.0,0.0,5.0,0.0,3.0,1.0,3.0,2.0,4.0


- save the new data set.

In [66]:
dq.to_csv('out_data/team_roster_quality_without_goalies.csv', index='False', sep=',')