## player quality

The purpose of this notebook is to determine the quality of players per team and position for the whole season. Each player will be manually assigned a value of 1 or 2 based on their productivity. This will show the quantity of elite players each team have and assist with estimating the value of each roster position on team success. 

###  import modules

In [1]:
import sys
import os
import pandas as pd
import numpy as np
import datetime, time
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
from pylab import hist, show
import scipy

### import data 

In [2]:
dm = pd.read_csv('out_data/play_by_play.csv')
dm = dm.drop('Unnamed: 0', axis=1)

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
dm.shape

(1531666, 24)

create new data set and keep variables: 
- (a) game number.
- (b) visitor team information.
- (c) home team information.

In [4]:
df = dm[['Season', 'GameNumber', 'VTeamCode', 'VPlayer', 'VPosition', 'HTeamCode', 'HPlayer', 'HPosition']]

In [5]:
df.shape

(1531666, 8)

- reshape the data to have home and visitor team observatons under the same coloumns. 

In [6]:
a = [col for col in df.columns if 'Player' in col]
b = [col for col in df.columns if 'Position' in col]
c = [col for col in df.columns if 'TeamCode' in col]
df = pd.lreshape(df, {'Player' : a, 'Position' : b, 'TeamCode' : c})

- group data by game number, team and player to drop duplicates and keep only one observation per player per game.

In [7]:
df = df.drop_duplicates(['GameNumber', 'TeamCode', 'Player'])
df = df[['Season', 'GameNumber', 'TeamCode', 'Player', 'Position']]
df = df.rename(columns={'Player': 'PlayerNumber', 'Position': 'PlayerPosition' })
df = df.sort_values(['Season', 'GameNumber', 'TeamCode', 'PlayerPosition', 'PlayerNumber'], ascending=[True, True, True, True, True])

In [8]:
df.shape

(46920, 5)

- store player quality to csv file

In [9]:
df.to_csv('out_data/player_quality.csv', index='False', sep=',')

In [10]:
df.shape

(46920, 5)

#### roster position follows.

- group data by team and player to keep one observation per player for the whole season (drop duplicates).

In [11]:
dc = df[['Season', 'TeamCode', 'PlayerNumber', 'PlayerPosition']]

In [12]:
dc = dc.drop_duplicates(['TeamCode', 'PlayerNumber', 'PlayerPosition'])
dc = dc.sort_values(['TeamCode', 'PlayerPosition', 'PlayerNumber'], ascending=[True, True, True])

In [13]:
dc.shape

(1059, 4)

- save the data set and manually rank players per team.

In [14]:
dc.to_csv('out_data/player_rank.csv', index='False', sep=',')

#### next step is to rank players manually and  merge player rank into play by play

- import manual player rank csv file.

In [15]:
dr = pd.read_csv('out_data/player_rank_manual.csv')
dr = dr.drop('Unnamed: 0', axis=1)

In [16]:
dr.shape

(1059, 5)

- **rename columns** for event team and player in order to reshape data. 

In [17]:
dm = dm.rename(columns={'EventPlayerNumber': 'EventPNumber', 'EventTeamCode': 'EventTeam', 'EventPlayerName': 'EventPName' })

### reshape data for home and visitor  observations to be under the same columns.

In [18]:
a = [col for col in dm.columns if 'Player' in col]
b = [col for col in dm.columns if 'Position' in col]
c = [col for col in dm.columns if 'TeamCode' in col]
dm = pd.lreshape(dm, {'Player' : a, 'Position' : b, 'TeamCode' : c})

In [19]:
dm = dm.rename(columns={'Player': 'PlayerNumber', 'Position': 'PlayerPosition', 'EventPNumber': 'EventPlayerNumber', 'EventTeam': 'EventTeamCode', 'EventPName': 'EventPlayerName'})
dm = dm.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True])

In [20]:
dm.shape

(3063332, 21)

### merge player rank data set onto play_by_play

In [21]:
dm = dm.merge(dr, on=['Season', 'TeamCode', 'PlayerNumber', 'PlayerPosition'], how='left')
dm.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)

In [22]:
dm = dm[['Season', 'GameNumber', 'GameDate', 'Period', 'AdvantageType', 'Zone', 'EventNumber', 'EventType',  'EventDetail', 'EventTeamCode', 'EventPlayerNumber', 'EventPlayerName',  'EventTimeFromZero', 'EventTimeFromTwenty', 'TeamCode', 'PlayerNumber', 'PlayerPosition', 'ShotType', 'ShotResult', 'Length', 'PenaltyType', 'Rank']]
dm.sort_values(['Season', 'GameNumber', 'EventNumber'], ascending=[True, True, True], inplace=True)

In [23]:
dm.shape

(3063332, 22)

- check the data set to see if and which columns contain NaN values.

In [24]:
dm.isnull().sum()

Season                       0
GameNumber                   0
GameDate                     0
Period                       0
AdvantageType                0
Zone                         0
EventNumber                  0
EventType                    0
EventDetail                  0
EventTeamCode                0
EventPlayerNumber            0
EventPlayerName              0
EventTimeFromZero            0
EventTimeFromTwenty          0
TeamCode                     0
PlayerNumber                 0
PlayerPosition               0
ShotType               1749130
ShotResult             2789200
Length                 2084206
PenaltyType            2946770
Rank                         0
dtype: int64

- save new data set.

In [25]:
dm.to_csv('out_data/play_by_play_with_player_rank.csv', index='False', sep=',')

#### the next step is to estimate the value of each roster position.