# NHL Draft data from NHL Records API

Dataset generated from a JSON received from the NHL Records API, contains response to the request for all draft records.

For details, see notebook `notebooks/feature_extraction/nhl_api.ipynb`.

# Data cleanup and feature extraction

## Load data

In [1]:
import numpy as np
import pandas as pd
from glob import glob
from time import time
import os

In [5]:
draft_api_data_path = '../../data/nhl_api/nhl_draft_all.csv'
t = time()
df = pd.read_csv(draft_api_data_path)
# df = df.rename(columns={'Unnamed: 0': 'id'})
elapsed = time() - t
print("----- DataFrame with NHL Draft Data loaded"
      "\nin {0:.2f} seconds".format(elapsed) + 
      "\nwith {0:,} rows\nand {1:,} columns"
      .format(df.shape[0], df.shape[1]) + 
      "\n-- Column names:\n", df.columns)

----- DataFrame with NHL Draft Data loaded
in 0.10 seconds
with 11,587 rows
and 25 columns
-- Column names:
 Index(['amateurClubName', 'amateurLeague', 'birthDate', 'birthPlace',
       'countryCode', 'csPlayerId', 'draftYear', 'draftedByTeamId',
       'firstName', 'height', 'id', 'lastName', 'overallPickNumber',
       'pickInRound', 'playerId', 'playerName', 'position', 'removedOutright',
       'removedOutrightWhy', 'roundNumber', 'shootsCatches',
       'supplementalDraft', 'teamPickHistory', 'triCode', 'weight'],
      dtype='object')


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11587 entries, 0 to 11586
Data columns (total 25 columns):
amateurClubName       11492 non-null object
amateurLeague         11471 non-null object
birthDate             11526 non-null object
birthPlace            11530 non-null object
countryCode           11580 non-null object
csPlayerId            4854 non-null float64
draftYear             11587 non-null int64
draftedByTeamId       11587 non-null int64
firstName             11580 non-null object
height                11502 non-null float64
id                    11587 non-null int64
lastName              11580 non-null object
overallPickNumber     11587 non-null int64
pickInRound           11587 non-null int64
playerId              10800 non-null float64
playerName            11580 non-null object
position              11580 non-null object
removedOutright       11587 non-null object
removedOutrightWhy    88 non-null object
roundNumber           11587 non-null int64
shootsCatches     

## Data cleanup

Fixing inconsistent capitalizations

In [23]:
col = 'amateurClubName'
df[col] = df[col].str.title()
print("Capitalizations fixed!")

Capitalizations fixed!


### Some inconsistencies in amateur club names

In [25]:
col = 'amateurClubName'
mask = df[col].str.contains('london', case=False).fillna(False)
df.loc[mask,col].value_counts()

London Knights           115
London                    64
London Nationals           3
London Jr. B               1
London Diamonds Jr. B      1
Name: amateurClubName, dtype: int64

In [None]:
mask = df[col].str.contains('peterbo', case=False).fillna(False)
df.loc[mask,col].value_counts()

In [30]:
ska = ['Ska Leningrad', 
       'St. Petersburg Ska',
       'Ska St. Petersburg',
       'Leningrad Ska',
       'St. Petersburg Ska St. Petersburg',
       'Ska St. Petersburg 2']
mask = df[col].isin(ska)
new_ska = "St. Petersburg SKA"
df.loc[mask, col] = new_ska
print("The following Amateur Team names:\n", ska,
      "\nhave been renamed to: \n'{0}'"
      .format(new_ska))

The following Amateur Team names:
 ['Ska Leningrad', 'St. Petersburg Ska', 'Ska St. Petersburg', 'Leningrad Ska', 'St. Petersburg Ska St. Petersburg', 'Ska St. Petersburg 2'] 
have been renamed to St. Petersburg SKA


In [32]:
cska = ['Cska Moscow', 'Hc Cska', 'Cska',
        'Cska 2 ', 'Cska Jr.',
        'Cska 2 Cska Moscow 2']
mask = df[col].isin(cska)
new_cska = "Moscow CSKA"
df.loc[mask, col] = new_cska
print("The following Amateur Team names:\n", cska,
      "\nhave been renamed to: \n'{0}'"
      .format(new_cska))

The following Amateur Team names:
 ['Ska Leningrad', 'St. Petersburg Ska', 'Ska St. Petersburg', 'Leningrad Ska', 'St. Petersburg Ska St. Petersburg', 'Ska St. Petersburg 2'] 
have been renamed to: 
'Moscow CSKA'


In [31]:
mask1 = df[col].str.contains('peterbu', case=False).fillna(False)
mask2 = df[col].str.contains('ska', case=False)
df.loc[mask1 | mask2,col].value_counts()

Saskatoon Blades             93
Cska Moscow                  58
Saskatoon                    36
St. Petersburg SKA           15
Cska 2                       14
Hc Cska                      12
Cska                          5
Cska Jr.                      4
U. Of Alaska-Anchorage        4
U. Of Nebraska-Omaha          3
Banska Bystrica               3
U. Of Alaska-Fairbanks        2
Chaska                        2
Oskarshamn                    2
Nebraska-Omaha                2
Fort Saskatchewan             2
Skalica                       1
Spisska Nova Ves              1
Skalica Jr.                   1
Spisska Nova Ves Jr.          1
Cska 2 Cska Moscow 2          1
Fort Saskatchewan Traders     1
Saskatoon J'S                 1
Alaska All-Stars              1
Alaska-Fairbanks              1
Name: amateurClubName, dtype: int64

In [None]:
df[col].value_counts()

## Data cleanup

Cleanup summary:

* summarized positions
    * corrected for consistency
    * C/RW, C/LW, _etc._, C/W, F = C
    * L/RW, W = RW
    * player who can play center are assumed to be centers for the purposes of this analysis
    * universal (left/right) wingers are assumed to be right wingers

### Fix positions

In [None]:
df['Pos'].value_counts()

In [None]:
df['Pos'] = df['Pos'].str.replace("C RW", "C")
df['Pos'] = df['Pos'].str.replace("C; LW", "C")
df['Pos'] = df['Pos'].str.replace("F", "C")
df['Pos'] = df['Pos'].str.replace("C/W", "C")
df['Pos'] = df['Pos'].str.replace("C/LW", "C")
df['Pos'] = df['Pos'].str.replace("C/RW", "C")
df['Pos'] = df['Pos'].str.replace("L/RW", "RW")
mask = df['Pos'] == "W"
df['Pos'] = np.where(mask, "RW", df['Pos'])
df['Pos'].value_counts()

### Fix player names

In [None]:
df['name'] = df['Player'].str.split("\\").apply(lambda x: x[0])
df['alias'] = df['Player'].str.split("\\").apply(lambda x: x[1])
print("Player names splits into columns 'name' and 'alias'.")

## New features
* `league`: string, junior league of the player
* `year`: int, year of NHL draft, extracted from .csv file names
* `num_teams`: int, number of teams in each draft year
* `round_ratio`: float, ratio of each pick: 
    * $\text{round_ratio}=\large{\frac{\text{# Overall}}{\text{number of teams}}}$ 
    * number of teams represents number of picks per round
    * each overall pick number (e.g., 171) is divided by the number of picks per round to determine in which round (and how late in the round, via the ratio) was each prospect selected
    * \- 1 is needed to ensure proper boundary between rounds
    * so, for example, for pick #171 $\text{round ratio}=\frac{171 - 1}{30} = 5.67$
* `round`: int, round in which a prospect was selected
    * `round_ratio` is rounded down and 1 is added
    * $\text{round} = \text{int}(\text{round ratio}) + 1$
* `1st_round`: boolean, whether the prospect was selected in the $1^{st}$ round
    * one-hot encoding for $1^{st}$ round picks
    * True if `round` == 1, False otherwise
* `gpg`: float, average goals per game
* `apg`: float, average assists per game
* `ppg`: float, average points per game

### Extract junior `league` from `Amateur Team`

In [None]:
df['league'] = df['Amateur Team'].str.extract(pat='\((.*?)\)')
print("New column 'league' added to df.")

In [None]:
num_teams = df.groupby('year')['Team'].nunique()
num_teams.name = 'num_teams'
df = pd.merge(df, num_teams, 
              right_on=num_teams.index,
              left_on='year')
print("New column 'num_teams' added to df.")

In [None]:
df['round_ratio'] = (df['Overall'] - 1)/ df['num_teams']
df['round'] = df['round_ratio'].astype('int') + 1
df['1st_round'] = df['round'] == 1
print("New columns 'round_ratio', 'round', and '1st_round added to df.")

In [None]:
df['gpg'] = df['G'] / df['GP']
df['apg'] = df['A'] / df['GP']
df['ppg'] = df['PTS'] / df['GP']
print("New columns `gpg`, `apg`, and `ppg` added to df.")

## Sanity checks

In [None]:
df.groupby('year')['Team'].nunique()

In [None]:
year = 2016
pick = 1
# #1 overall pick from 2016
mask1 = df['year'] == year
mask2 = df['Overall'] == pick
print("#{0} pick in {1} was {2} picked by the {3}."
      .format(pick, year, 
              df.loc[mask1 & mask2, 'Player']
                .values[0].split('\\')[0],
              df.loc[mask1 & mask2, 'Team'].values[0]))
print("\nPicks by round:")
subset = df.loc[mask1, 
       ['Player', 'Overall', 
        'round', 'round_ratio']]
print("Total picks by round in {0}:"
      .format(year))
subset['round'].value_counts().sort_index()

### Counts by #overall
As there are 10 draft seasons, almost all #overall should have 10 players drafted.

In [None]:
counts = df['Overall'].value_counts()
len(counts[counts == 10]) / len(counts)

Displaying all #overall, for which there are NOT 10 players ("anomalies").

In [None]:
# overall picks with not 10 counts
counts[counts != 10]

Most draft picks (by # overall) have been selected 10 times, corresponding to 10 drafts that took place from 2009 to 2018.

In [None]:
df['year'].value_counts().sort_index()

Last two seasons had slightly longer drafts (217 players total, compared to 210 or 211 prior to 2017), hence the odd number of picks of overall numbers above 215 seen above.

Goals/assists/points per game

In [None]:
focus_id = 0
print("GPG * GP =", df.loc[focus_id, 'gpg'] 
      * df.loc[focus_id, 'GP'])
print("G =", df.loc[focus_id, 'G'])
print("APG * GP =", df.loc[focus_id, 'apg'] 
      * df.loc[focus_id, 'GP'])
print("A =", df.loc[focus_id, 'A'])
print("PPG * GP =", df.loc[focus_id, 'ppg'] 
      * df.loc[focus_id, 'GP'])
print("PTS =", df.loc[focus_id, 'PTS'])

### Goals by position

In [None]:
pos = 'C'
mask = df['Pos'] == pos
print("Players from position {0} "
      "scored on average {1:.2f} goals "
      "in total.".format(pos, 
            df.loc[mask, 'G'].mean()))

In [None]:
pos = 'D'
mask = df['Pos'] == pos
print("Players from position {0} "
      "scored on average {1:.2f} goals "
      "in total.".format(pos, 
            df.loc[mask, 'G'].mean()))

In [None]:
pos = 'G'
mask = df['Pos'] == pos
print("Players from position {0} "
      "scored on average {1:.2f} goals "
      "in total.".format(pos, 
            df.loc[mask, 'G'].mean()))

All results appear to be reasonable: centers score much more goals on average than defencemen, goalies score 0.

In [None]:
mask1 = df['Pos'] == 'C'
mask2 = df['G'] == 0
mask3 = df['1st_round'] == True
s_display_cols = ['name', 'Nat.', 'Pos', 
                  'Overall', 'Team', 'year',
                  'Amateur Team', 'GP', 'PTS', 'ppg', '+/-']
df.loc[mask1 & mask2 & mask3, s_display_cols] 

In [None]:
mask1 = df['Pos'] == 'C'
mask2 = df['PTS'] > 400
mask3 = df['1st_round'] == True
s_display_cols = ['name', 'Nat.', 'Pos', 
                  'Overall', 'Team', 'year',
                  'Amateur Team', 'GP', 'PTS', 'ppg', '+/-']
df.loc[mask1 & mask2 & mask3, s_display_cols] 

## Record results to a new .csv file

In [None]:
save_path = '../../data/nhl_draft_picks_2009-2018.csv'
t = time()
df.to_csv(save_path)
elapsed = time() - t
print("DataFrame saved to file:\n", save_path,
      "\ntook {0:.2f} seconds".format(elapsed))
