# NHL 2009-2018 Draft data

Dataset downloaded from [Kaggle](http://localhost:8888/?token=b1eb61f8ed72cf51005640919abafb625e74268313819ce3).

# Data cleanup and feature extraction

## Load data

In [1]:
import numpy as np
import pandas as pd
from glob import glob
from time import time
import os

In [4]:
draft_data_path = '../../data/nhl-draft-picks-2009-2018/'
os.listdir(draft_data_path)

['2009.csv',
 '2010.csv',
 '2011.csv',
 '2012.csv',
 '2013.csv',
 '2014.csv',
 '2015.csv',
 '2016.csv',
 '2017.csv',
 '2018.csv']

In [5]:
t = time()
# glob all .csv files from NHL Draft data
pattern = '*.csv'
csv_files = glob(draft_data_path + pattern)

#  Iterate over csv_files
frames = []
for csv in csv_files:
    #  Read csv into a DataFrame: df
    df = pd.read_csv(csv)
    df['year'] = int(csv[-8:-4])
    # Append df to frames
    frames.append(df)
df = pd.concat(frames)
elapsed = time() - t
print("----- DataFrame with NHL Draft Data loaded"
      "\nin {0:.2f} seconds".format(elapsed) + 
      "\nwith {0:,} rows\nand {1:,} columns"
      .format(df.shape[0], df.shape[1]) + 
      "\n-- Column names:\n", df.columns)

----- DataFrame with NHL Draft Data loaded
in 0.21 seconds
with 2,119 rows
and 21 columns
-- Column names:
 Index(['Overall', 'Team', 'Player', 'Nat.', 'Pos', 'Age', 'To', 'Amateur Team',
       'GP', 'G', 'A', 'PTS', '+/-', 'PIM', 'GP.1', 'W', 'L', 'T/O', 'SV%',
       'GAA', 'year'],
      dtype='object')


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2119 entries, 0 to 216
Data columns (total 21 columns):
Overall         2119 non-null int64
Team            2119 non-null object
Player          2119 non-null object
Nat.            2118 non-null object
Pos             2118 non-null object
Age             2118 non-null float64
To              728 non-null float64
Amateur Team    2118 non-null object
GP              728 non-null float64
G               728 non-null float64
A               728 non-null float64
PTS             728 non-null float64
+/-             724 non-null float64
PIM             728 non-null float64
GP.1            55 non-null float64
W               54 non-null float64
L               54 non-null float64
T/O             54 non-null float64
SV%             55 non-null float64
GAA             55 non-null float64
year            2119 non-null int64
dtypes: float64(14), int64(2), object(5)
memory usage: 364.2+ KB


## Data cleanup

Cleanup summary:

* summarized positions
    * corrected for consistency
    * C/RW, C/LW, _etc._, C/W, F = C
    * L/RW, W = RW
    * player who can play center are assumed to be centers for the purposes of this analysis
    * universal (left/right) wingers are assumed to be right wingers

In [7]:
df['Pos'].value_counts()

D        718
C        517
LW       319
RW       283
G        219
C/LW      31
C/RW      14
W          9
C RW       2
F          2
C; LW      2
L/RW       1
C/W        1
Name: Pos, dtype: int64

In [8]:
df['Pos'] = df['Pos'].str.replace("C RW", "C")
df['Pos'] = df['Pos'].str.replace("C; LW", "C")
df['Pos'] = df['Pos'].str.replace("F", "C")
df['Pos'] = df['Pos'].str.replace("C/W", "C")
df['Pos'] = df['Pos'].str.replace("C/LW", "C")
df['Pos'] = df['Pos'].str.replace("C/RW", "C")
df['Pos'] = df['Pos'].str.replace("L/RW", "RW")
mask = df['Pos'] == "W"
df['Pos'] = np.where(mask, "RW", df['Pos'])
df['Pos'].value_counts()

D     718
C     569
LW    319
RW    293
G     219
Name: Pos, dtype: int64

## New features
* `year`: int, year of NHL draft, extracted from .csv file names
* `num_teams`: int, number of teams in each draft year
* `round_ratio`: float, ratio of each pick: 
    * $\text{round_ratio}=\large{\frac{\text{# Overall}}{\text{number of teams}}}$ 
    * number of teams represents number of picks per round
    * each overall pick number (e.g., 171) is divided by the number of picks per round to determine in which round (and how late in the round, via the ratio) was each prospect selected
    * \- 1 is needed to ensure proper boundary between rounds
    * so, for example, for pick #171 $\text{round ratio}=\frac{171 - 1}{30} = 5.67$
* `round`: int, round in which a prospect was selected
    * `round_ratio` is rounded down and 1 is added
    * $\text{round} = \text{int}(\text{round ratio}) + 1$
* `1st_round`: boolean, whether the prospect was selected in the $1^{st}$ round
    * one-hot encoding for $1^{st}$ round picks
    * True if `round` == 1, False otherwise
* `gpg`: float, average goals per game
* `apg`: float, average assists per game
* `ppg`: float, average points per game

In [9]:
num_teams = df.groupby('year')['Team'].nunique()
num_teams.name = 'num_teams'
df = pd.merge(df, num_teams, 
              right_on=num_teams.index,
              left_on='year')
print("New column 'num_teams' added to df.")

New column 'num_teams' added to df.


In [10]:
df['round_ratio'] = (df['Overall'] - 1)/ df['num_teams']
df['round'] = df['round_ratio'].astype('int') + 1
df['1st_round'] = df['round'] == 1
print("New columns 'round_ratio', 'round', and '1st_round added to df.")

New columns 'round_ratio', 'round', and '1st_round added to df.


In [22]:
df['gpg'] = df['G'] / df['GP']
df['apg'] = df['A'] / df['GP']
df['ppg'] = df['PTS'] / df['GP']
print("New columns `gpg`, `apg`, and `ppg` added to df.")

New columns `gpg`, `apg`, and `ppg` added to df.


## Sanity checks

In [11]:
df.groupby('year')['Team'].nunique()

year
2009    30
2010    30
2011    30
2012    30
2013    30
2014    30
2015    30
2016    30
2017    31
2018    31
Name: Team, dtype: int64

In [12]:
year = 2016
pick = 1
# #1 overall pick from 2016
mask1 = df['year'] == year
mask2 = df['Overall'] == pick
print("#{0} pick in {1} was {2} picked by the {3}."
      .format(pick, year, 
              df.loc[mask1 & mask2, 'Player']
                .values[0].split('\\')[0],
              df.loc[mask1 & mask2, 'Team'].values[0]))
print("\nPicks by round:")
subset = df.loc[mask1, 
       ['Player', 'Overall', 
        'round', 'round_ratio']]
print("Total picks by round in {0}:"
      .format(year))
subset['round'].value_counts().sort_index()

#1 pick in 2016 was Auston Matthews picked by the Toronto Maple Leafs.

Picks by round:
Total picks by round in 2016:


1    30
2    30
3    30
4    30
5    30
6    30
7    30
8     1
Name: round, dtype: int64

### Counts by #overall
As there are 10 draft seasons, almost all #overall should have 10 players drafted.

In [13]:
counts = df['Overall'].value_counts()
len(counts[counts == 10]) / len(counts)

0.9631336405529954

Displaying all #overall, for which there are NOT 10 players ("anomalies").

In [14]:
# overall picks with not 10 counts
counts[counts != 10]

118    9
211    8
215    2
212    2
214    2
216    2
213    2
217    2
Name: Overall, dtype: int64

Most draft picks (by # overall) have been selected 10 times, corresponding to 10 drafts that took place from 2009 to 2018.

In [17]:
df['year'].value_counts().sort_index()

2009    210
2010    210
2011    211
2012    211
2013    211
2014    210
2015    211
2016    211
2017    217
2018    217
Name: year, dtype: int64

Last two seasons had slightly longer drafts (217 players total, compared to 210 or 211 prior to 2017), hence the odd number of picks of overall numbers above 215 seen above.

Goals/assists/points per game

In [29]:
focus_id = 0
print("GPG * GP =", df.loc[focus_id, 'gpg'] 
      * df.loc[focus_id, 'GP'])
print("G =", df.loc[focus_id, 'G'])
print("APG * GP =", df.loc[focus_id, 'apg'] 
      * df.loc[focus_id, 'GP'])
print("A =", df.loc[focus_id, 'A'])
print("PPG * GP =", df.loc[focus_id, 'ppg'] 
      * df.loc[focus_id, 'GP'])
print("PTS =", df.loc[focus_id, 'PTS'])

GPG * GP = 279.0
G = 279.0
APG * GP = 354.0
A = 354.0
PPG * GP = 633.0
PTS = 633.0


## Record results to a new .csv file

In [30]:
save_path = '../../data/nhl_draft_picks_2009-2018.csv'
t = time()
df.to_csv(save_path)
elapsed = time() - t
print("DataFrame saved to file:\n", save_path,
      "\ntook {0:.2f} seconds".format(elapsed))


DataFrame saved to file:
 ../../data/nhl_draft_picks_2009-2018.csv 
took 0.09 seconds
