# NHL Draft dataset
# Feature extraction
# Records
## Most Goals by a Rookie in a Single Season
This notebook presents feature extraction from NHL Records data obtained from NHL Records API Records endpoint.
### Data collection summary
Dataset generated from a JSON received from the NHL Records API, contains response to the request for all draft records.

For details, see notebook `notebooks/feature_extraction/nhl_api.ipynb`.

## Preparations
### Import dependencies

In [1]:
from time import time
import os

os.chdir('Documents/repos/nhl_draft/')

from src.io_utils import csv_to_df_rec, df_to_csv
from src.preproc_utils import norm_cols

os.listdir()

['.git',
 '.gitattributes',
 '.gitignore',
 '.idea',
 'data',
 'img',
 'main.py',
 'methodology',
 'models',
 'notebooks',
 'README.md',
 'requirements.txt',
 'src']

### Load data

In [2]:
rec_name = 'most-goals-rookie-one-season'
suffix = '_new_cols'
df, name = csv_to_df_rec(rec_name, suffix)

----- NHL Records
--- Most Goals, Rookie, Season 

----- DataFrame with NHL Records Data loaded
in 0.21 seconds
with 3,377 rows
and 60 columns
-- Column names:
 Index(['activePlayer', 'assists', 'assistsPerGpMin20', 'firstGoals',
       'firstName', 'fiveGoalGames', 'fourGoalGames', 'gameWinningGoals',
       'gamesInSchedule', 'gamesPlayed', 'goals', 'goalsPerGpMin20',
       'goalsPerGpMin50', 'id', 'lastName', 'overtimeAssists', 'overtimeGoals',
       'overtimePoints', 'penalties', 'penaltyMinutes', 'playerId', 'points',
       'pointsPerGpMin50', 'positionCode', 'powerPlayGoals', 'rookieFlag',
       'seasonId', 'sevenGoalGames', 'shorthandedGoals', 'shots',
       'sixGoalGames', 'teamAbbrevs', 'teamNames', 'threeGoalGames',
       'threeOrMoreGoalGames', 'assists_norm', 'firstGoals_norm',
       'gameWinningGoals_norm', 'gamesPlayed_norm', 'goals_norm',
       'overtimeGoals_norm', 'overtimePoints_norm', 'penalties_norm',
       'penaltyMinutes_norm', 'points_norm', 'powerPlayGo

## New variables
### Defence
One-hot encoding for defencemen (True if player is a defenceman, False for all other positions).

In [3]:
df['def'] = df['positionCode'] == 'D'
print("Value counts of all positions:\n",
      df['positionCode'].value_counts(),
      "\n\nValue counts of the new variable 'def':\n",
      df['def'].value_counts())

Value counts of all positions:
 C    1004
L     865
R     843
D     665
Name: positionCode, dtype: int64 

Value counts of the new variable 'def':
 False    2712
True      665
Name: def, dtype: int64


## Rescaling variables
Two techniques that can be used to consistently rescale data are :
### Normalization 
Normalization is a rescaling of the data from the original range so that all values are within the range of 0 and 1.
* Types of normalization:
    * Rescaling (min-max normalization)  
$ \large{ X' = \frac{ X - X_{min} } { X_{max} - X_{min} } } $
    * Rescaling between an arbitrary set of values  
$ \large{ X' = a + \frac{ (X - X_{min})(b - a) } { X_{max} - X_{min} } } $
    * Mean normalization  
$ \large{ X' = \frac{ X - \mu_X } { X_{max} - X_{min} } } $

Variables can be normalized using the `scikit-learn` object `MinMaxScaler`.

### Standardization (Z-score normalization)
Standardization is another type of rescaling that is more robust to new values being outside the range of expected values than normalization. Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance.
* Standardization assumes that observations fit a [Gaussian distribution](http://hyperphysics.phy-astr.gsu.edu/hbase/Math/gaufcn.html) (bell curve) with a well behaved mean and standard deviation. 
    * Data can still be standardized if this expectation is not met, but results might not be reliable.

* General standardization is defined as
$ \large{ X' = \frac{ X - \mu_X } { \sigma } } $,
where $\mu_X$ is the mean of the feature and $\sigma$ is its standard deviation

Variables can be standardized using the `scikit-learn` object `StandardScaler`.

In [4]:
# list of columns to normalize
cols = ['assists', 'firstGoals', 'gameWinningGoals',
        'gamesPlayed', 'goals', 'overtimeAssists', 'overtimeGoals',
        'overtimePoints', 'penalties', 'penaltyMinutes',
        'points', 'powerPlayGoals', 'shots']
df = norm_cols(df, cols, op='norm')
df = norm_cols(df, cols, op='std')
df.info()

----- Normalizing features:
 ['assists', 'firstGoals', 'gameWinningGoals', 'gamesPlayed', 'goals', 'overtimeAssists', 'overtimeGoals', 'overtimePoints', 'penalties', 'penaltyMinutes', 'points', 'powerPlayGoals', 'shots']

Feature: assists
Min: 0.000000, Max: 70.000000

Feature: firstGoals
Min: 0.000000, Max: 14.000000

Feature: gameWinningGoals
Min: 0.000000, Max: 9.000000

Feature: gamesPlayed
Min: 1.000000, Max: 84.000000

Feature: goals
Min: 3.000000, Max: 76.000000

Feature: overtimeAssists
Min: 0.000000, Max: 4.000000

Feature: overtimeGoals
Min: 0.000000, Max: 4.000000

Feature: overtimePoints
Min: 0.000000, Max: 6.000000

Feature: penalties
Min: 0.000000, Max: 103.000000

Feature: penaltyMinutes
Min: 0.000000, Max: 377.000000

Feature: points
Min: 3.000000, Max: 132.000000

Feature: powerPlayGoals
Min: 0.000000, Max: 31.000000

Feature: shots
Min: 3.000000, Max: 425.000000

 ----- All columns normalized!
----- Standardizing features:
 ['assists', 'firstGoals', 'gameWinningGoals'

## Record results to a new .csv file

In [5]:
suffix = '_new_cols'
save_path = 'data/nhl_api/records/' + \
            rec_name + suffix + '.csv'
df_to_csv(df, save_path)

DataFrame saved to file:
 data/nhl_api/records/most-goals-rookie-one-season_new_cols.csv 
took 0.31 seconds
