# NHL Draft dataset
# Feature extraction
# Records: 
## Most Goals by a Rookie in a Single Season
This notebook presents feature extraction from NHL Records data obtained from NHL Records API Records endpoint.
### Data collection summary
Dataset generated from a JSON received from the NHL Records API, contains response to the request for all draft records.

For details, see notebook `notebooks/feature_extraction/nhl_api.ipynb`.

## Preparations
### Import dependencies

In [1]:
import pandas as pd
from math import sqrt
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from time import time
import sys
import os

In [2]:
os.chdir('Documents/repos/nhl_draft/')
sys.path.append('src')
os.listdir()

['.git',
 '.gitattributes',
 '.gitignore',
 '.idea',
 'data',
 'design',
 'main.py',
 'models',
 'notebooks',
 'README.md',
 'requirements.txt',
 'src']

In [3]:
from preproc_utils import norm_cols

### Load data

In [4]:
rec_name = 'most-goals-rookie-one-season'
rec_file = 'data/nhl_api/records/records_main.csv'
df_rec = pd.read_csv(rec_file)
mask = df_rec['descriptionKey'] == rec_name
name = df_rec.loc[mask, 'description'].values[0]
print("----- NHL Records\n---", name, 'dataset\n')

file = 'data/nhl_api/records/' + \
       rec_name + '.csv'
t = time()
df = pd.read_csv(file)
elapsed = time() - t
print("----- DataFrame with NHL Draft Data loaded"
      "\nin {0:.2f} seconds".format(elapsed) + 
      "\nwith {0:,} rows\nand {1:,} columns"
      .format(df.shape[0], df.shape[1]) + 
      "\n-- Column names:\n", df.columns)

----- NHL Records
--- Most Goals, Rookie, Season dataset

----- DataFrame with NHL Draft Data loaded
in 0.07 seconds
with 3,377 rows
and 35 columns
-- Column names:
 Index(['activePlayer', 'assists', 'assistsPerGpMin20', 'firstGoals',
       'firstName', 'fiveGoalGames', 'fourGoalGames', 'gameWinningGoals',
       'gamesInSchedule', 'gamesPlayed', 'goals', 'goalsPerGpMin20',
       'goalsPerGpMin50', 'id', 'lastName', 'overtimeAssists', 'overtimeGoals',
       'overtimePoints', 'penalties', 'penaltyMinutes', 'playerId', 'points',
       'pointsPerGpMin50', 'positionCode', 'powerPlayGoals', 'rookieFlag',
       'seasonId', 'sevenGoalGames', 'shorthandedGoals', 'shots',
       'sixGoalGames', 'teamAbbrevs', 'teamNames', 'threeGoalGames',
       'threeOrMoreGoalGames'],
      dtype='object')


## Rescaling variables
From [machinelearningmastery.com](https://machinelearningmastery.com/normalize-standardize-time-series-data-python/), [wikipedia](https://en.wikipedia.org/wiki/Feature_scaling), and a [lecture by Andrew Ng](http://openclassroom.stanford.edu/MainFolder/VideoPage.php?course=MachineLearning&video=03.1-LinearRegressionII-FeatureScaling&speed=100/) on Feature Scaling:

Some machine learning algorithms will achieve better performance if data has a consistent scale or distribution. Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization.

For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.

Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it. In stochastic gradient descent, feature scaling can sometimes improve the convergence speed of the algorithm. In support vector machines, it can reduce the time to find support vectors. Note that feature scaling changes the SVM result.

Two techniques that can be used to consistently rescale data are :
### Normalization 
* Also known as feature scaling or unity-based normalization
* Normalization is a rescaling of the data from the original range so that all values are within the range of 0 and 1.
* Normalization can be useful, and even required in some machine learning algorithms when data has input values with differing scales.
* It may be required for algorithms, like k-Nearest neighbors, which uses distance calculations and Linear Regression and Artificial Neural Networks that weight input values.
* Normalization requires the knowledge or accurate estimation of the minimum and maximum observable values (can be estimated from the available data).
* If needed, the transform can be inverted. This is useful for converting predictions back into their original scale for reporting or plotting.
* If the data presents a time series that is trending up or down, estimating these expected values may be difficult and normalization may not be the best method to use.
* Types of normalization:
    * Rescaling (min-max normalization)  
$ \large{ X' = \frac{ X - X_{min} } { X_{max} - X_{min} } } $
    * Rescaling between an arbitrary set of values  
$ \large{ X' = a + \frac{ (X - X_{min})(b - a) } { X_{max} - X_{min} } } $
    * Mean normalization  
$ \large{ X' = \frac{ X - \mu_X } { X_{max} - X_{min} } } $

Variables can be normalized using the `scikit-learn` object `MinMaxScaler`.

In [5]:
# list of columns to normalize
cols = ['assists', 'firstGoals', 'gameWinningGoals',
        'gamesPlayed', 'goals', 'overtimeGoals',
        'overtimePoints', 'penalties', 'points',
        'powerPlayGoals', 'shots']
df = norm_cols(df, cols, op='norm')

----- Normalizing features:
 ['assists', 'firstGoals', 'gameWinningGoals', 'gamesPlayed', 'goals', 'overtimeGoals', 'overtimePoints', 'penalties', 'points', 'powerPlayGoals', 'shots']

Feature: assists
Min: 0.000000, Max: 70.000000

Feature: firstGoals
Min: 0.000000, Max: 14.000000

Feature: gameWinningGoals
Min: 0.000000, Max: 9.000000

Feature: gamesPlayed
Min: 1.000000, Max: 84.000000

Feature: goals
Min: 3.000000, Max: 76.000000

Feature: overtimeGoals
Min: 0.000000, Max: 4.000000

Feature: overtimePoints
Min: 0.000000, Max: 6.000000

Feature: penalties
Min: 0.000000, Max: 103.000000

Feature: points
Min: 3.000000, Max: 132.000000

Feature: powerPlayGoals
Min: 0.000000, Max: 31.000000

Feature: shots
Min: 3.000000, Max: 425.000000

 ----- All columns normalized!


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3377 entries, 0 to 3376
Data columns (total 46 columns):
activePlayer             3377 non-null bool
assists                  3377 non-null int64
assistsPerGpMin20        731 non-null float64
firstGoals               3377 non-null int64
firstName                3377 non-null object
fiveGoalGames            308 non-null float64
fourGoalGames            308 non-null float64
gameWinningGoals         3377 non-null int64
gamesInSchedule          3377 non-null int64
gamesPlayed              3377 non-null int64
goals                    3377 non-null int64
goalsPerGpMin20          340 non-null float64
goalsPerGpMin50          4 non-null float64
id                       3377 non-null int64
lastName                 3377 non-null object
overtimeAssists          3377 non-null int64
overtimeGoals            3377 non-null int64
overtimePoints           3377 non-null int64
penalties                3377 non-null int64
penaltyMinutes           3377 non-

### Standardization (Z-score normalization)
* Standardization is another type of rescaling that is more robust to new values being outside the range of expected values than normalization. 
*  Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance.
    * This can be thought of as subtracting the mean value, or centering the data, and scaling by standard deviation.
* Like normalization, standardization can be useful, and even required in some machine learning algorithms when data has input values with differing scales.
    * This method is widely used for normalization in many machine learning algorithms (e.g., support vector machines, logistic regression, and artificial neural networks).
* Standardization assumes that observations fit a [Gaussian distribution](http://hyperphysics.phy-astr.gsu.edu/hbase/Math/gaufcn.html) (bell curve) with a well behaved mean and standard deviation. 
    * Data can still be standardized if this expectation is not met, but results might not be reliable.
* Standardization requires the knowledge or accurate estimation of the mean and standard deviation of observable values. 
    * These values can be estimated from training data.
* Types of standardization
    * General standardization  
$ \large{ X' = \frac{ X - \mu_X } { \sigma } } $,
where $\mu_X$ is the mean of the feature and $\sigma$ is its standard deviation
    * Scaling to unit length
        * Another option that is widely used in machine-learning is to scale the components of a feature vector such that the complete vector has length one. 
        * This usually means dividing each component by the [Euclidean length](https://en.wikipedia.org/wiki/Euclidean_length) of the vector:  
$ \large{ X' = \frac{ X } { ||X||_2 } } $
        * In some applications (e.g. Histogram features) it can be more practical to use the L1 norm (i.e. Manhattan Distance, City-Block Length or [Taxicab Geometry](https://en.wikipedia.org/wiki/Taxicab_Geometry)) of the feature vector. 
        * This is especially important if in the following learning steps the Scalar Metric is used as a distance measure.

Variables can be standardized using the `scikit-learn` object `StandardScaler`.

In [7]:
# list of columns to standardize
cols = ['assists', 'firstGoals', 'gameWinningGoals',
        'gamesPlayed', 'goals', 'overtimeGoals',
        'overtimePoints', 'penalties', 'points',
        'powerPlayGoals', 'shots']
df = norm_cols(df, cols, op='std')

----- Standardizing features:
 ['assists', 'firstGoals', 'gameWinningGoals', 'gamesPlayed', 'goals', 'overtimeGoals', 'overtimePoints', 'penalties', 'points', 'powerPlayGoals', 'shots']

Feature: assists
Mean: 12.947883, StandardDeviation: 10.205453

Feature: firstGoals
Mean: 1.493041, StandardDeviation: 1.602493

Feature: gameWinningGoals
Mean: 1.254960, StandardDeviation: 1.408284

Feature: gamesPlayed
Mean: 51.437074, StandardDeviation: 20.844453

Feature: goals
Mean: 9.479123, StandardDeviation: 7.316703

Feature: overtimeGoals
Mean: 0.077880, StandardDeviation: 0.308078

Feature: overtimePoints
Mean: 0.177376, StandardDeviation: 0.503688

Feature: penalties
Mean: 14.138585, StandardDeviation: 13.005445

Feature: points
Mean: 22.427006, StandardDeviation: 16.056978

Feature: powerPlayGoals
Mean: 1.954246, StandardDeviation: 2.714523

Feature: shots
Mean: 84.595841, StandardDeviation: 51.647488

 ----- All columns standardized!


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3377 entries, 0 to 3376
Data columns (total 57 columns):
activePlayer             3377 non-null bool
assists                  3377 non-null int64
assistsPerGpMin20        731 non-null float64
firstGoals               3377 non-null int64
firstName                3377 non-null object
fiveGoalGames            308 non-null float64
fourGoalGames            308 non-null float64
gameWinningGoals         3377 non-null int64
gamesInSchedule          3377 non-null int64
gamesPlayed              3377 non-null int64
goals                    3377 non-null int64
goalsPerGpMin20          340 non-null float64
goalsPerGpMin50          4 non-null float64
id                       3377 non-null int64
lastName                 3377 non-null object
overtimeAssists          3377 non-null int64
overtimeGoals            3377 non-null int64
overtimePoints           3377 non-null int64
penalties                3377 non-null int64
penaltyMinutes           3377 non-

## Record results to a new .csv file

In [9]:
suffix = '_new_cols'
save_path = 'data/nhl_api/records/' + \
            rec_name + suffix + '.csv'
t = time()
df.to_csv(save_path)
elapsed = time() - t
print("DataFrame saved to file:\n", save_path,
      "\ntook {0:.2f} seconds".format(elapsed))


DataFrame saved to file:
 data/nhl_api/records/most-goals-rookie-one-season_new_cols.csv 
took 0.37 seconds
