# Standard Scaler Normalization

This notebook explains how to use the standard scaler encoding from `scikit-learn`.  This scaler normalizes the data by subtracting the mean and dividing by the standard deviation.

This notebook will data for flights in and out of NYC in 2013.  

### Packages

This tutorial uses:
* [pandas](https://pandas.pydata.org/docs/)
* [statsmodels](https://www.statsmodels.org/stable/index.html)
    * [statsmodels.api](https://www.statsmodels.org/stable/api.html#statsmodels-api)
* [numpy](https://numpy.org/doc/stable/)
* [scikit-learn](https://scikit-learn.org/stable/)
    * [sklearn.model_selection](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
    * [sklearn.preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
* [category_encoders](https://contrib.scikit-learn.org/category_encoders/)

In [1]:
import statsmodels.api as sm
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import category_encoders as ce

## Reading the data

The data is from `rdatasets` imported using the Python package `statsmodels`.

In [2]:
df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 336776 entries, 0 to 336775
Data columns (total 19 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   year            336776 non-null  int64  
 1   month           336776 non-null  int64  
 2   day             336776 non-null  int64  
 3   dep_time        328521 non-null  float64
 4   sched_dep_time  336776 non-null  int64  
 5   dep_delay       328521 non-null  float64
 6   arr_time        328063 non-null  float64
 7   sched_arr_time  336776 non-null  int64  
 8   arr_delay       327346 non-null  float64
 9   carrier         336776 non-null  object 
 10  flight          336776 non-null  int64  
 11  tailnum         334264 non-null  object 
 12  origin          336776 non-null  object 
 13  dest            336776 non-null  object 
 14  air_time        327346 non-null  float64
 15  distance        336776 non-null  int64  
 16  hour            336776 non-null  int64  
 17  minute    

## Feature Engineering

### Handle null values

In [3]:
df.isnull().sum()

year                 0
month                0
day                  0
dep_time          8255
sched_dep_time       0
dep_delay         8255
arr_time          8713
sched_arr_time       0
arr_delay         9430
carrier              0
flight               0
tailnum           2512
origin               0
dest                 0
air_time          9430
distance             0
hour                 0
minute               0
time_hour            0
dtype: int64

As this model will predict arrival delay, the `Null` values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.

In [4]:
df.dropna(inplace=True)

### Convert the times from floats or ints to hour and minutes

In [5]:
df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['flight'] = df.flight.astype(str)
df.rename(columns={'hour': 'dep_hour',
                   'minute': 'dep_minute'}, inplace=True)

## Prepare data for modeling

### Set up train-test split

In [6]:
target = 'arr_delay'
y = df[target]
X = df.drop(columns=[target, 'time_hour', 'year', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1066)
X_train.dtypes

month                 int64
day                   int64
carrier              object
flight               object
tailnum              object
origin               object
dest                 object
air_time            float64
distance              int64
dep_hour              int64
dep_minute            int64
arr_hour              int64
arr_minute            int64
sched_arr_hour        int64
sched_arr_minute      int64
sched_dep_hour        int64
sched_dep_minute      int64
dtype: object

### Encode categorical variables

We convert the categorical features to numerical through the leave one out encoder in `categorical_encoders`.  This leaves a single numeric feature in the place of each existing categorical feature.  This is needed to apply the scaler to all features in the training data.

In [7]:
encoder = ce.LeaveOneOutEncoder(return_df=True)
X_train_loo = encoder.fit_transform(X_train, y_train)
X_test_loo = encoder.transform(X_test)
X_train_loo.shape

(261876, 17)

We apply the standard scaler from `scikit-learn`.

In [8]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_loo, y_train)
X_train_scaled.shape

(261876, 17)

In [9]:
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_train_scaled_df.describe()

Unnamed: 0,month,day,carrier,flight,tailnum,origin,dest,air_time,distance,dep_hour,dep_minute,arr_hour,arr_minute,sched_arr_hour,sched_arr_minute,sched_dep_hour,sched_dep_minute
count,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0
mean,-1.301713e-16,4.5981610000000003e-17,1.415887e-16,-2.153564e-16,6.105726e-17,6.058388e-16,-2.516668e-16,1.274292e-16,8.999015000000001e-17,7.951318e-15,-1.764622e-15,6.00907e-16,-1.60843e-16,-1.117432e-15,-2.422899e-16,7.951318e-15,-1.764622e-15
std,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002
min,-1.63054,-1.676892,-3.336721,-5.885651,-7.954211,-0.8216296,-4.807096,-1.395737,-1.316373,-1.746525,-1.359586,-2.764704,-1.698053,-3.023737,-1.667935,-1.746525,-1.359586
25%,-0.7520552,-0.8798827,-0.89843,-0.6827688,-0.6975891,-0.8125293,-0.8734769,-0.7331093,-0.7327537,-0.8880329,-0.9449566,-0.6990625,-0.8915001,-0.8111693,-0.8635545,-0.8880329,-0.9449566
50%,0.1264298,0.03098504,0.1138961,-0.1431278,-0.04615378,-0.6735426,0.09236695,-0.230795,-0.2171556,-0.02954102,0.1434451,0.05207997,-0.02733634,-0.006599369,0.0557374,-0.02954102,0.1434451
75%,1.004915,0.8279943,0.5012644,0.5623605,0.5821325,1.33682,0.6148843,0.4318324,0.4644133,0.8289509,0.920875,0.8032224,0.8944383,0.7979706,0.8601178,0.8289509,0.920875
max,1.590571,1.738862,2.408367,15.73161,24.29566,1.337449,7.96019,5.818352,5.353752,2.116689,1.698305,1.74215,1.700991,1.602541,1.721954,2.116689,1.698305


Scale the test set.  This can now be passed into the `predict` or `predict_proba` functions of a trained model.

In [10]:
X_test_scaled = scaler.transform(X_test_loo)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X_train.columns)
X_test_scaled_df.describe()

Unnamed: 0,month,day,carrier,flight,tailnum,origin,dest,air_time,distance,dep_hour,dep_minute,arr_hour,arr_minute,sched_arr_hour,sched_arr_minute,sched_dep_hour,sched_dep_minute
count,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0
mean,-0.005041,0.007379,-0.003772,0.001724,0.000449,-0.003078,-0.002807,0.004899,0.005081,0.003615,0.000465,0.001644,-0.002205,0.004651,0.004332,0.003615,0.000465
std,0.997754,0.996867,1.000505,0.993298,0.986614,0.999333,1.01071,1.006461,1.005689,1.00292,1.000407,1.002833,0.999758,1.000658,1.005285,1.00292,1.000407
min,-1.63054,-1.676892,-3.057897,-4.411746,-4.97766,-0.812695,-4.439205,-1.385049,-1.316373,-1.746525,-1.359586,-2.764704,-1.698053,-3.023737,-1.667935,-1.746525,-1.359586
25%,-0.752055,-0.879883,-0.89786,-0.677545,-0.700821,-0.812695,-0.873506,-0.733109,-0.742277,-0.888033,-0.944957,-0.699062,-0.8915,-0.811169,-0.863555,-0.888033,-0.944957
50%,0.12643,0.030985,-0.626442,-0.142877,-0.042937,-0.67364,0.091824,-0.230795,-0.217156,-0.029541,0.143445,0.05208,0.001469,-0.006599,0.055737,-0.029541,0.143445
75%,1.004915,0.827994,0.50108,0.560253,0.577151,1.336826,0.614185,0.44252,0.479378,0.828951,0.920875,0.803222,0.894438,0.797971,0.917574,0.828951,0.920875
max,1.590571,1.738862,2.4038,8.853391,15.483197,1.336826,7.76981,5.722164,5.353752,2.116689,1.698305,1.74215,1.700991,1.602541,1.721954,2.116689,1.698305
