# Standard Scaler Normalization

This notebook explains how to use the MinMax scaler encoding from `scikit-learn`.  This scaler normalizes the data using just the minimum and maximum values of the feature to transform the feature to a value between **0** and **1**.

This notebook will data for flights in and out of NYC in 2013.  

### Packages

This tutorial uses:
* [pandas](https://pandas.pydata.org/docs/)
* [statsmodels](https://www.statsmodels.org/stable/index.html)
    * [statsmodels.api](https://www.statsmodels.org/stable/api.html#statsmodels-api)
* [numpy](https://numpy.org/doc/stable/)
* [scikit-learn](https://scikit-learn.org/stable/)
    * [sklearn.model_selection](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
    * [sklearn.preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
* [category_encoders](https://contrib.scikit-learn.org/category_encoders/)

In [1]:
import statsmodels.api as sm
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import category_encoders as ce

## Reading the data

The data is from `rdatasets` imported using the Python package `statsmodels`.

In [2]:
df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 336776 entries, 0 to 336775
Data columns (total 19 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   year            336776 non-null  int64  
 1   month           336776 non-null  int64  
 2   day             336776 non-null  int64  
 3   dep_time        328521 non-null  float64
 4   sched_dep_time  336776 non-null  int64  
 5   dep_delay       328521 non-null  float64
 6   arr_time        328063 non-null  float64
 7   sched_arr_time  336776 non-null  int64  
 8   arr_delay       327346 non-null  float64
 9   carrier         336776 non-null  object 
 10  flight          336776 non-null  int64  
 11  tailnum         334264 non-null  object 
 12  origin          336776 non-null  object 
 13  dest            336776 non-null  object 
 14  air_time        327346 non-null  float64
 15  distance        336776 non-null  int64  
 16  hour            336776 non-null  int64  
 17  minute    

## Feature Engineering

### Handle null values

In [3]:
df.isnull().sum()

year                 0
month                0
day                  0
dep_time          8255
sched_dep_time       0
dep_delay         8255
arr_time          8713
sched_arr_time       0
arr_delay         9430
carrier              0
flight               0
tailnum           2512
origin               0
dest                 0
air_time          9430
distance             0
hour                 0
minute               0
time_hour            0
dtype: int64

As this model will predict arrival delay, the `Null` values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.

In [4]:
df.dropna(inplace=True)

### Convert the times from floats or ints to hour and minutes

In [5]:
df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['flight'] = df.flight.astype(str)
df.rename(columns={'hour': 'dep_hour',
                   'minute': 'dep_minute'}, inplace=True)

## Prepare data for modeling

### Set up train-test split

In [6]:
target = 'arr_delay'
y = df[target]
X = df.drop(columns=[target, 'time_hour', 'year', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1066)
X_train.dtypes

month                 int64
day                   int64
carrier              object
flight               object
tailnum              object
origin               object
dest                 object
air_time            float64
distance              int64
dep_hour              int64
dep_minute            int64
arr_hour              int64
arr_minute            int64
sched_arr_hour        int64
sched_arr_minute      int64
sched_dep_hour        int64
sched_dep_minute      int64
dtype: object

### Encode categorical variables

We convert the categorical features to numerical through the leave one out encoder in `categorical_encoders`.  This leaves a single numeric feature in the place of each existing categorical feature.  This is needed to apply the scaler to all features in the training data.

In [7]:
encoder = ce.LeaveOneOutEncoder(return_df=True)
X_train_loo = encoder.fit_transform(X_train, y_train)
X_test_loo = encoder.transform(X_test)
X_train_loo.shape

(261876, 17)

We apply the MinMax scaler from `scikit-learn`.

In [8]:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train_loo, y_train)
X_train_scaled.shape

(261876, 17)

In [9]:
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_train_scaled_df.describe()

Unnamed: 0,month,day,carrier,flight,tailnum,origin,dest,air_time,distance,dep_hour,dep_minute,arr_hour,arr_minute,sched_arr_hour,sched_arr_minute,sched_dep_hour,sched_dep_minute
count,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0,261876.0
mean,0.506204,0.490929,0.580795,0.272266,0.246643,0.380546,0.376517,0.193474,0.197354,0.452091,0.444616,0.613444,0.499568,0.6536,0.492032,0.452091,0.444616
std,0.310452,0.292762,0.174062,0.046259,0.031008,0.463161,0.078325,0.138618,0.149923,0.258852,0.327023,0.221885,0.294201,0.216157,0.294995,0.258852,0.327023
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.272727,0.233333,0.424413,0.240682,0.225012,0.004215,0.308101,0.091852,0.087497,0.222222,0.135593,0.458333,0.237288,0.478261,0.237288,0.222222,0.135593
50%,0.545455,0.5,0.60062,0.265645,0.245212,0.068588,0.383751,0.161481,0.164797,0.444444,0.491525,0.625,0.491525,0.652174,0.508475,0.444444,0.491525
75%,0.818182,0.733333,0.668046,0.298281,0.264694,0.999709,0.424678,0.253333,0.266979,0.666667,0.745763,0.791667,0.762712,0.826087,0.745763,0.666667,0.745763
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Scale the test set.  This can now be passed into the `predict` or `predict_proba` functions of a trained model.

In [10]:
X_test_scaled = scaler.transform(X_test_loo)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X_train.columns)
X_test_scaled_df.describe()

Unnamed: 0,month,day,carrier,flight,tailnum,origin,dest,air_time,distance,dep_hour,dep_minute,arr_hour,arr_minute,sched_arr_hour,sched_arr_minute,sched_dep_hour,sched_dep_minute
count,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0,65470.0
mean,0.504639,0.493089,0.580139,0.272346,0.246657,0.379121,0.376297,0.194153,0.198115,0.453027,0.444768,0.613809,0.498919,0.654606,0.49331,0.453027,0.444768
std,0.309755,0.291844,0.17415,0.045949,0.030593,0.462851,0.079164,0.139513,0.150775,0.259608,0.327156,0.222513,0.294129,0.216299,0.296554,0.259608,0.327156
min,0.0,0.0,0.048532,0.068182,0.092297,0.004138,0.028815,0.001481,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.272727,0.233333,0.424512,0.240923,0.224912,0.004138,0.308099,0.091852,0.08607,0.222222,0.135593,0.458333,0.237288,0.478261,0.237288,0.222222,0.135593
50%,0.545455,0.5,0.471756,0.265657,0.245312,0.068543,0.383709,0.161481,0.164797,0.444444,0.491525,0.625,0.5,0.652174,0.508475,0.444444,0.491525
75%,0.818182,0.733333,0.668014,0.298183,0.264539,0.999711,0.424623,0.254815,0.269223,0.666667,0.745763,0.791667,0.762712,0.826087,0.762712,0.666667,0.745763
max,1.0,1.0,0.999205,0.681818,0.726744,0.999711,0.985088,0.986667,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
