# Data standardization and Normalization


In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


df = pd.DataFrame({
    'Income': [15000, 1800, 120000, 10000],
    'Age': [25, 18, 42, 51],
    'Department': ['HR','Legal','Marketing','Management']
})

In [10]:
df

Unnamed: 0,Income,Age,Department
0,15000,25,HR
1,1800,18,Legal
2,120000,42,Marketing
3,10000,51,Management


Before directly applying any feature transformation or scaling technique, we need to remember the categorical column: Department and first deal with it. This is because we cannot scale non-numeric values.

For that, we 1st create a copy of our dataframe and store the numerical feature names in a list, and their values as well:


In [11]:
df_scaled = df.copy()
col_names = ['Income', 'Age']
features = df_scaled[col_names]

## Min-Max Scaler(Normalization)

MinMax Scaler(Normalization)
The MinMax scaler is one of the simplest scalers to understand.  It just scales all the data between 0 and 1. The formula for calculating the scaled value is-

                            x_scaled = (x – x_min)/(x_max – x_min)
                            
Thus, a point to note is that it does so for every feature separately. Though (0, 1) is the default range, we can define our range of max and min values as well. How to implement the MinMax scaler?

In [12]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [13]:
df_scaled[col_names] = scaler.fit_transform(features.values)

In [14]:
df_scaled

Unnamed: 0,Income,Age,Department
0,0.111675,0.212121,HR
1,0.0,0.0,Legal
2,1.0,0.727273,Marketing
3,0.069374,1.0,Management


You can see how the values were scaled. The minimum value among the columns became 0, and the maximum value was changed to 1, with other values in between. However, suppose we don’t want the income or age to have values like 0. Let us take the range to be (5, 10)

In [15]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(5, 10))

df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

Unnamed: 0,Income,Age,Department
0,5.558376,6.060606,HR
1,5.0,5.0,Legal
2,10.0,8.636364,Marketing
3,5.34687,10.0,Management


## Standard Scaler

Just like the MinMax Scaler, the Standard Scaler is another popular scaler that is very easy to understand and implement.

For each feature, the Standard Scaler scales the values such that the mean is 0 and the standard deviation is 1(or the variance).

                             x_scaled = x – mean/std_dev
                             
However, Standard Scaler assumes that the distribution of the variable is normal. Thus, in case, the variables are not normally distributed, we

<li>Either choose a different scaler.
<li> convert the variables to a normal distribution and then apply this scaler</li>

Implementing the standard scaler is much similar to implementing a min-max scaler. Just like before, we will first import StandardScaler and then use it to transform our variable.

In [16]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled


Unnamed: 0,Income,Age,Department
0,-0.449056,-0.685248,HR
1,-0.722214,-1.218219,Legal
2,1.723796,0.60911,Marketing
3,-0.552525,1.294358,Management


In [17]:
df_scaled.describe()

Unnamed: 0,Income,Age
count,4.0,4.0
mean,0.0,-5.5511150000000004e-17
std,1.154701,1.154701
min,-0.722214,-1.218219
25%,-0.594947,-0.818491
50%,-0.500791,-0.03806935
75%,0.094157,0.7804217
max,1.723796,1.294358
