# Standardization & Normalization

Transformation is a highly essential step required for any data science project. There is a significant boost in terms of performance of a model from transformation. 

For eg, Linear Regression, internal concept of Gradient Decent is used to determine the Global Minima. Where the objective is to identify the coefficient slopes/parameters which helps smoother derivition. 

KNN, works of Euclidian distance which tries to find the nearest points to either classify or solve regression problems. K-Means/Heirarchical Clustering also uses Eucledian distance. 
For instance, if a dataset consists of 2 features as below:

|#|Feature 1|Feature 2|
|---|----|----|
|0. |24|56|
|1. |21|100|
|2. |45|67|

Where (x1,y1) = (24,56)
and (x2,y2) = (21,100)

Every points have direction and vector.

Scaling will be performed in order to make the processing much faster. 
An eg of such scaling includes MinMax/Standard Scaler. 

On the other hand, for Tree/Ensemble based techniques, transformation is not necessarily require.

For Deep Learning Techniques such as ANN, CNN, or RNN scaling, standardization or normalization is also a must. 

---
### Types of Transformation
---

1. Normalisation and Standardization
2. Scaling to Minimum and Maximum Values
3. Scaling to Median and Quantiles
4. Gaussian Transformation 
    * a. Logarithmic Transformation
    * b. Reciprocal Transformation
    * c. Square Root Transformation
    * d. Exponential Transformation
    * e. Box Cox Transformation

In [None]:
import numpy as np 
import pandas as pd 
import os
for dirname, _, filenames in os.walk('/kaggle/input/titanicdataset-traincsv'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
df = pd.read_csv('../input/titanicdataset-traincsv/train.csv', usecols=['Pclass', 'Age', 'Fare', 'Survived'])
df.head()

#### 1. Standardization

Where all features and variables are brought to a similar scale. By centering the variables at zero.

z = (x - (x_mean))/std


In [None]:
df.isnull().sum()

In [None]:
#NOT A GOOD PRACTISE

df['Age'].fillna(df.Age.median(),inplace=True)
df.isnull().sum()

In [None]:
#Performing standardization
#Using StandardScaler from Scikit-Learn

from sklearn.preprocessing import StandardScaler
scalar = StandardScaler()

|fit|fit_transform|
|----|----|
|Used for training|When data is required to modified where the features are 'transformed'|


In [None]:
df_scaled = scalar.fit_transform(df)
df_scaled

In [None]:
pd.DataFrame(df_scaled)

The transformation is taking placed based on each distinct field

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
plt.hist(df['Pclass'],bins=20)

In [None]:
plt.hist(df_scaled[:,1],bins=20)

In [None]:
plt.hist(df['Age'],bins=20)

In [None]:
plt.hist(df_scaled[:,2],bins=20)

In [None]:
plt.hist(df['Fare'],bins=20)

In [None]:
plt.hist(df_scaled[:,3],bins=20)

#### 2. MinMax Scaling
Aim of this scaling is to transform/scale the values between 0 to 1 / 0 and 1
Works well with DL techniques such as CNN

X_scaled = ((X - X.min) / (X.max-X.min))

In [None]:
from sklearn.preprocessing import MinMaxScaler
min_max = MinMaxScaler()

In [None]:
df_min_max = pd.DataFrame(min_max.fit_transform(df), columns=df.columns)
df_min_max

In [None]:
plt.hist(df['Pclass'],bins=20)

In [None]:
plt.hist(df_min_max['Pclass'],bins=20)

In [None]:
plt.hist(df['Age'],bins=20)

In [None]:
plt.hist(df_min_max['Age'],bins=20)

In [None]:
plt.hist(df['Fare'],bins=20)

In [None]:
plt.hist(df_min_max['Fare'],bins=20)

#### 3. Robust Scaler
Used to scale features according to median and quantiles (IQR)

Best in presence of outliers.

IQR = Q3 - Q1
X_scaled = (X-X.median)/IQR

In [None]:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()

In [None]:
df_robust_scaler = pd.DataFrame(scaler.fit_transform(df),columns = df.columns)
df_robust_scaler

In [None]:
plt.hist(df['Pclass'],bins=20)

In [None]:
plt.hist(df_robust_scaler['Pclass'],bins=20)

In [None]:
plt.hist(df['Age'],bins=20)

In [None]:
plt.hist(df_robust_scaler['Age'],bins=20)

In [None]:
plt.hist(df['Fare'],bins=20)

In [None]:
plt.hist(df_robust_scaler['Fare'],bins=20)

#### 4. Gaussian Transformation 

The main purpose of using this type of transformation is that ML algorithms such as Linear and Logistic assumes that the features are normally distributed. From normally distributed data, they provide better performance. Hence, the following techniques are applied in order to make the features normally/Gaussian distributed.

* a. Logarithmic Transformation
* b. Reciprocal Transformation
* c. Square Root Transformation
* d. Exponential Transformation
* e. Box Cox Transformation

In [None]:
df.drop('Pclass', inplace = True, axis = 1)
df.head()

QQ Plots are used in order to check whether a distribution is normal or Gaussian

In [None]:
import scipy.stats as stat
import pylab

In [None]:
def plot_data(df,feature):
    plt.figure(figsize = (10,6))
    #(1st row, 2nd column, 1st index)
    plt.subplot(1,2,1)
    df[feature].hist()
    #(1st row, 2nd column, 2nd index)
    plt.subplot(1,2,2)
    stat.probplot(df[feature],dist='norm',plot=pylab)
    plt.show()

--- 

Age
---
---

In [None]:
plot_data(df, 'Age')

If the plot points fall within the red lines, then the feature is normally distributed

#### a. Logarithmic Transformation

In [None]:
df['Age_log'] = np.log(df['Age'])
plot_data(df,'Age_log')

#### b. Reciprocal Transformation

In [None]:
df['Age_reciprocal'] = 1/df['Age']
plot_data(df,'Age_reciprocal')

#### c. Square Root Transformation


In [None]:
df['Age_squared'] = df.Age**(1/2)
plot_data(df,'Age_squared')

#### d. Exponential Transformation

In [None]:
df['Age_exp'] = df.Age**(1/1.2)
plot_data(df, 'Age_exp')

#### e. Box Cox Transformation

The Box-Cox transformation is defined as:

T(Y)=(Y exp(λ)−1)/λ

where Y is the response variable and λ is the transformation parameter. λ varies from -5 to 5. In the transformation, all values of λ are considered and the optimal value for a given variable is selected.


λ = parameter

In [None]:
df['AgeBcox'],parameters = stat.boxcox(df['Age'])

In [None]:
print(parameters)

In [None]:
plot_data(df, 'AgeBcox')

--- 

Fare
---
---

In [None]:
plot_data(df,'Fare')

In [None]:
df['Fare_log'] = np.log1p(df['Fare'])
plot_data(df,'Fare_log')

In [None]:
df['Fare_exp'] = df.Fare**(1/1.2)
plot_data(df, 'Fare_exp')

In [None]:
df['Fare_squared'] = df.Fare**(1/2)
plot_data(df,'Fare_squared')

In [None]:
df['FareBcox'],parameters = stat.boxcox(df['Fare']+1)
plot_data(df,'FareBcox')