# Feature Transformation

## Feature Scaling

Definition: A technique to standardize independent features of a dataset in a fixed range of values.
Note: Use Feature Scaling just before model building.

## Normalization

Definition: A data pre processing technique often used for Machine Learning. <br> The goal of Normalization is to change the volumes of numeric columns in the data set to use common scale,without distorting differences in the range of values or losing information.  

### Types:
1)Min-Max scaling (Standard Normalization) <br> 2) Mean Normalization <br> 3)Max absolute scaling <br> 4) Robust scaling


### MinMax scaler- (Value- Min)/(Max-Min) in range [0,1] Class: MInMaxscaler

### Mean Normalization: (Value- x_Mean)/(x_Max-x_Min) range in [-1,1]
Here we do mean centering as we use to do in Standardization(so, this technique is used very rerely, instead we use Standardization). There is no separate class for this technique in Scikit-learn library, we have to code it manually.

### Max absolute scaling: x_new=(x_old)/|x_max|, class-->MaxAbsScaler
Mostly used when we have sparse data(means, data having many zeros)

### Robust Scaling : x_new=(x_old-x_median)/IQR, class-->RobustScaler
This scaling is Robust to outliers(generally it performs well with outliers)

# Topics covered:
1.How to Normalize<br>
2.Effect of Normalization on Outliers<br>
3.Normalization Vs Standardization 

In [1]:
#Importing dependencies/libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#Using the first three columns in this section:
df1 = pd.read_csv("wine_data.csv", header= None, usecols=[0,1,2])

In [3]:
df1.columns=["Class label","Alcohol","Malic acid"]

# Min-Max Scaling 

In [4]:
# test-train split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df1.drop("Class label", axis=1), df1["Class label"], test_size=0.3, random_state=0)

In [5]:
#shapes:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((124, 2), (54, 2), (124,), (54,))

In [6]:
#Normalization:
from sklearn.preprocessing import MinMaxScaler
scaler= MinMaxScaler()
#Fit the scaler to the train set so that it would learn from the parameter
scaler.fit(X_train)
#transform train and test sets
X_train_scaled= scaler.transform(X_train )
X_test_scaled= scaler.transform(X_test)

In [7]:
#by-default standardization converts dataframe into array
X_train_scaled

array([[0.72043011, 0.20378151],
       [0.31989247, 0.08403361],
       [0.60215054, 0.71218487],
       [0.57258065, 0.56302521],
       [0.76075269, 0.1302521 ],
       [0.48924731, 0.5       ],
       [0.75537634, 0.67857143],
       [0.61021505, 0.17436975],
       [0.54301075, 0.62394958],
       [0.39784946, 0.07352941],
       [0.33870968, 0.1092437 ],
       [0.46774194, 0.53361345],
       [0.5188172 , 0.53781513],
       [0.70967742, 0.07563025],
       [0.57258065, 0.30882353],
       [0.36021505, 0.0105042 ],
       [0.38709677, 0.13235294],
       [0.20967742, 0.25840336],
       [0.59408602, 0.64915966],
       [0.82526882, 0.26680672],
       [0.15591398, 0.09663866],
       [0.52688172, 0.16386555],
       [0.46774194, 0.31512605],
       [0.65860215, 0.16386555],
       [0.1155914 , 0.5987395 ],
       [0.27956989, 0.26680672],
       [0.21236559, 0.12184874],
       [0.65053763, 0.59033613],
       [0.31451613, 0.44957983],
       [0.54301075, 0.17647059],
       [0.

In [8]:
#so we need to convert back array into dataframe
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

In [9]:
#first 5 rows of X_train_scaled
X_train_scaled.head()

Unnamed: 0,Alcohol,Malic acid
0,0.72043,0.203782
1,0.319892,0.084034
2,0.602151,0.712185
3,0.572581,0.563025
4,0.760753,0.130252


In [10]:
#scatistical info before scaling
X_train.describe()

Unnamed: 0,Alcohol,Malic acid
count,124.0,124.0
mean,12.983065,2.38371
std,0.80134,1.136696
min,11.03,0.89
25%,12.3625,1.6075
50%,13.04,1.885
75%,13.64,3.2475
max,14.75,5.65


In [11]:
#scatistical info after scaling
np.round(X_train_scaled.describe(), 1)

Unnamed: 0,Alcohol,Malic acid
count,124.0,124.0
mean,0.5,0.3
std,0.2,0.2
min,0.0,0.0
25%,0.4,0.2
50%,0.5,0.2
75%,0.7,0.5
max,1.0,1.0


## Effect of Normalization on Outliers:

Since, we sqeeze range as a result impact of outlier also squeezes, we need to handle outliers separately.

## Normalization Vs Standardization
1.Depends on type of data <br>2.Most of the problems are solved by using Standardization<br>3.Normalization(MinMaxScaler) is mostly used when we already know min and max values, example : CNN(image processing)<br>4.When we have outliers, its best to  use Robust scaling<br>5.When we have sparse data try using MaxAbs scaling<br>6.When we have no idea simply use Standardization

SyntaxError: illegal target for annotation (3117167578.py, line 1)