# Feature Scaling

Feature Scaling in machine learning consists of transforming numeric columns to a standard scale. 

It is also known as **Data Normalization**  and is generally performed during the data preprocessing step.

In machine learning, some feature values differ from others multiple times. The features with higher values will dominate the learning process.

In [152]:
import pandas as pd
from sklearn.datasets import load_wine

In [153]:
# Load wine dataset
df = load_wine(as_frame=True)

In [154]:
# Get frame
df = df['frame']
df

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.20,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.50,16.8,113.0,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,13.71,5.65,2.45,20.5,95.0,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740.0,2
174,13.40,3.91,2.48,23.0,102.0,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750.0,2
175,13.27,4.28,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835.0,2
176,13.17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840.0,2


In [155]:
new_df = df[["target", "alcohol", "malic_acid"]]
new_df

Unnamed: 0,target,alcohol,malic_acid
0,0,14.23,1.71
1,0,13.20,1.78
2,0,13.16,2.36
3,0,14.37,1.95
4,0,13.24,2.59
...,...,...,...
173,2,13.71,5.65
174,2,13.40,3.91
175,2,13.27,4.28
176,2,13.17,2.59


## min-max normalization

This technique scales the values of a feature to a range **between 0 and 1**. This is done by subtracting the minimum value of the feature from each value, and then dividing by the range of the feature.

Formula for min-max normalization is given by:

$$ 
    x_{norm}= \frac{x - min(x)}{max(x) - min(x)} 
$$

where:
 
$ x_{norm} $ is the normalized form of $ x $ 

$ x $ is feature vector from the original data set

$ min(x) $ and $ min(x) $ is the minimum and maximum value of $ x $ respectively.

In [156]:
from sklearn.preprocessing import MinMaxScaler

In [157]:
scaling = MinMaxScaler()

In [158]:
scaling.fit_transform(new_df[['alcohol', 'malic_acid']])

array([[0.84210526, 0.1916996 ],
       [0.57105263, 0.2055336 ],
       [0.56052632, 0.3201581 ],
       [0.87894737, 0.23913043],
       [0.58157895, 0.36561265],
       [0.83421053, 0.20158103],
       [0.88421053, 0.22332016],
       [0.79736842, 0.27865613],
       [1.        , 0.17786561],
       [0.74473684, 0.12055336],
       [0.80789474, 0.28063241],
       [0.81315789, 0.14624506],
       [0.71578947, 0.19565217],
       [0.97894737, 0.19565217],
       [0.88157895, 0.22332016],
       [0.68421053, 0.21146245],
       [0.86052632, 0.23320158],
       [0.73684211, 0.16403162],
       [0.83157895, 0.16798419],
       [0.68684211, 0.46640316],
       [0.79736842, 0.17588933],
       [0.5       , 0.60474308],
       [0.70526316, 0.22134387],
       [0.47894737, 0.16996047],
       [0.65      , 0.21146245],
       [0.53157895, 0.25889328],
       [0.62105263, 0.20355731],
       [0.59736842, 0.19367589],
       [0.74736842, 0.22924901],
       [0.78684211, 0.18577075],
       [0.

## Z-Score normalization

Also known as **standardization** is a technique scales the values of a feature to have a mean of 0 and a standard deviation of 1. This is done by subtracting the mean of the feature from each value, and then dividing by the standard deviation.

the formula for Z-score normalization is given by:

$$ 
    Z = \frac{x - mean}{sd} 
$$

where:
 
$ Z $ is the normalized form of $ x $

$ x $ is the feature vector from the original data set

$ mean $ is the mean of the feature vector $ x $ 

$ sd $ is the standard deviation of $ x $

In [159]:
from sklearn.preprocessing import StandardScaler

In [160]:
scaling = StandardScaler()

In [161]:
scaling.fit_transform(new_df[['alcohol', 'malic_acid']])

array([[ 1.51861254, -0.5622498 ],
       [ 0.24628963, -0.49941338],
       [ 0.19687903,  0.02123125],
       [ 1.69154964, -0.34681064],
       [ 0.29570023,  0.22769377],
       [ 1.48155459, -0.51736664],
       [ 1.71625494, -0.4186237 ],
       [ 1.3086175 , -0.16727801],
       [ 2.25977152, -0.62508622],
       [ 1.0615645 , -0.88540853],
       [ 1.3580281 , -0.15830138],
       [ 1.38273339, -0.76871232],
       [ 0.92568536, -0.54429654],
       [ 2.16095032, -0.54429654],
       [ 1.70390229, -0.4186237 ],
       [ 0.77745356, -0.47248348],
       [ 1.60508109, -0.37374054],
       [ 1.02450655, -0.68792264],
       [ 1.46920194, -0.66996938],
       [ 0.78980621,  0.68550197],
       [ 1.3086175 , -0.63406285],
       [-0.08723191,  1.31386618],
       [ 0.87627476, -0.42760033],
       [-0.18605311, -0.66099274],
       [ 0.61686912, -0.47248348],
       [ 0.06099988, -0.25704433],
       [ 0.48098997, -0.50839001],
       [ 0.36981612, -0.55327317],
       [ 1.07391715,