Feature scaling is nothing other than transforming the numerical features into a small range of values.

In [1]:
# ------------------------------------------------------------
# Feature Scaling Explanation:
# ------------------------------------------------------------
# When features have different ranges (e.g., one feature: 1-100, another: 5-300),
# scaling them is important so that ML algorithms treat all features fairly.
#
# Normalization (Min-Max scaling):
# - Scales features to a range like 0-1 or -1 to 1.
# - Preferred when data is NOT Gaussian.
# - Useful for algorithms like Neural Networks and KNN, which don't assume a specific data distribution.
#
# Standardization (Z-score scaling):
# - Scales features to have mean = 0 and standard deviation = 1.
# - Preferred when data has a Gaussian (normal) distribution.
#
# Robust Scaling:
# - Uses median and interquartile range (IQR) instead of mean/std.
# - Less sensitive to outliers.
# - Good choice when data contains extreme values or outliers.
#
# Rule of thumb:
# - Unknown distribution → start with Normalization.
# - Neural Networks / KNN → Normalization works well.
# - Outliers present → Robust Scaler is preferable.
# ------------------------------------------------------------


# 1. Normalization

Normalization is a scaling techniques that transform the numerical feature to the range of values between 0 and 1.

$$
Xnorm = \frac {X-Xmin} {Xmax-Xmin}
$$

In [2]:
import numpy as np
import pandas as pd
from seaborn import load_dataset

tip_data = load_dataset('tips')
tip_data.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


Taking all numerical features from the above data.

In [3]:
num_feats = tip_data[['total_bill', 'tip', 'size']]

In [4]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(num_feats)

The output of the scaler is a NumPy array. I can convert it back to a Pandas DataFrame.

In [5]:
df = pd.DataFrame(scaled_data, columns=num_feats.columns)
df.head()

Unnamed: 0,total_bill,tip,size
0,0.291579,0.001111,0.2
1,0.152283,0.073333,0.4
2,0.375786,0.277778,0.4
3,0.431713,0.256667,0.2
4,0.450775,0.29,0.6


# 2. Standardization

$$
Xstd = \frac {X - u} {\sigma}
$$

Xstd is the standardized feature, X is the feature,  𝑢  is mean of the feature, and  𝜎  is the standard deviation.

In [6]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(num_feats)

In [7]:
# Quick check: mean ~0, std ~1
print("Mean of scaled data:", np.round(scaled_data.mean(axis=0), 2))
print("Std of scaled data:", np.round(scaled_data.std(axis=0), 2))

Mean of scaled data: [-0.  0. -0.]
Std of scaled data: [1. 1. 1.]


In [8]:
# Converting back to DataFrame
scaled_data_df = pd.DataFrame(scaled_data, columns=num_feats.columns)
scaled_data_df.head()

Unnamed: 0,total_bill,tip,size
0,-0.314711,-1.439947,-0.600193
1,-1.063235,-0.969205,0.453383
2,0.13778,0.363356,0.453383
3,0.438315,0.225754,-0.600193
4,0.540745,0.44302,1.506958


# 3. Robust Scaler

In [9]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
scaled_data = scaler.fit_transform(num_feats)

In [10]:
# Converting back to DataFrame
scaled_df = pd.DataFrame(scaled_data, columns=num_feats.columns)
scaled_df.head()

Unnamed: 0,total_bill,tip,size
0,-0.074675,-1.2096,0.0
1,-0.691558,-0.7936,1.0
2,0.298237,0.384,1.0
3,0.545918,0.2624,0.0
4,0.630334,0.4544,2.0
