# 3. Normalization and Standardization

In this notebook, we will explain why it is essential to normalize and standardize data, and provide examples of both techniques.

## Why Normalize & Standardize?

- Most of the time, data comes from different sources, resulting in different magnitudes.
- This difference in scale can lead to poor performance in machine learning.
- Two preprocessing treatments are applied to make the data "homogeneous": Normalization and Standardization.


## Normalization

- Min-max scaling transforms each numerical value $x$ into another value $x_e \in [0, 1]$ using the minimum and maximum values in the data.
- This normalization preserves the proportional distance between the values of a feature.

$x_e = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}}$

where $x_{\text{min}}$ is the minimum value of the variable $x$ and $x_{\text{max}}$ is the maximum value.


In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

# Generating a larger sample DataFrame
np.random.seed(0)
data = {
    'Age': np.random.randint(20, 60, 100),
    'Salary': np.random.randint(30000, 120000, 100),
    'Experience': np.random.randint(1, 40, 100)
}

df = pd.DataFrame(data)

# Normalizing the data
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(df)
df_normalized = pd.DataFrame(normalized_data, columns=df.columns)

print("Original Data:\n", df)
print("\nNormalized Data:\n", df_normalized)

Original Data:
     Age  Salary  Experience
0    20   54777          29
1    23   43824           3
2    23   32418          28
3    59   42843          20
4    29  108778          26
..  ...     ...         ...
95   23   63930          29
96   54   76774          37
97   33   58651          26
98   59   92885          33
99   41   45997          15

[100 rows x 3 columns]

Normalized Data:
          Age    Salary  Experience
0   0.000000  0.273841    0.756757
1   0.076923  0.150450    0.054054
2   0.076923  0.021956    0.729730
3   1.000000  0.139399    0.513514
4   0.230769  0.882186    0.675676
..       ...       ...         ...
95  0.076923  0.376953    0.756757
96  0.871795  0.521647    0.972973
97  0.333333  0.317483    0.675676
98  1.000000  0.703144    0.864865
99  0.538462  0.174930    0.378378

[100 rows x 3 columns]


## Standardization

- Standardization can be applied when the variable meets the criteria of a normal distribution (Gaussian Distributions).
- Standardization is the process of transforming a variable into another that will follow the normal distribution $X \sim N(\mu, \sigma)$ with:
  - $\mu = 0$: The mean of the distribution
  - $\sigma = 1$: The standard deviation

- Standardization transforms $x$ into:

$x_s = \frac{x - \mu}{\sigma}$

In [15]:
from sklearn.preprocessing import StandardScaler

# Standardizing the data
scaler = StandardScaler()
standardized_data = scaler.fit_transform(df)
df_standardized = pd.DataFrame(standardized_data, columns=df.columns)

print("\nOriginal Data:\n", df)
print("\nStandardized Data:\n", df_standardized)


Original Data:
     Age  Salary  Experience
0    20   54777          29
1    23   43824           3
2    23   32418          28
3    59   42843          20
4    29  108778          26
..  ...     ...         ...
95   23   63930          29
96   54   76774          37
97   33   58651          26
98   59   92885          33
99   41   45997          15

[100 rows x 3 columns]

Standardized Data:
          Age    Salary  Experience
0  -1.500053 -0.687615    0.763871
1  -1.262702 -1.096027   -1.603310
2  -1.262702 -1.521330    0.672826
3   1.585499 -1.132606   -0.055538
4  -0.788002  1.325956    0.490735
..       ...       ...         ...
95 -1.262702 -0.346321    0.763871
96  1.189915  0.132602    1.492234
97 -0.471536 -0.543163    0.490735
98  1.585499  0.733343    1.128053
99  0.161398 -1.015001   -0.510765

[100 rows x 3 columns]


The tables below illustrate the differences between the original data, normalized data, and standardized data.

- **Original Data**: The raw data before any transformation.
- **Normalized Data**: Data scaled to a range of [0, 1] using min-max scaling.
- **Standardized Data**: Data transformed to have a mean of 0 and a standard deviation of 1 using standardization.

The transformation ensures that each feature contributes equally to the analysis, preventing features with larger magnitudes from dominating the model training process.
