### Import Dependencies

In [1]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler

### Important Concepts

**Standardization** and **Normalization** are two common techniques for feature scaling in data preprocessing:

- **Standardization** transforms data so that it has a mean of 0 and a standard deviation of 1.

  - **Formula:**  
    \( z = \frac{x - \mu}{\sigma} \)

  - **Use case:**  
    Useful when data is normally distributed or for algorithms that assume zero-centered data (e.g., linear models, PCA).

- **Normalization** (often refers to Min-Max scaling) rescales data to a fixed range, usually [0, 1].

  - **Formula:**  
    \( x_{norm} = \frac{x - x_{min}}{x_{max} - x_{min}} \)

  - **Use case:**  
    Useful when you need bounded values or when features

### Basic Processing

In [2]:
df = pd.read_csv('data/processed/ChurnModelling_Encoded.csv')
df.head()

Unnamed: 0,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,CreditScoreBins,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
0,42.0,2,0.0,1,1,1,101348.88,1,1,True,False,False,True,False
1,41.0,1,83807.86,1,0,1,112542.58,0,1,False,False,True,True,False
2,42.0,8,159660.8,3,1,0,113931.57,1,0,True,False,False,True,False
3,38.91,1,0.0,2,0,0,93826.63,0,2,True,False,False,True,False
4,43.0,2,125510.82,1,1,1,79084.1,0,4,False,False,True,True,False


| **Condition**                                                 | **Min-Max Scaling**                             | **Standardization (Z-score)**                    |
|---------------------------------------------------------------|--------------------------------------------------|--------------------------------------------------|
| Data has a known, fixed range                                 | ✅ Yes                                           | ❌ Not ideal                                     |
| Data contains outliers                                        | ❌ Sensitive to outliers                         | ✅ More robust to outliers                        |
| Data is normally distributed                                  | ❌ Not necessary                                 | ✅ Preferred                                     |
| Data is not normally distributed (e.g., skewed)               | ✅ If shape needs to be preserved                | ✅ Often works well after log-transform          |
| Model is distance-based (KNN, SVM)                            | ✅ Recommended                                   | ✅ Also acceptable                               |
| Model is neural network                                       | ✅ Strongly recommended                          | ❌ May slow training                             |
| Model is linear or uses regularization                        | ❌ Not ideal                                     | ✅ Helps with convergence                        |
| Input features need bounded values (0–1)                      | ✅ Required                                      | ❌ Not bounded                                   |
| Applying PCA or LDA                                           | ❌ May distort variance                          | ✅ Required (centering needed)                   |
| Want to preserve original distribution shape                  | ✅ Maintains feature shape                       | ✅ Maintains shape but centers data              |
| Working with tree-based models                                | ❌ Not needed                                    | ❌ Not needed                                    |


In [3]:
columns_need_to_be_scaled = ['Age', 'Tenure','Balance','EstimatedSalary']

for col in columns_need_to_be_scaled:
    standard_scaler = StandardScaler()
    df[col] = standard_scaler.fit_transform(df[[col]])

df

Unnamed: 0,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,CreditScoreBins,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
0,0.302983,-1.041760,-1.225848,1,1,1,0.021886,1,1,True,False,False,True,False
1,0.204867,-1.387538,0.117350,1,0,1,0.216534,0,1,False,False,True,True,False
2,0.302983,1.032908,1.333053,3,1,0,0.240687,1,0,True,False,False,True,False
3,-0.000196,-1.387538,-1.225848,2,0,0,-0.108918,0,2,True,False,False,True,False
4,0.401100,-1.041760,0.785728,1,1,1,-0.365276,0,4,False,False,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,0.008634,-0.004426,-1.225848,2,1,0,-0.066419,0,3,True,False,False,False,True
9996,-0.383831,1.724464,-0.306379,1,1,1,0.027988,0,0,True,False,False,False,True
9997,-0.285715,0.687130,-1.225848,1,0,1,-1.008643,1,2,True,False,False,True,False
9998,0.302983,-0.695982,-0.022608,2,1,0,-0.125231,1,3,False,True,False,False,True


In [4]:
df.to_csv(
    'data/processed/ChurnModelling_Final.csv',index=False)