In [18]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
import pickle

In [19]:
## Load the dataset
data = pd.read_csv("../dataset/Churn_Modelling.csv")
data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [20]:
# Preprocessing the data
## Drop irrelevant columns/features
data = data.drop(columns=["RowNumber", "CustomerId", "Surname"])
data.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


## Now we will convert the categorical values to numerical values. 
**Gender** and **Geography** will be converted

# 🔑 LabelEncoder and `fit_transform`

### 📌 LabelEncoder
- Part of **scikit-learn (`sklearn.preprocessing`)**.  
- Used to convert **categorical labels (strings)** into **numeric values**.  
- Example:  

["cat", "dog", "dog", "mouse"] → [0, 1, 1, 2]

--- 
LabelEncoder doesn’t assign numbers randomly. It works like this:

It sorts the unique classes alphabetically (lexicographically).

Then it assigns integers starting from 0.

---

### 📌 `fit_transform()`
- **`fit()`** → Learns the mapping (e.g., "cat" → 0, "dog" → 1, "mouse" → 2).  
- **`transform()`** → Applies this mapping to the data.  
- **`fit_transform()`** → Combines both steps in one.  


### ✅ Example of label encoder in Python
```python
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

animals = ["cat", "dog", "dog", "mouse"]
encoded = le.fit_transform(animals)

print(encoded)      # [0 1 1 2]
print(le.classes_)  # ['cat' 'dog' 'mouse']


In [21]:
# Encode categorial variables
label_encoder_gender = LabelEncoder()
data['Gender'] = label_encoder_gender.fit_transform(data['Gender'])
data.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,0,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,0,41,1,83807.86,1,0,1,112542.58,0
2,502,France,0,42,8,159660.8,3,1,0,113931.57,1
3,699,France,0,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,0,43,2,125510.82,1,1,1,79084.1,0


# 🔑 One Hot Encoding

**One Hot Encoding** is a technique to convert categorical values into a **binary (0/1) matrix**.  
Unlike `LabelEncoder`, which assigns an integer to each class, **One Hot Encoding creates a separate column for each unique category**.

---
## When to Use?

One Hot Encoding is used when categorical values do not have an **inherent order** (nominal data).

Example: colors (red, green, blue), animals (cat, dog, mouse), cities (London, Paris, Delhi).


## Code Example

In [22]:
## Example

# Suppose we have the following animals:

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

animals = pd.DataFrame({"Animal": ["cat", "dog", "mouse", "dog"]})

encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(animals[["Animal"]])

print(encoder.categories_)  
print(encoded)

[array(['cat', 'dog', 'mouse'], dtype=object)]
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]


In [23]:
## One Hot Encode 'Geography'
from sklearn.preprocessing import OneHotEncoder

onehot_encoder_geo = OneHotEncoder(sparse_output=False)
geo_encoded = onehot_encoder_geo.fit_transform(data[["Geography"]])
print(onehot_encoder_geo.categories_)
geo_encoded

[array(['France', 'Germany', 'Spain'], dtype=object)]


array([[1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       ...,
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.]])

In [24]:
onehot_encoder_geo.get_feature_names_out(['Geography'])

array(['Geography_France', 'Geography_Germany', 'Geography_Spain'],
      dtype=object)

In [25]:
## converting encoded values into dataframe

geo_encoded_df = pd.DataFrame(geo_encoded, columns=onehot_encoder_geo.get_feature_names_out(['Geography']))
geo_encoded_df

Unnamed: 0,Geography_France,Geography_Germany,Geography_Spain
0,1.0,0.0,0.0
1,0.0,0.0,1.0
2,1.0,0.0,0.0
3,1.0,0.0,0.0
4,0.0,0.0,1.0
...,...,...,...
9995,1.0,0.0,0.0
9996,1.0,0.0,0.0
9997,1.0,0.0,0.0
9998,0.0,1.0,0.0


In [26]:
## Combining one hot conded geography columns with original data
## since we can run this block again and again hence adding checks to avoid duplication of data and errors

new_columns = list(geo_encoded_df.columns)
if not all(col in data.columns for col in new_columns):
    data = pd.concat([data.drop(columns=['Geography'], errors='ignore'), geo_encoded_df], axis=1)
data

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_France,Geography_Germany,Geography_Spain
0,619,0,42,2,0.00,1,1,1,101348.88,1,1.0,0.0,0.0
1,608,0,41,1,83807.86,1,0,1,112542.58,0,0.0,0.0,1.0
2,502,0,42,8,159660.80,3,1,0,113931.57,1,1.0,0.0,0.0
3,699,0,39,1,0.00,2,0,0,93826.63,0,1.0,0.0,0.0
4,850,0,43,2,125510.82,1,1,1,79084.10,0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,771,1,39,5,0.00,2,1,0,96270.64,0,1.0,0.0,0.0
9996,516,1,35,10,57369.61,1,1,1,101699.77,0,1.0,0.0,0.0
9997,709,0,36,7,0.00,1,0,1,42085.58,1,1.0,0.0,0.0
9998,772,1,42,3,75075.31,2,1,0,92888.52,1,0.0,1.0,0.0


# 🥒 Pickle File in Python

A **pickle file** is a way to **serialize** (save) and **deserialize** (load) Python objects using the built-in `pickle` module.  

- **Serialization** → Convert Python objects into a byte stream (so they can be stored in a file or sent over a network).  
- **Deserialization** → Convert the byte stream back to Python objects.  

---

## 📌 What is a Pickle File?
- A file with extension `.pkl` (or `.pickle`).
- Stores Python objects in **binary format**.
- Useful for saving trained models, data, or any Python object to reuse later without recomputation.

---

## 📌 Use Cases

### 1. Machine Learning Models
Save a trained ML model to reuse without retraining:


In [27]:
import pickle
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit([[0,0],[1,1]], [0,1])

# Save model
with open("../pickle_files/model.pkl", "wb") as f:
    pickle.dump(model, f)

# Load model
with open("../pickle_files/model.pkl", "rb") as f:
    loaded_model = pickle.load(f)

print(loaded_model.predict([[2,2]]))

[1]


### 2. Data Storage

Save preprocessed data (e.g., tokenized text, feature matrices) for later use.

### 3. Object Persistence

Store Python objects (lists, dicts, custom classes) across program runs.

### ⚠️ Caution

Pickle files are not secure.

Loading a pickle from an untrusted source can execute malicious code.

Use only with trusted data.

In [28]:
## Save the encoders in pickle file

with open("../pickle_files/label_encoder_gender.pkl", "wb") as file:
    pickle.dump(label_encoder_gender, file)

with open("../pickle_files/onehot_encoder_geo.pkl", "wb") as file:
    pickle.dump(onehot_encoder_geo, file)
    

In [29]:
## Divide the dataset into dependent and independent features

## Exited is the dependent feature as it is the output and rest are the independent features as they are the inputs
# X is input, y is output
X = data.drop(columns=['Exited'], errors='ignore')
y = data['Exited']

## let's split our dataset in training and testing set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Scale these features

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # fit will calculate the mean and standard_deviation to tranform using standard method
X_test = scaler.transform(X_test) # we use same mean and standard deviation calculated in training step to get the results accordingly

X_train


array([[ 0.35649971,  0.91324755, -0.6557859 , ...,  1.00150113,
        -0.57946723, -0.57638802],
       [-0.20389777,  0.91324755,  0.29493847, ..., -0.99850112,
         1.72572313, -0.57638802],
       [-0.96147213,  0.91324755, -1.41636539, ..., -0.99850112,
        -0.57946723,  1.73494238],
       ...,
       [ 0.86500853, -1.09499335, -0.08535128, ...,  1.00150113,
        -0.57946723, -0.57638802],
       [ 0.15932282,  0.91324755,  0.3900109 , ...,  1.00150113,
        -0.57946723, -0.57638802],
       [ 0.47065475,  0.91324755,  1.15059039, ..., -0.99850112,
         1.72572313, -0.57638802]])

# 📌 What is `train_test_split`?

`train_test_split` is a utility function in **Scikit-learn** used to **split your dataset** into **training** and **testing** sets.

---

## 🔑 Why do we split?
- **Training set** → Used to train the machine learning model (teach it patterns).  
- **Testing set** → Used to evaluate how well the trained model performs on unseen data.  

👉 This helps us check if the model generalizes well, not just memorizes the training data.

---

## 🛠 Syntax
```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
```

## Parameters

- X → Features (independent variables)

- y → Labels/Target (dependent variable)

- test_size → Proportion of data to be used as test set (e.g., 0.2 = 20% test, 80% train)

- train_size → (Optional) Directly set training size

- random_state → Ensures reproducibility (same split every time if you set it)

- shuffle=True → Randomly shuffles before splitting (default = True)

## Example


In [30]:
## save scaler as pickle file

with open("../pickle_files/scaler.pkl", "wb") as file:
    pickle.dump(scaler, file)

In [31]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Sample data
data = {
    "Age": [20, 21, 22, 23, 24],
    "Salary": [20000, 25000, 27000, 30000, 32000],
    "Purchased": [0, 1, 0, 1, 1]
}
df = pd.DataFrame(data)

a = df[["Age", "Salary"]]   # Features
b = df["Purchased"]         # Target

# Split data
A_train, A_test, b_train, b_test = train_test_split(a, b, test_size=0.4, random_state=1)

print("X_train:\n", A_train)
print("X_test:\n", A_test)
print("y_train:\n", b_train)
print("y_test:\n", b_test)


X_train:
    Age  Salary
4   24   32000
0   20   20000
3   23   30000
X_test:
    Age  Salary
2   22   27000
1   21   25000
y_train:
 4    1
0    0
3    1
Name: Purchased, dtype: int64
y_test:
 2    0
1    1
Name: Purchased, dtype: int64


### 🎯 Output meaning

X_train, y_train → Used to train the model

X_test, y_test → Used to test the model’s accuracy

## 👉 In short:
**train_test_split ensures that your model is trained on one part of the data and evaluated on unseen data, which prevents overfitting.**

# 📌 What is Feature Scaling?

**Feature Scaling** is a technique used to bring all independent variables (features) into the same **range**.  
This ensures that no feature dominates others just because of its large values.

---

## 🔑 Why do we need scaling?
- Many machine learning algorithms (e.g., **KNN, SVM, Logistic Regression, Neural Networks**) work based on **distance** or **gradient descent**.  
- If one feature has very large values compared to others, it can **bias the model**.  
- Scaling makes features **comparable** and improves **model accuracy & convergence speed**.

---

## ⚙️ Common Techniques

### 1. Standardization (Z-score Normalization)
Transforms data to have **mean = 0** and **standard deviation = 1**.

$ 
z = \frac{x - \mu}{\sigma}
$

Example:  
- Age: [20, 30, 40] → [-1.22, 0, 1.22]

---

### 2. Min-Max Scaling (Normalization)
Rescales data to a **fixed range [0, 1]**.

$
x' = \frac{x - x_{min}}{x_{max} - x_{min}}
$

Example:  
- Salary: [20000, 40000, 60000] → [0, 0.5, 1]

---

## ✅ Example in Python

In [32]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Sample data
data = {"Age": [20, 30, 40], "Salary": [20000, 40000, 60000]}
df = pd.DataFrame(data)

# Standardization
scaler_std = StandardScaler()
standardized = scaler_std.fit_transform(df)

# Min-Max Scaling
scaler_mm = MinMaxScaler()
normalized = scaler_mm.fit_transform(df)

print("Standardized:\n", standardized)
print("Normalized:\n", normalized)

Standardized:
 [[-1.22474487 -1.22474487]
 [ 0.          0.        ]
 [ 1.22474487  1.22474487]]
Normalized:
 [[0.  0. ]
 [0.5 0.5]
 [1.  1. ]]


# ⚖️ Difference Between Scaling and Removing Outliers

## 1️⃣ Scaling Features
- **Definition**: Scaling changes the *range* of feature values but does not remove any data points.
- **Purpose**: Ensures features are on a similar scale so that models (like KNN, SVM, Logistic Regression, Neural Networks) perform correctly.
- **How**: Techniques like
  - Standardization (mean=0, std=1)
  - Min-Max Scaling (range 0–1)

✅ Example:  
Before Scaling: [10, 20, 30]
After Standard Scaling: [-1.22, 0, 1.22]


---

## 2️⃣ Removing Outliers
- **Definition**: Outlier removal identifies and *removes unusual/extreme values* that do not follow the general data distribution.
- **Purpose**: Prevents extreme values from skewing mean, std, or model performance.
- **How**: Techniques like
  - Z-score method (|z| > 3)
  - IQR method (values outside Q1-1.5*IQR or Q3+1.5*IQR)

✅ Example:  
Original Data: [10, 12, 11, 13, 100]
After Removing Outlier: [10, 12, 11, 13]


---

## 🎯 Key Difference
- **Scaling**: Adjusts the *scale* of all values → keeps every point.  
- **Outlier Removal**: Deletes extreme points → changes dataset itself.  

👉 Often, outlier removal is done **before scaling**, otherwise outliers can distort the scaling process.