### **Module 8: Advanced Pandas for Machine Learning**
In this final module, students will explore techniques that are specifically useful for machine learning preprocessing.

#### **Topics:**
- **Feature Engineering:**
  - Creating new features from existing data (e.g., interactions, ratios).
  
- **Encoding Categorical Variables:**
  - One-hot encoding using `pd.get_dummies()`.
  - Label encoding using `sklearn.preprocessing.LabelEncoder`.

- **Scaling and Normalization:**
  - Using `MinMaxScaler` and `StandardScaler` from `sklearn` for feature scaling.

- **Dealing with Outliers:**
  - Detecting and removing outliers using methods like Z-score and IQR.

#### **Hands-on Lab:**
- Prepare a dataset for machine learning by encoding categorical features, scaling numeric features, and handling outliers.

---

## **1. Feature Engineering**

### **Real-world Scenario 1: Customer Data Analysis**  
You have a **customer dataset** with information such as `total purchases` and `total visits`. You want to **create new features** such as **average purchase amount** and **purchase-to-visit ratio**.

### **Example: Creating New Features**  

In [2]:
import pandas as pd

# Sample customer data
customer_data = {
    'customer_id': [1, 2, 3],
    'total_purchases': [500, 200, 300],
    'total_visits': [5, 10, 6]
}

df = pd.DataFrame(customer_data)

# Create new features
df['avg_purchase_per_visit'] = df['total_purchases'] / df['total_visits']
df['purchase_ratio'] = df['total_purchases'] / df['total_purchases'].sum()

display(df)

Unnamed: 0,customer_id,total_purchases,total_visits,avg_purchase_per_visit,purchase_ratio
0,1,500,5,100.0,0.5
1,2,200,10,20.0,0.2
2,3,300,6,50.0,0.3


## **2. Encoding Categorical Variables**

### **Real-world Scenario 2: Car Sales Data**  
You have a **car sales dataset** with a `fuel_type` column. You need to **encode this categorical column** for machine learning.

### **Example 1: One-Hot Encoding**  

In [3]:
# Sample car sales data
car_data = {
    'car_id': [1, 2, 3],
    'fuel_type': ['Petrol', 'Diesel', 'Electric'],
    'price': [10000, 12000, 30000]
}

df = pd.DataFrame(car_data)

# One-hot encoding
df_encoded = pd.get_dummies(df, columns=['fuel_type'], drop_first=True)

print(df_encoded)

   car_id  price  fuel_type_Electric  fuel_type_Petrol
0       1  10000               False              True
1       2  12000               False             False
2       3  30000                True             False


### **Example 2: Label Encoding**

In [4]:
from sklearn.preprocessing import LabelEncoder

# Sample customer gender data
gender_data = {
    'customer_id': [1, 2, 3],
    'gender': ['Male', 'Female', 'Female']
}

df = pd.DataFrame(gender_data)

# Label encoding
label_encoder = LabelEncoder()
df['gender_encoded'] = label_encoder.fit_transform(df['gender'])

print(df)

   customer_id  gender  gender_encoded
0            1    Male               1
1            2  Female               0
2            3  Female               0


## **3. Scaling and Normalization**

### **Real-world Scenario 3: House Prices Data**  
You have a **house prices dataset** with columns `price`, `size_in_sqft`, and `number_of_rooms`. You need to **scale the features** for ML preprocessing.

### **Example: Feature Scaling**  

In [6]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Sample house prices data
house_data = {
    'house_id': [1, 2, 3],
    'price': [300000, 500000, 700000],
    'size_in_sqft': [1500, 2500, 3500]
}

df = pd.DataFrame(house_data)

# Min-Max Scaling
scaler = MinMaxScaler()
df[['price_scaled', 'size_scaled']] = scaler.fit_transform(df[['price', 'size_in_sqft']])

# Standardization (Z-score)
std_scaler = StandardScaler()
df[['price_standardized', 'size_standardized']] = std_scaler.fit_transform(df[['price', 'size_in_sqft']])

display(df)

Unnamed: 0,house_id,price,size_in_sqft,price_scaled,size_scaled,price_standardized,size_standardized
0,1,300000,1500,0.0,0.0,-1.224745,-1.224745
1,2,500000,2500,0.5,0.5,0.0,0.0
2,3,700000,3500,1.0,1.0,1.224745,1.224745


## **4. Dealing with Outliers**

### **Real-world Scenario 4: Employee Salary Data**  
You have an **employee salary dataset**. You need to **detect and remove outliers** in the `salary` column using the **Z-score method** and **IQR method**.

### **Example 1: Z-Score Method** 

In [7]:
import numpy as np

# Sample employee salary data
salary_data = {
    'employee_id': [1, 2, 3, 4, 5, 6],
    'salary': [50000, 52000, 48000, 49000, 47000, 150000]  # 150000 is an outlier
}

df = pd.DataFrame(salary_data)

# Calculate Z-scores
df['z_score'] = (df['salary'] - df['salary'].mean()) / df['salary'].std()

# Remove outliers (Z-score > 3)
df_no_outliers = df[df['z_score'].abs() < 3]

print(df_no_outliers)

   employee_id  salary   z_score
0            1   50000 -0.388469
1            2   52000 -0.339910
2            3   48000 -0.437027
3            4   49000 -0.412748
4            5   47000 -0.461306
5            6  150000  2.039460


### **Example 2: IQR Method (Interquartile Range)**  

In [8]:
# Calculate IQR
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1

# Remove outliers
df_no_outliers = df[(df['salary'] >= Q1 - 1.5 * IQR) & (df['salary'] <= Q3 + 1.5 * IQR)]

print(df_no_outliers)

   employee_id  salary   z_score
0            1   50000 -0.388469
1            2   52000 -0.339910
2            3   48000 -0.437027
3            4   49000 -0.412748
4            5   47000 -0.461306


## **5. Full Hands-on Lab: Preprocessing a Dataset for ML**

### **Scenario 5: Preprocessing Customer Churn Dataset**  
You have a **customer churn dataset** with both categorical and numerical features. You need to:
1. **One-hot encode** the `contract_type` column.
2. **Scale** the `monthly_charges` column.
3. **Remove outliers** from the `tenure` column.

In [10]:
from sklearn.preprocessing import MinMaxScaler

# Sample customer churn data
churn_data = {
    'customer_id': [1, 2, 3, 4, 5],
    'contract_type': ['Month-to-month', 'Two year', 'One year', 'Month-to-month', 'Two year'],
    'monthly_charges': [70.5, 99.9, 85.4, 110.7, 80.0],
    'tenure': [1, 24, 12, 36, 2]  # Outlier in tenure
}

df = pd.DataFrame(churn_data)

# Step 1: One-hot encoding
df_encoded = pd.get_dummies(df, columns=['contract_type'], drop_first=True)

# Step 2: Feature scaling
scaler = MinMaxScaler()
df_encoded[['monthly_charges_scaled']] = scaler.fit_transform(df_encoded[['monthly_charges']])

# Step 3: Outlier removal using IQR method for 'tenure'
Q1 = df_encoded['tenure'].quantile(0.25)
Q3 = df_encoded['tenure'].quantile(0.75)
IQR = Q3 - Q1
df_no_outliers = df_encoded[(df_encoded['tenure'] >= Q1 - 1.5 * IQR) & (df_encoded['tenure'] <= Q3 + 1.5 * IQR)]

display(df_no_outliers)

Unnamed: 0,customer_id,monthly_charges,tenure,contract_type_One year,contract_type_Two year,monthly_charges_scaled
0,1,70.5,1,False,False,0.0
1,2,99.9,24,False,True,0.731343
2,3,85.4,12,True,False,0.370647
3,4,110.7,36,False,False,1.0
4,5,80.0,2,False,True,0.236318


## **6. Additional Hands-on Labs:**

### **Scenario 6: Loan Approval Dataset**  
- **Task:** Encode `loan_type` (categorical) using `LabelEncoder`.
- **Task:** Scale `loan_amount` using `StandardScaler`.

In [12]:
from sklearn.preprocessing import LabelEncoder, StandardScaler

loan_data = {
    'applicant_id': [101, 102, 103],
    'loan_type': ['Home Loan', 'Car Loan', 'Personal Loan'],
    'loan_amount': [200000, 15000, 30000]
}

df = pd.DataFrame(loan_data)

# Label Encoding
le = LabelEncoder()
df['loan_type_encoded'] = le.fit_transform(df['loan_type'])

# Feature scaling
scaler = StandardScaler()
df[['loan_amount_standardized']] = scaler.fit_transform(df[['loan_amount']])

display(df)

Unnamed: 0,applicant_id,loan_type,loan_amount,loan_type_encoded,loan_amount_standardized
0,101,Home Loan,200000,1,1.410441
1,102,Car Loan,15000,0,-0.794615
2,103,Personal Loan,30000,2,-0.615827


### **Scenario 7: E-commerce Order Dataset**  
- **Task:** Create a new feature `order_value` as `price * quantity`.
- **Task:** One-hot encode the `payment_method` column.

In [14]:
ecommerce_data = {
    'order_id': [1, 2, 3],
    'price': [50, 30, 100],
    'quantity': [2, 5, 1],
    'payment_method': ['Credit Card', 'Debit Card', 'PayPal']
}

df = pd.DataFrame(ecommerce_data)

# Create a new feature
df['order_value'] = df['price'] * df['quantity']

# One-hot encoding
df_encoded = pd.get_dummies(df, columns=['payment_method'], drop_first=True)

display(df_encoded)

Unnamed: 0,order_id,price,quantity,order_value,payment_method_Debit Card,payment_method_PayPal
0,1,50,2,100,False,False
1,2,30,5,150,True,False
2,3,100,1,100,False,True


## **Summary of Examples:**

| **Scenario**                | **Task**                                       | **Description**                                           |
|-----------------------------|------------------------------------------------|----------------------------------------------------------|
| Customer Data Analysis       | Feature Engineering                            | Created `avg_purchase_per_visit` and `purchase_ratio`.    |
| Car Sales                    | One-hot Encoding and Label Encoding            | Encoded fuel types and genders.                           |
| House Prices                 | Min-Max Scaling and Standardization            | Scaled house prices and sizes.                            |
| Employee Salary              | Z-score and IQR Method                         | Detected and removed outliers in salary data.             |
| Customer Churn               | Full Preprocessing Example                     | Encoded, scaled, and removed outliers from churn dataset. |
| Loan Approval                | Label Encoding and Feature Scaling             | Encoded loan types and scaled loan amounts.               |
| E-commerce Orders            | Created `order_value` and encoded payment method| Added new feature and one-hot encoded payment types.      |

These hands-on examples cover a wide range of machine learning preprocessing tasks.