# Day 4: Data Preparation & Feature Engineering

**Duration:** 90 minutes  
**Dataset:** Titanic Passenger Data

## Learning Objectives
- Understand data quality properties and the knowledge hierarchy
- Distinguish structured vs unstructured data
- Handle missing data using different strategies
- Detect and handle outliers
- Apply normalization techniques
- Perform categorical encoding
- Create new features through feature engineering

---

## Part 1: Setup and Data Loading (5 mins)

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from scipy import stats
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.ensemble import IsolationForest

print("✓ Libraries imported successfully!")

In [None]:
# Load Titanic dataset
df = sns.load_dataset('titanic')
print(f"Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")
df.head()

---
## Part 2: Data Quality Assessment (15 mins)

### The Knowledge Hierarchy
- **Data:** Raw facts (e.g., "Age: 22")
- **Information:** Processed data (e.g., "Average age is 29.7")
- **Knowledge:** Understanding patterns (e.g., "Younger passengers had better survival rates")
- **Wisdom:** Applying knowledge (e.g., "Prioritize evacuating children in emergencies")

### Exercise 2.1: Data Quality Properties

In [None]:
# TODO: Check for missing values
# Hint: Use df.isnull().sum()

missing_values = # YOUR CODE HERE
print("Missing Values per Column:")
print(missing_values)
print(f"\nTotal missing values: {missing_values.sum()}")

In [None]:
# TODO: Visualize missing data
# Create a heatmap showing where data is missing

missing_data = df.isnull()
fig = px.imshow(missing_data.T, 
                labels=dict(x="Passenger", y="Feature", color="Missing"),
                title="Missing Data Heatmap (White = Missing)")
fig.show()

### Exercise 2.2: Structured vs Unstructured Data

**Structured Data:** Organized in tables with rows and columns (like our Titanic dataset)  
**Unstructured Data:** No predefined format (e.g., text, images, audio)

In [None]:
# TODO: Identify different data types in our dataset
# Hint: Use df.dtypes

print("Data Types:")
# YOUR CODE HERE

# Separate numerical and categorical features
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()

print(f"\nNumerical features: {numerical_cols}")
print(f"Categorical features: {categorical_cols}")

---
## Part 3: Handling Missing Data (20 mins)

### Missing Data Mechanisms
- **MCAR** (Missing Completely At Random): No pattern
- **MAR** (Missing At Random): Related to other observed variables
- **MNAR** (Missing Not At Random): Related to the missing value itself

### Exercise 3.1: Analyze Missing Age Data

In [None]:
# TODO: Calculate percentage of missing age values
age_missing_pct = # YOUR CODE HERE (use df['age'].isnull().mean())
print(f"Missing age values: {age_missing_pct*100:.1f}%")

# Check if age is MCAR, MAR, or MNAR
# Compare survival rates for passengers with/without age data
has_age = df[df['age'].notnull()]['survived'].mean()
no_age = df[df['age'].isnull()]['survived'].mean()

print(f"\nSurvival rate (age known): {has_age*100:.1f}%")
print(f"Survival rate (age missing): {no_age*100:.1f}%")
print(f"\nDifference: {abs(has_age - no_age)*100:.1f} percentage points")

**Question:** Based on the survival rate difference, is the missing age data MCAR, MAR, or MNAR?

Your answer: ___________________________________

### Exercise 3.2: Imputation Strategies

Let's try different ways to fill in missing age values:

In [None]:
# Strategy 1: Mean imputation
# TODO: Fill missing ages with the mean age
df['age_mean_imputed'] = # YOUR CODE HERE (use df['age'].fillna())

print(f"Original mean age: {df['age'].mean():.2f}")
print(f"After mean imputation: {df['age_mean_imputed'].mean():.2f}")

In [None]:
# Strategy 2: Median imputation
# TODO: Fill missing ages with the median age
df['age_median_imputed'] = # YOUR CODE HERE

print(f"Original median age: {df['age'].median():.2f}")
print(f"After median imputation: {df['age_median_imputed'].median():.2f}")

In [None]:
# Strategy 3: Group-based imputation (by passenger class and sex)
# TODO: Fill missing ages with the median age of the same class and gender
df['age_group_imputed'] = df.groupby(['pclass', 'sex'])['age'].transform(
    lambda x: x.fillna(x.median())
)

print("\nMedian age by class and gender:")
print(df.groupby(['pclass', 'sex'])['age'].median())

In [None]:
# Compare the distributions
fig = go.Figure()
fig.add_trace(go.Histogram(x=df['age'], name='Original (with missing)', opacity=0.7))
fig.add_trace(go.Histogram(x=df['age_mean_imputed'], name='Mean Imputed', opacity=0.7))
fig.add_trace(go.Histogram(x=df['age_group_imputed'], name='Group Imputed', opacity=0.7))
fig.update_layout(title='Comparison of Imputation Strategies',
                  xaxis_title='Age',
                  yaxis_title='Count',
                  barmode='overlay')
fig.show()

**Question:** Which imputation strategy preserves the distribution best? Why?

Your answer: ___________________________________

### Exercise 3.3: Handling Missing Cabin Data

In [None]:
# TODO: Check missing cabin data
cabin_missing_pct = # YOUR CODE HERE
print(f"Missing cabin values: {cabin_missing_pct*100:.1f}%")

# Create a binary feature: cabin_known (1 if cabin is known, 0 otherwise)
df['cabin_known'] = # YOUR CODE HERE (use df['cabin'].notnull().astype(int))

# Check if having cabin information correlates with survival
print("\nSurvival rate by cabin information:")
print(df.groupby('cabin_known')['survived'].mean())

---
## Part 4: Outlier Detection and Handling (20 mins)

### Exercise 4.1: Detect Outliers Using IQR Method

**IQR (Interquartile Range) Method:**
- Q1 = 25th percentile
- Q3 = 75th percentile
- IQR = Q3 - Q1
- Outliers: values < Q1 - 1.5×IQR or > Q3 + 1.5×IQR

In [None]:
# TODO: Detect outliers in 'fare' using IQR method
Q1 = # YOUR CODE HERE (use df['fare'].quantile(0.25))
Q3 = # YOUR CODE HERE (use df['fare'].quantile(0.75))
IQR = # YOUR CODE HERE

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Q1: {Q1:.2f}")
print(f"Q3: {Q3:.2f}")
print(f"IQR: {IQR:.2f}")
print(f"Lower bound: {lower_bound:.2f}")
print(f"Upper bound: {upper_bound:.2f}")

# Identify outliers
outliers = df[(df['fare'] < lower_bound) | (df['fare'] > upper_bound)]
print(f"\nNumber of outliers: {len(outliers)}")

In [None]:
# TODO: Create a box plot to visualize outliers
fig = # YOUR CODE HERE (use px.box())
fig.update_layout(title='Fare Distribution with Outliers',
                  yaxis_title='Fare (£)')
fig.show()

### Exercise 4.2: Z-Score Method for Outlier Detection

In [None]:
# TODO: Calculate Z-scores for fare
# Z-score = (value - mean) / standard deviation
df['fare_zscore'] = # YOUR CODE HERE (use stats.zscore())

# Identify outliers (|Z-score| > 3)
outliers_zscore = df[np.abs(df['fare_zscore']) > 3]
print(f"Number of outliers (Z-score > 3): {len(outliers_zscore)}")

# Show the outliers
print("\nOutliers:")
print(outliers_zscore[['name', 'fare', 'pclass', 'fare_zscore']].sort_values('fare', ascending=False))

**Question:** Should we remove these outliers? Why or why not?

Your answer: ___________________________________

### Exercise 4.3: Isolation Forest for Multivariate Outlier Detection

**Isolation Forest:** Machine learning algorithm that detects anomalies by isolating outliers

In [None]:
# TODO: Use Isolation Forest to detect outliers
# Select numerical features for analysis
features_for_outliers = ['age_group_imputed', 'fare', 'sibsp', 'parch']
X = df[features_for_outliers].copy()

# Create and fit Isolation Forest
iso_forest = IsolationForest(contamination=0.1, random_state=42)
df['outlier'] = iso_forest.fit_predict(X)
# -1 = outlier, 1 = normal

print(f"Number of outliers detected: {(df['outlier'] == -1).sum()}")
print(f"Percentage: {(df['outlier'] == -1).mean()*100:.1f}%")

In [None]:
# Visualize outliers in 2D space (Age vs Fare)
fig = px.scatter(df, x='age_group_imputed', y='fare', 
                 color=df['outlier'].map({1: 'Normal', -1: 'Outlier'}),
                 title='Outlier Detection with Isolation Forest',
                 labels={'color': 'Status'})
fig.show()

---
## Part 5: Normalization & Standardization (15 mins)

### Why Normalize?
- Different features have different scales
- Many ML algorithms perform better with normalized data
- Prevents features with large values from dominating

### Exercise 5.1: Min-Max Normalization (0-1 scaling)

In [None]:
# TODO: Apply Min-Max scaling to age and fare
scaler_minmax = MinMaxScaler()

df['age_normalized'] = scaler_minmax.fit_transform(df[['age_group_imputed']])
df['fare_normalized'] = # YOUR CODE HERE (use scaler_minmax.fit_transform())

print("Original values:")
print(df[['age_group_imputed', 'fare']].describe())
print("\nNormalized values (0-1 range):")
print(df[['age_normalized', 'fare_normalized']].describe())

### Exercise 5.2: Z-Score Standardization (mean=0, std=1)

In [None]:
# TODO: Apply Z-score standardization
scaler_standard = StandardScaler()

df['age_standardized'] = scaler_standard.fit_transform(df[['age_group_imputed']])
df['fare_standardized'] = # YOUR CODE HERE

print("Standardized values (mean≈0, std≈1):")
print(df[['age_standardized', 'fare_standardized']].describe())

In [None]:
# Compare original vs normalized vs standardized
fig = go.Figure()
fig.add_trace(go.Box(y=df['fare'], name='Original Fare'))
fig.add_trace(go.Box(y=df['fare_normalized'], name='Normalized Fare'))
fig.add_trace(go.Box(y=df['fare_standardized'], name='Standardized Fare'))
fig.update_layout(title='Comparison of Scaling Methods',
                  yaxis_title='Value')
fig.show()

---
## Part 6: Categorical Encoding (10 mins)

### Exercise 6.1: One-Hot Encoding

Machine learning models need numerical input. We need to convert categorical variables!

In [None]:
# TODO: Apply one-hot encoding to 'embarked' column
# Hint: Use pd.get_dummies()

embarked_encoded = # YOUR CODE HERE
print("Original column:")
print(df['embarked'].value_counts())
print("\nOne-hot encoded:")
print(embarked_encoded.head())

In [None]:
# TODO: Encode 'sex' column
# Create a binary encoding: male=1, female=0
df['sex_encoded'] = # YOUR CODE HERE (use df['sex'].map({'male': 1, 'female': 0}))

print("Sex encoding:")
print(df[['sex', 'sex_encoded']].drop_duplicates())

---
## Part 7: Feature Engineering (15 mins)

### Creating New Features

Feature engineering is the art of creating new features from existing data to improve model performance.

### Exercise 7.1: Family Size Feature

In [None]:
# TODO: Create family_size feature
df['family_size'] = # YOUR CODE HERE (SibSp + Parch + 1)

# Create is_alone feature
df['is_alone'] = # YOUR CODE HERE (1 if family_size==1, else 0)

print("Family size distribution:")
print(df['family_size'].value_counts().sort_index())
print(f"\nPassengers traveling alone: {df['is_alone'].sum()}")

In [None]:
# Analyze survival by family size
family_survival = df.groupby('family_size')['survived'].mean()
fig = px.bar(x=family_survival.index, y=family_survival.values,
             title='Survival Rate by Family Size',
             labels={'x': 'Family Size', 'y': 'Survival Rate'})
fig.update_layout(yaxis_tickformat='.0%')
fig.show()

### Exercise 7.2: Extract Title from Name

In [None]:
# TODO: Extract title from name (Mr., Mrs., Miss., etc.)
df['title'] = df['name'].str.extract(' ([A-Za-z]+)\.', expand=False)

print("Titles found:")
print(df['title'].value_counts())

In [None]:
# TODO: Group rare titles into 'Other'
# Keep only common titles: Mr, Miss, Mrs, Master
common_titles = ['Mr', 'Miss', 'Mrs', 'Master']
df['title_grouped'] = df['title'].apply(
    lambda x: x if x in common_titles else 'Other'
)

print("\nGrouped titles:")
print(df['title_grouped'].value_counts())

In [None]:
# Analyze survival by title
title_survival = df.groupby('title_grouped')['survived'].mean().sort_values(ascending=False)
fig = px.bar(x=title_survival.index, y=title_survival.values,
             title='Survival Rate by Title',
             labels={'x': 'Title', 'y': 'Survival Rate'})
fig.update_layout(yaxis_tickformat='.0%')
fig.show()

### Exercise 7.3: Age Groups

In [None]:
# TODO: Create age groups
# Categories: Child (0-12), Teen (13-19), Adult (20-59), Senior (60+)

bins = [0, 12, 19, 59, 100]
labels = ['Child', 'Teen', 'Adult', 'Senior']
df['age_group'] = # YOUR CODE HERE (use pd.cut())

print("Age group distribution:")
print(df['age_group'].value_counts())

In [None]:
# Survival by age group
age_group_survival = df.groupby('age_group')['survived'].mean()
fig = px.bar(x=age_group_survival.index, y=age_group_survival.values,
             title='Survival Rate by Age Group',
             labels={'x': 'Age Group', 'y': 'Survival Rate'})
fig.update_layout(yaxis_tickformat='.0%')
fig.show()

### Exercise 7.4: Fare Per Person

In [None]:
# TODO: Calculate fare per person (fare / family_size)
df['fare_per_person'] = # YOUR CODE HERE

print("Fare per person statistics:")
print(df['fare_per_person'].describe())

---
## Part 8: Creating the Final Cleaned Dataset (10 mins)

### Exercise 8.1: Select and Prepare Final Features

In [None]:
# TODO: Create final dataset with cleaned and engineered features
final_features = [
    'survived',           # Target variable
    'pclass',            # Original feature
    'sex_encoded',       # Encoded feature
    'age_group_imputed', # Imputed feature
    'fare_per_person',   # Engineered feature
    'family_size',       # Engineered feature
    'is_alone',          # Engineered feature
    'cabin_known'        # Engineered feature
]

df_final = df[final_features].copy()
print(f"Final dataset shape: {df_final.shape}")
print("\nFirst few rows:")
df_final.head()

In [None]:
# TODO: Check for any remaining missing values
print("Missing values in final dataset:")
print(df_final.isnull().sum())

In [None]:
# Summary statistics
print("\nFinal Dataset Summary:")
print(df_final.describe())

---
## Summary & Reflection

### Key Takeaways

Today we learned:
- ✓ How to assess data quality using the knowledge hierarchy
- ✓ Strategies for handling missing data (mean, median, group-based imputation)
- ✓ Multiple methods for detecting outliers (IQR, Z-score, Isolation Forest)
- ✓ Normalization techniques (Min-Max, Z-score standardization)
- ✓ Categorical encoding (one-hot encoding, binary encoding)
- ✓ Feature engineering to create meaningful new features

### Data Preparation Pipeline Summary

```
Raw Data → Missing Data Handling → Outlier Detection → 
Normalization → Encoding → Feature Engineering → Clean Dataset
```

### Reflection Questions

1. Which imputation strategy worked best for the age variable and why?

   Your answer: ___________________________________

2. Why is feature engineering important for machine learning?

   Your answer: ___________________________________

3. What new feature did you find most insightful?

   Your answer: ___________________________________

---
## Bonus Challenge (Optional)

### Create Your Own Feature!

In [None]:
# TODO: Create a new feature that you think might be useful
# for predicting survival. Explain your reasoning!

# YOUR CODE HERE

# Analyze how your feature relates to survival
# YOUR CODE HERE

---
## Resources

- **Scikit-learn Preprocessing:** https://scikit-learn.org/stable/modules/preprocessing.html
- **Missing Data Handling:** https://pandas.pydata.org/docs/user_guide/missing_data.html
- **Feature Engineering Guide:** https://www.kaggle.com/learn/feature-engineering

**See you on Day 6 for Machine Learning!** 🤖