# Day 4: Data Preparation & Feature Engineering - SOLUTIONS

**Duration:** 90 minutes  
**Dataset:** Titanic Passenger Data

## Learning Objectives
- Understand data quality properties and the knowledge hierarchy
- Distinguish structured vs unstructured data
- Handle missing data using different strategies
- Detect and handle outliers
- Apply normalization techniques
- Perform categorical encoding
- Create new features through feature engineering

---

## Part 1: Setup and Data Loading (5 mins)

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from scipy import stats
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.ensemble import IsolationForest

print("✓ Libraries imported successfully!")

In [None]:
# Load Titanic dataset
df = sns.load_dataset('titanic')
print(f"Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")
df.head()

---
## Part 2: Data Quality Assessment (15 mins)

### The Knowledge Hierarchy
- **Data:** Raw facts (e.g., "Age: 22")
- **Information:** Processed data (e.g., "Average age is 29.7")
- **Knowledge:** Understanding patterns (e.g., "Younger passengers had better survival rates")
- **Wisdom:** Applying knowledge (e.g., "Prioritize evacuating children in emergencies")

### Exercise 2.1: Data Quality Properties

In [None]:
# SOLUTION: Check for missing values
# We use df.isnull().sum() to count missing values in each column

missing_values = df.isnull().sum()
print("Missing Values per Column:")
print(missing_values)
print(f"\nTotal missing values: {missing_values.sum()}")

**Interpretation:** 
- The `age` column has significant missing data (about 20%)
- The `deck` and `cabin` columns have very high missing percentages (>70%)
- The `embarked` column has minimal missing data (only 2 values)
- Understanding missing data patterns is crucial for deciding how to handle them

In [None]:
# SOLUTION: Visualize missing data
# Create a heatmap showing where data is missing
# This helps us see patterns in missing data across different passengers

missing_data = df.isnull()
fig = px.imshow(missing_data.T, 
                labels=dict(x="Passenger", y="Feature", color="Missing"),
                title="Missing Data Heatmap (White = Missing)")
fig.show()

**Interpretation:**
- The heatmap reveals that `cabin` and `deck` have systematic missing patterns
- `age` has scattered missing values across passengers
- This visualization helps identify if missing data is random or follows a pattern

### Exercise 2.2: Structured vs Unstructured Data

**Structured Data:** Organized in tables with rows and columns (like our Titanic dataset)  
**Unstructured Data:** No predefined format (e.g., text, images, audio)

In [None]:
# SOLUTION: Identify different data types in our dataset
# Understanding data types helps us choose appropriate preprocessing methods

print("Data Types:")
print(df.dtypes)

# Separate numerical and categorical features
# Numerical features can be used in calculations; categorical need encoding
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()

print(f"\nNumerical features: {numerical_cols}")
print(f"Categorical features: {categorical_cols}")

**Explanation:**
- **Numerical features** (int64, float64): Can be directly used in mathematical operations
  - Examples: age, fare, sibsp, parch
- **Categorical features** (object, category): Need encoding before use in ML models
  - Examples: sex, embarked, class, who
- Different preprocessing techniques apply to each type

---
## Part 3: Handling Missing Data (20 mins)

### Missing Data Mechanisms
- **MCAR** (Missing Completely At Random): No pattern
- **MAR** (Missing At Random): Related to other observed variables
- **MNAR** (Missing Not At Random): Related to the missing value itself

### Exercise 3.1: Analyze Missing Age Data

In [None]:
# SOLUTION: Calculate percentage of missing age values
# .isnull().mean() gives us the proportion of missing values
age_missing_pct = df['age'].isnull().mean()
print(f"Missing age values: {age_missing_pct*100:.1f}%")

# Check if age is MCAR, MAR, or MNAR
# Compare survival rates for passengers with/without age data
# If there's a difference, the missing data is NOT completely random
has_age = df[df['age'].notnull()]['survived'].mean()
no_age = df[df['age'].isnull()]['survived'].mean()

print(f"\nSurvival rate (age known): {has_age*100:.1f}%")
print(f"Survival rate (age missing): {no_age*100:.1f}%")
print(f"\nDifference: {abs(has_age - no_age)*100:.1f} percentage points")

**Question:** Based on the survival rate difference, is the missing age data MCAR, MAR, or MNAR?

**Answer:** The missing age data is most likely **MAR (Missing At Random)** or **MNAR (Missing Not At Random)**. The survival rate difference (~5-8 percentage points) suggests that whether age data is recorded is related to other factors. For example:
- Third-class passengers may have had less thorough record-keeping (MAR - related to passenger class)
- Crew members or passengers who boarded at certain ports may be less likely to have age recorded
- Since there's a systematic difference in outcomes, we can rule out MCAR

**Why this matters:** Understanding the missing data mechanism helps us choose appropriate imputation strategies. For MAR/MNAR, group-based imputation (using other features) is often better than simple mean imputation.

### Exercise 3.2: Imputation Strategies

Let's try different ways to fill in missing age values:

In [None]:
# Strategy 1: Mean imputation
# SOLUTION: Fill missing ages with the mean age
# Simple but can distort the distribution by adding many values at the center
df['age_mean_imputed'] = df['age'].fillna(df['age'].mean())

print(f"Original mean age: {df['age'].mean():.2f}")
print(f"After mean imputation: {df['age_mean_imputed'].mean():.2f}")

**Explanation:**
- Mean imputation replaces all missing values with the average
- **Pros:** Simple, preserves the mean
- **Cons:** Reduces variance, creates artificial peak at the mean, ignores relationships with other variables

In [None]:
# Strategy 2: Median imputation
# SOLUTION: Fill missing ages with the median age
# More robust to outliers than mean imputation
df['age_median_imputed'] = df['age'].fillna(df['age'].median())

print(f"Original median age: {df['age'].median():.2f}")
print(f"After median imputation: {df['age_median_imputed'].median():.2f}")

**Explanation:**
- Median imputation uses the middle value
- **Pros:** Less affected by outliers than mean, still simple
- **Cons:** Still reduces variance, ignores relationships with other variables

In [None]:
# Strategy 3: Group-based imputation (by passenger class and sex)
# SOLUTION: Fill missing ages with the median age of the same class and gender
# This preserves relationships between variables
# For example: First-class females tend to be older than third-class males
df['age_group_imputed'] = df.groupby(['pclass', 'sex'])['age'].transform(
    lambda x: x.fillna(x.median())
)

print("\nMedian age by class and gender:")
print(df.groupby(['pclass', 'sex'])['age'].median())

**Explanation:**
- Group-based imputation considers that different groups have different age distributions
- For example: 1st class male passengers had median age of ~40, while 3rd class females had median age of ~21
- **Pros:** Preserves relationships between variables, more realistic values
- **Cons:** More complex, requires identifying relevant grouping variables

In [None]:
# Compare the distributions
fig = go.Figure()
fig.add_trace(go.Histogram(x=df['age'], name='Original (with missing)', opacity=0.7))
fig.add_trace(go.Histogram(x=df['age_mean_imputed'], name='Mean Imputed', opacity=0.7))
fig.add_trace(go.Histogram(x=df['age_group_imputed'], name='Group Imputed', opacity=0.7))
fig.update_layout(title='Comparison of Imputation Strategies',
                  xaxis_title='Age',
                  yaxis_title='Count',
                  barmode='overlay')
fig.show()

**Question:** Which imputation strategy preserves the distribution best? Why?

**Answer:** **Group-based imputation** preserves the distribution best because:

1. **Maintains variance:** Doesn't create artificial peaks like mean/median imputation
2. **Preserves relationships:** Accounts for the fact that age distributions differ by passenger class and gender
3. **More realistic:** Imputed values are drawn from similar passengers' ages
4. **Avoids bias:** Mean imputation would create a spike at age ~30, which is unrealistic

You can see in the histogram that:
- Mean imputation creates an artificial spike at the mean age (~30)
- Group imputation maintains the natural spread of ages
- Group imputation better represents the underlying population

### Exercise 3.3: Handling Missing Cabin Data

In [None]:
# SOLUTION: Check missing cabin data
cabin_missing_pct = df['cabin'].isnull().mean()
print(f"Missing cabin values: {cabin_missing_pct*100:.1f}%")

# Create a binary feature: cabin_known (1 if cabin is known, 0 otherwise)
# When >70% of data is missing, imputation isn't useful
# Instead, we create a feature that captures whether the information was recorded
df['cabin_known'] = df['cabin'].notnull().astype(int)

# Check if having cabin information correlates with survival
print("\nSurvival rate by cabin information:")
print(df.groupby('cabin_known')['survived'].mean())

**Explanation:**
- With 77% missing data, imputing cabin numbers would be unreliable
- However, WHETHER cabin data exists is highly predictive!
- Passengers with known cabins had ~67% survival vs ~30% for unknown
- This makes sense: cabin assignments were more complete for first-class passengers
- **Key insight:** Sometimes the presence/absence of data is more informative than the data itself!
- This is an example of **feature engineering from missing data patterns**

---
## Part 4: Outlier Detection and Handling (20 mins)

### Exercise 4.1: Detect Outliers Using IQR Method

**IQR (Interquartile Range) Method:**
- Q1 = 25th percentile
- Q3 = 75th percentile
- IQR = Q3 - Q1
- Outliers: values < Q1 - 1.5×IQR or > Q3 + 1.5×IQR

In [None]:
# SOLUTION: Detect outliers in 'fare' using IQR method
# The IQR method is robust and commonly used for identifying outliers
Q1 = df['fare'].quantile(0.25)
Q3 = df['fare'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Q1: {Q1:.2f}")
print(f"Q3: {Q3:.2f}")
print(f"IQR: {IQR:.2f}")
print(f"Lower bound: {lower_bound:.2f}")
print(f"Upper bound: {upper_bound:.2f}")

# Identify outliers
outliers = df[(df['fare'] < lower_bound) | (df['fare'] > upper_bound)]
print(f"\nNumber of outliers: {len(outliers)}")

**Explanation:**
- **Q1 (25th percentile):** 25% of passengers paid less than this
- **Q3 (75th percentile):** 75% of passengers paid less than this
- **IQR:** The range containing the middle 50% of data
- **1.5×IQR rule:** Standard threshold from Tukey's method
- Values beyond Q1-1.5×IQR or Q3+1.5×IQR are considered outliers
- This method is less sensitive to extreme values than mean-based methods

In [None]:
# SOLUTION: Create a box plot to visualize outliers
# Box plots are specifically designed to show the IQR and outliers
fig = px.box(df, y='fare', 
             title='Fare Distribution with Outliers',
             labels={'fare': 'Fare (£)'})
fig.update_layout(yaxis_title='Fare (£)')
fig.show()

**Interpretation:**
- The box shows the IQR (25th to 75th percentile)
- The line inside the box is the median
- The whiskers extend to 1.5×IQR
- Points beyond the whiskers are outliers
- We can see several high-fare outliers (expensive first-class tickets)
- These outliers are real data points, not errors!

### Exercise 4.2: Z-Score Method for Outlier Detection

In [None]:
# SOLUTION: Calculate Z-scores for fare
# Z-score measures how many standard deviations away from the mean a value is
# Z-score = (value - mean) / standard deviation
df['fare_zscore'] = stats.zscore(df['fare'], nan_policy='omit')

# Identify outliers (|Z-score| > 3)
# The "3-sigma rule": values beyond 3 standard deviations are rare (<0.3% in normal distribution)
outliers_zscore = df[np.abs(df['fare_zscore']) > 3]
print(f"Number of outliers (Z-score > 3): {len(outliers_zscore)}")

# Show the outliers
print("\nOutliers:")
print(outliers_zscore[['name', 'fare', 'pclass', 'fare_zscore']].sort_values('fare', ascending=False))

**Explanation:**
- **Z-score formula:** (value - mean) / std
- **Interpretation:** 
  - Z-score = 0: value is at the mean
  - Z-score = 1: value is 1 standard deviation above mean
  - Z-score = -2: value is 2 standard deviations below mean
- **Threshold of 3:** In a normal distribution, 99.7% of data falls within ±3 standard deviations
- **Assumption:** Z-score method assumes data is approximately normally distributed
- **Findings:** The outliers are mostly first-class passengers with expensive tickets (e.g., fare > £200)

**Question:** Should we remove these outliers? Why or why not?

**Answer:** **No, we should NOT remove these outliers** for several reasons:

1. **They are legitimate data points:** These are real passengers who paid expensive fares for first-class tickets
2. **Domain knowledge:** In the Titanic context, expensive tickets for luxury suites are expected
3. **Predictive value:** High fares are associated with higher survival rates (first-class passengers had priority for lifeboats)
4. **Sample size:** We have a relatively small dataset (891 passengers), so removing data reduces our ability to learn patterns
5. **Alternative approaches:**
   - Use robust algorithms that handle outliers well (e.g., tree-based models)
   - Apply log transformation to reduce the impact of extreme values
   - Cap values at a reasonable threshold rather than removing them

**When to remove outliers:**
- Data entry errors (e.g., age = 999)
- Measurement errors
- Data from a different population

**General rule:** Only remove outliers if you have strong evidence they are errors, not just because they are unusual!

### Exercise 4.3: Isolation Forest for Multivariate Outlier Detection

**Isolation Forest:** Machine learning algorithm that detects anomalies by isolating outliers

In [None]:
# SOLUTION: Use Isolation Forest to detect outliers
# Select numerical features for analysis
features_for_outliers = ['age_group_imputed', 'fare', 'sibsp', 'parch']
X = df[features_for_outliers].copy()

# Create and fit Isolation Forest
# contamination=0.1 means we expect ~10% of data to be outliers
iso_forest = IsolationForest(contamination=0.1, random_state=42)
df['outlier'] = iso_forest.fit_predict(X)
# -1 = outlier, 1 = normal

print(f"Number of outliers detected: {(df['outlier'] == -1).sum()}")
print(f"Percentage: {(df['outlier'] == -1).mean()*100:.1f}%")

**Explanation:**
- **Isolation Forest** is an unsupervised learning algorithm for anomaly detection
- **How it works:** 
  - Builds random decision trees
  - Outliers are easier to "isolate" (require fewer splits) than normal points
  - Points that are isolated quickly are classified as outliers
- **Advantages over IQR/Z-score:**
  - Works with multiple features simultaneously (multivariate)
  - Doesn't assume normal distribution
  - Can detect complex outlier patterns
- **contamination parameter:** Expected proportion of outliers (we set to 10%)
- **Use case:** Detecting passengers with unusual combinations of features (e.g., old age + large family + high fare)

In [None]:
# Visualize outliers in 2D space (Age vs Fare)
fig = px.scatter(df, x='age_group_imputed', y='fare', 
                 color=df['outlier'].map({1: 'Normal', -1: 'Outlier'}),
                 title='Outlier Detection with Isolation Forest',
                 labels={'color': 'Status'})
fig.show()

**Interpretation:**
- Red points (outliers) are passengers with unusual combinations of features
- Notice outliers are not just at extreme values of a single feature
- They are points that are "isolated" from the main clusters
- Examples: very old passengers with high fares, or unusual family configurations

---
## Part 5: Normalization & Standardization (15 mins)

### Why Normalize?
- Different features have different scales
- Many ML algorithms perform better with normalized data
- Prevents features with large values from dominating

### Exercise 5.1: Min-Max Normalization (0-1 scaling)

In [None]:
# SOLUTION: Apply Min-Max scaling to age and fare
# Formula: (x - min) / (max - min)
# This scales all values to the range [0, 1]
scaler_minmax = MinMaxScaler()

df['age_normalized'] = scaler_minmax.fit_transform(df[['age_group_imputed']])
# Create a new scaler for fare to avoid mixing the min/max from age
scaler_minmax_fare = MinMaxScaler()
df['fare_normalized'] = scaler_minmax_fare.fit_transform(df[['fare']])

print("Original values:")
print(df[['age_group_imputed', 'fare']].describe())
print("\nNormalized values (0-1 range):")
print(df[['age_normalized', 'fare_normalized']].describe())

**Explanation:**
- **Min-Max Scaling:** Transforms data to a fixed range [0, 1]
- **Formula:** normalized_value = (value - min) / (max - min)
- **Properties:**
  - Minimum value → 0
  - Maximum value → 1
  - Preserves the shape of the distribution
  - Preserves relationships between values
- **When to use:**
  - When you need bounded values (e.g., for neural networks)
  - When features have different units but similar distributions
- **Limitation:** Sensitive to outliers (they compress the rest of the data)

### Exercise 5.2: Z-Score Standardization (mean=0, std=1)

In [None]:
# SOLUTION: Apply Z-score standardization
# Formula: (x - mean) / std
# This centers data at 0 with standard deviation of 1
scaler_standard = StandardScaler()

df['age_standardized'] = scaler_standard.fit_transform(df[['age_group_imputed']])
# Create a new scaler for fare
scaler_standard_fare = StandardScaler()
df['fare_standardized'] = scaler_standard_fare.fit_transform(df[['fare']])

print("Standardized values (mean≈0, std≈1):")
print(df[['age_standardized', 'fare_standardized']].describe())

**Explanation:**
- **Standardization:** Transforms data to have mean=0 and standard deviation=1
- **Formula:** standardized_value = (value - mean) / std
- **Properties:**
  - Mean becomes 0 (or very close to 0)
  - Standard deviation becomes 1
  - No fixed minimum or maximum
  - Preserves the shape of the distribution
- **When to use:**
  - Most common for ML algorithms (especially those using distance metrics)
  - When features have different units and scales
  - For algorithms like SVM, KNN, linear regression, PCA
- **Advantage over Min-Max:** Less sensitive to outliers

In [None]:
# Compare original vs normalized vs standardized
fig = go.Figure()
fig.add_trace(go.Box(y=df['fare'], name='Original Fare'))
fig.add_trace(go.Box(y=df['fare_normalized'], name='Normalized Fare'))
fig.add_trace(go.Box(y=df['fare_standardized'], name='Standardized Fare'))
fig.update_layout(title='Comparison of Scaling Methods',
                  yaxis_title='Value')
fig.show()

**Visual Comparison:**
- **Original:** Wide range, outliers clearly visible
- **Normalized:** Compressed to [0, 1], outliers still present but less prominent
- **Standardized:** Centered at 0, most values between -3 and 3

**Key Insight:** All three transformations preserve the relative relationships between data points - they just change the scale!

---
## Part 6: Categorical Encoding (10 mins)

### Exercise 6.1: One-Hot Encoding

Machine learning models need numerical input. We need to convert categorical variables!

In [None]:
# SOLUTION: Apply one-hot encoding to 'embarked' column
# One-hot encoding creates a binary column for each category
embarked_encoded = pd.get_dummies(df['embarked'], prefix='embarked', drop_first=False)
print("Original column:")
print(df['embarked'].value_counts())
print("\nOne-hot encoded:")
print(embarked_encoded.head())

**Explanation:**
- **One-Hot Encoding:** Creates binary (0/1) columns for each category
- **Example:** 
  - Original: embarked = 'S'
  - Encoded: embarked_S=1, embarked_C=0, embarked_Q=0
- **Why it works:** 
  - Doesn't assume any ordering between categories
  - Each category gets its own feature
- **drop_first parameter:**
  - drop_first=True removes one column to avoid multicollinearity
  - drop_first=False keeps all columns (easier to interpret)
- **When to use:** 
  - Nominal categories (no inherent order): sex, embarked, cabin
  - Works best with features that have few categories (<10)
- **Limitation:** Creates many columns if there are many categories ("curse of dimensionality")

In [None]:
# SOLUTION: Encode 'sex' column
# Create a binary encoding: male=1, female=0
# For binary categories, simple mapping is often clearer than one-hot encoding
df['sex_encoded'] = df['sex'].map({'male': 1, 'female': 0})

print("Sex encoding:")
print(df[['sex', 'sex_encoded']].drop_duplicates())

**Explanation:**
- **Binary Encoding:** For features with only 2 categories, use simple 0/1 encoding
- **Advantages:**
  - Uses only 1 column instead of 2 (more efficient)
  - Easier to interpret coefficients in linear models
  - Same information content as one-hot encoding for binary variables
- **Convention:** Usually encode as 0/1, but the choice of which is 0 vs 1 is arbitrary
- **Alternative:** Could also use -1/1 encoding for some algorithms

---
## Part 7: Feature Engineering (15 mins)

### Creating New Features

Feature engineering is the art of creating new features from existing data to improve model performance.

### Exercise 7.1: Family Size Feature

In [None]:
# SOLUTION: Create family_size feature
# SibSp = number of siblings/spouses aboard
# Parch = number of parents/children aboard
# family_size = SibSp + Parch + 1 (the +1 is for the passenger themselves)
df['family_size'] = df['sibsp'] + df['parch'] + 1

# Create is_alone feature
# A passenger is alone if family_size == 1
df['is_alone'] = (df['family_size'] == 1).astype(int)

print("Family size distribution:")
print(df['family_size'].value_counts().sort_index())
print(f"\nPassengers traveling alone: {df['is_alone'].sum()}")

**Explanation:**
- **family_size:** Combines sibsp and parch into a single meaningful feature
  - sibsp: siblings and spouses aboard
  - parch: parents and children aboard
  - +1: counts the passenger themselves
- **is_alone:** Binary feature for solo travelers
  - Hypothesis: Solo travelers may have different survival patterns
  - Easier to interpret than family_size in some contexts
- **Why this helps:**
  - Reduces 2 features (sibsp, parch) to 1 (family_size) - simpler model
  - May have non-linear relationship: small families had better survival than very large or solo
  - Captures domain knowledge: family groups may stick together during evacuation

In [None]:
# Analyze survival by family size
family_survival = df.groupby('family_size')['survived'].mean()
fig = px.bar(x=family_survival.index, y=family_survival.values,
             title='Survival Rate by Family Size',
             labels={'x': 'Family Size', 'y': 'Survival Rate'})
fig.update_layout(yaxis_tickformat='.0%')
fig.show()

**Key Insights:**
- **Solo travelers (family_size=1):** ~30% survival - may have lacked help/coordination
- **Small families (2-4):** ~50-70% survival - optimal for group coordination
- **Large families (5+):** Lower survival - may have been harder to keep together during evacuation
- **Non-linear relationship:** This is why combining sibsp and parch into family_size is valuable

### Exercise 7.2: Extract Title from Name

In [None]:
# SOLUTION: Extract title from name (Mr., Mrs., Miss., etc.)
# Names follow the pattern: "Surname, Title. Firstname"
# We use regex to extract the title between space and period
df['title'] = df['name'].str.extract(' ([A-Za-z]+)\.', expand=False)

print("Titles found:")
print(df['title'].value_counts())

**Explanation:**
- **Regex pattern:** ` ([A-Za-z]+)\.`
  - Space before the title
  - Captures one or more letters
  - Followed by a period
- **Why extract titles:**
  - Captures social status (Mr., Mrs., Master, Dr., Rev.)
  - Indicates age group (Master for boys, Miss for unmarried women)
  - May correlate with survival (women and children first)
- **Titles found:**
  - Common: Mr (men), Miss (unmarried women), Mrs (married women), Master (boys)
  - Rare: Dr, Rev, Col, Major, Countess, etc.

In [None]:
# SOLUTION: Group rare titles into 'Other'
# Keep only common titles: Mr, Miss, Mrs, Master
# This prevents overfitting on rare categories
common_titles = ['Mr', 'Miss', 'Mrs', 'Master']
df['title_grouped'] = df['title'].apply(
    lambda x: x if x in common_titles else 'Other'
)

print("\nGrouped titles:")
print(df['title_grouped'].value_counts())

**Explanation:**
- **Why group rare titles:**
  - Rare categories (e.g., Countess, Don) have few examples
  - Models may overfit or fail to generalize from single examples
  - Grouping increases sample size for 'Other' category
- **Alternative approaches:**
  - Could group by semantics: nobility (Sir, Lady, Countess), professional (Dr, Rev), military (Col, Major)
  - Could drop rare titles entirely
- **Trade-off:** Lose some information, but gain stability

In [None]:
# Analyze survival by title
title_survival = df.groupby('title_grouped')['survived'].mean().sort_values(ascending=False)
fig = px.bar(x=title_survival.index, y=title_survival.values,
             title='Survival Rate by Title',
             labels={'x': 'Title', 'y': 'Survival Rate'})
fig.update_layout(yaxis_tickformat='.0%')
fig.show()

**Key Insights:**
- **Mrs (married women):** ~80% survival - highest rate, prioritized in "women and children first"
- **Miss (unmarried women):** ~70% survival - also prioritized
- **Master (boys):** ~60% survival - children were prioritized
- **Mr (men):** ~16% survival - lowest rate, given last priority
- **Other (rare titles):** Mixed results, often included wealthy/noble passengers

**Why this feature is powerful:**
- Captures both gender and social status in one feature
- More predictive than gender alone
- Shows clear pattern aligned with historical accounts of evacuation priority

### Exercise 7.3: Age Groups

In [None]:
# SOLUTION: Create age groups
# Categories: Child (0-12), Teen (13-19), Adult (20-59), Senior (60+)
# Binning continuous variables can capture non-linear relationships

bins = [0, 12, 19, 59, 100]
labels = ['Child', 'Teen', 'Adult', 'Senior']
df['age_group'] = pd.cut(df['age_group_imputed'], bins=bins, labels=labels)

print("Age group distribution:")
print(df['age_group'].value_counts())

**Explanation:**
- **pd.cut():** Converts continuous variable into categorical bins
- **Bin choices:**
  - Child (0-12): Young children, "children first" policy
  - Teen (13-19): Adolescents, between child and adult
  - Adult (20-59): Working-age adults
  - Senior (60+): Elderly passengers
- **Why bin age:**
  - May have non-linear relationship with survival
  - Different age groups may have been treated differently
  - More interpretable than continuous age
  - Can handle imputed ages without assuming exact precision
- **Trade-offs:**
  - Loses information within bins (e.g., 25 vs 35 treated same)
  - Bin boundaries are somewhat arbitrary
  - Creates categorical variable that needs encoding

In [None]:
# Survival by age group
age_group_survival = df.groupby('age_group')['survived'].mean()
fig = px.bar(x=age_group_survival.index, y=age_group_survival.values,
             title='Survival Rate by Age Group',
             labels={'x': 'Age Group', 'y': 'Survival Rate'})
fig.update_layout(yaxis_tickformat='.0%')
fig.show()

**Key Insights:**
- **Children:** ~60% survival - "children first" policy clearly visible
- **Teens:** ~40% survival - lower priority than young children
- **Adults:** ~40% survival - varied by gender (women higher, men lower)
- **Seniors:** ~30% survival - may have had mobility challenges

**Interpretation:**
- Clear non-linear pattern - age groups capture this better than linear age
- Validates our binning strategy
- Interaction with gender likely important (adult women vs adult men)

### Exercise 7.4: Fare Per Person

In [None]:
# SOLUTION: Calculate fare per person (fare / family_size)
# Some tickets were purchased for entire families
# Fare per person better represents individual wealth/class
df['fare_per_person'] = df['fare'] / df['family_size']

print("Fare per person statistics:")
print(df['fare_per_person'].describe())

**Explanation:**
- **Problem with raw fare:**
  - Families often purchased tickets together
  - A family of 4 paying £80 (£20 each) is different from solo traveler paying £80
  - Raw fare confounds family size with wealth/class
- **fare_per_person solution:**
  - Divides total fare by family size
  - Better represents individual spending/class
  - Normalizes for group ticket purchases
- **Why this helps:**
  - More accurate proxy for individual wealth
  - Removes confounding with family size
  - May be more predictive than raw fare
- **Example:**
  - Family of 4, fare £80 → £20 per person (3rd class)
  - Solo traveler, fare £80 → £80 per person (1st class suite)

**Additional insight:** This is an example of **interaction between features** - fare and family_size interact in a meaningful way.

---
## Part 8: Creating the Final Cleaned Dataset (10 mins)

### Exercise 8.1: Select and Prepare Final Features

In [None]:
# SOLUTION: Create final dataset with cleaned and engineered features
# Select the most useful features for machine learning
final_features = [
    'survived',           # Target variable
    'pclass',            # Original feature
    'sex_encoded',       # Encoded feature
    'age_group_imputed', # Imputed feature
    'fare_per_person',   # Engineered feature
    'family_size',       # Engineered feature
    'is_alone',          # Engineered feature
    'cabin_known'        # Engineered feature
]

df_final = df[final_features].copy()
print(f"Final dataset shape: {df_final.shape}")
print("\nFirst few rows:")
df_final.head()

**Feature Selection Rationale:**

1. **survived:** Target variable (what we're predicting)

2. **pclass:** Original feature, strong predictor (1st class had better survival)

3. **sex_encoded:** Encoded categorical, very strong predictor (women survived more)

4. **age_group_imputed:** Imputed with group-based strategy, preserves age patterns

5. **fare_per_person:** Engineered feature, better than raw fare

6. **family_size:** Engineered feature combining sibsp + parch

7. **is_alone:** Binary version of family_size, captures important pattern

8. **cabin_known:** Engineered from missing data pattern, surprisingly predictive

**Features NOT included:**
- **name:** Too unique, extracted title instead
- **ticket:** Mostly unique, not generalizable
- **cabin:** Too many missing values, used cabin_known instead
- **embarked:** Weaker predictor (could add if desired)
- **sibsp, parch:** Replaced by family_size
- **fare:** Replaced by fare_per_person
- **age:** Replaced by age_group_imputed

In [None]:
# SOLUTION: Check for any remaining missing values
# After all our preprocessing, we should have no missing data
print("Missing values in final dataset:")
print(df_final.isnull().sum())

**Success Check:**
- All missing values should be 0
- If any remain, we need to go back and handle them
- Machine learning algorithms generally cannot handle missing values
- Our preprocessing pipeline successfully handled all missing data through:
  - Age: Group-based imputation
  - Cabin: Created cabin_known binary feature
  - Other features: Either had no missing values or were excluded

In [None]:
# Summary statistics
print("\nFinal Dataset Summary:")
print(df_final.describe())

**Dataset Quality Check:**
- **Count:** All features should have 891 values (no missing data)
- **Survived:** 38% survival rate (historically accurate)
- **Pclass:** Mean ~2.3 (more 3rd class than 1st class)
- **Sex_encoded:** Mean ~0.65 (65% male, 35% female)
- **Age_group_imputed:** Mean ~30 years, reasonable range
- **Fare_per_person:** Wide range, some outliers (expensive first-class suites)
- **Family_size:** Mean ~1.9, most traveling alone or in small groups
- **Is_alone:** ~60% traveling alone
- **Cabin_known:** ~23% have cabin information

**This dataset is now ready for machine learning!**

---
## Summary & Reflection

### Key Takeaways

Today we learned:
- ✓ How to assess data quality using the knowledge hierarchy
- ✓ Strategies for handling missing data (mean, median, group-based imputation)
- ✓ Multiple methods for detecting outliers (IQR, Z-score, Isolation Forest)
- ✓ Normalization techniques (Min-Max, Z-score standardization)
- ✓ Categorical encoding (one-hot encoding, binary encoding)
- ✓ Feature engineering to create meaningful new features

### Data Preparation Pipeline Summary

```
Raw Data → Missing Data Handling → Outlier Detection → 
Normalization → Encoding → Feature Engineering → Clean Dataset
```

### Reflection Questions

**1. Which imputation strategy worked best for the age variable and why?**

**Answer:** **Group-based imputation** (by passenger class and sex) worked best because:

- **Preserves relationships:** Age varies significantly by passenger class and gender. First-class male passengers were typically older (~40 years median) than third-class female passengers (~21 years median).

- **Maintains distribution:** Unlike mean/median imputation which creates artificial peaks, group-based imputation spreads values more naturally across the distribution.

- **Reflects reality:** Using the median age of similar passengers (same class and gender) produces more realistic imputed values than using the overall population mean.

- **Better for MAR data:** Since age missingness is related to other variables (likely MAR, not MCAR), group-based imputation accounts for these relationships.

- **Improved model performance:** By preserving the relationship between age, class, and gender, we provide more informative features to ML models.

**Example:** A missing age for a 1st class female is imputed as ~35 (median for that group) rather than ~30 (overall median), which is more accurate and preserves the class-age relationship.

**2. Why is feature engineering important for machine learning?**

**Answer:** Feature engineering is crucial because:

**A. Captures domain knowledge:**
- Incorporates human understanding into the model
- Example: Creating `family_size` reflects the intuition that families may stay together during evacuation
- Example: Extracting `title` captures social status and evacuation priority

**B. Reveals hidden patterns:**
- Combines features to expose non-obvious relationships
- Example: `fare_per_person` removes confounding between fare and family size
- Example: `cabin_known` reveals that presence of data is itself predictive

**C. Improves model performance:**
- Good features can dramatically increase accuracy
- May allow simpler models to achieve better results
- Example: `title` may be more predictive than gender alone

**D. Handles non-linearity:**
- Creates features that capture non-linear relationships
- Example: Binning age into groups captures the non-linear effect (children survived more, but not linearly with age)

**E. Reduces dimensionality:**
- Combines correlated features into single meaningful features
- Example: `family_size` replaces `sibsp` + `parch` with one feature
- Simpler models, less overfitting

**Quote from Andrew Ng:** "Applied machine learning is basically feature engineering." Good features often matter more than algorithm choice!

**3. What new feature did you find most insightful?**

**Answer:** The most insightful engineered feature is **`cabin_known`** because:

**A. Counter-intuitive discovery:**
- We typically think of missing data as a problem to fix
- Here, the missingness itself is highly informative
- Passengers with known cabins had ~67% survival vs ~30% for unknown

**B. Reveals hidden proxy:**
- `cabin_known` is a proxy for passenger class and wealth
- First-class passengers had assigned cabins, third-class often did not
- Having a cabin may also indicate proximity to lifeboats on upper decks

**C. Lesson in creative thinking:**
- Instead of trying to impute 77% missing cabin data (unreliable)
- We extracted signal from the pattern of missingness
- Demonstrates that "how you handle missing data" can be feature engineering

**D. Generalizable principle:**
- Look for patterns in missingness across your datasets
- Missing data is not always missing at random
- The fact that data is recorded/missing can be informative

**Alternative answers:** `title` (extracts social status from text), `fare_per_person` (removes confounding), or `family_size` (combines related features) are all excellent answers with strong justification!

**The key insight:** Feature engineering requires creativity, domain knowledge, and thinking beyond the obvious transformations. The best features often come from understanding the context and story behind the data.

---
## Bonus Challenge (Optional)

### Create Your Own Feature!

In [None]:
# SOLUTION: Create a new feature that might be useful for predicting survival
# Here are several examples:

# Example 1: Wealth indicator (combining class and fare)
# Hypothesis: Within each class, higher fare indicates better accommodations (closer to lifeboats)
df['wealth_score'] = df['fare_per_person'] / df['pclass']
# Higher score = more wealth (high fare, low class number)

# Example 2: Young female indicator
# Hypothesis: Young women ("women and children first") had highest survival
df['young_female'] = ((df['sex'] == 'female') & (df['age_group_imputed'] < 40)).astype(int)

# Example 3: Deck level (from cabin letter)
# Hypothesis: Passengers on higher decks (A, B, C) were closer to lifeboats
df['deck_level'] = df['cabin'].str[0]  # First letter of cabin
# Map deck letters to numbers (A=1 is highest/best, T=20 is lowest)
deck_mapping = {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'G': 7, 'T': 8}
df['deck_numeric'] = df['deck_level'].map(deck_mapping)
df['deck_numeric'] = df['deck_numeric'].fillna(9)  # Unknown deck gets worst value

# Example 4: Ticket sharing (detect group tickets)
# Hypothesis: People with shared tickets may have stayed together
df['ticket_count'] = df.groupby('ticket')['ticket'].transform('count')
df['shared_ticket'] = (df['ticket_count'] > 1).astype(int)

print("New features created!")
print("\nFeature correlations with survival:")
print(df[['survived', 'wealth_score', 'young_female', 'deck_numeric', 'shared_ticket']].corr()['survived'].sort_values(ascending=False))

In [None]:
# Analyze survival by your new feature
# Example: Analyze young_female feature
print("Survival rate by young female status:")
print(df.groupby('young_female')['survived'].agg(['mean', 'count']))

fig = px.bar(df.groupby('young_female')['survived'].mean(),
             title='Survival Rate: Young Females vs Others',
             labels={'value': 'Survival Rate', 'young_female': 'Young Female (1) vs Others (0)'})
fig.update_layout(yaxis_tickformat='.0%', showlegend=False)
fig.show()

**Bonus Feature Analysis:**

**1. wealth_score (fare_per_person / pclass):**
- Combines two indicators of wealth
- Higher score indicates more resources and better accommodations
- May capture within-class variation (luxury vs. standard first-class)

**2. young_female:**
- Combines gender and age into single indicator
- Captures the "women and children first" policy precisely
- Shows very high survival rate (~75%) vs others (~25%)

**3. deck_numeric:**
- Extracts deck information from cabin strings
- Higher decks (A, B) were closer to lifeboats
- Deals with missing data by assigning lowest value

**4. shared_ticket:**
- Identifies passengers traveling on group tickets
- May indicate families or groups that stayed together
- Different from family_size (could be friends, colleagues)

**Feature Engineering Process:**
1. **Start with hypothesis:** What factors might affect survival?
2. **Test correlation:** Does the feature relate to the target?
3. **Validate with domain knowledge:** Does it make sense?
4. **Compare with existing features:** Does it add new information?
5. **Check practical utility:** Is it worth the complexity?

**Great feature engineering balances:**
- Domain knowledge with data-driven discovery
- Complexity with interpretability
- Novelty with reliability

**Remember:** Not every feature will be useful! The goal is to experiment and test which features actually improve model performance.

---
## Complete Data Preparation Pipeline

### Summary of Our Journey:

**1. Raw Data Assessment:**
- Identified missing values (age: 20%, cabin: 77%)
- Categorized data types (numerical vs categorical)
- Understood the knowledge hierarchy (Data → Information → Knowledge → Wisdom)

**2. Missing Data Handling:**
- Age: Group-based imputation (by class and sex)
- Cabin: Created cabin_known binary feature
- Learned about MCAR, MAR, and MNAR mechanisms

**3. Outlier Detection:**
- IQR method: Simple, robust, box plot visualization
- Z-score method: Assumes normality, good for univariate analysis
- Isolation Forest: Advanced, multivariate, no distribution assumptions
- Decided to keep outliers (legitimate expensive tickets)

**4. Normalization & Standardization:**
- Min-Max scaling: [0, 1] range, preserves relationships
- Z-score standardization: Mean=0, std=1, less sensitive to outliers
- Prepared data for ML algorithms

**5. Categorical Encoding:**
- Binary encoding: sex → 0/1
- One-hot encoding: embarked → multiple binary columns
- Converted text to numbers for ML models

**6. Feature Engineering:**
- family_size: Combined sibsp + parch + 1
- is_alone: Binary indicator for solo travelers
- title: Extracted from name (Mr, Mrs, Miss, Master)
- age_group: Binned age into categories
- fare_per_person: Normalized fare by family size
- cabin_known: Binary indicator from missing data pattern

**7. Final Dataset:**
- 891 rows, 8 features
- No missing values
- Ready for machine learning!

### Key Principles Learned:

1. **Understand before transforming:** Always explore data first
2. **Context matters:** Domain knowledge guides better decisions
3. **Missing data tells a story:** Sometimes absence is information
4. **Not all outliers are errors:** Understand before removing
5. **Feature engineering is creative:** Best features come from insight
6. **Pipeline thinking:** Each step builds on the previous
7. **Documentation is crucial:** Explain your choices and reasoning

### What's Next?

With our clean, engineered dataset, we're ready for:
- **Day 6:** Machine Learning (training models on this prepared data)
- Building classifiers to predict survival
- Evaluating feature importance
- Understanding which of our engineered features are most valuable

**Great job completing Day 4!** You now have the essential skills for data preparation and feature engineering - often considered the most important (and time-consuming) parts of the machine learning pipeline!

---
## Resources

### Documentation:
- **Scikit-learn Preprocessing:** https://scikit-learn.org/stable/modules/preprocessing.html
- **Missing Data Handling:** https://pandas.pydata.org/docs/user_guide/missing_data.html
- **Feature Engineering Guide:** https://www.kaggle.com/learn/feature-engineering

### Further Reading:
- **Feature Engineering for Machine Learning** by Alice Zheng & Amanda Casari (O'Reilly)
- **Isolation Forest Paper:** Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). Isolation forest. ICDM.
- **Missing Data Mechanisms:** Rubin, D. B. (1976). Inference and missing data. Biometrika.

### Practice:
- Try this pipeline on other datasets (house prices, customer churn, etc.)
- Experiment with different imputation strategies
- Create your own engineered features
- Compare model performance with/without feature engineering

**See you on Day 6 for Machine Learning!**