# Day 4 — Solutions Notebook

*Auto-generated notebook based on provided lecture slides.*

## Solutions — Day 4: Data Prep & Feature Engineering
Worked solutions and explanations.

In [None]:
# Setup: installs (uncomment the !pip lines if needed) and imports
# If running in a managed environment (e.g. Google Colab), uncomment the pip installs below.
# !pip install pandas numpy seaborn plotly scikit-learn matplotlib

import pandas as pd, numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, roc_curve
sns.set_theme(style='whitegrid')

# Load dataset (seaborn's titanic dataset) - we'll use this across all notebooks
df = sns.load_dataset('titanic')
df_original = df.copy()  # keep a pristine copy
print('Loaded titanic dataset with shape:', df.shape)
df.head()


### Data types & conversions (solution)

In [None]:
print(df.dtypes)
# Convert survived and pclass (if needed)
df['survived'] = df['survived'].astype('int')
if 'pclass' in df.columns:
    df['pclass'] = df['pclass'].astype('category')
print('\nConverted types:\n', df.dtypes)

### Missing values (solution)
Two approaches shown: drop rows vs median imputation.

In [None]:
# Drop rows example
df_drop = df.dropna(subset=['age','fare'])
print('After dropping rows with missing age/fare:', df_drop.shape)

# Median imputation example
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='median')
df_imputed = df.copy()
df_imputed['age'] = imp.fit_transform(df[['age']])
print('Missing age after impute:', df_imputed['age'].isna().sum())


### Outlier handling (solution)
Two options: remove extreme fares, or cap them (winsorize). We'll show both.

In [None]:
# IQR remove
q1 = df['fare'].quantile(0.25)
q3 = df['fare'].quantile(0.75)
iqr = q3-q1
lower = q1 - 1.5*iqr
upper = q3 + 1.5*iqr
df_iqr_removed = df[~((df['fare']<lower) | (df['fare']>upper))]
print('After IQR removal:', df_iqr_removed.shape)

# Capping (winsorizing)
df_capped = df.copy()
df_capped['fare'] = np.clip(df_capped['fare'], lower, upper)
print('Capped fares min/max:', df_capped['fare'].min(), df_capped['fare'].max())


### Normalization & Encoding (solution)
We show Min-Max and Standard scaling. Also one-hot encoding with pandas.

In [None]:
# Prepare DataFrame: median impute age
X = df_imputed.copy()
X['fare'] = X['fare'].fillna(X['fare'].median())
num_cols = ['age','fare']
# Min-Max
mm = MinMaxScaler()
X_mm = X.copy()
X_mm[num_cols] = mm.fit_transform(X_mm[num_cols])
# Standard
ss = StandardScaler()
X_ss = X.copy()
X_ss[num_cols] = ss.fit_transform(X_ss[num_cols])

# One-hot encoding
X_encoded = pd.get_dummies(X_mm, columns=['sex','embarked'], drop_first=True)
print('Encoded cols sample:', [c for c in X_encoded.columns if 'sex_' in c or 'embarked_' in c])
X_encoded.head()


### Notes
- Normalization is important for distance-based models (KNN) or gradient-based optimization.
- One-hot encoding is necessary for categorical variables when algorithms expect numeric inputs.