# Day 4 — Student Notebook

*Auto-generated notebook based on provided lecture slides.*

## Day 4 — Data Preparation & Feature Engineering
**Goals:** understand data types, missing values, basic imputation, outlier detection, normalization, and one-hot encoding.

In [None]:
# Setup: installs (uncomment the !pip lines if needed) and imports
# If running in a managed environment (e.g. Google Colab), uncomment the pip installs below.
# !pip install pandas numpy seaborn plotly scikit-learn matplotlib

import pandas as pd, numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, roc_curve
sns.set_theme(style='whitegrid')

# Load dataset (seaborn's titanic dataset) - we'll use this across all notebooks
df = sns.load_dataset('titanic')
df_original = df.copy()  # keep a pristine copy
print('Loaded titanic dataset with shape:', df.shape)
df.head()


### 1) Inspect data types and convert if necessary
- Check `df.dtypes`
- Convert 'survived' to integer and 'pclass' to categorical if needed

**Task:** perform conversions.

In [None]:
# Student: inspect and convert
df.dtypes
# convert
if df['survived'].dtype != 'int64' and df['survived'].dtype != 'int32':
    df['survived'] = df['survived'].astype('int')
if 'pclass' in df.columns:
    df['pclass'] = df['pclass'].astype('category')
print(df.dtypes)

### 2) Missing values handling
- Identify columns with missing values
- Try two strategies: drop rows with missing critical fields; imputing age with median

**Tasks:**
1. Show missing counts.
2. Create `df_imputed` where `age` is imputed with median.

In [None]:
# Student: missing value handling
print(df.isna().sum())
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='median')
df_imputed = df.copy()
df_imputed['age'] = imp.fit_transform(df[['age']])
print('Missing after impute (age):', df_imputed['age'].isna().sum())


### 3) Outlier detection (IQR) — student exercise
- Use IQR method on 'fare'

**Task:** Flag outliers in `fare` using 1.5*IQR rule and show count.

In [None]:
# Student: IQR outlier detection
q1 = df['fare'].quantile(0.25)
q3 = df['fare'].quantile(0.75)
iqr = q3 - q1
lower = q1 - 1.5*iqr
upper = q3 + 1.5*iqr
outliers = df[(df['fare']<lower) | (df['fare']>upper)]
print('IQR bounds', lower, upper)
print('Outlier count:', outliers.shape[0])
outliers.head()


### 4) Normalization & encoding
- Normalize numeric columns (age, fare) with MinMax and StandardScaler
- One-hot encode 'sex' and 'embarked'

**Task:** create a cleaned and encoded DataFrame `X_ready`.

In [None]:
# Student: normalization & one-hot encoding
num_cols = ['age','fare']
X = df_imputed.copy()
X[num_cols] = X[num_cols].fillna(X[num_cols].median())
scaler = MinMaxScaler()
X[num_cols] = scaler.fit_transform(X[num_cols])
X = pd.get_dummies(X, columns=['sex','embarked'], drop_first=True)
print('Columns after encoding:', X.columns.tolist()[:20])
X_ready = X
X_ready.head()


### Short reflection
- Why do we normalize numeric features for some models?
- When is one-hot encoding necessary? Write a one-paragraph answer.