# Day 6 — Student Notebook

*Auto-generated notebook based on provided lecture slides.*

## Day 6 — Introduction to Machine Learning
**Goals:** understand supervised vs unsupervised, train/test split, train a simple classifier (logistic regression, KNN) on the Titanic dataset.

In [None]:
# Setup: installs (uncomment the !pip lines if needed) and imports
# If running in a managed environment (e.g. Google Colab), uncomment the pip installs below.
# !pip install pandas numpy seaborn plotly scikit-learn matplotlib

import pandas as pd, numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, roc_curve
sns.set_theme(style='whitegrid')

# Load dataset (seaborn's titanic dataset) - we'll use this across all notebooks
df = sns.load_dataset('titanic')
df_original = df.copy()  # keep a pristine copy
print('Loaded titanic dataset with shape:', df.shape)
df.head()


### 1) Prepare data for supervised learning
- Predict `survived` using a small set of features: `pclass`, `sex`, `age`, `fare`, `embarked`.
- Use median imputation + simple encoding.

**Task:** create `X` and `y`.

In [None]:
# Student: create X and y
df_ml = df.copy()
# simple preprocessing
df_ml['age'] = df_ml['age'].fillna(df_ml['age'].median())
df_ml['fare'] = df_ml['fare'].fillna(df_ml['fare'].median())
df_ml = pd.get_dummies(df_ml, columns=['sex','embarked','class'], drop_first=True)
features = ['age','fare'] + [c for c in df_ml.columns if c.startswith('sex_') or c.startswith('embarked_') or c.startswith('class_')]
X = df_ml[features]
y = df_ml['survived']
print('Features used:', features)
print('Shape X,y:', X.shape, y.shape)


### 2) Train/test split and a baseline model (student)
- Split data (80/20)
- Train LogisticRegression
- Evaluate accuracy

**Task:** implement train/test split and fit logistic regression.

In [None]:
# Student: train/test + logistic regression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = LogisticRegression(max_iter=200)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Confusion matrix:\n', confusion_matrix(y_test, y_pred))


### 3) Try K-Nearest Neighbors (student)
- Train KNN with k=5 and compare accuracy

**Task:** fit KNN and compare results to logistic regression.

In [None]:
# Student: KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
print('KNN Accuracy:', accuracy_score(y_test, y_pred_knn))
print('Confusion matrix:\n', confusion_matrix(y_test, y_pred_knn))


### Short reflection
- Which model performed better? Why might that be the case?
- What are weaknesses of these simple approaches?