### Session 1: End-to-End ML Hands-on (Titanic)

In this hands-on, you'll build an end-to-end ML workflow on the Titanic dataset using pandas and scikit-learn.

Learning goals:
- Understand the full flow: problem framing → EDA → preprocessing → modeling → evaluation → iteration → export → inference.
- Learn how `Pipeline` and `ColumnTransformer` organize preprocessing and models.
- Practice evaluating models beyond accuracy and prepare a Kaggle-style submission.
- Mini-challenge: feature engineering at the end to improve accuracy.


In [None]:
# Install minimal dependencies (Colab-friendly). 
!pip -q install pandas scikit-learn seaborn matplotlib joblib requests


### Imports and configuration

Technical notes:
- pandas: data loading and manipulation.
- seaborn/matplotlib: quick EDA visualizations.
- scikit-learn: split, preprocessing, pipelines, models, metrics, hyperparameter search.
- joblib: save/load trained pipelines.


In [1]:
import os
import sys
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import requests

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
                             confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay)
import joblib


ModuleNotFoundError: No module named 'pandas'

### Data access (dual-mode)

We'll support two approaches:
- Download from public URLs into `/content/data/raw/` (best for Colab and reproducibility).
- Load from local repo path `../data/raw/` (if running locally).

Set `USE_URLS = True` to download; otherwise, the notebook will try local paths.

TODO: Update `TRAIN_URL` and `TEST_URL` to your hosted raw CSV links if using URLs.


In [None]:
# Toggle between URL download and local path
USE_URLS = True  # set to False to use local CSVs under ../data/raw/

# TODO: replace with your public raw URLs (e.g., GitHub raw).
TRAIN_URL = 'https://raw.githubusercontent.com/your-org/your-repo/main/data/raw/train.csv'
TEST_URL  = 'https://raw.githubusercontent.com/your-org/your-repo/main/data/raw/test.csv'

DATA_DIR_COLAB = '/content/data/raw'
DATA_DIR_LOCAL = os.path.join('..', 'data', 'raw')
os.makedirs(DATA_DIR_COLAB, exist_ok=True)

def download_if_needed(url: str, dst_dir: str) -> str:
    filename = os.path.basename(url)
    dst_path = os.path.join(dst_dir, filename)
    if not os.path.exists(dst_path):
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        with open(dst_path, 'wb') as f:
            f.write(r.content)
    return dst_path

if USE_URLS:
    train_path = download_if_needed(TRAIN_URL, DATA_DIR_COLAB)
    test_path = download_if_needed(TEST_URL, DATA_DIR_COLAB)
else:
    train_path = os.path.join(DATA_DIR_LOCAL, 'train.csv')
    test_path = os.path.join(DATA_DIR_LOCAL, 'test.csv')

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)
train_df.shape, test_df.shape


### Problem framing
We are solving a binary classification problem: predict `Survived` (0/1) from passenger features.
Common features include `Pclass`, `Sex`, `Age`, `SibSp`, `Parch`, `Fare`, and `Embarked`. The `Name`, `Ticket`, and `Cabin` fields are often noisy.


In [None]:
target_col = 'Survived'
X = train_df.drop(columns=[target_col])
y = train_df[target_col]
X.head()


### Quick EDA
Goals:
- Inspect schema, missingness, and distributions.
- Identify candidate features for modeling and preprocessing needs.

TODO: Add one more insightful plot of your choice.


In [None]:
display(train_df.head())
train_df.info()
train_df.isna().mean().sort_values(ascending=False).to_frame('missing_frac')


In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10,4))
sns.countplot(data=train_df, x='Survived', ax=axes[0])
axes[0].set_title('Target balance')
sns.histplot(data=train_df, x='Age', kde=True, ax=axes[1])
axes[1].set_title('Age distribution')
plt.tight_layout()
plt.show()


### Train/validation split
We hold out a validation set to estimate generalization. We stratify by the target to preserve class balance.


In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train.shape, X_valid.shape


### Preprocessing with ColumnTransformer
Concepts:
- Numeric: impute missing values (median), scale features.
- Categorical: impute missing (most_frequent), one-hot encode.
- `ColumnTransformer` applies different pipelines to column subsets.

TODO: Choose imputation strategy for `Age` and `Embarked` (keep defaults or change).


In [None]:
numeric_features = ['Age', 'Fare', 'SibSp', 'Parch']
categorical_features = ['Pclass', 'Sex', 'Embarked']

numeric_preprocess = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_preprocess = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_preprocess, numeric_features),
        ('cat', categorical_preprocess, categorical_features),
    ]
)
preprocessor


### Baseline model: Logistic Regression inside a Pipeline
Why pipeline:
- Guarantees preprocessing is applied consistently in training and inference.
- Simplifies cross-validation and exporting a single object.


In [None]:
clf = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('model', LogisticRegression(max_iter=1000))
])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_valid)
y_proba = clf.predict_proba(X_valid)[:, 1]

acc = accuracy_score(y_valid, y_pred)
prec = precision_score(y_valid, y_pred)
rec = recall_score(y_valid, y_pred)
f1 = f1_score(y_valid, y_pred)
roc = roc_auc_score(y_valid, y_proba)
acc, prec, rec, f1, roc


In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10,4))
ConfusionMatrixDisplay.from_predictions(y_valid, y_pred, ax=ax[0])
ax[0].set_title('Confusion Matrix')
RocCurveDisplay.from_predictions(y_valid, y_proba, ax=ax[1])
ax[1].set_title('ROC Curve')
plt.tight_layout()
plt.show()


### Hyperparameter search (small grid)
We explore a small grid for regularization strength.

TODO: Extend the grid or try a different model like `RandomForestClassifier`.


In [None]:
param_grid = {
    'model__C': [0.1, 1.0, 3.0],
    'model__penalty': ['l2'],
    'model__solver': ['lbfgs']
}
grid = GridSearchCV(clf, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)
grid.best_params_, grid.best_score_


In [None]:
best_model = grid.best_estimator_
y_pred_best = best_model.predict(X_valid)
y_proba_best = best_model.predict_proba(X_valid)[:, 1]
acc_b = accuracy_score(y_valid, y_pred_best)
prec_b = precision_score(y_valid, y_pred_best)
rec_b = recall_score(y_valid, y_pred_best)
f1_b = f1_score(y_valid, y_pred_best)
roc_b = roc_auc_score(y_valid, y_proba_best)
acc_b, prec_b, rec_b, f1_b, roc_b


### Error analysis
Inspect where the model fails to guide improvements.

TODO: Propose one hypothesis to improve, and test it (e.g., different imputation).


In [None]:
errors = X_valid.copy()
errors['y_true'] = y_valid.values
errors['y_pred'] = y_pred_best
errors[errors['y_true'] != errors['y_pred']].head(10)


### Export the best pipeline
Exporting the entire pipeline ensures preprocessing is part of the saved artifact.


In [None]:
joblib.dump(best_model, 'titanic_best_pipeline.joblib')
os.path.getsize('titanic_best_pipeline.joblib')


### Inference on test set and Kaggle-style submission
We'll fit on the full training data using the best estimator, predict on the `test.csv`, and create `submission.csv` with columns `PassengerId` and `Survived`.
You can upload this to Kaggle Titanic to see how it ranks relative to the benchmark.


In [None]:
best_model.fit(X, y)
test_pred = best_model.predict(test_df)
submission = pd.DataFrame({
    'PassengerId': test_df['PassengerId'],
    'Survived': test_pred
})
submission.to_csv('submission.csv', index=False)
submission.head()


### Mini-challenge: Feature engineering to improve accuracy
Ideas to try (implement one or more and re-evaluate):
- `FamilySize = SibSp + Parch + 1`
- Extract `Title` from `Name` and group rare titles
- Bin `Age` or `Fare`
- Try `RandomForestClassifier` or `XGBClassifier` (requires extra install)

TODO: Implement `FamilySize` and include it in `numeric_features`, retrain, and compare metrics and submission score.
