# Titanic Survival Prediction – Model Training & Evaluation

## Introduction

This notebook continues the Titanic survival analysis by focusing on building and evaluating predictive models.  
The cleaned and feature-enhanced dataset from the previous EDA notebook is used as input.

## Load Processed Dataset

We load the preprocessed DataFrame saved after the EDA phase.  
It includes engineered features such as `FamilySize`, `Deck`, `LogFare`, and cleaned versions of `Age`, `Fare`, and `Embarked`.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

df = pd.read_csv("titanic_processed_train.csv")

## Feature Selection

We select a subset of relevant features that showed predictive power during EDA, including:
- Passenger class (`Pclass`)
- Gender (`Sex`)
- Age, Fare, and FamilySize
- Embarked port
- Extracted cabin deck

The target variable is `Survived`.

In [2]:
features = [
    "Pclass", "Sex", "Age", "Fare", "Embarked",
    "FamilySize", "Deck"
]

target = "Survived"

df_model = df[features + [target]].copy()

In [3]:
X = df_model.drop(columns=[target])
y = df_model[target]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0, stratify=y
)

## Preprocessing Pipeline

We use `sklearn`'s `ColumnTransformer` and `Pipeline` to automate:
- Imputation of missing values
- Scaling of numerical features
- One-hot encoding of categorical variables

This ensures that the same preprocessing is applied consistently across models.

In [4]:
numeric_features = ["Age", "Fare", "FamilySize"]
categorical_features = ["Pclass", "Sex", "Embarked", "Deck"]

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(transformers=[
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features)
])

## Model Training

We train two models for comparison:
- Logistic Regression (as a simple and interpretable baseline)
- Random Forest Classifier (for higher accuracy and feature importance insights)

Both models are evaluated on a hold-out test set.

In [5]:
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42)
}

## Evaluation Metrics

We evaluate the models using:
- Confusion Matrix
- Precision, Recall, F1-Score
- ROC-AUC Score

These metrics help us understand both general accuracy and performance on the minority class (survivors).

In [6]:
for name, model in models.items():
    clf = Pipeline(steps=[
        ("preprocessor", preprocessor),
        ("classifier", model)
    ])
    
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    y_proba = clf.predict_proba(X_test)[:, 1]
    
    print(f"\n===== {name} =====")
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    print(f"ROC AUC: {roc_auc_score(y_test, y_proba):.4f}")


===== Logistic Regression =====
Confusion Matrix:
[[96 14]
 [23 46]]

Classification Report:
              precision    recall  f1-score   support

           0       0.81      0.87      0.84       110
           1       0.77      0.67      0.71        69

    accuracy                           0.79       179
   macro avg       0.79      0.77      0.78       179
weighted avg       0.79      0.79      0.79       179

ROC AUC: 0.8472

===== Random Forest =====
Confusion Matrix:
[[95 15]
 [24 45]]

Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.86      0.83       110
           1       0.75      0.65      0.70        69

    accuracy                           0.78       179
   macro avg       0.77      0.76      0.76       179
weighted avg       0.78      0.78      0.78       179

ROC AUC: 0.8354


## Results Summary

- Logistic Regression performs slightly better overall, especially in terms of ROC AUC and balanced precision/recall for the positive class.
- Random Forest yields comparable results but does not outperform the simpler model in this case.
- Both models tend to classify non-survivors more reliably than survivors, which is common in slightly imbalanced datasets.

Further improvements could come from cross-validation, hyperparameter tuning, or additional feature engineering.