# **Titanic Survival Prediction**

**Author:** Milos Saric [https://saricmilos.com/]  
**YOUTUBE: English: @realskillsoverdegrees  Serbian: @saricmilos**  
**Date:** October 7, 2025  
**Dataset:** Titanic Passenger Data  

---

This notebook explores the classic Titanic dataset to predict passenger survival using machine learning.  
The analysis will guide you through the full data science workflow, including:

1. **Problem Definition** – Clearly outline the objective and scope of the project.

2. **Data Collection** – Gather relevant datasets from KAGGLE.

3. **Exploratory Data Analysis (EDA)** – Analyze and visualize data to uncover patterns and insights.

4. **Feature Engineering** – Create, transform, or select meaningful features to improve model performance.

5. **Model Development** – Build and train predictive or analytical models.

6. **Evaluation & Testing** – Assess model performance using appropriate metrics and validate results.

The goal of this project is to apply practical data science techniques to a real-world dataset and gain insights into the factors that influenced survival on the Titanic.


## 1. **Problem Definition**

This phase involves clearly understanding the challenge we aim to solve. This step sets the foundation for the entire project and ensures all efforts are aligned toward a common goal.

Key aspects include:

- **Objective**: Predict whether a passenger survived the Titanic disaster based on available features such as age, gender, class, fare and newly created features.  

- **Scope**: The analysis focuses on the provided Titanic dataset. Predictions are limited to the passengers listed in the dataset, without considering external historical data or additional features beyond what is provided.  

- **Stakeholders**:  
  - **Data Scientists / ML Practitioners**: To practice and improve predictive modeling skills.  
  - **Kaggle Community**: Participants competing in the Titanic challenge.  
  - **Educators / Students**: Learning tool for understanding classification problems and feature engineering.  

- **Success Criteria**: Achieve high prediction accuracy on the test dataset, evaluated using metrics such as **accuracy score**. A successful model reliably distinguishes between survivors and non-survivors.

>A well-defined problem statement is half the solution!


In [None]:
from IPython.display import Image

Image(filename=r"C:\Users\Milos\Desktop\ESCAPE 9-5\PYTHON\GitHub Kaggle Projects\1. Titanic Survival Predictor\Images\titanic.jpg")

## **2. Data Collection**

The **Data Collection** phase is all about gathering the data we need and setting up the tools for analysis. In this step, we also import essential libraries and create reusable functions to streamline our workflow.
The training and testing datasets for this project are provided by Kaggle. You can either:

 - **1.** Download them directly from my GitHub: https://github.com/saricmilos/titanic-survival-prediction

 - **2.** Or access them from Kaggle itself: Titanic: Machine Learning from Disaster

Both sources contain the same dataset, so you can choose whichever is more convenient.

## **2.1. Import Libraries**
   Import libraries for data handling, visualization, and modeling:  

In [None]:
import os
import pandas as pd
import numpy as np
from pathlib import Path
from typing import Any
import matplotlib.pyplot as plt
import re
import seaborn as sns
from sklearn.model_selection import StratifiedKFold

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

#HyperParameters
from sklearn.model_selection import GridSearchCV

## **2.2. Create Reusable Functions**
Functions to avoid repetitive tasks and keep code clean:

### **2.2.1. Data Loading**

In [None]:
# Load our datasets
def load_dataset(csv_path: Path, **read_csv_kwargs: Any) -> pd.DataFrame:
    """     
    Load a CSV file into a pandas DataFrame.
    
    Args:
        csv_path (Path): Full path to the CSV file
        **read_csv_kwargs: Optional arguments for pd.read_csv

    Returns:
        pd.DataFrame
     """
    if not csv_path.exists():
        raise FileNotFoundError(f"CSV file not found: {csv_path}")
    return pd.read_csv(csv_path, **read_csv_kwargs)

### **2.2.1. Data Preparation**

In [None]:
# Function to extract title from the name
def extract_title(name):
    match = re.search(r", (\w+)\.",name)
    return match.group(1) if match else "Unknown"

In [None]:
# Create a column containing each passenger title as categorical numerical value
def process_titles(df, rare_titles, title_mapping):
    """
    Extracts and encodes passenger titles into numeric categories.
    
    Parameters:
    - df: DataFrame, the dataset to process
    - rare_titles: list of titles to group as 'Rare'
    - title_mapping: dict mapping titles to numeric values
    
    Returns:
    - df: DataFrame with a new 'Title' column encoded numerically
    """
    # Extract titles
    df['Title'] = df['Name'].apply(extract_title)
    
    # Replace rare titles
    df['Title'] = df['Title'].replace(rare_titles, 'Rare')
    
    # Ensure all remaining titles exist in the mapping
    df['Title'] = df['Title'].apply(lambda x: x if x in title_mapping else 'Unknown')
    
    # Map to numeric
    df['Title'] = df['Title'].map(title_mapping)
    
    return df

In [None]:
# Function to determine size of the family from number of family members (cousins, children, parents):
def family_category(size):
    if size == 1:
        return "Single"
    elif size <= 4:
        return "SmallFamily"
    else:
        return "LargeFamily"

In [None]:
# Creates a column for family oriented features (size of the family, travelling alone, number of members in each person family)
def process_family_features(df, family_category_func, family_mapping):
    """
    Adds family-related features to a DataFrame:
    - FamilySize: total number of family members aboard
    - FamilyCategory: categorical encoding of family size
    - IsAlone: 1 if the passenger is alone, 0 otherwise

    Parameters:
    - df: pandas DataFrame
    - family_category_func: function to categorize family size
    - family_mapping: dict mapping family categories to numeric values

    Returns:
    - df: DataFrame with new family features
    """
    # Compute family size
    df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
    
    # Categorize family size and map to numeric
    df["FamilyCategory"] = df["FamilySize"].apply(family_category_func).map(family_mapping)
    
    # Flag passengers who are alone
    df["IsAlone"] = (df["FamilySize"] == 1).astype(int)
    
    return df


In [None]:
def process_age_features(df, bins, labels, age_mapping):
    """
    Adds age-related features to a DataFrame:
    - AgeMissing: 1 if Age is missing, 0 otherwise
    - Age: fills missing values using median per Title
    - AgeGroup: numeric age group for modeling

    Parameters:
    - df: pandas DataFrame
    - bins: list of numeric bin edges for age groups
    - labels: list of labels for each age group
    - age_mapping: dict mapping age group labels to numeric codes

    Returns:
    - df: DataFrame with new age features
    """
    # Flag missing ages
    df["AgeMissing"] = df["Age"].isna().astype(int)
    
    # Fill missing ages with median per Title
    df["Age"] = df.groupby("Title")["Age"].transform(lambda x: x.fillna(x.median()))
    
    # Categorize ages into bins
    age_groups = pd.cut(df["Age"], bins=bins, labels=labels)
    
    # Map labels to numeric codes and convert to integer
    df["AgeGroup"] = age_groups.map(age_mapping).astype(int)
    
    return df

In [None]:
# Function to plot confusion matrix
def plot_confusion_matrix(y_true, y_pred, model_name=None, labels=None, figsize=(6, 4), normalize=False):
    """
    Plot a confusion matrix using Seaborn.
    
    Parameters:
        y_true : array-like, true labels
        y_pred : array-like, predicted labels
        model_name : str, optional, name of the model for the title
        labels : list, optional, class labels
        figsize : tuple, optional, size of the figure
        normalize : bool, optional, normalize counts to percentages
    """
    cm = confusion_matrix(y_true, y_pred, labels=labels)
    
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    
    plt.figure(figsize=figsize)
    sns.heatmap(cm, annot=True, fmt=".2f" if normalize else "d",
                cmap="Blues", cbar=False,
                xticklabels=labels if labels is not None else True,
                yticklabels=labels if labels is not None else True)
    
    title = "Confusion Matrix"
    if model_name:
        title += f" - {model_name}"
    plt.title(title)
    plt.ylabel('Actual Label')
    plt.xlabel('Predicted Label')
    plt.tight_layout()
    plt.show()

In [None]:
def process_fare_bins(df, column="Fare", bins=4, labels=None):
    """
    Converts a continuous fare column into quantile-based bins, handling missing values,
    and ensures the bin column is numeric.

    Parameters:
    - df: pandas DataFrame
    - column: column name to bin
    - bins: number of quantile bins
    - labels: list of labels for each bin (numeric or categorical)

    Returns:
    - df: DataFrame with new 'FareBin' column as numeric
    """
    # Fill missing fares with median
    df[column] = df[column].fillna(df[column].median())
    
    if labels is None:
        labels = list(range(bins))
    
    # Create quantile bins
    fare_groups = pd.qcut(df[column], q=bins, labels=labels)
    
    # Convert to numeric
    df["FareBin"] = fare_groups.astype(int)
    
    return df

In [None]:
def encode_categorical(df, column, encoding_type="onehot", prefix=None, dummy_na=True):
    """
    Encodes a categorical column in different ways and drops the original column.
    
    Parameters:
    - df: pandas DataFrame
    - column: column name to encode
    - encoding_type: str, type of encoding: "onehot" or "label"
    - prefix: string to prefix dummy columns (only for one-hot encoding)
    - dummy_na: bool, include a column for NaNs (only for one-hot encoding)
    
    Returns:
    - df: DataFrame with encoded column(s)
    """
    
    if encoding_type == "onehot":
        if prefix is None:
            prefix = column
        # One-hot encode with optional NaN column
        dummies = pd.get_dummies(df[column], prefix=prefix, dummy_na=dummy_na).astype(int)
        df[dummies.columns] = dummies
        df.drop(columns=[column], inplace=True)
    
    elif encoding_type == "label":
        # Label encode
        le = LabelEncoder()
        # Fill NaN temporarily to encode
        df[column] = df[column].fillna("NaN")  
        df[column] = le.fit_transform(df[column])
    
    else:
        raise ValueError("Unsupported encoding_type. Choose 'onehot' or 'label'.")
    
    return df

In [None]:
def process_deck(df, all_decks):
    """
    Extract deck from Cabin and one-hot encode it,
    ensuring all columns exist and are in consistent order.
    Drops the original Cabin column.
    """
    # Extract deck, fill missing as "Missing"
    df["Deck"] = df["Cabin"].apply(lambda x: str(x)[0] if pd.notna(x) else "Missing")
    
    # One-hot encode
    deck_dummies = pd.get_dummies(df["Deck"], prefix="Deck").astype(int)
    
    # Add missing columns
    for col in all_decks:
        if col not in deck_dummies:
            deck_dummies[col] = 0
    
    # Ensure column order
    deck_dummies = deck_dummies[all_decks]
    
    # Add one-hot columns to DataFrame
    df[all_decks] = deck_dummies
    
    # Drop original Cabin column
    df.drop(columns=["Cabin"], inplace=True)
    
    return df

In [None]:
def process_ticket(df, ticket_counts=None):
    """
    Processes the Ticket column:
    - Adds TicketGroupSize
    - Optionally extracts TicketPrefix
    - Drops original Ticket column

    Parameters:
    - df: pandas DataFrame
    - ticket_counts: precomputed ticket counts (dict or Series). 
                     If None, computes from df.

    Returns:
    - df: processed DataFrame
    - ticket_counts: Series of ticket counts
    """
    if ticket_counts is None:
        ticket_counts = df['Ticket'].value_counts()
    
    # Ticket group size
    df['TicketGroupSize'] = df['Ticket'].map(ticket_counts)
    
    # Ticket prefix
    df['TicketPrefix'] = df['Ticket'].apply(lambda x: str(x).split()[0] if not str(x).isdigit() else 'None')
    
    # Drop original ticket
    df.drop(columns=['Ticket'], inplace=True)
    
    return df, ticket_counts

### **2.2.3. Data Visualization**

In [None]:
# Function to plot most important features
def plot_feature_importance(model, feature_names, top_n=15):
    # Some models (like XGBoost/RandomForest) have 'feature_importances_'
    importance = model.feature_importances_
    fi = pd.DataFrame({
        'Feature': feature_names,
        'Importance': importance
    }).sort_values('Importance', ascending=False).head(top_n)
    
    plt.figure(figsize=(8, 6))
    plt.barh(fi['Feature'], fi['Importance'])
    plt.gca().invert_yaxis()
    plt.title(f"Top {top_n} Feature Importances for {type(model).__name__}")
    plt.xlabel("Importance Score")
    plt.show()

In [None]:
def bar_chart(feature, dataset_name='train', datasets=None):
    """
    Plots a stacked bar chart of Survived vs Not Survived counts for a given feature.
    Shows counts on bars and percentages in parentheses.
    """
    if datasets is None or dataset_name not in datasets:
        raise ValueError("Dataset not found in datasets dictionary")
    
    df = datasets[dataset_name]

    # Count values for each group
    counts = df.groupby(['Survived', feature],observed=False).size().unstack(fill_value=0)

    # Plot stacked bar
    ax = counts.T.plot(kind='bar', stacked=True, figsize=(10,6), color=['red','green'])
    
    plt.title(f'Survival by {feature.capitalize()}')
    plt.xlabel(feature.capitalize())
    plt.ylabel('Count')
    plt.xticks(rotation=45)
    plt.legend(title='Survived', labels=['Not Survived', 'Survived'])

    # Add counts with percentages on bars
    for i, col in enumerate(counts.columns):
        total = counts[col].sum()  # total for this feature value
        bottom = 0
        for j in range(len(counts)):
            height = counts.iloc[j, i]
            if height > 0:
                percent = height / total * 100
                ax.text(
                    i,  # x-coordinate = bar index
                    bottom + height / 2,
                    f'{int(height)} ({percent:.1f}%)',
                    ha='center', va='center', color='white', fontsize=10
                )
            bottom += height
    
    plt.tight_layout()
    plt.show()
    
    return counts


## **2.3. Load Datasets**

In [None]:
dataset_folder = Path(r"C:\Users\Milos\Desktop\ESCAPE 9-5\PYTHON\GitHub Kaggle Projects\1. Titanic Survival Predictor\Data")
datasets = {}

In [None]:
for csv_file in dataset_folder.glob("*.csv"):
    datasets[csv_file.stem] = load_dataset(csv_file)

In [None]:
print(f"{datasets.keys()}")

##  **3. Exploratory Data Analysis (EDA)**

Exploratory Data Analysis is all about **understanding the dataset**, uncovering patterns, spotting anomalies, and generating insights that will guide feature engineering and modeling.

## What Caused the “Unsinkable” Titanic to Go Down?

1. **11:40 pm** – Titanic strikes an iceberg, seawater flooding her bow.  
2. **12:00 am** – With the keel tilted upward, massive stress strains the hull.  
3. **2:15 am** – The hull begins to break apart; Titanic splits along a joint.  
4. **2:18 am** – The wheelhouse crumbles under the force of the sea.  
5. **2:20 am** – The stern rises into the sky, floats for a brief, followed by Titanic sinking into the Atlantic.

---

In [None]:
Image(filename=r"C:\Users\Milos\Desktop\ESCAPE 9-5\PYTHON\GitHub Kaggle Projects\1. Titanic Survival Predictor\Images\howtitanicsank.png")

Printing first 5 rows of training dataset.

In [None]:
datasets["train"].head()

##  **Data Dictionary**

| Feature       | Description |
|---------------|-------------|
| **PassengerId** | Unique identifier for each passenger |
| **Survived**    | Survival status (0 = No, 1 = Yes) |
| **Pclass**      | Passenger class (1 = 1st, 2 = 2nd, 3 = 3rd) |
| **Name**        | Full name of the passenger |
| **Sex**         | Gender of the passenger (male/female) |
| **Age**         | Age of the passenger in years |
| **SibSp**       | Number of siblings or spouses aboard the Titanic |
| **Parch**       | Number of parents or children aboard the Titanic |
| **Ticket**      | Ticket number |
| **Fare**        | Passenger fare (in British pounds) |
| **Cabin**       | Cabin number |
| **Embarked**    | Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton) |



There are 891 rows and 12 columns in our training dataset.

In [None]:
datasets["train"].shape

In [None]:
datasets["train"].info()

The test dataset contains 418 rows and 11 columns. Note that unlike the training dataset, it **does not include the target column `Survived`**, which we aim to predict using our model.

In [None]:
datasets["test"].shape

In [None]:
datasets["test"].info()

We can observe that several features have missing values:

- **Age**: Out of 891 rows in the training dataset, the Age is available for only 714 passengers, meaning 177 values are missing.  
- **Cabin**: The Cabin feature is missing for the majority of passengers, with only 204 out of 891 rows containing a value.
- **Embarked**: The Embarked feature is missing 2 values.    

> Missing values are important to identify, as they may affect model performance and will need to be handled during data preprocessing.


In [None]:
datasets["train"].isna().sum()

It's important to inspect the **test dataset** for missing values before making predictions. The missing values for each column are as follows:

- **Age**: Out of 418 rows in the test dataset, the Age is available for only **332 passengers**, meaning **86 values are missing**.  
- **Cabin**: The Cabin feature is missing for most passengers, with only **91 out of 418 rows** containing a value.  
- **Fare**: There is **1 missing value** in the Fare column.

> Identifying missing values in the test dataset is important, as they need to be handled properly to ensure accurate predictions from our model.


In [None]:
datasets["train"][datasets["train"]["Survived"] == 1]["Sex"].value_counts()

In [None]:
datasets["test"].isna().sum()

In [None]:
datasets["train"].info()

## **3.1. Bar Charts for Categorical Features**

## Women and Children First
The cry of *"women and children first"* echoed across the decks of the Titanic.  
And it was obeyed.  

While chaos spread through the freezing night, women and children were guided into lifeboats. Many men stepped back, allowing others a chance at survival. Some lived. Many did not.  

The fate of each soul depended not only on courage, but also on where they stood on the ship when the iceberg struck.

---

In [None]:
survived_sex = bar_chart("Sex","train",datasets)

## Passenger Class and Survival

The decks of the Titanic were divided not just by cabins, but by **class**—first, second, and third.  
Where you slept often determined whether you lived or perished.  

First-class passengers had easier access to lifeboats, wider staircases, and closer proximity to the deck. Second-class passengers had fewer advantages, and third-class passengers faced long corridors and locked gates.  

In the chaos of the sinking, survival was not just about courage, it was shaped by **where you were in the ship’s hierarchy**.

In [None]:
survived_class = bar_chart("Pclass","train",datasets)

## Port of Embarkation and Survival

Where passengers boarded the Titanic—**Southampton (S), Cherbourg (C), or Queenstown (Q)**—also influenced their chances of survival.  
The port was more than a starting point; it often reflected class, cabin location, and access to lifeboats.  

Passengers who boarded at **Cherbourg (C)** were more likely to be first-class and closer to the upper decks, giving them a higher chance of survival. Those from **Southampton (S)** and **Queenstown (Q)** included more second- and third-class passengers, who faced longer routes to safety and crowded corridors.  

In the tragedy of that night, survival was shaped not only by courage, but also by **where you entered the ship**.

In [None]:
survived_embarked = bar_chart("Embarked","train",datasets)

In [None]:
label_counts = datasets['train']["Survived"].value_counts()
print(label_counts)


label_percentages = datasets['train']["Survived"].value_counts(normalize=True) * 100
print(label_percentages)

In [None]:
x = datasets["train"]["Survived"].value_counts().values 
labels = datasets["train"]["Survived"].value_counts().index 

# Use the correct syntax for barplot
sns.barplot(x=labels, y=x)
plt.title('Frequency Table of the Label')
plt.xlabel('Diabetes Binary')
plt.ylabel('Frequency')
plt.show()

# Print the total number of labels
print('Total number of labels: ', sum(x))

### Correlation matrix

In [None]:
corr = datasets["train"].select_dtypes(include=["number"]).corr(method="spearman")
corr1 = corr.abs()

In [None]:
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
  
f, ax = plt.subplots(figsize=(16, 14))
sns.heatmap(corr, annot=True, fmt=".2f", mask=mask, cmap="coolwarm", vmin=-1, vmax=1)
    # xticks
plt.xticks(range(len(corr.columns)), corr.columns);
    # yticks
plt.yticks(range(len(corr.columns)), corr.columns)
    # plot
plt.show()

In [None]:
mask = np.zeros_like(corr1, dtype=bool)
mask[np.triu_indices_from(mask)] = True
  
f, ax = plt.subplots(figsize=(16, 14))

sns.heatmap(corr1, annot=True, fmt=".2f", mask=mask, vmin=0, vmax=1)
    # xticks
plt.xticks(range(len(corr1.columns)), corr1.columns);
    # yticks
plt.yticks(range(len(corr1.columns)), corr1.columns)
    # plot
plt.show()

In [None]:
datasets["train"]["Embarked"].value_counts()

In [None]:
datasets["train"].describe()

In [None]:
numeric_cols = datasets["train"].select_dtypes(include='number').columns
fig, axes = plt.subplots(nrows=len(numeric_cols), ncols=1, figsize=(10, 5*len(numeric_cols)))

for ax, col in zip(axes, numeric_cols):
    ax.hist(datasets["train"][col], bins=50, color='skyblue', edgecolor='black')
    ax.set_title(col)
    ax.set_xlabel(col)
    ax.set_ylabel("Frequency")

plt.tight_layout()
plt.show()

## **4. Feature Engineering**

Feature engineering is the process of **transforming raw data into meaningful features** that improve model performance.  
It involves creating new variables, encoding categorical data, handling missing values, and selecting the most informative attributes.  

Good feature engineering leverages insights from Exploratory Data Analysis (EDA) to **highlight patterns, enhance predictive power, and make the data more suitable for modeling**.


### 5.1. Title of Each Passenger

The **title** of a passenger, contained in each passengers name, such as *Mr*, *Mrs*, *Miss*, or *Master* provides valuable information about their **social status, age group, and gender**.  

Titles can help us understand survival patterns on the Titanic, as certain groups (like women and children) were more likely to survive.  
Rare or unusual titles are grouped into a **"Rare"** category to simplify the analysis, and missing titles are labeled as **"Unknown"**.  

By converting titles into **numeric categories**, we create a feature that can improve predictive modeling.


In [None]:
rare_titles = ['Dr', 'Rev', 'Col', 'Major', 'Mlle', 'Countess', 'Ms', 'Lady', 
               'Jonkheer', 'Don', 'Capt', 'Sir']
title_mapping = {"Mr":0, "Miss":1, "Mrs":2, "Master":3, "Rare":4, "Unknown":5}

In [None]:
for key in ['train', 'test']:
    datasets[key] = process_titles(datasets[key], rare_titles, title_mapping)

In [None]:
datasets["train"].isna().sum()

In [None]:
datasets["test"].isna().sum()

In [None]:
survived_title = bar_chart("Title","train",datasets)

### 5.2. Family Size

The **family size** of a each passenger is calculated as the sum of siblings/spouses (`SibSp`) and parents/children (`Parch`) aboard, plus one for the passenger themselves can provide insights into survival patterns.  

Passengers traveling alone often had different survival chances compared to those in larger families. To capture this, we create additional features:  

- **FamilySize**: total number of family members aboard  
- **FamilyCategory**: a categorical representation of family size (e.g., Single, Small, Large)  
- **IsAlone**: a binary indicator of whether the passenger was traveling alone  

These features help models understand **social dynamics and group behavior**, which were crucial factors during the Titanic disaster.

In [None]:
family_mapping = {'Single': 0, 'SmallFamily': 1, 'LargeFamily': 2}

In [None]:
for key in ["train", "test"]:
    datasets[key] = process_family_features(datasets[key], family_category, family_mapping)

In [None]:
datasets["train"].isna().sum()

In [None]:
datasets["test"].isna().sum()

In [None]:
datasets["train"]["FamilySize"].value_counts()

In [None]:
survived_familysize = bar_chart("FamilySize","train",datasets)

In [None]:
survived_familycategoty = bar_chart("FamilyCategory","train",datasets)

In [None]:
survived_isalone = bar_chart("IsAlone","train",datasets)

### 5.3) Age Category

A passenger's **age** played a role in survival on the Titanic, as children and younger passengers were often prioritized during evacuation.  

To capture this, we create the following features:  

- **AgeMissing**: a flag indicating if the age was missing, which can itself be informative  
- **Age**: missing ages are filled using the median age of passengers with the same **Title**, preserving social/age patterns  
- **AgeGroup**: passengers are categorized into **Child, Teen, Adult, MiddleAge, and Senior**, and these groups are mapped to numeric codes for modeling  

These age-related features help the model understand patterns related to **age and survival**, while handling missing or


In [None]:
bins = [0,12,18,35,60,120]
labels = ["Child","Teen","Adult","MiddleAge","Senior"]
age_mapping = {"Child": 0,"Teen": 1,"Adult": 2,"MiddleAge": 3,"Senior": 4}

In [None]:
for key in ["train","test"]:
    datasets[key] = process_age_features(datasets[key],bins,labels,age_mapping)

In [None]:
datasets["train"].isna().sum()

In [None]:
datasets["test"].isna().sum()

In [None]:
survived_age = bar_chart("AgeGroup","train",datasets)

In [None]:
datasets["train"].head()

### 5.4. Fare Prices Category

Ticket fares on the Titanic varied greatly, from a few pounds to extravagant sums.  
Because the distribution of fares is highly **skewed**, we convert the continuous `Fare` values into **four quantile-based categories** (quartiles).  

- **FareBin**: divides passengers into 4 groups (0 = lowest fares, 3 = highest fares)  
- This reduces the effect of extreme outliers and allows the model to capture **relative wealth levels** more effectively.  

In [None]:
for key in ["train", "test"]:
    datasets[key] = process_fare_bins(datasets[key])

In [None]:
datasets["train"].isna().sum()

In [None]:
datasets["test"].isna().sum()

### 5.5) Gender

Following the principle of *"women and children first"*, women were far more likely to be given places in lifeboats.  

To capture this, we encode gender into a numeric feature:  

- **SexLabels**:  
  - 0 = Male  
  - 1 = Female  

This transformation allows models to directly use gender as a feature while preserving the critical survival pattern linked to it.


In [None]:
sex_mapping = {"male": 0, "female": 1}
for df in [datasets["train"],datasets["test"]]:
    df["SexLabels"] = df["Sex"].map(sex_mapping)

In [None]:
datasets["train"].head(10)

### 5.6) Embarked One-Hot Encoding

The port of embarkation where passengers boarded the Titanic can provide insight into survival patterns.  

To make this feature usable for machine learning models, we apply **one-hot encoding**:  

- Each port (`C`, `Q`, `S`) is converted into a separate binary column:  
  - `Embarked_C` = 1 if the passenger embarked at Cherbourg, else 0  
  - `Embarked_Q` = 1 if the passenger embarked at Queenstown, else 0  
  - `Embarked_S` = 1 if the passenger embarked at Southampton, else 0  
- **Missing values** are captured in an additional column (`Embarked_nan`) to preserve all information.  

This transformation allows models to directly use the port of embarkation while handling categorical values and missing data efficiently.

In [None]:
for key in ["train", "test"]:
    datasets[key] = encode_categorical(datasets[key], "Embarked", prefix="Embarked")

In [None]:
datasets["train"].isna().sum()

### 5.7) Cabin One-Hot Encoding

Cabins on the Titanic were labeled with letters indicating the **deck level**.  
Passengers’ location on the ship affected their **access to lifeboats** and thus survival chances.  

We extract the **first letter** of the Cabin as the deck and apply **one-hot encoding**:  

- Each deck (`A`–`G`) and missing cabins are converted into separate binary columns (`Deck_A`, `Deck_B`, …, `Deck_Missing`).  
- This ensures that models can use **deck information numerically** while handling missing values consistently.  

In [None]:
all_decks = ['Deck_A','Deck_B','Deck_C','Deck_D','Deck_E','Deck_F','Deck_G','Deck_Missing','Deck_T']

for key in ["train", "test"]:
    datasets[key] = process_deck(datasets[key], all_decks)

In [None]:
datasets["train"].isna().sum()

### 5.8) Ticket

The `Ticket` column contains alphanumeric ticket numbers.  
While the raw ticket string is not directly useful for modeling, we can extract useful features from it:  

- **TicketGroupSize**: counts how many passengers share the same ticket, capturing families or travel companions.  
- **TicketPrefix**: extracts any letter or symbol prefix, which may reflect booking type or passenger group.  

After extracting these features, the original `Ticket` column is dropped to keep the dataset clean for modeling.

In [None]:
# Apply to datasets
# Compute ticket counts from training set to ensure consistency
datasets["train"], ticket_counts = process_ticket(datasets["train"])

# Apply same counts to test set
datasets["test"], _ = process_ticket(datasets["test"], ticket_counts=ticket_counts)

In [None]:
# Fill missing TicketGroupSize in test set
datasets["test"]['TicketGroupSize'] = datasets["test"]['TicketGroupSize'].fillna(1).astype(int)

In [None]:
datasets["train"]["TicketPrefix"].value_counts()

The `TicketPrefix` column is messy because it contains many unique values, and most of them appear only once.  
Using it directly in models is difficult, especially with one-hot encoding, because it would create hundreds of mostly empty columns that don’t help the model.

In [None]:
# Define frequent prefixes (example: those appearing at least 10 times)
freq_prefixes = datasets["train"]['TicketPrefix'].value_counts()[lambda x: x >= 10].index.tolist()

# Map rare prefixes to "Rare"
for df in [datasets["train"], datasets["test"]]:
    df['TicketPrefix'] = df['TicketPrefix'].apply(lambda x: x if x in freq_prefixes else 'Rare')

In [None]:
datasets["train"]["TicketPrefix"].value_counts()

In [None]:
ticket_prefix_dummies = pd.get_dummies(datasets["train"]['TicketPrefix'], prefix='TicketPrefix').astype(int)
datasets["train"] = pd.concat([datasets["train"], ticket_prefix_dummies], axis=1)

ticket_prefix_dummies_test = pd.get_dummies(datasets["test"]['TicketPrefix'], prefix='TicketPrefix').astype(int)
datasets["test"] = pd.concat([datasets["test"], ticket_prefix_dummies_test], axis=1)

# Ensure same columns in train and test
for col in ticket_prefix_dummies.columns:
    if col not in datasets["test"]:
        datasets["test"][col] = 0

In [None]:
datasets["train"].drop(columns=["TicketPrefix"], inplace=True)
datasets["test"].drop(columns=["TicketPrefix"], inplace=True)

In [None]:
datasets["train"].head()

### 5.9. Pclass x Deck

In [None]:
for deck in ['Deck_A','Deck_B','Deck_C','Deck_D','Deck_E','Deck_F','Deck_G','Deck_Missing','Deck_T']:
    for df in [datasets["train"], datasets["test"]]:
        df[f'{deck}_Pclass'] = df[deck] * df['Pclass']

### 5.10. Sex x AgeGroup

In [None]:
datasets["train"].info()

In [None]:
for df in [datasets["train"], datasets["test"]]:
    df['Sex_AgeGroup'] = df['SexLabels'] * df['AgeGroup']

In [None]:
for df in [datasets["train"], datasets["test"]]:
    df['FamilySize_Pclass'] = df['FamilySize'] * df['Pclass']

## 6) Model Training & Evaluation

### Removing Unnecessary Original Columns

In [None]:
columns_to_drop = ["PassengerId","Name","Sex","Fare","Deck","Age"]

In [None]:
train_clean = datasets["train"].drop(columns=columns_to_drop, axis = 1)
test_clean = datasets["test"].drop(columns=columns_to_drop, axis = 1)

In [None]:
X_train = train_clean.drop(['Survived'], axis=1) 
y_train_true = train_clean['Survived']

In [None]:
X_test = test_clean.copy()

In [None]:
X_test

In [None]:
# Initialize Scaler
scaler = StandardScaler()
# Fit only on training data, then transform both
X_train_scaled = pd.DataFrame(
    scaler.fit_transform(X_train),
    columns=X_train.columns,
    index=X_train.index
)

X_test_scaled = pd.DataFrame(
    scaler.transform(X_test),
    columns=X_test.columns,
    index=X_test.index
)

In [None]:
# Ensure all columns match training set
X_test_scaled = X_test_scaled.reindex(columns=X_train_scaled.columns, fill_value=0)

In [None]:
num_folds = 5
cross_validation = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=42)
error_metrics = ['accuracy', 'roc_auc', 'f1']

In [None]:
models = [
    ('MLP', MLPClassifier()),
    ('RFC', RandomForestClassifier()),
    ('SVC', SVC()),
    ('AdaB', AdaBoostClassifier()),
    ('GBC', GradientBoostingClassifier()),
    ('DTC', DecisionTreeClassifier()),
    ('XGB', XGBClassifier()),
    ('LR', LogisticRegression(max_iter=500)),
]

### Train Models

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_score

# Cross-validation setup
num_folds = 5
cv = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=42)
error_metrics = ['accuracy', 'f1']  # removed roc_auc

trained_models = {}
cv_results_summary = []

for name, model in models:
    print(f"Training model: {name}...")
    
    # Fit model on the entire training set
    model.fit(X_train_scaled, y_train_true)
    
    # Store trained model
    trained_models[name] = model
    
    # Cross-validation scores
    metric_scores = {}
    for scoring in error_metrics:
        scores = cross_val_score(model, X_train_scaled, y_train_true, cv=cv, scoring=scoring)
        metric_scores[scoring] = (scores.mean(), scores.std())
        print(f"{name} - {scoring}: Mean={scores.mean():.4f}, Std={scores.std():.4f}")
    
    cv_results_summary.append((name, metric_scores))
    print("-"*50)

In [None]:
# Loop through all trained models
for name, model in trained_models.items():
    print(f"Evaluating model: {name}")
    
    # Generate cross-validated predictions
    y_train_pred = cross_val_predict(model, X_train_scaled, y_train_true, cv=cv)
    
    # Compute metrics
    acc = accuracy_score(y_train_true, y_train_pred)
    precision = precision_score(y_train_true, y_train_pred)
    recall = recall_score(y_train_true, y_train_pred)
    
    # Print metrics
    print(f"{name} - Accuracy: {acc:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}")
    
    # Plot confusion matrix
    plot_confusion_matrix(y_train_true, y_train_pred, name)
    print("-"*50)

In [None]:
# Dictionary to store results
metrics_summary = {
    'Model': [],
    'Accuracy': [],
    'Precision': [],
    'Recall': [],
    'F1': []
}

# Loop through all trained models
for name, model in models:
    print(f"Evaluating {name}...")
    
    # Get cross-validated predictions
    y_pred = cross_val_predict(model, X_train_scaled, y_train_true, cv=5)
    
    # Calculate metrics
    acc = accuracy_score(y_train_true, y_pred)
    prec = precision_score(y_train_true, y_pred)
    rec = recall_score(y_train_true, y_pred)
    f1 = f1_score(y_train_true, y_pred)
    
    # Store results
    metrics_summary['Model'].append(name)
    metrics_summary['Accuracy'].append(acc)
    metrics_summary['Precision'].append(prec)
    metrics_summary['Recall'].append(rec)
    metrics_summary['F1'].append(f1)

# Convert to DataFrame
metrics_df = pd.DataFrame(metrics_summary)
print(metrics_df)

# Plot bar charts for each metric
metrics_to_plot = ['Accuracy', 'Precision', 'Recall', 'F1']
plt.figure(figsize=(16, 6))

for i, metric in enumerate(metrics_to_plot, 1):
    plt.subplot(1, 4, i)
    sns.barplot(x='Model', y=metric, data=metrics_df, palette='viridis')
    plt.xticks(rotation=45)
    plt.ylim(0, 1)
    plt.title(metric)

plt.tight_layout()
plt.show()

### Feature Importance

In [None]:
plot_feature_importance(trained_models['XGB'], X_train_scaled.columns)

In [None]:
plot_feature_importance(trained_models['RFC'], X_train_scaled.columns)

# HYPERPARAMETER TUNING

In [None]:
# Define models and coarse grids
models_params = {
    "MLP": {
        "model": MLPClassifier(max_iter=500, random_state=42),
        "params": {
            "hidden_layer_sizes": [(50,), (100,)],
            "alpha": [0.0001, 0.001],
            "learning_rate_init": [0.001, 0.01]
        }
    },
    "RFC": {
        "model": RandomForestClassifier(random_state=42),
        "params": {
            "n_estimators": [100, 200],
            "max_depth": [None, 5, 10],
            "min_samples_split": [2, 5]
        }
    },
    "SVC": {
        "model": SVC(random_state=42),
        "params": {
            "C": [0.1, 1, 10],
            "kernel": ["rbf", "linear"],
            "gamma": ["scale", "auto"]
        }
    },
    "AdaB": {
        "model": AdaBoostClassifier(random_state=42),
        "params": {
            "n_estimators": [50, 100, 200],
            "learning_rate": [0.5, 1, 1.5]
        }
    },
    "GBC": {
        "model": GradientBoostingClassifier(random_state=42),
        "params": {
            "n_estimators": [100, 200],
            "learning_rate": [0.05, 0.1],
            "max_depth": [3, 5]
        }
    },
    "DTC": {
        "model": DecisionTreeClassifier(random_state=42),
        "params": {
            "max_depth": [None, 5, 10],
            "min_samples_split": [2, 5, 10]
        }
    },
    "XGB": {
        "model": XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42),
        "params": {
            "n_estimators": [100, 200],
            "learning_rate": [0.05, 0.1],
            "max_depth": [3, 5]
        }
    },
    "LR": {
        "model": LogisticRegression(max_iter=500, random_state=42),
        "params": {
            "C": [0.1, 1, 10],
            "penalty": ["l2"],
            "solver": ["lbfgs"]
        }
    }
}

In [None]:
# Perform GridSearchCV for each model
best_models = {}
for name, mp in models_params.items():
    print(f"\nRunning GridSearch for {name}...")
    grid = GridSearchCV(mp["model"], mp["params"], cv=5, scoring='f1', n_jobs=-1)
    grid.fit(X_train, y_train_true)
    print(f"Best F1: {grid.best_score_:.4f} | Best Params: {grid.best_params_}")
    best_models[name] = grid.best_estimator_

## 7) Making Predictions on TEST SET - USING THE BEST MODEL

In [None]:
# Use the trained XGBoost model from grid search
xgb_model = best_models['XGB']  # your tuned XGB from GridSearch

# Make predictions
y_test_pred = xgb_model.predict(X_test)

# Probabilities (for ROC curves or metrics)
y_test_proba = xgb_model.predict_proba(X_test)[:, 1]

# Create a DataFrame for submission
submission = pd.DataFrame({
    'PassengerId': datasets['test']['PassengerId'],
    'Survived': y_test_pred
})

# Save to CSV
submission.to_csv('titanic_xgb_predictions.csv', index=False)

print("Submission file saved: titanic_xgb_predictions.csv")

## 🏁 Conclusion

- The MLP and XGB model achieved the highest accuracy of 82%.
- Newly engineered feature Title was the most important predictors of survival.
- Future improvements could include hyperparameter tuning and ensemble stacking.