# Titanic Survival Analysis
## *A Story Hidden in Data*

---

April 15, 1912. The North Atlantic. The RMS Titanic — the largest ship ever built — strikes an iceberg on her maiden voyage. Within hours, she sinks. Of the 2,224 people on board, more than 1,500 perish.

But here's what makes this tragedy hauntingly unequal: **survival wasn't random.** Your chances depended on *who you were.*

Today, you have data on 891 of those passengers — their names, ages, ticket class, and whether they lived or died. The question is simple:

> **Can you figure out *why* some people survived and others didn't?**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ipywidgets as widgets
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import warnings
warnings.filterwarnings("ignore")

sns.set_theme(style="ticks", font_scale=1.15)

# Load data
df = pd.read_csv("../data/titanic.csv")
print(f"Loaded {len(df)} passengers, {df['survived'].sum()} survived ({df['survived'].mean()*100:.1f}%)")

Loaded 891 passengers, 342 survived (38.4%)


In [4]:
df.sample(8, random_state=42)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
709,1,3,male,,1,1,15.2458,C,Third,man,True,,Cherbourg,yes,False
439,0,2,male,31.0,0,0,10.5,S,Second,man,True,,Southampton,no,True
840,0,3,male,20.0,0,0,7.925,S,Third,man,True,,Southampton,no,True
720,1,2,female,6.0,0,1,33.0,S,Second,child,False,,Southampton,yes,False
39,1,3,female,14.0,1,0,11.2417,C,Third,child,False,,Cherbourg,yes,False
290,1,1,female,26.0,0,0,78.85,S,First,woman,False,,Southampton,yes,True
300,1,3,female,,0,0,7.75,Q,Third,woman,False,,Queenstown,yes,True
333,0,3,male,16.0,2,0,18.0,S,Third,man,True,,Southampton,no,False


## Before You Look at the Data...

Pause for a moment. Imagine you're on the Titanic as it begins to sink. There aren't enough lifeboats for everyone.

*Who do you think gets on a lifeboat first?* What factors might matter — age? wealth? gender? being close to the deck?

Hold that thought. Now let's see if the data agrees with your intuition. Use the filters below to slice the passengers by sex and ticket class. Watch how the survival rate shifts. Try different combinations.

**What surprises you?**

In [2]:
@widgets.interact(
    sex=widgets.ToggleButtons(options=["All", "Male", "Female"], description="Sex:", button_style=""),
    pclass=widgets.ToggleButtons(options=["All Classes", "1st Class", "2nd Class", "3rd Class"], description="Class:", button_style=""),
)
def explore(sex, pclass):
    filtered = df.copy()
    if sex != "All":
        filtered = filtered[filtered["sex"] == sex.lower()]
    if pclass != "All Classes":
        cls_map = {"1st Class": 1, "2nd Class": 2, "3rd Class": 3}
        filtered = filtered[filtered["pclass"] == cls_map[pclass]]

    if len(filtered) == 0:
        print("No passengers match the selected filters.")
        return

    survived    = filtered["survived"].sum()
    total       = len(filtered)
    surv_rate   = survived / total * 100

    fig, axes = plt.subplots(1, 3, figsize=(14, 5))

    counts = filtered["survived"].value_counts().rename({0: "Did Not Survive", 1: "Survived"})
    axes[0].pie(counts, labels=counts.index, autopct="%1.1f%%",
                colors=["#E8575A", "#5B8FB9"], startangle=90,
                wedgeprops={"edgecolor": "white", "linewidth": 2})
    axes[0].set_title(f"Survival Rate\n{survived}/{total} ({surv_rate:.0f}%)", weight="bold")

    age_data = filtered.dropna(subset=["age"])
    for val, label, color in [(1, "Survived", "#5B8FB9"), (0, "Did Not Survive", "#E8575A")]:
        axes[1].hist(age_data[age_data["survived"] == val]["age"],
                     bins=20, alpha=0.6, label=label, color=color)
    axes[1].set_xlabel("Age")
    axes[1].set_ylabel("Count")
    axes[1].set_title("Age Distribution", weight="bold")
    axes[1].legend()

    surv_by_class = filtered.groupby("class")["survived"].mean() * 100
    surv_by_class = surv_by_class.reindex(["First", "Second", "Third"]).dropna()
    axes[2].bar(surv_by_class.index, surv_by_class.values,
                color=["#F4A261", "#5B8FB9", "#6BCB77"], edgecolor="white", linewidth=1.5)
    axes[2].set_ylabel("Survival Rate (%)")
    axes[2].set_title("Survival Rate by Class", weight="bold")
    axes[2].set_ylim(0, 100)
    for i, v in enumerate(surv_by_class.values):
        axes[2].text(i, v + 2, f"{v:.0f}%", ha="center", fontweight="bold")

    plt.suptitle(f"Titanic — {sex} · {pclass} · {total} passengers",
                 fontsize=13, weight="bold", y=1.02)
    sns.despine()
    plt.tight_layout()
    plt.show()

interactive(children=(ToggleButtons(description='Sex:', options=('All', 'Male', 'Female'), value='All'), Toggl…

## From Intuition to Algorithm

By now, you've probably noticed some clear patterns — perhaps that women survived at much higher rates, or that first-class passengers had a significant advantage.

You built that understanding by looking at the data yourself: filtering, comparing, noticing differences. But what if we could teach a computer to discover these patterns on its own?

That's the idea behind a **decision tree**. It looks at the passenger data and figures out which questions to ask — *"Is this passenger female?" "What class ticket do they hold?"* — to best predict who survives. Nobody tells it the answer. It learns from examples.

Here's the twist: we train it on *most* of the data, then test it on passengers it has **never seen**. If it still predicts well, it has truly *learned* the pattern — not just memorized the answers. Use the slider to control how many questions the tree is allowed to ask, and watch what happens as it grows more complex.

In [None]:
features_df = df[["pclass", "sex", "age", "sibsp", "parch", "fare"]].copy()
features_df["sex"] = (features_df["sex"] == "female").astype(int)
features_df = features_df.fillna(features_df.median(numeric_only=True))
target = df["survived"]

X_train, X_test, y_train, y_test = train_test_split(
    features_df, target, test_size=0.2, random_state=42
)

feature_names = ["Ticket Class", "Female?", "Age", "Siblings/Spouse", "Parents/Children", "Fare"]
class_names   = ["Did Not Survive", "Survived"]

@widgets.interact(depth=widgets.IntSlider(
    value=3, min=1, max=6, step=1,
    description="Tree depth:",
    style={"description_width": "initial"},
    layout=widgets.Layout(width="400px"),
))
def train_and_show(depth):
    clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
    clf.fit(X_train, y_train)
    preds    = clf.predict(X_test)
    accuracy = accuracy_score(y_test, preds)
    cm       = confusion_matrix(y_test, preds)

    fig, axes = plt.subplots(1, 2, figsize=(16, max(4, depth * 1.8)))

    plot_tree(clf, feature_names=feature_names, class_names=class_names,
              filled=True, rounded=True, fontsize=9, ax=axes[0])
    axes[0].set_title(f"Decision Tree (depth = {depth})", weight="bold", fontsize=13)

    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", ax=axes[1],
                xticklabels=class_names, yticklabels=class_names,
                linewidths=1, linecolor="white")
    axes[1].set_xlabel("Predicted", fontsize=11)
    axes[1].set_ylabel("Actual", fontsize=11)
    axes[1].set_title(f"Confusion Matrix — Accuracy: {accuracy*100:.1f}%",
                      weight="bold", fontsize=13)

    plt.tight_layout()
    plt.show()

    print(f"\nModel accuracy on unseen test data: {accuracy*100:.1f}%")
    print(f"Training set: {len(X_train)} passengers | Test set: {len(X_test)} passengers")

interactive(children=(IntSlider(value=3, description='Tree depth:', layout=Layout(width='400px'), max=6, min=1…

## What Did You Discover?

Think about the journey you just took. You started with a question — *who survives a shipwreck?* — and no clear answer.

You formed a hypothesis, explored the data with your own eyes, and found patterns that revealed an uncomfortable truth: survival on the Titanic was deeply tied to social class and gender. The "women and children first" protocol was real — but it favored those in first class far more than those in third.

Then you watched a machine learn those same patterns entirely on its own. Nobody told the algorithm "women first" — it discovered that rule from the data, just like you did. Did you notice what happened when the tree got too deep? More complex isn't always better — a model can *memorize* the training data instead of learning the real pattern.

That is the heart of data science. Not the code, not the math — but the ability to **ask a meaningful question, explore the evidence, and let the data guide you toward an answer.**

---

*The Titanic dataset is where many data scientists got their start. Now you have too.*