# ðŸš¢ Titanic Survival Analysis
## *Can a machine predict who survives?*

---

On April 15, 1912, the RMS Titanic sank after hitting an iceberg.  
**1,502 of 2,224 passengers died.**

We have data on each passenger: their age, class, sex, fare paid, and whether they survived.

**Two questions we'll answer today:**
1. Which groups of people had the best (and worst) survival rates? â†’ *Data Exploration*
2. Can we build a program that predicts survival from passenger info? â†’ *Machine Learning*

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ipywidgets as widgets
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import warnings
warnings.filterwarnings("ignore")

sns.set_theme(style="whitegrid", font_scale=1.15)

# Load data
df = pd.read_csv("../data/titanic.csv")
print(f"Loaded {len(df)} passengers, {df['survived'].sum()} survived ({df['survived'].mean()*100:.1f}%)")

## Part 1 â€” Explore the Data

Let's look at who was on board and filter by different characteristics.

In [None]:
@widgets.interact(
    sex=widgets.ToggleButtons(options=["All", "Male", "Female"], description="Sex:", button_style=""),
    pclass=widgets.ToggleButtons(options=["All Classes", "1st Class", "2nd Class", "3rd Class"], description="Class:", button_style=""),
)
def explore(sex, pclass):
    filtered = df.copy()
    if sex != "All":
        filtered = filtered[filtered["sex"] == sex.lower()]
    if pclass != "All Classes":
        cls_map = {"1st Class": 1, "2nd Class": 2, "3rd Class": 3}
        filtered = filtered[filtered["pclass"] == cls_map[pclass]]

    if len(filtered) == 0:
        print("No passengers match the selected filters.")
        return

    survived    = filtered["survived"].sum()
    total       = len(filtered)
    surv_rate   = survived / total * 100

    fig, axes = plt.subplots(1, 3, figsize=(14, 5))

    counts = filtered["survived"].value_counts().rename({0: "Did Not Survive", 1: "Survived"})
    axes[0].pie(counts, labels=counts.index, autopct="%1.1f%%",
                colors=["#E8575A", "#5B8FB9"], startangle=90,
                wedgeprops={"edgecolor": "white", "linewidth": 2})
    axes[0].set_title(f"Survival Rate\n{survived}/{total} ({surv_rate:.0f}%)", weight="bold")

    age_data = filtered.dropna(subset=["age"])
    for val, label, color in [(1, "Survived", "#5B8FB9"), (0, "Did Not Survive", "#E8575A")]:
        axes[1].hist(age_data[age_data["survived"] == val]["age"],
                     bins=20, alpha=0.6, label=label, color=color)
    axes[1].set_xlabel("Age")
    axes[1].set_ylabel("Count")
    axes[1].set_title("Age Distribution", weight="bold")
    axes[1].legend()

    surv_by_class = filtered.groupby("class")["survived"].mean() * 100
    surv_by_class = surv_by_class.reindex(["First", "Second", "Third"]).dropna()
    axes[2].bar(surv_by_class.index, surv_by_class.values,
                color=["#F4A261", "#5B8FB9", "#6BCB77"], edgecolor="white", linewidth=1.5)
    axes[2].set_ylabel("Survival Rate (%)")
    axes[2].set_title("Survival Rate by Class", weight="bold")
    axes[2].set_ylim(0, 100)
    for i, v in enumerate(surv_by_class.values):
        axes[2].text(i, v + 2, f"{v:.0f}%", ha="center", fontweight="bold")

    plt.suptitle(f"Titanic â€” {sex} Â· {pclass} Â· {total} passengers",
                 fontsize=13, weight="bold", y=1.02)
    sns.despine()
    plt.tight_layout()
    plt.show()

## Part 2 â€” Machine Learning: Decision Tree

A **decision tree** is a machine learning model that learns rules like:

```
IF sex = female
   AND class = 1st or 2nd  â†’  likely SURVIVED
IF sex = male
   AND age > 15            â†’  likely DID NOT SURVIVE
```

We'll:
1. Split data into **training** (80%) and **test** (20%) sets
2. Train the tree on the training data
3. Test how accurate it is on data it's *never seen before*

In [None]:
features_df = df[["pclass", "sex", "age", "sibsp", "parch", "fare"]].copy()
features_df["sex"] = (features_df["sex"] == "female").astype(int)
features_df = features_df.fillna(features_df.median(numeric_only=True))
target = df["survived"]

X_train, X_test, y_train, y_test = train_test_split(
    features_df, target, test_size=0.2, random_state=42
)

feature_names = ["Ticket Class", "Female?", "Age", "Siblings/Spouse", "Parents/Children", "Fare"]
class_names   = ["Did Not Survive", "Survived"]

@widgets.interact(depth=widgets.IntSlider(
    value=3, min=1, max=6, step=1,
    description="Tree depth:",
    style={"description_width": "initial"},
    layout=widgets.Layout(width="400px"),
))
def train_and_show(depth):
    clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
    clf.fit(X_train, y_train)
    preds    = clf.predict(X_test)
    accuracy = accuracy_score(y_test, preds)
    cm       = confusion_matrix(y_test, preds)

    fig, axes = plt.subplots(1, 2, figsize=(16, max(4, depth * 1.8)))

    plot_tree(clf, feature_names=feature_names, class_names=class_names,
              filled=True, rounded=True, fontsize=9, ax=axes[0])
    axes[0].set_title(f"Decision Tree (depth = {depth})", weight="bold", fontsize=13)

    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", ax=axes[1],
                xticklabels=class_names, yticklabels=class_names,
                linewidths=1, linecolor="white")
    axes[1].set_xlabel("Predicted", fontsize=11)
    axes[1].set_ylabel("Actual", fontsize=11)
    axes[1].set_title(f"Confusion Matrix â€” Accuracy: {accuracy*100:.1f}%",
                      weight="bold", fontsize=13)

    plt.tight_layout()
    plt.show()

    print(f"\nModel accuracy on unseen test data: {accuracy*100:.1f}%")
    print(f"Training set: {len(X_train)} passengers | Test set: {len(X_test)} passengers")

## Key Takeaways

1. **Data tells stories.** The Titanic data shows stark differences: women and 1st-class passengers had far higher survival rates.
2. **Machine learning learns patterns from examples.** The decision tree found these rules automatically â€” we never told it "women first".
3. **Deeper trees = more complex rules.** But too deep and the model *memorizes* the training data instead of learning general patterns. This is called **overfitting**.
4. **The confusion matrix** shows exactly where the model makes mistakes â€” predicting no survival when there was, and vice versa.

---
*This dataset is also used in real ML courses at top universities â€” you've just done what data science students do!*