# Bivariate EDA: Crash Severity vs Single Factors

This notebook uses the cleaned crash dataset to explore how **Crash Severity** varies with one factor at a time (e.g., area type, functional class, lighting, driver/vehicle factors). In this **Step 4**, we read the cleaned data created in the data‑quality notebook and do not perform any additional cleaning here.

# 4.1 Setup and load cleaned data

In [None]:
# Setup and load cleaned data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

cleaned_df = pd.read_pickle("./data/cleaned/crash_2018_cleaned.pkl")

cleaned_df["Crash Severity"].value_counts().sort_index()

## 4.2. Crash severity vs categorical factors

We start with simple bivariate looks at how crash severity differs among different outcomes of categorical variables.


In [None]:
# Variable names to show on plots
nice_names = {
    "Functional Class Recode": "Functional Class",
    "Crash Manner Recode": "Crash Manner",
    "Lighting Conditions Recode": "Lighting Conditions",
    "Visibility Obstruction Recode": "Visibility Obstruction",
    "Vehicle Type Recode": "Vehicle Type",
    "Driver Gender Recode": "Driver Gender",
    "Driver License Validity": "Driver License Validity",
    "Raised Median": "Raised Median",
}

from matplotlib.colors import ListedColormap

severity_colors = ListedColormap(
    [
        "#2ca02c",  # PDO - green
        "#bcbd22",  # Minor - yellow-green
        "#ff7f0e",  # Serious - orange
        "#d62728",  # Fatal - red
    ]
)


def plot_severity_grid(
    df,
    factors,
    severity_col="Crash Severity",
    order_sev=("PDO", "Minor", "Serious", "Fatal"),
    ncols=2,
    figsize=(12, 8),
    cmap=severity_colors,
):
    import math

    n = len(factors)
    nrows = math.ceil(n / ncols)
    fig, axes = plt.subplots(nrows, ncols, figsize=figsize)
    axes = axes.flatten()

    for i, factor in enumerate(factors):
        ax = axes[i]
        ct = pd.crosstab(df[severity_col], df[factor], normalize="columns") * 100
        ct = ct.reindex(index=order_sev)

        # Rename columns in the table for plotting (x‑axis labels)
        ct = ct.rename(columns=nice_names).T

        ct.plot(kind="bar", stacked=True, cmap=cmap, ax=ax, legend=False)
        ax.set_title(nice_names.get(factor, factor), fontsize=9)
        ax.set_ylabel("Percent")
        ax.set_xlabel("")
        ax.tick_params(axis="x", labelrotation=30, labelsize=7)

        for patch in ax.patches:
            h = patch.get_height()
            if h < 7:
                continue
            x = patch.get_x() + patch.get_width() / 2
            y = patch.get_y() + h / 2
            ax.text(x, y, f"{h:.0f}%", ha="center", va="center", fontsize=7, color="black")

    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])

    handles, labels = axes[0].get_legend_handles_labels()
    fig.legend(handles, labels, title=severity_col, bbox_to_anchor=(1.02, 0.5), loc="center left")
    plt.tight_layout()
    plt.show()


print("Distribution of crash severity (%) among categories\n")
plot_severity_grid(
    cleaned_df,
    ["Area Type", "Functional Class Recode", "Lighting Conditions Recode", "Crash Manner Recode"],
)

plot_severity_grid(
    cleaned_df, ["Visibility Obstruction Recode", "Raised Median", "Curvature", "Grade"]
)

plot_severity_grid(cleaned_df, ["Driver Gender Recode", "Driver License Validity"])

plot_severity_grid(cleaned_df, ["Vehicle Type Recode"])

## Observations

...As we can see, 