# Multivariate Analysis of Crash Severity 

This notebook covers **Task - 4: Multivariate Analysis and Transformations**. We model crash severity (PDO, Minor, Serious, Fatal) as an ordered outcome and study how roadway, driver, and environmental factors jointly relate to more severe outcomes. Based on our bivariate outcomes, we identified some variable such as Weekend, Raised median, County, Number of lanes, Driver gender, and Driver Age had minimal impact on severity. However we keep all the variables (except county (too many categories) and Driver BAC(only for DUI cases)) to develop a full model. Rhen based on signifivcance we aim to develp a reduced model.  


# 4.1 Setup and Data Loading

In [None]:
# Setup and cleaned data loading
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

cleaned_df = pd.read_pickle("../data/cleaned/crash_2018_cleaned.pkl")

# 4.1.1 Selection of Key Variables from Earlier EDA
Based on our bivariate outcomes, we identified some variable such as County, Number of lanes, Number of Vehicles, and Driver gender had minimal impact on severity. We excluded them from multivariate analysis. To deal with skewness, AADT is transformed into a log version.

In [None]:
# select key variables
multi_vars = [
    "Crash Severity",
    "Weekend",
    "Time of Day",
    "Area Type",
    "Functional Class Recode",
    "Number of Vehicles Num",
    "Curvature",
    "Grade",
    "Raised Median",
    "AADT",
    "Speed Limit Num",
    "Impact Speed Num",
    "Vehicle Type Recode",
    "Driver License Validity",
    "Crash Manner Recode",
    "BAC Available",
    "Visibility Obstruction Recode",
    "Lighting Conditions Recode",
    "Driver Age",
    "Driver BAC",
]

multi_df = cleaned_df[multi_vars].copy()

# transform AADT to reduce skew
multi_df["log_AADT"] = np.log1p(multi_df["AADT"])

# 4.2 Correlation Among Numeric Variables

In [None]:
num_vars = [
    "log_AADT",
    "Speed Limit Num",
    "Impact Speed Num",
    "Driver Age",
    "Number of Vehicles Num",
    "Driver BAC",
]

corr_num = multi_df[num_vars].corr()
corr_num

In [None]:
plt.figure(figsize=(8, 8))
sns.heatmap(corr_num, annot=True, cmap="coolwarm", vmin=-1, vmax=1)
plt.title("Correlation among numeric variables")
plt.tight_layout()
plt.show()

## **Observations**

**Strong Correlations:**
- There is no strong correlation either positive or negative among all the varibales
  
**Moderate Correlations:**
- A positive relationship between speed limit and impact speed (0.56). The speed limit of the roadway positively influenced the impact speed of a crash
    
**Weak Correlation**
- Speed limit and AADT (log) are positively correlated (0.34), for higher AADT roads, speed limit seems to be higher.
- A negative correlation between number of vehicles and impact speed (-0.38). Impact speed decrease as the number of vehicles involved in a crash increase 
- A very weak negative correlation between driver age and impact speed (-0.12)

Most of the correlations between the variables are negligible, indicating no linear relationship 

# 4.3 Pairplots to Visualize the Linear Relationships
For better visualization, we use small sample (n = 2000)

In [None]:
severity_palette = {
    "PDO": "green",
    "Minor": "palegreen",  # mint-like
    "Serious": "orange",
    "Fatal": "red",
}

sample = multi_df.dropna(
    subset=["Crash Severity", "log_AADT", "Speed Limit Num", "Impact Speed Num", "Driver Age"]
).sample(n=2000, random_state=0)

sns.pairplot(
    sample,
    vars=["log_AADT", "Speed Limit Num", "Impact Speed Num", "Driver Age"],
    hue="Crash Severity",
    palette=severity_palette,
    plot_kws=dict(alpha=0.4, s=20),
    diag_kind="hist",
)
plt.suptitle("Pairwise relationships of numeric variables by crash severity", y=1.02)
plt.show()

Since PDO crashes are very high in proportion, lets have a look on the relationships excluding PDO crashes

In [None]:
# Filter out PDO to focus on higher severities
sample_severe = (
    multi_df[multi_df["Crash Severity"].isin(["Minor", "Serious", "Fatal"])]
    .dropna(
        subset=["Crash Severity", "log_AADT", "Speed Limit Num", "Impact Speed Num", "Driver Age"]
    )
    .sample(n=2000, random_state=1)
)

# IMPORTANT: remove unused 'PDO' category
sample_severe["Crash Severity"] = sample_severe["Crash Severity"].cat.remove_unused_categories()

severity_palette_no_pdo = {
    "Minor": "palegreen",
    "Serious": "orange",
    "Fatal": "red",
}

sns.pairplot(
    sample_severe,
    vars=["log_AADT", "Speed Limit Num", "Impact Speed Num", "Driver Age"],
    hue="Crash Severity",
    palette=severity_palette_no_pdo,
    plot_kws=dict(alpha=0.5, s=25),
    diag_kind="hist",
)
plt.suptitle("Pairwise relationships (excluding PDO)", y=1.02)
plt.show()

## Observations
- Impact speed vs severity: Fatal (red) and Serious (orange) points tend to occur at higher impact speeds than Minor; the lower‑left of the Impact‑Speed panels is mostly light green.

- While there is no overall noticable pattern, but impact speed and speed limit appear to have a positive linear relationship for serious and fatal crashes.

- For fatal crashes (red dots), impact speed seems to be linearly correlated with driver age.

- Other plots: No clear visual separation by severity; colors are fairly mixed.

# 4.4 Encoding Crash Severity and Ranking of Variables

In [None]:
# simple numeric encoding for severity (ordered)
severity_map = {"PDO": 1, "Minor": 2, "Serious": 3, "Fatal": 4}
multi_df["Severity_Num"] = multi_df["Crash Severity"].map(severity_map)

# choose a small set of important predictors to encode
cat_for_corr = [
    "Area Type",
    "Time of Day",
    "Curvature",
    "Grade",
    "Vehicle Type Recode",
    "BAC Available",
]

dummies = pd.get_dummies(multi_df[cat_for_corr], drop_first=True)

corr_with_sev = pd.concat(
    [
        multi_df[["Severity_Num", "log_AADT", "Speed Limit Num", "Impact Speed Num", "Driver Age"]],
        dummies,
    ],
    axis=1,
).corr()

# extract correlations with Severity_Num
corr_sev = corr_with_sev["Severity_Num"].sort_values(ascending=False)
corr_sev

In [None]:
plt.figure(figsize=(10, 5))
sns.barplot(
    x=corr_sev.values,
    y=corr_sev.index,
    orient="h",
    color="orange",
)
plt.axvline(0, color="black", linewidth=1)
plt.xlabel("Correlation with severity (Severity_Num)")
plt.ylabel("Variables")
plt.title("Variables ranked by correlation with crash severity")
plt.tight_layout()
plt.show()

In [None]:
# absolute correlations, sorted strongest to weakest
corr_abs = corr_sev.abs().sort_values(ascending=False)

plt.figure(figsize=(10, 5))
sns.barplot(
    x=corr_abs.values,
    y=corr_abs.index,
    orient="h",
    color="orange",
)
plt.axvline(0, color="black", linewidth=1)
plt.xlabel("Correlation with severity (Severity_Num)")
plt.ylabel("Variables")
plt.title("Strength (absolute correlation")
plt.tight_layout()
plt.show()

## Observations
- **Impact speed** has the strongest positive association with severity (r ≈ 0.21): higher impact speed tends to mean more severe crashes.

- **Curvature, higher speed limits, nighttime, and grades** all show small positive correlations with severity, consistent with your bivariate results.

- **BAC available** is also positively related to severity, but modestly.

- **Urban area and higher AADT** are negatively correlated with severity (r ≈ −0.14 and −0.11), indicating relatively more severe crashes on lower‑AADT, non‑urban roads.

- **Vehicle type and driver age** show very weak linear correlations with severity in this summary.

# 4.5 Ranking influencing Factors for DUI Crashes

In [None]:
# DUI subset
dui_df = multi_df[multi_df["BAC Available"]].copy()

# numeric severity
severity_map = {"PDO": 1, "Minor": 2, "Serious": 3, "Fatal": 4}
dui_df["Severity_Num"] = dui_df["Crash Severity"].map(severity_map)

# choose predictors to correlate with severity
cat_for_corr_dui = [
    "Area Type",
    "Time of Day",
    "Curvature",
    "Grade",
    "Vehicle Type Recode",
    "Lighting Conditions Recode",
]

num_vars_dui = ["log_AADT", "Speed Limit Num", "Impact Speed Num", "Driver Age", "Driver BAC"]

dummies_dui = pd.get_dummies(dui_df[cat_for_corr_dui], drop_first=True)

corr_with_sev_dui = pd.concat([dui_df[["Severity_Num"] + num_vars_dui], dummies_dui], axis=1).corr()

corr_sev_dui = corr_with_sev_dui["Severity_Num"].sort_values(ascending=False)
corr_abs_dui = corr_sev_dui.abs().sort_values(ascending=False)

plt.figure(figsize=(10, 5))
sns.barplot(
    x=corr_sev_dui.values,
    y=corr_sev_dui.index,
    orient="h",
    color="orange",
)
plt.axvline(0, color="black", linewidth=1)
plt.xlabel("Correlation with severity (Severity_Num)")
plt.ylabel("Variables")
plt.title("Variables ranked by correlation with crash severity for DUI cases")
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 5))
sns.barplot(
    x=corr_abs_dui.values,
    y=corr_abs_dui.index,
    orient="h",
    color="orange",
)
plt.axvline(0, color="black", linewidth=1)
plt.xlabel("Correlation with severity (Severity_Num)")
plt.ylabel("Variables")
plt.title("Strength (absolute correlation) for DUI cases")
plt.tight_layout()
plt.show()

## Observations
For DUI cases:
- Speed Limit correlation roughly doubles (0.08 → 0.15), suggesting high posted speeds are even more critical for DUI severity.

- Area Type_Urban becomes more negative (−0.14 → −0.17): DUI severe crashes are especially concentrated in non‑urban settings.

- Driver Age and Vehicle Type_Other/Truck show slightly stronger positive correlations in the DUI subset. Older people with trucks appears to involve more in severe crashes while DUI. 

- Some lighting categories show clearer roles:

    - Daylight has a small positive correlation;

    - Night‑lighted / Illuminated show negative correlations, hinting that better lighting may slightly reduce severity in DUI crashes.