# Aviation Accidents — Safety Metrics Across Models/Makes

This notebook explores safety metrics derived in the cleaning notebook, focusing on:

- **Serious/Fatal injury fraction** (`SeriousOrFatal.Rate`) as the primary injury risk metric  
- **Destroyed fraction** (`Was.Destroyed`) as the primary robustness-to-destruction metric  
- Separate recommendations for **small** vs **large** airplanes using a passenger threshold of **20**


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load cleaned data produced from the cleaning notebook
df = pd.read_csv("AviationData_Cleaned.csv", low_memory=False)

# Basic inspection
print("Shape:", df.shape)
display(df.head())
display(df.info())


In [None]:
# Split into 'small' vs 'large' airplanes using passenger threshold of 20
# Assumption: Passengers.Est is a conservative estimate of onboard persons (injured + uninjured).
PASSENGER_THRESHOLD = 20

df["Event.Date"] = pd.to_datetime(df["Event.Date"], errors="coerce")

df_small = df[df["Passengers.Est"] <= PASSENGER_THRESHOLD].copy()
df_large = df[df["Passengers.Est"] > PASSENGER_THRESHOLD].copy()

print("Small rows:", df_small.shape[0])
print("Large rows:", df_large.shape[0])


In [None]:
def top_lowest_mean(df_group, group_col, metric_col, n=15, min_count=10):
    """Return a dataframe of the n categories with lowest mean metric, requiring min_count."""
    g = (df_group
         .groupby(group_col)[metric_col]
         .agg(['mean','count'])
         .reset_index()
         .query("count >= @min_count")
         .sort_values("mean", ascending=True)
         .head(n)
    )
    return g

def mean_by_category(df_group, category_col, metric_col, min_count=10):
    return (df_group
            .groupby(category_col)[metric_col]
            .agg(['mean','count'])
            .reset_index()
            .query("count >= @min_count")
            .sort_values("mean", ascending=True)
           )


In [None]:
# Top 15 makes (small and large) with lowest mean SeriousOrFatal.Rate
top15_small_makes = top_lowest_mean(df_small, "Make", "SeriousOrFatal.Rate", n=15, min_count=50)
top15_large_makes = top_lowest_mean(df_large, "Make", "SeriousOrFatal.Rate", n=15, min_count=50)

display(top15_small_makes)
display(top15_large_makes)

# Side-by-side plot (small vs large)
fig, axes = plt.subplots(1, 2, figsize=(14, 6), sharey=False)

axes[0].barh(top15_small_makes["Make"][::-1], top15_small_makes["mean"][::-1])
axes[0].set_title("Small airplanes: 15 lowest mean serious/fatal fraction")
axes[0].set_xlabel("Mean SeriousOrFatal.Rate")

axes[1].barh(top15_large_makes["Make"][::-1], top15_large_makes["mean"][::-1])
axes[1].set_title("Large airplanes: 15 lowest mean serious/fatal fraction")
axes[1].set_xlabel("Mean SeriousOrFatal.Rate")

plt.tight_layout()
plt.show()


In [None]:
# Violin plot: distribution of SeriousOrFatal.Rate for the 10 lowest-mean makes (small airplanes)
top10_small_makes = top_lowest_mean(df_small, "Make", "SeriousOrFatal.Rate", n=10, min_count=50)
makes_order = top10_small_makes["Make"].tolist()

data = [df_small.loc[df_small["Make"] == m, "SeriousOrFatal.Rate"].dropna().values for m in makes_order]

fig, ax = plt.subplots(figsize=(14, 6))
vp = ax.violinplot(data, showmeans=True, showextrema=False)

ax.set_xticks(np.arange(1, len(makes_order)+1))
ax.set_xticklabels(makes_order, rotation=45, ha="right")
ax.set_title("Small airplanes: distribution of serious/fatal fraction (10 lowest-mean makes)")
ax.set_ylabel("SeriousOrFatal.Rate")
plt.tight_layout()
plt.show()

display(top10_small_makes)


In [None]:
# Strip-plot style (jittered scatter) for the 10 lowest-mean makes (large airplanes)
top10_large_makes = top_lowest_mean(df_large, "Make", "SeriousOrFatal.Rate", n=10, min_count=50)
makes_order = top10_large_makes["Make"].tolist()

fig, ax = plt.subplots(figsize=(14, 6))

rng = np.random.default_rng(0)
for i, m in enumerate(makes_order, start=1):
    y = df_large.loc[df_large["Make"] == m, "SeriousOrFatal.Rate"].dropna().values
    x = i + rng.normal(0, 0.06, size=len(y))  # jitter
    ax.scatter(x, y, alpha=0.35, s=10)

ax.set_xticks(np.arange(1, len(makes_order)+1))
ax.set_xticklabels(makes_order, rotation=45, ha="right")
ax.set_title("Large airplanes: distribution of serious/fatal fraction (10 lowest-mean makes)")
ax.set_ylabel("SeriousOrFatal.Rate")
plt.tight_layout()
plt.show()

display(top10_large_makes)


In [None]:
# Evaluate destroyed fraction by make for both small and large aircraft
destroyed_small = top_lowest_mean(df_small, "Make", "Was.Destroyed", n=15, min_count=50)
destroyed_large = top_lowest_mean(df_large, "Make", "Was.Destroyed", n=15, min_count=50)

destroyed_small = destroyed_small.rename(columns={"mean": "Destroyed.Fraction"})
destroyed_large = destroyed_large.rename(columns={"mean": "Destroyed.Fraction"})

display(destroyed_small)
display(destroyed_large)

fig, axes = plt.subplots(1, 2, figsize=(14, 6), sharey=False)

axes[0].barh(destroyed_small["Make"][::-1], destroyed_small["Destroyed.Fraction"][::-1])
axes[0].set_title("Small airplanes: 15 lowest destroyed fraction (by make)")
axes[0].set_xlabel("Mean Was.Destroyed")

axes[1].barh(destroyed_large["Make"][::-1], destroyed_large["Destroyed.Fraction"][::-1])
axes[1].set_title("Large airplanes: 15 lowest destroyed fraction (by make)")
axes[1].set_xlabel("Mean Was.Destroyed")

plt.tight_layout()
plt.show()


## Discussion: Makes (Summary)

**Injury risk (Serious/Fatal fraction):**
- For **small airplanes**, the plot highlights the *15 makes* with the lowest mean `SeriousOrFatal.Rate` among makes with ≥50 events.
- For **large airplanes**, the analogous list provides the safest injury profiles under the same minimum-count constraint.

**Destroyed fraction:**
- The destroyed-fraction plot provides a robustness view: makes with low `Was.Destroyed` rates tend to experience less catastrophic hull outcomes.

**Recommendations approach:**
- Prefer makes that are consistently low on **both** (i) serious/fatal injury fraction **and** (ii) destroyed fraction.
- Use the distribution plots (violin/strip-style) to check **variance**: a low mean with heavy tails may indicate occasional severe outcomes.

> Note: These statistics are descriptive. They do not prove causality and are sensitive to reporting, operational context, and exposure mix.


In [None]:
# Plane type analysis: mean serious/fatal fraction for small vs large planes

def plane_type_summary(df_group, min_count=10):
    return (df_group
            .groupby("Plane.Type")["SeriousOrFatal.Rate"]
            .agg(['mean','count'])
            .reset_index()
            .query("count >= @min_count")
            .sort_values("mean", ascending=True)
           )

pt_small = plane_type_summary(df_small, min_count=10)
pt_large = plane_type_summary(df_large, min_count=10)

# Plot: show top 20 safest plane types in each group for readability
topn = 20
fig, axes = plt.subplots(1, 2, figsize=(14, 7), sharey=False)

axes[0].barh(pt_small["Plane.Type"].head(topn)[::-1], pt_small["mean"].head(topn)[::-1])
axes[0].set_title("Small airplanes: mean serious/fatal fraction (top 20 safest plane types)")
axes[0].set_xlabel("Mean SeriousOrFatal.Rate")

axes[1].barh(pt_large["Plane.Type"].head(topn)[::-1], pt_large["mean"].head(topn)[::-1])
axes[1].set_title("Large airplanes: mean serious/fatal fraction (top 20 safest plane types)")
axes[1].set_xlabel("Mean SeriousOrFatal.Rate")

plt.tight_layout()
plt.show()

display(pt_small.head(30))
display(pt_large.head(30))


In [None]:
# Distributional plot by airplane type (choice: strip-style jitter scatter)
# Filter to plane types with >=10 events and limit to top 10 safest types in each group.

rng = np.random.default_rng(1)

def strip_by_plane_type(df_group, title, top_k=10, min_count=10):
    pt = (df_group.groupby("Plane.Type")["SeriousOrFatal.Rate"]
          .agg(['mean','count'])
          .reset_index()
          .query("count >= @min_count")
          .sort_values("mean", ascending=True)
          .head(top_k))
    types = pt["Plane.Type"].tolist()
    
    fig, ax = plt.subplots(figsize=(14, 6))
    for i, t in enumerate(types, start=1):
        y = df_group.loc[df_group["Plane.Type"] == t, "SeriousOrFatal.Rate"].dropna().values
        x = i + rng.normal(0, 0.06, size=len(y))
        ax.scatter(x, y, alpha=0.35, s=10)
    ax.set_xticks(np.arange(1, len(types)+1))
    ax.set_xticklabels(types, rotation=45, ha="right")
    ax.set_title(title)
    ax.set_ylabel("SeriousOrFatal.Rate")
    plt.tight_layout()
    plt.show()
    display(pt)

strip_by_plane_type(df_small, "Small airplanes: serious/fatal fraction distribution (10 safest plane types)")
strip_by_plane_type(df_large, "Large airplanes: serious/fatal fraction distribution (10 safest plane types)")


In [None]:
# Additional requirement:
# "Filter plane types, ensuring that you have at least 10 individual examples in each model/make to average over.
#  For smaller planes, limit your plotted results to the makes with the 10 lowest mean serious/fatal injury fractions."

# Identify 10 safest small-airplane makes (by mean serious/fatal fraction, with >=50 events)
safe10_small_makes = top_lowest_mean(df_small, "Make", "SeriousOrFatal.Rate", n=10, min_count=50)["Make"].tolist()
df_small_safe_makes = df_small[df_small["Make"].isin(safe10_small_makes)].copy()

pt_small_safe_makes = (df_small_safe_makes.groupby("Plane.Type")["SeriousOrFatal.Rate"]
                       .agg(['mean','count'])
                       .reset_index()
                       .query("count >= 10")
                       .sort_values("mean", ascending=True))

# Plot top 20 plane types within the 10 safest makes
fig, ax = plt.subplots(figsize=(14, 7))
ax.barh(pt_small_safe_makes["Plane.Type"].head(20)[::-1], pt_small_safe_makes["mean"].head(20)[::-1])
ax.set_title("Small airplanes: plane types within 10 safest makes (top 20 by mean serious/fatal fraction)")
ax.set_xlabel("Mean SeriousOrFatal.Rate")
plt.tight_layout()
plt.show()

display(pd.DataFrame({"Safe10.Small.Makes": safe10_small_makes}))
display(pt_small_safe_makes.head(30))


## Discussion: Specific Airplane Types

- Compare the *best-performing* airplane types in the small vs large segments.  
- Look for airplane types that have both **low average serious/fatal fraction** and **tight distributions** (few extreme outliers).
- Remember to respect the minimum sample size (≥10 events) to avoid unstable averages.


In [None]:
# Exploring Other Variables
# We'll analyze two factors:
# 1) Weather.Condition
# 2) Broad.phase.of.flight
#
# For each, we look at:
# - Mean serious/fatal injury fraction
# - Mean destroyed fraction
# - Sample counts
#
# We keep categories with at least 100 events to reduce noise.

def summarize_factor(df_in, factor, min_count=100):
    s = (df_in.groupby(factor)
         .agg(
             InjuryRateMean=("SeriousOrFatal.Rate", "mean"),
             DestroyedMean=("Was.Destroyed", "mean"),
             Count=(factor, "size")
         )
         .reset_index()
         .query("Count >= @min_count")
         .sort_values("InjuryRateMean", ascending=True))
    return s

weather_sum = summarize_factor(df, "Weather.Condition", min_count=100)
phase_sum = summarize_factor(df, "Broad.phase.of.flight", min_count=100)

display(weather_sum)
display(phase_sum)

# Weather: bar plots
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
axes[0].barh(weather_sum["Weather.Condition"][::-1], weather_sum["InjuryRateMean"][::-1])
axes[0].set_title("Weather condition vs mean serious/fatal fraction")
axes[0].set_xlabel("Mean SeriousOrFatal.Rate")

axes[1].barh(weather_sum["Weather.Condition"][::-1], weather_sum["DestroyedMean"][::-1])
axes[1].set_title("Weather condition vs mean destroyed fraction")
axes[1].set_xlabel("Mean Was.Destroyed")

plt.tight_layout()
plt.show()

# Phase: bar plots
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
axes[0].barh(phase_sum["Broad.phase.of.flight"][::-1], phase_sum["InjuryRateMean"][::-1])
axes[0].set_title("Phase of flight vs mean serious/fatal fraction")
axes[0].set_xlabel("Mean SeriousOrFatal.Rate")

axes[1].barh(phase_sum["Broad.phase.of.flight"][::-1], phase_sum["DestroyedMean"][::-1])
axes[1].set_title("Phase of flight vs mean destroyed fraction")
axes[1].set_xlabel("Mean Was.Destroyed")

plt.tight_layout()
plt.show()


## Discussion: Other Variables

### Weather Condition
- Compare mean serious/fatal fraction and mean destroyed fraction across weather categories.
- Typically, **IMC/poor weather** categories show elevated risk, but this should be interpreted alongside sample sizes and operational context.

### Phase of Flight
- Accident severity often concentrates in **takeoff/climb** and **approach/landing** phases due to reduced altitude margins and workload.
- Use the destroyed-fraction plot to see where hull loss is more common, and compare against injury rate means.
