**Aviation Accident Risk Assessment: Analyzing Aircraft Safety**

For this project, we will be working with a [dataset](https://www.kaggle.com/datasets/khsamaha/aviation-accident-database-synopses/data) sourced from the National Transportation Safety Board (NTSB), which contains information on aviation accidents that occurred between 1962 and 2022. The dataset includes several key attributes, such as:

**Aircraft.Type:** The model of the aircraft involved in the accident.

**Event.Date:** The specific date when the accident took place.

**Location:** The geographic location where the accident occurred.

**Weather.Condition:** The weather conditions present at the time of the accident.

**Total.Fatal.Injuries:** The total number of fatalities resulting from the accident.

The project will follow these main steps:

**Data Cleaning:** This involves handling missing values and ensuring data consistency.

**Exploratory Data Analysis (EDA):** I will examine patterns in the data related to aircraft types, locations, and weather conditions.


**Data Visualization:** Various visual tools will be created to present the findings clearly.

**Business Insights:** Based on the analysis, I will generate recommendations to help guide decisions in the aviation industry.


In [1]:
#Importing necessary Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')

Loading AviationData.csv and USState_Codes.csv into pandas DataFrames.

In [3]:
# Load the aviation data and state codes data
aviation_df = pd.read_csv('./AviationData.csv', encoding='ISO-8859-1')
state_codes_df = pd.read_csv('./USState_Codes.csv')

**Step 1: Data Cleaning**

Before performing any analysis, we need to clean the data. This involves:
- **Handling missing values**: We need to check if any data is missing and decide how to handle it (by either removing or filling the missing data).
- **Ensuring correct data types**: For example, the `Event.Date` column should be in **datetime** format, and columns like `Latitude` and `Longitude` should be numeric.

We will perform these steps to ensure the dataset is clean and ready for analysis.

In [None]:
# Replace placeholder values ("-") with NaN
aviation_df.replace("-", np.nan, inplace=True)

# Check the number of missing values in each column
print("Missing values before cleaning:")
print(aviation_df.isnull().sum())

# Drop rows with missing values
aviation_df_cleaned = aviation_df.dropna()

# Confirm the dataset size after cleaning
print(f"Dataset size after cleaning: {aviation_df_cleaned.shape}")

# Convert data types where necessary
aviation_df_cleaned['Event.Date'] = pd.to_datetime(aviation_df_cleaned['Event.Date'], errors='coerce')
aviation_df_cleaned['Latitude'] = pd.to_numeric(aviation_df_cleaned['Latitude'], errors='coerce')
aviation_df_cleaned['Longitude'] = pd.to_numeric(aviation_df_cleaned['Longitude'], errors='coerce')

# Quick check for invalid dates
if aviation_df_cleaned['Event.Date'].isnull().any():
    print("Warning: Some dates could not be parsed. These rows may need further investigation.")

# Check for invalid or missing coordinates
missing_coords = aviation_df_cleaned[['Latitude', 'Longitude']].isnull().any(axis=1).sum()
if missing_coords > 0:
    print(f"Warning: {missing_coords} rows have missing or invalid coordinates.")

# Ensure no placeholder values are left
if "-" in aviation_df_cleaned.values:
    print("Some placeholder values ('-') still exist in the dataset.")
else:
    print("All placeholder values have been successfully replaced.")

# Display a preview of the cleaned dataset
print("Preview of the cleaned data:")
print(aviation_df_cleaned.head())


**Step 2:** **Exploratory Data Analysis (EDA)**
In the EDA phase, we will:

Examine the data to find patterns, trends, or relationships.
Answer key questions about the data (e.g., which aircraft types are most involved in accidents? How do weather conditions affect accidents?).
Generate insights to help create business recommendations.

**Key Questions for EDA:**


1.   Which aircraft types are most involved in accidents?
2.   How do weather conditions affect accident rates?
3.   What time periods or locations show the most accidents?


Analyzing which aircraft types are most involved in accidents. This will help identify if certain aircraft types have higher accident rates.

In [7]:
# Group by aircraft make and count the number of accidents
accidents_by_type = aviation_df_cleaned.groupby('Make')['Event.Id'].count()

# Sort results in descending order to show the most accident-prone aircraft types
accidents_by_type = accidents_by_type.sort_values(ascending=False)

# Display the top 10 aircraft makes with the highest number of accidents
print("Top 10 aircraft makes with the most accidents:")
print(accidents_by_type.head(10))


Top 10 aircraft makes with the most accidents:
Series([], Name: Event.Id, dtype: int64)


In [44]:
# Check if dataset is empty
if aviation_df_cleaned.empty:
    print("The dataset is empty after cleaning. Please check the data cleaning process.")
else:
    # Group by aircraft make and count the number of accidents
    if 'Make' in aviation_df_cleaned.columns and 'Event.Id' in aviation_df_cleaned.columns:
        accidents_by_type = aviation_df_cleaned.groupby('Make')['Event.Id'].count()

        # Sort results in descending order
        accidents_by_type = accidents_by_type.sort_values(ascending=False)

        # Display results if there are any
        if not accidents_by_type.empty:
            print("Top 10 aircraft makes with the most accidents:")
            print(accidents_by_type.head(10))
        else:
            print("No accident data available for grouping.")
    else:
        print("Columns 'Make' or 'Event.Id' are missing in the dataset. Please check your data.")


The dataset is empty after cleaning. Please check the data cleaning process.


In [None]:
# Fill missing values for 'Make' with "Unknown"
aviation_df_cleaned['Make'] = aviation_df_cleaned['Make'].fillna('Unknown')

# Fill missing 'Event.Id' values with 0, or leave them as NaN
aviation_df_cleaned['Event.Id'] = aviation_df_cleaned['Event.Id'].fillna(0)

# Check the dataset's shape after filling missing values
print(f"Dataset shape after filling missing values: {aviation_df_cleaned.shape}")

# Quick check for any remaining missing values
print("Remaining missing values in each column:")
print(aviation_df_cleaned.isnull().sum())