# Exploratory Data Analysis of Aviation Accidents

---

## Project Information

**Authors**: 
- Vladyslav Lysenko
- Vasudha Chandna
- Om Mistry
  
**Course**: SCS 3250-095 Foundations of Data Science  
**Instructor**: Wilson So  
**Institution**: University of Toronto School of Continuing Studies  
**Submission Date**: August 2025  
**Purpose**:  
This project was completed as a term project of a "Foundations of Data Science" course.  
It involves scraping real-world data from the Aviation Safety Network, preparing the dataset, and conducting exploratory data analysis to discover insights related to aviation accidents.

![University Logo](https://learn.utoronto.ca/themes/custom/de_theme/logo.svg)


---

## Table of Contents

1. [Project Information](#Project-Information)
2. [Introduction](#Introduction)
3. [Data Acquisition](#Data-Acquisition)
4. [Data Preparation](#Data-Preparation)
5. [Exploratory Data Analysis (EDA)](#Exploratory-Data-Analysis-(EDA))
    - 5.1 [Descriptive Statistics](#Descriptive-Statistics)
    - 5.2 [Temporal Trends](#Temporal-Trends)
    - 5.3 [Aircraft & Operator Analysis](#Aircraft-&-Operator-Analysis)
    - 5.4 [Geographic Distribution](#Geographic-Distribution)
6. [Key Findings](#Key-Findings)
7. [Limitations](#Limitations)
8. [Future Work](#future-work)
9. [Conclusion](#conclusion)
10. [Appendix](#Appendix)

---

## Introduction

### Objective

This project aims to conduct an exploratory data analysis (EDA) of aviation accident records from the Aviation Safety Network (ASN) to uncover trends, patterns, and potential contributing factors to aviation incidents between 1995 and 2024. The analysis is intended to highlight how publicly available aviation data can be examined to reveal high-level patterns and risk areas. While the insights are not intended to support direct operational or regulatory decisions, they offer a preliminary view into aviation safety dynamics based on historical occurrences.


![Aviation Safety Network Logo](https://asn.flightsafety.org/graphics/ASN%20Logo_FSF.png)

### Dataset Overview

The dataset was obtained by programmatically scraping detailed aviation occurrence records from the Aviation Safety Network’s public database. After collection and preparation, the dataset comprised **7,999** records and **18** attributes, including:

- Temporal attributes (e.g., accident date, time, year of manufacture)
- Aircraft information (e.g., type, operator)
- Accident details (e.g., location, phase of flight, nature of flight, damage severity)
- Casualty information (e.g., fatalities, occupants)

Despite some inconsistencies and missing values, the dataset provides a broad and rich view of aviation accidents across the globe, including both civil and, to a limited extent, military or non-scheduled operations.

### Motivation

Aviation remains one of the safest modes of transport, yet every accident is a critical event that merits careful analysis. Studying decades of aviation occurrences can help reveal:

- Common operational risks or contributing factors
- Trends in accident frequency or severity over time
- Aircraft types, operators, or flight phases associated with increased occurrence rates
- Geographical or environmental contexts that may play a role

By analyzing structured data from a reputable aviation source, this project aims to demonstrate how historical data can be used to explore aviation safety dynamics and uncover high-level insights. While the findings are not intended to inform regulatory decisions, they may support further research or awareness regarding aviation safety patterns.

### Disclaimer on Data Accuracy and Purpose

We, as students conducting this analysis as part of a data science course project, acknowledge that while the dataset used is based on real-world aviation accident records from the Aviation Safety Network, it is not fully accurate or comprehensive.

Several important limitations must be considered:

- The dataset may contain missing, inconsistent, or outdated information due to the nature of web scraping and publicly reported data.
- Some entries lack full investigative outcomes, or include uncertain classifications (e.g., "Unknown").
- Critical operational context-such as total flight volume, regulatory environments, weather conditions, or maintenance practices-is not available in this dataset, which limits the depth and reliability of any causal interpretation.

Therefore, we explicitly state that this analysis is intended solely for educational purposes.
It does not aim to produce authoritative conclusions about aviation safety and should not be used as the basis for any operational, regulatory, or policy decisions.

All insights presented here are meant to demonstrate the application of data science techniques-such as data cleaning, transformation, visualization, and exploratory analysis-and should be interpreted within that learning context.



---

## Data Acquisition

### Web Scraping Approach

To build a comprehensive dataset of global aviation accidents, we performed web scraping from the Aviation Safety Network (ASN) website using Python-based tools. The ASN database provides detailed accident listings, each with a link to a dedicated detail page containing additional structured and semi-structured information.

#### Tools and Libraries Used

We used the following Python libraries for the scraping pipeline:

- `requests` - for sending HTTP requests to retrieve HTML pages

- `BeautifulSoup` (from `bs4`) - for parsing and navigating HTML documents

- `re` (regular expressions) - for extracting structured data (e.g., coordinates)

- `concurrent.futures.ThreadPoolExecutor` - for multithreaded scraping of detail pages

- `pandas` - for post-processing and tabular structuring of data



#### Pagination and URL Structure

The main accident listing pages on ASN follow a paginated URL pattern. 

Initially, our approach relied on using `pandas.read_html()` to parse accident tables directly from each page. However, we quickly found that essential information - such as coordinates, full location names, and other detailed metadata - was only accessible on the individual accident detail pages.

To handle pagination, we created a dynamic URL variable, which was updated every time we reach the end of a page or year occurences. Hence, covering all accident index pages from 1995 to 2024, we iterated through, extracting basic links and summary-level fields from each page. In the end, individual events data was extracted by iteration through links gathered during first process.

#### Extracted Fields

From the main list pages, we extracted:

- Date of accident
- Aircraft type
- Operator
- Fatalities
- Location
- URL to accident detail page

From the detail pages, we extracted additional fields:

- Phase of flight
- Nature of flight
- Departure and destination airports
- Year of manufacture
- Damage type
- Coordinates (if available via embedded map)  
Etc. (Less relevant)

### Challenges and Solutions

#### 1. Transition from `pandas.read_html` to `BeautifulSoup`

Initially, we attempted to extract occurrence data using `pandas.read_html()` on paginated tables. However, ASN structures detailed data (such as phase, damage, coordinates) only on separate accident pages, not in the summary tables. As a result, we rewrote the scraping pipeline using BeautifulSoup to parse both the summary listings and detail pages programmatically.



#### 2. Extracting Coordinates via Embedded Map

Coordinates were not explicitly listed anywhere, but they were embedded in an iframe src attribute pointing to a Google Maps preview. To retrieve these values, we used a regular expression to parse latitude and longitude from the embedded map URL.

#### 3. Handling Inconsistent Formats and Missing Data

Data across pages exhibited inconsistent formatting:

- Dates were in various string formats and had to be normalized into datetime objects.
- Coordinates were missing in many cases.
- Encoding issues occasionally arose due to non-ASCII characters in location or operator names.

To address these, we applied defensive parsing logic and fallback defaults where necessary. All pages were parsed using UTF-8 encoding and missing values were explicitly handled using pandas data cleaning techniques.

---
## Data Preparation

Once the aviation safety dataset was collected via web scraping, significant preparation was required to clean and structure the data for meaningful analysis. The raw dataset consisted of both high-level records and detailed entries scraped from accident report pages. However, as with most real-world datasets - particularly those compiled from semi-structured web sources-the raw data exhibited a range of issues including redundancy, missingness, inconsistent formatting, and noisy categorical fields.

This section outlines the systematic steps taken to transform the raw dataset into a clean and analyzable form. Each step was carefully documented and implemented in Python using pandas, numpy, and other standard libraries.

### Tools and Process Overview

Data preparation was performed using a combination of tools tailored for cleaning, transformation, and feature engineering:

- `pandas` - for tabular data manipulation, missing value handling, and feature creation
- `numpy` - for numerical operations and handling missing values
- `datetime` - for time-based parsing and transformation
- `ast` (Abstract Syntax Trees) - for safely evaluating stringified list representations of coordinates

The data cleaning process was iterative. It began with exploring the structure of the raw dataset, identifying key inconsistencies, and progressively transforming the dataset into a format suitable for exploratory data analysis (EDA). Key tasks included handling missing data, parsing and splitting composite values, and constructing new features for deeper insights.



### Initial Cleanup and Structural Simplification

The first step in preparing the dataset involved removing structurally redundant or non-informative columns. For example:

- Columns like `"Unnamed: 6"`, `"MSN"`, `"Engine model"`, `"Other fatalities"` and external links (e.g., `"link"`) were removed as they either contained no data or provided no analytical value.
- Duplicate representations of the same data (e.g., `acc. date` vs `Date`, or `type` vs `Type`) were resolved by selecting the most complete or better-formatted version.

Additionally, entries flagged as low-confidence or incomplete were dropped to improve data quality and reliability.

### Normalization of Missing Values

To ensure consistency in handling missing data:

- Placeholder values such as empty strings ("") and dashes ("-") were systematically replaced with `NaN`.
- Some categorical fields with missing values were temporarily filled with `"Unknown"` during intermediate stages (e.g., `accident_type`, `aircraft_nature`). However, in the final version of the dataset, these were converted back to `NaN` to avoid inflating category counts during statistical analysis or visualizations.

This normalization step was essential to make the dataset compatible with analytical tools and to avoid misleading outcomes due to inconsistent null representations.

### Parsing and Cleaning Composite Fields

Several fields required decomposition and cleaning:

- The combined `"Fatalities"` field, often represented as a string like "Fatalities: 2 / Occupants: 5", was split into two separate numeric columns: `fatalities` and `occupants`.

- Extraneous labels were removed using string replacement, and missing values were filled with `0` or `NaN` as appropriate.

- The new columns were then explicitly cast to `int` to enable proper calculations and aggregation.

This allowed for accurate computation of critical measures like the fatality rate, which would later serve as a key feature in analysis.



### Date and Time Parsing

Date and time data were parsed and validated to support time-based analysis:

- The `accident_date` field was converted to `datetime` format using `pandas.to_datetime()` with error coercion.
- Entries with missing dates were removed to ensure temporal consistency across the dataset. Additionally, records from the year 2025 were excluded, as the year is still in progress and the data may be incomplete or subject to change.
- The `accident_time` field, often incomplete or missing, was parsed separately with format specification (`"%H:%M"`). Entries that could not be parsed were marked as NaT.

### Column Renaming and Reordering

To enhance clarity and maintain naming consistency throughout the analysis:

- All columns were renamed with standardized snake_case naming (e.g., `aircraft_type`, `damage_severity`, `departure_airport`).
- The columns were reordered into logical groups - for example: aircraft-related fields, accident context (date, time, location), and outcome measures (fatalities, damage level).

This organization greatly improved the readability and usability of the dataset for both exploratory work and presentation.



### Categorical Data Cleaning and Consolidation

Many columns contained categorical fields with inconsistent or overly granular labels:

- The `aircraft_type` field included variants with prefixes and suffixes (e.g., `“Boeing 737-200 ?”`), which were cleaned using regular expressions.
- The `aircraft_nature` field had a high number of rare and ambiguous labels (e.g., `“Passenger - Scheduled”`, `“Passenger - Non-Scheduled/charter/Air Taxi”`), which were grouped into broader categories like `"Passenger"`, `"Cargo"`, `"Private"`, etc.
- Similar grouping and simplification were applied to fields like `damage_severity`, `phase`, and `operator` to avoid sparsity during aggregation or plotting.

This step ensured that plots and summaries would not be dominated by outliers or inconsistently labeled entries.



### Handling Missing and Sparse Features

Some fields, while potentially useful, suffered from extreme sparsity:

- `airframe_hours`: Present in less than half of the dataset. Retained for reference, but not used in aggregated analysis.
- `cycles`: Missing in over 60% of records. Dropped entirely due to low information density.
- `year_of_manufacture`: Missing in ~9% of rows, filled using the median year grouped by aircraft type to retain contextual accuracy.

These decisions balanced completeness with the need to maintain reliable, interpretable statistics.



### Feature Engineering

A few additional features were derived during preparation:

- Fatality Rate: A computed column as `fatalities` / `occupants`, accounting for division by zero where applicable.
- Country and Airport Parsing: The combined location field was split into two components - `accident_location` and `accident_country` - using regex to separate entries like "Tokyo-Haneda Airport (HND) - Japan".

These engineered features enabled deeper insights into geographical trends and severity analysis.

### Final Dataset Review and Export

After all cleaning steps the dataset was reviewed for type consistency using df.info() and sample validation checks.

The clean, normalized dataset was exported in .feather format to preserve data types and ensure fast loading for analysis and visualization.

---
## Exploratory Data Analysis (EDA)

### Descriptive Statistics

Before diving into patterns or relationships in the dataset, we first explored the basic characteristics of the data using descriptive statistics. This step allowed us to understand the overall structure of the dataset, identify possible irregularities, and get a general sense of the types of accidents we’re dealing with.

We focused on both `numerical` variables (like fatalities, occupants, and aircraft age) and `categorical` variables (like aircraft type, accident phase, and operator nature). The goal was to summarize the data in a simple and meaningful way before looking deeper.

#### Summary statistics of numerical columns

For the numerical part of the dataset, we looked at values such as:

- `Fatalities`: how many people died in each accident
- `Occupants`: total number of people onboard
- `Aircraft Year of Manufacture`: to later calculate aircraft age
- `Airframe Hours`: how many hours the aircraft had flown at the time of the accident

After calculating the usual summary statistics, here are some key observations:

- Most accidents involved zero or very few fatalities, with many records showing 0. However, there were a few accidents with extremely high fatalities, some exceeding 300 people, which heavily affect the average.
- The number of occupants varied from very small aircraft (1-2 people) to large commercial jets with over 500. This wide range reflects the diversity of aircraft types in the dataset.
- The year of manufacture showed that most aircraft were relatively modern, but some were quite old at the time of the accident. This observation later helped us look into the relationship between aircraft age and accident severity.
- Airframe hours also had a large range, but due to many missing values, we only considered this field briefly.

| Variable                | Count | Mean    | Std Dev   | Min    | 25%     | 50%    | 75%      | Max       |
| ----------------------- | ----- | ------- | --------- | ------ | ------- | ------ | -------- | --------- |
| **Fatalities**          | 7,999 | 3.45    | 17.50     | 0.0    | 0.0     | 0.0    | 0.0      | 312.0     |
| **Occupants**           | 7,999 | 40.82   | 72.57     | 0.0    | 2.0     | 6.0    | 44.0     | 560.0     |
| **Year of Manufacture** | 7,808 | 1987.93 | 15.29     | 1938.0 | 1978.0  | 1989.0 | 1999.0   | 2023.0    |
| **Airframe Hours**      | 3,087 | 22,178  | 21,374.57 | 0.0    | 6,219.5 | 14,597 | 32,447.5 | 126,184.0 |

<center><i>Table 5.1: Summary statistics for key numeric variables</i></center>


<p align="center">
  <img src="media/numerical-distributions-of-aviation-accidents.png" alt="Numerical Distributions of Aviation Accidents" width="1200"/>
</p>

<p align="center"><em>Figure 5.1.1: Numerical Distributions of Aviation Accidents</em></p>

#### Summary Statistics of Categorical Features

Next, we explored the categorical variables to see what types of events and aircraft we were dealing with. These columns help us understand the nature of accidents rather than just the numbers.

Some of the key columns here included:

- `Accident Type`: Most records were labeled as “Accident”
- `Aircraft Type`: The dataset included a large variety of aircraft, but Cessna, Boeing, and Airbus models were among the most common. This reflects both general aviation and commercial aircraft.
- `Aircraft Nature`: A large portion of accidents involved Passenger aircraft. Other types included Cargo, Military, Private, and a significant “Other” and "Special Operation" categories, which grouped specialized or rare missions (e.g., training, aerial firefighting, executive, etc.).
- `Phase of Flight`: Interestingly, many accidents occurred during the Landing phase, followed by Approach and Cruise. This aligns with known aviation risks, as these phases are more technically demanding.

| Variable                | Non-Null Count | Unique Values | Most Frequent Value                      | Frequency |
| ----------------------- | -------------- | ------------- | ---------------------------------------- | --------- |
| **Accident Time**       | 5,310          | 1,191         | 11:00:00                                 | 47        |
| **Accident Type**       | 7,954          | 5             | Accident                                 | 7,295     |
| **Aircraft Type**       | 7,999          | 1,980         | Cessna 208B Grand Caravan                | 283       |
| **Aircraft Nature**     | 7,129          | 8             | Passenger                                | 3,800     |
| **Operator**            | 7,885          | 4,058         | Delta Air Lines                          | 140       |
| **Phase**               | 7,624          | 9             | Landing                                  | 2,488     |
| **Departure Airport**   | 6,764          | 3,046         | Chicago-O'Hare International Airport, IL | 45        |
| **Destination Airport** | 6,785          | 3,473         | Chicago-O'Hare International Airport, IL | 54        |
| **Damage Severity**     | 7,907          | 5             | Substantial                              | 3,532     |
| **Damage Outcome**      | 5,696          | 2             | Written-Off                              | 3,966     |
| **Accident Location**   | 7,990          | 5,856         | Chicago-O'Hare International Airport, IL | 32        |
| **Accident Country**    | 7,999          | 214           | United States of America                 | 2,303     |

<center><i>Table 5.2: Summary statistics for key categorical variables</i></center>


<p align="center">
  <img src="media/categorical-distributions-of-aviation-accidents.png" alt="Categorical Distributions of Aviation Accidents" width="1200"/>
</p>

<p align="center"><em>Figure 5.1.2: Categorical Distributions of Aviation Accidents</em></p>

### Temporal Trends

One of the most important parts of this analysis was to understand how aviation accidents have changed over time. By looking at when accidents happened, we hoped to see if there are any trends or patterns-whether safety is improving, whether certain periods are riskier, and how accident outcomes differ across time.

We broke this part of the analysis into four main areas: accidents by year, fatalities by year, seasonal patterns, and time of day.

#### Accidents Over the Years

We started by grouping the accidents by year to see how the total number of accidents changed over time.

- The results showed a clear downward trend, especially starting from the 2000s. This suggests that aviation safety has improved over the years, likely due to stricter regulations, improved aircraft technology, better pilot training, and enhanced safety protocols.
- There were some years where the number of accidents suddenly went up. These might be due to specific large-scale events or reporting inconsistencies that should be looked at further.

<p align="center">
  <img src="media/accidents-over-time.png" alt="Accidents Over Time" width="1200"/>
</p>

<p align="center"><em>Figure 5.2.1: Accidents Over Time</em></p>

#### Fatalities and Average Fatalities Per Year

We then looked at how many people died each year in aviation accidents and how fatal the average accident was:

- Total fatalities fluctuate a lot more than accident counts. This makes sense, because one major disaster can cause a sudden spike.
- To understand this better, we calculated the average number of fatalities per accident per year. This gave a slightly balanced view.
- The average number remained relatively stable over the decades, though it started to slightly decrease in recent years, which may indicate that not only are accidents rarer, but they also tend to be less deadly.



<p align="center">
  <img src="media/fatalities-over-time.png" alt="Fatalities Over Time" width="1200"/>
</p>

<p align="center"><em>Figure 5.2.2: Fatalities Over Time</em></p>

<p align="center">
  <img src="media/fatalities-per-accident-ratio-over-time.png" alt="Fatalities per Accident Ratio Over Time" width="1200"/>
</p>

<p align="center"><em>Figure 5.2.3: Fatalities per Accident Ratio Over Time</em></p>

<p align="center">
  <img src="media/accidents-vs-fatalities-over-time.png" alt="Accidents vs Fatalities Over Time" width="1200"/>
</p>

<p align="center"><em>Figure 5.2.4: Accidents vs Fatalities Over Time</em></p>

<p align="center">
  <img src="media/fatality-distribution-in-aviation-accidents.png" alt="Fatality Distribution in Aviation Accidents" width="1200"/>
</p>

<p align="center"><em>Figure 5.2.5: Fatality Distribution in Aviation Accidents</em></p>

#### Seasonal Patterns

To investigate whether aviation accidents tend to happen more during certain parts of the year, we assigned each accident to a season based on its date, while accounting for hemisphere differences (so, for example, July in Brazil counts as Winter).

The bar chart (see Figure 5.2.6) shows the overall distribution of accidents by season:

- Summer stands out as the season with the highest number of accidents.
- Winter and Spring have nearly identical accident counts, slightly lower than Summer.
- Fall had the fewest accidents among all seasons.

This result likely reflects increased flight activity during summer months, especially due to tourism, which raises exposure. It might also be connected to extreme weather patterns like thunderstorms or heat-related equipment issues in some regions.

To go deeper, we also examined how the number of accidents in each season changed over the years (Figure 5.2.7). This multi-line chart breaks the data into four subplots-one for each season-and displays accident counts from 1995 to 2025.

From this chart, we observed:

- All seasons show a slight decline or stabilization in the number of accidents over the years, especially after 2000.
- Summer accidents remain consistently high over time and even show noticeable peaks around the late 2010s and early 2020s.
- Winter and Spring trends are more stable, with some variability but no sharp increases.
- Fall shows more fluctuations, with some years having sudden drops or spikes, possibly tied to isolated major events or smaller flight volumes in that season.

Overall, these trends suggest that while accidents occur in every season, summer deserves particular attention - possibly due to a mix of higher air traffic and environmental factors. Still, the differences are not extreme, and aviation safety appears to have improved across all seasons over the past few decades.

<p align="center">
  <img src="media/accidents-by-season.png" alt="Accidents by Season" width="1200"/>
</p>

<p align="center"><em>Figure 5.2.6: Accidents by Season</em></p>


<p align="center">
  <img src="media/seasonal-aviation-accident-trends-over-the-years.png" alt="Seasonal Aviation Accidents Trends Over The Years" width="1200"/>
</p>

<p align="center"><em>Figure 5.2.7: Seasonal Aviation Accidents Trends Over The Years</em></p>

### Aircraft & Operator Analysis

In this section, we explored how different aircraft characteristics and operator types relate to aviation accidents. The goal was to understand whether certain types of aircraft or types of use (like passenger vs. cargo) are more commonly involved in accidents, and whether aircraft age plays a noticeable role.

We also briefly examined operator data to see if there were any patterns in how often different companies or organizations appeared.

#### Aircraft Types

We started by identifying which specific aircraft types were most frequently involved in accidents. This gave us an overview of which models show up more often in the dataset, although it's important to remember that more appearances don’t mean a plane is unsafe-in most cases, it just reflects how widely the model is used around the world.

As seen in Figure 5.3.1, the most common aircraft types involved in accidents are largely general aviation aircraft. These include:

- Cessna 208B Grand Caravan (by far the most common)
- DHC-6 Twin Otter 300
- Antonov An-2R
- Various Beechcraft and Learjet models

Most of these planes are used for regional flights, cargo delivery, training, or charter services, often in rural or remote areas. Their popularity in non-commercial and utility aviation is the main reason they appear so frequently in accident records.

<p align="center">
  <img src="media/top-10-aircraft-types-in-accidents.png" alt="Top 10 Aircraft Types in Accidents" width="1200"/>
</p>

<p align="center"><em>Figure 5.3.1: Top 10 Aircraft Types in Accidents</em></p>

However, it's obvious that for most people, the part of the aviation industry they care most about is passenger aircraft-especially commercial jets used by airlines. This is the field with the highest visibility, media attention, and public concern.

When we filtered the dataset to focus only on passenger aircraft, we observed the presence of well-known manufacturers like Boeing and Airbus, along with some regional jets and turboprops that are commonly used by smaller airlines. These aircraft types didn’t dominate the overall top 10 list simply because they fly fewer total flights than the general aviation sector, but they are still critical for safety analysis due to the high number of people onboard.

So, while general aviation accidents make up the majority of entries in the dataset, it’s important to analyze both sectors separately-one tells us about operational risks in smaller, utility-based flights, and the other about safety in the large-scale passenger transport systems we rely on every day.

<p align="center">
  <img src="media/top-10-passenger-aircraft-types-in-accidents.png" alt="Top 10 Passenger Aircraft Types in Accidents" width="1200"/>
</p>

<p align="center"><em>Figure 5.3.2: Top 10 Passenger Aircraft Types in Accidents</em></p>

#### Aircraft Nature (Purpose of Flight)

Next, we analyzed the purpose or nature of each aircraft involved in accidents. To simplify this analysis, we grouped the detailed values from the dataset into broader categories: Passenger, Cargo, Military, Private, Special Purpose, and Other (which included training, calibration, positioning, and similar types).

- Passenger aircraft made up the majority of accident records. This is expected since they represent the bulk of air traffic worldwide.
- Cargo and Military aircraft followed but at much lower levels.
- The “Other” category was larger than we initially expected, mostly due to specialized flights like training or ferry operations, which are common in general aviation.

This distribution shows that while most accidents involve passenger aircraft, other sectors of aviation are still vulnerable-especially non-commercial operations, where safety measures might differ.

<p align="center">
  <img src="media/accident-by-aircraft-nature.png" alt="Accidents by Aircraft Nature" width="1200"/>
</p>

<p align="center"><em>Figure 5.3.3: Accidents by Aircraft Nature</em></p>

#### Aircraft Age at Accident

We also wanted to examine whether aircraft age has any visible relationship with the severity of accidents, measured by the number of fatalities. To explore this, we grouped aircraft into age ranges (e.g., 0-10 years, 10-20 years, etc.) and plotted the number of fatalities for each group.

As shown in Figure 5.3.4, several clear patterns emerged:

- Accidents with the highest number of fatalities occurred mostly in younger to mid-age aircraft, particularly in the 0-20 year range.
- While older aircraft (above 40-50 years) were still involved in accidents, they were less often associated with high fatality counts.
- All age groups had many accidents with zero or very few fatalities, indicating that not all accidents are severe, regardless of aircraft age.
- There are many outliers-some accidents had over 200 or even 300 fatalities-mostly concentrated in the newer age groups, which are also more likely to include large commercial aircraft.

One explanation is that larger aircraft used for commercial passenger flights are often newer or mid-aged and tend to carry more people, so when a disaster happens, the fatality count is high. On the other hand, older aircraft are more often used in general aviation (e.g., private, cargo, or training), where fewer people are onboard.

So, while aircraft age alone is not a direct indicator of risk, it seems to be related to the context of the flight-how the plane is used, and how many people it typically carries. Maintenance, regulations, and usage type may all be more important than age by itself.



<p align="center">
  <img src="media/aircraft-age-vs-fatalities.png" alt="Aircraft Age vs Fatalities" width="1200"/>
</p>

<p align="center"><em>Figure 5.3.4: Aircraft Age vs Fatalities</em></p>

#### Operator Overview

Lastly, we explored the operators responsible for the flights. The dataset had over 4,000 unique operators, which included commercial airlines, private companies, military organizations, and training schools.

- Most operators only appeared once or twice.
- A few operators had many records, which usually correlates with fleet size and number of operations rather than being more prone to accidents.
- We selected the top 10 most frequent operators for clearer visualization.

It’s important to interpret this data carefully-higher accident counts don’t automatically imply poor safety performance. These operators may simply fly more often or have better reporting standards.

<p align="center">
  <img src="media/top-10-operators-involved-in-accidents.png" alt="Top 10 Operators Involved in Accidents" width="1200"/>
</p>

<p align="center"><em>Figure 5.3.5: Top 10 Operators Involved in Accidents</em></p>

### Geographic Distribution

In this section, we explored the locations where aviation accidents occurred to understand if certain countries or regions are more affected than others. The dataset included both country names and, in many cases, geographic coordinates (latitude and longitude), which allowed us to look at the data from both a national and global spatial perspective.

#### Accidents by Country

First, we grouped the data by country and counted how many accidents occurred in each one. The top 10 countries with the most accidents are shown in Figure 5.4.1.

- The United States had the highest number of accidents by a large margin. This isn’t surprising, as the U.S. has one of the largest and busiest aviation industries in the world, covering commercial, cargo, military, and general aviation.
- Other countries in the top 10 included Russia, Canada, Brazil, Colombia, and several in Europe and Asia.
- Countries with high accident counts generally have either large air traffic volume, difficult flying environments (like mountains or jungle), or large geographic areas that require air transport.

It's important to mention that these numbers don't directly reflect safety issues-they are also influenced by how many flights take place and how well incidents are reported.

<p align="center">
  <img src="media/top-10-countries-by-number-of-accidents.png" alt="Top 10 Countries by Number of Accidents" width="1200"/>
</p>

<p align="center"><em>Figure 5.4.1: Top 10 Countries by Number of Accidents</em></p>

#### Global Distribution by Coordinates

We visualized the actual geographic spread of accidents using the dataset's latitude and longitude values. Each point on the map represents the location of one accident (see Figure 5.4.2).

The map clearly shows that aviation accidents have occurred all around the world, but there are visible clusters in North America, Europe, and parts of Asia. Some areas-like the Amazon region, remote parts of Africa, and mountainous areas-also show clusters, which may reflect the challenging flight conditions in those places (e.g., limited infrastructure, poor weather, or outdated aircraft).

Major urban areas and busy international flight paths naturally appear more frequently due to high air traffic density. In contrast, very isolated regions (like the deep ocean, Antarctica, or central deserts) have little or no recorded data, possibly due to both low flight volume and lack of data reporting.



<p align="center">
  <img src="media/global-distribution-of-aviation-accidents.png" alt="Global Distribution of Aviation Accidents" width="1200"/>
</p>

<p align="center"><em>Figure 5.4.2: Global Distribution of Aviation Accidents</em></p>

#### Hemisphere Comparison

To further generalize the spatial distribution, we divided all accident locations into the Northern and Southern Hemispheres, based on their latitude.

As shown in Figure 5.4.3:

- The Northern Hemisphere accounts for a much larger share of aviation accidents.
- This difference reflects the global distribution of air traffic, since most countries with high flight volumes and dense populations are in the Northern Hemisphere.
- It also aligns with the presence of major air hubs, commercial airlines, and military operations in countries like the U.S., China, and European nations.
- The Southern Hemisphere, while still important, tends to have fewer flights and more remote operations, which may explain the lower count.



<p align="center">
  <img src="media/accidents-by-hemisphere.png" alt="Accidents by Hemisphere" width="1200"/>
</p>

<p align="center"><em>Figure 5.4.3: Accidents by Hemisphere</em></p>

---
## Key Findings

Throughout this project, we conducted a detailed exploratory analysis of global aviation accident data. This analysis included statistical summaries, time trends, aircraft and operator evaluations, and geographic distribution. Here are the main findings we uncovered:

### Accidents Have Decreased Over Time

From the time series analysis, it’s clear that the number of aviation accidents has decreased significantly from the 1990s to the 2020s. This trend suggests strong progress in aviation safety, likely due to better technologies, stricter regulations, and more advanced training practices. A noticeable drop after 2019 may also reflect the impact of the COVID-19 pandemic on air traffic.

<p align="center">
  <img src="media/accidents-vs-fatalities-over-time.png" alt="Accidents vs Fatalities Over Time" width="1200"/>
</p>

<p align="center"><em>Figure 6.1: Accidents vs Fatalities Over Time</em></p>

### Fatalities Are Often Low, but Outliers Exist

Most accidents in the dataset involved zero or very few fatalities, showing that not every accident leads to major loss of life. However, a few tragic events involved more than 100 or even 300 deaths, highlighting the potential severity of rare catastrophic crashes.

<p align="center">
  <img src="media/fatality-distribution-in-aviation-accidents.png" alt="Fatality Distribution in Aviation Accidents" width="1200"/>
</p>

<p align="center"><em>Figure 6.2: Fatality Distribution in Aviation Accidents</em></p>

### Passenger Aircraft Are the Most Common in Accidents

Passenger aircraft dominate the dataset, with more than triple the number of accidents compared to any other aircraft type. This is expected because most air traffic around the world is passenger-based. Categories like cargo, military, and private aircraft follow at a much lower rate.

<p align="center">
  <img src="media/accident-by-aircraft-nature.png" alt="Accidents by Aircraft Nature" width="1200"/>
</p>

<p align="center"><em>Figure 6.3: Accidents by Aircraft Nature</em></p>

### Landing and En Route Phases Are Most Risky

Most accidents happened during the landing phase, followed by the en route phase. These flight segments are typically more complex and risk-prone, which is consistent with real-world aviation safety research.

<p align="center">
  <img src="media/flight_phase_at_accident.png" alt="Flight Phase Distribution" width="1200"/>
</p>

<p align="center"><em>Figure 6.4: Flight Phase Distribution</em></p>

### Geographically, North America and Europe Have the Most Recorded Accidents

The United States alone accounts for over 2,000 accidents, with Canada, Russia, and the United Kingdom also showing high numbers. This likely reflects a combination of large aviation activity and more complete reporting systems in these countries.

<p align="center">
  <img src="media/top-10-countries-by-number-of-accidents.png" alt="Top 10 Countries by Number of Accidents" width="1200"/>
</p>

<p align="center"><em>Figure 6.5: Top 10 Countries by Number of Accidents</em></p>

<p align="center">
  <img src="media/global-distribution-of-aviation-accidents.png" alt="Global Distribution of Aviation Accidents" width="1200"/>
</p>

<p align="center"><em>Figure 6.6: Global Distribution of Aviation Accidents</em></p>

### Major Airlines Show High Counts

Delta, American, and United Airlines top the list of operators involved in accidents. However, this is expected due to their massive daily flight operations. The data does not imply they are unsafe, only that high exposure leads to higher counts.

<p align="center">
  <img src="media/top-10-operators-involved-in-accidents.png" alt="Top 10 Operators Involved in Accidents" width="1200"/>
</p>

<p align="center"><em>Figure 6.7: Top 10 Operators Involved in Accidents</em></p>

### Certain Passenger Aircraft Types Stand Out in Accidents

While the Cessna 208B Grand Caravan appears most frequently in accident records — likely due to its extensive use in remote or regional operations — it's not the only aircraft of interest. Well-known commercial passenger jets like the Boeing 737-800 and various Airbus A320 series models also appear in the top 10.

Importantly, Boeing 737s are a backbone of many global fleets, including those operated by Air Canada, WestJet, and Sunwing, which are major carriers in Canada. Similarly, Fokker aircraft like the Fokker 100 and Fokker 50, although older, were widely used in North America and Europe, especially in regional airlines during the 1990s and 2000s.

Their presence in the dataset reflects how commonly these aircraft types were (or still are) used in commercial aviation, rather than indicating any specific safety issue. These aircraft typically operate in high-frequency routes, making them more likely to appear in incident data purely due to their high exposure.

<p align="center">
  <img src="media/top-10-passenger-aircraft-types-in-accidents.png" alt="Top 10 Passenger Aircraft Types in Accidents" width="1200"/>
</p>

<p align="center"><em>Figure 6.8: Top 10 Passenger Aircraft Types in Accidents</em></p>

<p align="center">
  <img src="https://images.aircharterservice.com/global/aircraft-guide/private-charter/cessna-208-grand-caravan-.jpg
" alt="Cessna 208B Grand Caravan" width="1000"/>
</p>

<p align="center"><em>Cessna 208B Grand Caravan</em></p>

<p align="center">
  <img src="https://live.staticflickr.com/65535/50828661088_2f4209f85a_b.jpg" alt="Airbus A320-232" width="1000"/>
</p>

<p align="center"><em>Airbus A320-232</em></p>

<p align="center">
  <img src="https://assets.skiesmag.com/wp-content/uploads/2024/07/Air_Canada_Air_Canada_To_Receive_Eight_Boeing_737_8_Aircraft.jpg
" alt="Boeing 737-8AS" width="1000"/>
</p>

<p align="center"><em>Boeing 737-8AS</em></p>

### Aircraft Are Often Older and Heavily Used

Most aircraft in the dataset were manufactured between the 1970s and early 2000s, with some recording more than 30,000 flight hours. This shows that many incidents involve aircraft with significant operational history, which could contribute to wear-related risks.

<p align="center">
  <img src="media/aircraft_age_and_usage_distributions.png" alt="Aircraft Age and Usage Distributions" width="1200"/>
</p>

<p align="center"><em>Figure 6.9: Aircraft Age and Usage Distributions</em></p>

---
## Limitations

While this analysis provided several valuable insights into aviation accidents, there are some limitations that should be acknowledged:

**Incomplete Data**: Some records had missing or inconsistent entries, particularly in columns such as airframe_hours, operator, or exact location coordinates. While we made efforts to clean and impute where possible, these gaps could impact the accuracy of our findings.

**Bias Toward Reported Cases**: The dataset only includes accidents that were officially reported and listed in the Aviation Safety Network. This may exclude minor incidents or cases from underreported regions, leading to geographic or demographic bias.

**Lack of Contextual Variables**: Important variables like weather conditions, pilot experience, maintenance history, or organizational safety culture were not available. These could significantly influence accident outcomes but were not part of this dataset.

**Time-Based Reporting Gaps**: The number of accidents in recent years might appear lower due to delays in investigation or reporting, which could skew the trend analysis toward the end of the timeline.

**Assumptions in Derived Features**: For example, when assigning seasons, we used hemisphere-aware logic based on latitude, but this still simplifies many real-world environmental conditions.

These limitations suggest that while our analysis can point out general trends and associations, it should not be taken as a definitive cause-and-effect interpretation of aviation safety factors.

---
## Future Work

This project opens up several directions for more detailed and robust analysis in the future:

**Incorporate External Data**: Merging this dataset with weather reports, pilot logs, or aircraft maintenance data could help explore deeper causal relationships.

**Natural Language Processing (NLP)**: Many detailed accident reports are available as text. Applying NLP techniques could help extract more context and explanations from the narratives.

**Predictive Modeling**: With cleaned and enriched data, machine learning models could be trained to predict factors such as accident severity or fatality likelihood based on known inputs.

**Temporal Decomposition**: More advanced time series techniques like seasonal-trend decomposition (STL) could provide better insights into long-term and short-term accident patterns.

**Interactive Dashboards**: Building a dashboard (e.g., using Power BI or Tableau) could help stakeholders explore the data more intuitively and perform custom filtering by country, aircraft, or accident type.

Great example of this - [dashboard of Aviation Safety Network (ASN)](https://aviation-safety.net/dashboard/safetyreport2023)

By addressing the current limitations and applying more advanced tools, future analysis can help improve the accuracy and usefulness of insights derived from aviation accident data.

---
## Conclusion

In this project, we conducted an exploratory data analysis of aviation accident data sourced from the Aviation Safety Network. We examined the dataset from multiple perspectives — statistical summaries, temporal trends, aircraft and operator characteristics, geographic patterns, and fatality data.

Key observations include:

- A noticeable decline in both accidents and fatalities over recent decades, suggesting improvements in aviation safety.
- Small aircraft such as Cessnas are involved in a large number of accidents, likely due to their high global presence and use in general aviation.
- Passenger aircraft like Boeing and Airbus models are less frequently involved but more impactful when accidents occur.
- Accidents are concentrated in the Northern Hemisphere, especially in countries with higher aviation traffic.
- Most accidents involved few or no fatalities, but rare high-fatality events still occur and demand attention.

Despite some data limitations, the analysis provides meaningful insights into aviation safety trends and highlights opportunities for future enhancements using more advanced techniques. It also demonstrates the value of structured data exploration in identifying both risks and areas of improvement in global aviation.

---
## Appendix

### A. Notebooks

All data preparation, analysis, and visualization were performed using separate Jupyter notebooks. Each step described in the report corresponds directly to a specific notebook. These notebooks contain the full code and can be accessed and downloaded using the links below:

### B. Data Source

The dataset used in this project was obtained by scraping the [Aviation Safety Network](https://asn.flightsafety.org/), covering global aviation accidents from 1995 to 2025. After preprocessing and cleaning, the dataset contained `7,999` records and `19` columns.

### C. References

- Aviation Safety Network. (2025). Aviation Accident Database. Retrieved from https://aviation-safety.net/

- Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3), 90–95.

- Caswell, T. et al. (2025). Matplotlib Documentation. Retrieved from https://matplotlib.org/stable/index.html

- Waskom, M. et al. (2024). Seaborn: Statistical Data Visualization. Retrieved from https://seaborn.pydata.org/

- Python Software Foundation. (2025). Pandas Documentation. https://pandas.pydata.org/

- McKinney, W. (2022). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and Jupyter (3rd ed.). O'Reilly Media.

---
<div style="text-align: right; margin-top: 40px;">
<strong>Vladyslav Lysenko</strong> &nbsp;&nbsp; <strong>Vasudha Chandna</strong> &nbsp;&nbsp; <strong>Om Mistry</strong><br>
University of Toronto School of Continuing Studies<br>
August 2025
</div>