# Aviation Crash Data Analysis 

## Business Understanding 

### Overview
This project analyzes the [aviation accident dataset](https://www.kaggle.com/datasets/khsamaha/aviation-accident-database-synopses) on Kaggle to determine the factors that affect the safety of aircrafts. The findings will be used to make recommendations to a client that is looking to expand its business through purchasing and operating airplanes for commercial and private enterprises. Insights from the data will be used to identify the lowest-risk aircrafts that the client can start its business endeavor with. 

### Background Information
Before proceeding any further, we do some research to gain some domain knowledge. Different measures could be considered for assessing the safety of aircrafts. For example, the *number of fatal crashes per every 100,000 flights* seems like a standard measure to consider. The *number of non-fatal incidents that led to injuries* or required the pilot to take extreme measures for landing the aircraft may be considered as a secondary measure [REF](https://assets.performance.gov/APG/files/2023/june/FY2023_June_DOT_Progress_Aviation_Safety.pdf).

Some aspects of safety we can investigate are: 
- aircraft make and model: certain models may be more susciptible to incidents due to inherent manufacturing/design flaws.
- number of engines: a 4-engine aircraft should be safer compared to a 2-engine one
- location: mountanous regions or locations with extreme weather conditions could be more accident-prone.
- manufacturer's safety compliance policies: quantifiable metrics such as the frequency of inspections could be studied provided that the data is available.

## Data Understanding

In this section, we take a first look at the data to get a preliminary understanding of its type and what it contains. 

In [91]:
# importing the required modules
import pandas as pd
#pd.set_option('display.max_rows', None)

In [113]:
# read the data 
df = pd.read_csv("./data/Aviation_Data.csv",low_memory=False)
# inspect the data
df.info()

## improve aesthetics
# replace . with space in column names to increase readability
df.rename(columns=lambda x: x.replace('.',' '),inplace=True)
# standardize the capitalization of column names
df.columns = [column.capitalize() for column in df.columns]
len(df.columns)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90348 entries, 0 to 90347
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      90348 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50132 non-null  object 
 9   Airport.Name            52704 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87507 non-null  object 
 14  Make                    88826 non-null

31

### Some Observations 
- The data has 31 columns. There are two data types: object (string/text) and float.
- 

In [108]:
# taking a peak at a few rows of the data
df.tail(3)

Unnamed: 0,Event id,Investigation type,Accident number,Event date,Location,Country,Latitude,Longitude,Airport code,Airport name,...,Purpose of flight,Air carrier,Total fatal injuries,Total serious injuries,Total minor injuries,Total uninjured,Weather condition,Broad phase of flight,Report status,Publication date
90345,20221227106497,Accident,WPR23LA075,2022-12-26,"Payson, AZ",United States,341525N,1112021W,PAN,PAYSON,...,Personal,,0.0,0.0,0.0,1.0,VMC,,,27-12-2022
90346,20221227106498,Accident,WPR23LA076,2022-12-26,"Morgan, UT",United States,,,,,...,Personal,MC CESSNA 210N LLC,0.0,0.0,0.0,0.0,,,,
90347,20221230106513,Accident,ERA23LA097,2022-12-29,"Athens, GA",United States,,,,,...,Personal,,0.0,1.0,0.0,1.0,,,,30-12-2022


In [106]:
# check 
df.isna().sum()/len(df)*100

Event id                   1.614867
Investigation type         0.000000
Accident number            1.614867
Event date                 1.614867
Location                   1.672422
Country                    1.865011
Latitude                  61.944924
Longitude                 61.954886
Airport code              44.512330
Airport name              41.665560
Injury severity            2.721698
Aircraft damage            5.150086
Aircraft category         64.263736
Registration number        3.144508
Make                       1.684597
Model                      1.716695
Amateur built              1.727764
Number of engines          8.348829
Engine type                9.468942
Far description           64.555939
Schedule                  86.073848
Purpose of flight          8.468367
Air carrier               81.573471
Total fatal injuries      14.233851
Total serious injuries    15.461327
Total minor injuries      14.822686
Total uninjured            8.158454
Weather condition          6

In [83]:
#df["Air carrier"].unique()

array([nan, 'Air Canada', 'Rocky Mountain Helicopters, In', ...,
       'SKY WEST AVIATION INC TRUSTEE', 'GERBER RICHARD E',
       'MC CESSNA 210N LLC'], dtype=object)

In [56]:
#df.groupby('Model',dropna=False)['Total.Fatal.Injuries'].sum()

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
69435,20110613X22328,Accident,DCA11WA069,2011-01-09,"Urmia, Iran (Islamic Republic of)",Iran,,,OITR,Orumiyeh Airport,...,,Iran Air,0.0,0.0,0.0,0.0,,,,25-09-2020
76661,20150928X01411,Incident,ENG15WA034,2015-02-15,"Tehran, Iran (Islamic Republic of)",Iran,,,,,...,,,0.0,0.0,0.0,0.0,,,,25-09-2020
78254,20160128X04914,Accident,DCA16WA076,2016-01-28,"Mashad, Iran (Islamic Republic of)",Iran,,,OIMM,Mashad International Airport,...,,Zagros Air,0.0,9.0,0.0,142.0,,,,25-09-2020
80223,20170328X20623,Accident,DCA17WA080,2017-03-27,"Ardabil, Iran (Islamic Republic of)",Iran,,,OITL,Ardabil,...,,,0.0,0.0,0.0,185.0,,,,25-09-2020
81928,20180321X95714,Accident,DCA18WA106,2018-03-11,"Shahr-e Kurd, Iran (Islamic Republic of)",Iran,,,,,...,,,11.0,0.0,0.0,0.0,,,,25-09-2020


### Data Preparation

### Exploratory Data Analysis

### Conclusion

### Limitations

## Recommendations

## Next Steps