# Aviation Crash Data Analysis 

<center><img src="./images/Header.png" 
    Width="1000">

## Business Understanding 

### Business Objective
This project analyzes the [aviation accident dataset](https://www.kaggle.com/datasets/khsamaha/aviation-accident-database-synopses) on Kaggle. The findings will be used to make recommendations to a client that is looking to expand its business through purchasing and operating airplanes for commercial and private enterprises. 

### Background Information
Before proceeding any further, we do some research to gain some domain knowledge. Different measures could be considered for assessing the safety of aircrafts. For example, the *number of fatal crashes per every 100,000 flights* seems like a standard measure to consider. The *number of non-fatal incidents that led to injuries* or required the pilot to take extreme measures for landing the aircraft may be considered as a secondary measure [REF](https://assets.performance.gov/APG/files/2023/june/FY2023_June_DOT_Progress_Aviation_Safety.pdf).

Some aspects of safety we can investigate are: 
- aircraft make and model: certain models may be more susciptible to incidents due to inherent manufacturing/design flaws.
- number of engines: a 4-engine aircraft should be safer compared to a 2-engine one
- location: mountanous regions or locations with extreme weather conditions could be more accident-prone.
- manufacturer's safety compliance policies: quantifiable metrics such as the frequency of inspections could be studied provided that the data is available.

### Data Mining Goals
A successful mining of the data would determine potentail factors that affect the safety of aircrafts. The correlation of each factor to aircraft safety and the strength of such correlation will provide insights to identify the lowest-risk aircrafts that the client can start its business endeavor with. 

### Project Plan
We will first skim the data provided to get a preliminary understanding of what's available, whether it's clean and ready for use. Next, we will determine the useful data for our analysis. This portion of the data will be prepared,cleaned and organized. In the end, we will use insights from the data along with visualizations to make appropriate business recommendations. As a complementary part, limitations of the work along with potential future investigations will be highlighted.   

## Data Understanding

In this section, we take a first look at the data to get a preliminary understanding of its type and what it contains. 

In [91]:
# importing the required modules
import pandas as pd
#pd.set_option('display.max_rows', None)

In [133]:
# read the data 
df = pd.read_csv("./data/Aviation_Data.csv",low_memory=False)
# inspect the data
df.info()

## improve aesthetics
# replace . with space in column names to increase readability
df.rename(columns=lambda x: x.replace('.',' '),inplace=True)
# standardize the capitalization of column names
df.columns = [column.capitalize() for column in df.columns]
df.rename(columns={"Far description": "FAR description"},inplace=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90348 entries, 0 to 90347
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      90348 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50132 non-null  object 
 9   Airport.Name            52704 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87507 non-null  object 
 14  Make                    88826 non-null

In [144]:
# taking a peak at a few rows of the data
pd.set_option('display.max_columns', None)
df.head(5)

Unnamed: 0,Event id,Investigation type,Accident number,Event date,Location,Country,Latitude,Longitude,Airport code,Airport name,Injury severity,Aircraft damage,Aircraft category,Registration number,Make,Model,Amateur built,Number of engines,Engine type,FAR description,Schedule,Purpose of flight,Air carrier,Total fatal injuries,Total serious injuries,Total minor injuries,Total uninjured,Weather condition,Broad phase of flight,Report status,Publication date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,Fatal(2),Destroyed,,NC6404,Stinson,108-3,No,1.0,Reciprocating,,,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,Fatal(4),Destroyed,,N5069P,Piper,PA24-180,No,1.0,Reciprocating,,,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.922223,-81.878056,,,Fatal(3),Destroyed,,N5142R,Cessna,172M,No,1.0,Reciprocating,,,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,Fatal(2),Destroyed,,N1168J,Rockwell,112,No,1.0,Reciprocating,,,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,Fatal(1),Destroyed,,N15NY,Cessna,501,No,,,,,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


In [159]:
# Take a look at some of the items you don't unders
print('Categorical data for Broad phase of flight:')
print(df["Broad phase of flight"].unique(),'\n')

print('Categorical data for FAR description:')
print(df["FAR description"].unique(),'\n')

print('Categorical data for Engine type:')
print(df["Engine type"].unique())

Categorical data for Broad phase of flight:
['Cruise' 'Unknown' 'Approach' 'Climb' 'Takeoff' 'Landing' 'Taxi'
 'Descent' 'Maneuvering' 'Standing' 'Go-around' 'Other' nan] 

Categorical data for FAR description:
[nan 'Part 129: Foreign' 'Part 91: General Aviation'
 'Part 135: Air Taxi & Commuter' 'Part 125: 20+ Pax,6000+ lbs'
 'Part 121: Air Carrier' 'Part 137: Agricultural'
 'Part 133: Rotorcraft Ext. Load' 'Unknown' 'Part 91F: Special Flt Ops.'
 'Non-U.S., Non-Commercial' 'Public Aircraft' 'Non-U.S., Commercial'
 'Public Use' 'Armed Forces' 'Part 91 Subpart K: Fractional' '091' 'NUSC'
 '135' 'NUSN' '121' '137' '129' '133' '091K' 'UNK' 'PUBU' 'ARMF' '103'
 '125' '437' '107'] 

Categorical data for Engine type:
['Reciprocating' nan 'Turbo Fan' 'Turbo Shaft' 'Unknown' 'Turbo Prop'
 'Turbo Jet' 'Electric' 'Hybrid Rocket' 'Geared Turbofan' 'LR' 'NONE'
 'UNK']


**Holistic Description of the Data:**
- The data has 31 columns. There are two data types: object (string/text) and float. We note that the float data type seems appropriate for the columns listed (i.e., number of engines, total fatal/serious/minor injuries, and total uninjured). An int type may be more appropriate for these categories but changing the data type might introduce unwanted issues when performing statistical analysis so we will keep it as is for now. 
- Notice that some columns have substantial amount of missing data. We will address this in more detail in the next sections (Data Preparation and EDA).
- I took some time to make sure I understand what each column represents and brainstorm whether it can be leveraged for the type of analysis we're doing. Below, I will include explanation on a number of items that were not clear to me:
    - Investigation type: incident vs. accident. incidents refer to occurrences that do not result in signifcant damage to the aircraft. 
    - FAR description: Represents descriptions or codes that specify which specific Federal Aviation Regulations are relevant to each accident. Recurrent FAR violations in certain categories (such as maintenance) can be raise red flags for aviation companies.
    - Engine type: includes reciprocating, turbo jet, etc. Engine type has been documented to have an effect with aircraft safety [REF](https://dk.upce.cz/bitstream/handle/10195/74791/Use_of_Aircraft_Engine_Type_and_Quantity_and_their_Impact_on_Air_Transport_Safety.pdf?sequence=1&isAllowed=y).
    - Broad phase of flight: indicates the phase of flight at which the accident or incident happened. Includes categorical data such as "Cruise", "Taxi", etc. This may be useful for identifying risks associated with each phase of flight, but may not be necessarily relevant to our analysis.
    - Report status: This item shows whether the report on the accident is at its final stage or it's developing.

Based on what was discussed above, the following features will be of interest for our analysis:
- Investigation type, location, country, event date, injury severity, aircraft damage, aircraft category, make, model, number of engines, engine type, FAR description, air carrier, total injuries and total uninju
- Location
- Event date
- 


## Data Preparation

In [None]:
# check for missing data. 
#df.isna().sum()/len(df)*100
#df["Far description"].unique()
#df.groupby('Model',dropna=False)['Total.Fatal.Injuries'].sum()

## Explarotary Data Analysis (EDA)

### Conclusion

### Limitations

## Recommendations

## Next Steps