# Aviation Crash Data Analysis 

<center><img src="./images/Header.png" 
    Width="1000">

## Business Understanding 

### Business Objective
This project analyzes the [aviation accident dataset](https://www.kaggle.com/datasets/khsamaha/aviation-accident-database-synopses) on Kaggle. The findings will be used to make recommendations to a client that is looking to expand its business through purchasing and operating airplanes for commercial and private enterprises. 

### Background Information
Before proceeding any further, we do some research to gain some domain knowledge. Different measures could be considered for assessing the safety of aircrafts. For example, the *number of fatal crashes per every 100,000 flights* seems like a standard measure to consider. The *number of non-fatal incidents that led to injuries* or required the pilot to take extreme measures for landing the aircraft may be considered as a secondary measure [REF](https://assets.performance.gov/APG/files/2023/june/FY2023_June_DOT_Progress_Aviation_Safety.pdf).

Some aspects of safety we can investigate are: 
- aircraft make and model: certain models may be more susciptible to incidents due to inherent manufacturing/design flaws.
- number of engines: a 4-engine aircraft should be safer compared to a 2-engine one
- location: mountanous regions or locations with extreme weather conditions could be more accident-prone.
- manufacturer's safety compliance policies: quantifiable metrics such as the frequency of inspections could be studied provided that the data is available.

### Data Mining Goals
A successful mining of the data would determine potentail factors that affect the safety of aircrafts. The correlation of each factor to aircraft safety and the strength of such correlation will provide insights to identify the lowest-risk aircrafts that the client can start its business endeavor with. 

### Project Plan
We will first skim the data provided to get a preliminary understanding of what's available, whether it's clean and ready for use. Next, we will determine the useful data for our analysis. This portion of the data will be prepared,cleaned and organized. In the end, we will use insights from the data along with visualizations to make appropriate business recommendations. As a complementary part, limitations of the work along with potential future investigations will be highlighted.   

## Data Understanding

In this section, we take a first look at the data to get a preliminary understanding of its type and what it contains. 

In [91]:
# importing the required modules
import pandas as pd
#pd.set_option('display.max_rows', None)

In [133]:
# read the data 
df = pd.read_csv("./data/Aviation_Data.csv",low_memory=False)
# inspect the data
df.info()

## improve aesthetics
# replace . with space in column names to increase readability
df.rename(columns=lambda x: x.replace('.',' '),inplace=True)
# standardize the capitalization of column names
df.columns = [column.capitalize() for column in df.columns]
df.rename(columns={"Far description": "FAR description"},inplace=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90348 entries, 0 to 90347
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      90348 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50132 non-null  object 
 9   Airport.Name            52704 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87507 non-null  object 
 14  Make                    88826 non-null

**Holistic Description of the Data:**
- The data has 31 columns. There are two data types: object (string/text) and float.
- We notice that some columns have substantial amount of missing data. We will address this in more detail while doing EDA.
- We take some time to make sure we understand what each column represents and brainstorm how it can be leveraged for the type of analysis we're doing. For example, the FAR description column represents has descriptions or codes that specify which specific Federal Aviation Regulations are relevant to each accident.

Next, we take a look at a small chunk of the data:

In [134]:
# taking a peak at a few rows of the data
df.tail(3)

Unnamed: 0,Event id,Investigation type,Accident number,Event date,Location,Country,Latitude,Longitude,Airport code,Airport name,...,Purpose of flight,Air carrier,Total fatal injuries,Total serious injuries,Total minor injuries,Total uninjured,Weather condition,Broad phase of flight,Report status,Publication date
90345,20221227106497,Accident,WPR23LA075,2022-12-26,"Payson, AZ",United States,341525N,1112021W,PAN,PAYSON,...,Personal,,0.0,0.0,0.0,1.0,VMC,,,27-12-2022
90346,20221227106498,Accident,WPR23LA076,2022-12-26,"Morgan, UT",United States,,,,,...,Personal,MC CESSNA 210N LLC,0.0,0.0,0.0,0.0,,,,
90347,20221230106513,Accident,ERA23LA097,2022-12-29,"Athens, GA",United States,,,,,...,Personal,,0.0,1.0,0.0,1.0,,,,30-12-2022


In [None]:
We are now prepared 

### Data Preparation

In [None]:
# check for missing data. 
#df.isna().sum()/len(df)*100
#df["Far description"].unique()
#df.groupby('Model',dropna=False)['Total.Fatal.Injuries'].sum()

### Exploratory Data Analysis

### Conclusion

### Limitations

## Recommendations

## Next Steps