# Exploratory Analysis on Airplane Data

### Business Problem

Your company is expanding in to new industries to diversify its portfolio. Specifically, they are interested in purchasing and operating airplanes for commercial and private enterprises, but do not know anything about the potential risks of aircraft.

You are charged with determining which aircraft are the lowest risk for the company to start this new business endeavor.

You must then translate your findings into actionable insights that the head of the new aviation division can use to help decide which aircraft to purchase.

In [1]:
#Import libraries
import pandas as pd
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
warnings.filterwarnings('ignore')

## Preview the Data Available

In [2]:
#This will be the main DataFrame used to analyze our business problem

df = pd.read_csv('data/Aviation_Data.csv')
df.head()

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.9222,-81.8781,,,...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,...,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


## Data Cleaning

In [3]:
#drop columns with more than 20% nans 
nan_cols = list(df.loc[:,df.isna().sum()/len(df) > .2].columns) 

#drop uninformative columns
nan_cols.extend(['Event.Id', 'Investigation.Type', 'Accident.Number', 'Injury.Severity',\
                 'Registration.Number', 'Report.Status', 'Publication.Date'])

# get a list of columns to keep
non_nan_cols = [x for x in df.columns if x not in nan_cols]

# remove columns from df
df = df[non_nan_cols].copy()

### Cleaning individual columns

In [4]:
# Clean Weather Condition
#Change the notation "IMC" and "VMC" to have a clear understanding of each condition.

df['Weather.Condition'].replace({'Unk':'Unknown','UNK':'Unknown','VMC':'Visual Meteorological Conditions',\
                                 'IMC':'Instrumental Meteorological Conditions'}, inplace = True)

#Filling the 'NaN' values with "Unknown", for further analysis.

df['Weather.Condition'].fillna('Unknown', inplace = True)

In [5]:
 # Clean Engine Type
 # If the number of Engines is zero, we set in the "Type of Engine" column just "No engine"
 df['Engine.Type'] = df.apply(lambda row: 'No Engine' if row['Number.of.Engines']==0 \
                              else row['Engine.Type'], axis = 1)

  #Changing 'Nan' values for 'UNK'. Then transforming the name to make it clear.

 df['Engine.Type'].fillna('UNK', inplace = True)
 df['Engine.Type'].replace({'UNK':'Unknown'}, inplace = True)
 df['Engine.Type'].value_counts(dropna=False)

Reciprocating      69528
Unknown             9374
Turbo Shaft         3609
Turbo Prop          3391
Turbo Fan           2481
No Engine           1226
Turbo Jet            703
Geared Turbofan       12
None                  11
Electric              10
LR                     2
Hybrid Rocket          1
Name: Engine.Type, dtype: int64

In [6]:
# #Cleaning the 'Make' Column
# characters_to_remove = ['(', ')', ',', '.', '%', '?','-']
# df['Make'] = df['Make'].str.title().fillna('UNKNOWN')
# df['Make'] = df['Make'].map(lambda x: ''.join(char for char in x if char not in characters_to_remove))
# names_var = {'Boeing': 'Boeing', 'Cirrus':'Cirrus','Airbus':'Airbus','Douglas':'Mcdonnel Douglas', \
#              'Air Tractor':'Air Tractor','Embraer':'Embraer','Bombardier':'Bombardier'}
# for key, value in names_var.items():
#     df.loc[df['Make'].str.contains(key), 'Make'] = value
# df = df[~df['Make'].str.contains('helicopter|copter|robinson', case=False)]
# top_50_makes = df['Make'].value_counts().index[:50]
# df = df[df['Make'].isin(top_50_makes)]

In [7]:
#Cleaning the 'Make' Column

#Create the list 'characters_to_remove' so we can drop any puctuation sign that can make our name differ.

characters_to_remove = ['(', ')', ',', '.', '%', '?','-']

# Fill 'NaN' values with UNKNOWN.

df['Make'] = df['Make'].str.title().fillna('UNKNOWN')
df['Make'] = df['Make'].map(lambda x: ''.join(char for char in x if char not in characters_to_remove))

#Based on further analysis, we found different names which certain variations and added words.
# 'names_var' dictionary is a way to unify names that have many variatios

names_var = {'Boeing': 'Boeing', 'Cirrus':'Cirrus','Airbus':'Airbus','Douglas':'Mcdonnel Douglas', \
             'Air Tractor':'Air Tractor','Embraer':'Embraer','Bombardier':'Bombardier'}
for key, value in names_var.items():
    df.loc[df['Make'].str.contains(key), 'Make'] = value
    
#Eliminating all the rows which 'Make' involves anything helicopter related, since we're analyzing just airplanes.

df = df[~df['Make'].str.contains('helicopter|copter|robinson', case=False)]

#Since there are many 'Makes' with a really low count of incidences, we keep just the top 50.

top_50_makes = df['Make'].value_counts().index[:50]
df = df[df['Make'].isin(top_50_makes)]

#Cleaning the "Model" column
df['Model'] = df['Model'].str.title().fillna('UNKNOWN')

In [8]:
#Replacing NaN values for 'Unknown in 'Purpose of flight':
df['Purpose.of.flight'].fillna('Unknown', inplace=True)

In [9]:
# Check if any of the injury columns have true values. If any do, fill nans with 0s. 
# If all columns are nans, assume the data was not logged and keep them as nans. 
injury_cols = ['Total.Fatal.Injuries', 'Total.Serious.Injuries', 'Total.Minor.Injuries', 'Total.Uninjured']
injury_data_exists = df[injury_cols].apply(lambda col: any(col), axis = 1)

for col in injury_cols:
    df.loc[injury_data_exists & df[col].isna(), col] = 0

In [10]:
# Replacing 'NaN' values in Aircraft Damage
df['Aircraft.damage'].fillna('Unknown', inplace = True)
df['Aircraft.damage'].value_counts(dropna = False)

Substantial    50647
Destroyed      14587
Unknown         4147
Minor           2322
Name: Aircraft.damage, dtype: int64

### Adding Columns, Engineering features on the dataset

#### Now, once most of our columns are clean, we alter then in order to get information more valuable for future analysis:

In [11]:
# #We decided to remove everything that is not Amateur Built.
# Remove rows where Amateur.Built is Yes or NaN, then remove Amateur.Built column
df = df.drop(df.loc[(df['Amateur.Built']=='Yes') |( df['Amateur.Built'].isna())].index)
df.reset_index(drop = True, inplace = True)
df.drop(columns = 'Amateur.Built', inplace = True)
df.head()

Unnamed: 0,Event.Date,Location,Country,Aircraft.damage,Make,Model,Number.of.Engines,Engine.Type,Purpose.of.flight,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition
0,1948-10-24,"MOOSE CREEK, ID",United States,Destroyed,Stinson,108-3,1.0,Reciprocating,Personal,2.0,0.0,0.0,0.0,Unknown
1,1962-07-19,"BRIDGEPORT, CA",United States,Destroyed,Piper,Pa24-180,1.0,Reciprocating,Personal,4.0,0.0,0.0,0.0,Unknown
2,1974-08-30,"Saltville, VA",United States,Destroyed,Cessna,172M,1.0,Reciprocating,Personal,3.0,0.0,0.0,0.0,Instrumental Meteorological Conditions
3,1977-06-19,"EUREKA, CA",United States,Destroyed,Rockwell,112,1.0,Reciprocating,Personal,2.0,0.0,0.0,0.0,Instrumental Meteorological Conditions
4,1979-08-02,"Canton, OH",United States,Destroyed,Cessna,501,,Unknown,Personal,1.0,2.0,0.0,0.0,Visual Meteorological Conditions


For the Event.Date, we decided to break it down into "year", "month" and "day" for further analysis.

In [12]:
#Transform the Event Date. Dropping 'Event.Date' once the columns are created:

df['Event.Date'] = pd.to_datetime(df['Event.Date'])
df['Event.Day'] = df['Event.Date'].map(lambda x: x.day)
df['Event.Month'] = df['Event.Date'].map(lambda x: x.month)
df['Event.Month.Name'] = df['Event.Date'].map(lambda x: x.month_name())
df['Event.Year'] = df['Event.Date'].map(lambda x: x.year)
df.drop(['Event.Date'], axis=1, inplace=True)

We needed to have information about the amount of passengers that were on each incident. We used the information in the four columns related to injuries to extract it.

Aditionally we created percentages for the passengers that: Got injured, ended uninjured, and died.

In [13]:
#Obtaining the total of passengers
df['Total.Passengers'] = df['Total.Fatal.Injuries'] + df['Total.Serious.Injuries'] + df['Total.Minor.Injuries'] \
+ df['Total.Uninjured']
df['Total.Injured'] = df['Total.Fatal.Injuries'] + df['Total.Serious.Injuries'] + df['Total.Minor.Injuries']

#Creating Percentages for every kind of situation.
# Note that we had to account for division by 0 where the plane had no passengers.

df['Percent.Injured'] = (df['Total.Injured'] / df['Total.Passengers']) * 100
df['Percent.Uninjured'] = (df['Total.Uninjured'] / df['Total.Passengers']) * 100
df['Percent.Died'] = (df['Total.Fatal.Injuries'] / df['Total.Passengers']) * 100
df.loc[df['Total.Passengers'] == 0, ['Percent.Injured','Percent.Died','Percent.Uninjured']] = 0

### Commercial or private?

We needed to find a way to differenciate commercial from private planes, we decided to guide us based on the number of engines, everything with 2 or less engines will be considered private. Everything above 2 will be commercial

In [14]:
# Creating the division between Private and Commercial

df['Airplane.Type'] = df['Number.of.Engines'].apply(lambda x: 'Private' if x <= 2 else 'Commercial')

### Number of Engines

For the number of engines, we found ocurrences where the number of engines is equal to zero.

We check if this is the case because there's actually no engine in that type of plane, or if it's a missing value.

In [15]:
 # Check if Number.of.Engines = 0 corresponds to no Engine.Type
 df[(df['Number.of.Engines']==0) & ((df['Engine.Type']!='NONE') & (df['Engine.Type'].notna())\
                                     & (df['Engine.Type'] != 'Unknown'))]
 #Replace Number.of.Engines = 0 with Number.of.Engines = Unknown where Engine.Type exists.
 #Drop Unknown or null values for number of Engines
 replace_indx = list(df[(df['Number.of.Engines']==0) 
                     & ((df['Engine.Type']!='NONE') 
                        & (df['Engine.Type'].notna()) 
                        & (df['Engine.Type'] != 'Unknown'))].index)

 df.loc[replace_indx, 'Number.of.Engines'] = 'Unknown'
 df.dropna(subset=['Number.of.Engines'],axis=0,inplace=True)
 df = df[df['Number.of.Engines'] != 'Unknown']

### Working with the location

We decided to keep data from every country, so we could have more data to analyze the performance of the airplanes.

But, since we're not interested in the locations outside of the US, we rename all of the other countries "Foreign Country" to make the classification process easier.

For the Location Column, we just decided to kept the code for the corresponding State. All the ocurrences that were either in a foreign country or Unknown will be categorized as "Unknown/Foreign Location"

In [16]:
 #Grouping all the Countries outside of the US as "Foreign Country"
 df.loc[df['Country'] != 'United States', 'Country'] = 'Foreign Country'

 #Filling the Null Values with "UNKNOWN" the the Location Column
 df['Location'] = df['Location'].fillna('UNKNOWN')

 # Adjusting the Column so it shows code of the corresponding State, if it's Unknown or outside of the US, label accordingly
 df['Location'] = df['Location'].apply(lambda location: location.split(', ')[-1] if \
                                       len(location.split(', ')) > 1 and len(location.split(', ')[-1]) == 2 else \
                                       'Unknown/Foreign Location')

## Cleaned Dataframes

Once we were done with the cleaning process, we created two different DataFrames, one for the private planes and another for Commercial planes, to go deeper in the analysis of each category.

Separating them into two different CSV files, and creating a third one to have all the data cleaned and consolidated.

In [17]:
#Created df_priv and df_comm to differenciate data for private and commercial Airplanes

df_priv = df[df['Airplane.Type'] == 'Private'].reset_index(drop = True)
df_comm = df[df['Airplane.Type'] == 'Commercial'].reset_index(drop = True)

#Creating corresponding CSV files
# df_priv.to_csv('priv_data_clean.csv', index_label = 'index')
# df_comm.to_csv('comm_data_clean.csv', index_label = 'index')
#df.to_csv('data_clean.csv', index_label = 'index')

In [18]:
df.shape

(64662, 23)

In [19]:
#ENGINE TYPE Decreases the difference from 955 to 316
#NUMBER OF ENGINES DECREASES FROM 955 TO 177
#SWITCHING THEM CHANGES FROM 955 TO 558