# Data Cleaning

<center><img src="https://en.meming.world/images/en/a/ab/Traumatized_Mr._Incredible.jpg" title="Before and after data cleaning"/></center>

<center> Before and after data cleaning </center>

In [2]:
#Importing relevant libraries
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns


In [3]:
# Importing our data file
df_original = pd.read_csv('../Data/AviationData.csv', encoding='latin1', low_memory=False)

# Making a copy of the original dataframe in case I need it
df = df_original.copy()

# Listing the columns out to for pruning columns that are irrelevant to our goal
df.columns

Index(['Event.Id', 'Investigation.Type', 'Accident.Number', 'Event.Date',
       'Location', 'Country', 'Latitude', 'Longitude', 'Airport.Code',
       'Airport.Name', 'Injury.Severity', 'Aircraft.damage',
       'Aircraft.Category', 'Registration.Number', 'Make', 'Model',
       'Amateur.Built', 'Number.of.Engines', 'Engine.Type', 'FAR.Description',
       'Schedule', 'Purpose.of.flight', 'Air.carrier', 'Total.Fatal.Injuries',
       'Total.Serious.Injuries', 'Total.Minor.Injuries', 'Total.Uninjured',
       'Weather.Condition', 'Broad.phase.of.flight', 'Report.Status',
       'Publication.Date'],
      dtype='object')

# Foreword

Most of the data cleaning is done is contained in this notebook with a couple of exceptions. String cleaning for the Make and Model column are done in the stringcleaning.ipynb notebook and then utilized both in here and EDA later on. Further pruning of the dataset was done in EDA as the need arose. 

## Data cleaning to-do list


1. Drop irrelevant columns
2. Deal with missing values
3. Clean up rows/intra-column clean-up: filter Investigation.Type (now Type) to only accidents, Amateur.Built to only include non-amateur aircraft, (*maybe*) Number.of.Engine to only 1 & 2, and other stuff
4. Check remaining columns if you want em
5. Replace all null values in passenger numbers with 0 (assumption here is that all nulls are 0s that were not manually entered into data)
6. Clean up duplicates in the Make column

<center><img src="https://cdn-icons-png.flaticon.com/512/10179/10179118.png" width="260" height="260"/></center>

### __1. Column cleaning__

In [4]:
# Dropping all irrelevant columns
irrelevant_columns = ['Event.Id', 'Accident.Number', 'Latitude', 'Longitude', 'Airport.Code', 'Airport.Name', 
'Registration.Number', 'FAR.Description', 'Purpose.of.flight', 'Air.carrier', 'Report.Status', 'Publication.Date']

df = df.drop(columns=irrelevant_columns)
df.columns

Index(['Investigation.Type', 'Event.Date', 'Location', 'Country',
       'Injury.Severity', 'Aircraft.damage', 'Aircraft.Category', 'Make',
       'Model', 'Amateur.Built', 'Number.of.Engines', 'Engine.Type',
       'Schedule', 'Total.Fatal.Injuries', 'Total.Serious.Injuries',
       'Total.Minor.Injuries', 'Total.Uninjured', 'Weather.Condition',
       'Broad.phase.of.flight'],
      dtype='object')

In [5]:
# Checking for the number of missing values in our remaining columns
df.isna().sum()

Investigation.Type            0
Event.Date                    0
Location                     52
Country                     226
Injury.Severity            1000
Aircraft.damage            3194
Aircraft.Category         56602
Make                         63
Model                        92
Amateur.Built               102
Number.of.Engines          6084
Engine.Type                7096
Schedule                  76307
Total.Fatal.Injuries      11401
Total.Serious.Injuries    12510
Total.Minor.Injuries      11933
Total.Uninjured            5912
Weather.Condition          4492
Broad.phase.of.flight     27165
dtype: int64

In [6]:
# Further pruning of columns with more than 25% of its data missing
df = df.drop(columns=['Schedule', 'Broad.phase.of.flight'])
df.columns

Index(['Investigation.Type', 'Event.Date', 'Location', 'Country',
       'Injury.Severity', 'Aircraft.damage', 'Aircraft.Category', 'Make',
       'Model', 'Amateur.Built', 'Number.of.Engines', 'Engine.Type',
       'Total.Fatal.Injuries', 'Total.Serious.Injuries',
       'Total.Minor.Injuries', 'Total.Uninjured', 'Weather.Condition'],
      dtype='object')

In [7]:
# Renaming columns
new_column_names = {'Investigation.Type': 'Type', 'Event.Date': 'Date', 'Injury.Severity': 'Injury_Severity', 'Aircraft.damage': 'Damage_Type', 
'Number.of.Engines': 'Engines', 'Engine.Type': 'Engine_Type', 'Total.Fatal.Injuries': 'Fatal_Injuries', 'Total.Serious.Injuries': 'Serious_Injuries', 
'Total.Minor.Injuries': 'Minor_Injuries', 'Total.Uninjured': 'Uninjured', 'Weather.Condition': 'Weather', 'Amateur.Built': 'Amateur_Built',
'Aircraft.Category': 'Aircraft_Category'}
df.rename(columns=new_column_names, inplace=True)

In [8]:
df.head()

Unnamed: 0,Type,Date,Location,Country,Injury_Severity,Damage_Type,Aircraft_Category,Make,Model,Amateur_Built,Engines,Engine_Type,Fatal_Injuries,Serious_Injuries,Minor_Injuries,Uninjured,Weather
0,Accident,1948-10-24,"MOOSE CREEK, ID",United States,Fatal(2),Destroyed,,Stinson,108-3,No,1.0,Reciprocating,2.0,0.0,0.0,0.0,UNK
1,Accident,1962-07-19,"BRIDGEPORT, CA",United States,Fatal(4),Destroyed,,Piper,PA24-180,No,1.0,Reciprocating,4.0,0.0,0.0,0.0,UNK
2,Accident,1974-08-30,"Saltville, VA",United States,Fatal(3),Destroyed,,Cessna,172M,No,1.0,Reciprocating,3.0,,,,IMC
3,Accident,1977-06-19,"EUREKA, CA",United States,Fatal(2),Destroyed,,Rockwell,112,No,1.0,Reciprocating,2.0,0.0,0.0,0.0,IMC
4,Accident,1979-08-02,"Canton, OH",United States,Fatal(1),Destroyed,,Cessna,501,No,,,1.0,2.0,,0.0,VMC


#### 1.1 Injury_Severity pruning

In [9]:
df['Fatal_Injuries'].sum()

50201.0

In [10]:
# Creating a new column 'Fatality' to extract the bracketed number of fatalities from the 'Injury_Severity' column
df['Fatality'] = df['Injury_Severity'].str.extract(r'\((\d+)\)')

df['Fatality'] = pd.to_numeric(df['Fatality'])

df['Fatality'].sum()

35617.0

In [11]:
# Checking if there are rows where Fatal_Injuries contains a 0 and Fatality is bigger than 0
df.loc[(df['Fatal_Injuries'] == 0) & (df['Fatality'] > 0)]

Unnamed: 0,Type,Date,Location,Country,Injury_Severity,Damage_Type,Aircraft_Category,Make,Model,Amateur_Built,Engines,Engine_Type,Fatal_Injuries,Serious_Injuries,Minor_Injuries,Uninjured,Weather,Fatality


In [12]:
# Checking if there are rows where Fatal_Injuries is bigger than 0 while Fatality is 0
df.loc[(df['Fatal_Injuries'] > 0) & (df['Fatality'] == 0)]

Unnamed: 0,Type,Date,Location,Country,Injury_Severity,Damage_Type,Aircraft_Category,Make,Model,Amateur_Built,Engines,Engine_Type,Fatal_Injuries,Serious_Injuries,Minor_Injuries,Uninjured,Weather,Fatality


In [13]:
# Checking if there are rows where Fatal_Injuries is null than 0 while Fatality is not null
df.loc[(df['Fatal_Injuries'].isna()) & (df['Fatality'].notna())]

Unnamed: 0,Type,Date,Location,Country,Injury_Severity,Damage_Type,Aircraft_Category,Make,Model,Amateur_Built,Engines,Engine_Type,Fatal_Injuries,Serious_Injuries,Minor_Injuries,Uninjured,Weather,Fatality


In [14]:
# Checking if there are rows where Fatal_Injuries is not null than 0 while Fatality is null
df.loc[(df['Fatal_Injuries'].notna()) & (df['Fatality'].isna())]

Unnamed: 0,Type,Date,Location,Country,Injury_Severity,Damage_Type,Aircraft_Category,Make,Model,Amateur_Built,Engines,Engine_Type,Fatal_Injuries,Serious_Injuries,Minor_Injuries,Uninjured,Weather,Fatality
7,Accident,1982-01-01,"PULLMAN, WA",United States,Non-Fatal,Substantial,Airplane,Cessna,140,No,1.0,Reciprocating,0.0,0.0,0.0,2.0,VMC,
8,Accident,1982-01-01,"EAST HANOVER, NJ",United States,Non-Fatal,Substantial,Airplane,Cessna,401B,No,2.0,Reciprocating,0.0,0.0,0.0,2.0,IMC,
9,Accident,1982-01-01,"JACKSONVILLE, FL",United States,Non-Fatal,Substantial,,North American,NAVION L-17B,No,1.0,Reciprocating,0.0,0.0,3.0,0.0,IMC,
10,Accident,1982-01-01,"HOBBS, NM",United States,Non-Fatal,Substantial,,Piper,PA-28-161,No,1.0,Reciprocating,0.0,0.0,0.0,1.0,VMC,
11,Accident,1982-01-01,"TUSKEGEE, AL",United States,Non-Fatal,Substantial,,Beech,V35B,No,1.0,Reciprocating,0.0,0.0,0.0,1.0,VMC,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88884,Accident,2022-12-26,"Annapolis, MD",United States,Minor,,,PIPER,PA-28-151,No,,,0.0,1.0,0.0,0.0,,
88885,Accident,2022-12-26,"Hampton, NH",United States,,,,BELLANCA,7ECA,No,,,0.0,0.0,0.0,0.0,,
88886,Accident,2022-12-26,"Payson, AZ",United States,Non-Fatal,Substantial,Airplane,AMERICAN CHAMPION AIRCRAFT,8GCBC,No,1.0,,0.0,0.0,0.0,1.0,VMC,
88887,Accident,2022-12-26,"Morgan, UT",United States,,,,CESSNA,210N,No,,,0.0,0.0,0.0,0.0,,


In [15]:
# Checking if there are rows where Fatal_Injuries is 0 while Fatality is null
df.loc[(df['Fatal_Injuries'] == 0) & (df['Fatality'].isna())]

Unnamed: 0,Type,Date,Location,Country,Injury_Severity,Damage_Type,Aircraft_Category,Make,Model,Amateur_Built,Engines,Engine_Type,Fatal_Injuries,Serious_Injuries,Minor_Injuries,Uninjured,Weather,Fatality
7,Accident,1982-01-01,"PULLMAN, WA",United States,Non-Fatal,Substantial,Airplane,Cessna,140,No,1.0,Reciprocating,0.0,0.0,0.0,2.0,VMC,
8,Accident,1982-01-01,"EAST HANOVER, NJ",United States,Non-Fatal,Substantial,Airplane,Cessna,401B,No,2.0,Reciprocating,0.0,0.0,0.0,2.0,IMC,
9,Accident,1982-01-01,"JACKSONVILLE, FL",United States,Non-Fatal,Substantial,,North American,NAVION L-17B,No,1.0,Reciprocating,0.0,0.0,3.0,0.0,IMC,
10,Accident,1982-01-01,"HOBBS, NM",United States,Non-Fatal,Substantial,,Piper,PA-28-161,No,1.0,Reciprocating,0.0,0.0,0.0,1.0,VMC,
11,Accident,1982-01-01,"TUSKEGEE, AL",United States,Non-Fatal,Substantial,,Beech,V35B,No,1.0,Reciprocating,0.0,0.0,0.0,1.0,VMC,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88884,Accident,2022-12-26,"Annapolis, MD",United States,Minor,,,PIPER,PA-28-151,No,,,0.0,1.0,0.0,0.0,,
88885,Accident,2022-12-26,"Hampton, NH",United States,,,,BELLANCA,7ECA,No,,,0.0,0.0,0.0,0.0,,
88886,Accident,2022-12-26,"Payson, AZ",United States,Non-Fatal,Substantial,Airplane,AMERICAN CHAMPION AIRCRAFT,8GCBC,No,1.0,,0.0,0.0,0.0,1.0,VMC,
88887,Accident,2022-12-26,"Morgan, UT",United States,,,,CESSNA,210N,No,,,0.0,0.0,0.0,0.0,,


In [16]:
# Making a temporary dataframe in which 'Fatal_Injuries' does not equal 'Fatality'
df_temp = df.loc[(df['Fatal_Injuries']) != (df['Fatality'])]

In [18]:
df_temp['Fatality'].sum()

0.0

In [19]:
df_temp['Fatal_Injuries'].sum()

14584.0

__There is no extra information in the Injury_Severity column that is missing from Fatal_Injuries. In fact, Fatal_Injuries contains more information than the numbers stripped from the Injury_Severity column. The temp column Fatality can be thus dropped and the numbers in brackets can be stripped from the Injurity_Severity column so that only "Fatal" remains.__

In [20]:
df = df.drop(columns=['Fatality'])

In [21]:
# Stripping the bracket enclosed numbers from the Injury_Severity column
df['Injury_Severity'] = df['Injury_Severity'].str.replace(r'Fatal\(\d+\)', 'Fatal', regex=True)

### __2. Dealing with missing values__

In [22]:
df.isna().sum()

Type                     0
Date                     0
Location                52
Country                226
Injury_Severity       1000
Damage_Type           3194
Aircraft_Category    56602
Make                    63
Model                   92
Amateur_Built          102
Engines               6084
Engine_Type           7096
Fatal_Injuries       11401
Serious_Injuries     12510
Minor_Injuries       11933
Uninjured             5912
Weather               4492
dtype: int64

#### 2.1 Replacing all missing values from Fatal_Injuries, Serious_Injuries, Minor_Injuries, Uninjured with 0 and creating a new column Total_Passengers

In [23]:
# Replacing missing values with 0
df['Fatal_Injuries'] = df['Fatal_Injuries'].replace({np.nan: 0})
df['Serious_Injuries'] = df['Serious_Injuries'].replace({np.nan: 0})
df['Minor_Injuries'] = df['Minor_Injuries'].replace({np.nan: 0})
df['Uninjured'] = df['Uninjured'].replace({np.nan: 0})

In [24]:
# Confirming the previous operation succeeeded
df.isna().sum()

Type                     0
Date                     0
Location                52
Country                226
Injury_Severity       1000
Damage_Type           3194
Aircraft_Category    56602
Make                    63
Model                   92
Amateur_Built          102
Engines               6084
Engine_Type           7096
Fatal_Injuries           0
Serious_Injuries         0
Minor_Injuries           0
Uninjured                0
Weather               4492
dtype: int64

In [25]:
# Creating a Total_Passengers column
df['Total_Passengers'] = df['Fatal_Injuries'] + df['Serious_Injuries'] + df['Minor_Injuries'] + df['Uninjured']

#### 2.2 Dropping rows with null values

In [26]:
df.dropna(subset=['Make'], inplace=True)
df.dropna(subset=['Model'], inplace=True)
df.dropna(subset=['Amateur_Built'], inplace=True)
df.dropna(subset=['Country'], inplace=True)

#### 2.3 Imputing missing values in Aircraft_Category

This is in aircraft_category_filled.csv. The imputation was done with the help of ChatGPT with the prompt "Can you fill in missing data in the Aircraft_Category column based on the type of aircraft that you identify through the Make and Model columns?". ChatGPT was able to impute most of the missing values leaving 13,796 values empty due to "unclear classifications or missing data". Having had a closer look at some of these cases of unclear classification they were all airplanes and usually the Make name was a shortened or different version of their full names i.e. Beech instead of Beechcraft and ChatGPT was unable to match those to aircraft manufacturers. I then brute force imputed the rest of the missing values as Airplane because it is the mode in Aircraft_Category after all and it isn't close.

### __3. Let's clean up some rows__

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 88678 entries, 0 to 88888
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Type               88678 non-null  object 
 1   Date               88678 non-null  object 
 2   Location           88630 non-null  object 
 3   Country            88453 non-null  object 
 4   Injury_Severity    87699 non-null  object 
 5   Damage_Type        85524 non-null  object 
 6   Aircraft_Category  32226 non-null  object 
 7   Make               88678 non-null  object 
 8   Model              88678 non-null  object 
 9   Amateur_Built      88678 non-null  object 
 10  Engines            82734 non-null  float64
 11  Engine_Type        81731 non-null  object 
 12  Fatal_Injuries     88678 non-null  float64
 13  Serious_Injuries   88678 non-null  float64
 14  Minor_Injuries     88678 non-null  float64
 15  Uninjured          88678 non-null  float64
 16  Weather            84297 no

#### 3.1 Filtering Type and Amateur_Built and Engines

In [29]:
df['Type'].value_counts()

Type
Accident    84881
Incident     3797
Name: count, dtype: int64

In [30]:
# Getting rid of all events labeled 'Incident'
df = df.loc[df['Type'] == 'Accident']

In [31]:
# Eliminating all amateur built planes, we ain't interesting in those
df = df.loc[df['Amateur_Built'] == 'No']

In [32]:
df['Engines'].value_counts()

Engines
1.0    61035
2.0     9324
0.0     1141
4.0      224
3.0      206
8.0        3
6.0        1
Name: count, dtype: int64

In [33]:
# Filtering out aircraft that have no engines and ones that have more than 4. After looking through the aircraft with more than
# 4 engines, I realized that they were all electric airplanes or drones which are not relevant to our analysis
df = df.loc[df['Engines'] > 0]
df = df.loc[df['Engines'] <= 4]


#### 3.2 Converting Date column to datetime objects and making a time cutoff

In [34]:
# Converting the Date column to datetime values
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')

In [35]:
# Adding Year and Month columns in case we need them for analysis later on
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

In [40]:
# Checking dataset by setting a time cut-off. The time cut-off operation itself is done later in this notebook
df[df['Year'] >= 1985].shape

(61574, 20)

#### 3.3 Create fraction/percentage rates for fatalities and injuries

See EDA.ipynb for this step.

In [38]:
# Half cleaned dataset
df.to_csv('aviation_data_half_clean.csv')

#### 3.4 Cleaning up duplicates in the 'Make' column

The Make column was cleaned in another notebook stringcleaning.ipynb.

In [42]:
df = pd.read_csv('aircraft_category_filled.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70789 entries, 0 to 70788
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0.1       70789 non-null  int64  
 1   Unnamed: 0         70789 non-null  int64  
 2   Type               70789 non-null  object 
 3   Date               70789 non-null  object 
 4   Location           70771 non-null  object 
 5   Country            70597 non-null  object 
 6   Injury_Severity    70717 non-null  object 
 7   Damage_Type        69942 non-null  object 
 8   Aircraft_Category  70789 non-null  object 
 9   Make               70789 non-null  object 
 10  Model              70789 non-null  object 
 11  Amateur_Built      70789 non-null  object 
 12  Engines            70789 non-null  float64
 13  Engine_Type        68723 non-null  object 
 14  Fatal_Injuries     70789 non-null  float64
 15  Serious_Injuries   70789 non-null  float64
 16  Minor_Injuries     707

#### 3.5 Filtering Aircraft_Category to just airplanes

In [45]:
# Drop all rows that aren't airplanes
df = df.loc[df['Aircraft_Category'] == 'Airplane']

Aircraft_Category
Airplane    64313
Name: count, dtype: int64

In [None]:
# Dropping extra columns that appeared from not setting an index=False when converting our dataframes to csv in other steps
df.drop(columns=['Unnamed: 0.1', 'Unnamed: 0'], inplace=True)

In [53]:
df.drop(columns=['Type', 'Amateur_Built'], inplace=True)

In [54]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 64313 entries, 0 to 70788
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Date               64313 non-null  object 
 1   Location           64300 non-null  object 
 2   Country            64195 non-null  object 
 3   Injury_Severity    64249 non-null  object 
 4   Damage_Type        63522 non-null  object 
 5   Aircraft_Category  64313 non-null  object 
 6   Make               64313 non-null  object 
 7   Model              64313 non-null  object 
 8   Engines            64313 non-null  float64
 9   Engine_Type        62554 non-null  object 
 10  Fatal_Injuries     64313 non-null  float64
 11  Serious_Injuries   64313 non-null  float64
 12  Minor_Injuries     64313 non-null  float64
 13  Uninjured          64313 non-null  float64
 14  Weather            63394 non-null  object 
 15  Total_Passengers   64313 non-null  float64
 16  Year               64313 no

#### 3.6 Truncating the dataset further by dropping accidents prior to the year 1990 and accidents that occurred outside of the US

In [69]:
# We decided to drop entries that had a location outside of the US and focus only on accidents that occurred after 1990. 
# The location decision is due to our company operating only in the US and the year cut-off is just an estimation of data that will
# actually be relevant for our analysis. While we did not comb through all of the airplane models in the dataset it is safe to say
# that some of the models and even manufacturers are not in operation anymore in present day. 

df_filtered = df[df['Year'] >= 1990]
df_filtered = df_filtered[df_filtered['Country'] == 'United States']
df_filtered.info()

#df = df[df['Country'] == 'United States']

<class 'pandas.core.frame.DataFrame'>
Index: 43045 entries, 21550 to 70788
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Date               43045 non-null  object 
 1   Location           43042 non-null  object 
 2   Country            43045 non-null  object 
 3   Injury_Severity    43016 non-null  object 
 4   Damage_Type        42485 non-null  object 
 5   Aircraft_Category  43045 non-null  object 
 6   Make               43045 non-null  object 
 7   Model              43045 non-null  object 
 8   Engines            43045 non-null  float64
 9   Engine_Type        41630 non-null  object 
 10  Fatal_Injuries     43045 non-null  float64
 11  Serious_Injuries   43045 non-null  float64
 12  Minor_Injuries     43045 non-null  float64
 13  Uninjured          43045 non-null  float64
 14  Weather            42687 non-null  object 
 15  Total_Passengers   43045 non-null  float64
 16  Year               4304

In [83]:
df_filtered.shape

(43045, 19)

In [87]:
# Seems like there were entries where total passengers was 0. Instead of manually looking for the reports to verify passenger 
# amounts we simply drop the 47 entries that had no passengers
df_filtered = df_filtered.loc[df['Total_Passengers'] > 0]


(42998, 19)

In [95]:
df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Index: 42998 entries, 21550 to 70788
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Date               42998 non-null  object 
 1   Location           42995 non-null  object 
 2   Country            42998 non-null  object 
 3   Injury_Severity    42998 non-null  object 
 4   Damage_Type        42440 non-null  object 
 5   Aircraft_Category  42998 non-null  object 
 6   Make               42998 non-null  object 
 7   Model              42998 non-null  object 
 8   Engines            42998 non-null  float64
 9   Engine_Type        41599 non-null  object 
 10  Fatal_Injuries     42998 non-null  float64
 11  Serious_Injuries   42998 non-null  float64
 12  Minor_Injuries     42998 non-null  float64
 13  Uninjured          42998 non-null  float64
 14  Weather            42648 non-null  object 
 15  Total_Passengers   42998 non-null  float64
 16  Year               4299

In [125]:
df_filtered['Make'].value_counts()[:30]

Make
cessna               16719
piper                 9225
beechcraft            3314
grumman                952
mooney                 912
boeing                 735
air tractor            529
aeronca                429
maule                  422
hughes                 386
cirrus                 351
champion               336
stinson                301
luscombe               294
mcdonnell douglas      255
north american         254
rockwell               252
taylorcraft            249
aero commander         247
de havilland           221
schweizer              204
air tractor inc        202
ayres                  147
aerospatiale           146
aviat                  136
airbus                 131
hiller                 129
enstrom                125
waco                   105
gulfstream             102
Name: count, dtype: int64

In [127]:
# Next up we filter out Makes that have less than 100 entries in our dataset
value_counts = df_filtered['Make'].value_counts()

In [128]:
values_to_keep = value_counts[value_counts > 99].index
values_to_keep

Index(['cessna', 'piper', 'beechcraft', 'grumman', 'mooney', 'boeing',
       'air tractor', 'aeronca', 'maule', 'hughes', 'cirrus', 'champion',
       'stinson', 'luscombe', 'mcdonnell douglas', 'north american',
       'rockwell', 'taylorcraft', 'aero commander', 'de havilland',
       'schweizer', 'air tractor inc', 'ayres', 'aerospatiale', 'aviat',
       'airbus', 'hiller', 'enstrom', 'waco', 'gulfstream',
       'ercoupe (eng & research corp.)'],
      dtype='object', name='Make')

In [129]:
filtered_df = df_filtered[df_filtered['Make'].isin(values_to_keep)]

In [130]:
filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 37910 entries, 21550 to 70787
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Date               37910 non-null  object 
 1   Location           37908 non-null  object 
 2   Country            37910 non-null  object 
 3   Injury_Severity    37910 non-null  object 
 4   Damage_Type        37461 non-null  object 
 5   Aircraft_Category  37910 non-null  object 
 6   Make               37910 non-null  object 
 7   Model              37910 non-null  object 
 8   Engines            37910 non-null  float64
 9   Engine_Type        36837 non-null  object 
 10  Fatal_Injuries     37910 non-null  float64
 11  Serious_Injuries   37910 non-null  float64
 12  Minor_Injuries     37910 non-null  float64
 13  Uninjured          37910 non-null  float64
 14  Weather            37649 non-null  object 
 15  Total_Passengers   37910 non-null  float64
 16  Year               3791

In [131]:
filtered_df.to_csv('data_cleaned_final.csv', index=True)

### __4.Surprise data cleaning tasks that emerge during analysis__

In [143]:
df = pd.read_csv('data_cleaned_final.csv')

In [146]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37910 entries, 0 to 37909
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Date               37910 non-null  object 
 1   Location           37908 non-null  object 
 2   Country            37910 non-null  object 
 3   Injury_Severity    37910 non-null  object 
 4   Damage_Type        37461 non-null  object 
 5   Aircraft_Category  37910 non-null  object 
 6   Make               37910 non-null  object 
 7   Engines            37910 non-null  float64
 8   Engine_Type        36837 non-null  object 
 9   Fatal_Injuries     37910 non-null  float64
 10  Serious_Injuries   37910 non-null  float64
 11  Minor_Injuries     37910 non-null  float64
 12  Uninjured          37910 non-null  float64
 13  Weather            37649 non-null  object 
 14  Total_Passengers   37910 non-null  float64
 15  Year               37910 non-null  int64  
 16  Month              379

#### 4.1 Cleaning up the Model column

This clean-up was done in stringclening.ipynb, see that notebook for the details. 

In [147]:
df = df.rename(columns={'Model_final': 'Model'})

#### 4.2 Miscellaneous clean-up tasks

In [149]:
# Dropping missing values from Damage_type
df = df.dropna(subset=['Damage_Type'])

In [151]:
# Data cleaning should now be complete, but of course this hubris will come back to bite me
df.to_csv('data_cleaned_final.csv', index=False)

#### We were in fact not done with data cleaning. Next up, EDA.ipynb.