# String cleaning of the 'Make' and 'Model' columns

There is no good explanation as to why string cleaning is done in here rather than in the main Dataclean notebook. The thought at the time was that Dataclean felt too long and perhaps this string cleaning task is a separate thing so it was done here.

In [191]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns

# Cleaning the Make column

In hindsight, it would have been smart to write a function for some of the tasks in this notebook. We live and we learn.

In [147]:
df = pd.read_csv('aircraft_category_filled.csv')
df

Unnamed: 0.1,Unnamed: 0,Type,Date,Location,Country,Injury_Severity,Damage_Type,Aircraft_Category,Make,Model,...,Engines,Engine_Type,Fatal_Injuries,Serious_Injuries,Minor_Injuries,Uninjured,Weather,Total_Passengers,Year,Month
0,0,Accident,1948-10-24,"MOOSE CREEK, ID",United States,Fatal,Destroyed,Airplane,Stinson,108-3,...,1.0,Reciprocating,2.0,0.0,0.0,0.0,UNK,2.0,1948,10
1,1,Accident,1962-07-19,"BRIDGEPORT, CA",United States,Fatal,Destroyed,Airplane,Piper,PA24-180,...,1.0,Reciprocating,4.0,0.0,0.0,0.0,UNK,4.0,1962,7
2,2,Accident,1974-08-30,"Saltville, VA",United States,Fatal,Destroyed,Airplane,Cessna,172M,...,1.0,Reciprocating,3.0,0.0,0.0,0.0,IMC,3.0,1974,8
3,3,Accident,1977-06-19,"EUREKA, CA",United States,Fatal,Destroyed,Airplane,Rockwell,112,...,1.0,Reciprocating,2.0,0.0,0.0,0.0,IMC,2.0,1977,6
4,5,Accident,1979-09-17,"BOSTON, MA",United States,Non-Fatal,Substantial,Airplane,Mcdonnell Douglas,DC9,...,2.0,Turbo Fan,0.0,0.0,1.0,44.0,VMC,45.0,1979,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70784,88869,Accident,2022-12-13,"Lewistown, MT",United States,Non-Fatal,Substantial,Airplane,PIPER,PA42,...,2.0,,0.0,0.0,0.0,1.0,,1.0,2022,12
70785,88873,Accident,2022-12-14,"San Juan, PR",United States,Non-Fatal,Substantial,Airplane,CIRRUS DESIGN CORP,SR22,...,1.0,,0.0,0.0,0.0,1.0,VMC,1.0,2022,12
70786,88876,Accident,2022-12-15,"Wichita, KS",United States,Non-Fatal,Substantial,Airplane,SWEARINGEN,SA226TC,...,2.0,,0.0,0.0,0.0,1.0,,1.0,2022,12
70787,88877,Accident,2022-12-16,"Brooksville, FL",United States,Minor,Substantial,Airplane,CESSNA,R172K,...,1.0,,0.0,1.0,0.0,0.0,VMC,1.0,2022,12


In [148]:
df.isna().sum()

Unnamed: 0              0
Type                    0
Date                    0
Location               18
Country               192
Injury_Severity        72
Damage_Type           847
Aircraft_Category       0
Make                    0
Model                   0
Amateur_Built           0
Engines                 0
Engine_Type          2066
Fatal_Injuries          0
Serious_Injuries        0
Minor_Injuries          0
Uninjured               0
Weather              1113
Total_Passengers        0
Year                    0
Month                   0
dtype: int64

In [149]:
df['Aircraft_Category'].value_counts()

Aircraft_Category
Airplane             64313
Helicopter            6113
Weight-Shift           132
Powered Parachute       81
Glider                  75
Gyrocraft               30
Ultralight              14
Balloon                 14
WSFT                     9
Blimp                    4
Powered-Lift             2
Rocket                   1
Unknown                  1
Name: count, dtype: int64

### Cleaning up the Make column

In [150]:
df['Make'].value_counts()

Make
Cessna                     21408
Piper                      11565
CESSNA                      4372
Beech                       4017
PIPER                       2577
                           ...  
Golden Circle                  1
FUJI                           1
PALEN                          1
BUTLER AIRCRAFT COMPANY        1
ORLICAN S R O                  1
Name: count, Length: 1994, dtype: int64

In [151]:
df['Make'] = df['Make'].str.lower()

In [152]:
df['Make'].value_counts()

Make
cessna                    25780
piper                     14142
beech                      4905
bell                       2340
mooney                     1272
                          ...  
gt ultralights                1
evektor aerotechnik as        1
lyons                         1
wheat                         1
orlican s r o                 1
Name: count, Length: 1632, dtype: int64

### Manually cleaning out duplicates

In [153]:
# This and the myriad cells below are clearing duplicates in the 'Make' column. This method is neither 100% efficient nor 100% accurate. Some airplane
# models are joint productions from two or more partnering companies. This method will not be able to differentiate between the two. Instead, names will
# be combined and grouped under umbrellas like 'cessna' or 'beechcraft'.

df.loc[df['Make'].str.contains('cessna', na=False), 'Make'] = 'cessna'

In [154]:
df['Make'].value_counts()

Make
cessna                    25835
piper                     14142
beech                      4905
bell                       2340
mooney                     1272
                          ...  
collard                       1
gt ultralights                1
evektor aerotechnik as        1
lyons                         1
orlican s r o                 1
Name: count, Length: 1620, dtype: int64

In [155]:
df.loc[df['Make'].str.contains('beech', na=False), 'Make'] = 'beechcraft'


In [156]:
df['Make'].value_counts()

Make
cessna                            25835
piper                             14142
beechcraft                         4978
bell                               2340
mooney                             1272
                                  ...  
bell-olympic helicopters, inc.        1
fisher michael h                      1
antares us                            1
sportflight international llc         1
orlican s r o                         1
Name: count, Length: 1608, dtype: int64

In [157]:
df.loc[df['Make'].str.contains('boeing', na=False), 'Make'] = 'boeing'


In [158]:
df['Make'].value_counts()

Make
cessna                    25835
piper                     14142
beechcraft                 4978
bell                       2340
mooney                     1272
                          ...  
cubcrafter                    1
collard                       1
gt ultralights                1
evektor aerotechnik as        1
orlican s r o                 1
Name: count, Length: 1597, dtype: int64

In [159]:
df.loc[df['Make'].str.contains('airbus', na=False), 'Make'] = 'airbus'

In [160]:
df['Make'].value_counts()

Make
cessna            25835
piper             14142
beechcraft         4978
bell               2340
mooney             1272
                  ...  
bristol               1
cubcrafter            1
collard               1
gt ultralights        1
orlican s r o         1
Name: count, Length: 1591, dtype: int64

In [161]:
df.loc[df['Make'].str.contains('piper', na=False), 'Make'] = 'piper'

In [162]:
df['Make'].value_counts()

Make
cessna                          25835
piper                           14213
beechcraft                       4978
bell                             2340
mooney                           1272
                                ...  
eclipse aviation corporation        1
bristol                             1
cubcrafter                          1
collard                             1
orlican s r o                       1
Name: count, Length: 1575, dtype: int64

In [164]:
df.loc[df['Make'].str.contains('mooney', na=False), 'Make'] = 'mooney'
df['Make'].value_counts()

Make
cessna            25835
piper             14213
beechcraft         4978
bell               2340
mooney             1321
                  ...  
bristol               1
cubcrafter            1
collard               1
gt ultralights        1
orlican s r o         1
Name: count, Length: 1568, dtype: int64

In [166]:
df.loc[df['Make'].str.contains('cirrus', na=False), 'Make'] = 'cirrus'
df['Make'].value_counts()

Make
cessna            25835
piper             14213
beechcraft         4978
bell               2340
mooney             1321
                  ...  
bristol               1
cubcrafter            1
collard               1
gt ultralights        1
orlican s r o         1
Name: count, Length: 1564, dtype: int64

In [167]:
df.loc[df['Make'].str.contains('robinson', na=False), 'Make'] = 'robinson'
df['Make'].value_counts()

Make
cessna            25835
piper             14213
beechcraft         4978
bell               2340
robinson           1442
                  ...  
bristol               1
cubcrafter            1
collard               1
gt ultralights        1
orlican s r o         1
Name: count, Length: 1558, dtype: int64

In [168]:
df.loc[df['Make'].str.contains('rockwell', na=False), 'Make'] = 'rockwell'
df['Make'].value_counts()

Make
cessna            25835
piper             14213
beechcraft         4978
bell               2340
robinson           1442
                  ...  
bristol               1
cubcrafter            1
collard               1
gt ultralights        1
orlican s r o         1
Name: count, Length: 1550, dtype: int64

In [169]:
df.loc[df['Make'].str.contains('grumman', na=False), 'Make'] = 'grumman'
df['Make'].value_counts()

Make
cessna                            25835
piper                             14213
beechcraft                         4978
bell                               2340
grumman                            1597
                                  ...  
bell-olympic helicopters, inc.        1
dart                                  1
kociemba robert h                     1
engle david                           1
orlican s r o                         1
Name: count, Length: 1537, dtype: int64

In [170]:
df.loc[df['Make'].str.contains('bellanca', na=False), 'Make'] = 'bellanca'
df['Make'].value_counts()

Make
cessna            25835
piper             14213
beechcraft         4978
bell               2340
grumman            1597
                  ...  
bristol               1
cubcrafter            1
collard               1
gt ultralights        1
orlican s r o         1
Name: count, Length: 1535, dtype: int64

In [171]:
df.loc[df['Make'].str.contains('raytheon', na=False), 'Make'] = 'raytheon'
df['Make'].value_counts()

Make
cessna            25835
piper             14213
beechcraft         4978
bell               2340
grumman            1597
                  ...  
bristol               1
cubcrafter            1
collard               1
gt ultralights        1
orlican s r o         1
Name: count, Length: 1533, dtype: int64

In [172]:
df.loc[df['Make'].str.contains('bombardier', na=False), 'Make'] = 'bombardier'
df['Make'].value_counts()

Make
cessna                    25835
piper                     14213
beechcraft                 4978
bell                       2340
grumman                    1597
                          ...  
cubcrafter                    1
collard                       1
gt ultralights                1
evektor aerotechnik as        1
orlican s r o                 1
Name: count, Length: 1529, dtype: int64

In [173]:
df.loc[df['Make'].str.contains('embraer', na=False), 'Make'] = 'embraer'
df['Make'].value_counts()

Make
cessna                          25835
piper                           14213
beechcraft                       4978
bell                             2340
grumman                          1597
                                ...  
eclipse aviation corporation        1
bristol                             1
cubcrafter                          1
collard                             1
orlican s r o                       1
Name: count, Length: 1523, dtype: int64

In [174]:
df.loc[df['Make'].str.contains('lockheed', na=False), 'Make'] = 'lockheed'
df['Make'].value_counts()

Make
cessna                25835
piper                 14213
beechcraft             4978
bell                   2340
grumman                1597
                      ...  
dart                      1
ultralight soaring        1
kociemba robert h         1
engle david               1
orlican s r o             1
Name: count, Length: 1522, dtype: int64

In [175]:
df.loc[df['Make'].str.contains('dassault', na=False), 'Make'] = 'dassault'
df['Make'].value_counts()

Make
cessna                          25835
piper                           14213
beechcraft                       4978
bell                             2340
grumman                          1597
                                ...  
eclipse aviation corporation        1
bristol                             1
cubcrafter                          1
collard                             1
orlican s r o                       1
Name: count, Length: 1518, dtype: int64

In [176]:
df.loc[df['Make'].str.contains('honda', na=False), 'Make'] = 'honda'
df['Make'].value_counts()

Make
cessna                            25835
piper                             14213
beechcraft                         4978
bell                               2340
grumman                            1597
                                  ...  
bell-olympic helicopters, inc.        1
dart                                  1
kociemba robert h                     1
engle david                           1
orlican s r o                         1
Name: count, Length: 1516, dtype: int64

In [177]:
df.loc[df['Make'].str.contains('bell textron', na=False), 'Make'] = 'bell textron'
df['Make'].value_counts()

Make
cessna                            25835
piper                             14213
beechcraft                         4978
bell                               2340
grumman                            1597
                                  ...  
bell-olympic helicopters, inc.        1
dart                                  1
kociemba robert h                     1
engle david                           1
orlican s r o                         1
Name: count, Length: 1516, dtype: int64

In [178]:
df.loc[df['Make'].str.contains('saab', na=False), 'Make'] = 'saab'
df['Make'].value_counts()

Make
cessna                          25835
piper                           14213
beechcraft                       4978
bell                             2340
grumman                          1597
                                ...  
eclipse aviation corporation        1
bristol                             1
cubcrafter                          1
collard                             1
orlican s r o                       1
Name: count, Length: 1514, dtype: int64

In [179]:
df.loc[df['Make'].str.contains('vodochody', na=False), 'Make'] = 'aero vodochody'
df['Make'].value_counts()

Make
cessna            25835
piper             14213
beechcraft         4978
bell               2340
grumman            1597
                  ...  
bristol               1
cubcrafter            1
collard               1
gt ultralights        1
orlican s r o         1
Name: count, Length: 1511, dtype: int64

In [180]:
df.loc[df['Make'].str.contains('pilatus', na=False), 'Make'] = 'pilatus'
df['Make'].value_counts()

Make
cessna                          25835
piper                           14213
beechcraft                       4978
bell                             2340
grumman                          1597
                                ...  
eclipse aviation corporation        1
bristol                             1
cubcrafter                          1
collard                             1
orlican s r o                       1
Name: count, Length: 1508, dtype: int64

In [181]:
df.loc[df['Make'].str.contains('piaggio', na=False), 'Make'] = 'piaggio'
df['Make'].value_counts()

Make
cessna                 25835
piper                  14213
beechcraft              4978
bell                    2340
grumman                 1597
                       ...  
ultralight soaring         1
dart                       1
scheibe flugzeugbau        1
kociemba robert h          1
orlican s r o              1
Name: count, Length: 1505, dtype: int64

In [182]:
df.loc[df['Make'].str.contains('gulfstream', na=False), 'Make'] = 'gulfstream'
df['Make'].value_counts()

Make
cessna                          25835
piper                           14213
beechcraft                       4978
bell                             2340
grumman                          1597
                                ...  
wsl pzl                             1
eclipse aviation corporation        1
bristol                             1
cubcrafter                          1
orlican s r o                       1
Name: count, Length: 1494, dtype: int64

In [183]:
df.loc[df['Make'].str.contains('mcdonnell', na=False), 'Make'] = 'mcdonnell douglas'
df['Make'].value_counts()

Make
cessna                          25835
piper                           14213
beechcraft                       4978
bell                             2340
grumman                          1597
                                ...  
wsl pzl                             1
eclipse aviation corporation        1
bristol                             1
cubcrafter                          1
orlican s r o                       1
Name: count, Length: 1487, dtype: int64

In [184]:
df.loc[df['Make'].str.contains('fairchild', na=False), 'Make'] = 'fairchild'
df['Make'].value_counts()

Make
cessna                          25835
piper                           14213
beechcraft                       4978
bell                             2340
grumman                          1597
                                ...  
wsl pzl                             1
eclipse aviation corporation        1
bristol                             1
cubcrafter                          1
orlican s r o                       1
Name: count, Length: 1481, dtype: int64

In [185]:
df.loc[df['Make'].str.contains('israel', na=False), 'Make'] = 'israel aerospace industries'
df['Make'].value_counts()

Make
cessna                          25835
piper                           14213
beechcraft                       4978
bell                             2340
grumman                          1597
                                ...  
wsl pzl                             1
eclipse aviation corporation        1
bristol                             1
cubcrafter                          1
orlican s r o                       1
Name: count, Length: 1480, dtype: int64

In [186]:
df.loc[df['Make'].str.contains('convair', na=False), 'Make'] = 'convair'
df['Make'].value_counts()

Make
cessna                          25835
piper                           14213
beechcraft                       4978
bell                             2340
grumman                          1597
                                ...  
wsl pzl                             1
eclipse aviation corporation        1
bristol                             1
cubcrafter                          1
orlican s r o                       1
Name: count, Length: 1479, dtype: int64

In [190]:
df.to_csv('aircraft_category_filled.csv', index=True)

# Cleaning the Model column

### Another manual clean-up job

In [207]:
df = pd.read_csv('data_cleaned_mostly_clean.csv')

In [208]:
# Convert all to upper case and remove leading and trailing spaces
df['Model_cleaned'] = df['Model'].str.upper().str.strip()

In [209]:
# Replace dashes with spaces
df['Model_cleaned'] = df['Model_cleaned'].str.replace('-', ' ', regex=False)

In [210]:
# Remove extra spaces by replacing multiple spaces with single space
df['Model_cleaned'] = df['Model_cleaned'].str.replace(r'\s+', ' ', regex=True).str.strip()

In [211]:
# Put the final result into a Model_final column
df['Model_final'] = df['Model_cleaned']

In [None]:
# Put the final result into a Model_final column
df['Model_final'] = df['Model_cleaned']

In [212]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37910 entries, 0 to 37909
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         37910 non-null  int64  
 1   Date               37910 non-null  object 
 2   Location           37908 non-null  object 
 3   Country            37910 non-null  object 
 4   Injury_Severity    37910 non-null  object 
 5   Damage_Type        37461 non-null  object 
 6   Aircraft_Category  37910 non-null  object 
 7   Make               37910 non-null  object 
 8   Model              37910 non-null  object 
 9   Engines            37910 non-null  float64
 10  Engine_Type        36837 non-null  object 
 11  Fatal_Injuries     37910 non-null  float64
 12  Serious_Injuries   37910 non-null  float64
 13  Minor_Injuries     37910 non-null  float64
 14  Uninjured          37910 non-null  float64
 15  Weather            37649 non-null  object 
 16  Total_Passengers   379

In [215]:
# Drop unnecessary columns that came from not using index=False argument in the to_csv method
df = df.drop(columns=['Unnamed: 0', 'Model', 'Model_cleaned'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37910 entries, 0 to 37909
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Date               37910 non-null  object 
 1   Location           37908 non-null  object 
 2   Country            37910 non-null  object 
 3   Injury_Severity    37910 non-null  object 
 4   Damage_Type        37461 non-null  object 
 5   Aircraft_Category  37910 non-null  object 
 6   Make               37910 non-null  object 
 7   Engines            37910 non-null  float64
 8   Engine_Type        36837 non-null  object 
 9   Fatal_Injuries     37910 non-null  float64
 10  Serious_Injuries   37910 non-null  float64
 11  Minor_Injuries     37910 non-null  float64
 12  Uninjured          37910 non-null  float64
 13  Weather            37649 non-null  object 
 14  Total_Passengers   37910 non-null  float64
 15  Year               37910 non-null  int64  
 16  Month              379

In [216]:
# Save to data_cleaned_final.csv 
df.to_csv('data_cleaned_final.csv', index=False)