## Introduction

This is a basic data cleaning notebook for the aircraft pricing dataset. The data was scraped at the beginning of July of 2020, using the Scrapy framework for Python. 

I created this notebook mainly for practice and to improve my skills using various numpy, pandas, and regex methods/functions. This notebook is great for beginners as I explain my logic every step of the way. Feel free to reach out with any questions or suggestions. 

Lets get started! 

# DATA CLEANING - Aircraft Pricing Dataset

In [None]:
# Import necessary libariries 

import numpy as np
import pandas as pd
import re

pd.options.mode.chained_assignment = None # Ignore certain warnings

In [None]:
# Import dataset

df = pd.read_csv('/kaggle/input/used-aircraft-pricing/aircraft_data.csv')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

There is a lot of missing data. 

Keep in mind that the dataset was scraped from 2 different websites. The following columns - 'Engine 1 Hours', 'Engine 2 Hours', 'Prop 1 Hours', 'Prop 2 Hours', 'Total Seats' and 'Flight Rules' - were only available for some aircraft and on only one of the sites scraped, hence the missing data. 

However, we also have missing data in the following columns - 'Condition', 'Currency', 'Location', 'Total Hours' and 'National Origin'.

Let's dive in and clean the data one column at a time, starting with the easiest columns first. 

## Make - column

In [None]:
# Luckily, there isn't any missing data here.

df['Make'].isnull().sum()

In [None]:
df['Make'].value_counts()[:30]

In [None]:
df['Make'].nunique()

The 'Make' data seems fine at first glance. There are 187 different makes of aircraft throughout the dataset, Cessna being the most popular. Let's make all of the entries uppercase to stay consistent. 

In [None]:
print('There are a total of {}/{} uppercase rows in this column'.format((df['Make'].str.isupper().sum()), (len(df))))
df['Make'] = df['Make'].str.upper()
print('There are a total of {}/{} uppercase rows in this column'.format((df['Make'].str.isupper().sum()), (len(df))))

## Model - column

In [None]:
# No missing data.

df['Model'].isnull().sum()

In [None]:
df['Model'].value_counts()

In [None]:
# Applying upper method to the Model column. Some aircraft models are stricly numbers, which is why not all have been converted
# to uppercase. 

print('There are a total of {}/{} uppercase rows in this column'.format((df['Model'].str.isupper().sum()), (len(df))))
df['Model'] = df['Model'].str.upper()
print('There are a total of {}/{} uppercase rows in this column'.format((df['Model'].str.isupper().sum()), (len(df))))

In [None]:
# There are a total of 962 different models of aircraft in our dataset.

df['Model'].nunique()

In [None]:
df['Model'].value_counts()

## National Origin - column

In [None]:
df['National Origin'].isnull().sum()

In [None]:
# Filter for rows that have NaN values. 

df.loc[df['National Origin'].isnull()]

1. [MICCO](http://aso.com/seller/5487/old/micco.htm) - A quick google search revealed that this is a US company.
2. [STAUDACHER](https://alumni.msu.edu/stay-informed/magazine/article.cfm?id=253) - US manufacturer. 
3. [HOMEBUILT](https://en.wikipedia.org/wiki/Homebuilt_aircraft) - Homebuilt aircraft can vary significantly from one another. The specs depend on the person building the aircraft. I'm going to drop all 'HOMEBUILT' aircraft since they aren't built by a specific aircraft manufacturer. 
4. MINX - Limited information available on this manufacturer.
5. BONSALL DC - Limited information available on this manufacturer.

##### MICCO - change origin to US

In [None]:
df.loc[137, ['National Origin']] = df.loc[137, ['National Origin']].replace(np.nan, 'United States')
df.loc[137]

##### STAUDACHER - change origin to US

In [None]:
df.loc[711, ['National Origin']] = df.loc[137, ['National Origin']].replace(np.nan, 'United States')
df.loc[711]

##### HOMEBUILT - drop all

In [None]:
#Confirm that all 4 were dropped

print(len(df.loc[df['Make'] == 'HOMEBUILT']))
df = df.drop(df.loc[df['Make'] == 'HOMEBUILT'].index)
print(len(df.loc[df['Make'] == 'HOMEBUILT']))

##### MINX & BONSALL DC - drop both

In [None]:
# There is only 1 listing for each manufacturer. We can go ahead and drop both as not having them shouldn't 
# notably impact our dataset. 

print(len(df.loc[df['Make'] == 'MINX']))
print(len(df.loc[df['Make'] == 'BONSALL DC']))
df = df.drop(df.loc[df['Make'] == 'MINX'].index)
df = df.drop(df.loc[df['Make'] == 'BONSALL DC'].index)
print(len(df.loc[df['Make'] == 'MINX']))
print(len(df.loc[df['Make'] == 'BONSALL DC']))

In [None]:
# Confirm that the right amount of rows were dropped. 

print('length of dataset before dropping 6 rows: {}'.format(len(df)))
print('length of dataset after dropping 6 rows: {}'.format(len(df)))

In [None]:
# Similar to the other columns, apply the upper method to stay consistent.

print('There are a total of {}/{} uppercase rows in this column'.format((df['National Origin'].str.isupper().sum()), (len(df))))
df['National Origin'] = df['National Origin'].str.upper()
print('There are a total of {}/{} uppercase rows in this column'.format((df['National Origin'].str.isupper().sum()), (len(df))))

In [None]:
df['National Origin'].value_counts()

Initially when going through this dataset I didn't notice that Switzerland is listed twice above. The first entry is spelt 'Switzerland' and the second 'Swtizerland'. This is obviously an error and we need to correct the spelling.

In [None]:
# Find the index value of the row that has the incorrect spelling of Switzerland

df.loc[df['National Origin'] == 'SWTIZERLAND']

In [None]:
# Update and confirm the results

df.loc[186, 'National Origin'] = df.loc[186, 'National Origin'].replace('SWTIZERLAND', 'SWITZERLAND')
df['National Origin'].value_counts()

In [None]:
# According to our dataset, most aircraft listings are made in the United States.

print('Aircraft found in the dataset are manufactured in {} different countries.'.format(df['National Origin'].nunique()))
print('')
print(df['National Origin'].value_counts())

##### Lastly,  I'm going to rename this column to 'Country of Origin'.

In [None]:
df = df.rename(columns={'National Origin': 'Country of Origin'})

In [None]:
df.head()

## Category - column

In [None]:
# No missing data 

df['Category'].isnull().sum()

In [None]:
df['Category'].value_counts()

As mentioned previously, this data was scarped from various websites and therefore isn't consistent. I'm going to merge a few of the categories because they are duplicates. 

1. Single Engine Piston = Single Piston 
2. Multi Engine Piston = Twin Piston
3. Turboprop = Turboprops

In [None]:
df['Category'] = np.where((df['Category'] == 'Single Piston'), 'Single Engine Piston', df['Category'])
df['Category'] = np.where((df['Category'] == 'Twin Piston'), 'Multi Engine Piston', df['Category'])
df['Category'] = np.where((df['Category'] == 'Turboprops'), 'Turboprop', df['Category'])

In [None]:
# Amend 'Gliders | Sailplanes' to 'Gliders/Sailplanes' to stay consistent with the 'Military/Classic/Vintage' name format.

df['Category'] = np.where((df['Category'] == 'Gliders | Sailplanes'), 'Gliders/Sailplanes', df['Category'])

In [None]:
# Ensure the changes have been applied

df['Category'].value_counts()

In [None]:
# Similar to the rest of the columns, apply the upper method to stay consistent.

print('There are a total of {}/{} uppercase rows in this column'.format((df['Category'].str.isupper().sum()), (len(df))))
df['Category'] = df['Category'].str.upper()
print('There are a total of {}/{} uppercase rows in this column'.format((df['Category'].str.isupper().sum()), (len(df))))

In [None]:
# Let's see what the dataframe looks like after making various changes to the Category, 
# Make, Model, and Country of Origin columns

df.head()

## Year - column

In [None]:
# No missing data in this column

df['Year'].isnull().sum()

In [None]:
df['Year'].value_counts()

In [None]:
# The data is for 94 different years
df['Year'].nunique()

In [None]:
df['Year'].unique()

Let's take a closer look at the 'Not listed' and '-' entries. 

In [None]:
print("There are {} rows that have 'Not listed' entered in the year column.".format(len(df[df['Year'] == 'Not Listed'])))
print('')
df[df['Year'] == 'Not Listed'].head()

In [None]:
print("There are {} rows that have '-' entered in the year column.".format(len(df[df['Year'] == '-'])))
print('')
df[df['Year'] == '-'].head()

In [None]:
# Drop the columns that don't have an actual year. Total rows to drop = 34 + 44 = 78 

print('Length of dataset prior to dropping rows with missing data: {}'.format(len(df)))
df = df.drop(df.loc[df['Year'] == 'Not Listed'].index)
df = df.drop(df.loc[df['Year'] == '-'].index)
print('Length of dataset after dropping rows with missing data: {}'.format(len(df)))

In [None]:
# Ensure the right amount of rows were dropped

2524-2446

In [None]:
# Looks good.

df['Year'].unique()

In [None]:
# Last step is to convert the 'Year' column to type = integer

df['Year'] = df['Year'].astype(np.int64)
df['Year'].dtype

## Total Hours - column

In [None]:
df.head()

In [None]:
df['Total Hours'].isnull().sum()

In [None]:
df['Total Hours'].value_counts(dropna=False)

##### First, address the 'NaN', '-', and '0' hours. Convert 'NaN' and '-' to 0 hours.

In [None]:
len(df.loc[df['Total Hours'].isnull()])

In [None]:
print("Total number of NaN: {} and '-': {} BEFORE converting to 0.".format((len(df.loc[df['Total Hours'].isnull()])), len(df.loc[df['Total Hours'] == '-'])))

df['Total Hours'] = np.where((df['Total Hours'] == '-'), 0, df['Total Hours'])
df['Total Hours'] = np.where((df['Total Hours'].isnull()), 0, df['Total Hours'])
df['Total Hours'] = np.where((df['Total Hours'] == '0'), 0, df['Total Hours'])

print("Total number of NaN: {} and '-': {} AFTER converting to 0.".format((len(df.loc[df['Total Hours'].isnull()])), len(df.loc[df['Total Hours'] == '-'])))

In [None]:
# The data is very messy. There are letters, commas, colons, and periods within the data. Let's clean it up.

df['Total Hours'].value_counts()[-20:]

In [None]:
# Let's filter for rows that aren't pure digits. 

df_messy_hours = df.loc[~df['Total Hours'].astype(str).str.isdigit()]
print(len(df_messy_hours))
df_messy_hours

In [None]:
# Search for rows that have 'h' or 'H' within them.

contains_h = df_messy_hours[df_messy_hours['Total Hours'].str.contains('h')]
print(len(contains_h))
contains_h.head()

In [None]:
contains_H = df_messy_hours[df_messy_hours['Total Hours'].str.contains('H')]
print(len(contains_H))
contains_H.tail()

In [None]:
# Remove all letters from these rows so that only numbers remain

contains_h['Total Hours'] = contains_h['Total Hours'].astype(str).str.replace('[^0-9]', '')
contains_H['Total Hours'] = contains_H['Total Hours'].astype(str).str.replace('[^0-9]', '')

In [None]:
# Drop row 1928 since it already exists in the 'contains_h' subset

contains_H.drop(1928, inplace=True)
len(contains_H)

In [None]:
contains_H

After dropping the letters, some numbers look too big. For example row 2397 - 1883145 hours.
Before dropping the letters, the row stated: "1883tt 145 since O/H". We'll have to review the total hours before/after dropping the letters to make sure that the correct number of hours is reflected in the 'Total Hours' column.


In [None]:
# Amending total hours to correct number.

contains_H.loc[2325, 'Total Hours'] = contains_H.loc[2325, 'Total Hours'].replace('122200', 'NaN')
contains_H.loc[2372, 'Total Hours'] = contains_H.loc[2372, 'Total Hours'].replace('8207', '821')
contains_H.loc[2397, 'Total Hours'] = contains_H.loc[2397, 'Total Hours'].replace('1883145', 'NaN')

In [None]:
# Update our df with the correct values from above. 

df.update(contains_h)
df.update(contains_H)

In [None]:
df

In [None]:
df.loc[~df['Total Hours'].astype(str).str.isdigit()]

In [None]:
# Filter rows that aren't pure digits again

messy_hours2 = df.loc[~df['Total Hours'].astype(str).str.isdigit()]

In [None]:
no_letters = messy_hours2[pd.to_numeric(messy_hours2['Total Hours'], errors='coerce').notnull()]
print(len(no_letters))
no_letters[30:]

I reviewed the Total Hours column above to ensure that when each value is converted to a float data type, they'll be converted correctly. 

Example - row 1929 - '1.978' would be converted to 2 hours (1.978 rounds to 2), instead of 2000. Because of this I had to manually convert this to '1978'. See this and other manual changes I had to make in the cell below.

In [None]:
no_letters.loc[1929, 'Total Hours'] = no_letters.loc[1929, 'Total Hours'].replace('1.978', '1978')
no_letters.loc[2023, 'Total Hours'] = no_letters.loc[2023, 'Total Hours'].replace('5.198', '5198')
no_letters.loc[2093, 'Total Hours'] = no_letters.loc[2093, 'Total Hours'].replace('5.74', '574')
no_letters.loc[2115, 'Total Hours'] = no_letters.loc[2115, 'Total Hours'].replace('5.497', '5497')
no_letters.loc[2127, 'Total Hours'] = no_letters.loc[2127, 'Total Hours'].replace('5.615', '5615')
no_letters.loc[2230, 'Total Hours'] = no_letters.loc[2230, 'Total Hours'].replace('1.06', '106')
no_letters.loc[2331, 'Total Hours'] = no_letters.loc[2331, 'Total Hours'].replace('10.72', '1072')

In [None]:
# Convert the numbers to integers

no_letters['Total Hours'] = no_letters['Total Hours'].astype(float).round().astype(int)

In [None]:
# Update the dataframe with the updated rows from the 'no_letters' df. 

df.update(no_letters)

In [None]:
# We currently don't have any values that are null in the 'Total Hours' column

df['Total Hours'].isnull().sum()

In [None]:
# Convert to integers

df['Total Hours'] = pd.to_numeric(df['Total Hours'], errors='coerce')

In the interest of time, I'm not going to look into the rest of the 'Total Hours' rows that aren't pure digits. I will convert any value that can be converted to an integer, while the remaining rows will be converted to np.nan format. I will drop the NaN rows and continue cleaning the rest of the data. 

Since this is just for fun/practice I don't want to spend too much time on this one column especially since the amount of rows that will be dropped is around 100, which will leave more than 2000 to work with. 

In [None]:
# After the integer conversion, 109 rows will need to be dropped.

print("There are {} rows with NaN that need to be dropped". format(df['Total Hours'].isnull().sum()))
df.dropna(subset=['Total Hours'], inplace=True)
print("There are {} rows with NaN remaining in the 'Total Hours' column". format(df['Total Hours'].isnull().sum()))

## Condition - column

In [None]:
df.head()

In [None]:
df['Condition'].value_counts(dropna=False)

There are lots of NaN values in the Condition column. This should be an easy fix if we make a few assumptions.

1. Any aircraft that has 0 Total Hours and manufactured between 2018-2020 will be considered a new aircraft.  
2. Aircraft with Total Hours > 0 but a NaN value in the Condition column will be considered Used. 

Let's begin.

In [None]:
# Let's filter for aircraft with 0 Total Hours, manufactured between 2018-2020 and don't have a listed condition. 
# These should all be listed as New according to the assumptions above. 

print(len(df.loc[(df['Total Hours'] == 0) & (df['Year'] >= 2018) & (df['Condition'].isnull())]))
df['Condition'] = np.where((df['Total Hours'] == 0) & (df['Year'] >= 2018) & (df['Condition'].isnull()) , 'New', df['Condition'])
print(len(df.loc[(df['Total Hours'] == 0) & (df['Year'] >= 2018) & (df['Condition'].isnull())]))

In [None]:
# 12 rows were updated, 600 remain. 

df['Condition'].isnull().sum()

In [None]:
# Filter for aircraft that are listed as having Total Hours greater than 0 and the condition listed as NaN.
# Change these all to Used

print(len(df.loc[(df['Total Hours'] != 0) & (df['Condition'].isnull())]))
df['Condition'] = np.where((df['Total Hours'] != 0) & (df['Condition'].isnull()) , 'Used', df['Condition'])
print(len(df.loc[(df['Total Hours'] != 0) & (df['Condition'].isnull())]))

In [None]:
# 575 rows were updated, 25 remain. 

df['Condition'].isnull().sum()

In [None]:
# Let's look at the remaining listings

df[df['Condition'].isnull()]

I'm going to drop the 25 listings above so that we have more accurate data to work with. I believe that some or even most of the aircraft above have been rebuilt, hence are showing 0 Total Hours. I believe these are anomolies and don't reflect the majority of the data in the dataset.

Additionally, I'll drop aircraft stated as Used but have 0 Total Hours.

In [None]:
print('Length of dataset prior to dropping NaN values from the Condition column: {}'.format(len(df)))
df.dropna(subset=['Condition'], inplace=True)
print('Length of dataset after dropping NaN values from the Condition column: {}'.format(len(df)))

In [None]:
df['Condition'].value_counts(dropna=False)

In [None]:
# Drop used aircraft with 0 total hours:

len(df[(df['Total Hours'] == 0) & (df['Condition'] == 'Used')])

In [None]:
print('Length of dataset prior to dropping NaN values from the Condition column: {}'.format(len(df)))
df = df.drop(df[(df['Total Hours'] == 0) & (df['Condition'] == 'Used')].index)
print('Length of dataset after dropping NaN values from the Condition column: {}'.format(len(df)))

In [None]:
# Ensure the correct amount of rows were dropped.

2312-2241

In [None]:
# Project aircraft, similar to homebuilt aircraft mentioned above can vary widely and there isn't sufficient data
# to carry out a meaningful analysis. But for the sake of curiousity I'll leave this for now. 

df[df['Condition'] == 'Project'] 

In [None]:
# Looks good.

df['Condition'].value_counts()

In [None]:
# One more time - apply the upper method on the entire column.

print('There are a total of {}/{} uppercase rows in this column'.format((df['Condition'].str.isupper().sum()), (len(df))))
df['Condition'] = df['Condition'].str.upper()
print('There are a total of {}/{} uppercase rows in this column'.format((df['Condition'].str.isupper().sum()), (len(df))))

## Price & Currency - column

In [None]:
# The data is a little messy.

df['Price'].value_counts().tail(20)

In [None]:
# No missing data in the Price column. Let's check the currency column.

df['Price'].isnull().sum()

In [None]:
# 415 NaN values in the Currency column

df['Currency'].isnull().sum()

In [None]:
# There are 5 different currencies within our dataset.

df['Currency'].value_counts(dropna=False)

In [None]:
df[df['Currency'].isnull()].head()

Some of the data didn't transfer correctly from the csv input file. 

A few key points:

1. Rows that only have an integer value in the Price column and no matching Currency are USD
2. Rows that have the following symbol - 'â‚¬' - are Euros. 
3. Rows that have the following symbol - 'Â£' - are GBP. 

I'm going to start by filtering by the different currencies and only extracting the integers from the each row. 

In [None]:
df[df['Currency'] == 'EUR'].tail()

In [None]:
df[df['Currency'] == 'GBP'].tail()

In [None]:
# Create a variable for EUR

price_eur = df[df['Price'].str.contains(r'â‚¬', flags=re.IGNORECASE, regex=True, na=False)]
price_eur

In [None]:
# Update the Currency column for the listing below

price_eur[price_eur['Currency'].isnull()]

In [None]:
df.loc[2361, ['Currency']] = df.loc[2361, ['Currency']].replace(np.nan, 'EUR')
df.loc[2361, 'Currency']

In [None]:
# We can now go ahead and remove the unwanted characters within the Price column for Euros

price_eur['Price'] = price_eur['Price'].str.replace('Price:', '')
price_eur['Price'] = price_eur['Price'].str.replace('â‚¬', '')
print(len(price_eur))
price_eur

In [None]:
df.update(price_eur)

In [None]:
# Let's repeat the steps above for GBP now. 

price_gbp = df[df['Price'].str.contains(r'£', flags=re.IGNORECASE, regex=True, na=False)]
print(len(price_gbp))
price_gbp

In [None]:
# No missing values for Currency = GBP

price_gbp[price_gbp['Currency'].isnull()]

In [None]:
# Remove all unwaned characters within the GBP rows in the Price column

price_gbp['Price'] = price_gbp['Price'].str.replace('Price: ', '')
price_gbp['Price'] = price_gbp['Price'].str.replace('Â£', '')
print(len(price_gbp))
price_gbp

In [None]:
df.update(price_gbp)

In [None]:
# Lastly, follow the steps above for USD

price_usd = df[df['Price'].str.contains(r'USD', flags=re.IGNORECASE, regex=True, na=False)]
print(len(price_usd))
price_usd

In [None]:
# Replace the NaN rows in the Currency column with 'USD'

price_usd['Currency'] = price_usd['Currency'].replace(np.nan, 'USD')
price_usd.head()

In [None]:
# Remove all unwanted characters

price_usd['Price'] = price_usd['Price'].str.replace('Price: USD', '')
price_usd['Price'] = price_usd['Price'].str.replace('$', '')
price_usd.head()

In [None]:
# Update the main df

df.update(price_usd)

In [None]:
# Only 1 NaN remains. 
# Drop row with index 2260 since it doesn't have a price, nor currency.

print(df['Currency'].isnull().sum())
df[df['Currency'].isnull()]

In [None]:
print('Length of dataset before dropping row at index 2260: {}'.format(len(df)))
df.drop(2260, inplace=True)
print('Length of dataset before dropping row at index 2260: {}'.format(len(df)))

In [None]:
df['Currency'].isnull().sum()

In [None]:
df['Currency'].value_counts()

In [None]:
# Let's quickly take a look at the other currencies
# CAD looks good

df[df['Currency'] == 'CAD']

In [None]:
# Let's quickly take a look at the other currencies
# CHF looks good

df[df['Currency'] == 'CHF']

In [None]:
df['Price'].value_counts()

In [None]:
# Lastly, drop unwanted characters such as the '$' symbol from the entire Price column.

df['Price'] = df['Price'].str.replace('$', '')
df['Price'] = df['Price'].str.replace(' ', '')
df['Price'] = df['Price'].str.replace(',', '')

df['Price'].value_counts()

In [None]:
# Price column looks good now, we can move on. 

df.head()

## Location - column

In [None]:
df.head()

This column was definitely the most time consuming to clean up. I had to apply a lot of manual changes. 

In [None]:
# 11 missing rows in this column

df['Location'].isnull().sum()

In [None]:
df[df['Location'].isnull()]

In [None]:
# Drop the 11 rows with NaN values in the Location column.

print('Length of dataset before dropping NaN values from the Location column: {}'. format(len(df)))
df.dropna(subset=['Location'], inplace=True)
print('Length of dataset before dropping NaN values from the Location column: {}'. format(len(df)))

In [None]:
df

In [None]:
# Remove blank spaces from Location column

df.Location = df.Location.str.replace(' ', '')

In [None]:
# I manually kept adding countries to this list as I kept looking through the Location column. 

country_list = ['UnitedKingdom','Monaco', 'United Kingdom', 'USA', 'Canada', 'Luxembourg', 'Germany', 'Austria',
                    'Monaco', 'Poland', 'Belgium', 'Russian Federation', 'Netherlands', 'Sweden',
                    'Norway', 'Switzerland', 'France', 'Spain','Denmark', 'Lithuania', 'Turkey', 'Italy', 
                'Iceland', 'SouthAfrica', 'UnitedStates', 'CzechRepublic', 'NewZealand', 'Brazil', 'Australia', 
                'Bulgaria', 'CostaRica', 'RussianFederation', 'Chile', 'Nigeria', 'Pakistan', 'Indonesia', 
                'Venezuela', 'Malaysia', 'Congo', 'NewGuinea', 'UnitedArabEmirates', 'Singapore', 'CAN', 'POL',
                'DEU', 'FRA', 'ITA', 'ZAF', 'AUS', 'ARG', 'SRB', 'CZE', 'NLD', 'MEX', 'ESP', 'AUS', 'URY', 'KEN', 'CHE']

pattern = '|'.join(country_list)

In [None]:
# Create a function to search through the Location column and extract country names. A Country column is created with the 
# individual Country names

def pattern_search(search_str:str, search_list:str):

    search_obj = re.search(search_list, search_str)
    if search_obj :
        return_str = search_str[search_obj.start(): search_obj.end()]
    else:
        return_str = np.nan
    return return_str

df['Country'] = df['Location'].astype(str).apply(lambda x: pattern_search(search_str=x, search_list=pattern))
df

In [None]:
df['Country'].isnull().sum()

In [None]:
# In the interest of time I'm going to completely drop these rows.

df.dropna(subset=['Country'], inplace=True)
df['Country'].isnull().sum()

In [None]:
df.head()

In [None]:
# Some of the countries are repeated. I'll have to manually update these. 

df['Country'].value_counts()

In [None]:
# Manual changes to ensure that countries aren't double counted and have a consistent format.

df['Country'] = np.where((df['Country'] == 'FRA'), 'France', df['Country'])
df['Country'] = np.where((df['Country'] == 'MEX'), 'Mexico', df['Country'])
df['Country'] = np.where((df['Country'] == 'URY'), 'Uruguay', df['Country'])
df['Country'] = np.where((df['Country'] == 'KEN'), 'Kenya', df['Country'])
df['Country'] = np.where((df['Country'] == 'ITA'), 'Italy', df['Country'])
df['Country'] = np.where((df['Country'] == 'ESP'), 'Spain', df['Country'])
df['Country'] = np.where((df['Country'] == 'NLD'), 'Netherlands', df['Country'])
df['Country'] = np.where((df['Country'] == 'CZE'), 'Czech Republic', df['Country'])
df['Country'] = np.where((df['Country'] == 'SRB'), 'Serbia', df['Country'])
df['Country'] = np.where((df['Country'] == 'ARG'), 'Argentina', df['Country'])
df['Country'] = np.where((df['Country'] == 'CzechRepublic'), 'Czech Republic', df['Country'])
df['Country'] = np.where((df['Country'] == 'CostaRica'), 'Costa Rica', df['Country'])
df['Country'] = np.where((df['Country'] == 'UnitedArabEmirates'), 'United Arab Emirates', df['Country'])
df['Country'] = np.where((df['Country'] == 'RussianFederation'), 'Russia', df['Country'])
df['Country'] = np.where((df['Country'] == 'POL'), 'Poland', df['Country'])
df['Country'] = np.where((df['Country'] == 'DEU'), 'Germany', df['Country'])
df['Country'] = np.where((df['Country'] == 'AUS'), 'Australia', df['Country'])
df['Country'] = np.where((df['Country'] == 'SouthAfrica'), 'South Africa', df['Country'])
df['Country'] = np.where((df['Country'] == 'ZAF'), 'South Africa', df['Country'])
df['Country'] = np.where((df['Country'] == 'UnitedKingdom'), 'United Kingdom', df['Country'])
df['Country'] = np.where((df['Country'] == 'CHE'), 'Switzerland', df['Country'])
df['Country'] = np.where((df['Country'] == 'CAN'), 'Canada', df['Country'])
df['Country'] = np.where((df['Country'] == 'NewGuinea'), 'New Guinea', df['Country'])
df['Country'] = np.where((df['Country'] == 'United States'), 'USA', df['Country'])

In [None]:
df['Country'].value_counts()

In [None]:
# Double check the data. 

df[df['Country'] == 'Canada']

It seems like the Country values aren't correct. Using row 2508 as an example. The Country column indicates Canada, but it should actually be the United States, according to the Location column. This is because the Search Function takes the first country name from the 'country_list' and matches it to the first country it comes across in the Location column.

We'll need to look at each country individually to ensure that our data is 100% accurate. Let's do that, but first I'll add 'US - State' column. 

In [None]:
# Copy/Paste from https://gist.github.com/JeffPaine/3083347

states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", 
          "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
          "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
          "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
          "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]

pattern = '|'.join(states)

In [None]:
df['US - State'] = df['Location'].astype(str).apply(lambda x: pattern_search(search_str=x, search_list=pattern))

In [None]:
df.head()

In [None]:
# Manual changes to correct the Country column

df['Country'] = np.where((df['Country'] == 'Canada') & (df['Location'] == 'NorthAmerica+Canada,Mexico'), 'Mexico', df['Country'])
df['Country'] = np.where((df['Country'] == 'Canada') & (df['Location'] == 'NorthAmerica+Canada,UnitedStates'), 'USA', df['Country'])
df['Country'] = np.where((df['Country'] == 'Canada') & (df['Location'] == 'NorthAmerica+Canada,Canada'), 'Canada', df['Country'])
df['Country'] = np.where((df['Country'] == 'Canada') & (df['US - State'] != 'CA') & (df['US - State'].notnull()), 'USA', df['Country'])
df['US - State'] = np.where((df['Country'] == 'Canada') & (df['US - State'].notnull()), np.nan, df['US - State'])
df['Country'] = np.where((df['Country'] == 'Canada') & (df['Location'] == 'NorthAmerica+Canada,UnitedStates-CA'), 'USA', df['Country'])
df['US - State'] = np.where((df['Country'] == 'USA') & (df['Location'] == 'NorthAmerica+Canada,UnitedStates-CA'), 'CA', df['US - State'])
df['US - State'] = np.where((df['Country'] != 'USA') & (df['US - State'].notnull()), np.nan, df['US - State'])

In [None]:
# While cross checking each country using the code below (I appplied the filter below for all countries, one at a time), 
# I noticed that Uruguay was incorrectly entered as the Country for some of the listings. 

df[(df['Country'] == 'Uruguay')]

In [None]:
# Amendments to correct Uruguay:

df['Country'] = np.where((df['Country'] == 'Uruguay') & (df['Location'] == 'HAWKESBURY,\n\tON\n\tCAN'), 'Canada', df['Country']) 
df['Country'] = np.where((df['Country'] == 'Uruguay') & (df['Location'] == 'HAWKESBURY,\n\tQC\n\tCAN'), 'Canada', df['Country']) 
df['Country'] = np.where((df['Country'] == 'Uruguay') & (df['Location'] == 'HAWKESBURY\n\t\n\tUSA'), 'USA', df['Country'])
df[(df['Country'] == 'Uruguay')]

In [None]:
df['Country'].value_counts()[:10]

In [None]:
# Australia also has several errors. See cell below for amendments. 

df[(df['Country'] == 'Australia')]

In [None]:
# Australia amendments

df['Country'] = np.where((df['Country'] == 'Australia') & (df['Location'] == 'Australia&NZ,NewZealand'), 'New Zealand', df['Country']) 
df['Country'] = np.where((df['Country'] == 'Australia') & (df['Location'] == 'AUSTIN,\n\tTX\n\tUSA'), 'USA', df['Country']) 

In [None]:
df['Country'].isnull().sum()

In [None]:
df['Country'].value_counts()

In [None]:
df['US - State'].isnull().sum()

In [None]:
df['US - State'].value_counts()

In [None]:
# Last step - Applying upper method

print('There are a total of {}/{} uppercase rows in this column'.format((df['Country'].str.isupper().sum()), (len(df))))
df['Country'] = df['Country'].str.upper()
print('There are a total of {}/{} uppercase rows in this column'.format((df['Country'].str.isupper().sum()), (len(df))))

In [None]:
df.head()

# DATA CLEANING - FINAL STEPS

In [None]:
df.info()

A few last steps:

1. Drop the Location column because we've created a Country and US - State columns
2. Drop Engine 1 Hours, Engine 2 Hours, Prop 1 Hours, Prop 2 Hours, Total Seats, Flight Rules columns since there is too much missing data. I might decide to add them back later. But, I won't be using those columns for the initial analysis.  
3. Drop S/N and REG columns. These columns won't be very useful in analyzing aircraft prices. The Serial and Registration numbers are unique to each aircraft and have no significance on the price. These numbers are similar to a automobiles license plate, which doesn't have any impact on the price. 
4. Rearrange and rename certain columns 
5. Convert the data into correct data types. For example, the Price and Total Hours columns should be Dtype = integer.
6. Convert Currency to USD to make it easier to work with the data
7. Create output csv file

In [None]:
# Drop - Location column

location_df = df['Location']
df.drop(['Location'], axis=1, inplace=True)
df.head()

In [None]:
# Drop - Engine 1 Hours, Engine 2 Hours, Prop 1 Hours, Prop 2 Hours, Total Seats, Flight Rules

unused_columns = df[['Engine 1 Hours', 'Engine 2 Hours', 'Prop 1 Hours', 'Prop 2 Hours', 'Total Seats', 'Flight Rules']]
df.drop(['Engine 1 Hours', 'Engine 2 Hours', 'Prop 1 Hours', 'Prop 2 Hours', 'Total Seats', 'Flight Rules'], axis=1, inplace=True)

In [None]:
# Drop S/N and REG columns

sn_reg = df[['S/N', 'REG']]
df.drop(['S/N', 'REG'], axis=1, inplace=True)

In [None]:
# Rename and Rearrange the columns

df = df.rename(columns={'Country': 'Location - Country', 'US - State': 'Location - US State'})

df.head()

In [None]:
# Rearrange the columns

rearrange_columns = df.columns.tolist()
rearrange_columns = [
 'Condition',
 'Category',
 'Year',
 'Make',
 'Model',
 'Country of Origin',
 'Total Hours',
 'Location - Country',
 'Location - US State',
 'Price',
 'Currency', 
 ]

In [None]:
df = df[rearrange_columns]
df.head()

In [None]:
# Convert columns to correct data types

df['Total Hours'] = df['Total Hours'].astype(np.int64)
df['Price'] = df['Price'].astype(np.int64)
df['Year'] = pd.to_datetime(df['Year'], format='%Y').dt.year
df.info()

In [None]:
df.head()

Let's convert the Price column to USD. I will use the July 1st close rate.

- [1 USD = 0.8888 EUR](https://www.poundsterlinglive.com/best-exchange-rates/us-dollar-to-euro-exchange-rate-on-2020-07-01)
- [1 USD = 0.8023 GBP](https://www.poundsterlinglive.com/best-exchange-rates/us-dollar-to-british-pound-exchange-rate-on-2020-07-01)
- [1 USD = 1.3591 CAD](https://www.poundsterlinglive.com/best-exchange-rates/us-dollar-to-canadian-dollar-exchange-rate-on-2020-07-01)
- [1 USD = 0.9458 CHF](https://www.poundsterlinglive.com/best-exchange-rates/us-dollar-to-swiss-franc-exchange-rate-on-2020-07-01)

In [None]:
# Convert all Prices to USD

df.loc[df['Currency'] == 'EUR', 'Price'] = df.loc[df['Currency'] == 'EUR', 'Price']*(1/0.8888)
df.loc[df['Currency'] == 'GBP', 'Price'] = df.loc[df['Currency'] == 'GBP', 'Price']*(1/0.8023)
df.loc[df['Currency'] == 'CAD', 'Price'] = df.loc[df['Currency'] == 'CAD', 'Price']*(1/1.3591)
df.loc[df['Currency'] == 'CHF', 'Price'] = df.loc[df['Currency'] == 'CHF', 'Price']*(1/0.9458)
df['Price'] = df['Price'].astype(np.int64)

In [None]:
# We can now drop the Currency column since all of prices are in USD

df.drop(['Currency'], axis=1, inplace=True)

## Clean Dataset:

In [None]:
df.head()

### Last Step! 

Create a csv output of the clean dataset.

In [None]:
df.to_csv('clean_aircraft_data.csv')

Thank you for going through this notebook!

Feel free to leave any questions or comments below.  