# BrainStation Capstone: Flight Data and CO2 Emissions

**Author:** Xenel Nazar

**Contact Info:** xenel.nazar@gmail.com

**Submission Date:** Sept 26, 2022

**Notebook:** 3 of 4

Table of Contents:

[Introduction](#Introduction) \
[Pre-Processing](#Pre-Processing) \
[Dataset for Modeling](#Dataset-for-Modeling) \
[Sources](#Sources)

# Introduction

For the following notebook, we will be looking at pre-processing the data from the .csv file generated at the end of our first notebook. 

We will iterate through the numerical and categorical columns and pre-process, as required, to help prepare the data for modeling.

### Import Libraries

We will first import all necessary libaries to help with pre-processing.

In [None]:
# import libraries
import numpy as np
import pandas as pd
pd.options.display.max_columns = 999
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.preprocessing import OneHotEncoder

### Import Data

We will now import the .csv file that was generated in our first notebook.

In [None]:
# Read in the data 
df = pd.read_csv('data/flight_data_edit.csv')

In [None]:
# verify
df.info()

# Pre-Processing

We will first split the columns into numerical and categorical columns.

In [None]:
# Separate Numeric and Categorical Columns

# Separate Numeric Columns
numeric_col_list = list(df.select_dtypes('number').columns)

#Separate Categorical Columns
categorical_col_list = list(df.select_dtypes('object').columns)

In [None]:
# verify numerical columns
print("Numeric Columns:")
numeric_col_list

In [None]:
# verify categorical columns
print("Categorical Columns:")
categorical_col_list

### Pre-Process Numeric Columns

We will first look at tackling the numerical columns.

In [None]:
# verify numerical columns
numeric_col_list

We can now transfer the numerical columns to a new dataframe that we will use for modeling.

In [None]:
# Create new dataframe to store numerical data
df_clean = df[numeric_col_list].copy()

In [None]:
# Verify new dataframe
df_clean.head()

In [None]:
# Describe numerical columns
df_clean.describe()

As the values in the numerical columns vary, we may look the standard scaler when modeling.

In [None]:
# Generate histograms for numeric columns
i = 1
for column in df_clean.columns:
    fig = px.histogram(df_clean, x=column,
                       opacity=0.8,
                       color_discrete_sequence=['cornflowerblue']
                       )
    fig.update_layout( title=column,
                      xaxis_title=column, yaxis_title='Count')
    fig.show()
    # export Graph
    fig.write_html(f"visualizations/numerical_column_{i}_Overview.html")
    i=i+1

The Latitude and Longitude coordinates for the Orign and Destination airports do not show any sign of normality in the data.

The `duration` column shows a very evident normal distribution, but with a slight right skew. 

The `stops` columns shows a similar pattern. with a normal distribution around 1 to 3 stops, with a slight right skew. 

The `price` column does not show a normal distribution, with a majority of the points under the $5,000 USD price range. 

`co2_emissions` shows a binormal distribution, with a distribution from ~0-500lbs and another from >500lbs, with a right skew. 

The `Distance_km` column shows some form of a binormal distribution, with a right skewed distribution, from 0-4,000kms and another from 5,000-15,000kms.

The calculated field `KM/LBS` column clearly shows a binormal distribution, with a right skewed distribution from 0-4km/lb, which we have classfied in the `KM/LBS_Classification` column as 0 or Low Utilization/Efficiency. While there is another distribution that is normally distributed, from 4-12km/lb, which is classifed as in the `KM/LBS_Classification` as 1 or High Utilization/Efficiency. 

In regards to the `KM/LBS_Classification`, we do not see a normal distribution, but a clear imbalanced classification, with a majority classified as 1 or High-Utilization/Efficiency.

The imbalance can result in difficulties in modeling if we need to over/under sample to help enhance the model's performance. We have noted that oversampling could pose to overfitting issues. Potential use of tree-based models, like Decision Trees and Random Forest models, could help tackle the imbalanced data.


In [None]:
# Review Correlation 
sns.heatmap(df_clean.corr().round(2), vmin=-1, vmax=1, cmap='coolwarm', annot=True)
plt.show()

Looking at the correlation plot for the columns, we can see various positive and negative correlations.

A number of positive correlations are seen with the relationship between the `duration`, `stops`, `price`, `co2_emissions`, and `Distance_km` columns.

We do see some negative correlations with the latitude columns, (`from_Airport_Lat` & `dest_Airport_Lat`) and the `duration`, `stops`, `price`, `co2_emissions`, and `Distance_km` columns.

We will likely need to drop some or a number of the columns prior to modeling. 

### Pre-Process Categorical Columns

We can now look at pre-processing the categorical columns.

In [None]:
# verify categorical columns
categorical_col_list

#### from_Airport_City

We will first look at the `from_Airport_City` column. 

In [None]:
# Check distribution of values
column = 'from_Airport_City'
rvw = df['from_Airport_City'].value_counts(normalize=True)*100
rvw

In [None]:
# Graph Categorical column
fig = px.bar(rvw,
                   opacity=0.8,
                   color_discrete_sequence=['cornflowerblue']
                   )
fig.update_layout( title=column,
                  xaxis_title=column, yaxis_title='Value %',showlegend=False)
fig.show()
# export Graph
fig.write_html(f"visualizations/Categorical_Column_{column}.html")

We can see that for the origin airport codes, the proportion of the top 10 values range from 5-6% of the overall dataset. 

We can now one hot encode each of the various Origin Airport Cities.

In [None]:
# Instantiate the OneHotEncoder
ohe = OneHotEncoder()
subcategory = pd.DataFrame(df[column])

# Fit the OneHotEncoder to the subcategory column and transform
encoded = ohe.fit_transform(subcategory)

# Convert from sparse matrix to dense
dense_array = encoded.toarray()

# Put into a dataframe to get column names
encoded_df = pd.DataFrame(dense_array, columns=ohe.categories_, dtype=int)

encoded_df.columns = [
"from_Airport_City_Addis Ababa","from_Airport_City_Algiers"
    ,"from_Airport_City_Athens","from_Airport_City_Beijing"
    ,"from_Airport_City_Bogota","from_Airport_City_Brussels"
    ,"from_Airport_City_Buenos Aires","from_Airport_City_Cairo"
    ,"from_Airport_City_Chengdu","from_Airport_City_Confins"
    ,"from_Airport_City_Copenhagen","from_Airport_City_Delhi"
    ,"from_Airport_City_Dublin","from_Airport_City_Frankfurt"
    ,"from_Airport_City_Guangzhou","from_Airport_City_Hangzhou"
    ,"from_Airport_City_Melbourne","from_Airport_City_Mumbai"
    ,"from_Airport_City_Munich","from_Airport_City_Paris"
    ,"from_Airport_City_Santiago","from_Airport_City_Sao Paulo"
    ,"from_Airport_City_Shanghai","from_Airport_City_Shenzhen"
    ,"from_Airport_City_Sydney","from_Airport_City_Toronto"
    ,"from_Airport_City_Vienna","from_Airport_City_Xi'an"
]

# verify
encoded_df

We can now combine the encoded columns with the clean dataframe.

In [None]:
# Combine Encoded Dataframe to df_clean
df_clean = pd.concat([df_clean, encoded_df], axis=1)

In [None]:
# verify
df_clean.head()

We can now drop the `from_Airport_City` from our categorical column list.

In [None]:
# Remove from categorical list
categorical_col_list.remove(column)
categorical_col_list

#### from_airport_code

In [None]:
# Check distribution of values
column = 'from_airport_code'
rvw = df['from_airport_code'].value_counts(normalize=True)*100
rvw

In [None]:
# Graph Categorical column
fig = px.bar(rvw,
                   opacity=0.8,
                   color_discrete_sequence=['cornflowerblue']
                   )
fig.update_layout( title=column,
                  xaxis_title=column, yaxis_title='Value %',showlegend=False)
fig.show()
# export Graph
fig.write_html(f"visualizations/Categorical_Column_{column}.html")

We see similar data being presented in the `from_airport_code` column, and a similar value breakdown as the `from_Airport_City` column. 

We can drop the `from_airport_code` column from our categorical column list and use the values from the `from_Airport_City` column.

In [None]:
# Remove from categorical list
print(f"The {column} column is now removed from the categorical column list")
categorical_col_list.remove(column)
categorical_col_list

#### from_country

In [None]:
# Check distribution of values
column = 'from_country'
rvw = df[column].value_counts(normalize=True)*100
rvw

In [None]:
# Graph Categorical column
fig = px.bar(rvw,
                   opacity=0.8,
                   color_discrete_sequence=['cornflowerblue']
                   )
fig.update_layout( title=column,
                  xaxis_title=column, yaxis_title='Value %',showlegend=False)
fig.show()
# export Graph
fig.write_html(f"visualizations/Categorical_Column_{column}.html")

We can see that the country with the highest proportion of values in the dataset are trips originating from Germany, at ~12%, and secondly with Australia, at ~10%.

We will now encode each of the origin countries listed.

In [None]:
# Instantiate the OneHotEncoder
ohe = OneHotEncoder()
subcategory = pd.DataFrame(df[column])

# Fit the OneHotEncoder to the subcategory column and transform
encoded = ohe.fit_transform(subcategory)

# Convert from sparse matrix to dense
dense_array = encoded.toarray()

# Put into a dataframe to get column names
encoded_df = pd.DataFrame(dense_array, columns=ohe.categories_, dtype=int)

encoded_df.columns = ["from_country_Algeria","from_country_Argentina"
                      ,"from_country_Australia","from_country_Austria"
                      ,"from_country_Belgium","from_country_Brazil"
                      ,"from_country_Canada","from_country_Chile"
                      ,"from_country_China","from_country_Columbia"
                      ,"from_country_Denmark","from_country_Dublin"
                      ,"from_country_Egypt","from_country_Ethiopia"
                      ,"from_country_France","from_country_Germany"
                      ,"from_country_Greece","from_country_India",]

# verify
encoded_df

We will now add the encoded origin countries to our clean dataframe.

In [None]:
# Combine Encoded Dataframe to df_clean
df_clean = pd.concat([df_clean, encoded_df], axis=1)

In [None]:
# verify
df_clean.head()

We can now remove the `from_country` column from our categorical list.

In [None]:
# Remove from categorical list
print(f"The {column} column is now removed from the categorical column list")
categorical_col_list.remove(column)
categorical_col_list

#### dest_Airport_City

We can now look at the destination Airport cities.

In [None]:
# Check distribution of values
column = 'dest_Airport_City'
rvw = df[column].value_counts(normalize=True)*100
rvw

In [None]:
# Graph Categorical column
fig = px.bar(rvw,
                   opacity=0.8,
                   color_discrete_sequence=['cornflowerblue']
                   )
fig.update_layout( title=column,
                  xaxis_title=column, yaxis_title='Value %', showlegend=False)
fig.show()
# export Graph
fig.write_html(f"visualizations/Categorical_Column_{column}.html")

London and Tokyo have the largest values in data set, at 3.6% and 3.5%, respectively.

In [None]:
# Instantiate the OneHotEncoder
ohe = OneHotEncoder()
subcategory = pd.DataFrame(df[column])

# Fit the OneHotEncoder to the subcategory column and transform
encoded = ohe.fit_transform(subcategory)

# Convert from sparse matrix to dense
dense_array = encoded.toarray()

# Put into a dataframe to get column names
encoded_df = pd.DataFrame(dense_array, columns=ohe.categories_, dtype=int)

# add prefix
encoded_df.columns = [
    "dest_Airport_City_Addis Ababa","dest_Airport_City_Algiers"
    ,"dest_Airport_City_Amsterdam","dest_Airport_City_Athens"
    ,"dest_Airport_City_Atlanta","dest_Airport_City_Bangalore"
    ,"dest_Airport_City_Bangkok","dest_Airport_City_Beijing"
    ,"dest_Airport_City_Bogota","dest_Airport_City_Brussels"
    ,"dest_Airport_City_Buenos Aires","dest_Airport_City_Cairo"
    ,"dest_Airport_City_Cape Town","dest_Airport_City_Casablanca"
    ,"dest_Airport_City_Charlotte","dest_Airport_City_Chengdu"
    ,"dest_Airport_City_Chicago","dest_Airport_City_Confins"
    ,"dest_Airport_City_Copenhagen","dest_Airport_City_Delhi"
    ,"dest_Airport_City_Doha","dest_Airport_City_Dubai"
    ,"dest_Airport_City_Dublin","dest_Airport_City_Fiumicino"
    ,"dest_Airport_City_Frankfurt","dest_Airport_City_Guangzhou"
    ,"dest_Airport_City_Hangzhou","dest_Airport_City_Ho Chi Minh City"
    ,"dest_Airport_City_Houston","dest_Airport_City_Istanbul"
    ,"dest_Airport_City_Jakarta","dest_Airport_City_Johannesburg"
    ,"dest_Airport_City_Kuala Lumpur","dest_Airport_City_Lima"
    ,"dest_Airport_City_Lisbon","dest_Airport_City_London"
    ,"dest_Airport_City_Los Angeles","dest_Airport_City_Madrid"
    ,"dest_Airport_City_Manchester","dest_Airport_City_Manila"
    ,"dest_Airport_City_Melbourne","dest_Airport_City_Mexico City"
    ,"dest_Airport_City_Miami","dest_Airport_City_Milan"
    ,"dest_Airport_City_Moscow","dest_Airport_City_Mumbai"
    ,"dest_Airport_City_Munich","dest_Airport_City_Nairobi"
    ,"dest_Airport_City_New York","dest_Airport_City_Orlando"
    ,"dest_Airport_City_Oslo","dest_Airport_City_Panama City"
    ,"dest_Airport_City_Paris","dest_Airport_City_Phoenix"
    ,"dest_Airport_City_San Francisco","dest_Airport_City_Santiago"
    ,"dest_Airport_City_Sao Paulo","dest_Airport_City_Seattle"
    ,"dest_Airport_City_Seoul","dest_Airport_City_Shanghai"
    ,"dest_Airport_City_Shenzhen","dest_Airport_City_Singapore"
    ,"dest_Airport_City_Stockholm","dest_Airport_City_Sydney"
    ,"dest_Airport_City_Taipei","dest_Airport_City_Tokyo"
    ,"dest_Airport_City_Toronto","dest_Airport_City_Vienna"
    ,"dest_Airport_City_Xi'an","dest_Airport_City_Zurich"
]

# verify
encoded_df

We will now add the encoded destination Airport cities to our clean dataframe.

In [None]:
# Combine Encoded Dataframe to df_clean
df_clean = pd.concat([df_clean, encoded_df], axis=1)

In [None]:
# verify
df_clean.head()

We will now remove the `dest_Airport_City` column from our categorical column list.

In [None]:
# Remove from categorical list
print(f"The {column} column is now removed from the categorical column list")
categorical_col_list.remove(column)
categorical_col_list

#### dest_airport_code

As we dropped the similar column for origin airport codes, we can drop this column as well.

In [None]:
# Remove from categorical list
categorical_col_list.remove('dest_airport_code')
categorical_col_list

#### dest_country

In [None]:
# Check distribution of values
column = 'dest_country'
rvw = df[column].value_counts(normalize=True)*100
rvw

In [None]:
# Graph Categorical column
fig = px.bar(rvw,
                   opacity=0.8,
                   color_discrete_sequence=['cornflowerblue']
                   )
fig.update_layout( title=column,
                  xaxis_title=column, yaxis_title='Value %', showlegend=False)
fig.show()
# export Graph
fig.write_html(f"visualizations/Categorical_Column_{column}.html")

The United States as a destination has the highest proportion of values at ~22%. The second highest is the United Kingdom at ~5%.

We can now work on encoding each of the destination countries.

In [None]:
# Instantiate the OneHotEncoder
ohe = OneHotEncoder()
subcategory = pd.DataFrame(df[column])

# Fit the OneHotEncoder to the subcategory column and transform
encoded = ohe.fit_transform(subcategory)

# Convert from sparse matrix to dense
dense_array = encoded.toarray()

# Put into a dataframe to get column names
encoded_df = pd.DataFrame(dense_array, columns=ohe.categories_, dtype=int)

# add prefix
encoded_df.columns = [
    "dest_Airport_City_Algeria","dest_Airport_City_Argentina"
    ,"dest_Airport_City_Australia","dest_Airport_City_Austria"
    ,"dest_Airport_City_Belgium","dest_Airport_City_Brazil"
    ,"dest_Airport_City_Canada","dest_Airport_City_Chile"
    ,"dest_Airport_City_China","dest_Airport_City_Columbia"
    ,"dest_Airport_City_Denmark","dest_Airport_City_Dublin"
    ,"dest_Airport_City_Egypt","dest_Airport_City_Ethiopia"
    ,"dest_Airport_City_France","dest_Airport_City_Germany"
    ,"dest_Airport_City_Greece","dest_Airport_City_India"
    ,"dest_Airport_City_Indonesia","dest_Airport_City_Italy"
    ,"dest_Airport_City_Japan","dest_Airport_City_Kenya"
    ,"dest_Airport_City_Malaysia","dest_Airport_City_Mexico"
    ,"dest_Airport_City_Morocco","dest_Airport_City_Netherlands"
    ,"dest_Airport_City_Norway","dest_Airport_City_Panama"
    ,"dest_Airport_City_Peru","dest_Airport_City_Philippines"
    ,"dest_Airport_City_Portugal","dest_Airport_City_Qatar"
    ,"dest_Airport_City_Rome","dest_Airport_City_Russia"
    ,"dest_Airport_City_Singapore","dest_Airport_City_South Africa"
    ,"dest_Airport_City_South Korea","dest_Airport_City_Spain"
    ,"dest_Airport_City_Sweden","dest_Airport_City_Taiwan"
    ,"dest_Airport_City_Thailand","dest_Airport_City_Turkey"
    ,"dest_Airport_City_United Arab Emirates","dest_Airport_City_United Kingdom"
    ,"dest_Airport_City_United States","dest_Airport_City_Vietnam"
    ,"dest_Airport_City_Zurich"
]

# verify
encoded_df

We will now add the encoded Airport Countries to our clean dataframe.

In [None]:
# Combine Encoded Dataframe to df_clean
df_clean = pd.concat([df_clean, encoded_df], axis=1)

In [None]:
# verify
df_clean.head()

We will now remove the `dest_country` column from the categorical column list.

In [None]:
# Remove from categorical list
print(f"The {column} column is now removed from the categorical column list")
categorical_col_list.remove(column)
categorical_col_list

#### aircraft_type_1

We will now look at the Aircraft Types. 

In [None]:
# Check distribution of values
column = 'aircraft_type_1'
rvw = df[column].value_counts(normalize=True)*100
rvw

In [None]:
# Graph Categorical column
fig = px.bar(rvw,
                   opacity=0.8,
                   color_discrete_sequence=['cornflowerblue']
                   )
fig.update_layout( title=column,
                  xaxis_title=column, yaxis_title='Value%', showlegend=False)
fig.show()
# export Graph
fig.write_html(f"visualizations/Categorical_Column_{column}.html")

We see the most prominent Aircraft is the Airbus A320, at ~19%. According to Airbus, the A320 family is the most successful line ever ([1](https://www.airbus.com/en/products-services/commercial-aircraft/passenger-aircraft/a320-family)). 

We also can see that some other modes of transportation, like train and bus transporation, are listed in the dataset. We will have to remove these values before mapping to our cleaning dataframe. 

In [None]:
# Instantiate the OneHotEncoder
ohe = OneHotEncoder()
subcategory = pd.DataFrame(df[column])

# Fit the OneHotEncoder to the subcategory column and transform
encoded = ohe.fit_transform(subcategory)

# Convert from sparse matrix to dense
dense_array = encoded.toarray()

# Put into a dataframe to get column names
encoded_df = pd.DataFrame(dense_array, columns=ohe.categories_, dtype=int)

In [None]:
# verify
encoded_df

We can now drop the train and bus values.

In [None]:
encoded_df = encoded_df.drop(['Bus','Train'], axis=1)
encoded_df

There also looks to be "NaN" column which we will drop.

In [None]:
# drop last column for "NaN"
encoded_df = encoded_df.iloc[:,:-1]

In [None]:
# verify
encoded_df

In [None]:
# add prefix
encoded_df.columns = ["aircraft_type_1_ATR 42","aircraft_type_1_ATR 42/72"
                      ,"aircraft_type_1_ATR 72","aircraft_type_1_Airbus A220 Passenger"
                      ,"aircraft_type_1_Airbus A220-100 Passenger"
                      ,"aircraft_type_1_Airbus A220-300 Passenger","aircraft_type_1_Airbus A318"
                      ,"aircraft_type_1_Airbus A319","aircraft_type_1_Airbus A319neo"
                      ,"aircraft_type_1_Airbus A320","aircraft_type_1_Airbus A320neo"
                      ,"aircraft_type_1_Airbus A321","aircraft_type_1_Airbus A321 (Sharklets)"
                      ,"aircraft_type_1_Airbus A321neo","aircraft_type_1_Airbus A330"
                      ,"aircraft_type_1_Airbus A330-800neo Passenger"
                      ,"aircraft_type_1_Airbus A330-900neo","aircraft_type_1_Airbus A340"
                      ,"aircraft_type_1_Airbus A350","aircraft_type_1_Airbus A380"
                      ,"aircraft_type_1_Avro RJ","aircraft_type_1_Boeing 717"
                      ,"aircraft_type_1_Boeing 737","aircraft_type_1_Boeing 737MAX 8 Passenger"
                      ,"aircraft_type_1_Boeing 737MAX 9 Passenger","aircraft_type_1_Boeing 747"
                      ,"aircraft_type_1_Boeing 757","aircraft_type_1_Boeing 767"
                      ,"aircraft_type_1_Boeing 777","aircraft_type_1_Boeing 787"
                      ,"aircraft_type_1_Boeing 787-10"
                      ,"aircraft_type_1_Bombardier Regional Jet 550"
                      ,"aircraft_type_1_Canadair RJ 1000","aircraft_type_1_Canadair RJ 200"
                      ,"aircraft_type_1_Canadair RJ 700","aircraft_type_1_Canadair RJ 900"
                      ,"aircraft_type_1_Canadair Reg. Jet","aircraft_type_1_Comac ARJ21-700"
                      ,"aircraft_type_1_De Havilland-Bombardier Dash-8"
                      ,"aircraft_type_1_Embraer 170","aircraft_type_1_Embraer 175"
                      ,"aircraft_type_1_Embraer 190","aircraft_type_1_Embraer 195"
                      ,"aircraft_type_1_Embraer 195 E2","aircraft_type_1_Embraer ERJ-145"
                      ,"aircraft_type_1_Embraer RJ-170/190","aircraft_type_1_SAAB SF 340"]

In [None]:
# verify
encoded_df

We can now combine the encoded aircraft types with the clean dataframe.

In [None]:
# Combine Encoded Dataframe to df_clean
df_clean = pd.concat([df_clean, encoded_df], axis=1)

In [None]:
# verify
df_clean

In [None]:
# Remove from categorical list
print(f"The {column} column is now removed from the categorical column list")
categorical_col_list.remove(column)
categorical_col_list

#### airline_1

We will now look Airline information in the dataset.

In [None]:
# Check distribution of values
column = 'airline_1'
rvw = df[column].value_counts(normalize=True)*100
rvw

In [None]:
# Graph Categorical column
fig = px.bar(rvw,
                   opacity=0.8,
                   color_discrete_sequence=['cornflowerblue']
                   )
fig.update_layout( title=column,
                  xaxis_title=column, yaxis_title='Value%', showlegend=False)
fig.show()
# export Graph
fig.write_html(f"visualizations/Categorical_Column_{column}.html")

We can see that the airline with the highest proportion of values in the dataset is Lufthansa, at ~16%. As Lufthansa is a German airline, this is not suprising as the highest values for origin cities are German cities, and the highest value for origin country is Germany ([2](https://www.lufthansagroup.com/en/company/company-portrait.html#cid6471)).

We will now encode each of the airlines. 

In [None]:
# Instantiate the OneHotEncoder
ohe = OneHotEncoder()
subcategory = pd.DataFrame(df[column])

# Fit the OneHotEncoder to the subcategory column and transform
encoded = ohe.fit_transform(subcategory)

# Convert from sparse matrix to dense
dense_array = encoded.toarray()

# Put into a dataframe to get column names
encoded_df = pd.DataFrame(dense_array, columns=ohe.categories_, dtype=int)

# add prefix
encoded_df.columns = [
    "airline_1_ANA","airline_1_ASL Airlines","airline_1_Aegean"
    ,"airline_1_Aer Lingus","airline_1_Aerolineas Argentinas"
    ,"airline_1_Aeromexico","airline_1_Air Algerie","airline_1_Air Arabia"
    ,"airline_1_Air Arabia Maroc","airline_1_Air Astana"
    ,"airline_1_Air Austral","airline_1_Air Baltic","airline_1_Air Canada"
    ,"airline_1_Air China","airline_1_Air Dolomiti","airline_1_Air Europa"
    ,"airline_1_Air France","airline_1_Air India","airline_1_Air Macau"
    ,"airline_1_Air Malta","airline_1_Air Mauritius","airline_1_Air Moldova"
    ,"airline_1_Air New Zealand","airline_1_Air Niugini","airline_1_Air Serbia"
    ,"airline_1_Air Seychelles","airline_1_Air Tahiti Nui","airline_1_Air Transat"
    ,"airline_1_Air-India Express","airline_1_AirAsia (India)","airline_1_AirAsia X"
    ,"airline_1_Aircalin","airline_1_American","airline_1_Arkia","airline_1_Asiana"
    ,"airline_1_Austrian","airline_1_Avianca","airline_1_Azores Airlines","airline_1_Azul"
    ,"airline_1_Bamboo Airways","airline_1_Biman","airline_1_Blue Air","airline_1_BoA"
    ,"airline_1_British Airways","airline_1_Brussels Airlines","airline_1_Bulgaria Air"
    ,"airline_1_COPA","airline_1_CSA","airline_1_Cathay Pacific","airline_1_Cebu Pacific"
    ,"airline_1_China Airlines","airline_1_China Eastern","airline_1_China Southern"
    ,"airline_1_Corendon","airline_1_Croatia","airline_1_Cyprus Airways","airline_1_Delta"
    ,"airline_1_EVA Air","airline_1_EgyptAir","airline_1_El Al","airline_1_Emirates"
    ,"airline_1_Ethiopian","airline_1_Etihad","airline_1_Eurowings"
    ,"airline_1_Eurowings Discover","airline_1_Fiji Airways","airline_1_Finnair"
    ,"airline_1_Flair Airlines","airline_1_Fly One","airline_1_Flynas","airline_1_Flyr AS"
    ,"airline_1_GO FIRST","airline_1_Garuda Indonesia","airline_1_Gol","airline_1_Gulf Air"
    ,"airline_1_Hainan","airline_1_Hawaiian","airline_1_Hong Kong Airlines","airline_1_ITA"
    ,"airline_1_Iberia","airline_1_Iberia Express","airline_1_Icelandair","airline_1_IndiGo"
    ,"airline_1_JAL","airline_1_Jazeera","airline_1_Jet2","airline_1_JetBlue","airline_1_Jetstar"
    ,"airline_1_Juneyao Airlines","airline_1_KLM","airline_1_Kenya Airways","airline_1_Korean Air"
    ,"airline_1_Kuwait Airways","airline_1_LATAM","airline_1_LOT"
    ,"airline_1_Lanmei Airlines (Cambodia)","airline_1_Loganair","airline_1_Lufthansa"
    ,"airline_1_Lufthansa CityLine","airline_1_Luxair","airline_1_MEA","airline_1_MIAT"
    ,"airline_1_Malaysia Airlines","airline_1_Malindo Air","airline_1_Neos"
    ,"airline_1_Nepal Airlines","airline_1_Nile Air","airline_1_Norwegian","airline_1_Oman Air"
    ,"airline_1_Pacific Airways","airline_1_Pakistan","airline_1_Paranair","airline_1_Pegasus"
    ,"airline_1_Philippine Airlines","airline_1_Qantas","airline_1_Qatar Airways","airline_1_Rex"
    ,"airline_1_Royal Air Maroc","airline_1_Royal Brunei","airline_1_Royal Jordanian"
    ,"airline_1_RwandAir","airline_1_Ryanair","airline_1_SAS","airline_1_SNCF","airline_1_SWISS"
    ,"airline_1_Saudia","airline_1_Scoot","airline_1_Shandong","airline_1_Shanghai Airlines"
    ,"airline_1_Shenzhen","airline_1_Sichuan Airlines","airline_1_Singapore Airlines"
    ,"airline_1_Sky Airline","airline_1_Sky Express","airline_1_SpiceJet","airline_1_Spirit"
    ,"airline_1_SriLankan","airline_1_SunExpress","airline_1_Swoop","airline_1_TAAG"
    ,"airline_1_TAROM","airline_1_THAI","airline_1_TUI fly","airline_1_Tap Air Portugal"
    ,"airline_1_Thai Smile","airline_1_Transavia","airline_1_Tunisair"
    ,"airline_1_Turkish Airlines","airline_1_Uni Airways","airline_1_United"
    ,"airline_1_VOEPASS","airline_1_VietJet Air","airline_1_Virgin Atlantic"
    ,"airline_1_Virgin Australia","airline_1_Vistara","airline_1_Viva Air"
    ,"airline_1_VivaAerobus","airline_1_Volaris","airline_1_Vueling","airline_1_WestJet"
    ,"airline_1_Wideroe","airline_1_Wingo","airline_1_Wizz Air","airline_1_XiamenAir"
    ,"airline_1_easyJet","airline_1_flydubai","airline_1_jetSMART"
]

# verify
encoded_df

We will now combine the encoded airline data to our clean dataframe.

In [None]:
# Combine Encoded Dataframe to df_clean
df_clean = pd.concat([df_clean, encoded_df], axis=1)

We can now remove the `airline_1` column from the categorical column list.

In [None]:
# Remove from categorical list
print(f"The {column} column is now removed from the categorical column list")
categorical_col_list.remove(column)
categorical_col_list

# Balance of Categorical Columns

The balance of the categorical columns will be dropped for convenience, as well removing duplicate information. 

The following columns contain duplicate or similar information already listed in other columns:
- `Route` - Similar information already available with Origin and Destination City information
- `from_Airport_Name` - Similar information in Origin City information
- `from_airport_coordinates` - Latitude and Longitude information already listed in numerical column
- `dest_Airport_Name` - Similar information in Destination City information 
- `dest_airport_coordinates` - Latitude and Longitude information already listed in numerical column

For this capstone project, we will only look at the first aircraft used, and first airline listed in the first leg of a trip only. Possible future work can be done on these columns to help improve model performance.
 - `aircraft_type_2`
 - `aircraft_type_3`
 - `aircraft_type_4`
 - `aircraft_type_5`
 - `aircraft_type_6`
 - `aircraft_type_7`
 - `airline_2`
 - `airline_3`
 - `airline_4`
 - `airline_5`

# Dataset for Modeling

We can now look at the pre-processed dataset that we will use for modeling.

In [None]:
# verify
print("Dataframe Shape", df_clean.shape)

The dataframe contains 388 columns, potential dimension reduction might be needed for modeling. 

We will now export the dataframe as a csv to our data folder for simplicity and use in the modeling notebook.

In [None]:
df_clean.to_csv('data/flight_data_processed.csv', index=False)

# Sources
(1) - https://www.airbus.com/en/products-services/commercial-aircraft/passenger-aircraft/a320-family

(2) - https://www.lufthansagroup.com/en/company/company-portrait.html#cid6471