<img src="../images/airplane-symbol.jpg" style="float: left; margin: 20px;" width="50" height="50"> 
#  Predicting Flight Delays (<i>a Proof-of-Concept</i>)

Author: Solomon Heng

---

# (6) Feature Engineering & Selection

## Processes covered in this notebook:
1. [Importing dataset](#(1)-Importing-dataset)
2. [Dropping unnecesary columns](#(2)-Dropping-unnecesary-columns)
3. [Converting features into correct dtypes and representation](#(3)-Converting-features-into-correct-dtypes-and-representation)
4. [Selecting top 5 origin airports](#(4)-Selecting-top-5-origin-airports)
5. [Creating the multiclass targets](#(5)-Creating-the-multiclass-targets)
6. [Train_test_split for the 5 datasets](#(6)-Train_test_split-for-the-5-datasets)
7. [Feature selection, encoding, scaling, SMOTE and exporting for ORD](#(7)-Feature-selection,-encoding,-scaling,-SMOTE-and-exporting-for-ORD)
8. [Feature selection, encoding, scaling, SMOTE and exporting for LGA](#(8)-Feature-selection,-encoding,-scaling,-SMOTE-and-exporting-for-LGA)
9. [Feature selection, encoding, scaling, SMOTE and exporting for PHL](#(9)-Feature-selection,-encoding,-scaling,-SMOTE-and-exporting-for-PHL)
10. [Feature selection, encoding, scaling, SMOTE and exporting for DFW](#(10)-Feature-selection,-encoding,-scaling,-SMOTE-and-exporting-for-DFW)
11. [Feature selection, encoding, scaling, SMOTE and exporting for MCO](#(11)-Feature-selection,-encoding,-scaling,-SMOTE-and-exporting-for-MCO)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.feature_selection import chi2, f_classif
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

from imblearn.over_sampling import SMOTE

import pickle

Using TensorFlow backend.


---
### (1) Importing dataset

---

In [2]:
df = pd.read_csv('../datasets/combined_data_feature_selection.csv')

In [3]:
pd.set_option('display.max_columns', 100)
df.head()

Unnamed: 0,DATETIME,DAY_OF_WEEK,SCHEDULED_ARRIVAL_YEAR,SCHEDULED_ARRIVAL_MONTH,SCHEDULED_ARRIVAL_DAY,SCHEDULED_ARRIVAL_HOUR,SCHEDULED_ARRIVAL_MINS,AIRLINE_CODE,AIRLINE_NAME,ORIGIN_AIRPORT,ORIGIN_AIRPORT_NAME,ORIGIN_CITY,ORIGIN_STATE,ORIGIN_COUNTRY,ORIGIN_AIRPORT_LATITUDE,ORIGIN_AIRPORT_LONGITUDE,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,WHEELS_OFF,SCHEDULED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,LATE_AIRCRAFT_DELAY,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,winddir,windspd,windgust,ARRIVAL_DELAY/NO_DELAY,NUM_ARR_AVG_3HOUR,crosswind_comp
0,2015-01-01,4,2015,1,1,7,11,DL,Delta Air Lines Inc.,LAS,McCarran International Airport,Las Vegas,NV,USA,36.08036,-115.15233,ATL,30,33.0,3.0,45.0,221.0,186.0,1747,651.0,711,656.0,-15.0,0.0,30.36,0.0,0,0,0,0,0,0,4.0,0,0,10.0,0.0,0.0,0,0,20.0,0.0
1,2015-01-01,4,2015,1,1,14,0,DL,Delta Air Lines Inc.,LAS,McCarran International Airport,Las Vegas,NV,USA,36.08036,-115.15233,ATL,715,712.0,-3.0,726.0,225.0,191.0,1747,1337.0,1400,1342.0,-18.0,0.0,30.31,-3.0,0,0,0,0,0,0,1.0,0,0,10.0,330.0,7.0,0,0,57.0,6.062178
2,2015-01-01,4,2015,1,1,17,58,DL,Delta Air Lines Inc.,LAS,McCarran International Airport,Las Vegas,NV,USA,36.08036,-115.15233,ATL,1115,1116.0,1.0,1131.0,223.0,191.0,1747,1742.0,1758,1747.0,-11.0,0.0,30.33,-3.0,0,0,0,0,0,0,0.0,0,0,10.0,0.0,0.0,0,0,51.666667,0.0
3,2015-01-01,4,2015,1,1,20,33,DL,Delta Air Lines Inc.,LAS,McCarran International Airport,Las Vegas,NV,USA,36.08036,-115.15233,ATL,1345,1353.0,8.0,1406.0,228.0,194.0,1747,2020.0,2033,2023.0,-10.0,0.0,30.36,-3.0,0,0,0,0,0,0,8.0,0,0,10.0,310.0,4.0,0,0,43.666667,2.57115
4,2015-01-01,4,2015,1,1,22,0,DL,Delta Air Lines Inc.,LAS,McCarran International Airport,Las Vegas,NV,USA,36.08036,-115.15233,ATL,1519,1518.0,-1.0,1530.0,221.0,190.0,1747,2140.0,2200,2144.0,-16.0,0.0,30.28,-6.0,0,0,0,0,0,0,12.0,0,0,10.0,260.0,3.0,0,0,13.0,0.520945


In [4]:
df.shape

(341852, 47)

---
### (2) Dropping unnecesary columns

---

In [5]:
df[df['ORIGIN_AIRPORT'] ==  'LAS'][['SCHEDULED_TIME']][:10]

Unnamed: 0,SCHEDULED_TIME
0,221.0
1,225.0
2,223.0
3,228.0
4,221.0
5,221.0
6,221.0
7,221.0
8,221.0
9,226.0


Notice how even though the origin airport is the same, the scheduled time changes. This is likely due to the planned flight time being adjusted for seasonal winds. _(e.g. flying north-east bound would take different timings due to the seasonal winds favouring different directions over the seasons of the year)_

As such we should be able to drop the relative longitude & latitude between the airports as the seasonal winds would already be accounted for in the flight plans.

In [6]:
df.drop(['DATETIME', 'SCHEDULED_ARRIVAL_YEAR', 'SCHEDULED_ARRIVAL_DAY', 'SCHEDULED_ARRIVAL_MINS', 'DAY_OF_WEEK', 
         'AIRLINE_NAME', 'ORIGIN_AIRPORT_NAME', 'ORIGIN_CITY', 'ORIGIN_STATE', 'ORIGIN_COUNTRY',
         'ORIGIN_AIRPORT_LATITUDE', 'ORIGIN_AIRPORT_LONGITUDE', 'DESTINATION_AIRPORT', 'SCHEDULED_DEPARTURE', 
         'DEPARTURE_TIME', 'WHEELS_OFF', 'AIR_TIME', 'WHEELS_ON', 'SCHEDULED_ARRIVAL', 'ARRIVAL_TIME', 'DISTANCE'], 
         axis=1, inplace=True)

#Distance is also dropped since we are looking at single routes

In [7]:
df.head()

Unnamed: 0,SCHEDULED_ARRIVAL_MONTH,SCHEDULED_ARRIVAL_HOUR,AIRLINE_CODE,ORIGIN_AIRPORT,DEPARTURE_DELAY,SCHEDULED_TIME,ARRIVAL_DELAY,LATE_AIRCRAFT_DELAY,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,winddir,windspd,windgust,ARRIVAL_DELAY/NO_DELAY,NUM_ARR_AVG_3HOUR,crosswind_comp
0,1,7,DL,LAS,3.0,221.0,-15.0,0.0,30.36,0.0,0,0,0,0,0,0,4.0,0,0,10.0,0.0,0.0,0,0,20.0,0.0
1,1,14,DL,LAS,-3.0,225.0,-18.0,0.0,30.31,-3.0,0,0,0,0,0,0,1.0,0,0,10.0,330.0,7.0,0,0,57.0,6.062178
2,1,17,DL,LAS,1.0,223.0,-11.0,0.0,30.33,-3.0,0,0,0,0,0,0,0.0,0,0,10.0,0.0,0.0,0,0,51.666667,0.0
3,1,20,DL,LAS,8.0,228.0,-10.0,0.0,30.36,-3.0,0,0,0,0,0,0,8.0,0,0,10.0,310.0,4.0,0,0,43.666667,2.57115
4,1,22,DL,LAS,-1.0,221.0,-16.0,0.0,30.28,-6.0,0,0,0,0,0,0,12.0,0,0,10.0,260.0,3.0,0,0,13.0,0.520945


---
### (3) Converting features into correct dtypes and representation

---

**(i) Employing sine/cosine encoding for the month and hour categories**

In [8]:
# Creating function to help convert SCHEDULED_ARRIVAL_MONTH & SCHEDULED_ARRIVAL_HOUR into sine, cosine encoding

def encode(data, col, max_val):
    data[col + '_sin'] = np.sin(2 * np.pi * data[col]/max_val)
    data[col + '_cos'] = np.cos(2 * np.pi * data[col]/max_val)
    return data

In [9]:
encode(df, 'SCHEDULED_ARRIVAL_MONTH', 12)
encode(df, 'SCHEDULED_ARRIVAL_HOUR', 23)

# Dropping the initial features as we already have the sine/cosine encoded version
df.drop(['SCHEDULED_ARRIVAL_MONTH', 'SCHEDULED_ARRIVAL_HOUR'], axis=1, inplace=True)

**(ii) Encoding the binary features into 1s & 0s**

In [10]:
df['lightning'] = [0 if i=='0' else 1 for i in df['lightning']]
df['low_intensity'] = [0 if i=='0' else 1 for i in df['low_intensity']]
df['rain'] = [0 if i=='0' else 1 for i in df['rain']]
df['shower'] = [0 if i=='0' else 1 for i in df['shower']]
df['snow'] = [0 if i=='0' else 1 for i in df['snow']]
df['squall'] = [0 if i=='0' else 1 for i in df['squall']]
df['thunderyshower'] = [0 if i=='0' else 1 for i in df['thunderyshower']]
df['vicinity'] = [0 if i=='0' else 1 for i in df['vicinity']]

**(iii) Label encoding windgusts**

In [11]:
df['windgust'].unique()

array(['0', 'G1', 'G2', 'G3'], dtype=object)

In [12]:
df['windgust'] = df['windgust'].map({'0':0, 'G1':1, 'G2':2, 'G3':3})

---
### (4) Selecting top 5 origin airports

Selecting only top 5 (delay count) origin airports & creating their individual dataframes for subsequent feature selection

---

In [13]:
df_ord = df[df['ORIGIN_AIRPORT'] == 'ORD']
df_lga = df[df['ORIGIN_AIRPORT'] == 'LGA']
df_phl = df[df['ORIGIN_AIRPORT'] == 'PHL']
df_dfw = df[df['ORIGIN_AIRPORT'] == 'DFW']
df_mco = df[df['ORIGIN_AIRPORT'] == 'MCO']

In [14]:
# Dropping origin airport since we no longer need it
df_ord.drop('ORIGIN_AIRPORT', axis=1, inplace=True)
df_lga.drop('ORIGIN_AIRPORT', axis=1, inplace=True)
df_phl.drop('ORIGIN_AIRPORT', axis=1, inplace=True)
df_dfw.drop('ORIGIN_AIRPORT', axis=1, inplace=True)
df_mco.drop('ORIGIN_AIRPORT', axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


---
### (5) Creating the multiclass targets

---

**We will be classifying the delays into 4 groups:**
1. <15 minutes _(group 0)_
2. 15 minutes to 1 hour _(group 1)_
3. 1 hour to 3 hours _(group 2)_
4. above 3 hours _(group 3)_

**Rationale:**
1. _(<15mins) belong to the no delay category where things are normal_ 
2. _(15mins to 1hr) will be the category in which the airport or airline will perhaps decide if a reshuffling of ground resource deployment is needed_
3. _(1hr to 3hr) will be the category in which the airline or airport will perhaps decide on the necessary actions to take to mitigate the impact of the delays (e.g. rescheduling transit passengers to another flight to prevent delaying the departure of the connecting flight, etc)_
4. _(>3hrs)_ will be the category in which compensation is technically already due (for EU) and airlines or airports will perhaps decide on how to do "damage control"

_Note: Customers would have to be compensated after delays exceed 3 hours (for EU). Since we are looking at US and they DO NOT have any form of obligated compensation for flight delays, we shall use EU as a benchmark for the last class (> 3 hours). Also, if delays are long, airlines would likely reschedule the departure to a later timing._

In [15]:
df_ord.head()

Unnamed: 0,AIRLINE_CODE,DEPARTURE_DELAY,SCHEDULED_TIME,ARRIVAL_DELAY,LATE_AIRCRAFT_DELAY,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,winddir,windspd,windgust,ARRIVAL_DELAY/NO_DELAY,NUM_ARR_AVG_3HOUR,crosswind_comp,SCHEDULED_ARRIVAL_MONTH_sin,SCHEDULED_ARRIVAL_MONTH_cos,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos
29486,DL,2.0,127.0,-21.0,0.0,30.35,-1.0,0,0,0,0,0,0,3.0,0,0,10.0,320.0,6.0,0,0,39.0,4.596267,0.5,0.866025,0.631088,-0.775711
29487,DL,2.0,123.0,-18.0,0.0,30.35,-2.0,0,0,0,0,0,0,2.0,0,0,10.0,320.0,6.0,0,0,40.333333,4.596267,0.5,0.866025,0.136167,-0.990686
29488,DL,20.0,116.0,4.0,0.0,30.32,-3.0,0,0,0,0,0,0,2.0,0,0,10.0,0.0,0.0,0,0,54.666667,0.0,0.5,0.866025,-0.398401,-0.917211
29489,DL,4.0,119.0,-6.0,0.0,30.31,-3.0,0,0,0,0,0,0,1.0,0,0,10.0,330.0,7.0,0,0,57.0,6.062178,0.5,0.866025,-0.631088,-0.775711
29490,DL,1.0,120.0,-18.0,0.0,30.3,-3.0,0,0,0,0,0,0,1.0,0,0,10.0,310.0,3.0,0,0,57.666667,1.928363,0.5,0.866025,-0.81697,-0.57668


In [16]:
def delay_bin(x):
    """Defining function to group delays into various categories"""
    output = None
    if x<15:
        output = 0
    elif (x>=15) & (x<60):
        output = 1
    elif (x>=60) & (x<180):
        output = 2
    else:
        output = 3
        
    return output

In [17]:
df_ord['DELAY'] = df_ord['ARRIVAL_DELAY'].apply(delay_bin)
df_lga['DELAY'] = df_lga['ARRIVAL_DELAY'].apply(delay_bin)
df_phl['DELAY'] = df_phl['ARRIVAL_DELAY'].apply(delay_bin)
df_dfw['DELAY'] = df_dfw['ARRIVAL_DELAY'].apply(delay_bin)
df_mco['DELAY'] = df_mco['ARRIVAL_DELAY'].apply(delay_bin)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .l

In [18]:
df_ord.head()

Unnamed: 0,AIRLINE_CODE,DEPARTURE_DELAY,SCHEDULED_TIME,ARRIVAL_DELAY,LATE_AIRCRAFT_DELAY,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,winddir,windspd,windgust,ARRIVAL_DELAY/NO_DELAY,NUM_ARR_AVG_3HOUR,crosswind_comp,SCHEDULED_ARRIVAL_MONTH_sin,SCHEDULED_ARRIVAL_MONTH_cos,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,DELAY
29486,DL,2.0,127.0,-21.0,0.0,30.35,-1.0,0,0,0,0,0,0,3.0,0,0,10.0,320.0,6.0,0,0,39.0,4.596267,0.5,0.866025,0.631088,-0.775711,0
29487,DL,2.0,123.0,-18.0,0.0,30.35,-2.0,0,0,0,0,0,0,2.0,0,0,10.0,320.0,6.0,0,0,40.333333,4.596267,0.5,0.866025,0.136167,-0.990686,0
29488,DL,20.0,116.0,4.0,0.0,30.32,-3.0,0,0,0,0,0,0,2.0,0,0,10.0,0.0,0.0,0,0,54.666667,0.0,0.5,0.866025,-0.398401,-0.917211,0
29489,DL,4.0,119.0,-6.0,0.0,30.31,-3.0,0,0,0,0,0,0,1.0,0,0,10.0,330.0,7.0,0,0,57.0,6.062178,0.5,0.866025,-0.631088,-0.775711,0
29490,DL,1.0,120.0,-18.0,0.0,30.3,-3.0,0,0,0,0,0,0,1.0,0,0,10.0,310.0,3.0,0,0,57.666667,1.928363,0.5,0.866025,-0.81697,-0.57668,0


---
### (6) Train_test_split for the 5 datasets

---

In [19]:
df_ord, df_test_ord= train_test_split(df_ord, test_size=0.2, shuffle=True, stratify=df_ord['DELAY'], random_state=42)
df_lga, df_test_lga= train_test_split(df_lga, test_size=0.2, shuffle=True, stratify=df_lga['DELAY'], random_state=42)
df_phl, df_test_phl= train_test_split(df_phl, test_size=0.2, shuffle=True, stratify=df_phl['DELAY'], random_state=42)
df_dfw, df_test_dfw= train_test_split(df_dfw, test_size=0.2, shuffle=True, stratify=df_dfw['DELAY'], random_state=42)
df_mco, df_test_mco= train_test_split(df_mco, test_size=0.2, shuffle=True, stratify=df_mco['DELAY'], random_state=42)

In [20]:
df_ord.shape

(5000, 28)

In [21]:
df_lga.shape

(6342, 28)

In [22]:
df_phl.shape

(5200, 28)

In [23]:
df_dfw.shape

(5544, 28)

In [24]:
df_mco.shape

(6493, 28)

In [25]:
df_test_ord.shape

(1251, 28)

In [26]:
df_test_lga.shape

(1586, 28)

In [27]:
df_test_phl.shape

(1300, 28)

In [28]:
df_test_dfw.shape

(1387, 28)

In [29]:
df_test_mco.shape

(1624, 28)

---
### (7) Feature selection, encoding, scaling, SMOTE and exporting for ORD

---

**(i) Feature Selection for Classification Modeling (ORD)**

In [30]:
df_ord.head()

Unnamed: 0,AIRLINE_CODE,DEPARTURE_DELAY,SCHEDULED_TIME,ARRIVAL_DELAY,LATE_AIRCRAFT_DELAY,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,winddir,windspd,windgust,ARRIVAL_DELAY/NO_DELAY,NUM_ARR_AVG_3HOUR,crosswind_comp,SCHEDULED_ARRIVAL_MONTH_sin,SCHEDULED_ARRIVAL_MONTH_cos,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,DELAY
31687,DL,-3.0,123.0,-16.0,0.0,29.94,21.0,1,1,1,1,1,0,23.0,1,1,10.0,280.0,4.0,0,0,78.666667,0.6945927,-0.5,-0.8660254,-0.81697,-0.57668,0
29706,DL,-4.0,125.0,-10.0,0.0,30.29,-3.0,1,1,1,0,1,0,11.0,1,1,10.0,40.0,9.0,0,0,78.666667,6.8944,0.5,0.8660254,-0.887885,0.460065,0
259116,F9,-2.0,115.0,1.0,0.0,29.93,8.0,1,1,1,1,1,0,14.0,1,1,10.0,230.0,3.0,0,0,61.666667,1.928363,0.866025,0.5,0.398401,-0.917211,0
271891,UA,-2.0,128.0,-9.0,0.0,30.23,8.0,1,1,1,1,1,1,15.0,1,1,10.0,90.0,9.0,0,0,64.666667,1.102182e-15,-1.0,-1.83697e-16,-0.997669,-0.068242,0
31854,DL,-5.0,124.0,-15.0,0.0,29.97,21.0,1,1,1,1,1,1,25.0,1,1,10.0,250.0,4.0,0,0,56.666667,1.368081,-0.866025,-0.5,0.136167,-0.990686,0


In [31]:
features_class = df_ord.drop(['DELAY', 'ARRIVAL_DELAY/NO_DELAY', 'ARRIVAL_DELAY'], axis=1)
target_class = df_ord['DELAY']

In [32]:
features_class_numeric = features_class.drop(['AIRLINE_CODE', 'lightning', 'low_intensity', 'rain', 'shower', 'snow', 'squall', 'thunderyshower', 'vicinity'], axis=1)
features_class_cat = features_class[['AIRLINE_CODE', 'lightning', 'low_intensity', 'rain', 'shower', 'snow', 'squall', 'thunderyshower', 'vicinity']]

**Using ANOVA for Numerical Features**

In [33]:
for i in zip(features_class_numeric.columns, f_classif(features_class_numeric, target_class)[1]):
    print (i)

('DEPARTURE_DELAY', 0.0)
('SCHEDULED_TIME', 0.02281499623591434)
('LATE_AIRCRAFT_DELAY', 0.0)
('QNH', 0.006450878014547138)
('dew_point', 0.000246763208067444)
('temp', 0.05841106258956433)
('visibility', 0.00307516043359877)
('winddir', 0.029775601960626185)
('windspd', 0.34829973795639013)
('windgust', 0.8311423392565764)
('NUM_ARR_AVG_3HOUR', 6.1649685764056955e-06)
('crosswind_comp', 0.0069688759721214895)
('SCHEDULED_ARRIVAL_MONTH_sin', 0.05197071273131168)
('SCHEDULED_ARRIVAL_MONTH_cos', 0.8194841934934984)
('SCHEDULED_ARRIVAL_HOUR_sin', 4.894644512023902e-14)
('SCHEDULED_ARRIVAL_HOUR_cos', 7.49892630022929e-16)


**Using Chi-Square for Categorical Features**

In [34]:
features_class_cat.columns

Index(['AIRLINE_CODE', 'lightning', 'low_intensity', 'rain', 'shower', 'snow',
       'squall', 'thunderyshower', 'vicinity'],
      dtype='object')

In [35]:
# Need to label encode before we can use chi2

le = LabelEncoder()
features_class_cat['AIRLINE_CODE'] = le.fit_transform(features_class_cat['AIRLINE_CODE'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [36]:
features_class_cat.head()

Unnamed: 0,AIRLINE_CODE,lightning,low_intensity,rain,shower,snow,squall,thunderyshower,vicinity
31687,1,1,1,1,1,1,0,1,1
29706,1,1,1,1,0,1,0,1,1
259116,3,1,1,1,1,1,0,1,1
271891,7,1,1,1,1,1,1,1,1
31854,1,1,1,1,1,1,1,1,1


In [37]:
for i in zip(features_class_cat.columns, chi2(features_class_cat, target_class)[1]):
    print (i)

('AIRLINE_CODE', 6.203540456934237e-26)
('lightning', 0.9758025131299192)
('low_intensity', 0.9997929945815938)
('rain', 0.9997929945815938)
('shower', 0.9509914575576246)
('snow', 0.7713102163478002)
('squall', 0.07895470080161845)
('thunderyshower', 0.9758025131299192)
('vicinity', 0.9997929945815938)


Based on the ANOVA & Chi-Squared tests above, we will reject the null hypothesis of no difference in mean at a 95% significance level.

Hence, we will drop the following feature(s):
1. lightning
2. low_intensity
3. rain
4. shower
5. snow
6. squall
7. thunderyshower
8. vicinity
9. SCHEDULED_TIME
10. temp
11. visibility
12. winddir
13. windspd
14. windgust
15. crosswind_comp
16. SCHEDULED_ARRIVAL_MONTH_sin
17. SCHEDULED_ARRIVAL_MONTH_cos

In [38]:
df_class_ord = df_ord.drop(['ARRIVAL_DELAY', 'ARRIVAL_DELAY/NO_DELAY', 'lightning', 'low_intensity', 'rain', 'shower', 'snow', 'squall', 'thunderyshower', 'vicinity', 'SCHEDULED_TIME', 'temp', 'visibility', 'winddir', 'windspd', 'windgust', 'crosswind_comp', 'SCHEDULED_ARRIVAL_MONTH_sin', 'SCHEDULED_ARRIVAL_MONTH_cos'], axis=1)

# Duplicating transformation for test dataset
df_class_test_ord = df_test_ord.drop(['ARRIVAL_DELAY', 'ARRIVAL_DELAY/NO_DELAY', 'lightning', 'low_intensity', 'rain', 'shower', 'snow', 'squall', 'thunderyshower', 'vicinity', 'SCHEDULED_TIME', 'temp', 'visibility', 'winddir', 'windspd', 'windgust', 'crosswind_comp', 'SCHEDULED_ARRIVAL_MONTH_sin', 'SCHEDULED_ARRIVAL_MONTH_cos'], axis=1)

In [39]:
df_class_ord.head()

Unnamed: 0,AIRLINE_CODE,DEPARTURE_DELAY,LATE_AIRCRAFT_DELAY,QNH,dew_point,NUM_ARR_AVG_3HOUR,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,DELAY
31687,DL,-3.0,0.0,29.94,21.0,78.666667,-0.81697,-0.57668,0
29706,DL,-4.0,0.0,30.29,-3.0,78.666667,-0.887885,0.460065,0
259116,F9,-2.0,0.0,29.93,8.0,61.666667,0.398401,-0.917211,0
271891,UA,-2.0,0.0,30.23,8.0,64.666667,-0.997669,-0.068242,0
31854,DL,-5.0,0.0,29.97,21.0,56.666667,0.136167,-0.990686,0


In [40]:
df_class_test_ord.head()

Unnamed: 0,AIRLINE_CODE,DEPARTURE_DELAY,LATE_AIRCRAFT_DELAY,QNH,dew_point,NUM_ARR_AVG_3HOUR,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,DELAY
32663,DL,2.0,0.0,30.19,10.0,63.333333,-0.942261,-0.33488,0
266986,OO,-3.0,0.0,30.13,-1.0,55.333333,-0.942261,-0.33488,1
30429,DL,8.0,0.0,30.14,11.0,70.666667,-0.979084,0.203456,0
29983,DL,22.0,0.0,30.02,-8.0,60.666667,-0.997669,-0.068242,0
31783,DL,-1.0,0.0,30.02,23.0,60.666667,-0.519584,0.854419,0


**(ii) Encoding Features (for classification)**

In [41]:
# Setting X_train, X_test, y_train & y_test
X_train = df_class_ord.drop('DELAY', axis=1)
y_train = df_class_ord['DELAY']
X_test = df_class_test_ord.drop('DELAY', axis=1)
y_test = df_class_test_ord['DELAY']

In [42]:
X_train.shape

(5000, 8)

In [43]:
X_test.shape

(1251, 8)

In [44]:
X_train = pd.get_dummies(X_train, columns=['AIRLINE_CODE'], drop_first=True)
X_test = pd.get_dummies(X_test, columns=['AIRLINE_CODE'], drop_first=True)

In [45]:
X_train.head(2)

Unnamed: 0,DEPARTURE_DELAY,LATE_AIRCRAFT_DELAY,QNH,dew_point,NUM_ARR_AVG_3HOUR,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,AIRLINE_CODE_DL,AIRLINE_CODE_EV,AIRLINE_CODE_F9,AIRLINE_CODE_MQ,AIRLINE_CODE_NK,AIRLINE_CODE_OO,AIRLINE_CODE_UA
31687,-3.0,0.0,29.94,21.0,78.666667,-0.81697,-0.57668,1,0,0,0,0,0,0
29706,-4.0,0.0,30.29,-3.0,78.666667,-0.887885,0.460065,1,0,0,0,0,0,0


In [46]:
X_test.head(2)

Unnamed: 0,DEPARTURE_DELAY,LATE_AIRCRAFT_DELAY,QNH,dew_point,NUM_ARR_AVG_3HOUR,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,AIRLINE_CODE_DL,AIRLINE_CODE_EV,AIRLINE_CODE_F9,AIRLINE_CODE_MQ,AIRLINE_CODE_NK,AIRLINE_CODE_OO,AIRLINE_CODE_UA
32663,2.0,0.0,30.19,10.0,63.333333,-0.942261,-0.33488,1,0,0,0,0,0,0
266986,-3.0,0.0,30.13,-1.0,55.333333,-0.942261,-0.33488,0,0,0,0,0,1,0


In [47]:
X_train.shape

(5000, 14)

In [48]:
X_test.shape

(1251, 14)

In [49]:
# Arranging the order of features to be the same
X_test = X_test[X_train.columns]

**Establishing Baseline Accuracy**

In [50]:
# Baseline Accuracy
y_train.value_counts(normalize=True)

0    0.7726
1    0.1424
2    0.0668
3    0.0182
Name: DELAY, dtype: float64

Baseline Accuracy: <u>77.26%</u>

**(iii) Scaling dataset**

In [51]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [52]:
# Exporting scaler for deployment
filename = '../scalers/ord_scaler.sav'
pickle.dump(ss, open(filename, 'wb'))

**(iv) Employing Oversampling (SMOTE) to balance the training dataset**

In [53]:
sm = SMOTE(random_state=42)
X_train_sc_res, y_train_res = sm.fit_resample(X_train_sc, y_train)

In [54]:
y_train_res.value_counts(normalize=True)

3    0.25
2    0.25
1    0.25
0    0.25
Name: DELAY, dtype: float64

In [55]:
df_classification_enc = pd.DataFrame(X_train_sc_res, columns=X_train.columns)
df_classification_enc['DELAY'] = y_train_res

df_classification_enc_test = pd.DataFrame(X_test_sc, columns=X_test.columns)
df_classification_enc_test['DELAY'] = y_test.values

In [56]:
df_classification_enc.head()

Unnamed: 0,DEPARTURE_DELAY,LATE_AIRCRAFT_DELAY,QNH,dew_point,NUM_ARR_AVG_3HOUR,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,AIRLINE_CODE_DL,AIRLINE_CODE_EV,AIRLINE_CODE_F9,AIRLINE_CODE_MQ,AIRLINE_CODE_NK,AIRLINE_CODE_OO,AIRLINE_CODE_UA,DELAY
0,-0.34979,-0.242388,-1.072925,0.838466,1.174306,-0.873559,-0.490446,0.845734,-0.08396,-0.231821,-0.172818,-0.276453,-0.450001,-0.209903,0
1,-0.368073,-0.242388,1.225008,-1.428832,1.174306,-0.998968,0.937312,0.845734,-0.08396,-0.231821,-0.172818,-0.276453,-0.450001,-0.209903,0
2,-0.331507,-0.242388,-1.13858,-0.389654,0.130878,1.275746,-0.95941,-1.182405,-0.08396,4.313681,-0.172818,-0.276453,-0.450001,-0.209903,0
3,-0.331507,-0.242388,0.831077,-0.389654,0.315013,-1.193113,0.209751,-1.182405,-0.08396,-0.231821,-0.172818,-0.276453,-0.450001,4.764104,0
4,-0.386357,-0.242388,-0.875959,0.838466,-0.176012,0.812001,-1.060596,0.845734,-0.08396,-0.231821,-0.172818,-0.276453,-0.450001,-0.209903,0


In [57]:
df_classification_enc_test.head()

Unnamed: 0,DEPARTURE_DELAY,LATE_AIRCRAFT_DELAY,QNH,dew_point,NUM_ARR_AVG_3HOUR,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,AIRLINE_CODE_DL,AIRLINE_CODE_EV,AIRLINE_CODE_F9,AIRLINE_CODE_MQ,AIRLINE_CODE_NK,AIRLINE_CODE_OO,AIRLINE_CODE_UA,DELAY
0,-0.258373,-0.242388,0.568456,-0.200713,0.233175,-1.095128,-0.157449,0.845734,-0.08396,-0.231821,-0.172818,-0.276453,-0.450001,-0.209903,0
1,-0.34979,-0.242388,0.174525,-1.239891,-0.257849,-1.095128,-0.157449,-1.182405,-0.08396,-0.231821,-0.172818,-0.276453,2.222216,-0.209903,1
2,-0.148673,-0.242388,0.24018,-0.106242,0.683281,-1.160247,0.583922,0.845734,-0.08396,-0.231821,-0.172818,-0.276453,-0.450001,-0.209903,0
3,0.107294,-0.242388,-0.547683,-1.901186,0.0695,-1.193113,0.209751,0.845734,-0.08396,-0.231821,-0.172818,-0.276453,-0.450001,-0.209903,0
4,-0.313223,-0.242388,-0.547683,1.027407,0.0695,-0.347651,1.480399,0.845734,-0.08396,-0.231821,-0.172818,-0.276453,-0.450001,-0.209903,0


In [58]:
df_classification_enc.shape

(15452, 15)

In [59]:
df_classification_enc_test.shape

(1251, 15)

**(v) Exporting the data for modeling**

In [60]:
df_classification_enc.to_csv('../datasets/combined_data_class_ord.csv', index=False)
df_classification_enc_test.to_csv('../datasets/combined_data_class_test_ord.csv', index=False)

---
### (8) Feature selection, encoding, scaling, SMOTE and exporting for LGA

---

**(i) Feature Selection for Classification Modeling (LGA)**

In [61]:
df_lga.head()

Unnamed: 0,AIRLINE_CODE,DEPARTURE_DELAY,SCHEDULED_TIME,ARRIVAL_DELAY,LATE_AIRCRAFT_DELAY,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,winddir,windspd,windgust,ARRIVAL_DELAY/NO_DELAY,NUM_ARR_AVG_3HOUR,crosswind_comp,SCHEDULED_ARRIVAL_MONTH_sin,SCHEDULED_ARRIVAL_MONTH_cos,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,DELAY
258893,F9,-10.0,150.0,-13.0,0.0,30.26,11.0,1,1,1,1,1,0,17.0,1,1,10.0,80.0,4.0,0,0,46.333333,0.694593,0.5,-0.866025,-0.136167,-0.990686,0
26684,DL,15.0,157.0,-20.0,0.0,30.07,17.0,1,1,1,1,1,0,28.0,1,1,10.0,340.0,8.0,0,0,39.0,7.517541,1.224647e-16,-1.0,-0.269797,0.962917,0
26340,DL,-1.0,152.0,-12.0,0.0,30.24,14.0,1,1,1,1,1,0,23.0,1,1,10.0,150.0,5.0,0,0,63.666667,4.330127,0.5,-0.866025,0.398401,-0.917211,0
337802,MQ,18.0,137.0,8.0,0.0,30.1,14.0,1,1,1,1,1,0,22.0,1,1,10.0,0.0,0.0,0,0,76.333333,0.0,0.5,-0.866025,0.81697,-0.57668,0
337368,MQ,4.0,159.0,-17.0,0.0,30.51,-11.0,1,1,1,0,0,0,-1.0,1,1,10.0,100.0,5.0,0,0,54.666667,0.868241,0.5,0.866025,-0.136167,-0.990686,0


In [62]:
features_class = df_lga.drop(['DELAY', 'ARRIVAL_DELAY/NO_DELAY', 'ARRIVAL_DELAY'], axis=1)
target_class = df_lga['DELAY']

In [63]:
features_class_numeric = features_class.drop(['AIRLINE_CODE', 'lightning', 'low_intensity', 'rain', 'shower', 'snow', 'squall', 'thunderyshower', 'vicinity'], axis=1)
features_class_cat = features_class[['AIRLINE_CODE', 'lightning', 'low_intensity', 'rain', 'shower', 'snow', 'squall', 'thunderyshower', 'vicinity']]

**Using ANOVA for Numerical Features**

In [64]:
for i in zip(features_class_numeric.columns, f_classif(features_class_numeric, target_class)[1]):
    print (i)

('DEPARTURE_DELAY', 0.0)
('SCHEDULED_TIME', 0.002655885436822153)
('LATE_AIRCRAFT_DELAY', 0.0)
('QNH', 2.262144249912829e-05)
('dew_point', 0.00048071049224697893)
('temp', 0.09951006230878832)
('visibility', 0.001560370530422956)
('winddir', 8.84291488975131e-11)
('windspd', 0.00032805854237820094)
('windgust', 0.0011032556950852182)
('NUM_ARR_AVG_3HOUR', 8.311747666376677e-15)
('crosswind_comp', 3.90844485906365e-06)
('SCHEDULED_ARRIVAL_MONTH_sin', 1.678630151689781e-07)
('SCHEDULED_ARRIVAL_MONTH_cos', 0.10443095919608177)
('SCHEDULED_ARRIVAL_HOUR_sin', 5.681802049349261e-10)
('SCHEDULED_ARRIVAL_HOUR_cos', 4.257618888403148e-30)


**Using Chi-Square for Categorical Features**

In [65]:
features_class_cat.columns

Index(['AIRLINE_CODE', 'lightning', 'low_intensity', 'rain', 'shower', 'snow',
       'squall', 'thunderyshower', 'vicinity'],
      dtype='object')

In [66]:
# Need to label encode before we can use chi2

le = LabelEncoder()
features_class_cat['AIRLINE_CODE'] = le.fit_transform(features_class_cat['AIRLINE_CODE'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [67]:
features_class_cat.head()

Unnamed: 0,AIRLINE_CODE,lightning,low_intensity,rain,shower,snow,squall,thunderyshower,vicinity
258893,1,1,1,1,1,1,0,1,1
26684,0,1,1,1,1,1,0,1,1
26340,0,1,1,1,1,1,0,1,1
337802,2,1,1,1,1,1,0,1,1
337368,2,1,1,1,0,0,0,1,1


In [68]:
for i in zip(features_class_cat.columns, chi2(features_class_cat, target_class)[1]):
    print (i)

('AIRLINE_CODE', 1.3417687932355723e-05)
('lightning', 0.9987429591134359)
('low_intensity', 0.9998763676835255)
('rain', 0.9998763676835255)
('shower', 0.9116233234064776)
('snow', 0.9594023291974768)
('squall', 6.878053061588576e-08)
('thunderyshower', 0.9987429591134359)
('vicinity', 0.9998763676835255)


Based on the ANOVA & Chi-Squared tests above, we will reject the null hypothesis of no difference in mean at a 95% significance level.

Hence, we will drop the following feature(s):
1. lightning
2. low_intensity
3. rain
4. shower
5. snow
6. thunderyshower
7. vicinity

_Note, we will not drop SCHEDULED_ARRIVAL_MONTH_cos as it should always come in pairs with its sine counterpart_

In [69]:
df_class_lga = df_lga.drop(['ARRIVAL_DELAY', 'ARRIVAL_DELAY/NO_DELAY', 'lightning', 'low_intensity', 'rain', 'shower', 'snow', 'thunderyshower', 'vicinity', 'temp'], axis=1)

# Duplicating transformation for test dataset
df_class_test_lga = df_test_lga.drop(['ARRIVAL_DELAY', 'ARRIVAL_DELAY/NO_DELAY', 'lightning', 'low_intensity', 'rain', 'shower', 'snow', 'thunderyshower', 'vicinity', 'temp'], axis=1)

In [70]:
df_class_lga.head()

Unnamed: 0,AIRLINE_CODE,DEPARTURE_DELAY,SCHEDULED_TIME,LATE_AIRCRAFT_DELAY,QNH,dew_point,squall,visibility,winddir,windspd,windgust,NUM_ARR_AVG_3HOUR,crosswind_comp,SCHEDULED_ARRIVAL_MONTH_sin,SCHEDULED_ARRIVAL_MONTH_cos,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,DELAY
258893,F9,-10.0,150.0,0.0,30.26,11.0,0,10.0,80.0,4.0,0,46.333333,0.694593,0.5,-0.866025,-0.136167,-0.990686,0
26684,DL,15.0,157.0,0.0,30.07,17.0,0,10.0,340.0,8.0,0,39.0,7.517541,1.224647e-16,-1.0,-0.269797,0.962917,0
26340,DL,-1.0,152.0,0.0,30.24,14.0,0,10.0,150.0,5.0,0,63.666667,4.330127,0.5,-0.866025,0.398401,-0.917211,0
337802,MQ,18.0,137.0,0.0,30.1,14.0,0,10.0,0.0,0.0,0,76.333333,0.0,0.5,-0.866025,0.81697,-0.57668,0
337368,MQ,4.0,159.0,0.0,30.51,-11.0,0,10.0,100.0,5.0,0,54.666667,0.868241,0.5,0.866025,-0.136167,-0.990686,0


**(ii) Encoding Features (for classification)**

In [71]:
# Setting X_train, X_test, y_train & y_test
X_train = df_class_lga.drop('DELAY', axis=1)
y_train = df_class_lga['DELAY']
X_test = df_class_test_lga.drop('DELAY', axis=1)
y_test = df_class_test_lga['DELAY']

In [72]:
X_train.shape

(6342, 17)

In [73]:
X_test.shape

(1586, 17)

In [74]:
X_train = pd.get_dummies(X_train, columns=['AIRLINE_CODE'], drop_first=True)
X_test = pd.get_dummies(X_test, columns=['AIRLINE_CODE'], drop_first=True)

In [75]:
X_train.head(2)

Unnamed: 0,DEPARTURE_DELAY,SCHEDULED_TIME,LATE_AIRCRAFT_DELAY,QNH,dew_point,squall,visibility,winddir,windspd,windgust,NUM_ARR_AVG_3HOUR,crosswind_comp,SCHEDULED_ARRIVAL_MONTH_sin,SCHEDULED_ARRIVAL_MONTH_cos,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,AIRLINE_CODE_F9,AIRLINE_CODE_MQ,AIRLINE_CODE_WN
258893,-10.0,150.0,0.0,30.26,11.0,0,10.0,80.0,4.0,0,46.333333,0.694593,0.5,-0.866025,-0.136167,-0.990686,1,0,0
26684,15.0,157.0,0.0,30.07,17.0,0,10.0,340.0,8.0,0,39.0,7.517541,1.224647e-16,-1.0,-0.269797,0.962917,0,0,0


In [76]:
X_test.head(2)

Unnamed: 0,DEPARTURE_DELAY,SCHEDULED_TIME,LATE_AIRCRAFT_DELAY,QNH,dew_point,squall,visibility,winddir,windspd,windgust,NUM_ARR_AVG_3HOUR,crosswind_comp,SCHEDULED_ARRIVAL_MONTH_sin,SCHEDULED_ARRIVAL_MONTH_cos,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,AIRLINE_CODE_F9,AIRLINE_CODE_MQ,AIRLINE_CODE_WN
24573,1.0,168.0,0.0,30.04,3.0,0,10.0,230.0,6.0,0,66.666667,3.856726,0.5,0.866025,-0.81697,-0.57668,0,0,0
337959,23.0,165.0,0.0,30.2,20.0,0,10.0,300.0,4.0,0,60.666667,2.0,1.224647e-16,-1.0,-0.979084,0.203456,0,1,0


In [77]:
X_train.shape

(6342, 19)

In [78]:
X_test.shape

(1586, 19)

In [79]:
# Arranging the order of features to be the same
X_test = X_test[X_train.columns]

**Establishing Baseline Accuracy**

In [80]:
# Baseline Accuracy
y_train.value_counts(normalize=True)

0    0.791075
1    0.129927
2    0.063387
3    0.015610
Name: DELAY, dtype: float64

Baseline Accuracy: <u>79.11%</u>

**(iii) Scaling dataset**

In [81]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [82]:
# Exporting scaler for deployment
filename = '../scalers/lga_scaler.sav'
pickle.dump(ss, open(filename, 'wb'))

**(iv) Employing Oversampling (SMOTE) to balance the training dataset**

In [83]:
sm = SMOTE(random_state=42)
X_train_sc_res, y_train_res = sm.fit_resample(X_train_sc, y_train)

In [84]:
y_train_res.value_counts(normalize=True)

3    0.25
2    0.25
1    0.25
0    0.25
Name: DELAY, dtype: float64

In [85]:
df_classification_enc = pd.DataFrame(X_train_sc_res, columns=X_train.columns)
df_classification_enc['DELAY'] = y_train_res

df_classification_enc_test = pd.DataFrame(X_test_sc, columns=X_test.columns)
df_classification_enc_test['DELAY'] = y_test.values

In [86]:
df_classification_enc.head()

Unnamed: 0,DEPARTURE_DELAY,SCHEDULED_TIME,LATE_AIRCRAFT_DELAY,QNH,dew_point,squall,visibility,winddir,windspd,windgust,NUM_ARR_AVG_3HOUR,crosswind_comp,SCHEDULED_ARRIVAL_MONTH_sin,SCHEDULED_ARRIVAL_MONTH_cos,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,AIRLINE_CODE_F9,AIRLINE_CODE_MQ,AIRLINE_CODE_WN,DELAY
0,-0.518875,-1.458929,-0.245583,1.019847,-0.104704,-0.733113,0.352208,-0.923471,-0.630803,-0.265583,-0.875782,-0.888842,0.624847,-1.062664,0.343113,-1.106962,6.196072,-0.379121,-0.492746,0
1,0.004073,-0.639261,-0.245583,-0.232773,0.470217,-0.733113,0.352208,1.341418,0.434165,-0.265583,-1.335951,1.26664,-0.107049,-1.248873,0.107194,1.55658,-0.161393,-0.379121,-0.492746,0
2,-0.330614,-1.224738,-0.245583,0.887993,0.182756,-0.733113,0.352208,-0.313694,-0.364561,-0.265583,0.211888,0.259683,0.624847,-1.062664,1.286872,-1.006787,-0.161393,-0.379121,-0.492746,0
3,0.066827,-2.981169,-0.245583,-0.034991,0.182756,-0.733113,0.352208,-1.620361,-1.695771,-0.265583,1.006725,-1.108275,0.624847,-1.062664,2.02584,-0.542507,-0.161393,2.637677,-0.492746,0
4,-0.226024,-0.40507,-0.245583,2.668032,-2.212746,-0.733113,0.352208,-0.749249,-0.364561,-0.265583,-0.352864,-0.833984,0.624847,1.344693,0.343113,-1.106962,-0.161393,2.637677,-0.492746,0


In [87]:
df_classification_enc_test.head()

Unnamed: 0,DEPARTURE_DELAY,SCHEDULED_TIME,LATE_AIRCRAFT_DELAY,QNH,dew_point,squall,visibility,winddir,windspd,windgust,NUM_ARR_AVG_3HOUR,crosswind_comp,SCHEDULED_ARRIVAL_MONTH_sin,SCHEDULED_ARRIVAL_MONTH_cos,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,AIRLINE_CODE_F9,AIRLINE_CODE_MQ,AIRLINE_CODE_WN,DELAY
0,-0.288778,0.648788,-0.245583,-0.430555,-0.871265,-0.733113,0.352208,0.383196,-0.098319,-0.265583,0.400139,0.110128,0.624847,1.344693,-0.858819,-0.542507,-0.161393,-0.379121,-0.492746,0
1,0.171417,0.297502,-0.245583,0.624283,0.757677,-0.733113,0.352208,0.992974,-0.630803,-0.265583,0.023638,-0.476442,-0.107049,-1.248873,-1.145026,0.521131,-0.161393,2.637677,-0.492746,0
2,-0.309696,0.180407,-0.245583,-0.100918,-0.104704,-0.733113,0.352208,1.428529,0.966649,-0.265583,0.211888,2.002895,1.356743,0.141015,-0.858819,-0.542507,-0.161393,-0.379121,-0.492746,0
3,-0.05868,1.11717,-0.245583,0.756138,-1.542006,-0.733113,0.352208,-0.139471,0.167923,-0.265583,-2.507288,1.069544,1.160632,0.835959,0.107194,1.55658,-0.161393,-0.379121,-0.492746,0
4,-0.037762,-0.873452,-0.245583,-1.353539,0.853497,-0.733113,0.352208,1.167196,-0.364561,-0.265583,-0.018196,0.101755,-0.838946,-1.062664,0.343113,-1.106962,6.196072,-0.379121,-0.492746,0


In [88]:
df_classification_enc.shape

(20068, 20)

In [89]:
df_classification_enc_test.shape

(1586, 20)

**(v) Exporting the data for modeling**

In [90]:
df_classification_enc.to_csv('../datasets/combined_data_class_lga.csv', index=False)
df_classification_enc_test.to_csv('../datasets/combined_data_class_test_lga.csv', index=False)

---
### (9) Feature selection, encoding, scaling, SMOTE and exporting for PHL

---

**(i) Feature Selection for Classification Modeling (PHL)**

In [91]:
df_phl.head()

Unnamed: 0,AIRLINE_CODE,DEPARTURE_DELAY,SCHEDULED_TIME,ARRIVAL_DELAY,LATE_AIRCRAFT_DELAY,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,winddir,windspd,windgust,ARRIVAL_DELAY/NO_DELAY,NUM_ARR_AVG_3HOUR,crosswind_comp,SCHEDULED_ARRIVAL_MONTH_sin,SCHEDULED_ARRIVAL_MONTH_cos,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,DELAY
334680,AA,-5.0,134.0,-7.0,0.0,29.98,12.0,1,1,1,1,1,1,16.0,1,1,10.0,320.0,7.0,0,0,62.0,5.362311,-0.8660254,-0.5,-0.9422609,-0.33488,0
260179,F9,83.0,145.0,60.0,60.0,30.06,19.0,1,1,1,1,1,1,23.0,1,1,10.0,170.0,7.0,0,1,14.0,6.893654,-2.449294e-16,1.0,-2.449294e-16,1.0,2
63043,DL,-7.0,138.0,17.0,0.0,30.14,21.0,1,1,1,1,1,1,25.0,1,1,10.0,0.0,0.0,0,1,65.0,0.0,-1.0,-1.83697e-16,-0.9976688,-0.068242,1
263690,NK,-6.0,131.0,-14.0,0.0,30.06,18.0,1,1,1,1,1,1,22.0,1,1,10.0,80.0,6.0,0,0,72.666667,1.041889,-0.8660254,-0.5,0.6310879,-0.775711,0
63844,DL,-2.0,141.0,-3.0,0.0,30.33,2.0,1,1,1,1,1,1,8.0,1,1,10.0,100.0,7.0,0,0,59.666667,1.215537,-2.449294e-16,1.0,-0.9422609,-0.33488,0


In [92]:
features_class = df_phl.drop(['DELAY', 'ARRIVAL_DELAY/NO_DELAY', 'ARRIVAL_DELAY'], axis=1)
target_class = df_phl['DELAY']

In [93]:
features_class_numeric = features_class.drop(['AIRLINE_CODE', 'lightning', 'low_intensity', 'rain', 'shower', 'snow', 'squall', 'thunderyshower', 'vicinity'], axis=1)
features_class_cat = features_class[['AIRLINE_CODE', 'lightning', 'low_intensity', 'rain', 'shower', 'snow', 'squall', 'thunderyshower', 'vicinity']]

**Using ANOVA for Numerical Features**

In [94]:
for i in zip(features_class_numeric.columns, f_classif(features_class_numeric, target_class)[1]):
    print (i)

('DEPARTURE_DELAY', 0.0)
('SCHEDULED_TIME', 0.13358156136104601)
('LATE_AIRCRAFT_DELAY', 0.0)
('QNH', 2.57548207996603e-07)
('dew_point', 2.372762067300908e-19)
('temp', 1.6337383931038633e-05)
('visibility', 2.54806424874493e-15)
('winddir', 0.9679984722477728)
('windspd', 0.4339963708331276)
('windgust', 0.3157730458380467)
('NUM_ARR_AVG_3HOUR', 0.0033823378465511395)
('crosswind_comp', 0.3745542374615344)
('SCHEDULED_ARRIVAL_MONTH_sin', 0.18131139083258385)
('SCHEDULED_ARRIVAL_MONTH_cos', 0.561402763861705)
('SCHEDULED_ARRIVAL_HOUR_sin', 2.9141224834236803e-16)
('SCHEDULED_ARRIVAL_HOUR_cos', 6.484450201067113e-26)


**Using Chi-Square for Categorical Features**

In [95]:
features_class_cat.columns

Index(['AIRLINE_CODE', 'lightning', 'low_intensity', 'rain', 'shower', 'snow',
       'squall', 'thunderyshower', 'vicinity'],
      dtype='object')

In [96]:
# Need to label encode before we can use chi2

le = LabelEncoder()
features_class_cat['AIRLINE_CODE'] = le.fit_transform(features_class_cat['AIRLINE_CODE'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [97]:
features_class_cat.head()

Unnamed: 0,AIRLINE_CODE,lightning,low_intensity,rain,shower,snow,squall,thunderyshower,vicinity
334680,0,1,1,1,1,1,1,1,1
260179,2,1,1,1,1,1,1,1,1
63043,1,1,1,1,1,1,1,1,1
263690,3,1,1,1,1,1,1,1,1
63844,1,1,1,1,1,1,1,1,1


In [98]:
for i in zip(features_class_cat.columns, chi2(features_class_cat, target_class)[1]):
    print (i)

('AIRLINE_CODE', 9.281505241998701e-13)
('lightning', 0.9996489028108656)
('low_intensity', 0.999977848491319)
('rain', 0.999977848491319)
('shower', 0.9964340176882278)
('snow', 0.9418254882210063)
('squall', 0.7504776812279135)
('thunderyshower', 0.9996489028108656)
('vicinity', 0.999977848491319)


Based on the ANOVA & Chi-Squared tests above, we will reject the null hypothesis of no difference in mean at a 95% significance level.

Hence, we will drop the following feature(s):
1. lightning
2. low_intensity
3. rain
4. shower
5. snow
6. squall
7. thunderyshower
8. vicinity
9. SCHEDULED_TIME
10. winddir
11. windspd
12. windgust
13. crosswind_comp
14. SCHEDULED_ARRIVAL_MONTH_sin
15. SCHEDULED_ARRIVAL_MONTH_cos

In [99]:
df_class_phl = df_phl.drop(['ARRIVAL_DELAY', 'ARRIVAL_DELAY/NO_DELAY', 'lightning', 'low_intensity', 'rain', 'shower', 'snow', 'squall', 'thunderyshower', 'vicinity', 'SCHEDULED_TIME', 'winddir', 'windspd', 'windgust', 'crosswind_comp', 'SCHEDULED_ARRIVAL_MONTH_sin', 'SCHEDULED_ARRIVAL_MONTH_cos'], axis=1)

# Duplicating transformation for test dataset
df_class_test_phl = df_test_phl.drop(['ARRIVAL_DELAY', 'ARRIVAL_DELAY/NO_DELAY', 'lightning', 'low_intensity', 'rain', 'shower', 'snow', 'squall', 'thunderyshower', 'vicinity', 'SCHEDULED_TIME', 'winddir', 'windspd', 'windgust', 'crosswind_comp', 'SCHEDULED_ARRIVAL_MONTH_sin', 'SCHEDULED_ARRIVAL_MONTH_cos'], axis=1)

In [100]:
df_class_phl.head()

Unnamed: 0,AIRLINE_CODE,DEPARTURE_DELAY,LATE_AIRCRAFT_DELAY,QNH,dew_point,temp,visibility,NUM_ARR_AVG_3HOUR,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,DELAY
334680,AA,-5.0,0.0,29.98,12.0,16.0,10.0,62.0,-0.9422609,-0.33488,0
260179,F9,83.0,60.0,30.06,19.0,23.0,10.0,14.0,-2.449294e-16,1.0,2
63043,DL,-7.0,0.0,30.14,21.0,25.0,10.0,65.0,-0.9976688,-0.068242,1
263690,NK,-6.0,0.0,30.06,18.0,22.0,10.0,72.666667,0.6310879,-0.775711,0
63844,DL,-2.0,0.0,30.33,2.0,8.0,10.0,59.666667,-0.9422609,-0.33488,0


**(ii) Encoding Features (for classification)**

In [101]:
# Setting X_train, X_test, y_train & y_test
X_train = df_class_phl.drop('DELAY', axis=1)
y_train = df_class_phl['DELAY']
X_test = df_class_test_phl.drop('DELAY', axis=1)
y_test = df_class_test_phl['DELAY']

In [102]:
X_train.shape

(5200, 10)

In [103]:
X_test.shape

(1300, 10)

In [104]:
X_train = pd.get_dummies(X_train, columns=['AIRLINE_CODE'], drop_first=True)
X_test = pd.get_dummies(X_test, columns=['AIRLINE_CODE'], drop_first=True)

In [105]:
X_train.head(2)

Unnamed: 0,DEPARTURE_DELAY,LATE_AIRCRAFT_DELAY,QNH,dew_point,temp,visibility,NUM_ARR_AVG_3HOUR,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,AIRLINE_CODE_DL,AIRLINE_CODE_F9,AIRLINE_CODE_NK,AIRLINE_CODE_US,AIRLINE_CODE_WN
334680,-5.0,0.0,29.98,12.0,16.0,10.0,62.0,-0.9422609,-0.33488,0,0,0,0,0
260179,83.0,60.0,30.06,19.0,23.0,10.0,14.0,-2.449294e-16,1.0,0,1,0,0,0


In [106]:
X_test.head(2)

Unnamed: 0,DEPARTURE_DELAY,LATE_AIRCRAFT_DELAY,QNH,dew_point,temp,visibility,NUM_ARR_AVG_3HOUR,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,AIRLINE_CODE_DL,AIRLINE_CODE_F9,AIRLINE_CODE_NK,AIRLINE_CODE_US,AIRLINE_CODE_WN
61781,-4.0,0.0,30.14,11.0,22.0,10.0,59.666667,0.942261,-0.33488,1,0,0,0,0
335136,0.0,0.0,30.33,6.0,11.0,10.0,65.666667,-0.730836,0.682553,0,0,0,0,0


In [107]:
X_train.shape

(5200, 14)

In [108]:
X_test.shape

(1300, 14)

In [109]:
# Arranging the order of features to be the same
X_test = X_test[X_train.columns]

**Establishing Baseline Accuracy**

In [110]:
# Baseline Accuracy
y_train.value_counts(normalize=True)

0    0.801346
1    0.134423
2    0.053654
3    0.010577
Name: DELAY, dtype: float64

Baseline Accuracy: <u>80.13%</u>

**(iii) Scaling dataset**

In [111]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [112]:
# Exporting scaler for deployment
filename = '../scalers/phl_scaler.sav'
pickle.dump(ss, open(filename, 'wb'))

**(iv) Employing Oversampling (SMOTE) to balance the training dataset**

In [113]:
sm = SMOTE(random_state=42)
X_train_sc_res, y_train_res = sm.fit_resample(X_train_sc, y_train)

In [114]:
y_train_res.value_counts(normalize=True)

3    0.25
2    0.25
1    0.25
0    0.25
Name: DELAY, dtype: float64

In [115]:
df_classification_enc = pd.DataFrame(X_train_sc_res, columns=X_train.columns)
df_classification_enc['DELAY'] = y_train_res

df_classification_enc_test = pd.DataFrame(X_test_sc, columns=X_test.columns)
df_classification_enc_test['DELAY'] = y_test.values

In [116]:
df_classification_enc.head()

Unnamed: 0,DEPARTURE_DELAY,LATE_AIRCRAFT_DELAY,QNH,dew_point,temp,visibility,NUM_ARR_AVG_3HOUR,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,AIRLINE_CODE_DL,AIRLINE_CODE_F9,AIRLINE_CODE_NK,AIRLINE_CODE_US,AIRLINE_CODE_WN,DELAY
0,-0.360432,-0.229425,-0.811446,-0.020557,-0.15768,0.354861,0.027284,-1.052649,-0.190386,-1.081797,-0.197911,-0.165112,-0.387541,-0.403795,0
1,1.679034,2.445886,-0.285887,0.654788,0.640921,0.354861,-3.223399,0.483358,1.706147,-1.081797,5.052783,-0.165112,-0.387541,-0.403795,2
2,-0.406783,-0.229425,0.239673,0.847744,0.869092,0.354861,0.230452,-1.142971,0.18844,0.924388,-0.197911,-0.165112,-0.387541,-0.403795,1
3,-0.383608,-0.229425,-0.285887,0.55831,0.526835,0.354861,0.749658,1.512112,-0.816698,-1.081797,-0.197911,6.056497,-0.387541,-0.403795,0
4,-0.290905,-0.229425,1.487877,-0.985336,-1.070366,0.354861,-0.130735,-1.052649,-0.190386,0.924388,-0.197911,-0.165112,-0.387541,-0.403795,0


In [117]:
df_classification_enc_test.head()

Unnamed: 0,DEPARTURE_DELAY,LATE_AIRCRAFT_DELAY,QNH,dew_point,temp,visibility,NUM_ARR_AVG_3HOUR,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,AIRLINE_CODE_DL,AIRLINE_CODE_F9,AIRLINE_CODE_NK,AIRLINE_CODE_US,AIRLINE_CODE_WN,DELAY
0,-0.337256,-0.229425,0.239673,-0.117035,0.526835,0.354861,-0.130735,2.019364,-0.190386,0.924388,-0.197911,-0.165112,-0.387541,-0.403795,0
1,-0.244553,-0.229425,1.487877,-0.599425,-0.728109,0.354861,0.2756,-0.707999,1.255134,-1.081797,-0.197911,-0.165112,-0.387541,-0.403795,1
2,-0.244553,-0.229425,-0.942836,0.461832,0.298663,0.354861,0.862529,-0.545397,-0.816698,0.924388,-0.197911,-0.165112,-0.387541,-0.403795,0
3,0.358016,-0.229425,2.538996,-1.853638,-0.95628,0.354861,-2.252709,0.043554,1.653462,-1.081797,-0.197911,-0.165112,-0.387541,-0.403795,0
4,-0.267729,-0.229425,-0.614361,0.268877,0.983178,0.354861,0.885104,-0.707999,1.255134,0.924388,-0.197911,-0.165112,-0.387541,-0.403795,0


In [118]:
df_classification_enc.shape

(16668, 15)

In [119]:
df_classification_enc_test.shape

(1300, 15)

**(v) Exporting the data for modeling**

In [120]:
df_classification_enc.to_csv('../datasets/combined_data_class_phl.csv', index=False)
df_classification_enc_test.to_csv('../datasets/combined_data_class_test_phl.csv', index=False)

---
### (10) Feature selection, encoding, scaling, SMOTE and exporting for DFW

---

**(i) Feature Selection for Classification Modeling (DFW)**

In [121]:
df_dfw.head()

Unnamed: 0,AIRLINE_CODE,DEPARTURE_DELAY,SCHEDULED_TIME,ARRIVAL_DELAY,LATE_AIRCRAFT_DELAY,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,winddir,windspd,windgust,ARRIVAL_DELAY/NO_DELAY,NUM_ARR_AVG_3HOUR,crosswind_comp,SCHEDULED_ARRIVAL_MONTH_sin,SCHEDULED_ARRIVAL_MONTH_cos,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,DELAY
331141,AA,-4.0,136.0,-12.0,0.0,30.16,12.0,1,1,1,1,1,0,24.0,1,1,10.0,350.0,5.0,0,0,72.333333,4.924039,0.5,-0.866025,-0.979084,0.203456,0
17378,DL,-4.0,130.0,-2.0,0.0,30.08,21.0,1,1,1,1,1,1,24.0,1,1,10.0,0.0,0.0,0,0,55.666667,0.0,-0.866025,-0.5,0.136167,-0.990686,0
330186,AA,69.0,118.0,55.0,0.0,30.04,3.0,1,1,1,0,1,0,7.0,1,1,10.0,320.0,4.0,0,1,57.666667,3.064178,0.5,0.866025,-0.942261,-0.33488,1
331728,AA,74.0,139.0,46.0,0.0,30.06,21.0,1,1,1,1,1,0,27.0,1,1,10.0,30.0,5.0,0,1,66.666667,4.330127,-0.5,-0.866025,-0.979084,0.203456,1
331292,AA,2.0,130.0,-4.0,0.0,30.21,19.0,1,1,1,1,1,0,21.0,1,1,10.0,140.0,5.0,0,0,55.666667,3.830222,0.5,-0.866025,0.136167,-0.990686,0


In [122]:
features_class = df_dfw.drop(['DELAY', 'ARRIVAL_DELAY/NO_DELAY', 'ARRIVAL_DELAY'], axis=1)
target_class = df_dfw['DELAY']

In [123]:
features_class_numeric = features_class.drop(['AIRLINE_CODE', 'lightning', 'low_intensity', 'rain', 'shower', 'snow', 'squall', 'thunderyshower', 'vicinity'], axis=1)
features_class_cat = features_class[['AIRLINE_CODE', 'lightning', 'low_intensity', 'rain', 'shower', 'snow', 'squall', 'thunderyshower', 'vicinity']]

**Using ANOVA for Numerical Features**

In [124]:
for i in zip(features_class_numeric.columns, f_classif(features_class_numeric, target_class)[1]):
    print (i)

('DEPARTURE_DELAY', 0.0)
('SCHEDULED_TIME', 0.5819070510422226)
('LATE_AIRCRAFT_DELAY', 0.0)
('QNH', 0.053763818376469645)
('dew_point', 1.0404124124920358e-09)
('temp', 0.0012135916751461159)
('visibility', 9.894194759420698e-09)
('winddir', 0.252286118044437)
('windspd', 0.34664471294783405)
('windgust', 0.8440766089140338)
('NUM_ARR_AVG_3HOUR', 2.7649231831735332e-06)
('crosswind_comp', 0.0005668492339203077)
('SCHEDULED_ARRIVAL_MONTH_sin', 0.007421610706167633)
('SCHEDULED_ARRIVAL_MONTH_cos', 0.1953351009113758)
('SCHEDULED_ARRIVAL_HOUR_sin', 2.5487195130160083e-08)
('SCHEDULED_ARRIVAL_HOUR_cos', 9.457502595841745e-18)


**Using Chi-Square for Categorical Features**

In [125]:
features_class_cat.columns

Index(['AIRLINE_CODE', 'lightning', 'low_intensity', 'rain', 'shower', 'snow',
       'squall', 'thunderyshower', 'vicinity'],
      dtype='object')

In [126]:
# Need to label encode before we can use chi2

le = LabelEncoder()
features_class_cat['AIRLINE_CODE'] = le.fit_transform(features_class_cat['AIRLINE_CODE'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [127]:
features_class_cat.head()

Unnamed: 0,AIRLINE_CODE,lightning,low_intensity,rain,shower,snow,squall,thunderyshower,vicinity
331141,0,1,1,1,1,1,0,1,1
17378,1,1,1,1,1,1,1,1,1
330186,0,1,1,1,0,1,0,1,1
331728,0,1,1,1,1,1,0,1,1
331292,0,1,1,1,1,1,0,1,1


In [128]:
for i in zip(features_class_cat.columns, chi2(features_class_cat, target_class)[1]):
    print (i)

('AIRLINE_CODE', 0.6878684165023368)
('lightning', 0.9790626921285849)
('low_intensity', 0.9997255675924507)
('rain', 0.9997255675924507)
('shower', 0.9143548597135644)
('snow', 0.9559356354331762)
('squall', 0.0005555255661217424)
('thunderyshower', 0.9790626921285849)
('vicinity', 0.9997255675924507)


Based on the ANOVA & Chi-Squared tests above, we will reject the null hypothesis of no difference in mean at a 95% significance level.

Hence, we will drop the following feature(s):
1. AIRLINE_CODE
2. lightning
3. low_intensity
4. rain
5. shower
6. snow
7. thunderyshower
8. vicinity
9. SCHEDULED_TIME
10. QNH
11. winddir
12. windspd
13. windgust

_Note, we will not drop SCHEDULED_ARRIVAL_MONTH_cos as it should always come in pairs with its sine counterpart_

In [129]:
df_class_dfw = df_dfw.drop(['ARRIVAL_DELAY', 'ARRIVAL_DELAY/NO_DELAY', 'AIRLINE_CODE', 'lightning', 'low_intensity', 'rain', 'shower', 'snow', 'thunderyshower', 'vicinity', 'SCHEDULED_TIME', 'QNH', 'winddir', 'windspd', 'windgust'], axis=1)

# Duplicating transformation for test dataset
df_class_test_dfw = df_test_dfw.drop(['ARRIVAL_DELAY', 'ARRIVAL_DELAY/NO_DELAY', 'AIRLINE_CODE', 'lightning', 'low_intensity', 'rain', 'shower', 'snow', 'thunderyshower', 'vicinity', 'SCHEDULED_TIME', 'QNH', 'winddir', 'windspd', 'windgust'], axis=1)

In [130]:
df_class_dfw.head()

Unnamed: 0,DEPARTURE_DELAY,LATE_AIRCRAFT_DELAY,dew_point,squall,temp,visibility,NUM_ARR_AVG_3HOUR,crosswind_comp,SCHEDULED_ARRIVAL_MONTH_sin,SCHEDULED_ARRIVAL_MONTH_cos,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,DELAY
331141,-4.0,0.0,12.0,0,24.0,10.0,72.333333,4.924039,0.5,-0.866025,-0.979084,0.203456,0
17378,-4.0,0.0,21.0,1,24.0,10.0,55.666667,0.0,-0.866025,-0.5,0.136167,-0.990686,0
330186,69.0,0.0,3.0,0,7.0,10.0,57.666667,3.064178,0.5,0.866025,-0.942261,-0.33488,1
331728,74.0,0.0,21.0,0,27.0,10.0,66.666667,4.330127,-0.5,-0.866025,-0.979084,0.203456,1
331292,2.0,0.0,19.0,0,21.0,10.0,55.666667,3.830222,0.5,-0.866025,0.136167,-0.990686,0


**(ii) Encoding Features (for classification)**

In [131]:
# Setting X_train, X_test, y_train & y_test
X_train = df_class_dfw.drop('DELAY', axis=1)
y_train = df_class_dfw['DELAY']
X_test = df_class_test_dfw.drop('DELAY', axis=1)
y_test = df_class_test_dfw['DELAY']

In [132]:
X_train.shape

(5544, 12)

In [133]:
X_test.shape

(1387, 12)

In [134]:
X_train.head(2)

Unnamed: 0,DEPARTURE_DELAY,LATE_AIRCRAFT_DELAY,dew_point,squall,temp,visibility,NUM_ARR_AVG_3HOUR,crosswind_comp,SCHEDULED_ARRIVAL_MONTH_sin,SCHEDULED_ARRIVAL_MONTH_cos,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos
331141,-4.0,0.0,12.0,0,24.0,10.0,72.333333,4.924039,0.5,-0.866025,-0.979084,0.203456
17378,-4.0,0.0,21.0,1,24.0,10.0,55.666667,0.0,-0.866025,-0.5,0.136167,-0.990686


In [135]:
X_test.head(2)

Unnamed: 0,DEPARTURE_DELAY,LATE_AIRCRAFT_DELAY,dew_point,squall,temp,visibility,NUM_ARR_AVG_3HOUR,crosswind_comp,SCHEDULED_ARRIVAL_MONTH_sin,SCHEDULED_ARRIVAL_MONTH_cos,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos
16891,-1.0,0.0,19.0,0,23.0,10.0,65.333333,5.0,-0.5,-0.8660254,-0.997669,-0.068242
17606,-4.0,0.0,22.0,1,24.0,10.0,65.0,2.298133,-1.0,-1.83697e-16,-0.136167,-0.990686


In [136]:
X_train.shape

(5544, 12)

In [137]:
X_test.shape

(1387, 12)

In [138]:
# Arranging the order of features to be the same
X_test = X_test[X_train.columns]

**Establishing Baseline Accuracy**

In [139]:
# Baseline Accuracy
y_train.value_counts(normalize=True)

0    0.818001
1    0.117965
2    0.051768
3    0.012266
Name: DELAY, dtype: float64

Baseline Accuracy: <u>81.8%</u>

**(iii) Scaling dataset**

In [140]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [141]:
# Exporting scaler for deployment
filename = '../scalers/dfw_scaler.sav'
pickle.dump(ss, open(filename, 'wb'))

**(iv) Employing Oversampling (SMOTE) to balance the training dataset**

In [142]:
sm = SMOTE(random_state=42)
X_train_sc_res, y_train_res = sm.fit_resample(X_train_sc, y_train)

In [143]:
y_train_res.value_counts(normalize=True)

3    0.25
2    0.25
1    0.25
0    0.25
Name: DELAY, dtype: float64

In [144]:
df_classification_enc = pd.DataFrame(X_train_sc_res, columns=X_train.columns)
df_classification_enc['DELAY'] = y_train_res

df_classification_enc_test = pd.DataFrame(X_test_sc, columns=X_test.columns)
df_classification_enc_test['DELAY'] = y_test.values

In [145]:
df_classification_enc.head()

Unnamed: 0,DEPARTURE_DELAY,LATE_AIRCRAFT_DELAY,dew_point,squall,temp,visibility,NUM_ARR_AVG_3HOUR,crosswind_comp,SCHEDULED_ARRIVAL_MONTH_sin,SCHEDULED_ARRIVAL_MONTH_cos,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,DELAY
0,-0.366729,-0.200019,0.040071,-0.724109,0.809347,0.366635,0.821866,0.433515,0.580484,-1.122557,-1.069125,0.545816,0
1,-0.366729,-0.200019,0.87629,1.381008,0.809347,0.366635,-0.271767,-1.108409,-1.400841,-0.61033,1.039344,-1.119966,0
2,1.438444,-0.200019,-0.796147,-0.724109,-1.069316,0.366635,-0.140531,-0.148886,0.580484,1.301324,-0.999508,-0.205142,1
3,1.562086,-0.200019,0.87629,-0.724109,1.140876,0.366635,0.450031,0.247536,-0.869946,-1.122557,-1.069125,0.545816,1
4,-0.218358,-0.200019,0.690463,-0.724109,0.477818,0.366635,-0.271767,0.090995,0.580484,-1.122557,1.039344,-1.119966,0


In [146]:
df_classification_enc_test.head()

Unnamed: 0,DEPARTURE_DELAY,LATE_AIRCRAFT_DELAY,dew_point,squall,temp,visibility,NUM_ARR_AVG_3HOUR,crosswind_comp,SCHEDULED_ARRIVAL_MONTH_sin,SCHEDULED_ARRIVAL_MONTH_cos,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,DELAY
0,-0.292544,-0.200019,0.690463,-0.724109,0.698838,0.366635,0.36254,0.457301,-0.869946,-1.122557,-1.104261,0.166807,0
1,-0.366729,-0.200019,0.969203,1.381008,0.809347,0.366635,0.340667,-0.388767,-1.595161,0.089384,0.524477,-1.119966,0
2,-0.292544,-0.200019,4.035337,-0.724109,-0.406258,0.366635,-0.118658,-1.108409,1.111378,-0.61033,1.039344,-1.119966,0
3,-0.243087,-0.200019,0.318811,-0.724109,0.477818,0.366635,0.931229,0.927014,0.580484,-1.122557,-0.762635,-0.542444,0
4,-0.342,-0.200019,0.87629,-0.724109,0.698838,0.366635,1.34681,-0.890902,-0.869946,-1.122557,-0.762635,-0.542444,0


In [147]:
df_classification_enc.shape

(18140, 13)

In [148]:
df_classification_enc_test.shape

(1387, 13)

**(v) Exporting the data for modeling**

In [149]:
df_classification_enc.to_csv('../datasets/combined_data_class_dfw.csv', index=False)
df_classification_enc_test.to_csv('../datasets/combined_data_class_test_dfw.csv', index=False)

---
### (11) Feature selection, encoding, scaling, SMOTE and exporting for MCO

---

**(i) Feature Selection for Classification Modeling (MCO)**

In [150]:
df_mco.head()

Unnamed: 0,AIRLINE_CODE,DEPARTURE_DELAY,SCHEDULED_TIME,ARRIVAL_DELAY,LATE_AIRCRAFT_DELAY,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,winddir,windspd,windgust,ARRIVAL_DELAY/NO_DELAY,NUM_ARR_AVG_3HOUR,crosswind_comp,SCHEDULED_ARRIVAL_MONTH_sin,SCHEDULED_ARRIVAL_MONTH_cos,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,DELAY
234138,WN,-4.0,90.0,-21.0,0.0,30.09,21.0,1,1,1,1,1,0,23.0,1,1,10.0,0.0,0.0,0,0,65.0,0.0,1.224647e-16,-1.0,0.398401,-0.917211,0
74316,DL,-4.0,97.0,-10.0,0.0,30.06,15.0,1,1,1,1,1,1,19.0,1,1,10.0,180.0,3.0,0,0,54.0,3.0,-2.449294e-16,1.0,-0.519584,0.854419,0
260318,F9,-6.0,90.0,-6.0,0.0,29.93,19.0,1,1,1,1,1,0,22.0,1,1,10.0,320.0,6.0,0,0,56.0,4.596267,1.224647e-16,-1.0,0.136167,-0.990686,0
74643,DL,5.0,100.0,-6.0,0.0,29.98,16.0,1,1,1,1,1,1,16.0,1,1,0.25,130.0,6.0,0,0,71.666667,3.856726,-2.449294e-16,1.0,0.81697,-0.57668,0
72510,DL,-3.0,90.0,-16.0,0.0,30.16,22.0,1,1,1,1,1,0,24.0,1,1,10.0,230.0,3.0,0,0,57.0,1.928363,-0.5,-0.866025,0.136167,-0.990686,0


In [151]:
features_class = df_mco.drop(['DELAY', 'ARRIVAL_DELAY/NO_DELAY', 'ARRIVAL_DELAY'], axis=1)
target_class = df_mco['DELAY']

In [152]:
features_class_numeric = features_class.drop(['AIRLINE_CODE', 'lightning', 'low_intensity', 'rain', 'shower', 'snow', 'squall', 'thunderyshower', 'vicinity'], axis=1)
features_class_cat = features_class[['AIRLINE_CODE', 'lightning', 'low_intensity', 'rain', 'shower', 'snow', 'squall', 'thunderyshower', 'vicinity']]

**Using ANOVA for Numerical Features**

In [153]:
for i in zip(features_class_numeric.columns, f_classif(features_class_numeric, target_class)[1]):
    print (i)

('DEPARTURE_DELAY', 0.0)
('SCHEDULED_TIME', 2.106113803677254e-06)
('LATE_AIRCRAFT_DELAY', 0.0)
('QNH', 9.696047485125571e-06)
('dew_point', 1.4357355664071547e-31)
('temp', 7.865253709984343e-30)
('visibility', 0.000455060099280293)
('winddir', 0.9378862891736903)
('windspd', 0.00641597523551054)
('windgust', 0.3087787804788761)
('NUM_ARR_AVG_3HOUR', 0.2725881151019376)
('crosswind_comp', 0.19125362884862782)
('SCHEDULED_ARRIVAL_MONTH_sin', 3.2004442689146905e-06)
('SCHEDULED_ARRIVAL_MONTH_cos', 5.903236375840817e-11)
('SCHEDULED_ARRIVAL_HOUR_sin', 1.3747413379689274e-47)
('SCHEDULED_ARRIVAL_HOUR_cos', 1.2559011552524842e-46)


**Using Chi-Square for Categorical Features**

In [154]:
features_class_cat.columns

Index(['AIRLINE_CODE', 'lightning', 'low_intensity', 'rain', 'shower', 'snow',
       'squall', 'thunderyshower', 'vicinity'],
      dtype='object')

In [155]:
# Need to label encode before we can use chi2

le = LabelEncoder()
features_class_cat['AIRLINE_CODE'] = le.fit_transform(features_class_cat['AIRLINE_CODE'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [156]:
features_class_cat.head()

Unnamed: 0,AIRLINE_CODE,lightning,low_intensity,rain,shower,snow,squall,thunderyshower,vicinity
234138,4,1,1,1,1,1,0,1,1
74316,0,1,1,1,1,1,1,1,1
260318,2,1,1,1,1,1,0,1,1
74643,0,1,1,1,1,1,1,1,1
72510,0,1,1,1,1,1,0,1,1


In [157]:
for i in zip(features_class_cat.columns, chi2(features_class_cat, target_class)[1]):
    print (i)

('AIRLINE_CODE', 2.6710957992456045e-79)
('lightning', 0.9970788006188568)
('low_intensity', 0.9999765263022226)
('rain', 0.9999765263022226)
('shower', 0.4723258212473813)
('snow', 0.9696522387410988)
('squall', 0.6321808017580879)
('thunderyshower', 0.9970788006188568)
('vicinity', 0.9999765263022226)


Based on the ANOVA & Chi-Squared tests above, we will reject the null hypothesis of no difference in mean at a 95% significance level.

Hence, we will drop the following feature(s):
1. lightning
2. low_intensity
3. rain
4. shower
5. snow
6. squall
7. thunderyshower
8. vicinity
9. winddir
10. windgust
11. NUM_ARR_AVG_3HOUR
12. crosswind_comp

_First time seeing that the NUM_ARR_AVG_3HOUR feature is not important for the prediction of delay. Many possibilities as to why NUM_ARR_AVG_3HOUR is not important, perhaps certain flights have a more regular pattern to their scheduling._

_Their scheduling might already be optimized to avoid crowded airspaces and hence the non-importance of the feature._

_Potential further research on this aspect._

In [158]:
df_class_mco = df_mco.drop(['ARRIVAL_DELAY', 'ARRIVAL_DELAY/NO_DELAY', 'lightning', 'low_intensity', 'rain', 'shower', 'snow', 'squall', 'thunderyshower', 'vicinity', 'winddir', 'windgust', 'NUM_ARR_AVG_3HOUR', 'crosswind_comp'], axis=1)

# Duplicating transformation for test dataset
df_class_test_mco = df_test_mco.drop(['ARRIVAL_DELAY', 'ARRIVAL_DELAY/NO_DELAY', 'lightning', 'low_intensity', 'rain', 'shower', 'snow', 'squall', 'thunderyshower', 'vicinity', 'winddir', 'windgust', 'NUM_ARR_AVG_3HOUR', 'crosswind_comp'], axis=1)

In [159]:
df_class_mco.head()

Unnamed: 0,AIRLINE_CODE,DEPARTURE_DELAY,SCHEDULED_TIME,LATE_AIRCRAFT_DELAY,QNH,dew_point,temp,visibility,windspd,SCHEDULED_ARRIVAL_MONTH_sin,SCHEDULED_ARRIVAL_MONTH_cos,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,DELAY
234138,WN,-4.0,90.0,0.0,30.09,21.0,23.0,10.0,0.0,1.224647e-16,-1.0,0.398401,-0.917211,0
74316,DL,-4.0,97.0,0.0,30.06,15.0,19.0,10.0,3.0,-2.449294e-16,1.0,-0.519584,0.854419,0
260318,F9,-6.0,90.0,0.0,29.93,19.0,22.0,10.0,6.0,1.224647e-16,-1.0,0.136167,-0.990686,0
74643,DL,5.0,100.0,0.0,29.98,16.0,16.0,0.25,6.0,-2.449294e-16,1.0,0.81697,-0.57668,0
72510,DL,-3.0,90.0,0.0,30.16,22.0,24.0,10.0,3.0,-0.5,-0.866025,0.136167,-0.990686,0


**(ii) Encoding Features (for classification)**

In [160]:
# Setting X_train, X_test, y_train & y_test
X_train = df_class_mco.drop('DELAY', axis=1)
y_train = df_class_mco['DELAY']
X_test = df_class_test_mco.drop('DELAY', axis=1)
y_test = df_class_test_mco['DELAY']

In [161]:
X_train.shape

(6493, 13)

In [162]:
X_test.shape

(1624, 13)

In [163]:
X_train = pd.get_dummies(X_train, columns=['AIRLINE_CODE'], drop_first=True)
X_test = pd.get_dummies(X_test, columns=['AIRLINE_CODE'], drop_first=True)

In [164]:
X_train.head(2)

Unnamed: 0,DEPARTURE_DELAY,SCHEDULED_TIME,LATE_AIRCRAFT_DELAY,QNH,dew_point,temp,visibility,windspd,SCHEDULED_ARRIVAL_MONTH_sin,SCHEDULED_ARRIVAL_MONTH_cos,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,AIRLINE_CODE_EV,AIRLINE_CODE_F9,AIRLINE_CODE_NK,AIRLINE_CODE_WN
234138,-4.0,90.0,0.0,30.09,21.0,23.0,10.0,0.0,1.224647e-16,-1.0,0.398401,-0.917211,0,0,0,1
74316,-4.0,97.0,0.0,30.06,15.0,19.0,10.0,3.0,-2.449294e-16,1.0,-0.519584,0.854419,0,0,0,0


In [165]:
X_test.head(2)

Unnamed: 0,DEPARTURE_DELAY,SCHEDULED_TIME,LATE_AIRCRAFT_DELAY,QNH,dew_point,temp,visibility,windspd,SCHEDULED_ARRIVAL_MONTH_sin,SCHEDULED_ARRIVAL_MONTH_cos,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,AIRLINE_CODE_F9,AIRLINE_CODE_NK,AIRLINE_CODE_WN
74131,1.0,103.0,0.0,30.1,10.0,14.0,10.0,10.0,-0.5,0.866025,-0.979084,0.203456,0,0,0
74711,53.0,97.0,33.0,30.14,19.0,21.0,0.25,7.0,-2.449294e-16,1.0,-0.730836,0.682553,0,0,0


In [166]:
X_train.shape

(6493, 16)

In [167]:
X_test.shape

(1624, 15)

Notice that our train dataset has 1 more column as compared to the test dataset. We will find out what is the extra columns

In [168]:
difference = []

for i in X_train.columns:
    if i not in X_test.columns:
        difference.append(i)

In [169]:
difference

['AIRLINE_CODE_EV']

Seems that our test dataset has no flights by EV airline. To allow for testing further on, we will opt to include column of features 'AIRLINE_CODE_EV' in the test dataset and set them all to 0 _(did not happen)_.

In [170]:
X_test['AIRLINE_CODE_EV'] = 0

In [171]:
# Arranging the order of features to be the same
X_test = X_test[X_train.columns]

**Establishing Baseline Accuracy**

In [172]:
# Baseline Accuracy
y_train.value_counts(normalize=True)

0    0.848760
1    0.102726
2    0.040659
3    0.007855
Name: DELAY, dtype: float64

Baseline Accuracy: <u>84.88%</u>

**(iii) Scaling dataset**

In [173]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [174]:
# Exporting scaler for deployment
filename = '../scalers/mco_scaler.sav'
pickle.dump(ss, open(filename, 'wb'))

**(iv) Employing Oversampling (SMOTE) to balance the training dataset**

In [175]:
sm = SMOTE(random_state=42)
X_train_sc_res, y_train_res = sm.fit_resample(X_train_sc, y_train)

In [176]:
y_train_res.value_counts(normalize=True)

3    0.25
2    0.25
1    0.25
0    0.25
Name: DELAY, dtype: float64

In [177]:
df_classification_enc = pd.DataFrame(X_train_sc_res, columns=X_train.columns)
df_classification_enc['DELAY'] = y_train_res

df_classification_enc_test = pd.DataFrame(X_test_sc, columns=X_test.columns)
df_classification_enc_test['DELAY'] = y_test.values

In [178]:
df_classification_enc.head()

Unnamed: 0,DEPARTURE_DELAY,SCHEDULED_TIME,LATE_AIRCRAFT_DELAY,QNH,dew_point,temp,visibility,windspd,SCHEDULED_ARRIVAL_MONTH_sin,SCHEDULED_ARRIVAL_MONTH_cos,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,AIRLINE_CODE_EV,AIRLINE_CODE_F9,AIRLINE_CODE_NK,AIRLINE_CODE_WN,DELAY
0,-0.325555,-0.829324,-0.19734,-0.114255,0.868599,0.685551,0.355234,-1.662574,-0.101151,-1.304766,0.911083,-1.017175,-0.012411,-0.196758,-0.195912,1.593847,0
1,-0.325555,0.130607,-0.19734,-0.30895,0.300318,0.230538,0.355234,-0.873113,-0.101151,1.458055,-0.46556,1.595281,-0.012411,-0.196758,-0.195912,-0.627413,0
2,-0.382248,-0.829324,-0.19734,-1.15263,0.679172,0.571798,0.355234,-0.083652,-0.101151,-1.304766,0.517827,-1.125521,-0.012411,5.082379,-0.195912,-0.627413,0
3,-0.070436,0.542006,-0.19734,-0.828138,0.395031,-0.110722,-2.880781,-0.083652,-0.101151,1.458055,1.538784,-0.515027,-0.012411,-0.196758,-0.195912,-0.627413,0
4,-0.297209,-0.829324,-0.19734,0.340035,0.963313,0.799304,0.355234,-0.873113,-0.831951,-1.119693,0.517827,-1.125521,-0.012411,-0.196758,-0.195912,-0.627413,0


In [179]:
df_classification_enc_test.head()

Unnamed: 0,DEPARTURE_DELAY,SCHEDULED_TIME,LATE_AIRCRAFT_DELAY,QNH,dew_point,temp,visibility,windspd,SCHEDULED_ARRIVAL_MONTH_sin,SCHEDULED_ARRIVAL_MONTH_cos,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,AIRLINE_CODE_EV,AIRLINE_CODE_F9,AIRLINE_CODE_NK,AIRLINE_CODE_WN,DELAY
0,-0.183823,0.953405,-0.19734,-0.049356,-0.17325,-0.338229,0.355234,0.968963,-0.831951,1.272981,-1.154643,0.635367,-0.012411,-0.196758,-0.195912,-0.627413,0
1,1.290198,0.130607,1.6463,0.210238,0.679172,0.458044,-2.880781,0.179502,-0.101151,1.458055,-0.782361,1.341846,-0.012411,-0.196758,-0.195912,-0.627413,1
2,0.241376,0.404873,-0.19734,-1.412224,-0.078537,0.116784,0.355234,-1.662574,1.164632,-0.614061,-0.28383,-1.017175,-0.012411,-0.196758,-0.195912,-0.627413,0
3,-0.325555,1.776202,-0.19734,0.859223,-0.078537,-0.565736,-2.797807,0.442656,1.360449,0.076644,-0.782361,1.341846,-0.012411,-0.196758,-0.195912,-0.627413,0
4,0.836653,-0.143659,1.199357,-0.568544,0.300318,-0.110722,0.355234,-1.662574,-0.101151,1.458055,-1.017878,1.013764,-0.012411,-0.196758,-0.195912,1.593847,1


In [180]:
df_classification_enc.shape

(22044, 17)

In [181]:
df_classification_enc_test.shape

(1624, 17)

**(v) Exporting the data for modeling**

In [182]:
df_classification_enc.to_csv('../datasets/combined_data_class_mco.csv', index=False)
df_classification_enc_test.to_csv('../datasets/combined_data_class_test_mco.csv', index=False)