<img src="../images/airplane-symbol.jpg" style="float: left; margin: 20px;" width="50" height="50"> 
#  Predicting Flight Delays (<i>a Proof-of-Concept</i>)

Author: Solomon Heng

---

# 7a. Classification Modeling ORD

## Processes covered in this notebook:
1. [Importing dataset](#(1)-Importing-dataset)
2. [Setting X_train, X_test, y_train & y_test](#(2)-Setting-X_train,-X_test,-y_train-&-y_test)
3. [Model (I): Logistic Regression](#(3)-Model-(I):-Logistic-Regression)
4. [Model (II): RandomForest](#(4)-Model-(II):-RandomForest)
5. [Model (III): XGBoost](#(5)-Model-(III):-XGBoost)
6. [Model (IV): Neural Networks](#(6)-Model-(IV):-Neural-Networks)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.metrics import confusion_matrix, roc_auc_score, classification_report
from sklearn.ensemble import RandomForestClassifier

from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.callbacks import EarlyStopping
from keras import backend as K

import xgboost as xgb

import pickle

sns.set()

Using TensorFlow backend.


---
### (1) Importing dataset

---

In [2]:
df = pd.read_csv('../datasets/combined_data_class_ord.csv')
df_test = pd.read_csv('../datasets/combined_data_class_test_ord.csv')

In [3]:
pd.set_option('display.max_columns', 100)
df.head()

Unnamed: 0,DEPARTURE_DELAY,LATE_AIRCRAFT_DELAY,QNH,dew_point,NUM_ARR_AVG_3HOUR,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,AIRLINE_CODE_DL,AIRLINE_CODE_EV,AIRLINE_CODE_F9,AIRLINE_CODE_MQ,AIRLINE_CODE_NK,AIRLINE_CODE_OO,AIRLINE_CODE_UA,DELAY
0,-0.34979,-0.242388,-1.072925,0.838466,1.174306,-0.873559,-0.490446,0.845734,-0.08396,-0.231821,-0.172818,-0.276453,-0.450001,-0.209903,0
1,-0.368073,-0.242388,1.225008,-1.428832,1.174306,-0.998968,0.937312,0.845734,-0.08396,-0.231821,-0.172818,-0.276453,-0.450001,-0.209903,0
2,-0.331507,-0.242388,-1.13858,-0.389654,0.130878,1.275746,-0.95941,-1.182405,-0.08396,4.313681,-0.172818,-0.276453,-0.450001,-0.209903,0
3,-0.331507,-0.242388,0.831077,-0.389654,0.315013,-1.193113,0.209751,-1.182405,-0.08396,-0.231821,-0.172818,-0.276453,-0.450001,4.764104,0
4,-0.386357,-0.242388,-0.875959,0.838466,-0.176012,0.812001,-1.060596,0.845734,-0.08396,-0.231821,-0.172818,-0.276453,-0.450001,-0.209903,0


In [4]:
df_test.head()

Unnamed: 0,DEPARTURE_DELAY,LATE_AIRCRAFT_DELAY,QNH,dew_point,NUM_ARR_AVG_3HOUR,SCHEDULED_ARRIVAL_HOUR_sin,SCHEDULED_ARRIVAL_HOUR_cos,AIRLINE_CODE_DL,AIRLINE_CODE_EV,AIRLINE_CODE_F9,AIRLINE_CODE_MQ,AIRLINE_CODE_NK,AIRLINE_CODE_OO,AIRLINE_CODE_UA,DELAY
0,-0.258373,-0.242388,0.568456,-0.200713,0.233175,-1.095128,-0.157449,0.845734,-0.08396,-0.231821,-0.172818,-0.276453,-0.450001,-0.209903,0
1,-0.34979,-0.242388,0.174525,-1.239891,-0.257849,-1.095128,-0.157449,-1.182405,-0.08396,-0.231821,-0.172818,-0.276453,2.222216,-0.209903,1
2,-0.148673,-0.242388,0.24018,-0.106242,0.683281,-1.160247,0.583922,0.845734,-0.08396,-0.231821,-0.172818,-0.276453,-0.450001,-0.209903,0
3,0.107294,-0.242388,-0.547683,-1.901186,0.0695,-1.193113,0.209751,0.845734,-0.08396,-0.231821,-0.172818,-0.276453,-0.450001,-0.209903,0
4,-0.313223,-0.242388,-0.547683,1.027407,0.0695,-0.347651,1.480399,0.845734,-0.08396,-0.231821,-0.172818,-0.276453,-0.450001,-0.209903,0


---
### (2) Setting X_train, X_test, y_train & y_test

---

In [5]:
X_train = df.drop('DELAY', axis=1)
y_train = df['DELAY']
X_test = df_test.drop('DELAY', axis=1)
y_test = df_test['DELAY']

In [6]:
X_train.shape

(15452, 14)

In [7]:
X_test.shape

(1251, 14)

In [8]:
X_train.columns

Index(['DEPARTURE_DELAY', 'LATE_AIRCRAFT_DELAY', 'QNH', 'dew_point',
       'NUM_ARR_AVG_3HOUR', 'SCHEDULED_ARRIVAL_HOUR_sin',
       'SCHEDULED_ARRIVAL_HOUR_cos', 'AIRLINE_CODE_DL', 'AIRLINE_CODE_EV',
       'AIRLINE_CODE_F9', 'AIRLINE_CODE_MQ', 'AIRLINE_CODE_NK',
       'AIRLINE_CODE_OO', 'AIRLINE_CODE_UA'],
      dtype='object')

In [9]:
len(X_train.columns)

14

---
**We attempted to use PCA to reduce dimensionality**

PCA does help us to reduce the dimensions in this case, but at the cost of lower accuracy & sensitivity (which is not what we are trying to achieve). We are trying to achieve a higher sensitivity for the class (15mins to 1 hour)


---
### (3) Model (I): Logistic Regression

---

In [10]:
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [11]:
cross_val_score(lr, X_train, y_train).mean()

0.87432097989884

In [12]:
lr.score(X_test, y_test)

0.8625099920063949

In [13]:
target_names = ['<15mins', '15mins to 1hr', '1 to 3hrs', '>3hrs']
print(classification_report(y_test, lr.predict(X_test),target_names=target_names))

               precision    recall  f1-score   support

      <15mins       0.94      0.92      0.93       967
15mins to 1hr       0.54      0.55      0.55       178
    1 to 3hrs       0.70      0.82      0.76        83
        >3hrs       0.84      0.91      0.87        23

     accuracy                           0.86      1251
    macro avg       0.76      0.80      0.78      1251
 weighted avg       0.87      0.86      0.86      1251



In [14]:
# Exporting model

# lr_filename = '../models/classification_logreg_model.sav'
# pickle.dump(lr, open(lr_filename, 'wb'))

---
### (4) Model (II): RandomForest

---

In [15]:
rf = RandomForestClassifier()

In [16]:
rf_params = {
  "n_estimators":[100,200,300],
  "min_samples_split":[10,20,30],
  "max_depth":[10,15,20]
}

rf_cv = RandomizedSearchCV(rf, param_distributions=rf_params, scoring='f1_micro', n_iter=2, n_jobs=4, verbose=2)

In [17]:
rf_cv.fit(X_train, y_train)

Fitting 5 folds for each of 2 candidates, totalling 10 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:   15.4s finished


RandomizedSearchCV(cv=None, error_score=nan,
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    ccp_alpha=0.0,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    max_samples=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
            

In [18]:
rf_cv.best_params_

{'n_estimators': 300, 'min_samples_split': 20, 'max_depth': 20}

In [19]:
opt_rf = RandomForestClassifier(n_estimators=100, min_samples_split=20, max_depth=15)

In [20]:
opt_rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=15, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=20,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [21]:
cross_val_score(opt_rf, X_train, y_train).mean()

0.9315955184641913

In [22]:
cross_val_score(opt_rf, X_train, y_train, scoring='f1_macro').mean()

0.9331287855068604

In [23]:
opt_rf.score(X_test, y_test)

0.8848920863309353

In [24]:
target_names = ['<15mins', '15mins to 1hr', '1 to 3hrs', '>3hrs']
print(classification_report(y_test, opt_rf.predict(X_test),target_names=target_names))

               precision    recall  f1-score   support

      <15mins       0.93      0.96      0.95       967
15mins to 1hr       0.65      0.53      0.59       178
    1 to 3hrs       0.72      0.78      0.75        83
        >3hrs       0.88      0.91      0.89        23

     accuracy                           0.88      1251
    macro avg       0.80      0.80      0.79      1251
 weighted avg       0.88      0.88      0.88      1251



In [25]:
# Exporting model

rf_filename = '../models/classification_rf_model_ord.sav'
pickle.dump(opt_rf, open(rf_filename, 'wb'))

---
### (5) Model (III): XGBoost

---

In [26]:
xgbc = xgb.XGBClassifier()

In [27]:
xgbc_params = {
  "learning_rate":[0.01, 0.1, 0.2, 0.3],
  "max_depth":[5, 10, 15]
}

xgbc_cv = RandomizedSearchCV(xgbc, param_distributions=xgbc_params, scoring='f1_micro', n_iter=2, n_jobs=4, verbose=2)

In [28]:
xgbc_cv.fit(X_train, y_train)

Fitting 5 folds for each of 2 candidates, totalling 10 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:   51.5s finished


RandomizedSearchCV(cv=None, error_score=nan,
                   estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                           colsample_bylevel=1,
                                           colsample_bynode=1,
                                           colsample_bytree=1, gamma=0,
                                           learning_rate=0.1, max_delta_step=0,
                                           max_depth=3, min_child_weight=1,
                                           missing=None, n_estimators=100,
                                           n_jobs=1, nthread=None,
                                           objective='binary:logistic',
                                           random_state=0, reg_alpha=0,
                                           reg_lambda=1, scale_pos_weight=1,
                                           seed=None, silent=None, subsample=1,
                                           verbosity=1),
                   iid=

In [29]:
xgbc_cv.best_params_

{'max_depth': 10, 'learning_rate': 0.3}

In [30]:
opt_xgb = xgb.XGBClassifier(max_depth=10, learning_rate=0.2)

In [31]:
opt_xgb.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.2, max_delta_step=0, max_depth=10,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [32]:
cross_val_score(opt_xgb, X_train, y_train).mean()

0.961689883668946

In [33]:
opt_xgb.score(X_test, y_test)

0.9000799360511591

In [34]:
target_names = ['<15mins', '15mins to 1hr', '1 to 3hrs', '>3hrs']
print(classification_report(y_test, opt_xgb.predict(X_test),target_names=target_names))

               precision    recall  f1-score   support

      <15mins       0.94      0.96      0.95       967
15mins to 1hr       0.72      0.59      0.65       178
    1 to 3hrs       0.81      0.81      0.81        83
        >3hrs       0.84      0.91      0.87        23

     accuracy                           0.90      1251
    macro avg       0.83      0.82      0.82      1251
 weighted avg       0.89      0.90      0.90      1251



In [35]:
# Exporting model

# xgb_filename = '../models/classification_xgb_model_ord.pkl'
# pickle.dump(opt_xgb, open(xgb_filename, 'wb'))

---
### (6) Model (IV): Neural Networks

---

In [36]:
from keras.utils import to_categorical

In [37]:
# One Hot Encoding the target
y_train_enc = to_categorical(y_train)
y_test_enc = to_categorical(y_test)

In [38]:
model = Sequential()

In [39]:
model.add(Dense(10, 
                input_dim=14, 
                activation='relu'))

# Dropout did not help improve accuracy in this case
# model.add(Dropout(0.5))

model.add(Dense(10, 
                activation='relu'))

model.add(Dense(4, 
                activation='softmax'))

In [40]:
model.summary() 

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 10)                150       
_________________________________________________________________
dense_2 (Dense)              (None, 10)                110       
_________________________________________________________________
dense_3 (Dense)              (None, 4)                 44        
Total params: 304
Trainable params: 304
Non-trainable params: 0
_________________________________________________________________


In [41]:
es = EarlyStopping(monitor='val_loss', patience=3)

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [42]:
model.fit(X_train, y_train_enc, batch_size=8, epochs=200, validation_split=0.2, callbacks=[es])

Train on 12361 samples, validate on 3091 samples
Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200


<keras.callbacks.callbacks.History at 0x2034b041f08>

In [43]:
model.evaluate(X_test, y_test_enc)



[0.3550378794072391, 0.8745003938674927]

In [56]:
y_nn_pred = model.predict(X_test).argmax(axis=-1)
target_names = ['<1hr', '1 to 2hrs', '2 to 3hrs', '>3hrs']

print(classification_report(y_test, y_nn_pred, target_names=target_names))

              precision    recall  f1-score   support

        <1hr       0.93      0.95      0.94       967
   1 to 2hrs       0.60      0.53      0.57       178
   2 to 3hrs       0.71      0.82      0.76        83
       >3hrs       0.94      0.74      0.83        23

    accuracy                           0.87      1251
   macro avg       0.80      0.76      0.77      1251
weighted avg       0.87      0.87      0.87      1251



In [57]:
# Exporting model

# nn_filename = '../models/classification_nn_model.sav'
# pickle.dump(model, open(nn_filename, 'wb'))