# Flight Delay Prediction

The goal of this notebook is to build a classifier to predict flight delays using 2015 US Domestic Flight Delay dataset.

* [Part 1: Basline Model](#baseline)

#### References
* [Kaggle Dataset](https://www.kaggle.com/usdot/flight-delays/kernels)
* [Data Columns Definition](https://www.kaggle.com/usdot/flight-delays/discussion/29308)

In [1]:
import datetime
import numpy as np
import pandas as pd
from scipy import stats

from sklearn.preprocessing import LabelEncoder, LabelBinarizer, OneHotEncoder, Imputer, StandardScaler, RobustScaler
from sklearn.model_selection import ShuffleSplit, train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score, roc_auc_score
from sklearn.decomposition import PCA, FastICA, FactorAnalysis

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
%config InlineBackend.figure_format = 'retina'

  from numpy.core.umath_tests import inner1d


---
<a id='baseline'></a>
## Part 1: Baseline Model

### 1. Load and prepare data
* Only sample 10% of the original dataset to save computation time
* Drop data rows with any missing values
* Create labels: {1: arrival delay > 10 min, 0: otherwise}
* One hot encoding for categorical features

In [4]:
# load and prepare data

df = pd.read_csv("data/flights.csv", low_memory=False)
df = df.sample(frac=0.1, random_state=0)
# df = df[["MONTH","DAY","DAY_OF_WEEK","AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT",
#                  "ORIGIN_AIRPORT","AIR_TIME", "DEPARTURE_TIME","DISTANCE","ARRIVAL_DELAY"]]
df = df[["AIRLINE","DAY_OF_WEEK","DEPARTURE_TIME","DISTANCE","ARRIVAL_DELAY"]]

# drop missing values
df.dropna(inplace=True)

# create label
df["label"] = (df["ARRIVAL_DELAY"]>10)*1
df.drop(['ARRIVAL_DELAY'], axis=1, inplace=True)

df.head()

Unnamed: 0,AIRLINE,DAY_OF_WEEK,DEPARTURE_TIME,DISTANCE,label
1678797,EV,5,2106.0,310,0
5099982,AA,7,1305.0,600,0
5652947,OO,1,1018.0,1550,1
5058446,VX,4,1645.0,1381,0
387609,DL,1,1337.0,515,0


In [5]:
# one hot encoding for categorical features
# cols = ["AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT","ORIGIN_AIRPORT"]
cols = ["AIRLINE", "DAY_OF_WEEK"]
for col in cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])

    enc = OneHotEncoder()
    col_oh = enc.fit_transform(df[col].values.reshape(-1, 1)).toarray()

    col_names = []
    for i in range(col_oh.shape[1]):
        name = col+str(i)
        col_names.append(name)

    df_col = pd.DataFrame(col_oh, columns = col_names)
    df = pd.concat([df.reset_index(), df_col.reset_index()], axis=1).drop([col, 'index'], axis=1)

In [6]:
df.head()

Unnamed: 0,DEPARTURE_TIME,DISTANCE,label,AIRLINE0,AIRLINE1,AIRLINE2,AIRLINE3,AIRLINE4,AIRLINE5,AIRLINE6,...,AIRLINE11,AIRLINE12,AIRLINE13,DAY_OF_WEEK0,DAY_OF_WEEK1,DAY_OF_WEEK2,DAY_OF_WEEK3,DAY_OF_WEEK4,DAY_OF_WEEK5,DAY_OF_WEEK6
0,2106.0,310,0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,1305.0,600,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,1018.0,1550,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1645.0,1381,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,1337.0,515,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


### 2. Train Random Forest classifier as a basline model
* Split the dataset into 70/30 training set and test set
* Train a Random Forest classifier with default parameters
* Evaluate on test set using AUC metric

In [8]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(df.drop(["label"], axis=1), df["label"], 
                                                    random_state=0, test_size=0.3)

print(X_train.shape, X_test.shape)

(399998, 23) (171428, 23)


In [14]:
# fit and evaluate model
def auc(model, train, test): 
    '''Return AUC on training set and test set'''
    return (roc_auc_score(y_train, model.predict_proba(train)[:,1]),
            roc_auc_score(y_test, model.predict_proba(test)[:,1]))

rf = RandomForestClassifier(random_state=0)
rf.fit(X_train, y_train)
auc(rf, X_train, X_test)

(0.9968294425293539, 0.633841988128292)