This project is about predicting if a flight will be delayed by over 15 minutes upon arrival, with Scikit-learn Decision Tree Classifier, using US flight data from January 2019 and January 2020. Here is the URL of the dataset and variables description:

https://www.kaggle.com/divyansh22/flight-delay-prediction

The challenge here is that our data is significantly imbalanced, as flights are way more often on time than delayed. Therefore we need to build a model capable of effectively separating classes 'on time' or 'delayed'. It is a binary classification problem. The AUC (Area Under the Curve) will be the most relevant metric to evaluate our model.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from pandas_profiling import ProfileReport as pr
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelBinarizer, MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import confusion_matrix, classification_report, plot_roc_curve

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df1 = pd.read_csv('/kaggle/input/flight-delay-prediction/Jan_2019_ontime.csv')
df2 = pd.read_csv('/kaggle/input/flight-delay-prediction/Jan_2020_ontime.csv')

Checking column structures before concatenating dataframes:

In [None]:
set(df1.columns) == set(df2.columns)

In [None]:
df = pd.concat([df1,df2])

Our main dataframe now consists of 22 columns and 1191331 rows:

In [None]:
df.info()

In [None]:
pd.set_option('display.max_columns', None)
df.head()

To explore the data, let's use the amazing Pandas profile report capabilities:

In [None]:
pr(df)

The above data profile report allows us to establish relevant information such as column redundancy, feature correlation and missing values at a glance. To solve our problem, we will choose 'ARR_DEL15' as the binary target label (0= on time, 1= late) and select the following features:

    'DAY_OF_WEEK': day # starting from monday. Will be set to 0 if week day and 1 if weekend.
    'OP_UNIQUE_CARRIER': carrier identifier.
    'DEP_TIME_BLK': 24h time chunks.
    'ORIGIN': departure airport identifier.
    'DEST': destination airport identfier.
    'DISTANCE': flight length.

We will drop all flights that were diverted or cancelled, for which ARR_DEL15 is NaN. 
We will also drop all rows that contain NaN values.

In [None]:
df = df[df['DIVERTED'] == 0]
df = df[df['CANCELLED'] == 0]

In [None]:
df = df[['DAY_OF_WEEK', 'OP_UNIQUE_CARRIER', 'DEP_TIME_BLK', 'ORIGIN', 'DEST', 'DISTANCE', 'ARR_DEL15']]

In [None]:
df.dropna()
df = df.reset_index(drop=True)

In [None]:
df = df.rename(columns = {'DAY_OF_WEEK' : 'ON_WEEKEND'})
df['ON_WEEKEND'] = (df['ON_WEEKEND'] > 5).astype(int)

As per our Pandas profiling report, we can see that the data is mostly categorical with large amounts of possible values in some categories such as 'ORIGIN' and 'DEST' (353 values each). 

Our strategy consists in defining quantiles correlated to the target ARR_DEL15 for each categorical feature, and then assigning a weight for each quantile by order of importance: higher quantile = higher weight. Our model will eventually be trained on the resulting ordinal data. 

Let's start with the carriers. There are 17 carriers, let's rank them in quantiles (# of delays generated):

In [None]:
df['OP_UNIQUE_CARRIER'].nunique()

In [None]:
carrier_df = df[['OP_UNIQUE_CARRIER','ARR_DEL15']].groupby('OP_UNIQUE_CARRIER').sum().sort_values(by='ARR_DEL15',ascending=False)
carrier_df['CARRIER_cat'] = pd.qcut(carrier_df['ARR_DEL15'], 17, labels = False)
carrier_df

Now let's replace the carrier identifier with its quantile number/weight in the main dataframe 'df':

In [None]:
data_carrier = carrier_df.loc[df['OP_UNIQUE_CARRIER']].reset_index()
df['CARRIER_cat'] = data_carrier['CARRIER_cat']

Now let's have a look at time blocks. There are 19 of them, let's apply the same quantile indexing method: rush hours (generating more delays) will get penalized more with a higher weight.

In [None]:
df['DEP_TIME_BLK'].nunique()

In [None]:
time_blk_df = df[['DEP_TIME_BLK','ARR_DEL15']].groupby('DEP_TIME_BLK').sum().sort_values(by='ARR_DEL15',ascending=False)
time_blk_df['TIME_cat'] = pd.qcut(time_blk_df['ARR_DEL15'], 19, labels = False)
time_blk_df

In [None]:
data_time = time_blk_df.loc[df['DEP_TIME_BLK']].reset_index()
df['DEP_TIME_cat'] = data_time['TIME_cat']

Now let's have a look at the departure airports feature 'ORIGIN':

In [None]:
df['ORIGIN'].nunique()

This time around, let's generate 25 quantiles, each containing a group of departure airports that tend to generate the same amount of delays. We have observed that generating more quantile bins does not improve our model performance.

In [None]:
origin_df = df[['ORIGIN','ARR_DEL15']].groupby('ORIGIN').sum().sort_values(by='ARR_DEL15',ascending=False)
origin_df['ORIGIN_cat'] = pd.qcut(origin_df['ARR_DEL15'], 25, labels = False)
origin_df

In [None]:
data_origin = origin_df.loc[df['ORIGIN']].reset_index()
df['ORIGIN_cat'] = data_origin['ORIGIN_cat']

Same approach with the destination airports feature 'DEST':

In [None]:
df['DEST'].nunique()

In [None]:
dest_df = df[['DEST','ARR_DEL15']].groupby('DEST').sum().sort_values(by='ARR_DEL15',ascending=False)
dest_df['DEST_cat'] = pd.qcut(dest_df['ARR_DEL15'], 25, labels = False)
dest_df

In [None]:
data_dest = dest_df.loc[df['DEST']].reset_index()
df['DEST_cat'] = data_dest['DEST_cat']

Let's have a look at the newly generated dataframe, fully numerical with ordinal values and a target feature:

In [None]:
df = df[['ON_WEEKEND', 'CARRIER_cat','DEP_TIME_cat', 'ORIGIN_cat', 'DEST_cat', 'DISTANCE', 'ARR_DEL15']]
df

Let's generate the feature matrix and the binary target vector to feed our model:

In [None]:
df_X = df.drop('ARR_DEL15', axis=1)
df_y =  df[['ARR_DEL15']]

In [None]:
X = df_X.values
y = df_y.values

In [None]:
y = LabelBinarizer().fit_transform(y)

As our data is significantly imbalanced (977724 rows for class '0=on time' and '187507 rows for class '1= delayed'), let's use the SMOTE method (Synthetic Minority Oversampling TEchnique) to generate more examples of class 1:

In [None]:
df_y.value_counts()

In [None]:
oversample = SMOTE()
X, y = oversample.fit_resample(X, y)

To feed our model, let's normalize and standardize our feature matrix so that each feature is of equal importance and equal range before splitting the data into train and test datasets (80% & 20%). 

Important note: as per sklearn documentation all decision tree models use float32, preferably with a gaussian distribution as input, which is exactly what we get here (all features in the 0-1 range).

In [None]:
X = StandardScaler().fit_transform(X)
X = MinMaxScaler().fit_transform(X)

In [None]:
X

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [None]:
model = tree.DecisionTreeClassifier()
model = model.fit(X_train, y_train)

In [None]:
y_pred_test = model.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred_test))

In [None]:
print(classification_report(y_test, y_pred_test))

In [None]:
plot_roc_curve(model, X_test, y_test)

When data is highly skewed, any model can reach good accuracy by always predicting the same class for example. In our particular case, we are trying to predict the minority class well for the model to be useful. 

Therefore the most relevant metric is AUC, or Area Under the Curve:

    If AUC=50% the model is useless as it is wrong 50% of the time.
    If AUC=100% the model is perfect, it identifies both classes right every time.

In conclusion, we get an AUC of 83% on the testing data, meaning that our model performs well at separating classes on unseen data and can predict flight delays effectively (75% accuracy and recall on both classes).