# Predicting Flight Delays using Decision Tree Classifier

About this Dataset

# Context
The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled, and diverted flights is published in DOT's monthly Air Travel Consumer Report and in this dataset of 2015 flight delays and cancellations.

# Acknowledgements
The flight delay and cancellation data was collected and published by the DOT's Bureau of Transportation Statistics.


A Decision Tree is a simple representation for classifying examples. It is a Supervised Machine Learning where the data is continuously split according to a certain parameter. We will build a DT through a binary recursive partitioning, which is an iterative process of splitting the data into partitions, and then splitting it up further on each of the branches.

Using the decision algorithm, we start at the tree root and split the data on the feature that results in the largest information gain (IG) (reduction in uncertainty towards the final decision).
In an iterative process, we can then repeat this splitting procedure at each child node until the leaves are pure. This means that the samples at each leaf node all belong to the same class.

First,we invoke all required packages. 
* Numpy would be required as the input for many machine learning algorithms require data in numpy array form. 
* Pandas is useful with general data processing, manipulation, cleaning etc. 
* Sklearn provides various machine learning algorithms including Decision Trees
* Seaborn is used to visualize data.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier

import sklearn.tree as tree
#import pydotplus
from sklearn.externals.six import StringIO 
from IPython.display import Image


from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import auc
from sklearn.metrics import plot_roc_curve
from scipy import interp

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
#flights = pd.read_csv('../input/flight-delays/flights.csv')
#flights_sub = flights.sample(n = 10000, random_state = 123)
#flights_sub.info()
#flights_sub.to_csv('flights_sample.csv')

Original Dataset consists of 6 million records. We will use a representational sample of 10000 records for this study

In [None]:
file = '../working/flights_sample.csv'
flights_sub = pd.read_csv(file)
print(flights_sub.shape)

In [None]:
flights_sub.head()

# EDA

Most flights fall in a category of delay upto 300 minutes and these can be called as 'Normal Delays'. Anything above this are special cases and not something that we wuold expect in day to day operations of an ariline company. As visible from below distribution, it is imperative to make this assumption to get rid of the outliers 

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, sharey = False, figsize=(12,3))
sns.distplot(flights_sub['DEPARTURE_DELAY'], kde = True, bins = 5, ax = ax1)
sns.boxplot(data = flights_sub, x = 'DEPARTURE_DELAY', ax = ax2)

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(nrows = 1, ncols = 3, figsize = (18,3))
sns.distplot(flights_sub.query('DEPARTURE_DELAY > 30 & DEPARTURE_DELAY < 300')['DEPARTURE_DELAY'], kde = True, bins = 5, ax = ax1)
sns.stripplot(data = flights_sub.query('DEPARTURE_DELAY > 30 & DEPARTURE_DELAY < 300'), x = 'DEPARTURE_DELAY', jitter = True, ax = ax2)
sns.stripplot(data = flights_sub.query('DEPARTURE_DELAY > 30 & DEPARTURE_DELAY < 300'), x = 'DEPARTURE_DELAY', y ='AIRLINE', jitter = True, ax = ax3)

Basically, I am making an assumption that I would only call flights which got delayed more than 30 minutes as DELAYED and anything above 300 minutes as SPECIAL situations and all such instances are removed from further study. This is an intution based assumption and we can come back change this parameter to a different value later

In [None]:
print(flights_sub['DEPARTURE_DELAY'].isna().sum())
print(flights_sub[flights_sub['DEPARTURE_DELAY'].isna()]['CANCELLED'].sum())
print(flights_sub[flights_sub['DEPARTURE_DELAY'].isna()].groupby(['CANCELLATION_REASON'])['CANCELLATION_REASON'].count())
flights_sub['DELAYED'] = flights_sub['DEPARTURE_DELAY']>30
flights_sub = flights_sub[flights_sub['DEPARTURE_DELAY']<300]


There are missing values in the DELAY column and these records are the ones where the flight got cancelled. I am curious to know the reasning behind the cancellation. 

Reason for Cancellation of flight: 
* A - Airline/Carrier; 
* B - Weather; 
* C - National Air System; 
* D - Security

Most flights are cancelled due to Weather conditions and a good number are due to AIRLINE issues or issues with ATC system. 

Next, we will try to get a sense of the Airline operators. The dataset uses 2 - digit Airline Code. To help the readers understanding, please find the reference to Airline Name for each code.

* UA	United Air Lines Inc.
* AA	American Airlines Inc.
* US	US Airways Inc.
* F9	Frontier Airlines Inc.
* B6	JetBlue Airways
* OO	Skywest Airlines Inc.
* AS	Alaska Airlines Inc.
* NK	Spirit Air Lines
* WN	Southwest Airlines Co.
* DL	Delta Air Lines Inc.
* EV	Atlantic Southeast Airlines
* HA	Hawaiian Airlines Inc.
* MQ	American Eagle Airlines Inc.
* VX	Virgin America


In [None]:
sns.catplot(data = flights_sub, kind = 'count', y = 'AIRLINE', hue = 'DELAYED', aspect = 0.8, ax = ax1)

Here is Comparision plot for flights delayed vs On-time for each of the Airline companies. Southwest, Delta and American Airlines are clearly are the bigger players and they have more flights, understandably as they fly more number of fligths too.

In [None]:
pd.concat([round((flights_sub[(flights_sub['DEPARTURE_DELAY']>30)]['AIRLINE'].value_counts() / flights_sub['AIRLINE'].value_counts())*100,0).sort_values(ascending = False) ,
          flights_sub['AIRLINE'].value_counts()], axis = 1, sort = False)


To better understand the efficiency of each Airline operator, above I have created table with percentage of Delayed flights out of total flights and it is Spirit Airlines  have the poor most record at 17% flights being delayed while 16% of United Airlines is delayed. 

Incase of larger airlines, Southwest and American Airlines have 11% delayed while only 8% flights are delayed in case of Delta. 

Two kinds of Airport Codes are being used and I have tried to fix it here

In [None]:
airport_codes = pd.read_csv('../input/airport-codes/Airport_codes.csv', dtype ={'DOT Code': str,'Code':str})
airport_codes.head()


In [None]:
flights_sub['ORIGIN_AIRPORT'] = flights_sub['ORIGIN_AIRPORT'].astype(str)
flights_sub.reset_index(inplace=True, drop=True)


airports_fixed = pd.merge(flights_sub, airport_codes, left_on='ORIGIN_AIRPORT', right_on = 'DOT Code')
flights_sub.drop(flights_sub[flights_sub['ORIGIN_AIRPORT'].str.len()>3].index, inplace = True)


In [None]:
#airports_fixed['ORIGIN_AIRPORT'] = airports_fixed['DOT Code']
#flights_sub = pd.concat([airports_fixed, flights_sub], axis = 0, sort = False)


In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(nrows = 1, ncols = 3, figsize = (15,3))
sns.countplot(data = flights_sub.query('DELAYED == True'), y = 'ORIGIN_AIRPORT', ax = ax2)
sns.countplot(data = flights_sub, y = 'ORIGIN_AIRPORT', ax = ax1)
sns.distplot(flights_sub['ORIGIN_AIRPORT'].value_counts(), ax = ax3)

There are 304 Airports in this dataset and many of them have very low number of flights. Infact many airports only have one flight. Since we cannot use these different airports as unique categories, I tried to create binning into 4 categories based on traffic volume.

* Heavy Airports: Top 3 % of the traffic belong to these airports
* Medium Airports : Top 10% to 3%
* Light: Rest 15% from 75% to 90%
* Very Light: The rest 50%

In [None]:
bin_pct_97 = flights_sub['ORIGIN_AIRPORT'].value_counts().quantile(0.97)
bin_pct_90 = flights_sub['ORIGIN_AIRPORT'].value_counts().quantile(0.90)
bin_pct_75 = flights_sub['ORIGIN_AIRPORT'].value_counts().quantile(0.75)

airports = flights_sub['ORIGIN_AIRPORT'].value_counts()
airports_index = flights_sub[flights_sub['ORIGIN_AIRPORT'].isin(airports[airports > bin_pct_97].index)].index
flights_sub.loc[airports_index,'AIRPORT_TYPE'] = 'Heavy'

airports_index = flights_sub[flights_sub['ORIGIN_AIRPORT'].isin(airports[(airports > bin_pct_90)&(airports <= bin_pct_97)].index)].index
flights_sub.loc[airports_index,'AIRPORT_TYPE'] = 'Medium'

airports_index = flights_sub[flights_sub['ORIGIN_AIRPORT'].isin(airports[(airports > bin_pct_75)&(airports <= bin_pct_90)].index)].index
flights_sub.loc[airports_index,'AIRPORT_TYPE'] = 'Light'

airports_index = flights_sub[flights_sub['ORIGIN_AIRPORT'].isin(airports[airports <= bin_pct_75].index)].index
flights_sub.loc[airports_index,'AIRPORT_TYPE'] = 'Very Light'

In [None]:
fig, (ax1, ax2, ax3, ax4) = plt.subplots(nrows = 1, ncols = 4, figsize = (20,5), sharey = False)
sns.countplot(data = flights_sub[flights_sub['AIRPORT_TYPE']=='Very Light'], y = 'ORIGIN_AIRPORT', ax = ax4)
sns.countplot(data = flights_sub[flights_sub['AIRPORT_TYPE']=='Light'], y = 'ORIGIN_AIRPORT', ax = ax3)
sns.countplot(data = flights_sub[flights_sub['AIRPORT_TYPE']=='Medium'], y = 'ORIGIN_AIRPORT', ax = ax2)
sns.countplot(data = flights_sub[flights_sub['AIRPORT_TYPE']=='Heavy'], y = 'ORIGIN_AIRPORT', ax = ax1)

Above plot shows the Airports in these 4 catgories

In [None]:
flights_sub['Date'] = pd.to_datetime(flights_sub[['YEAR', 'MONTH', 'DAY']])
flights_sub['MONTH'] = flights_sub['Date'].dt.month

In [None]:
flights_sub['WEATHER_DELAY'] = flights_sub['WEATHER_DELAY'].fillna(0)
flights_sub['WEATHER_DELAY'].values
flights_sub[flights_sub['DELAYED']==True]['WEATHER_DELAY'].values
plt.hist(flights_sub[flights_sub['DELAYED']==True]['WEATHER_DELAY'].values, log = True)
plt.hist(flights_sub[flights_sub['DELAYED']==False]['WEATHER_DELAY'].values, log = True)

plt.clf()
plt.hist(flights_sub[flights_sub['DELAYED']==True]['SECURITY_DELAY'].values, log = False, bins = 3)
plt.hist(flights_sub[flights_sub['DELAYED']==False]['SECURITY_DELAY'].values, log = False, bins = 3)

plt.clf()
plt.hist(flights_sub['AIR_SYSTEM_DELAY'].values, log = False, bins = 3)
plt.hist(flights_sub[flights_sub['DELAYED']==True]['AIR_SYSTEM_DELAY'].values, log = False, bins = 3)

plt.clf()
plt.hist(flights_sub['AIRLINE_DELAY'].values, log = False, bins = 50)
plt.hist(flights_sub[flights_sub['DELAYED']==True]['AIRLINE_DELAY'].values, log = False, bins = 50)

plt.clf()
plt.hist(flights_sub['LATE_AIRCRAFT_DELAY'].values, log = False, bins = 50)
plt.hist(flights_sub[flights_sub['DELAYED']==True]['LATE_AIRCRAFT_DELAY'].values, log = False, bins = 50)


Based on the above analysis, we can conclude below variables are the most relevant features for the Model building. 

This data is then split into Training and Testing datasets

Accordingly, I have created the numpy input to be fed which in Label encoded for incorporating categorical data. 

In [None]:
predictors = ['MONTH', 'AIRLINE','AIRPORT_TYPE', 'SECURITY_DELAY', 'WEATHER_DELAY', 'AIR_SYSTEM_DELAY', 'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY']
features = ['MONTH','AIRLINE', 'AIRPORT_TYPE', 'SECURITY_DELAY', 'WEATHER_DELAY', 'AIR_SYSTEM_DELAY', 'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY','DELAYED']
flights_model = flights_sub[features]
flights_model = flights_model.fillna(0)
random_state = 40

y = flights_model['DELAYED'].values
X = flights_model.drop('DELAYED', axis = 1).values

labelencoder_X = LabelEncoder()
X[:,1] =labelencoder_X.fit_transform(X[:,1])
X[:,2] =labelencoder_X.fit_transform(X[:,2])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y, random_state = random_state)



In [None]:
dt = DecisionTreeClassifier(criterion = 'gini',max_depth = 4, random_state = random_state)
dt.fit(X_train, y_train)

y_pred = dt.predict(X_test)

accuracy_gini = accuracy_score(y_test, y_pred)
print(accuracy_gini)

In [None]:
dt = DecisionTreeClassifier(criterion = 'entropy',max_depth = 4, random_state = random_state)
dt.fit(X_train, y_train)

y_pred = dt.predict(X_test)

accuracy_entropy = accuracy_score(y_test, y_pred)
print(accuracy_entropy)

Two DT models were developed, one using Gini as criteria and other using Entropy as criteria. I am getting very similar accuracy of 97% in both cases. But this could be a overfitted model and to make this a better model, I will introduce Cross Validation

In [None]:

#cv = StratifiedKFold(n_splits=6)
cv = 50
clf = DecisionTreeClassifier(criterion = 'entropy',max_depth = 50, min_samples_leaf = 0.05, random_state = random_state)
clf.fit(X_train, y_train)
clf.predict(X_test)

scores = cross_val_score(clf, X_train, y_train, cv = cv, n_jobs = -1)
scores.mean()

CrossValidation with Kfold = 50 and using Entropy as criteria, we get a accuracy of 93.9% which is a drop of around 3% from the original model indicating the overfitting we had with the original model

Another way of evaluating the Model accuracy is with the Confusion Matrix. This gives a very clear idea about how well the model is predicting the output. In this case, we have 87% True Negatives and 10% True Positives which means 97% of predictions fall in the True Category. 

Based on the requirements of Sensitivity, Specificity, we can evaluate a model using this Matrix

In [None]:
cf_matrix = confusion_matrix(y_test, y_pred)
cf_matrix_per = cf_matrix/np.sum(cf_matrix)

print(y_test.shape)
print(cf_matrix)
sns.heatmap(cf_matrix_per, annot=True, fmt='.2%', cmap='Blues')


In [None]:
# dot_data = StringIO()
# tree.export_graphviz(clf, 
# out_file=dot_data, 
# class_names=['0','1'], # the target names.
# feature_names=predictors, # the feature names.
# filled=True, # Whether to fill in the boxes with colours.
# rounded=True, # Whether to round the corners of the boxes.
# special_characters=True)
# graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) 
# Image(graph.create_png())

In [None]:
#bc = BaggingClassifier(base_estimator = clf, n_estimators = 100, n_jobs = -1)
#bc.fit(X_train, y_train)
#y_pred = bc.predict(X_test)

#accuracy = accuracy_score(y_test, y_pred)
#print(round(accuracy,2))

In [None]:
#bc = BaggingClassifier(base_estimator = clf, n_estimators = 100, n_jobs = -1, oob_score = True)
#bc.fit(X_train, y_train)
#y_pred = bc.predict(X_test)

#oob_accuracy = bc.oob_score_
#print(oob_accuracy)

In [None]:
rf = RandomForestClassifier(n_estimators = 500, criterion = 'entropy', max_depth = 50, min_samples_leaf = 0.02, n_jobs = -1)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

In [None]:
cf_matrix = confusion_matrix(y_test, y_pred)
cf_matrix_per = cf_matrix/np.sum(cf_matrix)

print(y_test.shape)
print(cf_matrix)
sns.heatmap(cf_matrix_per, annot=True, fmt='.2%', cmap='Blues')


In [None]:
importances_rf = pd.Series(rf.feature_importances_, index = predictors)
sorted_importances_rf = importances_rf.sort_values()
sorted_importances_rf.plot(kind = 'barh', color = 'm')
plt.show()

Interpretation: With this model ready now, one important step is to interpret the way the model is using features to reach the predictions. The above Feature importance chart shows the importance percentage of each variable. 

At more than 50%, the main reason behind flight delays are Late Aircraft Arrival followed by delays caused by Airlines and then Air System Delays.  