## Machine Learning model to predict airline delay

The objective of our model is to predict arrival delay. Arrival Delay (ARR_DELAY) is highyl skewed, majority of flights having zero or a small arrival delay. We break the problem into two subparts: 

### Delay Classification Model
* Classify [0/1] whether a flight is delayed more than 5 minutes or not
* Trained a Logistic Regression model
* Trained on 400000/600000 splits of positive and negative samples
* Averaged predictions over n=100 models
* Output probability of delay P(delay)

### Predicted Delay
* Regression using Linear Regression
* Trained on only on positive delays (ARR_DELAY>=5)

In [1]:
import pandas as pd
import numpy as np
import time
from sklearn.externals import joblib

In [2]:
tic = time.time()

#PREPARE DF FOR REGRESSION WITH CLIMATE
df = pd.read_csv('Airline+Weather_data.csv')

toc = time.time()
print("Finished reading CSV file in " + str(toc-tic) + " seconds")

Finished reading CSV file in 133.84136319160461 seconds


In [3]:
#Prepare the data
tic = time.time()

#Drop Variables which do not have correlation with arrival delays/cannot be predicted until the flight
df.drop(['YEAR','DAY_OF_MONTH','FL_NUM','CRS_DEP_TIME','DEP_TIME','DEP_DELAY','CRS_ARR_TIME','ARR_TIME','ACTUAL_ELAPSED_TIME','AIR_TIME','DEP_AVG_HOURLYVISIBILITY','DEP_AVG_HOURLYDRYBULBTEMPC','DEP_AVG_HOURLYWindSpeed','DEP_AVG_HOURLYPrecip','ARR_AVG_HOURLYVISIBILITY','ARR_AVG_HOURLYDRYBULBTEMPC','ARR_AVG_HOURLYWindSpeed','ARR_AVG_HOURLYPrecip'],axis=1, inplace=True)
#Remove data redundancy
df['ARR_HOUR'] = df['ARR_HOUR'].apply(lambda x:0 if x == 24 else x)
#Drop rows with Null Values
df.dropna(inplace=True)

#Convert to Dummy Variables
df = pd.concat([df,pd.get_dummies(df['MONTH'],drop_first=True,prefix="MONTH")],axis=1)
df = pd.concat([df,pd.get_dummies(df['DAY_OF_WEEK'],drop_first=True,prefix="DAY_OF_WEEK")],axis=1)
df = pd.concat([df,pd.get_dummies(df['UNIQUE_CARRIER'],drop_first=True,prefix="UNIQUE_CARRIER")],axis=1)
df = pd.concat([df,pd.get_dummies(df['ORIGIN'],drop_first=True,prefix="ORIGIN")],axis=1)
df = pd.concat([df,pd.get_dummies(df['DEST'],drop_first=True,prefix="DEST")],axis=1)
df = pd.concat([df,pd.get_dummies(df['DEP_HOUR'],drop_first=True,prefix="DEP_HOUR")],axis=1)
df = pd.concat([df,pd.get_dummies(df['ARR_HOUR'],drop_first=True,prefix="ARR_HOUR")],axis=1)

df.drop(['MONTH','DAY_OF_WEEK','UNIQUE_CARRIER','ORIGIN','DEST','DEP_HOUR','ARR_HOUR'],axis=1,inplace=True)
#DELAY_YN -> Delay Yes or No -> 1 if Delay>5 minutes, else 0
df['DELAY_YN'] = df['ARR_DELAY'].apply(lambda x:1 if x>=5 else 0)

toc = time.time()
print("Finished preparing data in " + str(toc-tic) + " seconds")

In [85]:
#Create 'n' different Logistic Regression Models

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

n = 10 #Number of models to average over

for i in range(n):
    
    tic = time.time()
    #Create a randomly selected smaller dataset for training purpose
    #Each dataset should have negative and positive classes in the ratio 60:40
    df_split = df.loc[np.random.choice(df[df['DELAY_YN']==1].index, 400000, replace = True)]
    df_split2 = df.loc[np.random.choice(df[df['DELAY_YN']==0].index, 600000, replace = False)]
    df_split = df_split.append(df_split2, ignore_index=True)

    X_train, X_test, y_train, y_test = train_test_split(df_split.drop(['DELAY_YN','ARR_DELAY'],axis=1),
                                                    df_split['DELAY_YN'], test_size=0.10, random_state=101)

    logmodel = LogisticRegression()
    logmodel.fit(X_train,y_train)
    
    predictions = logmodel.predict(X_test)

    truePos = X_test[((predictions == 1) & (y_test == predictions))]
    falsePos = X_test[((predictions == 1) & (y_test != predictions))]
    trueNeg = X_test[((predictions == 0) & (y_test == predictions))]
    falseNeg = X_test[((predictions == 0) & (y_test != predictions))]

    TP = truePos.shape[0]
    FP = falsePos.shape[0]
    TN = trueNeg.shape[0]
    FN = falseNeg.shape[0]

    accuracy = float(TP + TN)/float(TP + TN + FP + FN)
    print('Accuracy: '+str(accuracy))
    
    joblib.dump(logmodel, str(i)+'_logmodel.pkl') 
    
    toc = time.time()
    print(str(i+1)+"th fold took " + str(toc-tic) + " seconds")    

Accuracy: 0.68899
1th fold took 130.5088438987732 seconds
Accuracy: 0.68074
2th fold took 119.45041298866272 seconds
Accuracy: 0.70721
3th fold took 129.92791295051575 seconds
Accuracy: 0.66568
4th fold took 121.2253348827362 seconds
Accuracy: 0.673
5th fold took 128.84892797470093 seconds
Accuracy: 0.69951
6th fold took 132.01799082756042 seconds
Accuracy: 0.66381
7th fold took 118.65123891830444 seconds
Accuracy: 0.69405
8th fold took 127.84729886054993 seconds
Accuracy: 0.68276
9th fold took 126.42221188545227 seconds
Accuracy: 0.69443
10th fold took 125.44659209251404 seconds


In [86]:
#Test the Model performance (on Training data only)
df2 = df.loc[np.random.choice(df.index, 1000000, replace = False)]
X_test = df2.drop(['ARR_DELAY','DELAY_YN'],axis=1)
y_test = df2['DELAY_YN']

n = 10 #Number of models to average over
df2['DELAY_YN'] = np.zeros(len(df2.index))

for i in range(n):
    logmodel = joblib.load(str(i)+'_logmodel.pkl') 
    predictions = logmodel.predict(X_test)
    
    df2['DELAY_YN'] = df2['DELAY_YN'] + logmodel.predict_proba(X_test)[:,1]
    
    truePos = X_test[((predictions == 1) & (y_test == predictions))]
    falsePos = X_test[((predictions == 1) & (y_test != predictions))]
    trueNeg = X_test[((predictions == 0) & (y_test == predictions))]
    falseNeg = X_test[((predictions == 0) & (y_test != predictions))]

    TP = truePos.shape[0]
    FP = falsePos.shape[0]
    TN = trueNeg.shape[0]
    FN = falseNeg.shape[0]

    accuracy = float(TP + TN)/float(TP + TN + FP + FN)
    print('Accuracy: '+str(accuracy))

Accuracy: 0.732409
Accuracy: 0.725514
Accuracy: 0.747362
Accuracy: 0.711334
Accuracy: 0.719382
Accuracy: 0.74036
Accuracy: 0.71206
Accuracy: 0.738032
Accuracy: 0.726516
Accuracy: 0.733834


In [92]:
#Take Average of probabilities for positive class (DELAY_YN = 1). If average probability>0.5, assign value=1
df2['DELAY_YN_vote'] = df2['DELAY_YN']/n
df2['DELAY_YN_vote'] = df2['DELAY_YN_vote'].apply(lambda x:1 if x>0.46 else 0) #Take Vote

truePos = X_test[((df2['DELAY_YN_vote'] == 1) & (y_test == df2['DELAY_YN_vote']))]
falsePos = X_test[((df2['DELAY_YN_vote'] == 1) & (y_test != df2['DELAY_YN_vote']))]
trueNeg = X_test[((df2['DELAY_YN_vote'] == 0) & (y_test == df2['DELAY_YN_vote']))]
falseNeg = X_test[((df2['DELAY_YN_vote'] == 0) & (y_test != df2['DELAY_YN_vote']))]

TP = truePos.shape[0]
FP = falsePos.shape[0]
TN = trueNeg.shape[0]
FN = falseNeg.shape[0]

accuracy = float(TP + TN)/float(TP + TN + FP + FN)
print('Final Accuracy: '+str(accuracy))
print('TP: '+str(TP))
print('FP: '+str(FP))
print('TN: '+str(TN))
print('FN: '+str(FN))
print('% of positive predictions:')
print(len(df2[df2['DELAY_YN_vote']==1].index)/len(df2.index))

Final Accuracy: 0.712627
TP: 158341
FP: 145880
TN: 554286
FN: 141493
% of positive predictions:
0.304221


In [93]:
#Linear Regression on whole dataset
df_late = df[df['DELAY_YN']==1].copy()
df_late['log_delay'] = np.log(df_late['ARR_DELAY'])

print('Total positive delay datapoints:' + str(len(df_late.index)))

Total positive delay datapoints:3165541


In [94]:
#Modeling ARR_DELAY
tic = time.time()

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_late.drop(['DELAY_YN','log_delay'],axis=1), 
                                                    df_late['log_delay'], test_size=0.30, random_state=101)

print('Training...')
from sklearn.linear_model import LinearRegression
lm = LinearRegression(normalize=True)
lm.fit(X_train.drop('ARR_DELAY',axis=1),y_train)

print('Predicting on test set...')
predictions = lm.predict(X_test.drop('ARR_DELAY',axis=1))

X_test['predicted']=[np.exp(p) for p in predictions]

from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(X_test['ARR_DELAY'],X_test['predicted']))
print('MSE:', metrics.mean_squared_error(X_test['ARR_DELAY'],X_test['predicted']))
print('RMSE:', np.sqrt(metrics.mean_squared_error(X_test['ARR_DELAY'],X_test['predicted'])))

joblib.dump(lm, 'linearmodel.pkl')

toc = time.time()
print("Finished fitting Linear Regression in " + str(toc-tic) + " seconds")

Training...
Predicting on test set...
MAE: 25.9202623843
MSE: 2647.93839559
RMSE: 51.4581227368
Finished fitting Linear Regression in 149.62631487846375 seconds
