# Flight Delay Prediction (Jan 2019)
The dataset contains records gathered by the Bureau of Transportation Statistics (BTS)[9] toprovide historical comparisons of monthly on-time reports filed by large US Airlines.  

_Only datasetsfor 2019 intentionally selected due to the substantial effect of COVID19 in the aviation industry in2020, it is out of scope of this project to analyze this factor._
### Source:Bureau of Transportation Statistics 
### Number of rows:1,984,933
### Dataset features and description:

| Variable name               | Description                                                                                            |
|-----------------------------|--------------------------------------------------------------------------------------------------------|
| DayOfMonth                  | Day of Month                                                    |
| DayOfWeek                   | Day of Week  (1 - Monday, 2 -Tuesday, 3 - Wendsday)                                                    |
| Operating\_Airline          | Carrier Code                                                                                           |
| Origin                      | Origin Airport                                                                                         |
| Dest                        | Destination Airport                                                                                    |
| DepTime                     | Actual Departure Time (local time: hhmm)                                                               |
| DepDelay                    | Difference between scheduled and actual departure time (Minutes)                                       |
| DepDel15                    | Departure Delay Indicator (1=Yes)                                                                      |
| DepartureDelayGroups        | Departure Delay intervals, every (15 minutes from <-15 to >180)                                        |
| TaxiOut                     | Taxi-Out Time (Minutes)                                                                                |
| TaxiIn                      | Taxi-In Time (Minutes)                                                                                 |
| **ArrDelay (target)**       | **Difference in minutes between scheduled and actual arrival time**. *Early arrivals show negative numbers |
| ArrDel15                    | Arrival Delay Indicator, 15 Minutes or More (1=Yes)                                                    |
| ArrivalDelayGroups          | Arrival Delay intervals (15-minutes from <-15 to >180)                                                 |
| Cancelled                   | Cancelled Flight Indicator (1=Yes)                                                                     |
| CancellationCode            | Specifies The Reason For Cancellation                                                                  |
| ActualElapsedTime           | Elapsed Time of Flight, in Minutes                                                                     |
| AirTime                     | Flight Time, in Minutes                                                                                |                                                                                     |
| Distance                    | Distance between airports (Miles)                                                                      |
| DistanceGroup               | Distance FLight Segment, (every 250 Miles)                                                             |
| CarrierDelay                | Delay by Carrier (Minutes)                                                                             |
| NASDelay                    | Delay by NAS (Minutes)                                                                                 |
| SecurityDelay               | Delay by Security (Minutes)                                                                            |
| LateAircraftDelay           | Delay by Late Aircraft (in Minutes)                                                                    |
| WeatherDelay                | Delay caused by Weather (Minutes)                                                                      |

In [None]:
#Import basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 

In [None]:
#Setup pandas display parameters
pd.options.display.max_columns = 50
pd.options.display.max_rows = 200
pd.options.display.max_colwidth= 50
pd.options.display.precision = 3

In [None]:
#Define initial model parameters
cv_n_split = 3
random_state = 42
test_train_split = 0.25
sample = True
sample_size = 0.1

In [None]:
#data = pd.read_csv("../1019415451_T_ONTIME_MARKETING-4.csv")
df01_csv = pd.read_csv("../../Downloads/ontime-2019-01.csv",dtype={'a': str})
print ("Imported {} with {} variables".format(df01_csv.shape[0],df01_csv.shape[1]))

In [None]:
df01_csv.shape

In [None]:
if sample:
    df01_csv = df01_csv.sample(frac = sample_size, replace=True, random_state=random_state)
df01_csv.shape

# Data Cleaning 
* Selecting relevant and discard irrelevant columns for the prediction
* clean missing values
* Variable transformations
* Configuring up variable types

In [None]:

#Select most relevant columns, drop the rest
relevant_columns = ['DAY_OF_MONTH','DAY_OF_WEEK', # Date information
                    'OP_CARRIER','TAIL_NUM', #Airline and Aircraft Identifier
                    'ORIGIN','DEST',#Origin and destination
                    'DEP_DELAY','DEP_DELAY_NEW','DEP_DEL15','DEP_DELAY_GROUP','DEP_TIME_BLK','TAXI_OUT', #departure delays
                    'ARR_DELAY','ARR_DELAY_NEW','ARR_DEL15','ARR_DELAY_GROUP','ARR_TIME_BLK',#Arrival information
                    'CRS_ELAPSED_TIME','ACTUAL_ELAPSED_TIME','AIR_TIME','FLIGHTS','DISTANCE',#Flight summaries
                    'CANCELLED','DIVERTED', #Cancelled/Deleted information
                    'CARRIER_DELAY','WEATHER_DELAY','NAS_DELAY','SECURITY_DELAY','LATE_AIRCRAFT_DELAY']
df01 = df01_csv[df01_csv.columns.intersection(relevant_columns)] #Extract relevant columns from csv dataframe
df01.columns = map(str.lower, df01.columns)  #Set column names to lowercase
df01.head()

In [None]:
df01.dtypes.sort_values()

In [None]:
#Review dataframe columns summary for diagnostics and cleaning
pd.DataFrame({ 
            'unicos':df01.nunique(),
            'missing total': df01.isna().sum(),
            'missing %': df01.isna().sum()/df01.count(),
            'type':df01.dtypes})

In [None]:
#Fix data types for bool columns
df01["diverted"] = df01["diverted"].apply(lambda x: 1 if x > 0 else 0)
df01["cancelled"] = df01["cancelled"].apply(lambda x: 1 if x > 0 else 0)
df01["arr_del15"] = df01["arr_del15"].apply(lambda x: 1 if x > 0 else 0)
df01["dep_del15"] = df01["dep_del15"].apply(lambda x: 1 if x > 0 else 0)

In [None]:
pd.crosstab(df01.dep_delay.isna(), df01.cancelled).sort_values(by=1,ascending=False)

In [None]:
df01.flights.value_counts()

In [None]:
df_delayed = df01[(df01['cancelled'] == 0) & (df01['diverted']==0)]
df_delayed.shape

In [None]:
#Replace with 0 all NA values on delay 
df_delayed['carrier_delay'] = df_delayed['carrier_delay'].fillna(0)
df_delayed['weather_delay'] = df_delayed['weather_delay'].fillna(0)
df_delayed['nas_delay'] = df_delayed['nas_delay'].fillna(0)
df_delayed['security_delay'] = df_delayed['security_delay'].fillna(0)
df_delayed['late_aircraft_delay'] = df_delayed['late_aircraft_delay'].fillna(0)

In [None]:
#Review dataframe columns summary for diagnostics and cleaning
pd.DataFrame({ 
            'unicos':df_delayed.nunique(),
            'missing total': df_delayed.isna().sum(),
            'missing %': df_delayed.isna().sum()/df_delayed.count(),
            'type':df_delayed.dtypes})

In [None]:
#Fix data types for categorical columns
for col in ['day_of_month','day_of_week','op_carrier','origin','dest','dep_delay_group','dep_time_blk','arr_time_blk' ,'arr_delay_group','distance_group']:
    df01[col] = df01[col].astype('category')
#Fix data types for string columns
df01['tail_num'] = df01['tail_num'].astype('string')
df01.dtypes

# Data Exploration and Visualizations

In [None]:
# Setup plot enviroment
import seaborn as sns
sns.set()
sns.set_context('notebook',rc = {"grid.linewidth": 5})
sns.set_style("whitegrid")
colors = ["#345E6F","#264653","#287271","#2a9d8f","#e9c46a","#efb366","#f4a261","#ee8959","#e76f51","#e87153","#e97c61", '#902C14']
bin_colors = ["#264653","#2A9D8F","#ee8959","#e97c61"]
sns.set_palette(sns.color_palette(colors))
mul_palette = sns.color_palette(colors)
bin_palette = sns.color_palette(bin_colors)
sns.set(rc={'figure.figsize':(10,5)}, font_scale=1.5)
sns.set_style({'axes.facecolor':'white', 'grid.color': '.8','grid.linestyle': '--'})

In [None]:
sns.color_palette('YlGn')
ax = sns.histplot(data=df_delayed, x="arr_delay",kde=True, bins = 100, palette=diverging_colors)
ax.set(xlabel = "", ylabel = "",title = 'Arrival delay Histogram',)
ax.axes.yaxis.set_visible(False)

In [None]:
ax = sns.scatterplot(x="dep_delay", y="arr_delay", hue='arr_del15', data = df_delayed, palette=('YlGn'))
ax.set(xlabel = "Departure Delay", ylabel = "Arrival Delay",title = 'Departure delay vs arrival delay (Minutes)')
ax.legend(['On-Time', 'Delayed'])
ax.axes.yaxis.set_ticks([])

In [None]:
ax = sns.scatterplot(x="arr_time_blk", y="arr_delay_new", data = df_delayed)
ax.set(xlabel = "Departure Time Block", ylabel = "Arrival Delay",title = 'Departure Time vs arrival delay (Minutes)')
#set labels friendly name
ax.set_xticklabels(list(df_delayed.arr_time_blk.unique()),rotation=90)
#Hide Y labels
#ax.axes.yaxis.set_visible(False)

In [None]:
ax = sns.countplot(x="dep_time_blk", hue='arr_del15', data = df_delayed,palette = bin_palette)
ax.set(xlabel = "Departure Delay", ylabel = "Arrival Delay",title = 'Arrival delay vs Departure time block')
#set labels friendly name
ax.set_xticklabels(list(df_delayed.arr_time_blk.unique()),rotation=90)
ax.legend(['On-Time', 'Delayed'])

In [None]:
#create plot
ax = sns.countplot(x = 'arr_del15', hue = 'day_of_week' , data = df_delayed,palette = mul_palette)
#Set Title
ax.set(xlabel = "",title = 'Delayed Flights per day of the week')
#set labels friendly name
ax.legend(['Monday','Tuesday','Wensday','Thursday','Friday','Saturday','Sunday'])
ax.set_xticklabels(['On Time','Delayed'])
#Hide Y labels
ax.axes.yaxis.set_visible(False)

In [None]:
#create plot
ax = sns.countplot(x = 'day_of_month' , data = df_delayed[df_delayed['arr_del15']>0],palette = mul_palette)
#Set Title
ax.set(xlabel = "",title = 'Delayed flights per day of the month')
#set labels friendly name
#ax.set_xticklabels(['On Time','Delayed'])
#Hide Y labels
ax.axes.yaxis.set_visible(False)

In [None]:
sns.catplot(x="day_of_week", y="arr_delay", data=df_delayed, palette = mul_palette, kind="swarm")
ax.set(xlabel = "",title = 'Delayed time per day of week')
#set labels friendly name
#ax.set_xticklabels(['On Time','Delayed'])
ax.set_xticklabels(['Monday','Tuesday','Wensday','Thursday','Friday','Saturday','Sunday'],rotation=30)
#Hide Y labels
ax.axes.yaxis.set_visible(False)

In [None]:
ax = sns.scatterplot( x= "actual_elapsed_time", y = "arr_delay_new", hue = 'arr_del15', data = df_delayed)
ax.set(xlabel = "Time of Flight - Actual (min.)", ylabel = "Arrival Delay (min.)",title = 'Time of Flight (actual) vs arrival delay (Minutes)')
ax.legend(['On Time','Delayed'])

In [None]:
ax = sns.scatterplot(x = "crs_elapsed_time", y = "arr_delay_new", hue = 'arr_del15',data = df_delayed)
ax.set(xlabel = "Time of Flight (CRS)", ylabel = "Arrival Delay",title = 'Time of Flight (CRS) vs arrival delay (Minutes)')
ax.legend(['On Time','Delayed'])

In [None]:
ax = sns.scatterplot(x = "distance", y = "arr_delay_new", hue = 'arr_del15',data = df_delayed)
ax.set(xlabel = "Distance", ylabel = "Arrival Delay",title = 'Distance vs arrival delay (Minutes)')


In [None]:
ax = sns.scatterplot(x = "crs_elapsed_time", y = "arr_delay_new",data = df_delayed)
ax.set(xlabel = "Time of Flight (CRS)", ylabel = "Arrival Delay",title = 'Time of Flight (CRS) vs arrival delay (Minutes)')

In [None]:
fig, axes = plt.subplots(2, 2)
sns.set_palette("bright")
#fig.suptitle('Linear relationship')
sns.scatterplot( x= "actual_elapsed_time", y = "arr_delay_new", hue = 'arr_del15', data = df_delayed,ax=axes[0,0])
sns.scatterplot(x = "crs_elapsed_time", y = "arr_delay_new",   hue = 'arr_del15',data = df_delayed, ax=axes[0,1])
sns.scatterplot(x = "air_time", y = "arr_delay_new",   hue = 'arr_del15',data = df_delayed, ax=axes[1,0])
sns.scatterplot(x = "distance", y = "arr_delay_new", hue = 'arr_del15',data = df_delayed, ax=axes[1,1])
axes[0,0].set(xlabel = "", ylabel = "",title = 'Time of Flight (Actual)')
axes[0,1].set(xlabel = "", ylabel = "",title = 'Time of Flight (CRS)')
axes[1,0].set(xlabel = "", ylabel = "",title = 'Air Time')
axes[1,1].set(xlabel = "", ylabel = "",title = 'Distance')

axes[0,0].legend([],[], frameon=False)
axes[0,1].legend([],[], frameon=False)
axes[1,0].legend([],[], frameon=False)
axes[1,1].legend([],[], frameon=False)

axes[0,0].axes.yaxis.set_visible(False)
axes[0,1].axes.yaxis.set_visible(False)
axes[1,0].axes.yaxis.set_visible(False)
axes[1,1].axes.yaxis.set_visible(False)

axes[0,0].axes.xaxis.set_visible(False)
axes[0,1].axes.xaxis.set_visible(False)
axes[1,0].axes.xaxis.set_visible(False)
axes[1,1].axes.xaxis.set_visible(False)

In [None]:
ax = sns.scatterplot(x = "dep_delay_group", y = "arr_delay_new",data = df_delayed)
ax.set(xlabel = "Departure delay group", ylabel = "Arrival Delay",title = 'Departure Delay vs arrival delay (Minutes)')
ax.legend(['On Time','Delayed'])

# Variable transformations

In [None]:
df_delayed

In [None]:
origin_freq_encoding = (df_delayed.groupby('origin').size()) / len(df_delayed)
df_delayed['origin_freq_encoding'] = df_delayed['origin'].apply(lambda x : origin_freq_encoding[x])

In [None]:
dest_freq_encoding = (df_delayed.groupby('dest').size()) / len(df_delayed)
df_delayed['dest_freq_encoding'] = df_delayed['dest'].apply(lambda x : dest_freq_encoding[x])

In [None]:
dept_blk_freq = (df_delayed.groupby('dep_time_blk').size()) / len(df_delayed)
df_delayed['dept_blk_freq'] = df_delayed['dep_time_blk'].apply(lambda x : dept_blk_freq[x])
df_delayed.head()

In [None]:
df_train = df_delayed.drop(['cancelled','diverted','tail_num','op_carrier','flights','crs_elapsed_time','actual_elapsed_time','dep_delay','arr_delay','arr_del15','arr_delay_group','dep_delay_group','distance_group','origin','dest','dep_time_blk','arr_time_blk'],axis='columns', inplace=False)
df_train.dtypes

In [None]:
plt.figure(figsize=(8, 12))
heatmap = sns.heatmap(df_train.corr()[['arr_delay_new']].sort_values(by='arr_delay_new', ascending=False), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Features Correlating with arrival delatey time', fontdict={'fontsize':12}, pad=16);

# Model Building

In [None]:
import numpy as np

#Import ML models t be used
from sklearn.linear_model import LinearRegression
#Libraries for model selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import learning_curve
from sklearn.model_selection import StratifiedKFold
#Libraries for model evaluation
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score

In [None]:
def run_model(model, X_train, y_train, X_test, y_test, verbose=True, desc = 'No name model'):
    if verbose == False:
        model.fit(X_train,y_train, verbose=0)
    else:
        model.fit(X_train,y_train)
        y_pred = model.predict(X_test)
        print('Coefficients: ', model.coef_)
        print('Variance score: {}'.format(model.score(X_test, y_test)))
        ## setting plot style
        plt.style.use('fivethirtyeight')
        
        ## plotting residual errors in training data
        plt.scatter(model.predict(X_train), model.predict(X_train) - y_train,
                    color = "green", s = 10, label = 'Train data')
        
        ## plotting residual errors in test data
        plt.scatter(model.predict(X_test), model.predict(X_test) - y_test,
                    color = "blue", s = 10, label = 'Test data')
        
        ## plotting line for zero residual error
        plt.hlines(y = 0, xmin = 0, xmax = 50, linewidth = 2)
        
        ## plotting legend
        plt.legend(loc = 'upper right')
        
        ## plot title
        plt.title("Residual errors")
        
        ## method call for showing the plot
        plt.show()



        #r_sq = model.score(X_train, y_train)
        #model_acc = metrics.accuracy_score(y_test, y_pred )  # Model Accuracy, how often is the classifier correct?
        #print("Score = {}".format(r_sq)) 
        #print(X_test.shape,y_test.shape)
        #plt.scatter(X_test, y_test,  color='gray')
        #plt.plot(X_test, y_pred, color='red', linewidth=2)
        #plt.show()

In [None]:
#Define initial model parameters
cv_n_split = 3
random_state = 42
test_train_split = 0.25
cv_iter = 5
# Create empty list with models results
model_results = {}

In [None]:
y = df_train.pop('arr_delay_new')
X = df_train
print(X.shape)
print(y.shape)

In [None]:
X.dtypes

In [None]:
#split into raining and test
X_train,X_test, y_train,y_test = train_test_split(X , y , random_state = random_state, shuffle = True, test_size = test_train_split)

In [None]:
print(X_train.shape,X_test.shape, y_train.shape,y_test.shape)

In [None]:
#Create Linear Regression model
model_lr1 = LinearRegression()
run_model(model_lr1, X_train, y_train, X_test, y_test)
#model_lr1 = model_lr1.fit(X_train,y_train)

In [None]:
 #Predict the response for test dataset
y_pred = model_lr1.predict(X_test)
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
r_sq = model.score(x, y)
print('coefficient of determination:', r_sq)

In [None]:
import xgboost as xgb
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 10)
xg_reg.fit(X_train,y_train)
y_pred = xg_reg.predict(X_test)