# 2015 Flight Delay & Cancellation

### Business Problem

To classify whether a flight will be delayed or not by more than 10 mins.

**Feature Columns**

MONTH, DAY, DAY_OF_WEEK: data type int <br>
AIRLINE and FLIGHT_NUMBER: data type int <br>
ORIGIN_AIRPORT and DESTINATION_AIRPORT: data type string <br>
SCHEDUAL_DEPARTURE, DEPARTURE_TIME, DEPARTURE_DELAY,
SCHEDUAL_ARRIVAL, ARRIVAL_TIME : data type float <br>
ARRIVAL_DELAY: this will be the target and is transformed into boolean variable indicating delay of more than 10 minutes <br>
DISTANCE and AIR_TIME: data type float <br>


You can learn more about this dataset from the folowing Kaggle link.
https://www.kaggle.com/usdot/flight-delays/data?source=post_page---------------------------

**Objectve**:
To determine whether the flight will be delayed or not.

### Import necessary packages

In [None]:
!pip freeze | grep pandas

In [None]:
import pandas as pd
import numpy as np
import time
import pandas_profiling
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline
import seaborn as sns

## Loading Data

In [None]:
df = pd.read_csv("../input/flight-delays/flights.csv")                  # Reading the dataset
df.head()

## Data Summary

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.shape

- Since the data we have is quite huge nearly 58 lac rows and 31 features, hence we will try to reduce by dropping few unwanted features.

## Selecting Features

In [None]:
# Selecting important features

df = df[["MONTH", "DAY", "DAY_OF_WEEK", "AIRLINE", "FLIGHT_NUMBER", "DESTINATION_AIRPORT", "ORIGIN_AIRPORT", 
         "SCHEDULED_DEPARTURE", "DEPARTURE_TIME", "DEPARTURE_DELAY", 
         "SCHEDULED_ARRIVAL", "ARRIVAL_TIME", "ARRIVAL_DELAY", "AIR_TIME", "DISTANCE"]]

## Data Sample

In [None]:
df = df.sample(n=10000, random_state= 10, axis=0)
df.shape

## Pandas Profiling

In [None]:
report = pandas_profiling.ProfileReport(df)
report.to_file('flight_df.html')

In [None]:
from IPython.display import display, HTML, IFrame
display(HTML(open('flight_df.html').read()))

**Observation:-** As per Pandas Profiling
- High Correlation between:-
    - MONTH & df_index, 
    - DEPARTURE_TIME & SCHEDULED_DEPARTURE, 
    - ARRIVAL_DELAY & DEPARTURE_DELAY, 
    - DISTANCE & AIR_TIME. 
- DESTINATION_AIRPORT & ORIGIN_AIRPORT has 3 letter apha values and some have numerical values hence need to work on. 
- DEPARTURE_DELAY has 572 (5.7%) zeros, which means flight on time so ignore
- ARRIVAL_DELAY has 232 (2.3%) zeros, which means flight on time so ignore
- DEPARTURE_TIME has 179 (1.8%) missing values
- DEPARTURE_DELAY has 179 (1.8%) missing values
- ARRIVAL_TIME has 185 (1.8%) missing values
- ARRIVAL_DELAY has 207 (2.1%) missing values	
- AIR_TIME has 207 (2.1%) missing values

## Missing Values#

In [None]:
# Origin and Destination airport has few values which are numeric

# Making a function to replace all numerical values in origin and destination airport feature with np.nan
def Replace(i):
    try:
      if str(i).isalpha():
        return str(i)
    except:
      i == np.nan
      return i


In [None]:
# Applying function to replace
df['DESTINATION_AIRPORT'] = df['DESTINATION_AIRPORT'].apply(func=Replace)
df['ORIGIN_AIRPORT'] = df['ORIGIN_AIRPORT'].apply(func=Replace)
df.isna().sum()

**Observation**
- Departure Time, Departure Delay, Arrival Time, Arrival Delay and Air Time has missing value. 
- Lets drop all missng values as its less then 2% of the entire data, and filling it with median or mode will not give a real or close to real time of departure or arrival. 
- Airtime missing is for the same data which has arrival and departure time and delay time missing. Hence drop all nan

In [None]:
# Dropping all NAN missing values
df.dropna(inplace=True)
df.shape

**Now we have 9013 data with 15 features**

In [None]:
df.head()

# Data Visualization

## Avg. Departure Delay based on AIRLINE

In [None]:
df_delay = df[df.DEPARTURE_DELAY >= 1]
dep_delayed_flights = df_delay.groupby(['AIRLINE'], as_index=False).agg({'DEPARTURE_DELAY': 'mean'})

f,ax = plt.subplots(figsize=(10, 8))
sns.barplot('AIRLINE','DEPARTURE_DELAY', data=dep_delayed_flights ,ax=ax)
ax.set_title('Airline Departure Delay Distribution', fontsize=16)
ax.set_ylabel("Departure Delay", fontsize=16)
ax.set_xlabel("Airlines", fontsize=16)
plt.close(2)
plt.show()

## Avg. Arrival Delay based on AIRLINE

In [None]:
df_delay1 = df[df.ARRIVAL_DELAY >= 1]
dep_delayed_flights = df_delay.groupby(['AIRLINE'], as_index=False).agg({'ARRIVAL_DELAY': 'mean'})

f,ax = plt.subplots(figsize=(10, 8))
sns.barplot('AIRLINE','ARRIVAL_DELAY', data=dep_delayed_flights ,ax=ax)
ax.set_title('Airline Arrival Delay Distribution', fontsize=16)
ax.set_ylabel("Arrival Delay", fontsize=16)
ax.set_xlabel("Airlines", fontsize=16)
plt.close(2)
plt.show()

## Top 10 Airports with max DEPARTURE_DELAY

In [None]:
# To find the max 10th departure delay
df.nlargest(10, 'DEPARTURE_DELAY')[9:]

In [None]:
# We see that the 10th larges value for Departure Delay is 429 minutes

dep_delay_airports = df[df['DEPARTURE_DELAY']>427][['ORIGIN_AIRPORT', 'DEPARTURE_DELAY']]

dep_delay_airports['ORIGIN_AIRPORT'] = dep_delay_airports['ORIGIN_AIRPORT'].astype('category')

f, ax= plt.subplots(figsize=(10, 6))
sns.barplot('ORIGIN_AIRPORT', 'DEPARTURE_DELAY', data=dep_delay_airports, ax=ax)
ax.set_title('Departure Delay Distribution of Origin Airports', fontsize=16)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
plt.close(2)
plt.show()


## Top 10 Airports with max ARRIVAL_DELAY

In [None]:
# To find the max 10th arrival delay
df.nlargest(10, 'ARRIVAL_DELAY')[9:]

In [None]:
# We see that the 10th larges value for Arrival Delay is 434 minutes

arr_delay_airports = df[df['ARRIVAL_DELAY']>427][['DESTINATION_AIRPORT', 'ARRIVAL_DELAY']]
arr_delay_airports['DESTINATION_AIRPORT'] = arr_delay_airports['DESTINATION_AIRPORT'].astype('category')


f, ax= plt.subplots(figsize=(10, 6))
sns.barplot('DESTINATION_AIRPORT', 'ARRIVAL_DELAY', data=arr_delay_airports, ax=ax, saturation=.8)
ax.set_title('Arrival Delay Distribution of Destination Airports', fontsize=16)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
plt.close(2)
plt.show()


## Departure Delay on Monthly basis

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
sns.scatterplot('MONTH', "DEPARTURE_DELAY", data=df, size='DEPARTURE_DELAY', hue='AIRLINE', sizes=(50, 200))
plt.legend(bbox_to_anchor=(1.5,1) , loc='upper right')

**Observation** - Our sample data does not have data for 10th month

##  Arrival Delay on Monthly basis

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
sns.scatterplot('MONTH', "ARRIVAL_DELAY", data=df, size='ARRIVAL_DELAY', hue='AIRLINE', sizes=(50, 200))
plt.legend(bbox_to_anchor=(1.5,1) , loc='upper right')

**Observation**
- From above both the plots we see that maximum delay's are in the month of February, June & December

## Top 10 maximum delay flight numbers

In [None]:
arr_delay_flightnum = df[df['ARRIVAL_DELAY']>430][['FLIGHT_NUMBER', 'ARRIVAL_DELAY', 'AIRLINE']]
arr_delay_log = np.log(df['ARRIVAL_DELAY'])
f, ax = plt.subplots(figsize=(14, 8))
sns.barplot('FLIGHT_NUMBER', 'ARRIVAL_DELAY', data=arr_delay_flightnum, hue='AIRLINE')

ax.legend(bbox_to_anchor=(1, 1), loc='upper right')


## Feature Engineering

In [None]:
# using labelencoding and give conditions to Arrival delay colum
df['ARRIVAL_DELAY'].value_counts()

In [None]:
df["ARRIVAL_DELAY"] = (df["ARRIVAL_DELAY"]>10)*1    # Checking if delay is greater than 10 mins
df['ARRIVAL_DELAY'].value_counts()

In [None]:
# So we see that 2033 fights in our sample data has arrival delay more than 10 minutes

In [None]:
df.head()

## Converting Data Type

In [None]:
df.info()

In [None]:
# We have features like AIRLINE, DESTINATION_AIRPORT, ORIGIN_AIRPORT which are categorical data
# Hence convert them to category

In [None]:
# Categorical columns

cols = ["AIRLINE","DESTINATION_AIRPORT","ORIGIN_AIRPORT"]
for item in cols:
    df[item] = df[item].astype("category")

# Lets check data type again
df.info()

In [None]:
# Now lets LabelEncode the categorical features for Model building
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
col = ['AIRLINE', 'DESTINATION_AIRPORT', 'ORIGIN_AIRPORT']
le.fit(df[col].values.flatten())

df[col] = df[col].apply(le.fit_transform)
df.head()

## Splitting Data in X & y

In [None]:
X = df.drop('ARRIVAL_DELAY', 1)
y = df['ARRIVAL_DELAY']

In [None]:
X.head()

In [None]:
# Normalizing data X

from sklearn.preprocessing import StandardScaler

#Lets Use Sandardscaler to normalise the data
scaler = StandardScaler()
scaler.fit(X)

# Scale and center the data
X_norm = scaler.transform(X)

# Create a pandas DataFrame
X = pd.DataFrame(data=X_norm, index=X.index, columns=X.columns)


In [None]:
# Train test split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10, test_size=0.3)

### XGBoost

In [None]:
import xgboost as xgb
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

In [None]:
# Function for model evaluation

def auc(m, X_train, X_test): 
    return (metrics.roc_auc_score(y_train,m.predict_proba(X_train)[:,1]),
            metrics.roc_auc_score(y_test,m.predict_proba(X_test)[:,1]))

In [None]:
# XGBoost Model
%time
model = xgb.XGBClassifier(max_depth=50, min_child_weight=1,  n_estimators=200,\
                          n_jobs=-1 , verbose=1, learning_rate=0.2)
model.fit(X_train, y_train)

auc(model, X_train, X_test)

In [None]:
y_pred = model.predict(X_test)

In [None]:
import matplotlib.pyplot as plt                               # Visualization package

%matplotlib inline
import seaborn as sns

print('Accuracy: ', metrics.accuracy_score(y_test,y_pred))
print('')
print('********************************************')
print('Confusion matrix')
lr_cfm=metrics.confusion_matrix(y_test, y_pred)


lbl1=["Predicted 1", "Predicted 2"]
lbl2=["Actual 1", "Actual 2"]

sns.heatmap(lr_cfm, annot=True, cmap="Blues", fmt="d", xticklabels=lbl1, yticklabels=lbl2)
plt.show()

print('**********************************************')
print(metrics.classification_report(y_test,y_pred))

### LightGBM

In [None]:
import lightgbm as lgb  # ligther version of GBM 

In [None]:
# Function to evaluate LightGBM model

def auc2(m, X_train, X_test):
    y_train_pred = m.predict(X_train)
    y_test_pred = m.predict(X_test)

    return (print('ROC AUC Train Score: ', metrics.roc_auc_score(y_train, y_train_pred)),
    print('ROC AUC Test Score: ', metrics.roc_auc_score(y_test, y_test_pred)),
    print('Avg. Precision Score: ', metrics.average_precision_score(y_test, y_test_pred)),
    print('Confusion Metrics: \n', metrics.confusion_matrix(y_test, y_test_pred)))

In [None]:
def gini(y_test, y_test_pred):
    fpr, tpr, thr = metrics.roc_curve(y_test, y_pred, pos_label=1)
    g = 2 * metrics.auc(fpr, tpr) -1
    return g

def gini_lgb(preds, dtrain):
    y = list(dtrain.get_label())
    score = gini(y_test, y_test_pred,) / gini(y_test, y)
    return 'gini', score, True


In [None]:
%time
model2 = lgb.LGBMClassifier(n_estimators=90, 
                     silent=False, 
                     random_state =94, 
                     max_depth=5, 
                     num_leaves=30, 
                     objective='binary', 
                     metrics ='auc')

model2.fit(X_train, y_train, eval_metric=gini_lgb)

In [None]:
auc2(model2, X_train, X_test)

In [None]:
import matplotlib.pyplot as plt                               # Visualization package
y_test_pred = model2.predict(X_test)
%matplotlib inline
import seaborn as sns
print(metrics.accuracy_score(y_test,y_test_pred))
print('********************************************')
print('Confusion matrix')
lr_cfm=metrics.confusion_matrix(y_test, y_test_pred)


lbl1=["Predicted 1", "Predicted 2"]
lbl2=["Actual 1", "Actual 2"]

sns.heatmap(lr_cfm, annot=True, cmap="Blues", fmt="d", xticklabels=lbl1, yticklabels=lbl2)
plt.show()

print('**********************************************')
print(metrics.classification_report(y_test,y_test_pred))

### Catboost

In [None]:
!pip install catboost

In [None]:
import catboost as cb

In [None]:
cat_features_index = [0,1,2,3,4,5,6]  # externally defines the category index 

In [None]:
clf = cb.CatBoostClassifier(eval_metric="AUC", depth=10, iterations= 500, l2_leaf_reg= 9, learning_rate= 0.15)
clf.fit(X_train,y_train)


In [None]:
auc2(clf, X_train, X_test)

In [None]:
import matplotlib.pyplot as plt                               # Visualization package
y_test_p = clf.predict(X_test)
%matplotlib inline
import seaborn as sns
print(metrics.accuracy_score(y_test,y_test_p))
print('********************************************')
print('Confusion matrix')
lr_cfm=metrics.confusion_matrix(y_test, y_test_p)


lbl1=["Predicted 1", "Predicted 2"]
lbl2=["Actual 1", "Actual 2"]

sns.heatmap(lr_cfm, annot=True, cmap="Blues", fmt="d", xticklabels=lbl1, yticklabels=lbl2)
plt.show()

print('**********************************************')
print(metrics.classification_report(y_test, y_test_p))

**Conclusion:-**
- With the above analysis and the Models used we see that the best Model of 3 Models (XGBoost, LightGBM & Catboost) is XGBoost provideds 93% accuracy.
- The other two Models show 92% accuracy.