# Airline Arrivals

We have a dataset with details about flights. We need to predict whether the flight arrive is late or not. Flight could only be late if it's more than 30 minutes delayed. So we first need to check our data to find how we can check this.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA

In [2]:
df = pd.read_csv('C:/Users/vivek/Downloads/2005.csv')

In [3]:
df.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,2005,1,28,5,1603.0,1605,1741.0,1759,UA,541,...,4,23,0,,0,0,0,0,0,0
1,2005,1,29,6,1559.0,1605,1736.0,1759,UA,541,...,6,15,0,,0,0,0,0,0,0
2,2005,1,30,7,1603.0,1610,1741.0,1805,UA,541,...,9,18,0,,0,0,0,0,0,0
3,2005,1,31,1,1556.0,1605,1726.0,1759,UA,541,...,11,10,0,,0,0,0,0,0,0
4,2005,1,2,7,1934.0,1900,2235.0,2232,UA,542,...,5,10,0,,0,0,0,0,0,0


Our data contains a column 'ArrDelay' which shows how much delay did flight took to reach destination. We will simply check whether this column is more than 30 or not and store the result in new column for label.

In [4]:
df['Delayed'] = 0

In [5]:
df.at[df['ArrDelay'] < 30, 'Delayed'] = 0
df.at[df['ArrDelay'] > 30, 'Delayed'] = 1

In [6]:
df.columns

Index(['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'CRSDepTime',
       'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'TailNum',
       'ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay',
       'DepDelay', 'Origin', 'Dest', 'Distance', 'TaxiIn', 'TaxiOut',
       'Cancelled', 'CancellationCode', 'Diverted', 'CarrierDelay',
       'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay',
       'Delayed'],
      dtype='object')

Here we have alot of columns which we don't require so let's get rid of them.

In [7]:
df = df.drop(['ArrTime', 'TaxiIn', 'TaxiOut', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay', 'Origin', 'Dest', 'ActualElapsedTime', 'TailNum', 'CancellationCode', 'ArrDelay', 'UniqueCarrier'],1)

In [8]:
df.describe()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,CRSArrTime,FlightNum,CRSElapsedTime,AirTime,DepDelay,Distance,Cancelled,Diverted,Delayed
count,7140596.0,7140596.0,7140596.0,7140596.0,7006866.0,7140596.0,7140596.0,7140596.0,7140596.0,6992838.0,7006866.0,7140596.0,7140596.0,7140596.0,7140596.0
mean,2005.0,6.48116,15.71931,3.944549,1344.534,1337.973,1499.84,2042.659,125.9049,101.2756,8.674313,723.7402,0.01872813,0.001964542,0.1150949
std,0.0,3.410521,8.78596,1.989965,476.7736,464.2816,480.4065,1841.605,69.75044,82.46829,31.19505,571.1465,0.1355632,0.0442796,0.3191364
min,2005.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,-40.0,-1428.0,-1199.0,11.0,0.0,0.0,0.0
25%,2005.0,4.0,8.0,2.0,933.0,930.0,1120.0,584.0,75.0,54.0,-4.0,316.0,0.0,0.0,0.0
50%,2005.0,6.0,16.0,4.0,1331.0,1328.0,1523.0,1446.0,107.0,83.0,0.0,562.0,0.0,0.0,0.0
75%,2005.0,9.0,23.0,6.0,1735.0,1725.0,1912.0,3172.0,155.0,130.0,7.0,950.0,0.0,0.0,0.0
max,2005.0,12.0,31.0,7.0,2805.0,2359.0,2359.0,9584.0,660.0,1956.0,1930.0,4962.0,1.0,1.0,1.0


Dataset used here contains 7140596 rows so it's a huge dataset. For this reason we won't be comparing different machine learning algorithms because of less computational power.

In [9]:
df = df.dropna()

In [10]:
X = df.drop('Delayed', 1)
Y = df['Delayed']

In [11]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

In [12]:
rfc = RandomForestClassifier(n_estimators=10)
param_grid = {
    'n_estimators': [10, 20],
    'max_features': ['auto', 'sqrt']
}
CV_rfc = GridSearchCV(rfc, param_grid=param_grid, cv=3)
CV_rfc.fit(X_train, Y_train)
print("Best parameters: ", CV_rfc.best_params_)
print("Train data score: ", CV_rfc.score(X_train, Y_train))
print("Test data score: ", CV_rfc.score(X_test, Y_test))

Best parameters:  {'max_features': 'sqrt', 'n_estimators': 20}
Train data score:  0.9983301008193527
Test data score:  0.9661796923710538


Model is doing a pretty descent job here with 96% test score accuracy and 99% train score accuracy. May be size of our data is responsible for such successful model. Now let's try decreasing it's dimensionality using PCA and check how score is affected.

### PCA

In [13]:
pca = PCA(.95)
pca.fit(X_train)
x_pca = pca.transform(X_train)

In [15]:
rfc = RandomForestClassifier(n_estimators=20, max_features='sqrt')
rfc.fit(x_pca, Y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [16]:
x_test_pca = pca.transform(X_test)
print("Models score on train data: ", rfc.score(x_pca, Y_train))
print("Models score on test data: ", rfc.score(x_test_pca, Y_test))

Models score on train data:  0.9973039079225448
Models score on test data:  0.9334404905589145


So PCA clearly lowered model accuracy on test data by 3% which would not be acceptable in production environment. Though train dataset is still performing similar to previous model.