## The Setup
Use this dataset of airline arrival information to predict how late flights will be. A flight only counts as late if it is more than 30 minutes late.

I am going to treat this problem as a classifier

## Exploring the Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import time

from sklearn import grid_search
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score
from sklearn import preprocessing




In [2]:
df = pd.read_csv('2008.csv')

In [3]:
#right away I can tell that this is a large data set that will require long processing times. let's take a random sample
flights = df.sample(frac=0.05)

#pros and cons- increasing processing speed, but cons randoemly removing data
#might be valuable insights into data

#when prototyping, doing this on subset of data is fine, but in the end during oproduction, would want to train on full data set


In [4]:
flights.info()

#could limit data set by selectinng certain airports or carriers

<class 'pandas.core.frame.DataFrame'>
Int64Index: 350486 entries, 5146665 to 6760897
Data columns (total 29 columns):
Year                 350486 non-null int64
Month                350486 non-null int64
DayofMonth           350486 non-null int64
DayOfWeek            350486 non-null int64
DepTime              343752 non-null float64
CRSDepTime           350486 non-null int64
ArrTime              342948 non-null float64
CRSArrTime           350486 non-null int64
UniqueCarrier        350486 non-null object
FlightNum            350486 non-null int64
TailNum              346343 non-null object
ActualElapsedTime    342784 non-null float64
CRSElapsedTime       350442 non-null float64
AirTime              342784 non-null float64
ArrDelay             342784 non-null float64
DepDelay             343752 non-null float64
Origin               350486 non-null object
Dest                 350486 non-null object
Distance             350486 non-null int64
TaxiIn               342948 non-null float64
Ta

In [5]:
flights = flights.reset_index(drop=True)
flights.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,2008,9,21,7,1015.0,1016,1141.0,1142,EV,4761,...,2.0,29.0,0,,0,,,,,
1,2008,5,1,4,2130.0,2107,33.0,2356,NW,250,...,4.0,30.0,0,,0,18.0,0.0,14.0,0.0,5.0
2,2008,4,16,3,1048.0,1045,1200.0,1210,WN,664,...,6.0,8.0,0,,0,,,,,
3,2008,8,25,1,23.0,2359,824.0,815,B6,88,...,9.0,10.0,0,,0,,,,,
4,2008,4,27,7,2016.0,2010,2139.0,2139,US,1007,...,6.0,20.0,0,,0,,,,,


In [6]:
#check for nulls
print(flights.isnull().sum())

Year                      0
Month                     0
DayofMonth                0
DayOfWeek                 0
DepTime                6734
CRSDepTime                0
ArrTime                7538
CRSArrTime                0
UniqueCarrier             0
FlightNum                 0
TailNum                4143
ActualElapsedTime      7702
CRSElapsedTime           44
AirTime                7702
ArrDelay               7702
DepDelay               6734
Origin                    0
Dest                      0
Distance                  0
TaxiIn                 7538
TaxiOut                6784
Cancelled                 0
CancellationCode     343683
Diverted                  0
CarrierDelay         274306
WeatherDelay         274306
NASDelay             274306
SecurityDelay        274306
LateAircraftDelay    274306
dtype: int64


I dont know if I want to drop the rows with null values in delay. But after looking 85% of these flights were cancelled (could be more given missing data) so its fair to assume that they are null bc they were cancelled. Seems like the other flights were divereted, so likely going to drop those as well.

In [7]:
flights = flights[pd.notnull(flights['ArrDelay'])]

In [8]:
flights = flights[flights.Cancelled != 1]
flights = flights[flights.Diverted != 1]
flights= flights.drop(['Cancelled','Diverted'],axis=1)

In [9]:
flights.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 342784 entries, 0 to 350485
Data columns (total 27 columns):
Year                 342784 non-null int64
Month                342784 non-null int64
DayofMonth           342784 non-null int64
DayOfWeek            342784 non-null int64
DepTime              342784 non-null float64
CRSDepTime           342784 non-null int64
ArrTime              342784 non-null float64
CRSArrTime           342784 non-null int64
UniqueCarrier        342784 non-null object
FlightNum            342784 non-null int64
TailNum              342784 non-null object
ActualElapsedTime    342784 non-null float64
CRSElapsedTime       342784 non-null float64
AirTime              342784 non-null float64
ArrDelay             342784 non-null float64
DepDelay             342784 non-null float64
Origin               342784 non-null object
Dest                 342784 non-null object
Distance             342784 non-null int64
TaxiIn               342784 non-null float64
TaxiOut  

Makes sense now to drop Cancellation code and where the flight was cancelled

In [10]:
flights= flights.drop(['CancellationCode'],axis=1)

In [11]:
#likely going to want to drop other columns, so lets look at the data types and check for liekly categorical
for col in flights.columns: 
    print('There are {} unique values for {}'.format((len(flights[col].unique())),col))

There are 1 unique values for Year
There are 12 unique values for Month
There are 31 unique values for DayofMonth
There are 7 unique values for DayOfWeek
There are 1372 unique values for DepTime
There are 1168 unique values for CRSDepTime
There are 1436 unique values for ArrTime
There are 1330 unique values for CRSArrTime
There are 20 unique values for UniqueCarrier
There are 7413 unique values for FlightNum
There are 5278 unique values for TailNum
There are 571 unique values for ActualElapsedTime
There are 482 unique values for CRSElapsedTime
There are 554 unique values for AirTime
There are 597 unique values for ArrDelay
There are 563 unique values for DepDelay
There are 300 unique values for Origin
There are 301 unique values for Dest
There are 1368 unique values for Distance
There are 113 unique values for TaxiIn
There are 227 unique values for TaxiOut
There are 440 unique values for CarrierDelay
There are 303 unique values for WeatherDelay
There are 357 unique values for NASDelay


Dont need Year, but how should I handle month, dayofMonth, dayOfWeek

Also, for the types of delay, should I look at the fraction?

Also, will want to drop origin and dest as these are catgegorical, but it would take way to much computational power to create dummies. Unique carrier could be dummies. Same with flight and tail num



In [12]:
flights= flights.drop(['Year','Dest','Origin','FlightNum','TailNum',],axis=1)

In [13]:
flights.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 342784 entries, 0 to 350485
Data columns (total 21 columns):
Month                342784 non-null int64
DayofMonth           342784 non-null int64
DayOfWeek            342784 non-null int64
DepTime              342784 non-null float64
CRSDepTime           342784 non-null int64
ArrTime              342784 non-null float64
CRSArrTime           342784 non-null int64
UniqueCarrier        342784 non-null object
ActualElapsedTime    342784 non-null float64
CRSElapsedTime       342784 non-null float64
AirTime              342784 non-null float64
ArrDelay             342784 non-null float64
DepDelay             342784 non-null float64
Distance             342784 non-null int64
TaxiIn               342784 non-null float64
TaxiOut              342784 non-null float64
CarrierDelay         76180 non-null float64
WeatherDelay         76180 non-null float64
NASDelay             76180 non-null float64
SecurityDelay        76180 non-null float64
LateAi

In [14]:
#For the types of delays, I would want to remove all the null values. The shorter delays tend to have the more null values, 
#so let's assume that the type of delay is 'other' so we can replace the null values with 0
cols = ['CarrierDelay','WeatherDelay','NASDelay','SecurityDelay',
        'LateAircraftDelay']
for i in cols:
    flights[i] = flights[i].fillna(0)

In [15]:
#want to turn delay into indicator
flights['ArrDelayInd'] = np.where(flights['ArrDelay'] > 30,1,0)
flights = flights.drop(['ArrDelay'], axis=1)

In [16]:
#let's check how many were delayed 
flights['ArrDelayInd'].value_counts()

0    297412
1     45372
Name: ArrDelayInd, dtype: int64

Wow only 13% of flights were delayed. We will be dealing with class imbalance we would likely want to fix. 
We would likely want to down sample? 
@Vincent cna we go over? 

per vicent - when you train model, learn relationship between features and target, as well as learn distribution of outcpme var (whether one more common then the other, which is valuable). 

Run model first and run some subsampling to see if it makes model better. 

In [17]:
corrmat = flights.corr()

corrmat['ArrDelayInd'].sort_values(ascending=False)

ArrDelayInd          1.000000
DepDelay             0.684936
LateAircraftDelay    0.504109
NASDelay             0.433202
CarrierDelay         0.377119
TaxiOut              0.291878
DepTime              0.189259
WeatherDelay         0.172276
CRSDepTime           0.131860
CRSArrTime           0.126867
ActualElapsedTime    0.093182
ArrTime              0.093077
TaxiIn               0.088545
AirTime              0.041141
CRSElapsedTime       0.031160
SecurityDelay        0.025075
Distance             0.018812
DayOfWeek            0.011401
DayofMonth          -0.002252
Month               -0.036813
Name: ArrDelayInd, dtype: float64

In [18]:
#ok rather than creating dummies for month, day of month, day of week, it might be easier to just remove them 
flights = flights.drop(['Month','DayofMonth','DayOfWeek'], axis=1)

In [19]:

df = flights.copy()
df = df.drop(['ArrDelayInd'],1)
col_names = df.columns

for name in col_names: 
    if 'Delay' in name: 
        df = df.drop([name], axis=1)     


In [21]:
# df= df.drop(['CancellationCode'],axis=1)

In [22]:
#now im going to create dummies for carrier
df2 = pd.get_dummies(df['UniqueCarrier'])

In [23]:
feats = pd.concat([df2, df], axis=1, join_axes=[df2.index])

In [24]:
feats.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 342784 entries, 0 to 350485
Data columns (total 31 columns):
9E                   342784 non-null uint8
AA                   342784 non-null uint8
AQ                   342784 non-null uint8
AS                   342784 non-null uint8
B6                   342784 non-null uint8
CO                   342784 non-null uint8
DL                   342784 non-null uint8
EV                   342784 non-null uint8
F9                   342784 non-null uint8
FL                   342784 non-null uint8
HA                   342784 non-null uint8
MQ                   342784 non-null uint8
NW                   342784 non-null uint8
OH                   342784 non-null uint8
OO                   342784 non-null uint8
UA                   342784 non-null uint8
US                   342784 non-null uint8
WN                   342784 non-null uint8
XE                   342784 non-null uint8
YV                   342784 non-null uint8
DepTime              342784 n

In [25]:
feats = feats.drop(['UniqueCarrier'],1)

## Baseline Logistic Regression

In [26]:
# feats = feats.drop(['ArrDelayInd'],1)
#scal

# Select only numeric variables to scale.
df_num = feats.select_dtypes(include=[np.number]).dropna()

# Save the column names.
names=df_num.columns

# Scale, then turn the resulting numpy array back into a data frame with the correct column names.
feats_scaled = pd.DataFrame(preprocessing.scale(df_num), columns=names, index=feats.index)

In [27]:
feats_scaled['ArrDelayInd'] = flights['ArrDelayInd']
from sklearn.linear_model import LogisticRegression
X = feats_scaled.drop(['ArrDelayInd'], axis=1, inplace=False)
# Dependent variable
Y = feats_scaled['ArrDelayInd']


In [28]:
from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn import linear_model

X_train, X_test, Y_train, Y_test =model_selection.train_test_split(X,Y, test_size=0.30, random_state=42) 
regr = LogisticRegression()
# Fit the variables to the logistic model.
regr.fit(X_train, Y_train)

Y_pred = regr.predict(X_test)
print('\n Logistic Accuracy by ArrDelayInd')
print(pd.crosstab(Y_pred, Y_test))

print('\nThe accuracy for train set: ',format(regr.score(X_train, Y_train)))
print('The accuracy for test set: ',format(regr.score(X_test, Y_test)))


 Logistic Accuracy by ArrDelayInd
ArrDelayInd      0     1
row_0                   
0            89166  6268
1              135  7267

The accuracy for train set:  0.9363778818744061
The accuracy for test set:  0.9377358123614299


predicting flight to be on time but it delayed

learning over baseline model (predicting 87 % to be on time)

In [29]:
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)



Coefficients: 
 [[ -3.23904052e-02   5.53609262e-02  -3.13559388e-02  -1.53731755e-03
    5.81817893e-03  -8.53200441e-03  -7.63409014e-02   2.68437286e-02
   -3.44475137e-02  -1.19966061e-02  -6.77596801e-02   4.75076305e-02
   -3.74782388e-02  -1.65794797e-02  -2.37951163e-02   4.93099586e-02
   -5.92939620e-02   5.43225569e-02   1.58941419e-02   1.51232640e-02
    1.10365670e+01  -1.03882937e+01  -1.17114933e-01   1.06937150e-01
    2.20903836e+00  -4.02626371e+00   2.18814631e+00  -3.78959219e-01
    1.83463370e-01   5.74253205e-01]]

Intercept: 
 [-2.59817196]


In [30]:
score = cross_val_score(regr, X, Y, cv=10)
print('\nEach Cross Validated R2 score: \n', score)
print("\nOverall Logistic Regression R2: %0.2f (+/- %0.2f)\n" % (score.mean(), score.std() * 2))


Each Cross Validated R2 score: 
 [ 0.93550175  0.9367853   0.93981563  0.93681078  0.93529377  0.93593559
  0.93546881  0.9364607   0.9349437   0.93517708]

Overall Logistic Regression R2: 0.94 (+/- 0.00)



In [31]:
predict_train = regr.predict(X_train)
predict_test = regr.predict(X_test)

# Accuracy tables.
table_train = pd.crosstab(Y_train, predict_train, margins=True)
table_test = pd.crosstab(Y_test, predict_test, margins=True)

train_tI_errors = table_train.loc[0.0,1.0] / table_train.loc['All','All']
train_tII_errors = table_train.loc[1.0,0.0] / table_train.loc['All','All']

test_tI_errors = table_test.loc[0.0,1.0]/table_test.loc['All','All']
test_tII_errors = table_test.loc[1.0,0.0]/table_test.loc['All','All']

print((
    'Training set accuracy:\n'
    'Percent Type I errors: {}\n'
    'Percent Type II errors: {}\n\n'
    'Test set accuracy:\n'
    'Percent Type I errors: {}\n'
    'Percent Type II errors: {}'
).format(train_tI_errors, train_tII_errors, test_tI_errors, test_tII_errors))

Training set accuracy:
Percent Type I errors: 0.0014253088169103305
Percent Type II errors: 0.06219680930868355

Test set accuracy:
Percent Type I errors: 0.0013127698471352444
Percent Type II errors: 0.060951417791434905


In [32]:
corrmat = feats_scaled.corr()

corrmat['ArrDelayInd'].sort_values(ascending=False)

ArrDelayInd          1.000000
TaxiOut              0.291878
DepTime              0.189259
CRSDepTime           0.131860
CRSArrTime           0.126867
ActualElapsedTime    0.093182
ArrTime              0.093077
TaxiIn               0.088545
AirTime              0.041141
AA                   0.033746
CRSElapsedTime       0.031160
UA                   0.026892
Distance             0.018812
OH                   0.016491
B6                   0.016045
XE                   0.015883
MQ                   0.015233
YV                   0.014829
CO                   0.012854
EV                   0.010777
FL                  -0.002576
NW                  -0.007380
AS                  -0.009221
DL                  -0.009911
AQ                  -0.010843
OO                  -0.013416
9E                  -0.013961
F9                  -0.015782
US                  -0.024311
HA                  -0.025485
WN                  -0.035896
Name: ArrDelayInd, dtype: float64

In [35]:
# from IPython.display import display
# correlation_matrix = feats_scaled.corr()
# display(np.where(correlation_matrix.values<1.0,correlation_matrix.values,0).max())


In [36]:
# correlation_matrix

This is pretty great. I am going to try a couple other models to see if they improve. 


## Random Forrest

In [39]:
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

rfc = ensemble.RandomForestClassifier()
X = feats_scaled.drop(['ArrDelayInd'], axis=1, inplace=False)
# Dependent variable
Y = feats_scaled['ArrDelayInd']
X_train, X_test, Y_train, Y_test =model_selection.train_test_split(X,Y, test_size=0.30, random_state=42) 

rfc = ensemble.RandomForestClassifier()
rfc.fit(X_train, Y_train)
Y_pred = rfc.predict(X_test)

print('\n RF Accuracy by ArrDelayInd')
print(pd.crosstab(Y_pred, Y_test))

print("\nAccuracy on training set: {:.3f}".format(rfc.score(X_train, Y_train)))
print("\nAccuracy on test set: {:.3f}".format(rfc.score(X_test, Y_test)))


rfc_score = cross_val_score(rfc, X, Y, cv=5)
print('\nEach Cross Validated R2 score: \n', rfc_score)
print("\nOverall Random Forest Regression R2: %0.2f (+/- %0.2f)\n" % (rfc_score.mean(), rfc_score.std() * 2))


 RF Accuracy by ArrDelayInd
ArrDelayInd      0     1
row_0                   
0            88884  3741
1              417  9794

Accuracy on training set: 0.998

Accuracy on test set: 0.960

Each Cross Validated R2 score: 
 [ 0.96308235  0.96054436  0.96149134  0.96500671  0.96440866]

Overall Random Forest Regression R2: 0.96 (+/- 0.00)



In [40]:
predict_train = rfc.predict(X_train)
predict_test = rfc.predict(X_test)

# Accuracy tables.
table_train = pd.crosstab(Y_train, predict_train, margins=True)
table_test = pd.crosstab(Y_test, predict_test, margins=True)

train_tI_errors = table_train.loc[0.0,1.0] / table_train.loc['All','All']
train_tII_errors = table_train.loc[1.0,0.0] / table_train.loc['All','All']

test_tI_errors = table_test.loc[0.0,1.0]/table_test.loc['All','All']
test_tII_errors = table_test.loc[1.0,0.0]/table_test.loc['All','All']

print((
    'Training set accuracy:\n'
    'Percent Type I errors: {}\n'
    'Percent Type II errors: {}\n\n'
    'Test set accuracy:\n'
    'Percent Type I errors: {}\n'
    'Percent Type II errors: {}'
).format(train_tI_errors, train_tII_errors, test_tI_errors, test_tII_errors))

Training set accuracy:
Percent Type I errors: 5.417840532115292e-05
Percent Type II errors: 0.0021004550986046975

Test set accuracy:
Percent Type I errors: 0.004055000194484422
Percent Type II errors: 0.03637831109728111


Random Forrestdoes perform better, and reduces our type 1 error (and improves overall accuracy)