# Predicting Airline Data using a Generalized Linear Model (GLM) in R

In particular, we will predict the probability that a flight is late based on its departure date/time, the expected flight time and distance, the origin and destitation airports.

The core library for machine learning part will be the [GLM function of R](http://www.statmethods.net/advstats/glm.html).

### Considerations

The objective of this notebook is to define a simple model offerring a point of comparison in terms of computing performances across datascience language and libraries.  In otherwords, this notebook is not for you if you are looking for the most accurate model in airline predictions.  

## Install and Load useful libraries

#install.packages('caret', repos='http://cran.rstudio.com/')
#install.packages('ROCR', repos='http://cran.rstudio.com/')

In [1]:
library(caret)
library(ROCR)

Loading required package: lattice
Loading required package: ggplot2
Loading required package: gplots

Attaching package: ‘gplots’

The following object is masked from ‘package:stats’:

    lowess



## Load the data

- The dataset is taken from [http://stat-computing.org](http://stat-computing.org/dataexpo/2009/the-data.html).  We take the data corresponding to year 2008.
- We restrict the dataset to the first million rows
- We print all column names and the first 5 rows of the dataset

In [2]:
df = read.csv("2008.csv") 
nrow(df)

In [3]:
df = df[0:1000000,]

In [4]:
names(df)

In [5]:
df[0:5, ]

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,⋯,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2008,1,3,4,2003,1955,2211,2225,WN,335,⋯,4,8,0,,0,,,,,
2008,1,3,4,754,735,1002,1000,WN,3231,⋯,5,10,0,,0,,,,,
2008,1,3,4,628,620,804,750,WN,448,⋯,3,17,0,,0,,,,,
2008,1,3,4,926,930,1054,1100,WN,1746,⋯,3,7,0,,0,,,,,
2008,1,3,4,1829,1755,1959,1925,WN,3920,⋯,3,10,0,,0,2.0,0.0,0.0,0.0,32.0


## Data preparation for training

- We create a new "binary" column indicating if the flight was delayed or not.
- We turn origin/destination categorical data to a "one-hot" encoding representation
- We show the first 5 rows of the modified dataset
- We split the dataset in two parts:  a training dataset and a testing dataset containing 80% and 20% of the rows, respectively.

In [6]:
df = df[is.na(df$ArrDelay)==0,] #drop column where delay is na
df["IsArrDelayed"] <- as.numeric(df["ArrDelay"]>0)
df["Origin"       ] <- model.matrix(~Origin       , data=df)#[,-1] #as.factor (df[,c("Origin")])
df["Dest"         ] <- model.matrix(~Dest         , data=df)#as.factor (df[,c("Dest")])

In [7]:
df[0:5, ]

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,⋯,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,IsArrDelayed
2008,1,3,4,2003,1955,2211,2225,WN,335,⋯,8,0,,0,,,,,,0
2008,1,3,4,754,735,1002,1000,WN,3231,⋯,10,0,,0,,,,,,1
2008,1,3,4,628,620,804,750,WN,448,⋯,17,0,,0,,,,,,1
2008,1,3,4,926,930,1054,1100,WN,1746,⋯,7,0,,0,,,,,,0
2008,1,3,4,1829,1755,1959,1925,WN,3920,⋯,10,0,,0,2.0,0.0,0.0,0.0,32.0,1


In [8]:
trainIndex = sample(1:nrow(df), size = round(0.8*nrow(df)), replace=FALSE)
train = df[ trainIndex, ]
test  = df[-trainIndex, ]

## Model building

- We define the generalized linear model using a binomial function --> Logistic regression.
- We train the model and measure the training time --> ~19min on an intel i7-6700K (4.0 GHz) for 800K rows 	
- We show the model summary
- We show the 10 most important variables

In [None]:
system.time(
    model <- glm(IsArrDelayed ~ Year + Month + DayofMonth + DayOfWeek + DepTime + AirTime + Origin + Dest + Distance
             ,data=train,family = binomial) 
)

In [None]:
summary(model)

In [None]:
vi <- varImp(model, scale = FALSE)
vi$Variable<-rownames(vi)
rownames(vi) <- NULL
vi = vi[ order(-vi[,1]), c("Variable", "Overall") ]
vi[0:10,]

## Model testing

- We add a model prediction column to the testing dataset
- We show the first 10 rows of the test dataset (with the new column)
- We show the model ROC curve
- We measure the model Area Under Curve (AUC) to be 0.706 on the testing dataset.  

This is telling us that our model is not super accurate  (we generally assume that a model is raisonable at predicting when it has an AUC above 0.8).  But, since we are not trying to build the best possible model, but just show comparison of data science code/performance accross languages/libraries.
If none the less you are willing to improve this result, you should try adding more feature column into the model.

In [None]:
test["IsArrDelayedPred"] <- predict(model, newdata=test, type="response")
test[0:10,]

In [None]:
pred <- prediction(test$IsArrDelayedPred, test$IsArrDelayed)
perf <- performance(pred, measure = "tpr", x.measure = "fpr") 
plot(perf, col=rainbow(10))

In [None]:
AUC = performance(pred, measure = "auc")@y.values
AUC

## Key takeaways

- We built a GLM model predicting airline delay probability
- We train it on 800K rows in ~19min on an intel i7-6700K (4.0 GHz)
- We measure an AUC of 0.706, which is not super accurate but reasonable
- We demonstrated a typical workflow in R language in a Jupyter notebook

I might be biased, but I didn't find the R documentation very easy to read (compared to python equivalent). 