## Predicting airplane delays using Random Forests ##

In this exercise we are going to play a bit with a well known *Big Dataset* about plane trips. This notebook is an adaption of a scala notebook from the spark-notebook project:

[https://github.com/andypetrella/spark-notebook](https://github.com/andypetrella/spark-notebook)

We have translated the notebook to Python and adapted it for the jupyter platform.

### The data ###

For this small example we are going to use a subset of the [*Airline on-time performance*](http://stat-computing.org/dataexpo/2009/) data. For this exercise we will only use the data from the year 2008. This dat is already present in the workshop environment and we only have to load it into spark to get started. Let's explore what we have:

In [None]:
# initialize Spark
from pyspark import SparkContext, SparkConf
if not 'sc' in globals():
    conf = SparkConf().setMaster('local[*]')
    sc = SparkContext(conf=conf)

rawData = sc.textFile("file://///home/jovyan/work/data/2008.csv.gz")

In [None]:
print(rawData.count())

In [None]:
print type(rawData)
header = rawData.first().split(",")
print header

Let's strip the header and convert each line into an array. We will also make a randomsplit to reduce the amount of data for this notebook:

In [None]:
data = rawData.filter(lambda l: not(l.startswith("Year"))).map(lambda l: l.split(","))
data, rest = data.randomSplit([0.001,0.999], 123456)
print data.count()

In [None]:
print data.take(2)

Let's prettyprint a row with headers to see the kind of data we have

In [None]:
printSample = data.first()
from IPython.display import display, HTML

th = ["<th>" + d + "</th>" for d in header]
td = ["<td>" + d + "</td>" for d in printSample]

display(HTML("<table><thead><tr>" + "".join(th) + "</tr></thead><tbody><tr>" + "".join(td) + "</tr></tbody></table>"))

A list of airports will probably come in handy. Let's make it. We'll simply use an array index (the airport names are in position 16 and 17) to get a list of distinct airports.

In [None]:
airportsRDD = data.filter(lambda a : (a[16] != "NA" and a[17] != "NA")).flatMap(lambda a : a[16:18]).distinct()
print airportsRDD.count()
print airportsRDD.take(10)
airports = airportsRDD.collect()

### More exploration

While we now have a good idea about the structure of the data it's time to delve a little deeper. Let's examine the
distributions of the various delays.

In [None]:
arrDelays = data.filter(lambda a : (a[14] != "NA")).flatMap(lambda a : ((str(a[16]), int(a[14])), (str(a[17]), int(a[14]))))
depDelays = data.filter(lambda a : (a[15] != "NA")).flatMap(lambda a : ((str(a[16]), int(a[15])), (str(a[17]), int(a[15]))))

In [None]:
print arrDelays.take(4)
print depDelays.take(4)

In order to make histograms we will group delays by airport.

In [None]:
arrDelaysByAirportHist = arrDelays.groupByKey().map(lambda (x,y) : (x, list((y)))) 
depDelaysByAirportHist = depDelays.groupByKey().map(lambda (x,y) : (x, list((y))))

With the delays grouped we can plot some histograms for San Francisco. Note that this can take a while - actual evaluation of the RDD will take place here.

In [None]:
arrDelaysH = arrDelaysByAirportHist.filter(lambda (x,y) : x == "SFO").map(lambda (x,y) : y).collect()
depDelaysH = depDelaysByAirportHist.filter(lambda (x,y) : x == "SFO").map(lambda (x,y) : y).collect()

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(arrDelaysH, bins=100, histtype='stepfilled', color="b", label="arrival")
plt.hist(depDelaysH, bins=100, histtype='stepfilled', color="r", alpha=0.5, label="departure")
plt.title("Delays for San Francisco airport: SFO")
plt.xlabel("Delay")
plt.ylabel("Frequency")
plt.legend()
plt.show()

### Machine Learning: Random Forest

Right, now let's apply some machine learning. First import some needed types.

In [None]:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
from pyspark.mllib.linalg import DenseVector

Let's drop some of the categorical features for simplicity. We will label the data with the departure delay and consider the following as features:

In [None]:
features = header[0:8] + header[11:15] + header[16:18] + header[18:21]
print features

Now on to the boring (and slightly messy) part:
* clean the data: if a row contains NA, drop it
* convert the airport name to index in airport list for the feature vector
* transform our data to the appropriate MLlib type: *LabeledPoint*

In [None]:
selectedData = data.filter(lambda a: "NA" not in a[0:21]).map(lambda a : 
                                   LabeledPoint(float(a[15]), 
                                   DenseVector([float(x) for x in a[0:8]] + 
                                               [float(x) for x in a[11:15]] + 
                                               [float(airports.index(a[16]))] + 
                                               [float(airports.index(a[17]))] + 
                                               [float(x) for x in a[18:21]]))).cache()

As usual we split this data in training and test sets:

In [None]:
training, testing = selectedData.randomSplit([0.7,0.3], 123456)

After all this work we can finally train the model. Note again that actual evaluation of the training RDD will take place here and can take a few minutes:

In [None]:
categoricalFeaturesInfo = {12:len(airports), 13:len(airports)}
# For actual applications we would typically use much more trees
numTrees = 10
featureSubsetStrategy = "auto"
impurity = "variance"
maxDepth = 4
maxBins = len(airports)

model = RandomForest.trainRegressor(training, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

Although they are hard to interpret we can take a look at the trained trees:

In [None]:
print "Learned regression forest model:\n" + model.toDebugString()

Now let's try to predict some test data:

In [None]:
test = testing.take(10)
predictions = [(point.label, model.predict(point.features), point.label - model.predict(point.features)) for point in test]

from IPython.display import display, HTML
tbody = ""
for tup in predictions:
    tbody = tbody + "<tr><td>" + str(tup[0]) + "</td><td>" + str(tup[1]) + "</td><td>" + str(tup[2]) + "</td></tr>"
display(HTML("<table><thead><tr><th>Actual delay</th><th>Predicted delay</th><th>Difference</th></tr></thead><tbody><tr>" + tbody + "</tr></tbody></table>"))

Now for the MSE. First a baseline and then our model:

In [None]:
# Get some baseline
avgDelay = training.map(lambda p: p.label).mean()
print avgDelay
baseLineData = testing.map(lambda p: (p.label, avgDelay))
print baseLineData.first()
mseBaseLine = baseLineData.map(lambda (x,y) : (x - y)**2).mean()
print mseBaseLine

In [None]:
predictions = model.predict(testing.map(lambda p: p.features))
labelsAndPredictions = testing.map(lambda p: p.label).zip(predictions)
print labelsAndPredictions.first()
testMSE = labelsAndPredictions.map(lambda (x,y) : (x - y)**2).mean()
print testMSE

This concludes the notebook. Feel free to experiment more with the model and the data. As an exercise try plotting the predicted and actual values. Alternatively you can try to find out if there are certain airports were the predictions are significantly better. Have fun!