##Which airports have the largest number of departure delays?

This Jupyter notebook contains executable Spark code written in Scala.

If you want to try it out, then the first thing to do is to install [Jupyter](http://jupyter.org/) and the [Spark Kernel](https://github.com/ibm-et/spark-kernel) for it. The latter has to be built from source which takes a *very* long time. Follow the instructions in the project's wiki to install and integrate it with Jupyter.

Let's get to work!

Import the [dataset](http://stat-computing.org/dataexpo/2009/the-data.html).


In [8]:
import org.apache.spark._

val data = sc.textFile("2008.csv").map(_.split(","))

The next chunk is copied from [a StackOverflow answer](http://stackoverflow.com/questions/24299427/how-do-i-convert-csv-file-to-rdd) on how to remove csv headers.


In [9]:
class SimpleCSVHeader(header:Array[String]) extends Serializable {
  val index = header.zipWithIndex.toMap
  def apply(array:Array[String], key:String):String = array(index(key))
}
val header = new SimpleCSVHeader(data.take(1)(0))
val filtered = data.filter(line => header(line,"Origin") != "Origin")

Filter the required fields and determine if a flight is delayed.

In [10]:

val delayed = filtered.map(data => (data(15), data(16))).filter(data => data._1 != "NA" && data._1.toInt > 0)

Aggregate delays and materialize the top ten.

In [23]:
val results = delayed.groupBy(_._2).map(data => (data._1, data._2.size)).sortBy(_._2, ascending = false).take(10)

The 10 airports with the biggest number of delayed departures (with the number of delays) are:

In [24]:
results.foreach(println)

(ATL,175017)
(ORD,159427)
(DFW,127749)
(DEN,104414)
(LAX,87258)
(IAH,87139)
(PHX,82915)
(LAS,76240)
(EWR,69612)
(DTW,59837)
