We will analyze a dataset containing information about flights in the USA that stems from the [Bureau of Transportation Statistics](https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time). In the zip file provided, you will find a directory *airline_data* containing the files 2016_1.csv, ..., 2016_6.csv. Each line of a file (except the headers) contains the flight date, the airline ID, the flight number, the origin airport, the destination airport, the departure time, the departure delay in minutes, the arrival time, the arrival delay in minutes, the time in air in minutes, and the distance between both airports in miles. 

Preparation (if not done already):

1. Copy all the csv files to your virtual machine (e.g., via scp). Alternatively, you can also log in to the virtual machine and download the zip file directy from Absalon.
2. Create a directory 'airline_data' on your local Hadoop cluster. Afterwards, copy the csv files from your virtual machine to the Hadoop cluster via 'hadoop fs -put airline_data/*.csv airline_data/'

In [None]:
# generate an RDD based on all the csv files given in the airline_data directory.
# MAKE SURE THAT ONLY THE 2016_*.csv FILES ARE GIVEN IN THE DIRECTORY
airline_data = sc.textFile ("hdfs:///user/lsda/airline_data/*.csv")

In [None]:
# each csv file contains a header describing the data
header = airline_data.first()
print("Header information given in the first csv file:\n\n{}".format(header))

In [None]:
# filter the RDD to remove this header information (each csv file 
# contains such a line)
airline_data = airline_data.filter(lambda line: line != header)

# get the first 10 elements and print them
print("First 10 elements of the RDD:")
airline_data.take(10)

In [None]:
def parse(line):
    
    line = line.split(',')
    
    try:
    
        airline_id = line[1]
        origin = line[3].strip('\"')
        dest = line[4].strip('\"')
        dep_delay = float(line[6])
        arr_delay = float(line[8])
        
        return (airline_id, origin, dest, dep_delay, arr_delay)
    
    except Exception as e:
        
        # in case of an error: simply return 'None'
        return None

In [None]:
# apply the parsing function to each element via the map
# transformation; afterwards, remove all elements that
# could not be parsed properly.
airlines = airline_data.map(parse)
airlines = airlines.filter(lambda line: line is not None)

In [None]:
# let's inspect the first ten elements
airlines.take(10)

In [None]:
# (a) Shortest Flight Distance

# YOUR CODE HERE

In [None]:
# (b) Late Arrival Counts

# YOUR CODE HERE

In [None]:
# (c) Mean and Standard Deviation for Arrival Delays

# YOUR CODE HERE

In [None]:
# (d) Top-10 of Arrival Delays

# YOUR CODE HERE

In [16]:
import numpy as np
np.mean(4-2+2)
np.std([4,5,6,7,8,9])

1.707825127659933