# NYC Taxi Dataset Project - Data Prep

## Overall Steps

**Step 0:** Prerequisites

**Step 1:** Start Spark Cluster

**Step 2:** Upload this notebook and packages

**Step 3:** Clean Data

**Step 4:** Aggregate Data

**Step 5:** Save RDD to file on s3


Note: Step 1 is based on the CS109 [instructions](https://piazza.com/class/icf0cypdc3243c?cid=1369). However there are modifications for optimizing performance for this project

### Step 0: Prerequisites

1. You need the files CS109.pem and credentials.csv.If you had followed the cs109 instructions (for lab8 or HW5) you will already have these files.

2. You will need a directory containing the following files:
    
    a) CS109.pem
    
    b) credentials.csv
    
    c) Setup Project.ipynb
    
    d) myConfig.json
    
    e) DataPrep.ipynb (this notebook)
    
    f) geohash.py

### Step 1: Start Spark cluster and sanity check

#### Step 1a) Start your Spark cluster as described in Step 1 from Setup Project (unless your spark cluster is already running)
#### Step 1b) Sanity check: make sure Spark cluster is working

In [1]:
env = 'local' # 'aws'

In [2]:
import numpy as np
import scipy as sp
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)

In [3]:
if env == 'aws':
    import sys
    rdd = sc.parallelize(xrange(10),10)
    aa = rdd.map(lambda x: sys.version)
    aa.cache()
    aa.count()   
else:
    import geopy as gp
    import findspark
    findspark.init()
    import pyspark
    conf = (pyspark.SparkConf()
        .setMaster('local')
        .setAppName('pyspark')
        .set("spark.driver.memory", "3g")
        .set("spark.executor.memory", "3g"))
    sc = pyspark.SparkContext(conf=conf)

### Step 2: Upload this notebook and load data from s3

#### Upload this Jupyter Notebook and geohash.py using the console from http://localhost:8989

Notes: 
1. All the steps in Step 2 are to be executed from the Jupyter Notebook iteself
2. We will frequently be loading data form the s3 bucket you created in Step 3 of Setup Project (I will use the bucket name: "sdaultontestbucket", but replace this with your own
3. All the steps in Step 2 are to be executed from the Jupyter Notebook iteself

#### Gather data filenames


In [4]:
#Setup the variables

#Yellow/green cab filename prefix
yCabFNPrefix = "yellow_tripdata_"
gCabFNPrefix = "green_tripdata_"

#Availaiblity of data set by month & year
yDict = {}
gDict = {}

#availablity for Yellow cab
yDict[2015] = range(1,7) #available till jun 2015
yDict[2014] = range(1,13)
yDict[2013] = range(1,13)

#availablity for Green cab
gDict[2015] = range(1,7) #available till jun 2015
gDict[2014] = range(1,13)
gDict[2013] = range(8,13) #avialable only from august 2013

# Yellow cab data file name list
# file name is of format:  yellow_tripdata_2015-01.csv
yCabUrls = []
for year, monthList in yDict.iteritems():
    yearStr = str(year)
    for month in monthList:
        monthStr = str(month)
        if len(monthStr) == 1:
            monthStr = "0"+monthStr    
        url = yCabFNPrefix+yearStr+'-'+monthStr+".csv"
        yCabUrls.append(url)

#  green cab data file name list
gCabUrls = []
for year, monthList in gDict.iteritems():
    yearStr = str(year)
    for month in monthList:
        monthStr = str(month)
        if len(monthStr) == 1:
            monthStr = "0"+monthStr    
        url = gCabFNPrefix+yearStr+'-'+monthStr+".csv"
        gCabUrls.append(url)

#### Preprocess the Yellow Cab Data

In [None]:
for filename in yCabUrls:

#### Process the Yellow Cab Data
1. Create an RDD to store all taxi data
2. Get the schema of the data file
3. Get Relevant Data:
    a. Pickup datetime
    b. Pickup latitude
    c. Pickup longitude
4. Round latitude and longitude to discretize locations
5. Get day of the week and hour for each pickup
5. Calculate the number of pickups per (day of the week, hour, location)
6. Aggregate data from all datafiles

#### Create an RDD to store all taxi data

In [7]:
if env == 'aws':
    base_path = "s3://sdaultontestbucket/nyc/"
else:
    base_path = "data/"

In [39]:
taxi_rdd = sc.textFile(base_path+yCabUrls[0])
# %time taxi_rdd.cache()
# %time taxi_rdd.count()

#### Get the schema of the data file


In [40]:
taxi_rdd = taxi_rdd.map(lambda line: tuple(line.split(','))).zipWithIndex()
taxi_rdd = taxi_rdd.filter(lambda (row,idx): idx < 100000)
schema = taxi_rdd.take(1)[0][0]
print schema

(u'vendor_id', u'pickup_datetime', u'dropoff_datetime', u'passenger_count', u'trip_distance', u'pickup_longitude', u'pickup_latitude', u'rate_code', u'store_and_fwd_flag', u'dropoff_longitude', u'dropoff_latitude', u'payment_type', u'fare_amount', u'surcharge', u'mta_tax', u'tip_amount', u'tolls_amount', u'total_amount')


#### Helper Data Cleaning Functions

In [41]:
def fetch_indices(schema):
    # Takes a list of column names (strings) as a parameter and returns a tuple of the indices of the pickup datetime,
    # pickup latitude, and pickup longitude columns
    indices = [-1,-1,-1]
    for idx in xrange(len(schema)):
        col_name = schema[idx]
        if "pickup" in col_name:
            if "datetime" in col_name:
                indices[0] = idx
            elif "latitude" in col_name:
                indices[1] = idx
            elif "longitude" in col_name:
                indices[2] = idx
    return tuple(indices)       

In [42]:
import time
from datetime import date
def date_extractor(date_str):
    # Takes a datetime object as a parameter
    # and extracts and returns a tuple of the day of the week (1 through 7 where Monday == 1) and hour (1 through 24)\
    
    # Split date string into list of date, time
    d = date_str.split()
    # Parse year, month, day
    date_list = d[0].split('-')
    d_obj = date(int(date_list[0]),int(date_list[1]),int(date_list[2]))
    day_of_week = d_obj.isoweekday()
    # Get hour number
    hour = int(d[1].split(':')[0]) + 1
    return (day_of_week, hour)    

In [43]:
import geohash
def data_cleaner(row, indices, precision=7):
    # takes a tuple (row,idx) as a parameter and returns a tuple of the form:
    # (day of the week, hour, geotag)
    # indices = (1, 6, 5)
    #deal with header
    #assert len(row_with_idx) == 2, "row_with_idx is len %r" % len(row_with_idx)
    #if row_with_idx[1] == 0:
    #    return (-1,-1,0)
    #else:
    #    row = row_with_idx[0]
    #assert len(row) > 6, "row is len %r" % len(row)
    #extract day of the week and hour
    date_str = row[indices[0]]
    clean_date = date_extractor(date_str)
    #assert len(clean_date) == 2, "clean date is len %r" % len(clean_date)
    #get geo hash
    latitude = float(row[indices[1]])
    longitude = float(row[indices[2]])
    
    location = geohash.encode(latitude,longitude, precision = precision)
    # I was having trouble importing geohash on AWS, so for now we round to 3 decimal places to discretize the data
    # At New York's location,
    # .001 degree latitude is approximately 100 meters
    # .001 degree longitude is less than 100 meters
    # location = (round(latitude,3), round(longitude,3))

    return (clean_date[0], clean_date[1], location)

#### Get the relevant indices

In [44]:
indices = fetch_indices(schema)
assert (-1 not in indices)
assert indices == (1,6,5)
print indices

(1, 6, 5)


#### Get rid of header row and clean the data

In [45]:
taxi_rdd = taxi_rdd.filter(lambda (row,idx): idx > 1).map(lambda (row,idx): row)

In [46]:
precision = 7
taxi_rdd = taxi_rdd.map(lambda row: data_cleaner(row, indices, precision = precision))\
                   .map(lambda row: (row,1))\
                   .reduceByKey(lambda a,b: a + b)

In [47]:
# taxi_rdd.cache()
taxi_rdd.count()

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 23.0 failed 1 times, most recent failure: Lost task 0.0 in stage 23.0 (TID 131, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/opt/apache-spark/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/usr/local/opt/apache-spark/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/opt/apache-spark/libexec/python/pyspark/rdd.py", line 2355, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/local/opt/apache-spark/libexec/python/pyspark/rdd.py", line 2355, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/local/opt/apache-spark/libexec/python/pyspark/rdd.py", line 317, in func
    return f(iterator)
  File "/usr/local/opt/apache-spark/libexec/python/pyspark/rdd.py", line 1780, in combineLocally
    merger.mergeValues(iterator)
  File "/usr/local/opt/apache-spark/libexec/python/lib/pyspark.zip/pyspark/shuffle.py", line 266, in mergeValues
    for k, v in iterator:
  File "<ipython-input-46-874e2e5b14f9>", line 2, in <lambda>
  File "<ipython-input-43-f107cfe2d4bf>", line 21, in data_cleaner
  File "geohash.py", line 79, in encode
    raise Exception("invalid latitude.")
Exception: invalid latitude.

	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
	at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
	at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
	at org.apache.spark.scheduler.Task.run(Task.scala:88)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:724)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
	at scala.Option.foreach(Option.scala:236)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:909)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:908)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
	at py4j.Gateway.invoke(Gateway.java:259)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:207)
	at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/opt/apache-spark/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/usr/local/opt/apache-spark/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/opt/apache-spark/libexec/python/pyspark/rdd.py", line 2355, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/local/opt/apache-spark/libexec/python/pyspark/rdd.py", line 2355, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/local/opt/apache-spark/libexec/python/pyspark/rdd.py", line 317, in func
    return f(iterator)
  File "/usr/local/opt/apache-spark/libexec/python/pyspark/rdd.py", line 1780, in combineLocally
    merger.mergeValues(iterator)
  File "/usr/local/opt/apache-spark/libexec/python/lib/pyspark.zip/pyspark/shuffle.py", line 266, in mergeValues
    for k, v in iterator:
  File "<ipython-input-46-874e2e5b14f9>", line 2, in <lambda>
  File "<ipython-input-43-f107cfe2d4bf>", line 21, in data_cleaner
  File "geohash.py", line 79, in encode
    raise Exception("invalid latitude.")
Exception: invalid latitude.

	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
	at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
	at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
	at org.apache.spark.scheduler.Task.run(Task.scala:88)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	... 1 more


In [None]:
taxi_rdd.take(5)

In [None]:
def summer(val):
    # takes a tuple (with 2 elements) val as a parameter and returns the sum of the two elements if both or not None
    # Otherwise returns the element that is not none
    if val[0] is None:
        assert val[1] is not None
        return val[1]
    elif val[1] is None:
        assert val[0] is not None
        return val[0]
    
    return val[0]+val[1]
    

#### Combine Yellow Cab Data from all datafiles

In [None]:

failed_to_read = []
for i in xrange(1,len(yCabUrls)):
    filename = yCabUrls[i]
    temp_rdd = sc.textFile("s3://sdaultontestbucket/nyc/"+filename)
    temp_rdd.cache()
    temp_rdd = temp_rdd.map(lambda line: tuple(line.split(','))).zipWithIndex()
    schema = temp_rdd.take(1)[0][0]
    indices = fetch_indices(schema)
    # Make sure fetch indices found all the columns
    if (-1 in indices) or (indices != (1,6,5)):
        failed_to_read.append("s3://sdaultontestbucket/nyc/"+filename)
    else:
        # Get rid of header row and clean the data
        temp_rdd = temp_rdd.filter(lambda (row,idx): idx > 1).map(lambda (row,idx): row)
        # Clean data and reduce data down to ((day of week, hour, location), number of pickups) tuples
        temp_rdd = temp_rdd.map(data_cleaner)\
                    .map(lambda row: (row,1))\
                    .reduceByKey(lambda a,b: a+b)
        #Add rows to whole dataset
        # this join gives us (key, (count from taxi, count from temp))

        taxi_rdd = taxi_rdd.fullOuterJoin(temp_rdd).mapValues(summer)
        taxi_rdd.cache()
        print taxi_rdd.count()

    temp_rdd.unpersist()
    print "Read "+str(i)+" of "+str(len(yCabUrls)-1)+ " files"

In [None]:
#save the RDD as a single file
# this RDD has key = (day of week, hour, location), value = number of pickups
taxi_rdd.repartition(1).saveAsTextFile("s3n://sdaultontestbucket/summed_rdd")