# NYC Taxi Dataset Project - Setup

## Overall Steps

**Step 0:** Prerequisites

**Step 1:** Setup Spark cluster in AWS (do this from the shell/git bash)

**Step 2:** Sanity check to ensure Spark & S3 are setup properly

**Step 3:** Upload data files into S3 (you can skip this and use my S3 location)

**Step 4:** Check cluster performance on dataset

In the end I have also added other **uselful notes**

Note: Step 1 is based on the CS109 [instructions](https://piazza.com/class/icf0cypdc3243c?cid=1369). However there are modifications for optimizing performance for this project

### Step 0: Prerequisites

1. You need the files CS109.pem and credentials.csv.If you had followed the cs109 instructions (for lab8 or HW5) you will already have these files.

2. Create a new directory in your machine for the project and copy the following files into it:
    
    a) CS109.pem
    
    b) credentials.csv
    
    c) Setup Project.ipynb (this notebook)
    
    d) myConfig.json

### Step 1: Setup Spark cluster in AWS & perform sanity check



**Note: Run all the items in Step 1 from unix shell/Git bash (not from the Jupyter notebook)**

#### Step 1a) Create the cluster in AWS: run the following command in unix shell/Git bash (you may change the instance type/count)

**Enhancements incorporated to the script**: 
1. Setup Spark to use the maximum available resources (the myConfig.json file has the instructions)
2. Download the admin application Ganglia

#### Step 1c) Wait for the cluster to be ready: AWS web console has to show "WAITING"

#### Step 1d)  Get the cluster master's IP:

#### Step 1e) Run the script to configure Spark 

#### Step 1f) Create an SSH tunel to the AWS box and connect to the cluster. This command assumes your SSH key is on the same directory you are invoking the SSH command from. At the end of this you will be in a terminal session on the cluster's master node.

#### Step 1f) Open your browser and got to http://localhost:8989

Note: The notebook you open will **already** have the spark context set up for you.

### Step 2: Sanity check to ensure Spark & S3 are setup properly

#### Step 2a) Upload this Jupyter Notebook using the console from http://localhost:8989

Note: all the steps in Step 2 are to be executed from the Jupyter Notebook iteself

#### Step 2b) Setup the SparkContext (automatically setup by YARN)

In [1]:
sc

<pyspark.context.SparkContext at 0x7f6b79cfce50>

In [2]:
#Just an informational message
sc.master

u'yarn-client'

#### Step 2c) Spark sanity check

In [3]:
import sys
rdd = sc.parallelize(xrange(10),10)
aa = rdd.map(lambda x: sys.version)
aa.cache()
aa.count()

10

#### Step 2d) S3 sanity check

In [4]:
#Get a wikipedia page and store it in a local folder
!pwd
!mkdir test_s3
!wget http://en.wiktionary.org/wiki/awesome -P test_s3/ --trust-server-names

/home/hadoop
mkdir: cannot create directory ‘test_s3’: File exists
--2015-12-02 04:43:25--  http://en.wiktionary.org/wiki/awesome
Resolving en.wiktionary.org (en.wiktionary.org)... 208.80.154.224, 2620:0:861:ed1a::1
Connecting to en.wiktionary.org (en.wiktionary.org)|208.80.154.224|:80... connected.
HTTP request sent, awaiting response... 301 TLS Redirect
Location: https://en.wiktionary.org/wiki/awesome [following]
--2015-12-02 04:43:26--  https://en.wiktionary.org/wiki/awesome
Connecting to en.wiktionary.org (en.wiktionary.org)|208.80.154.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘test_s3/awesome.1’

awesome.1               [ <=>                  ]  65.39K  --.-KB/s   in 0.003s 

2015-12-02 04:43:26 (22.3 MB/s) - ‘test_s3/awesome.1’ saved [66964]



##### Create a S3 bucket : provide a unique name below: replace the "testaaset1"

In [5]:
# Add the downloaded file to S3: remeber to replace "testaaset1" with a unique bucket name
!aws s3 mb s3://sdaultonbucket1
    
# Add the downloaded file to the test bucket in S3: remeber to replace "testaaset1" with a unique bucket name
!aws s3 cp test_s3/awesome s3://sdaultonbucket1/

make_bucket: s3://sdaultonbucket1/
upload: test_s3/awesome to s3://sdaultonbucket1/awesome


##### Test if you are able to lookup the S3 file from Spark

In [6]:
testS3RDD = sc.textFile("s3://sdaultonbucket1/awesome")
testS3RDD.count()

443

##### Congrats! Now you have a working spark cluster with ability to connect with S3!

### Step 3: Upload data files into S3 

Note: You **can skip** as i have uploaded the files to my S3 location ** s3://testsetu/nyc/  **


In [7]:
!pwd
!mkdir datafiles

/home/hadoop


#### Download the data files into local folder

**Note**: We are downloading the data from **2013 onwards only** - though data is available from 2009

**Data source**: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

In [9]:
#Setup the S3 bucket for storing the nyc data
!aws s3 mb s3://sdaultonbucket1/nyc
    
#Use the ephemerial/tmp space in EC2 node    
!mkdir datafiles
!sudo mount /dev/xvdb datafiles
!mkdir datafiles/nyc

#for green cabs
!mkdir datafiles/nycg

make_bucket: s3://sdaultonbucket1/nyc/
mount: /dev/xvdb is already mounted or /home/hadoop/datafiles busy
       /dev/xvdb is already mounted on /mnt
       /dev/xvdb is already mounted on /media/ephemeral0
       /dev/xvdb is already mounted on /home/hadoop/datafiles
mkdir: cannot create directory ‘datafiles/nyc’: File exists
mkdir: cannot create directory ‘datafiles/nycg’: File exists


In [10]:
#Setup the variables

baseUrl = "http://storage.googleapis.com/tlc-trip-data/"
#Yellow/green cab filename prefix
yCabFNPrefix = "/yellow_tripdata_"
gCabFNPrefix = "/green_tripdata_"

#Availaiblity of data set by month & year
yDict = {}
gDict = {}

#availablity for Yellow cab
yDict[2015] = range(1,7) #available till jun 2015
yDict[2014] = range(1,13)
yDict[2013] = range(1,13)

#availablity for Green cab
gDict[2015] = range(1,7) #available till jun 2015
gDict[2014] = range(1,13)
gDict[2013] = range(8,13) #avialable only from august 2013

In [11]:
#  Yellow cab data file name list
# file name is of format:  yellow_tripdata_2015-01.csv
yCabUrls = []
yCabFilenames = []
for year, monthList in yDict.iteritems():
    yearStr = str(year)
    for month in monthList:
        monthStr = str(month)
        if len(monthStr) == 1:
            monthStr = "0"+monthStr    
        url = baseUrl+yearStr+yCabFNPrefix+yearStr+'-'+monthStr+".csv"
        yCabUrls.append(url)
        yCabFilenames.append(yCabFNPrefix+yearStr+'-'+monthStr+".csv")

#  green cab data file name list
gCabUrls = []
gCabFilenames = []
for year, monthList in gDict.iteritems():
    yearStr = str(year)
    for month in monthList:
        monthStr = str(month)
        if len(monthStr) == 1:
            monthStr = "0"+monthStr    
        url = baseUrl+yearStr+gCabFNPrefix+yearStr+'-'+monthStr+".csv"
        gCabFilenames.append(gCabFNPrefix+yearStr+'-'+monthStr+".csv")
        gCabUrls.append(url)

In [12]:
#Download the yellow cab files
for url in yCabUrls:
    !wget $url -P datafiles/nyc --trust-server-names

--2015-12-02 04:44:24--  http://storage.googleapis.com/tlc-trip-data/2013/yellow_tripdata_2013-01.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.22.128, 2607:f8b0:400d:c09::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.22.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2472351469 (2.3G) [text/csv]
Saving to: ‘datafiles/nyc/yellow_tripdata_2013-01.csv’


2015-12-02 04:44:45 (118 MB/s) - ‘datafiles/nyc/yellow_tripdata_2013-01.csv’ saved [2472351469/2472351469]

--2015-12-02 04:44:45--  http://storage.googleapis.com/tlc-trip-data/2013/yellow_tripdata_2013-02.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.22.128, 2607:f8b0:400d:c09::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.22.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2344381323 (2.2G) [text/csv]
Saving to: ‘datafiles/nyc/yellow_tripdata_2013-02.csv’


2015-12-02 04:45:04 

In [13]:
#Disk space of the Yellow Cab files
!du -mh datafiles/nyc

65G	datafiles/nyc


In [14]:
def preprocess_data(cabFilenames, isYellow):
    """
    Function that takes a list of filenames (strings) and a boolean as parameters.
    Removes the header from the each file and verifies the schema of the data.
    """
    # Dictionary where key = filename, value = (schema, bool==True if there is a blank line after header)
    file_schemas = {}
    prefix = 'datafiles/nycg/'
    if isYellow:
        prefix = 'datafiles/nyc/'
        
    for filename in cabFilenames:
        # Fetch schema
        with open(prefix+filename,'r') as in_fp:
            #read first two lines
            lines = [in_fp.readline() for i in xrange(2)]

        # now open again to write out
        file_schemas[filename] = (tuple(lines[0].split(',')), lines[1]=='\r\n')
    
    # verify all files have the necessary columns in the same position
    for (schema,blank) in file_schemas.values():
        assert 'ickup' in schema[1]
        assert 'atetime' in schema[1]
        assert 'ickup' in schema[5]
        assert 'ongitude' in schema[5]
        assert 'ickup' in schema[6]
        assert 'atitude' in schema[6]
    print "Schema:", file_schemas[filename][0]
    
    # Remove header and blank line from file
    for filename in cabFilenames:
        print "Writing to %r" % filename 
        with open(prefix+filename,'r') as in_fp:
            #read whole file
            lines = in_fp.readlines()

        with open(prefix+filename,'w') as out_fp:

            # check if there is a blank line after the header
            if file_schemas[filename][1]:
                out_fp.writelines(lines[2:])
            else:
                out_fp.writelines(lines[1:])

In [15]:
#Preprocess Yellow Cab files -- check schema
preprocess_data(yCabFilenames, True)

Schema: ('VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'pickup_longitude', 'pickup_latitude', 'RateCodeID', 'store_and_fwd_flag', 'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount\r\n')
Writing to '/yellow_tripdata_2013-01.csv'
Writing to '/yellow_tripdata_2013-02.csv'
Writing to '/yellow_tripdata_2013-03.csv'
Writing to '/yellow_tripdata_2013-04.csv'
Writing to '/yellow_tripdata_2013-05.csv'
Writing to '/yellow_tripdata_2013-06.csv'
Writing to '/yellow_tripdata_2013-07.csv'
Writing to '/yellow_tripdata_2013-08.csv'
Writing to '/yellow_tripdata_2013-09.csv'
Writing to '/yellow_tripdata_2013-10.csv'
Writing to '/yellow_tripdata_2013-11.csv'
Writing to '/yellow_tripdata_2013-12.csv'
Writing to '/yellow_tripdata_2014-01.csv'
Writing to '/yellow_tripdata_2014-02.csv'
Writing to '/yellow_tripdata_2014-03.csv'
Writing to '/yellow_tr

In [16]:
#add to s3
!aws s3 sync datafiles/nyc/ s3://sdaultonbucket1/nyc/

upload: datafiles/nyc/yellow_tripdata_2013-01.csv to s3://sdaultonbucket1/nyc/yellow_tripdata_2013-01.csv
upload: datafiles/nyc/yellow_tripdata_2013-02.csv to s3://sdaultonbucket1/nyc/yellow_tripdata_2013-02.csv
upload: datafiles/nyc/yellow_tripdata_2013-03.csv to s3://sdaultonbucket1/nyc/yellow_tripdata_2013-03.csv
upload: datafiles/nyc/yellow_tripdata_2013-04.csv to s3://sdaultonbucket1/nyc/yellow_tripdata_2013-04.csv
upload: datafiles/nyc/yellow_tripdata_2013-05.csv to s3://sdaultonbucket1/nyc/yellow_tripdata_2013-05.csv
upload: datafiles/nyc/yellow_tripdata_2013-06.csv to s3://sdaultonbucket1/nyc/yellow_tripdata_2013-06.csv
upload: datafiles/nyc/yellow_tripdata_2013-07.csv to s3://sdaultonbucket1/nyc/yellow_tripdata_2013-07.csv
upload: datafiles/nyc/yellow_tripdata_2013-08.csv to s3://sdaultonbucket1/nyc/yellow_tripdata_2013-08.csv
upload: datafiles/nyc/yellow_tripdata_2013-09.csv to s3://sdaultonbucket1/nyc/yellow_tripdata_2013-09.csv
upload: datafiles/nyc/yellow_tripdata_2013-10.

In [17]:
#free up space
!rm datafiles/nyc/*

In [18]:
#Download the green cab files
for url in gCabUrls:
    !wget $url -P datafiles/nycg --trust-server-names

--2015-12-02 05:33:12--  http://storage.googleapis.com/tlc-trip-data/2013/green_tripdata_2013-08.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.207.128, 2607:f8b0:400d:c07::80
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.207.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1147696 (1.1M) [text/csv]
Saving to: ‘datafiles/nycg/green_tripdata_2013-08.csv’


2015-12-02 05:33:13 (17.2 MB/s) - ‘datafiles/nycg/green_tripdata_2013-08.csv’ saved [1147696/1147696]

--2015-12-02 05:33:13--  http://storage.googleapis.com/tlc-trip-data/2013/green_tripdata_2013-09.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.207.128, 2607:f8b0:400d:c07::80
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.207.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7685009 (7.3M) [text/csv]
Saving to: ‘datafiles/nycg/green_tripdata_2013-09.csv’


2015-12-02 05:33:13 (49.6

In [19]:
#Disk space of the Yellow Cab files
!du -mh datafiles/nycg

4.0G	datafiles/nycg


In [20]:
# Preprocess Green Cab files
preprocess_data(gCabFilenames, False)

Schema: ('VendorID', 'lpep_pickup_datetime', 'Lpep_dropoff_datetime', 'Store_and_fwd_flag', 'RateCodeID', 'Pickup_longitude', 'Pickup_latitude', 'Dropoff_longitude', 'Dropoff_latitude', 'Passenger_count', 'Trip_distance', 'Fare_amount', 'Extra', 'MTA_tax', 'Tip_amount', 'Tolls_amount', 'Ehail_fee', 'improvement_surcharge', 'Total_amount', 'Payment_type', 'Trip_type \r\n')
Writing to '/green_tripdata_2013-08.csv'
Writing to '/green_tripdata_2013-09.csv'
Writing to '/green_tripdata_2013-10.csv'
Writing to '/green_tripdata_2013-11.csv'
Writing to '/green_tripdata_2013-12.csv'
Writing to '/green_tripdata_2014-01.csv'
Writing to '/green_tripdata_2014-02.csv'
Writing to '/green_tripdata_2014-03.csv'
Writing to '/green_tripdata_2014-04.csv'
Writing to '/green_tripdata_2014-05.csv'
Writing to '/green_tripdata_2014-06.csv'
Writing to '/green_tripdata_2014-07.csv'
Writing to '/green_tripdata_2014-08.csv'
Writing to '/green_tripdata_2014-09.csv'
Writing to '/green_tripdata_2014-10.csv'
Writing to

In [21]:
!aws s3 sync datafiles/nycg/ s3://sdaultonbucket1/nycg/

upload: datafiles/nycg/green_tripdata_2013-08.csv to s3://sdaultonbucket1/nycg/green_tripdata_2013-08.csv
upload: datafiles/nycg/green_tripdata_2013-09.csv to s3://sdaultonbucket1/nycg/green_tripdata_2013-09.csv
upload: datafiles/nycg/green_tripdata_2013-10.csv to s3://sdaultonbucket1/nycg/green_tripdata_2013-10.csv
upload: datafiles/nycg/green_tripdata_2013-11.csv to s3://sdaultonbucket1/nycg/green_tripdata_2013-11.csv
upload: datafiles/nycg/green_tripdata_2013-12.csv to s3://sdaultonbucket1/nycg/green_tripdata_2013-12.csv
upload: datafiles/nycg/green_tripdata_2014-01.csv to s3://sdaultonbucket1/nycg/green_tripdata_2014-01.csv
upload: datafiles/nycg/green_tripdata_2014-02.csv to s3://sdaultonbucket1/nycg/green_tripdata_2014-02.csv
upload: datafiles/nycg/green_tripdata_2014-03.csv to s3://sdaultonbucket1/nycg/green_tripdata_2014-03.csv
upload: datafiles/nycg/green_tripdata_2014-04.csv to s3://sdaultonbucket1/nycg/green_tripdata_2014-04.csv
upload: datafiles/nycg/green_tripdata_2014-05.

In [22]:
#free up space
!rm datafiles/nycg/*

### Step 4: Check cluster performance

In [23]:
myRDD = sc.textFile("s3://sdaultonbucket1/nyc/yellow_tripdata_2015-02.csv")
%time myRDD.cache()
%time myRDD.count()

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 2.87 ms
CPU times: user 0 ns, sys: 12 ms, total: 12 ms
Wall time: 14.6 s


12450521

In [25]:
myRDD.is_cached

True

### Other useful notes

**1. Enable the web admin interface** from the AWS console (follow the steps it says). Note: in this step when you open the SSH conection (as per instructions), the connection might not show ANY thing status etc) - this is fine. The SSH command (As per instruction) is:  ssh -i CS109.pem $DNS_NAME -ND 8157

**2. Admin UI's**
    
    a) To get to the Spark Jobs Admin Console: Go to the Hadoop Resource Manager UI (from AWS console) and click on "Application master" link (it will be one of the items in the listed running applications).
    
    b) Spark history server: http://<domain>:18080/
    
    c) For CPU/Memory performance on each node use the Ganglia UI (link from the AWS console)
    