### This notebook was ran on a cluster on Google Cloud due to the very large size of dataset `nyc_data.csv`

<h1>PROBLEM 1</h1>

Write an Apache Spark program that <b>uses <span style="color:green">map</span> and <span style="color:green">reduceByKey</span> to compute the average processing time by borough for 311 complaints.</b> You can read the data from the file nyc_data.csv. Each row represents a NYC 311 processing case, column 6 contains the borough and column 7 the processing time for each case. <b>You must use reduceByKey!</b>

You need to run your program on a cluster on the cloud. 

NOTES:

1. The Spark function <span style="color:blue">contains</span> works somewhat like the python in function. See the example below

2. Only include data rows where the borough is either of BRONX, MANHATTAN, QUEENS, STATEN ISLAND, BROOKLYN

3. You also need to account for any missing data in the processing times (you can ignore missing data in borough because you'll be filtering boroughs anyway)



In [226]:
// PROBLEM 1 SOLUTION HERE
val result = sc.textFile("gs://zhangzy_cloud_analytics/data/nyc_data.csv")       // read data from Bucket
    .mapPartitionsWithIndex{ (idx,iter) => if (idx==0) iter.drop(1) else iter}   // remove header
    .map(l=>l.split(",")).map(l=>(l(6),l(7)))                                    // extract only Boroughs and Times
    .filter(l=>(l._1.contains("BRONX")||l._1.contains("MANHATTAN")||             // filter Boroughs
                l._1.contains("QUEENS")||l._1.contains("STATEN ISLAND")||
                l._1.contains("BROOKLYN")))
    .filter(l=>try {
            l._2.toDouble
            true
        } catch {
            case e: Exception => false
        }
    )                                                                            // filter missing Time values
    .map(l=>(l._1,l._2.toDouble))                                                // convert Time values into Double
    .map(l=>{if (l._1.contains("BRONX")) ("BRONX",l._2)                          //
           else if (l._1.contains("MANHATTAN")) ("MANHATTAN",l._2)               // convert Strings containing
           else if (l._1.contains("QUEENS")) ("QUEENS",l._2)                     // a Borough into Strings=="Borough" 
           else if (l._1.contains("STATEN ISLAND")) ("STATEN ISLAND",l._2)       // i.e. "05 BRONX" ==> "BRONX"
           else ("BROOKLYN",l._2)})                                              
    .mapValues(value=>(value,1.0))                                                      // count each entry
    .reduceByKey{case ((timeA,countA),(timeB,countB)) => (timeA+timeB,countA+countB)}   // sum up the Times and Counts
    .mapValues{case (sum , count) => sum/count}                                         // calculate the average time
                                                                                        // by Borough
result.collect

result = MapPartitionsRDD[297] at mapValues at <console>:51


[(STATEN ISLAND,4.7309033311163775), (BROOKLYN,6.19202093132757), (MANHATTAN,6.720885143835739), (QUEENS,5.064014618641129), (BRONX,5.959183187781855)]

In [227]:
//Run the cell below and make sure the result (Out[[n]]:  ) is visible when you submit the file
//This returns the url of the master and will confirm that you ran the code on the cluster
//For example, my return value is:
// Some(http://cluster-01cb-m.c.cloud-class-spring2020.internal:4040)

sc.uiWebUrl

Some(http://cluster-ieor4526-m.c.ieor-4526-cloud-analytics.internal:4041)

## Step by step:

In [212]:
val raw_data = sc.textFile("gs://zhangzy_cloud_analytics/data/nyc_data.csv")

raw_data = gs://zhangzy_cloud_analytics/data/nyc_data.csv MapPartitionsRDD[273] at textFile at <console>:27


gs://zhangzy_cloud_analytics/data/nyc_data.csv MapPartitionsRDD[273] at textFile at <console>:27

In [213]:
raw_data.partitions.length   // default partition number = 2

2

In [214]:
val raw_data_header_removed = raw_data.mapPartitionsWithIndex{ (idx,iter) => if (idx==0) iter.drop(1) else iter}
val r1 = raw_data.count
val r2 = raw_data_header_removed.count    // only 1 row is dropped: the header row in the first partition

raw_data_header_removed = MapPartitionsRDD[274] at mapPartitionsWithIndex at <console>:31
r1 = 1521028
r2 = 1521027


1521027

In [215]:
raw_data.first()    // header that was removed

record number,Unique Key,Created Date,Closed Date,Agency,Descriptor,Borough,processing_time

In [216]:
raw_data_header_removed.take(5).foreach(println)

0,32305299,2016-01-01 00:00:09,2016-01-01 01:57:32,NYPD,Loud Music/Party,BROOKLYN,0.0815162037037037
5,323056566,2016-01-01 00:00:09,2016-01-01 01:57:32,NYPD,Loud Music/Party,BROOKLYN,NaN5,323056566,2016-01-01 00:00:09,2016-01-01 01:57:32,NYPD,Loud Music/Party,BROOKLYN,NaN5,323056566,2016-01-01 00:00:09,2016-01-01 01:57:32,NYPD,Loud Music/Party,BROOKLYN,NaN
1,32310343,2016-01-01 00:00:40,2016-01-01 03:12:53,NYPD,Loud Music/Party,BRONX,0.1334837962962963
2,32309107,2016-01-01 00:01:09,2016-01-21 09:20:55,HPD,NO LIGHTING,07 BRONX,20.388726851851853
3,32308578,2016-01-01 00:02:59,2016-01-01 23:35:50,NYPD,Loud Music/Party,Unspecified,0.9811458333333334


In [217]:
val borough_time_map = raw_data_header_removed         // filter missing Time values and Borough values
    .map(l=>l.split(","))
    .map(l=>(l(6),l(7)))
    .filter(l=>(l._1.contains("BRONX")||l._1.contains("MANHATTAN")||
                l._1.contains("QUEENS")||l._1.contains("STATEN ISLAND")||
                l._1.contains("BROOKLYN")))
    .filter(l=>try {
            l._2.toDouble
            true
        } catch {
            case e: Exception => false
        }
    )
    .map(l=>(l._1,l._2.toDouble))
    .map(l=>{if (l._1.contains("BRONX")) ("BRONX",l._2)
           else if (l._1.contains("MANHATTAN")) ("MANHATTAN",l._2)
           else if (l._1.contains("QUEENS")) ("QUEENS",l._2)
           else if (l._1.contains("STATEN ISLAND")) ("STATEN ISLAND",l._2)
           else ("BROOKLYN",l._2)})

borough_time_map = MapPartitionsRDD[280] at map at <console>:45


MapPartitionsRDD[280] at map at <console>:45

In [219]:
borough_time_map.take(10).foreach(println)

(BROOKLYN,0.0815162037037037)
(BRONX,0.1334837962962963)
(BRONX,20.388726851851853)
(BRONX,7.048576388888889)
(QUEENS,0.1400810185185185)
(BROOKLYN,0.11086805555555555)
(MANHATTAN,0.016967592592592593)
(MANHATTAN,0.1597222222222222)
(BRONX,2.996585648148148)
(BROOKLYN,0.06299768518518518)


In [220]:
borough_time_map.countByKey

Map(STATEN ISLAND -> 64361, QUEENS -> 318541, MANHATTAN -> 315748, BROOKLYN -> 429754, BRONX -> 275485)

In [221]:
val count_by_borough = borough_time_map.mapValues(value=>(value,1.0))

count_by_borough = MapPartitionsRDD[283] at mapValues at <console>:33


MapPartitionsRDD[283] at mapValues at <console>:33

In [222]:
count_by_borough.take(10).foreach(println)

(BROOKLYN,(0.0815162037037037,1.0))
(BRONX,(0.1334837962962963,1.0))
(BRONX,(20.388726851851853,1.0))
(BRONX,(7.048576388888889,1.0))
(QUEENS,(0.1400810185185185,1.0))
(BROOKLYN,(0.11086805555555555,1.0))
(MANHATTAN,(0.016967592592592593,1.0))
(MANHATTAN,(0.1597222222222222,1.0))
(BRONX,(2.996585648148148,1.0))
(BROOKLYN,(0.06299768518518518,1.0))


In [223]:
val sum_by_borough = count_by_borough.reduceByKey{case ((timeA,countA),(timeB,countB)) => (timeA+timeB,countA+countB)}

sum_by_borough = ShuffledRDD[284] at reduceByKey at <console>:35


ShuffledRDD[284] at reduceByKey at <console>:35

In [224]:
val averagetime_by_borough = sum_by_borough.mapValues{case (sum , count) => sum/count}

averagetime_by_borough = MapPartitionsRDD[285] at mapValues at <console>:37


MapPartitionsRDD[285] at mapValues at <console>:37

In [225]:
averagetime_by_borough.collect

[(STATEN ISLAND,4.7309033311163775), (BROOKLYN,6.19202093132757), (MANHATTAN,6.720885143835739), (QUEENS,5.064014618641129), (BRONX,5.959183187781855)]

<h1>PROBLEM 1</h1>

Write an Apache Spark program that <b>uses <span style="color:green">map</span> and <span style="color:green">reduceByKey</span> to compute the average processing time by borough for 311 complaints.</b> You can read the data from the file nyc_data.csv. Each row represents a NYC 311 processing case, column 6 contains the borough and column 7 the processing time for each case. <b>You must use reduceByKey!</b>

You need to run your program on a cluster on the cloud. 

NOTES:

1. The Spark function <span style="color:blue">contains</span> works somewhat like the python in function. See the example below

2. Only include data rows where the borough is either of BRONX, MANHATTAN, QUEENS, STATEN ISLAND, BROOKLYN

3. You also need to account for any missing data in the processing times (you can ignore missing data in borough because you'll be filtering boroughs anyway)



In [2]:
val y = Array(5,7,9)
println(y.contains(7))
println(y.contains(11))

true
false


y: Array[Int] = Array(5, 7, 9)
