# ![Spark Logo](http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png) + ![Python Logo](http://spark-mooc.github.io/web-assets/images/python-logo-master-v3-TM-flattened_small.png)
**Exploratory Analysis of Deerfoot Trail Commute Times**
#### This lab will build on the techniques covered in the Spark tutorial to develop a simple application to compute some stats on commute times on Deerfoot Trail.  We will use the commute times and accidents data collected for Deerfoot Trail for the period September 2013 to April 2014.
#### ** During this lab we will cover: **
#### *Part 1:* Creating a base RDD and pair RDDs
#### *Part 2:* Counting with pair RDDs
#### *Part 3:* Finding mean values
#### *Part 4:* Compute basic stats about the Deerfoot Trail data
#### Note that, for reference, you can look up the details of the relevant methods in [Spark's Python API](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD)

### ** Setting Up Spark **

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget https://dlcdn.apache.org/spark/spark-3.3.3/spark-3.3.3-bin-hadoop3.tgz
!tar -xvf spark-3.3.3-bin-hadoop3.tgz
!pip install findspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.3.3-bin-hadoop3"

import findspark
findspark.init()
findspark.find()
from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

sc = spark.sparkContext

--2023-11-05 10:34:34--  https://dlcdn.apache.org/spark/spark-3.3.3/spark-3.3.3-bin-hadoop3.tgz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 299426263 (286M) [application/x-gzip]
Saving to: ‘spark-3.3.3-bin-hadoop3.tgz’


2023-11-05 10:34:35 (265 MB/s) - ‘spark-3.3.3-bin-hadoop3.tgz’ saved [299426263/299426263]

spark-3.3.3-bin-hadoop3/
spark-3.3.3-bin-hadoop3/LICENSE
spark-3.3.3-bin-hadoop3/NOTICE
spark-3.3.3-bin-hadoop3/R/
spark-3.3.3-bin-hadoop3/R/lib/
spark-3.3.3-bin-hadoop3/R/lib/SparkR/
spark-3.3.3-bin-hadoop3/R/lib/SparkR/DESCRIPTION
spark-3.3.3-bin-hadoop3/R/lib/SparkR/INDEX
spark-3.3.3-bin-hadoop3/R/lib/SparkR/Meta/
spark-3.3.3-bin-hadoop3/R/lib/SparkR/Meta/Rd.rds
spark-3.3.3-bin-hadoop3/R/lib/SparkR/Meta/features.rds
spark-3.3.3-bin-hadoop3/R/lib/SparkR/Meta/hsearch.rds
spark-3.3.3-bin-hadoop3/R/lib/SparkR/

### ** Part 1: Creating a base RDD and pair RDDs **
#### In this part of the lab, we will explore creating a base RDD with `parallelize` and using pair RDDs to count words.

#### ** (1a) Create a base RDD **
#### We'll start by generating a base RDD by using a Python list and the `sc.parallelize` method.  Then we'll print out the type of the base RDD.

In [2]:
daysList = ['sunday', 'monday', 'tuesday', 'tuesday', 'friday']
daysRDD = sc.parallelize(daysList, 4)
# Print out the type of daysRDD
print(type(daysRDD))

<class 'pyspark.rdd.RDD'>


#### ** (1b) Pluralize and test **
#### Let's use a `map()` transformation to add the letter 's' to each string in the base RDD we just created. We'll define a Python function that returns the word with an 's' at the end of the word.  The print statement is a test of the function.


In [3]:
def makePlural(word):
    """Adds an 's' to `word`.

    Note:
        This is a simple function that only adds an 's'.  No attempt is made to follow proper
        pluralization rules.

    Args:
        word (str): A string.

    Returns:
        str: A string with 's' added to it.
    """
    return word + 's'


print(makePlural('sunday'))

sundays


#### ** (1c) Apply `makePlural` to the base RDD **
#### Now pass each item in the base RDD into a [map()](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.map) transformation that applies the `makePlural()` function to each element. And then call the [collect()](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.collect) action to see the transformed RDD.

In [4]:
pluralRDD = daysRDD.map(makePlural)
print(pluralRDD.collect())

['sundays', 'mondays', 'tuesdays', 'tuesdays', 'fridays']


#### ** (1d) Pass a `lambda` function to `map` **
#### Let's create the same RDD using a `lambda` function.

In [5]:
pluralLambdaRDD = daysRDD.map(lambda x: x + 's')
print(pluralLambdaRDD.collect())

['sundays', 'mondays', 'tuesdays', 'tuesdays', 'fridays']


#### ** (1e) Transform the newly reated RDD **
#### Now use `map()` and a `lambda` function to return the first character in each word.  We'll `collect` this result directly into a variable.

In [6]:
pluralFirstChars = (pluralRDD.
                   map(lambda x: x[0]).
                   collect())
print(pluralFirstChars)

['s', 'm', 't', 't', 'f']


#### ** (1f) Pair RDDs **
#### Often we would need to deal with pair RDDs.  A pair RDD is an RDD where each element is a pair tuple `(k, v)` where `k` is the key and `v` is the value. In this example, we will create a pair consisting of `('<day>', 1)` for each word element in the RDD.
#### We can create the pair RDD using the `map()` transformation with a `lambda()` function to create a new RDD.

In [7]:
dayPairs = daysRDD.map(lambda x: (x,1))
print(dayPairs.collect())

[('sunday', 1), ('monday', 1), ('tuesday', 1), ('tuesday', 1), ('friday', 1)]


### ** Part 2: Counting with pair RDDs **

##### Now, let's count the number of times a particular day appears in the RDD. There are multiple ways to perform the counting, but some are much less efficient than others.
##### A naive approach would be to `collect()` all of the elements and count them in the driver program. While this approach could work for small datasets, we want an approach that will work for any size dataset including terabyte- or petabyte-sized datasets. In addition, performing all of the work in the driver program is slower than performing it in parallel in the workers. For these reasons, we will use data parallel operations.

#### ** (2a) `groupByKey()` approach **
##### An approach you might first consider (we'll see shortly that there are better ways) is based on using the [groupByKey()](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.groupByKey) transformation. As the name implies, the `groupByKey()` transformation groups all the elements of the RDD with the same key into a single list in one of the partitions. There are two problems with using `groupByKey()`:
  + ##### The operation requires a lot of data movement to move all the values into the appropriate partitions.
  + ##### The lists can be very large. Consider a word count of English Wikipedia: the lists for common words (e.g., the, a, etc.) would be huge and could exhaust the available memory in a worker.

##### Use `groupByKey()` to generate a pair RDD of type `('day', iterator)`.

In [8]:
# Note that groupByKey requires no parameters
daysGrouped = dayPairs.groupByKey()
for key, value in daysGrouped.collect():
    print('{0}: {1}'.format(key, list(value)))

monday: [1]
tuesday: [1, 1]
friday: [1]
sunday: [1]


##### ** (2b) Use `groupByKey()` to obtain the counts **
##### Using the `groupByKey()` transformation creates an RDD containing 3 elements, each of which is a pair of a day and a Python iterator.
##### Now sum the iterator using a `map()` transformation.  The result should be a pair RDD consisting of (day, count) pairs.

In [9]:
dayCountsGrouped = daysGrouped.map(lambda x: (x[0], sum(x[1])))
print(dayCountsGrouped.collect())

[('monday', 1), ('tuesday', 2), ('friday', 1), ('sunday', 1)]


##### ** (2c) Counting using `reduceByKey` **
##### A better approach is to start from the pair RDD and then use the [reduceByKey()](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.reduceByKey) transformation to create a new pair RDD. The `reduceByKey()` transformation gathers together pairs that have the same key and applies the function provided to two values at a time, iteratively reducing all of the values to a single value. `reduceByKey()` operates by applying the function first within each partition on a per-key basis and then across the partitions, allowing it to scale efficiently to large datasets.

In [10]:
# Note that reduceByKey takes in a function that accepts two values and returns a single value
dayCounts = dayPairs.reduceByKey(lambda x,y: x+y)
print(dayCounts.collect())

[('monday', 1), ('tuesday', 2), ('friday', 1), ('sunday', 1)]


##### ** (2d) All together **
##### The expert version of the code performs the `map()` to pair RDD, `reduceByKey()` transformation, and `collect` in one statement.

In [11]:
dayCountsCollected = (daysRDD.
                      map(lambda x: (x,1)).
                      reduceByKey(lambda x,y: x+y).
                      collect())
print(dayCountsCollected)

[('monday', 1), ('tuesday', 2), ('friday', 1), ('sunday', 1)]


### ** Part 3: Finding unique days and a mean value **

##### ** (3a) Unique words **
##### Calculate the number of unique days in `daysRDD`.  You can use other RDDs that you have already created to make this easier.

In [12]:
uniqueDays = daysRDD.distinct().count()
print(uniqueDays)

4


##### ** (3b) Mean using `reduce` **
##### Find the mean number of days per unique day in `dayCounts`.
##### Use a `reduce()` action to sum the counts in `dayCounts` and then divide by the number of unique days.  First `map()` the pair RDD `dayCounts`, which consists of (key, value) pairs, to an RDD of values.

In [13]:
from operator import add
totalCount = (dayCounts
              .map(lambda x: x[1])
              .reduce(add))
average = totalCount / float(uniqueDays)
print(totalCount)
print(round(average, 2))

5
1.25


### ** Part 4: Compute Deerfoot Trail stats **

##### In this section we will apply some of the above concepts towards analyzing commute time and accidents data collected for Deerfoot Trail.

##### ** (4a) Loading the data **
##### We will first load the data.  The data was collected in the period September 2013 to April 2014.  It was obtained by querying Google Maps for commute times and Twitter for accident reports.  Although this data set is very small, because we are using parallel computation via Spark the functions we develop will scale for larger data sets.  To convert a text file into an RDD, we use the `SparkContext.textFile()` method. We will use `take(15)` to print 15 lines from this file.

In [14]:
fileName = '/content/deerfoot.csv'

deerfootRDD = (sc.textFile(fileName, 8))
'\n'.join(deerfootRDD.zipWithIndex().map(lambda x: str(x[1]) + ': ' + str(x[0])).take(15))

'0: 21/09/2013,Saturday,34,34,34,34,35,34,35,36,38,36,36,35,35,35,35,35,36,34,34,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2\n1: 22/09/2013,Sunday,34,34,34,34,34,34,34,35,35,35,34,35,34,35,34,34,34,34,34,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3\n2: 23/09/2013,Monday,35,36,41,43,45,41,36,35,35,35,37,40,43,46,43,37,34,34,35,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,2\n3: 24/09/2013,Tuesday,35,36,40,44,52,41,38,36,36,36,37,40,44,47,42,39,34,35,35,0,0,0,1,1,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,4,1,1,0,5\n4: 25/09/2013,Wednesday,35,36,40,39,39,37,36,35,36,37,37,40,44,45,41,38,35,35,35,0,0,0,0,0,0,0,0,0,0,0,0,2,2,1,0,0,0,0,0,0,1,4,0,0,5\n5: 26/09/2013,Thursday,34,36,50,56,49,37,37,35,36,36,39,56,59,46,42,38,35,34,35,0,0,1,1,1,0,1,0,0,0,0,2,1,5,1,0,0,0,0,0,3,4,5,1,0,13\n6: 27/09/2013,Friday,34,35,37,37,36,35,36,36,36,38,40,43,47,48,42,38,35,35,35,0,0,0,2,0,0,0,0,0,1,2,0,0,0,0,0,1,0,0,0,0,1,2,0,0,6\n7: 28/09/2013,Saturday,34,34,34,34,34,34,35,35,35,35,35,35,35,49,44,36,34

##### ** (4b) Extracting fields relevant to the analysis **
##### We will extract only those fields that will be useful for our further analysis in this lab.  Specifically, we are interested in field 2 (day), field 7 (commute time at 8 AM), and field 14 (commute time at 4 PM).  We consider only these 2 times since these best represent the morning and afternoon rush traffic.  Write a function `extractFields` that takes as input each record of `deerfootRDD` and produces a record for another RDD that only contains these 3 fields.

In [15]:
def extractFields(deerfootRDDRecord):
    """Creates a record consisting of day, 8 AM commute time, and 4 PM commute time.

    Args:
        deerfootRDDRecord : a comma separated string consisting of all fields in the data set.

    Returns:
        extracted record: a comma separated record (day, 8 AM commute time, 4 PM commute time)
    """
    fieldsList = deerfootRDDRecord.split(',')
    return (fieldsList[1],fieldsList[6],fieldsList[13])

print(extractFields(deerfootRDD.take(1)[0]))

('Saturday', '35', '35')


##### ** (4c) Obtaining extracted RDD **
##### Transform the `deerfootRDD` so that we get a resulting `deerfootPeakRDD` that only has peak hour commute times.

In [16]:
deerfootPeakRDD = deerfootRDD.map(extractFields)

print(deerfootPeakRDD.take(1))

[('Saturday', '35', '35')]


##### ** (4d) Obtaining stats - counting number of occurrences of each day of the week **
##### Start with the `deerfootPeakRDD`.  Create a pair RDD `deerfootDayPairRDD` that contains records where day is the key and 1 is the value. Apply another transformation on `deerfootDayPairRDD` to get a `deerfootDayCounts` RDD

In [17]:
deerfootDayPairRDD = deerfootPeakRDD.map(lambda x: (x[0],1))
deerfootDayCounts = deerfootDayPairRDD.reduceByKey(lambda x,y:x+y)

deerfootDayCountsList = deerfootDayCounts.collect()
print(deerfootDayCountsList)
deerfootDayCountsDict = dict(deerfootDayCountsList)
print(deerfootDayCountsDict.get('Friday'))

[('Thursday', 29), ('Monday', 29), ('Friday', 28), ('Wednesday', 29), ('Saturday', 29), ('Sunday', 29), ('Tuesday', 29)]
28


##### ** (4e) Filtering out Saturdays and Sundays **
##### As we can see from the previous result, there is almost an equal number of days of each type in the data set, which suggests that there is no big gap in the data collection.  Let's say we are now only interested in commute time stats for Monday to Friday.  Write a function called `filterSatSun` that filters out records for Saturdays and Sundays in `deerfootPeakRDD`.  Apply this transformation on `deerfootPeakRDD` to obtain an RDD called `deerfootPeakMFRDD`.

In [18]:
def filterSatSun(deerfootPeakRDDRecord):
    """Ignores "Saturday" and "Sunday" records.

    Args:
        deerfootPeakRDDRecord: A comma separated string (day, 8 AM commute time, 4 PM commute time).

    Returns:
        false if day is "Saturday" or "Sunday". true if otherwise
    """
    if deerfootPeakRDDRecord[0]=="Saturday" or deerfootPeakRDDRecord[0]=="Sunday":
        return False
    else:
        return True

deerfootPeakMFRDD = deerfootPeakRDD.filter(filterSatSun)
print(deerfootPeakMFRDD.take(5))

[('Monday', '45', '40'), ('Tuesday', '52', '40'), ('Wednesday', '39', '40'), ('Thursday', '49', '56'), ('Friday', '36', '43')]


##### ** (4f) Computing average commute times for each day of the week **
##### We will now compute the average of commute times for each day of the week for both 8 AM and 4 PM. To do this, first create a pair RDD called `deerfootPeakAMRDD` where each record has day as the key and 8 AM commute time as value.  Apply one or more appropriate transformations to compute average.  Repeat the process for the evening rush hour.  You can use the previously computed `deerfootDayCountsDict' in the average calculation.

In [19]:
deerfootPeakAMRDD = deerfootPeakMFRDD.map(lambda s: (s[0],int(s[1])))
deerfootPeakAMreduceByDay = deerfootPeakAMRDD.reduceByKey(lambda x,y:x+y).collect()

amAverages = list()

for item in deerfootPeakAMreduceByDay:
    avg = item[1]/float(deerfootDayCountsDict.get(item[0]))
    amAverages.append((item[0],avg))

deerfootPeakPMRDD = deerfootPeakMFRDD.map(lambda s: (s[0],int(s[2])))
deerfootPeakPMreduceByDay = deerfootPeakPMRDD.reduceByKey(lambda x,y:x+y).collect()

pmAverages = list()

for item in deerfootPeakPMreduceByDay:
    avg = item[1]/float(deerfootDayCountsDict.get(item[0]))
    pmAverages.append((item[0],avg))

print(amAverages)
print(pmAverages)

[('Thursday', 41.10344827586207), ('Monday', 42.44827586206897), ('Friday', 38.57142857142857), ('Wednesday', 44.206896551724135), ('Tuesday', 44.48275862068966)]
[('Thursday', 41.37931034482759), ('Monday', 40.310344827586206), ('Friday', 43.0), ('Wednesday', 40.93103448275862), ('Tuesday', 41.44827586206897)]


##### ** (4g) Computing max morning hour rush commute times for each day of the week **
##### For 8 AM, find the maximum commute time for each day of the week.

In [20]:
deerfootPeakAMMaxreduceByDay = deerfootPeakAMRDD.reduceByKey(lambda x,y:max(x,y)).collect()

for item in deerfootPeakAMMaxreduceByDay:
    print(item)


('Thursday', 57)
('Monday', 64)
('Friday', 57)
('Wednesday', 61)
('Tuesday', 87)
