# Developing with RDDs:  Parallelism

In this lab, you will take your first steps using RDDs in Spark using Jupyter Notebook.

## Objectives

1. Create the `SparkContext` to bootstrap `RDD`s.
2. Use the `RDD`s to explore parallelism & its benefits when processing data.

## Prerequisites

This lab assumes that the student is familiar with the course environment, in particular, Jupyter Notebook.

### Consider parallelism

Whenever we process big data, we need to consider the degree to which we'll be able to use parallelism.  The general idea is that if we can break our job down into smaller and smaller chunks that can execute in parallel, we should perform better.  In this lab, we're going to test that theory by methodically increasing the degree of parallelism when performing a word count calculation to get the most frequently used word in some text file.

### Get a SparkSession

In order to work with Spark's SQL support, we need to first get our hands on a special context called `SparkSession`.  
The SparkSession class is the entry point into all functionality in Spark. 

> Note: as of Spark 2.0, SparkSession replaced SqlContext. However, we could still use SqlContext as it's being kept for backward compatibility.

We'll use SparkSession.builder to create a SparkSession. SparkSession.builder lets you define you application name and it also lets you set various parameters in the Spark config, although there is no need to do so for our simple example.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Python Spark Parallelism").getOrCreate()


### Read external file into RDD

You'll be using a file sherlock-holmes.txt. You need to retrieve it from the container folder (/home/jovyan/Resources) mounted to host machine directory - see instructions about setting up the Docker container to run Jupyter Notebook.

Read text into RDD. 

Make sure that the file was read correctly by printing the RDD element count (here the number of lines in the file).

In [None]:
file = "/home/jovyan/Resources/sherlock-holmes.txt"

lines = spark.sparkContext.textFile(file)

print(lines.count())

### Count the words

Before we get to controlling the degree of parallelism let's make sure we're able to count the words and find the one with the maximum count.

Split the lines into individual words, for each word make a tuple, where the first element is the word itself and the second element is the initial count 1. Finally reduce by key obtaining the RDD with unique words and the total counts.


In [None]:
from operator import add

import re

words = lines.flatMap(lambda w : re.split('[ \],.:;?!\-@#\(\)\\\*\"\/]+', w)).map(lambda w : w.lower())

dict = words.map(lambda w : (w, 1))
counts = dict.reduceByKey(add)

counts.take(10)

### Find the word with the maximum count


Let's define the function first which takes two tuples (word and count) as arguments and returns the one with the bigger count.

Use the function to reduce the RDD to obtain the word with the biggest count.


In [None]:
def getMax(r, c):
  if (r[1] > c[1]):
    return r
  else:
    return c

max = counts.reduce(lambda r, c: getMax(r, c))

print(max)

### Control the degree of parallelism

Until now we'have been doing a fairly straightforward word count, returning the most frequently used word. 

Now we want to execute the procedure in a loop, each time increasing the number of partitions participating in the execution thus increasing the level of parallelism.

This will be accomplished by repartitioning the original RDD with cached lines of text through transforming it to another RDD with `partitionCount` partitions.

For any RDD the number of partitions is given by `getNumPartitions`.


In [None]:
for partitionCount in range(2, 9, 2):

    rept = lines.repartition(partitionCount)
    print(rept.getNumPartitions())


After repartitioning we follow with the word counting procedure using the initial RDD after transformation.

We want to capture the execution time in millisecods after each iteration.


### Conclusion

What do you notice about our timings?  Do they confirm or refute our hypothesis that more parallism generally means better performance?  If so, great!  If not, what do you suppose are other factors that are influencing our outcome?  Discuss with your instructor and classmates!


## Your Solution

In [None]:
# TODO






















## Suggested Solution

In [None]:
from operator import add
from datetime import datetime
from pyspark.sql import SparkSession

import re

# get the word with bigger count
def getMax(r, c):
  if (r[1] > c[1]):
    return r
  else:
    return c

spark = SparkSession.builder.appName("Python Spark Parallelism").getOrCreate()

file = "/home/jovyan/Resources/sherlock-holmes.txt"

# read lines and cache
lines = spark.sparkContext.textFile(file).persist()

minCores = 2
numCores = 8

# iterate over even number of partitions from minCores to numCores
for partitionCount in range(minCores, numCores+1, 2):

    rept = lines.repartition(partitionCount)

    dt1 = datetime.now()
    words = lines.flatMap(lambda w : re.split('[ \],.:;?!\-@#\(\)\\\*\"\/]*', w)).map(lambda w : w.lower())
    dict = words.map(lambda w : (w, 1))
    counts = dict.reduceByKey(add)
    max = counts.reduce(lambda r, c: getMax(r, c))
    dt2 = datetime.now()
    
    print(str(max) + " partition count " + str(partitionCount) + " time " + str(round((dt2-dt1).microseconds/1000)))
