# Developing with RDDs:  Parallelism

In this lab, you will take your first steps using RDDs in Spark using Jupyter Notebook.

## Objectives

1. Create the `SparkContext` to bootstrap `RDD`s.
2. Use the `RDD`s to explore parallelism & its benefits when processing data.

## Prerequisites

This lab assumes that the student is familiar with the course environment, in particular, Jupyter Notebook.

### Consider parallelism

Whenever we process big data, we need to consider the degree to which we'll be able to use parallelism.  The general idea is that if we can break our job down into smaller and smaller chunks that can execute in parallel, we should perform better.  In this lab, we're going to test that theory by methodically increasing the degree of parallelism when performing a word count calculation to get the most frequently used word in some text file.

### Create Spark Context

Spark context establishes a connection to a Spark execution environment.
For example, it's used to create RDDs by reading files or by parallelizing lists of objects.
You need to create it first as follows (make sure pyspark library is imported).

```
import pyspark
sc = pyspark.SparkContext('local[*]')
```
For a given notebook it needs to be created only once.

### Check if Spark Context works

Create an RDD consisting of numbers from 0 to 999.
Take a random sample of 5 numbers (without replacement - first parameter to takeSample function)

```
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)
```

You should see the output consisting of 5 random numbers from the initial range.

```
[492, 777, 372, 844, 131]
```

### Read external file into RDD

You'll be using a file sherlock-holmes.txt. You need to retrieve it from the container folder mounted to host machine directory (/data) - see instructions about setting up the Docker container to run Jupyter Notebook.

```
file = "/data/sherlock-holmes.txt"
```

Read text into RDD and cache it using persist. Caching greatly improves performance when RDDs are used more than once, which will be done in this excercise (if not cached the file would be read and parsed each time the RDD is used).

Make sure that the file was read correctly by printing the RDD element count (here the number of lines in the file).

```
text = sc.textFile(file).persist()
print(text.count())
```

### Count the words

Before we get to controlling the degree of parallelism let's make sure we're able to count the words and find the one with the maximum count.

Split the lines into individual words, for each word make a tuple, where the first element is the word itself and the second element is the initial count 1. Finally reduce by key obtaining the RDD with unique words and the total counts.

```
from operator import add

words = text.flatMap(lambda w : w.split(' '))
dict = words.map(lambda w : (w, 1))
counts = dict.reduceByKey(add)

counts.take(10)
```

### Fnd the word with the maximum count


Let's define the function first which takes two tuples (word and count) as arguments and returns the one with the bigger count.

```
def getMax(r, c):
  if (r[1] > c[1]):
    return r
  else:
    return c
```

Use the function to reduce the RDD to obtain the word with the biggest count.

```
max = counts.reduce(lambda r, c: getMax(r, c))

print(max)
```

### Control the degree of parallelism

Until now we'have been doing a fairly straightforward word count, returning the most frequently used word. 

Now we want to execute the procedure in a loop, each time increasing the number of partitions participating in the execution thus increasing the level of parallelism.

This will be accomplished by repartitioning the original RDD with cached lines of text through transforming it to another RDD with `partitionCount` partitions.

For any RDD the number of partitions is given by `getNumPartitions`.

```
for partitionCount in range(2, 9, 2):

    rept = text.repartition(partitionCount)
    print(rept.getNumPartitions())
```

After repartitioning we follow with the word counting procedure using the initial RDD after transformation.

We want to capture the execution time in millisecods after each iteration.

```
('the', 71744) partition count 2 time 460
('the', 71744) partition count 4 time 156
('the', 71744) partition count 6 time 296
('the', 71744) partition count 8 time 263
```

### Conclusion

What do you notice about our timings?  Do they confirm or refute our hypothesis that more parallism generally means better performance?  If so, great!  If not, what do you suppose are other factors that are influencing our outcome?  Discuss with your instructor and classmates!


In [1]:
import pyspark

sc = pyspark.SparkContext('local[*]')


In [3]:
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)

[823, 365, 170, 496, 368]

In [3]:
file = "/data/sherlock-holmes.txt"
text = sc.textFile(file).persist()

print(text.count())

128457


In [4]:
from operator import add

words = text.flatMap(lambda w : w.split(' '))
dict = words.map(lambda w : (w, 1))
counts = dict.reduceByKey(add)

counts.take(10)

[('The', 6139),
 ('Project', 205),
 ('EBook', 5),
 ('of', 39169),
 ('Sir', 30),
 ('Arthur', 18),
 ('Conan', 3),
 ('in', 19515),
 ('series', 88),
 ('', 69285)]

In [5]:
def getMax(r, c):
  if (r[1] > c[1]):
    return r
  else:
    return c

max = counts.reduce(lambda r, c: getMax(r, c))

print(max)

('the', 71744)


In [6]:
for partitionCount in range(2, 9, 2):

    rept = text.repartition(partitionCount)
    print(rept.getNumPartitions())

2
4
6
8


In [None]:
# TODO






















In [8]:
from operator import add
from datetime import datetime

# get the word with bigger count
def getMax(r, c):
  if (r[1] > c[1]):
    return r
  else:
    return c

file = "/data/sherlock-holmes.txt"

# read text and cache
text = sc.textFile(file).persist()

minCores = 2
numCores = 8

# iterate over even number of partitions from minCores to numCores
for partitionCount in range(minCores, numCores+1, 2):

    rept = text.repartition(partitionCount)

    dt1 = datetime.now()
    words = rept.flatMap(lambda w : w.split(' '))
    dict = words.map(lambda w : (w, 1))
    counts = dict.reduceByKey(add)
    max = counts.reduce(lambda r, c: getMax(r, c))
    dt2 = datetime.now()
    print(str(max) + " partition count " + str(partitionCount) + " time " + str(round((dt2-dt1).microseconds/1000)))


('the', 71744) partition count 2 time 469
('the', 71744) partition count 4 time 434
('the', 71744) partition count 6 time 482
('the', 71744) partition count 8 time 465
