# Developing with RDDs:  Parallelism

In this lab, you will take your first steps using RDDs in Spark using the Spark REPL inside a Jupyter Notebook.


## Objectives

1. Use the `SparkContext` to bootstrap `RDD`s.
2. Use the `RDD`s to explore parallelism & its benefits when processing data.

## Instructions

### Consider parallelism

Whenever we process big data, we need to consider the degree to which we'll be able to use parallelism.  The general idea is that if we can break our job down into smaller and smaller chunks that can execute in parallel, we should perform better.  In this lab, we're going to test that theory by methodically increasing the degree of parallelism when performing a word count calculation to get the most frequently used word in some text file.


### Identify directory containing source data

Check with your course instructor to get the directory containing the source data for this lab.  It's probably at `../Resources` relative to this notebook. 


### Add code to control the degree of parallelism

Below is a set of Scala code. This code is your starting point. We will play around with this algorithm in the execution cell, but so that you have the starting point, we'll kept it as a markdown as well.


```scala
// NOTE: set the value srcDir to a string containing the absolute path of
// the location of the source data directory, lesson-5xx-resources!
val srcDir = "../Resources"
val minCores = 2 // if your system has 2+ cores
val nCores = 8 // depends on your system
val partitions = 2 to nCores by 2
val files = Array("byu-corpus.txt")
val delimiters = Array(' ',',','.',':',';','?','!','-','@','#','(',')')
files foreach( file => {
  val original = sc.textFile(srcDir + "/" + file);
  println("File: " + file);
  partitions foreach(n => {
    println("Partitions: " + n)

    // TODO: repartition the RDD to use n cores
    val partitioned = ???;

    val start = System.currentTimeMillis

    // TODO: find the word used the most in the file using RDD called "partitioned"
    val result = partitioned.???

    val time = System.currentTimeMillis - start
    println("Took: " + time + " ms")
  })
})
```


## Execution Cell...
Here is where you can play with the algorithm. You have to read on to understand what we want you to do though. 

In [None]:
// NOTE: set the value srcDir to a string containing the absolute path of
// the location of the source data directory, lesson-5xx-resources!
val srcDir = "../Resources"
val minCores = 2 // if your system has 2+ cores
val nCores = 8 // depends on your system
val partitions = 2 to nCores by 2
val files = Array("byu-corpus.txt")
val delimiters = Array(' ',',','.',':',';','?','!','-','@','#','(',')')
files foreach( file => {
  val original = sc.textFile(srcDir + "/" + file);
  println("File: " + file);
  partitions foreach(n => {
    println("Partitions: " + n)

    // TODO: repartition the RDD to use n cores
    val partitioned = ???

    val start = System.currentTimeMillis

    // TODO: find the word used the most in the file using RDD called "partitioned"
    val result = ???
      
    val time = System.currentTimeMillis - start
    println("Took: " + time + " ms")
  })
})

Notice the line beginning with `val original = sc.textFile(srcDir + "/" + file);`.  That gives us an `RDD` with a default number of partitions, which is directly related to the degree of parallel computation that this `RDD` has.

Next, look at the line following the first `TODO` comment; it begins with `val partitioned = ` and is left unimplemented.  Notice that this code is encased in a `foreach` loop that varies in the variable `n` the number of partitions under test.  

Replace `???` with a call to `original.repartition(n)`, which will cause Spark to create a new `RDD` with the given number of partitions, which effectively determine the number of cores that we'll be executing on in this standalone example.  Your line should look like the following.

``` scala
val partitioned = original.repartition(n);
```

### Add the word count `RDD` code

Now that we are iterating on the parallelism of our `RDD`, we need to actually do something that's going to leverage it.  This lab will use the quintessential big data problem known as "word count" to find the word used the most in a given file of text.

Look for the second `TODO` comment.  Below it, you'll see the line `var result = partitioned.???`.  Replace it with our word count code, given below:


``` scala
val result = partitioned
    .flatMap(_.split(delimiters))
    .filter(_.trim.length > 0)
    .map(x => (x, 1))
    .reduceByKey(_ + _)
    .sortBy(_._2, false)
    .first;
```


As you can see, we're doing a fairly straightforward word count and returning the most frequently used word.  The difference is that in each iteration, we're increasing the degree of parallelism that's used each time we calculate it.


### Execute and observe timings

We're ready to roll now.  Execute the Spark script by running the execution cell.

``` scala
minCores: Int = 2
nCores: Int = 8
partitions: scala.collection.immutable.Range = Range(2, 4, 6, 8)
files: Array[String] = Array(byu-corpus.txt)
delimiters: Array[Char] = Array( , ,, ., :, ;, ?, !, -, @, #, (, ))
File: byu-corpus.txt
Partitions: 2
Took: 5848 ms
Partitions: 4
Took: 3601 ms
Partitions: 6
Took: 3580 ms
Partitions: 8
Took: 2979 ms
```


### Conclusion

What do you notice about our timings?  Do they confirm or refute our hypothesis that more parallelism generally means better performance?  If so, great!  If not, what do you suppose are other factors that are influencing our outcome?  Discuss with your instructor and classmates!
