## Developing Spark Applications: PairRDDFunctions

In this lab, you will take your first steps using PairRDDFunctions using Jupyter Notebook.

### Objectives

Use the SparkContext to bootstrap RDDs of Tuples in order to use PairRDDFunctions. Use the PairRDDFunctions to more easily calculate values in the RDD.

### Consider Tuples

Scala's support for arbitrary pairs (and triplets, quadruplets, and so on) of data is reflected in Spark. If an RDD's element is of a Tuple type, then, through Spark implicits, that RDD gets enriched with PairRDDFunctions, which provide convenient means of performing calculations on the Tuples.

### Read external file into RDD

You'll be using a file sherlock-holmes.txt. You need to retrieve it from the container folder (/home/jovyan/Resources) mounted to host machine directory - see instructions about setting up the Docker container to run Jupyter Notebook.

Read text into RDD. 

Make sure that the file was read correctly by printing the RDD element count (here the number of lines in the file).

In [None]:
val file = "/home/jovyan/Resources/sherlock-holmes.txt"
val lines = sc.textFile(file)

print(lines.count)

### Get an RDD of Words

Our goal is to count how many ifs, ands & buts there are in the file, so if we could somehow get a Map of word counts keyed on the word, we'd be golden.

The first thing we need to do is to create an RDD that contains the words in the file. We already have the file loaded into an RDD called `lines` where each element is a line of text. Your next step is to split each line with the provided delimiters so that you end up with a flat collection of words in the file. Whenever we want to flatten a collection of collections, we can use flatMap to do so.

We also want to remove all empty words by using the `filter` function, which accepts only the words, whose length is greater than zero. Note, that the RDD transforming functions could be strung together as you can see below.


In [None]:
val delimiters = Array(' ',',','.',':',';','?','!','-','@','#','(',')','*','_','/','"','"')

val words = lines.flatMap(_.split(delimiters)).filter(_.length > 0)

words.take(10).foreach(println)

### We'll consider two equivalent ways to accomplish our goal

They will be using PairRDDFunctions, which operate on data organized as key-value pairs. The functions typically agregate values for the keys.

### <font color=blue>Use reduceByKey</font>

One of the most useful PairRDDFunctions is `reduceByKey`, which transforms a pair RDD anto another pair RDD, where the values for each key are aggregated using the supplied reduce function.

In our case, the pair's key is the word, and the value is the count. The initial value for each word is 1. 

The reduce function is a simple addition as we'll be adding the counts of the same words. The values will be aggregated, finally resulting in the total count for each distinct word.

Let's transform our `words` RDD into initial `pairs` RDD and while we're at it convert words into lower case for better accuracy of the counts.

Next we'll be applying `reduceByKey` function yielding the total word counts.


In [None]:
val pairs = words.map(word => (word.toLowerCase, 1))

val counts = pairs.reduceByKey(_ + _)

counts.take(10).foreach(println)

#### Convert RDD into Map

In [None]:
val map = counts.collectAsMap

val ifs  = map("if")
val ands = map("and")
val buts = map("but")

println(ifs)
println(ands)
println(buts)


### <font color=blue>Use countByKey</font>

It just so happens that PairRDDFunctions has a method called `countByKey`, which returns a Map of counts of the occurrence of each key among the pairs. 

This function also requires a pair RDD. However, since it counts the occurrences of the keys it doesn't really matter what the values are (even the value type is irrelevant). Therefore we'll be using None as opposed to 1 as the initial value of the pairs.

Note that unlike `reduceByKey` there is no need to collect RDD values into a Map as `countByKey` does it automatically.

In [None]:
val pairs = words.map(word => (word.toLowerCase, None))

val counts = pairs.countByKey()

val ifs  = counts("if")
val ands = counts("and")
val buts = counts("but")

println(ifs)
println(ands)
println(buts)


### Other interesting things about Sherlock Holmes

#### Find the word with the highest count

We'll use `reduce` function, which out of two pair elements returns the one with the higher count (note the way we access the individual elements of pairs).



In [None]:
val maxCount = counts.reduce((r, c) => (if (r._2 > c._2) r else c))

#### Find the longest word

We'll use `reduce` function, which out of two pair elements returns the one with the longer key.


In [None]:
val maxWord = counts.reduce((r, c) => (if (r._1.length > c._1.length) r else c))

### Bonus Assignment

#### Find all anagrams among words

Can you write an algorithm that finds all the anograms (a word, phrase, or name formed by rearranging the letters of another, such as `cinema`, formed from `iceman`) in the text?

## Your Solution

In [None]:
# TODO













### Conclusion

In this lab, we saw how an RDD of Tuple types receives via an implicit additional methods from PairRDDFunctions to make calculating certain values easier.

#### Suggested Solution

This discussion written as one continuous set of code (without all the discussion).

In [None]:
val file = "/home/jovyan/Resources/sherlock-holmes.txt"
val lines = sc.textFile(file)

val delimiters = Array(' ',',','.',':',';','?','!','-','@','#','(',')','*','_','/','"','"')
val words = lines.flatMap(_.split(delimiters)).filter(_.length > 0)

val pairs = words.map(word => (word.toLowerCase, 1))
val counts = pairs.reduceByKey(_ + _)
val map = counts.collectAsMap

val ifs  = map("if")
val ands = map("and")
val buts = map("but")

println(ifs)
println(ands)
println(buts)

### Solution for Bonus Assignment

In [None]:
val file = "/home/jovyan/Resources/sherlock-holmes.txt"
val lines = sc.textFile(file)

val delimiters = Array(' ',',','.',':',';','?','!','-','@','#','(',')','*','_','/',''','"','"')

// We're interested in anagrams longer than 6 letters

val words = lines.flatMap(_.split(delimiters)).filter(_.length > 6).map(word => word.toLowerCase).distinct

// we're creating a pair RDD where the key is the original word sorted by letters 
// and the initial value is a singleton list with the word as an element
// any two words sharing the same sorted forms are anagrams

val pairs = words.map(word => (word.toList.sorted.mkString, List(word)))

// the reduce function is concatenating lists of words (::: - operator to concatenate lists) 
// map is dropping the keys as they're no loner needed
// filter is returning only those lists, which are longer than 1 (true anagrams)

val angrs = pairs.reduceByKey(_ ::: _).map(_._2).filter(_.size > 1)

angrs.collect.foreach(println(_))
