# Chapter 6: Advanced Spark Programming (Scala)

In this Notebook, we will review the following advanced concepts of Spark:

    * Accumulators
    * Broadcast Variables
    * Partition-Based Functions
    * Numeric RDD Operations

## Accumulators

Accumulators are useful for counters shared between different partitions. In the following example, we will count the total number of occurences of the number `5`.

In [1]:
val acc5 = sc.accumulator(0)

acc5 = 0




0

In [2]:
val numRdd = sc.parallelize(List(1,2,5,5,3,6,9,6,5,9))

numRdd = ParallelCollectionRDD[0] at parallelize at <console>:27


ParallelCollectionRDD[0] at parallelize at <console>:27

In [3]:
numRdd.glom().collect()

[[1, 2], [5, 5, 3], [6, 9], [6, 5, 9]]

In [4]:
/**
Increments the accumulator for number 5
**/

def count5(num: Int): Unit = {
    if(num == 5) {
        acc5 += 1
    }
}

count5: (num: Int)Unit


In [5]:
val numRdd = sc.parallelize(List(1,2,5,5,3,6,9,6,5,9))
numRdd.map(count5).collect()

numRdd = ParallelCollectionRDD[2] at parallelize at <console>:33


[(), (), (), (), (), (), (), (), (), ()]

In [6]:
acc5

3

## Broadcast Variables

Broadcast variables are useful when we want to use the same object in all partitions, and this object is small. This is very commont for example for HashMaps. If the HashMap is going to be used in all the executors to map the values of some keys, using a broadcast variable is very convinient.

Let's create a numeric key-value RDD, where each value is the squared of the key.

In [7]:
val rddValue = sc.parallelize(List(1,2,3,4,5))
val rddKeyValue = rddValue.map(x => (x, x*x))

rddValue = ParallelCollectionRDD[4] at parallelize at <console>:27
rddKeyValue = MapPartitionsRDD[5] at map at <console>:28


MapPartitionsRDD[5] at map at <console>:28

We can search now for the value of the key 5.

In [8]:
rddKeyValue.collect()

[(1,1), (2,4), (3,9), (4,16), (5,25)]

In [9]:
rddKeyValue.lookup(5)

WrappedArray(25)

Now we are going to create a big numeric RDD, whose values range from 1 to 5.

In [10]:
val rddBigKeys = sc.parallelize(List(1,4,3,5,2,3,1,5,3,2,1))

rddBigKeys = ParallelCollectionRDD[8] at parallelize at <console>:27


ParallelCollectionRDD[8] at parallelize at <console>:27

Next, we will find the corresponding values of this key-values using the previous HashMap, converting it to a `Map` (using the function `collectAsMap()`) and the broadcasting this variable.

In [11]:
val dictMap = rddKeyValue.collectAsMap()

dictMap = Map(2 -> 4, 5 -> 25, 4 -> 16, 1 -> 1, 3 -> 9)


Map(2 -> 4, 5 -> 25, 4 -> 16, 1 -> 1, 3 -> 9)

In [12]:
dictMap.get(5)

Some(25)

In [13]:
val mapBroad = sc.broadcast(dictMap)

mapBroad = Broadcast(5)


Broadcast(5)

In [14]:
mapBroad.value.get(5)

Some(25)

In [15]:
val rddBigValues = rddBigKeys.map(x => (x, mapBroad.value.get(x)))

rddBigValues = MapPartitionsRDD[9] at map at <console>:36


MapPartitionsRDD[9] at map at <console>:36

In [16]:
rddBigValues.collect()

[(1,Some(1)), (4,Some(16)), (3,Some(9)), (5,Some(25)), (2,Some(4)), (3,Some(9)), (1,Some(1)), (5,Some(25)), (3,Some(9)), (2,Some(4)), (1,Some(1))]

## Partition-Based Functions

Here we will see some special functions that works directly on the partitions of an RDD

    * mapPartitions()
    * mapPartitionsWithIndex()
    * foreachPartition()

`mapPartitions()`: Iterates over the partitions of an RDD, applying some function. Signature: Input --> Iterator, Output --> Iterator

Let's use our numeric rdd with several partitions. 

In [17]:
numRdd.glom().collect()

[[1, 2], [5, 5, 3], [6, 9], [6, 5, 9]]

In [18]:
numRdd.countByValue()

Map(5 -> 3, 1 -> 1, 6 -> 2, 9 -> 2, 2 -> 1, 3 -> 1)



We want to calculate the average of the number of each partition. For that, we create a function called `averagePartition()` and use the function `mapPartitions()`.

In [19]:
/**
Calculates the average of a list of numbers

@param nums: iterator that contains the intial numbers
@return: average of each initial list/iterator of numbres 
**/

def averagePartition(nums: Iterator[Int]): Iterator[Float] = {
    
    var sum = 0
    var count = 0
    
    for(num <- nums){
        sum = sum + num
        count = count + 1
    }
    
    List(sum.toFloat/count).iterator
}

averagePartition: (nums: Iterator[Int])Iterator[Float]


In [20]:
numRdd.mapPartitions(averagePartition).glom().collect()

[[1.5], [4.3333335], [7.5], [6.6666665]]

Now we want to to the same, but for each output number, we want to indicate its original partition. To do that, we create another function called `averagePartitionIndex()` and use the function `mapPartitionsWithIndex()`.

In [21]:
/**
Calculates the average of a list of numbers

@param nums: iterator that contains the intial numbers
@return: average of each initial list/iterator of numbres 
**/

def averagePartitionIndex(index:Int, nums: Iterator[Int]): Iterator[Float] = {
    
    var sum = 0
    var count = 0
    
    for(num <- nums){
        sum = sum + num
        count = count + 1
    }
    
    List(index, sum.toFloat/count).iterator
}

averagePartitionIndex: (index: Int, nums: Iterator[Int])Iterator[Float]


In [22]:
numRdd.mapPartitionsWithIndex(averagePartitionIndex).glom().collect()

[[0.0, 1.5], [1.0, 4.3333335], [2.0, 7.5], [3.0, 6.6666665]]

Another interesting function is `foreachPartition()`. It is useful to perform unitary operations for each partition, like for example stablish a conection to an external database. Let's do an easy example using the `numRdd`.

In [23]:
import sys.process._
import java.io._

In [24]:
"rm -f ../data/scala_logger.txt".!
"touch ../data/scala_logger.txt".!

numRdd.foreachPartition(partition => {
    val fw = new FileWriter("../data/scala_logger.txt", true)
        fw.write("Connecting to Database \n")
        fw.close()
})

In [25]:
"cat ../data/scala_logger.txt".!

Connecting to Database 
Connecting to Database 
Connecting to Database 
Connecting to Database 


0

## Numeric RDD Operations

Finally, we are going to explore some built-in numerical operations already included in the RDD API. In particular, we are going to explore the following methods:

    * count()
    * mean()
    * sum()
    * max()
    * min()
    * variance()
    * sampleVariance()
    * stdev()
    * sampleStdev()
    * stats()

`count()`: count the number of elements in an RDD

In [26]:
numRdd.count()

10

`mean()`: mean of the elements of an RDD

In [27]:
numRdd.sum()

51.0

`sum()`: cumulative sum of the elements of an RDD

`max()`: maximum value of the elements of an RDD

In [28]:
numRdd.max()

9

`min()`: minimum value of the elements of an RDD

In [29]:
numRdd.min()

1

`variance()`: variance of the elements of an RDD

In [30]:
numRdd.variance()

6.29

`sampleVariance()`: variance of the elements of an RDD (using a sample)

In [31]:
numRdd.sampleVariance()

6.988888888888889

`stdev()`: standard deviation of the elements of an RDD

In [32]:
numRdd.stdev()

2.507987240796891

`sampleStdev()`: standard deviation of the elements of an RDD (using a sample)

In [33]:
numRdd.sampleStdev()

2.64365067451978

`stats()`: main statistics (count, mean, stdev, max and min) of a numeric RDD

In [34]:
numRdd.stats()

(count: 10, mean: 5.100000, stdev: 2.507987, max: 9.000000, min: 1.000000)