# Chapter 6: Advanced Spark Programming (Python)

In this Notebook, we will review the following advanced concepts of Spark:

    * Accumulators
    * Broadcast Variables
    * Partition-Based Functions
    * Numeric RDD Operations

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Advanced-Spark-Programming").master("local[*]").getOrCreate()
sc = spark.sparkContext

## Accumulators

Accumulators are useful for counters shared between different partitions. In the following example, we will count the total number of occurences of the number `5`.

In [2]:
acc_5 = sc.accumulator(0)

In [3]:
num_rdd = sc.parallelize([1,2,5,5,3,6,9,6,5,9])

In [4]:
num_rdd.glom().collect()

[[1, 2], [5, 5], [3, 6], [9, 6, 5, 9]]

In [5]:
def count_5(num):
    """
    Increments the accumulator for number 5
    """
    global acc_5
    if num == 5:
        acc_5 += 1

In [6]:
num_rdd.map(count_5).collect()

[None, None, None, None, None, None, None, None, None, None]

In [7]:
acc_5

Accumulator<id=0, value=3>

## Broadcast Variables

Broadcast variables are useful when we want to use the same object in all partitions, and this object is small. This is very commont for example for dictionaries. If the dictionary is going to be used in all the executors to map the values of some keys, using a broadcast variable is very convinient.

In [8]:
from random import randint

Let's create a numeric key-value RDD, where each value is the squared of the key.

In [9]:
rdd_key = sc.parallelize([1,2,3,4,5])
rdd_key_value = rdd_key.map(lambda x: (x, x**2))

We can search now for the value of the key 5.

In [10]:
rdd_key_value.lookup(5)

[25]

Now we are going to create a big numeric RDD, whose values range from 1 to 5.

In [11]:
rdd_big_keys = sc.parallelize([randint(1,5) for _ in range(20)])

Next, we will find the corresponding values of this key-values using the previous dictionary, converting it to a `Map` (using the function `collectAsMap()`) and the broadcasting this variable.

In [12]:
dict_map = rdd_key_value.collectAsMap()

In [13]:
dict_broad = sc.broadcast(dict_map)

In [14]:
rdd_big_values = rdd_big_keys.map(lambda key: (key, dict_broad.value.get(key)))

In [15]:
rdd_big_values.collect()

[(4, 16),
 (4, 16),
 (2, 4),
 (1, 1),
 (5, 25),
 (4, 16),
 (5, 25),
 (5, 25),
 (4, 16),
 (1, 1),
 (3, 9),
 (2, 4),
 (3, 9),
 (4, 16),
 (1, 1),
 (3, 9),
 (1, 1),
 (3, 9),
 (4, 16),
 (1, 1)]

## Partition-Based Functions

Here we will see some special functions that works directly on the partitions of an RDD

    * mapPartitions()
    * mapPartitionsWithIndex()
    * foreachPartition()

`mapPartitions()`: Iterates over the partitions of an RDD, applying some function. Signature: Input --> Iterator, Output --> Iterator

Let's use our numeric rdd with several partitions. 

In [16]:
num_rdd.glom().collect()

[[1, 2], [5, 5], [3, 6], [9, 6, 5, 9]]

In [17]:
num_rdd.countByValue()

defaultdict(int, {1: 1, 2: 1, 5: 3, 3: 1, 6: 2, 9: 2})

We want to calculate the average of the number of each partition. For that, we create a function called `average_partition()` and use the function `mapPartitions()`.

In [18]:
def average_partition(nums):
    """
    Calculates the average of a list of numbers
    
    :input nums: initial list of numbers
    :return: [average (float)]
    """
    
    sum_count = [0, 0]
    for num in nums:
        sum_count[0] += num
        sum_count[1] += 1
        
    return iter([sum_count[0] / sum_count[1]])

In [19]:
num_rdd.mapPartitions(average_partition).glom().collect()

[[1.5], [5.0], [4.5], [7.25]]

In [20]:
num_rdd.glom().collect()

[[1, 2], [5, 5], [3, 6], [9, 6, 5, 9]]

Now we want to to the same, but for each output number, we want to indicate its original partition. To do that, we create another function called `average_partition_index()` and use the function `mapPartitionsWithIndex()`.

In [21]:
def average_partition_index(index, nums):
    """
    Calculates the average of a list of numbers indicating the 
    index of the originnal partition
    
    :input index : index of the current partition
    :input nums: initial list of numbers
    :return: tuple that contains the index of the original partition
    ant the average of its numbers
    """
    
    sum_count = [0, 0]
    for num in nums:
        sum_count[0] += num
        sum_count[1] += 1
        
    return iter((index, sum_count[0] / sum_count[1]))

In [22]:
num_rdd.mapPartitionsWithIndex(average_partition_index).glom().collect()

[[0, 1.5], [1, 5.0], [2, 4.5], [3, 7.25]]

Another interesting function is `foreachPartition()`. It is useful to perform unitary operations for each partition, like for example stablish a conection to an external database. Let's do an easy example using the `num_rdd`.

In [23]:
import os
if os.path.exists("../data/python_logger.txt"):
    os.remove("../data/python_logger.txt")

def connect_data_base(partition):
    """
    Function that fakes the connection to a database. For each 
    new connection, the prase "Connecting to Database" is written
    in the file "../data/python_logger.txt"
    
    """
    with open("../data/python_logger.txt", "a") as text_file:
        text_file.write("Connecting to Database\n")
        
num_rdd.foreachPartition(connect_data_base)

As we can see now, the phrase "Connecting to Database" has been written 4 times, one for each partition.

In [24]:
!cat "../data/python_logger.txt"

Connecting to Database
Connecting to Database
Connecting to Database
Connecting to Database


## Numeric RDD Operations

Finally, we are going to explore some built-in numerical operations already included in the RDD API. In particular, we are going to explore the following methods:

    * count()
    * mean()
    * sum()
    * max()
    * min()
    * variance()
    * sampleVariance()
    * stdev()
    * sampleStdev()
    * stats()

`count()`: count the number of elements in an RDD

In [25]:
num_rdd.count()

10

`mean()`: mean of the elements of an RDD

In [26]:
num_rdd.mean()

5.1

`sum()`: cumulative sum of the elements of an RDD

In [27]:
num_rdd.sum()

51

`max()`: maximum value of the elements of an RDD

In [28]:
num_rdd.max()

9

`min()`: minimum value of the elements of an RDD

In [29]:
num_rdd.min()

1

`variance()`: variance of the elements of an RDD

In [30]:
num_rdd.variance()

6.290000000000001

`sampleVariance()`: variance of the elements of an RDD (using a sample)

In [31]:
num_rdd.sampleVariance()

6.98888888888889

`stdev()`: standard deviation of the elements of an RDD

In [32]:
num_rdd.stdev()

2.5079872407968908

`sampleStdev()`: standard deviation of the elements of an RDD (using a sample)

In [33]:
num_rdd.sampleStdev()

2.6436506745197805

`stats()`: main statistics (count, mean, stdev, max and min) of a numeric RDD

In [34]:
num_rdd.stats()

(count: 10, mean: 5.1, stdev: 2.5079872408, max: 9.0, min: 1.0)