# Chapter 5: Effective Transformations

In this chapter, we will explore advance features of RDDs in order to perform effective transformations.

## Minimizing Object Creation

Minimizing the number of objects that our program creates is a great way to optimize our calculations. We are going to visit some examples which display the same functionalities but with a different number of objects created during its execution. In particular, we are going to consider a intial data where we have the report cards of different instructors for different pandas.

In [1]:
val recordCars = sc.parallelize(Array(("instructor1", "this is a happy panda"),
                                      ("instructor1", "this is a very happy panda"),
                                      ("instructor2", "good"),
                                      ("instructor2", "happy")))

recordCars = ParallelCollectionRDD[0] at parallelize at <console>:27


ParallelCollectionRDD[0] at parallelize at <console>:27

Over that data, we want to calculate, for each instructor: the longest woard, the mentions of happy and the average words of their reporst. For doing so, we are going to use different Aggregator classes, in which the number of objects created are different.

Option 1: Generating new objects without reusing:

In [2]:
// case class ReportCardMetrics

defined class ReportCardMetrics


In [3]:
// class MetricsCalculator

defined class MetricsCalculator


In [4]:
// recordCars

[(instructor1,ReportCardMetrics(5,2,5.5)), (instructor2,ReportCardMetrics(5,1,1.0))]

Option 2: With object reusing.

In [5]:
// class MetricsCalculatorReuseObjects

defined class MetricsCalculatorReuseObjects


In [6]:
// recordCars.aggregateByKey

[(instructor1,ReportCardMetrics(5,2,5.5)), (instructor2,ReportCardMetrics(5,1,1.0))]

Option 3: Using Arrays

In [7]:
// class MetricsCalculatorArrays

defined class MetricsCalculatorArrays


In [8]:
// recordCars

Name: Compile Error
Message: <console>:31: error: value seqenceOp is not a member of Array[Int]
                                seqOp = ((reportCardMetrics, reportCardText) => reportCardMetrics.seqenceOp(reportCardText)),
                                                                                                  ^
<console>:32: error: value compOp is not a member of Array[Int]
                                combOp = ((x, y) => x.compOp(y))).map(x => (x._1, x._2.toReportCardMetrics)).collect()
                                                      ^

StackTrace: 

## Set Operations

In this section, we include some examples of set operations that can be done in Spark highighting some peculiarities.

Substract example:

In [9]:
val rddA = sc.parallelize(Array(1,2,3,4,4,4,4))
val rddB = sc.parallelize(Array(3,4))
// val subtraction = rddA

rddA = ParallelCollectionRDD[5] at parallelize at <console>:27
rddB = ParallelCollectionRDD[6] at parallelize at <console>:28
subtraction = MapPartitionsRDD[10] at subtract at <console>:29


MapPartitionsRDD[10] at subtract at <console>:29

In [10]:
subtraction.collect()

[1, 2]

In [11]:
assert(subtraction.count() < rddA.count() - rddB.count())

Intersection example:

In [12]:
// val intersection = rddA
intersection.collect()

intersection = MapPartitionsRDD[16] at intersection at <console>:30


[3, 4]

In [13]:
val union = rddA.union(rddB)
union.collect()

union = UnionRDD[17] at union at <console>:30


[1, 2, 3, 4, 4, 4, 4, 3, 4]

In [14]:
assert(!rddA.collect().sorted.sameElements(union.collect().sorted))

## Reducing Setup Overhead

In this section, we will see some aspects of Spark related Spark set-up and configuration per-partition. We will see some examples with Broadcast Variables and Accumulators (both are Shared Variables).

In [15]:
// case class Panda

defined class Panda


In [16]:
val pandasRDD = sc.parallelize(Seq(Panda(1, 11000, "giant", false, Array(0.2, 0.8)),
                                   Panda(2, 11000, "small", false, Array(0.3, 0.1)),
                                   Panda(3, 13000, "small", true, Array(0.9, 0.7)),
                                   Panda(4, 13000, "medium", false, Array(0.5, 0.4)),
                                   Panda(5, 18000, "medium", true, Array(0.7, 0.1)),
                                   Panda(6, 18000, "giant", true, Array(0.1, 0.7)),
                                   Panda(7, 18000, "small", true, Array(0.3, 0.9))))

pandasRDD = ParallelCollectionRDD[18] at parallelize at <console>:29


ParallelCollectionRDD[18] at parallelize at <console>:29

### Broadcast Variables

In [17]:
val invalidPandasIds = Array(2, 7)

invalidPandasIds = Array(2, 7)


[2, 7]

In [18]:
// val invalidPandasBcst =

invalidPandasBcst = Broadcast(16)


Broadcast(16)

In [19]:
// pandasRDD.filter

Panda(1,11000,giant,false,[D@40243752)
Panda(3,13000,small,true,[D@3484e074)
Panda(4,13000,medium,false,[D@5f1bc7b8)
Panda(5,18000,medium,true,[D@e68f765)
Panda(6,18000,giant,true,[D@2b5e21d5)


In [20]:
pandasRDD.getNumPartitions

8

In [21]:
// object LazyPrng extends Serializable{
    
    
// }

// val bcastprng = sc
// pandasRDD

Panda(4,13000,medium,false,[D@5594ea0)
Panda(7,18000,small,true,[D@6fa70d55)


defined object LazyPrng
bcastprng = Broadcast(18)


Broadcast(18)

### Accumulators

Built-in accumulator example:

In [22]:
val accFuzzyNess = 
val transformed = 
transformed.count()
println("AccuFuzzyNess: " + accFuzzyNess)

AccuFuzzyNess: DoubleAccumulator(id: 450, name: Some(fuzzyNess), value: 3.0000000000000004)


accFuzzyNess = DoubleAccumulator(id: 450, name: Some(fuzzyNess), value: 3.0000000000000004)
transformed = MapPartitionsRDD[21] at map at <console>:32


MapPartitionsRDD[21] at map at <console>:32

Custom accumulator:

In [23]:
// import org.apache.spark.util.AccumulatorV2
// class MaxDoubleAccumulator extends AccumulatorV2

defined class MaxDoubleAccumulator


In [24]:
val acc = new MaxDoubleAccumulator()
sc.register(acc, "My accumulator")
val transformed = pandasRDD.repartition(1).collect().foreach(x => {acc.add(x.attributes(0).toDouble); (x.id, x.zip)})
acc

acc = MaxDoubleAccumulator(id: 476, name: Some(My accumulator), value: Some(0.9))


transformed: Unit = ()


MaxDoubleAccumulator(id: 476, name: Some(My accumulator), value: Some(0.9))

## Reusing RDDs

One good way to optimize Spark calculations is to reuse RDDs when needed. In this section, we will see come examples where this approach could be a good option.

### Cases of Reuse

#### Iterative Computations

In [25]:
import org.apache.spark.rdd.RDD

In [26]:
val validationSet: RDD[(Double, Int)] = sc.parallelize(Array((2.0, 1), (5.0, 2), (3.0, 2), (7.0, 4)))

validationSet = ParallelCollectionRDD[26] at parallelize at <console>:29


ParallelCollectionRDD[26] at parallelize at <console>:29

In [27]:
import scala.math

val testSet: Array[RDD[(Double, Int)]] = Array(validationSet.mapValues(_ + 1),
                                               validationSet.mapValues(_ + 2),
                                               validationSet)
validationSet.persist()
val errors = testSet.map(rdd => {
    val errorCount = rdd.join(validationSet).values.map(x => (math.pow(x._1 - x._2, 2), 1))
    .reduce((x, y) => (x._1 + y._1, x._2 + y._2))
    
    math.pow(errorCount._1.toDouble/errorCount._2, 0.5)
})

testSet = Array(MapPartitionsRDD[27] at mapValues at <console>:33, MapPartitionsRDD[28] at mapValues at <console>:34, ParallelCollectionRDD[26] at parallelize at <console>:29)
errors = Array(1.0, 2.0, 0.0)


[1.0, 2.0, 0.0]

#### Multiple Actions on the Same RDD

In [28]:
val rddA: RDD[(Int, String)] = sc.parallelize(Array((1, "a"), (4, "d"), (3, "c"), (2, "b")))
val sorted = rddA.sortByKey()
// sorted
val count = sorted.count() // Action 1 on sorted
val sample = count/2.0
val sampled = sorted.take(sample.toInt) // Action 2 on sorted

rddA = ParallelCollectionRDD[44] at parallelize at <console>:32
sorted = ShuffledRDD[47] at sortByKey at <console>:33
count = 4
sample = 2.0
sampled = Array((1,a), (2,b))


[(1,a), (2,b)]

### Types of Reuse: Cache, Persist, Checkpoint, Suffle Files

`Checkpoint example`

In [29]:
val rddA: RDD[(Int, String)] = sc.parallelize(Array((1, "a"), (4, "d"), (3, "c"), (2, "b")))
val sorted = rddA.sortByKey()
// sc
// sorted
val count = sorted.count() // Action 1 on sorted
val sample = count/2.0
val sampled = sorted.take(sample.toInt) // Action 2 on sorted

rddA = ParallelCollectionRDD[48] at parallelize at <console>:35
sorted = ShuffledRDD[51] at sortByKey at <console>:36
count = 4
sample = 2.0
sampled = Array((1,a), (2,b))


[(1,a), (2,b)]