# Chapter 6: Working with Key/Value Data

In this notebook, we will see some advanced concepts when performing data computations on key/value RDDs. We will solve in different ways the Goldilocks Example, focusing on performance considerations.

## The Goldlocks Example

In this example, we have data representing some metrics of several pandas.  

In [1]:
case class Panda(Happiness: Double, Niceness: Double, Softness: Double, 
                 Sweetness: Double)

defined class Panda


In [2]:
val df = spark.createDataFrame(Seq(Panda(15.0, 0.25, 2467.0, 0.0),
                                   Panda(2.0, 1000, 35.4, 0.0),
                                   Panda(10.0, 2.0, 50.0, 0.0),
                                   Panda(3.0, 8.5, 0.2, 98.0))) 

df = [Happiness: double, Niceness: double ... 2 more fields]


[Happiness: double, Niceness: double ... 2 more fields]

In [3]:
df.show()

+---------+--------+--------+---------+
|Happiness|Niceness|Softness|Sweetness|
+---------+--------+--------+---------+
|     15.0|    0.25|  2467.0|      0.0|
|      2.0|  1000.0|    35.4|      0.0|
|     10.0|     2.0|    50.0|      0.0|
|      3.0|     8.5|     0.2|     98.0|
+---------+--------+--------+---------+



The objective is to design an algorithm that allow us to input an arbitrary list of integers n1...nk and return the nth best element in each column. In particular, se are going to 2nd and 4th elements.

## Goldilocks Version 0: Iterative Solution

In [4]:
val rankIndexs = Array(2, 4)
var resultV0 = Map[Int, Iterable[Double]]()

for (idx <- 1 to df.schema.length) {
    
    val colData = df.rdd.map(row => row.getDouble(idx - 1))
    val sortedData = colData.sortBy(x => x).zipWithIndex()
    val ranksOnly = sortedData.filter(x => rankIndexs.contains(x._2 + 1)).map(_._1)
    
    resultV0 += (idx -> ranksOnly.collect())
    
}

resultV0.foreach(println)

(1,WrappedArray(3.0, 15.0))
(2,WrappedArray(2.0, 1000.0))
(3,WrappedArray(35.4, 2467.0))
(4,WrappedArray(0.0, 98.0))


rankIndexs = Array(2, 4)
resultV0 = Map(1 -> WrappedArray(3.0, 15.0), 2 -> WrappedArray(2.0, 1000.0), 3 -> WrappedArray(35.4, 2467.0), 4 -> WrappedArray(0.0, 98.0))


Map(1 -> WrappedArray(3.0, 15.0), 2 -> WrappedArray(2.0, 1000.0), 3 -> WrappedArray(35.4, 2467.0), 4 -> WrappedArray(0.0, 98.0))

## Goldilocks Version 1: GroupByKey Solution

In [5]:
val rankIndexs = Array(2, 4)
val rowLength = df.schema.length
val pairRDD = df.rdd.flatMap(row => Range(0, rowLength).map(idx => (idx, row.getDouble(idx))))
val resultV1 = pairRDD.groupByKey().map(x => (x._1, x._2.toArray.sorted.zipWithIndex
                                             .filter(y => rankIndexs.contains(y._2 + 1)).map(_._1))).collectAsMap()

rankIndexs = Array(2, 4)
rowLength = 4
pairRDD = MapPartitionsRDD[40] at flatMap at <console>:35
resultV1 = Map(2 -> Array(35.4, 2467.0), 1 -> Array(2.0, 1000.0), 3 -> Array(0.0, 98.0), 0 -> Array(3.0, 15.0))


Map(2 -> [D@33416fef, 1 -> [D@234553d2, 3 -> [D@4a72e0a9, 0 -> [D@7bd123f1)

In [6]:
resultV1.toSeq.sortBy(_._1).foreach({case (key, value) => println(key + "-> " + value.mkString(", "))})

0-> 3.0, 15.0
1-> 2.0, 1000.0
2-> 35.4, 2467.0
3-> 0.0, 98.0


## Partitioners and Key/Value Data

### Preserving Partitioning Information Across Transformations

In [17]:
import org.apache.spark.HashPartitioner
val partitioner = new HashPartitioner(4)

partitioner = org.apache.spark.HashPartitioner@4


org.apache.spark.HashPartitioner@4

In [18]:
val rddA = sc.parallelize(Array(1,2,3,4)).map(x => (x, x*1.5)).partitionBy(partitioner)

rddA = ShuffledRDD[61] at partitionBy at <console>:32


ShuffledRDD[61] at partitionBy at <console>:32

In [19]:
rddA.partitioner

Some(org.apache.spark.HashPartitioner@4)

In [20]:
rddA.map(x => x).partitioner

None

In [21]:
rddA.mapValues(x => x).partitioner

Some(org.apache.spark.HashPartitioner@4)

### Co-located RDDs

In [24]:
val a = sc.parallelize(Array(1,2,3,4)).map(x => (x, x*1.5))
val b = sc.parallelize(Array(2,4,7,9)).map(x => (x, x*1.5))

val rddA = a.partitionBy(partitioner)
rddA.cache()
val rddB = b.partitionBy(partitioner)
rddB.cache()
val rddC = a.cogroup(b)
rddC.count()

a = MapPartitionsRDD[65] at map at <console>:36
b = MapPartitionsRDD[67] at map at <console>:37
rddA = ShuffledRDD[68] at partitionBy at <console>:39
rddB = ShuffledRDD[69] at partitionBy at <console>:41
rddC = MapPartitionsRDD[71] at cogroup at <console>:43


6

### RDDs Co-partitioned but not Co-Located

In [25]:
val a = sc.parallelize(Array(1,2,3,4)).map(x => (x, x*1.5))
val b = sc.parallelize(Array(2,4,7,9)).map(x => (x, x*1.5))

val rddA = a.partitionBy(partitioner)
rddA.cache()
val rddB = b.partitionBy(partitioner)
rddB.cache()
val rddC = a.cogroup(b)
rddA.count()
rddB.count()
rddC.count()

a = MapPartitionsRDD[73] at map at <console>:38
b = MapPartitionsRDD[75] at map at <console>:39
rddA = ShuffledRDD[76] at partitionBy at <console>:41
rddB = ShuffledRDD[77] at partitionBy at <console>:43
rddC = MapPartitionsRDD[79] at cogroup at <console>:45


6

## Goldilocks Version 2: Secondary Sort

In [50]:
import org.apache.spark.rdd.RDD

In [39]:
val df = spark.createDataFrame(Seq(Panda(15.0, 0.25, 2467.0, 0.0),
                                   Panda(2.0, 1000, 35.4, 0.0),
                                   Panda(10.0, 2.0, 50.0, 0.0),
                                   Panda(3.0, 8.5, 0.2, 98.0))) 

df = [Happiness: double, Niceness: double ... 2 more fields]


[Happiness: double, Niceness: double ... 2 more fields]

In [40]:
df.rdd.getNumPartitions

4

In [34]:
import org.apache.spark.Partitioner
import scala.math

class ColumnIndexPartition(override val numPartitions: Int) extends Partitioner {
    
    require(numPartitions >=0, s"Number of partitions $numPartitions cannot be negative")
    
    override def getPartition(key: Any) = {
        
        val k = key.asInstanceOf[(Int, Double)]
        
        math.abs(k._1) % numPartitions
    }
}

defined class ColumnIndexPartition


In [45]:
val pairRDD = df.rdd.flatMap(row => Range(0, rowLength).map(idx => (idx, row.getDouble(idx)))).map((_, 1))
val partitioner = new ColumnIndexPartition(4)
val sorted = pairRDD.repartitionAndSortWithinPartitions(partitioner)

pairRDD = MapPartitionsRDD[95] at map at <console>:45
partitioner = ColumnIndexPartition@35214ad6
sorted = ShuffledRDD[96] at repartitionAndSortWithinPartitions at <console>:47


ShuffledRDD[96] at repartitionAndSortWithinPartitions at <console>:47

In [48]:
sorted.collect()

[((0,2.0),1), ((0,3.0),1), ((0,10.0),1), ((0,15.0),1), ((1,0.25),1), ((1,2.0),1), ((1,8.5),1), ((1,1000.0),1), ((2,0.2),1), ((2,35.4),1), ((2,50.0),1), ((2,2467.0),1), ((3,0.0),1), ((3,0.0),1), ((3,0.0),1), ((3,98.0),1)]

In [62]:
val targetRanks = Array(2,4)
val filterForTargetIndex: RDD[(Int, Double)] = sorted.mapPartitions(iter =>{
    
    var currentColumnIndex = -1
    var runningTotal = 0
    iter.filter({case(((colIndex, value), _)) =>
        if(((colIndex != currentColumnIndex))) {
            currentColumnIndex = colIndex
            runningTotal = 1
            
        } else {
            
            runningTotal += 1
        }
        
        targetRanks.contains(runningTotal)
    })
}.map(_._1))

targetRanks = Array(2, 4)
filterForTargetIndex = MapPartitionsRDD[99] at mapPartitions at <console>:50


MapPartitionsRDD[99] at mapPartitions at <console>:50

In [109]:
var resultV2 = Map[Int, Iterable[Double]]()

resultV2 = Map()


Map()

In [110]:
resultV2 += 1 -> Seq(1)

In [111]:
resultV2.get(1) match {
    case Some(value) => {resultV2 += 1 -> {value ++ Seq(2.0)}}
}

It would fail on the following input: None
       resultV2.get(1) match {
                   ^


In [112]:
resultV2.get(1)

Some(List(1.0, 2.0))

In [105]:
List(1.0) ++ Seq(2.0)

List(1.0, 2.0)

In [113]:
filterForTargetIndex.collect()

[(0,3.0), (0,15.0), (1,2.0), (1,1000.0), (2,35.4), (2,2467.0), (3,0.0), (3,98.0)]

In [119]:
var resultV2 = Map[Int, Iterable[Double]]()
filterForTargetIndex.collect().foreach(x => {
    resultV2.get(x._1) match {
        case Some(value) => {resultV2 += x._1 -> {value ++ Seq(x._2.toDouble)}}
        case None => {resultV2 += x._1 -> Seq(x._2.toDouble)}  
    }
})

resultV2 = Map(0 -> List(3.0, 15.0), 1 -> List(2.0, 1000.0), 2 -> List(35.4, 2467.0), 3 -> List(0.0, 98.0))


Map(0 -> List(3.0, 15.0), 1 -> List(2.0, 1000.0), 2 -> List(35.4, 2467.0), 3 -> List(0.0, 98.0))

In [121]:
resultV2.toSeq.sortBy(_._1).foreach({case (key, value) => println(key + "-> " + value.mkString(", "))})

0-> 3.0, 15.0
1-> 2.0, 1000.0
2-> 35.4, 2467.0
3-> 0.0, 98.0
