# Chapter 4: Working with Key/Value Pairs (Scala)

In this Notebook, we will study the operations that can be performed on Key/Value RDDs.

## Creating Pair RDDs

Using `map()`

In [1]:
import scala.math.pow

In [2]:
// val numericRdd = 
// val pairRdd = 

numericRdd = ParallelCollectionRDD[0] at parallelize at <console>:28
pairRdd = MapPartitionsRDD[1] at map at <console>:29


MapPartitionsRDD[1] at map at <console>:29

In [3]:
// println("Pair RDD from map(): " + pairRdd)

Pair RDD from map(): (1,1), (4,16), (2,4), (4,16), (1,1), (3,9), (3,9)


## Transformations on one Pair RDDs

In addition to the RDD transformation explained in Chapter 3, we can perform the following transformations specific for individual key/value RDDs:

    * reduceByKey()
    * mapValues()
    * groupByKey()
    * combineByKey()
    * flatMapValues()
    * keys()
    * values()
    * sortByKey()

`reduceByKey()` --> reduce the values of an RDD per key, `mapValues()` --> map the values of a key/value RDD.

Sum values using reduceByKey()

In [4]:
// val sumValues = 
println("Sum values using reduceByKey(): " + sumValues.collect().mkString(", "))

Sum values using reduceByKey(): (4,32), (1,2), (2,4), (3,18)


sumValues = ShuffledRDD[2] at reduceByKey at <console>:31


ShuffledRDD[2] at reduceByKey at <console>:31

Average calculated using reduceByKey()

In [5]:
// val avgRedByKey = 
println("Average by key using reduceByKey(): " + avgRedByKey.collect().mkString(", "))

Average by key using reduceByKey(): (4,16), (1,1), (2,4), (3,9)


avgRedByKey = MapPartitionsRDD[5] at mapValues at <console>:31


MapPartitionsRDD[5] at mapValues at <console>:31

Wordcount using reduceByKey()

In [6]:
// val lines = 
// val words = 
print("Word count using reduceByKey(): " + words.take(10).mkString(", "))

Word count using reduceByKey(): (package,1), (this,1), (Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version),1), (Because,1), (Python,2), (page](http://spark.apache.org/documentation.html).,1), (cluster.,1), (its,1), ([run,1), (general,3)

lines = ../data/README.md MapPartitionsRDD[7] at textFile at <console>:28
words = ShuffledRDD[10] at reduceByKey at <console>:29


ShuffledRDD[10] at reduceByKey at <console>:29

`groupByKey()` --> group values of an RDD grupped by key

In [7]:
// val groupedValues = 
println("Grouped RDD using groupByKey(): " + groupedValues.collect().mkString(", "))

Grouped RDD using groupByKey(): (4,CompactBuffer(16, 16)), (1,CompactBuffer(1, 1)), (2,CompactBuffer(4)), (3,CompactBuffer(9, 9))


groupedValues = ShuffledRDD[11] at groupByKey at <console>:31


ShuffledRDD[11] at groupByKey at <console>:31

`combineByKey()` --> combines the values of an RDD according to their key. Here we calculate the average per key of an key/value RDD using this function.

In [8]:
pairRdd.collect()

[(1,1), (4,16), (2,4), (4,16), (1,1), (3,9), (3,9)]

In [9]:
// val sumKeyValues = 

// val avgComByKey = 

print("Average by key using combineByKey(): " + avgComByKey.collect().mkString(", "))

Average by key using combineByKey(): (4,16.0), (1,1.0), (2,4.0), (3,9.0)

sumKeyValues = ShuffledRDD[12] at combineByKey at <console>:31
avgComByKey = MapPartitionsRDD[13] at mapValues at <console>:35


MapPartitionsRDD[13] at mapValues at <console>:35

`flatMapValues()` --> flat-maps the values of a key/value RDD

In [10]:
// println("RDD using flatMapValues(): " + pairRdd.....take(10).mkString(", "))

RDD using flatMapValues(): (1,1), (4,1), (4,2), (4,3), (4,4), (4,5), (4,6), (4,7), (4,8), (4,9)


`keys()` --> get the keys of a key/value RDD

In [11]:
// println("Get keys from key/pair RDD using keys(): " + pairRdd.....collect().mkString(", "))

Get keys from key/pair RDD using keys(): 1, 4, 2, 4, 1, 3, 3


`values()` --> get the values of a key/value RDD

In [12]:
// println("Get values from key/pair RDD using keys(): " + pairRdd.....collect().mkString(", "))

Get values from key/pair RDD using keys(): 1, 16, 4, 16, 1, 9, 9


`sortByKey()` --> sort the values of an RDD for each key

In [13]:
val rddSort = sc.parallelize(List((4, (8, 2)), (1, (3, 1, 9))))

rddSort = ParallelCollectionRDD[17] at parallelize at <console>:28


ParallelCollectionRDD[17] at parallelize at <console>:28

In [14]:
// println("Get RDD sorted by keys using sortByKey(): " + rddSort.....collect().mkString(", "))

Get RDD sorted by keys using sortByKey(): (1,(3,1,9)), (4,(8,2))


## Transformations on two Pair RDDs

In this section, the different transformation that can be performed on two key/value RDDs are studied.

In [15]:
val pairRdd1 = sc.parallelize(List((3, 'A'), (2, 'J'), (5, 'K')))
val pairRdd2 = sc.parallelize(List((5, 'Z'), (3, 'W'), (7, 'B')))

pairRdd1 = ParallelCollectionRDD[21] at parallelize at <console>:28
pairRdd2 = ParallelCollectionRDD[22] at parallelize at <console>:29


ParallelCollectionRDD[22] at parallelize at <console>:29

`subtractByKey()` --> subtract two RDDs by Key

In [16]:
// val subtractRdd = 
println("RDD from subtractByKey(): " + subtractRdd.collect().mkString(", "))

RDD from subtractByKey(): (3,A), (2,J), (5,K)


subtractRdd = MapPartitionsRDD[26] at subtract at <console>:31


MapPartitionsRDD[26] at subtract at <console>:31

`.join()` --> inner join two RDD by Key

In [17]:
// val joinRdd = 
println("RDD from join(): " + joinRdd.collect().mkString(", "))

RDD from join(): (5,(K,Z)), (3,(A,W))


joinRdd = MapPartitionsRDD[29] at join at <console>:31


MapPartitionsRDD[29] at join at <console>:31

`.leftOuterJoin()` --> left outer join two RDD by key

In [18]:
// val leftOuterJoinRdd = 
println("RDD from leftOuterJoin(): " + leftOuterJoinRdd.collect().mkString(", "))

RDD from leftOuterJoin(): (5,(K,Some(Z))), (2,(J,None)), (3,(A,Some(W)))


leftOuterJoinRdd = MapPartitionsRDD[32] at leftOuterJoin at <console>:31


MapPartitionsRDD[32] at leftOuterJoin at <console>:31

`.rightOuterJoin()` --> right outer join two RDD by key

In [19]:
// val rightOuterJoinRdd = 
println("RDD from rightOuterJoin(): " + rightOuterJoinRdd.collect().mkString(", "))

RDD from rightOuterJoin(): (5,(Some(K),Z)), (3,(Some(A),W)), (7,(None,B))


rightOuterJoinRdd = MapPartitionsRDD[35] at rightOuterJoin at <console>:31


MapPartitionsRDD[35] at rightOuterJoin at <console>:31

`.cogroup()` --> cogroup two RDD on key

In [20]:
// val cogroupedRdd = 
println("RDD from cogroup(): " + cogroupedRdd.collect().mkString(", "))

RDD from cogroup(): (5,(CompactBuffer(K),CompactBuffer(Z))), (2,(CompactBuffer(J),CompactBuffer())), (3,(CompactBuffer(A),CompactBuffer(W))), (7,(CompactBuffer(),CompactBuffer(B)))


cogroupedRdd = MapPartitionsRDD[37] at cogroup at <console>:31


MapPartitionsRDD[37] at cogroup at <console>:31

## Actions Available on Pair RDDs

In addition to the actions explained in Chapter 3, we can perform the following additional actions on key/value RDDs.

`countByKey()` --> count the value ocurrences in an RDD for each key

In [21]:
// println("countByKey(): " + pairRdd)

countByKey(): Map(4 -> 2, 1 -> 2, 2 -> 1, 3 -> 2)


`collectAsMap()` --> tranforms the key/value RDD as a HashMap.

In [22]:
// println("collectAsMap(): " + pairRdd)

collectAsMap(): Map(2 -> 4, 4 -> 16, 1 -> 1, 3 -> 9)


`lookup()` --> lookup the value corresponding to a key

In [23]:
// println("lookup(4): " + pairRdd)

lookup(4): WrappedArray(16, 16)


## Partitions

Finally, we will discussed two extra operations regarding the partitioning of key/value RDD

`repartition(n)` --> repartitions the RDD according in n partitions

In [24]:
println("Repartition of an RDD: ")
// pairRdd

Repartition of an RDD: 


[[(1,1), (2,4), (4,16), (3,9)], [(4,16), (1,1), (3,9)]]

`partitionBy()` --> custom partitions the RDD

In [25]:
import org.apache.spark.HashPartitioner

In [26]:
val myPartitioner = new HashPartitioner(2)

myPartitioner = org.apache.spark.HashPartitioner@2


org.apache.spark.HashPartitioner@2

In [27]:
print("Custom partitioning using partitionBy()")
// pairRdd

Custom partitioning using partitionBy()

[[(4,16), (2,4), (4,16)], [(1,1), (1,1), (3,9), (3,9)]]