# Working with Key/Value Pairs

Key/value RDDs are commonly used to perform aggregations, and often we will do some initial ETL (extract, transform, and load) to get our data into a key/value format.

## Motivation

Spark provides special operations on RDDs containing key/value pairs. These RDDs are called pair RDDs. Pair RDDs are a useful building block in many programs, as they expose operations that allow you to acct on each key in parallel or regroup data across the network.

## Creating pair RDDs

In [1]:
# creating a pair RDD using the first word as the key in Python
lines = sc.textFile("/Applications/spark-1.6.0-bin-hadoop2.6/README.md") # Create an RDD called lines
pairs = lines.map(lambda x: (x.split(" ")[0], x))

When creating a pair RDD from an in-memory collection in Scala and Python, we only need to call SparkContext.parallelize() on a collection of pairs.

## Transformation on Pair RDDs

Since pair RDDs contain tuples, we need to pass functions that operate on tuples rather than on individual elements.

In [10]:
rdd = sc.parallelize([(1, 2), (3, 4), (3, 6)])

In [11]:
# Transformations on one pair RDD

In [13]:
rdd.reduceByKey(lambda x, y: x + y).collect()

[(1, 2), (3, 10)]

In [18]:
rdd.groupByKey().collect()

[(1, <pyspark.resultiterable.ResultIterable at 0x108a39c10>),
 (3, <pyspark.resultiterable.ResultIterable at 0x108a39850>)]

In [21]:
for x in rdd.groupByKey().collect()[1][1]:
    print x

4
6


In [22]:
# Apply a function to each value of a pair RDD without changing the key
rdd.mapValues(lambda x: x + 1).collect()

[(1, 3), (3, 5), (3, 7)]

Apply a function that returs an iterator to each of a pair RDD, and for each element returned, produce a key/value entry with the old key. Often used for tokenization.

In [27]:
rdd.flatMapValues(lambda x: xrange(x-1, 6)).collect()

[(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (3, 3), (3, 4), (3, 5), (3, 5)]

In [29]:
rdd.keys().collect()

[1, 3, 3]

In [31]:
rdd.values().collect()

[2, 4, 6]

In [32]:
rdd.sortByKey().collect()

[(1, 2), (3, 4), (3, 6)]

In [33]:
# Transformations on two pair RDDs

In [38]:
other = sc.parallelize({(3, 9)})

In [39]:
rdd.subtractByKey(other).collect()

[(1, 2)]

In [40]:
# perform inner join between two RDDs
rdd.join(other).collect()

[(3, (4, 9)), (3, (6, 9))]

In [41]:
rdd.rightOuterJoin(other).collect()

[(3, (4, 9)), (3, (6, 9))]

In [43]:
rdd.leftOuterJoin(other).collect()

[(1, (2, None)), (3, (4, 9)), (3, (6, 9))]

In [45]:
# Group data from both RDDs sharing the same key
rdd.cogroup(other).collect()

[(1,
  (<pyspark.resultiterable.ResultIterable at 0x108a03950>,
   <pyspark.resultiterable.ResultIterable at 0x108a39bd0>)),
 (3,
  (<pyspark.resultiterable.ResultIterable at 0x108a39e90>,
   <pyspark.resultiterable.ResultIterable at 0x108a39b90>))]

We can take our pair RDD and filter out lines longer than 20 characters or more.

In [46]:
# simple filter on second element in Python
result = pairs.filter(lambda keyValue: len(keyValue[1]) < 20)

## Aggregations

In [1]:
rdd = sc.parallelize({("panda", 0), ("pink", 3), ("pirate", 3), ("panda", 1), ("pink", 4)})

In [3]:
# Per-key average with ReduceByKey() and mapValues()
rdd.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0]+y[0], x[1]+y[1])).collect()

[('pink', (7, 2)), ('panda', (1, 2)), ('pirate', (3, 1))]

In [4]:
# Word Count
rdd = sc.textFile("/Users/sergulaydore/Downloads/nyc16_bigdata1-master-3/week2b_thursday/shakespeare.txt")
words = rdd.flatMap(lambda x: x.split(" "))
result = words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x+y)

In [7]:
result.take(10)

[(u'', 506198),
 (u'fawn', 11),
 (u'Fame,', 3),
 (u'mustachio', 1),
 (u'protested,', 1),
 (u'sending.', 3),
 (u'offendeth', 1),
 (u'instant;', 1),
 (u'scold', 4),
 (u'Sergeant.', 1)]

In [44]:
# Per-key average using combineByKey()
nums = sc.parallelize({(1, 2), (3, 4), (3, 6), (3, 7)})
sumCount = nums.combineByKey((lambda x: (x, 1)),
                            (lambda x, y: (x[0] + y, x[1] + 1)),
                            (lambda x, y: (x[0]+y[0], x[1]+y[1])))

In [39]:
sumCount.collect()

[(1, (2, 1)), (3, (17, 3))]

In [46]:
r = sumCount.mapValues(lambda xy: (xy[0]/xy[1])).collectAsMap()

In [49]:
print(r)

{1: 2, 3: 5}


There are many options for combining our data by key. Most of them are implemented on top of combineByKey() but provide a simpler interface. In any case, using one of the specialized aggregation functions in Spark can be mush faster than the naive approach of grouping our data and then reducing it.

### Tuning the level of parallelism

Every RDD has a fixed number of partitions that determine the degree of parallelism to use when executing operations on the RDD. When performing aggregations or grouping operations, we can ask Spark to use a specific number of partitions. Spark will always try to infer a sensible default value based on the size of your cluster, but in some cases you will want to tune the level of parallelism for better performance.

In [50]:
# reduceByKey() with custom parallelism
data = [("a", 3), ("b", 4), ("a", 1)]
sc.parallelize(data).reduceByKey(lambda x, y: x+y) # Default parallelism
sc.parallelize(data).reduceByKey(lambda x, y: x+y, 10) # Custom parallelism

PythonRDD[112] at RDD at PythonRDD.scala:43

## Grouping Data

If our data is already keyed in the way we want, groupByKey() will group our data using the key in our RDD. On an RDD consisting of keys of type K and values of type V, we get back an RDD of type [K, Iterable[V]].

In addition to grouping data from a single RDD, we can group data sharing the same key from multiple RDDs using a function called cogroup().

### Joins

### Sorting Data

In [54]:
# Sorting our RDD by converting the integers to strings and using the string comparision functions
nums.sortByKey(ascending = True, numPartitions = None, keyfunc = lambda x: str(x))

PythonRDD[120] at RDD at PythonRDD.scala:43

In [57]:
nums.collect()

[(1, 2), (3, 7), (3, 4), (3, 6)]

## Actions available on Pair RDDs

In [58]:
rdd = sc.parallelize({(1, 2), (3, 4), (3, 6)})

In [59]:
# count the number of elements for each key
rdd.countByKey()

defaultdict(<type 'int'>, {1: 1, 3: 2})

In [60]:
# collect the result as a map to provide easy lookup
rdd.collectAsMap()

{1: 2, 3: 6}

In [61]:
# return all values associated with the provided key
rdd.lookup(3)

[4, 6]

## Data Partitioning

In a distributed program, communication is very expensive, so laying out data to minimize network traffic can greatly improve performance.

Partitioning will not be helpful in all applications - for example, if a given RDD is scanned only once, there is no point in partitioning it in advance. It is useful only when a dataset is reused multiple times in key-oriented operations such as joins.