# Advanced RDD functions

For processing (key, value) RDDs, there are a few functions that simplify common flow patterns. 

## MapValues

In contrast to the map() transformation, **mapValues( f( v -> v' ) )** can only be used on (key, value) pairs. It applies a function f to map a value v to v', and results in (key, v') pairs.

In [None]:
numbers = sc.parallelize([('John' ,3), ('Peter' ,8), ('Peter', 10), ('Mike', 4)])

In [None]:
numbers.mapValues(lambda x: x + 1).collect()

## Flattening

When mapping an RDD of N-elements, this will always result in an RDD of exactly N-elements, since every element will be transformed into exactly one value. When we want to map an element to zero or more elements, the default approach is starts with mapping every result to a list, lists that can contain zero or more elements. An RDD of lists can be **flatten**ed to an RDD of the elements that are contained in the lists, using the **flatten** transformation function. 

Consider the case where we want to process all the words in a textfile. Initially, the *textfile()* method will create an element for every line. Some lines are empty (contain no words), but most lines contain multiple words. To transform the lines to words we can first map each line to a list of words using Python's **split()** function, and then flatten the result to obtain an RDD of words.

Let us first consider a small in-memory example.

In [None]:
lines = sc.parallelize(["This is line 1", "", 
                        "This is line two", "Three", 
                        "This is the last line"])
lists = lines.map(lambda x: x.split())
print(lists.count(), lists.collect())

The RDD lists contains 5 elements, 5 lists of 0 or more words. To flatten the RDD, you will notice that Spark actually has no **flatten** transformation, but there is a **flatMap()** transformation instead. Presumably, since flattening is a relatively expensive operation and combining a flatten with a map relatively inexpensive, the absence of a flatten transformation may make programmers more conscious about always combining a map with a flatten. In any case, we can use **flatMap(lambda x:x)** with a so called identity function, to just flatten. 

In [None]:
words = lists.flatMap(lambda x:x)
print(words.count(), words.collect())

For educational purposes, it is good to see how the lines are mapped to lists and then how the RDD is flattened. However, this flow is commonly written in one operation.

In [None]:
words2 = lines.flatMap(lambda x: x.split())
print(words2.count(), words2.collect())

It is important to realize that flatten will only remove **one nested level**. To show what happens, we force a scenario in which an RDD in which the elements are lists with both words and nested lists in them. 

In [None]:
foolists = lines.map(lambda x: ['Foo', x.split()])
print(foolists.count(), foolists.collect())

Observe carefully, the most outer [] represent the RDD. Within the RDD, the first element is `['Foo', ['This', 'is', 'line', '1']]`, which contains a word and a list. Flattening the result will remove the list surrounding the element, so that two new elements are the result: `'Foo'` and `['This', 'is', 'line', '1']`. The empty list that is embedded in another list does not simply vanish. Therefore the result is 10 elements, 2 elements for every line. Note that mixtures of lists and non-lists are rarely used in Spark, because it makes processing more complex.

In [None]:
foowords = foolists.flatMap(lambda x:x)
print(foowords.count(), foowords.collect())

## flatMapValues

There is also a combined transformation **flatMapValues(f)** which is equivalent to a `mapValues(f)` followed by `mapFlat(i)` where i is the identify function. Flattening of (key, value) pairs causes the mapped values to remain bound to the original keys.

In [None]:
kv = sc.parallelize([(1, "This is line 1"), 
                     (2, ""), 
                     (3, "This is line two"),
                     (4, "Three"), 
                     (5, "This is the last line") ])
letters = kv.flatMapValues(lambda x: x.split())
print(letters.count(), letters.collect())