# mapValues():
            Syntax: RDD.mapValues(<function>)
The mapValues() transformation passes each value in a key/value pair RDD through a function
(a named or anonymous function specified by the <function> argument) without changing the
keys. Like its generalized equivalent map() , mapValues() outputs one element for each input
element.
    
The original RDD’s partitioning is not affected.

# flatMapValues():
            Syntax:   RDD.flatMapValues(<function>)
The flatMapValues() transformation passes each value in a key/value pair RDD through a function without changing the keys and produces a flattened list. It works exactly like flatMap() ,
which we looked at earlier, returning zero to many output elements per input element.

Much as with mapValues() , with flatMapValues() the original RDD’s partitioning is retained.
The easiest way to contrast mapValues() and flatMapValues() is to look at a practical example.
Consider a text file containing a city and a pipe-delimited list of temperatures, as shown here:

Hayward,71|69|71|71|72

Baumholder,46|42|40|37|39

Alexandria,50|48|51|53|44

Melbourne,88|101|85|77|74

Listing 4.29 simulates the loading of this data into an RDD and uses mapValues() to create a list
of key/value pair tuples containing the city and a list of temperatures for the city. It shows the use
of flatMapValues() with the same function against the same RDD to create tuples containing
the city and a number for each temperature recorded for the city.

A simple way to describe this is that mapValues() creates one element per city containing the
city name and a list of five temperatures for the city, whereas flatMapValues() flattens the lists
to create five elements per city with the city name and a temperature value.

In [1]:
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)

In [2]:
locwtemps = sc.parallelize(
        ['Hayward,71|69|71|71|72',
        'Baumholder,46|42|40|37|39',
        'Alexandria,50|48|51|53|44',
        'Melbourne,88|101|85|77|74']
)
kvpairs = locwtemps.map(lambda x: x.split(','))
kvpairs.take(4)
# returns :
# [['Hayward', '71|69|71|71|72'],
# ['Baumholder', '46|42|40|37|39'],
# ['Alexandria', '50|48|51|53|44'],
# ['Melbourne', '88|101|85|77|74']]
locwtemplist = kvpairs.mapValues(lambda x: x.split('|')).mapValues(lambda x: [int(s) for s in x])
# .mapValues() on the left will result:
# [['Hayward', '71,69,'71','71','72'],
# ['Baumholder', ''46','42','40','37','39'],
# ['Alexandria', '50','48', '51','53','44'],
# ['Melbourne', '88','101','85','77','74']]
#then the mapValues() on the right will convert the values in to list of numbers(integers)
locwtemplist.take(4)

[('Hayward', [71, 69, 71, 71, 72]),
 ('Baumholder', [46, 42, 40, 37, 39]),
 ('Alexandria', [50, 48, 51, 53, 44]),
 ('Melbourne', [88, 101, 85, 77, 74])]

In [5]:
locwtemps = kvpairs.flatMapValues(lambda x: x.split('|')).map(lambda x: (x[0], int(x[1])))

locwtemps.take(8)

[('Hayward', 71),
 ('Hayward', 69),
 ('Hayward', 71),
 ('Hayward', 71),
 ('Hayward', 72),
 ('Baumholder', 46),
 ('Baumholder', 42),
 ('Baumholder', 40)]

Grouping, aggregation, and sorting operations are functionally analogous to their more generalized forms discussed earlier in this chapter ( groupBy() and sortBy() ), again with the difference
being that these functions operate specifically on RDDs composed of key/value pairs.

<h3>Be Cautious of the Repartitioning and Shuffling Effects of Some Functions</h3>
Be aware that some functions, such as groupByKey() and reduceByKey() , may result in
a repartitioning or require shuffling. Shuffling is a relatively expensive operation because it
requires the movement of data between Spark Executors, often located on different Worker
nodes. These operations are often necessary and unavoidable, but in some cases, by under-
standing the planning and execution of an RDD’s lineage, you may be able to optimize these
operations. We discuss partitioning in more detail in Chapter 5.

# groupByKey():
                Syntax: RDD.groupByKey(numPartitions=None, partitionFunc=<hash_fn>)
The groupByKey() transformation groups the values for each key in a key/value pair RDD into a
single sequence.

The numPartitions argument specifies how many partitions—how many groups, that is—to
create. The partitions are created using the partitionFunc argument, which defaults to Spark’s
built-in hash partitioner. If numPartitions is None , which is the default, then the configured
system default number of partitions is used ( spark.default.parallelism ).

Consider the output from Listing 4.29. If you want to calculate the average temperature by city,
you first need to group all the values together by their city and then compute the averages.
Listing 4.30 shows how to use groupByKey() to accomplish this.

In [8]:
# continued from Listing 4.29
grouped = locwtemps.groupByKey()
grouped.take(4)
#Result:
# [
#  ('Hayward', <pyspark.resultiterable.ResultIterable at 0x7f602435c190>),
#  ('Baumholder', <pyspark.resultiterable.ResultIterable at 0x7f602435c0d0>),
#  ('Alexandria', <pyspark.resultiterable.ResultIterable at 0x7f602435c490>),
#  ('Melbourne', <pyspark.resultiterable.ResultIterable at 0x7f602435c1d0>)
# ]
avgtemps = grouped.mapValues(lambda x: sum(x)/len(x))
avgtemps.collect()

[('Hayward', 70.8),
 ('Baumholder', 40.8),
 ('Alexandria', 49.2),
 ('Melbourne', 85.0)]

Notice that groupByKey() returns a resultiterable object for the grouped values. An iterable
object in Python is a sequence object that can loop over. Many functions in Python accept iter-
ables as input, such as the sum() and len() functions.
<h3>Consider Using reduceByKey() or foldByKey() Instead of
groupByKey() </h3>
If you group values for the purposes of aggregation, such as by using a sum() or count() for
each key, then using reduceByKey() or foldByKey() provides much better performance in
many cases. This is because the results of the aggregation function are combined before the
shuffle, resulting in a reduced amount of data being shuffled.

# reduceByKey():
            Syntax: RDD.reduceByKey(function, numPartitions=None, partitionFunc=<hash_fn>)
The reduceByKey() transformation merges the values for the keys by using an associative func-
tion. The reduceByKey() method is called on a dataset of key/value pairs and returns a dataset of
key/value pairs, aggregating values for each key. This function is expressed as follows:

v n , v n+1 => v result

The numPartitions and partitionFunc arguments behave exactly the same as in the group-
ByKey() function. The numPartitions value is effectively the number of reduce tasks to execute,
and you can increase this to obtain a higher degree of parallelism. The numPartitions value
also affects the number of files produced with saveAsTextFile() or other file-producing Spark
actions. For instance, numPartitions=2 produces two output files when the RDD saves to disk.
Listing 4.31 takes the same input key/value pairs and produces the same results (average tempera-
tures per city) as the previous groupByKey() example—but using the reduceByKey() function
instead. This method is preferred for reasons we will discuss shortly.

In [12]:
# continued from Listing 4.29
temptups = locwtemps.mapValues(lambda x: (x, 1))
# creates tuples (city, (temp, 1))
temptups.collect()
# RETURNS:
# [
#  ('Hayward', (71, 1)),
#  ('Hayward', (69, 1)),
#  ('Hayward', (71, 1)),
#  ('Hayward', (71, 1)),
#  ('Hayward', (72, 1)),
#  ('Baumholder', (46, 1)),
#  ('Baumholder', (42, 1)),
#  ('Baumholder', (40, 1)),
#  ('Baumholder', (37, 1)),
#  ('Baumholder', (39, 1)),
#  ('Alexandria', (50, 1)),
#  ('Alexandria', (48, 1)),
#  ('Alexandria', (51, 1)),
#  ('Alexandria', (53, 1)),
#  ('Alexandria', (44, 1)),
#  ('Melbourne', (88, 1)),
#  ('Melbourne', (101, 1)),
#  ('Melbourne', (85, 1)),
#  ('Melbourne', (77, 1)),
#  ('Melbourne', (74, 1))
# ]
inputstoavg = temptups.reduceByKey(lambda x, y: (x[0]+y[0], x[1]+y[1]))
inputstoavg.collect()
# RETURNS:
# [
#  ('Hayward', (354, 5)),
#  ('Baumholder', (204, 5)),
#  ('Alexandria', (246, 5)),
#  ('Melbourne', (425, 5))
# ]
# sums temperatures by city
averages = inputstoavg.map(lambda x: (x[0], x[1][0]/x[1][1]))
# divides the sum of temperatures by key by the number of readings
averages.take(4)

[('Hayward', 70.8),
 ('Baumholder', 40.8),
 ('Alexandria', 49.2),
 ('Melbourne', 85.0)]

Averaging is not an associative operation; you can get around this by creating tuples containing
the sum total of values for each key and the count for each key—operations that are associative
and commutative—and then computing the average as a final step, as shown in Listing 4.31.


Note that reduceByKey() is efficient because it combines values locally on each Executor before
each of the combined lists sends to a remote Executor or Executors running the final reduce stage.
This is a shuffle operation.

Because the same associative and commutative function are run on the local Executor or Worker
and again on a remote Executor or Executors, taking a sum function, for example, you can think
of this as adding a list of sums as opposed to summing a bigger list of individual values. Because
there is less data sent in the shuffle phase, reduceByKey() using a sum function generally
performs better than groupByKey() followed by a sum() function.

# foldByKey():
               Syntax: RDD.foldByKey(zeroValue, function, numPartitions=None,partitionFunc=<hash_fn>)
The foldByKey() transformation is functionally similar to the fold() action discussed in the
previous section. However, foldByKey() is a transformation that works with predefined key/
value pair elements (see Listing 4.32). Both foldByKey() and fold() provide a zeroValue argu-
ment of the same type to be used if the RDD is empty.

The function supplied is in the generalized aggregate function form:

v n , v n+1 => v result

This is the same generalization used by the reduceByKey() transformation.
The numPartitions and the partitionFunc arguments have the same effect as they do with the
groupByKey() and reduceByKey() transformations.

In [14]:
#continued from Listing 4.29
locwtemps.collect()
# RETURNS:
# [
#  ('Hayward', 71),
#  ('Hayward', 69),
#  ('Hayward', 71),
#  ('Hayward', 71),
#  ('Hayward', 72),
#  ('Baumholder', 46),
#  ('Baumholder', 42),
#  ('Baumholder', 40),
#  ('Baumholder', 37),
#  ('Baumholder', 39),
#  ('Alexandria', 50),
#  ('Alexandria', 48),
#  ('Alexandria', 51),
#  ('Alexandria', 53),
#  ('Alexandria', 44),
#  ('Melbourne', 88),
#  ('Melbourne', 101),
#  ('Melbourne', 85),
#  ('Melbourne', 77),
#  ('Melbourne', 74)
# ]
maxbycity = locwtemps.foldByKey(0, lambda x, y: x if x > y else y)
maxbycity.collect()

[('Hayward', 72), ('Baumholder', 46), ('Alexandria', 53), ('Melbourne', 101)]

# sortByKey():
        Syntax:RDD.sortByKey(ascending=True, numPartitions=None, keyfunc=function)
The sortByKey() transformation sorts a key/value pair RDD by the predefined key. The sort
order is dependent on the underlying key object type, where numeric types are sorted numeri-
cally and so on. The difference between sort() , discussed earlier, and sortByKey() is that
sort() requires you to identify the key by which to sort, whereas sortByKey() is aware of the
key already.

Keys are sorted in the order provided by the ascending argument, which defaults to True . The
numPartitions argument specifies how many resultant partitions to output using a range parti-
tioning function. The keyfunc argument is an optional parameter to use if you want to derive a
key from passing the predefined key through another function, as in this example:

keyfunc=lambda k: k.lower()

Listing 4.33 shows the use of the sortByKey() transformation. The first example shows a simple
sort based on the key: a string representing the city name, sorted alphabetically. In the second
example, the keys and values are inverted to make the temperature the key and then use sort-
ByKey() to list the temperatures in descending numeric order, with the highest temperatures first.

In [12]:
# continued from Listing 4.29
sortedbykey = locwtemps.sortByKey()
sortedbykey.take(4)
# returns:
# [('Alexandria', 50), ('Alexandria', 48), ('Alexandria', 51), ('Alexandria', 53)]
sortedbyval = locwtemps.map(lambda x: (x[1],x[0])) \
.sortByKey(ascending=False)
sortedbyval.take(4)

[(101, 'Melbourne'), (88, 'Melbourne'), (85, 'Melbourne'), (77, 'Melbourne')]