<h2>map vs mapPartitions</h2>
<ul>
<li><strong>map</strong> will not change the number of elements in an RDD, while <strong>mapPartitions</strong> might very well do so.</li>
<li>The method <a href="https://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=rdd#pyspark.RDD.map">map</a> Return a new distributed dataset formed by passing each <em>element</em> of the source through a function func. <a href="https://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=rdd#pyspark.RDD.mapPartitions">mapPartitions</a> Similar to map, but runs separately on each <em>partition(block)</em> of the RDD, so <i>func</i> must be of type Iterator&lt;T&gt; =&gt; Iterator&lt;U&gt; when running on an RDD of type T.</li>
<li><strong>map</strong> works the function being utilized at a per element level while  <strong>mapPartitions</strong> exercises the function at the partition level</li>
</ul>
<p><strong><em>Example Scenario</em></strong>: if we have 100K elements in a particular RDD partition then we will fire off the function being used by the mapping transformation 100K times when we use <strong>map</strong>.Conversely, if we use <strong>mapPartitions</strong> then we will only call the particular function one time, but we will pass in all 100K records and get back all responses in one function call.There will be performance gain since <strong>map</strong> works on a particular function so many times, especially if the function is doing something expensive each time that it wouldn't need to do if we passed in all the elements at once(in case of <strong>mapPartitions</strong>).</p>

In [1]:
rdd = sc.parallelize([1,2,3,4,5],2)
rdd.map(lambda a:a if a!=2 else None).collect()

[1, None, 3, 4, 5]

In [2]:
def filter_out_2(partition):
    print("partition:")
    for element in partition:
        print(element)
        if element != 2:
            yield element
rdd.mapPartitions(filter_out_2).collect()

[1, 3, 4, 5]

<i>shell打印输出:</i><br/>partition:
3
4
5
partition:
1
2

<h2>mapValues(f)</h2>
<p>Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD’s partitioning.</p>

In [3]:
rdd = sc.parallelize([("a", ["apple", "banana", "lemon"]), ("b", ["grapes"])])
def f(x): return len(x)
rdd.mapValues(f).collect()

[('a', 3), ('b', 1)]

<h2>groupByKey</h2>
<p>When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable&lt;V&gt;) pairs. </p>
<p><b>Note:</b> If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using <em>reduceByKey</em> or <em>aggregateByKey</em> will yield much better performance.
<br/>
<b>Note:</b> By default, the level of parallelism in the output depends on the number of partitions of the parent RDD.You can pass an optional <em>numTasks</em> argument to set a different number of tasks.</p>

In [4]:
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
rdd.groupByKey().mapValues(len).collect()

[('b', 1), ('a', 2)]

In [5]:
rdd.groupByKey().mapValues(list).collect()

[('b', [1]), ('a', [1, 1])]

if you want to use <b>reduceByKey</b> it will be:

In [6]:
from operator import add
rdd.reduceByKey(add).collect()

[('b', 1), ('a', 2)]

<h2>reduceByKey</h2>
<p> When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function <i>func</i>, which must be of type (V,V) =&gt; V. Like in <em>groupByKey</em>, the number of reduce tasks is configurable through an optional second argument.</p>

In [7]:
from operator import add
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
rdd.reduceByKey(add).collect()

[('b', 1), ('a', 2)]

<h2>aggregate(zeroValue, seqOp, combOp)</h2>
<p>Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral “zero value.”

The functions op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.

The first function (seqOp) can return a different result type, U, than the type of this RDD. Thus, we need one operation for merging a T into an U and one operation for merging two U.</p>
<p>文档中给出了一个Example如下：</p>

In [8]:
seqOp = (lambda x, y: (x[0] + y, x[1] + 1))
combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1]))
sc.parallelize([1, 2, 3, 4]).aggregate((0, 0), seqOp, combOp)

(10, 4)

为便于理解，代码修改如下:

In [11]:
def seqOp(x,y):
    if x ==(0,0):
        print("partition-----")
    print("seq_x:",x)
    print("seq_y:",y)
    z = (x[0] + y,x[1] + 1)
    print("seq_z:",z)
    return z
def combOp(x,y):
    print("comb_x",x)
    print("comb_y",y)
    z = (x[0] + y[0],x[1] + y[1])
    print("comb_z:",z)
    return z

In [12]:
sc.parallelize([1, 2, 3, 4]).aggregate((0, 0), seqOp, combOp)

comb_x (0, 0)
comb_y (3, 2)
comb_z: (3, 2)
comb_x (3, 2)
comb_y (7, 2)
comb_z: (10, 4)


(10, 4)

<i>shell打印输出:</i><br/>
partition-----<br/>
seq_x: (0, 0)<br/>
seq_y: 3<br/>
seq_z: (3, 1)<br/>
seq_x: (3, 1)<br/>
seq_y: 4<br/>
seq_z: (7, 2)<br/>
partition-----<br/>
seq_x: (0, 0)<br/>
seq_y: 1<br/>
seq_z: (1, 1)<br/>
seq_x: (1, 1)<br/>
seq_y: 2<br/>
seq_z: (3, 2)<br/>

<br>从上面的代码的输出结果可以看出，1,2被分到第1个分区中，3,4被分到第2个分区中。在第1个分区中首先将zeroValue(0,0)和第一个元素3传给seqOp函数，返回(3,1)，然后将(3,1)和第二个元素4传给seqOp函数，返回(7,2)，以此类推，在第2个分区中还是先将zeroValue(0,0)和第一个元素1传给seqOp函数，返回(1,1)，然后将(1,1)和第二个元素1传给seqOp函数，返回(3,2)。最后将初始值zeroValue(0,0)和两个分区的结果经过combOp函数进行计算，先将初始值zeroValue(0,0)和第二个分区的结果(3，2)传给combine函数，返回(3，2)，然后将(3，2)和第一个分区结果(7,2)传给combine函数，返回最终结果(10, 4)。</p><p>因此对于<strong>aggregate</strong>总结如下：<br/>
def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): U<br/>
aggregate用户聚合RDD中的元素，先使用seqOp将RDD中每个分区中的T类型元素聚合成U类型，再使用combOp将之前每个分区聚合后的U类型聚合成U类型，特别注意seqOp和combOp都会使用zeroValue的值，zeroValue的类型为U。<br/>将初始值和第一个分区中的第一个元素传递给seq函数进行计算，然后将计算结果和第二个元素传递给seq函数，直到计算到最后一个值。第二个分区中也是同理操作。最后将所有分区的结果经过combine函数进行计算（先将前两个结果进行计算，将返回结果和下一个结果传给combine函数，以此类推），并返回最终结果。</p>
<p>再举个例子如下：</p>

In [14]:
rdd = sc.parallelize((1,2,3,4,5,6),2)
def seq(a,b):
    print ('seqOp:'+str(a)+"\t"+str(b))
    return min(a,b)
def comb(a,b):
    print ('comOp'+str(a)+"\t"+str(b))
    return a+b
rdd.aggregate(3,seq,comb)

comOp3	1
comOp4	3


7

shell打印输出:<br/>
seqOp:3	1<br/>
seqOp:1	2<br/>
seqOp:1	3<br/>
seqOp:3	4<br/>
seqOp:3	5<br/>
seqOp:3	6<br/>