<h2>map vs mapPartitions</h2>
<ul>
<li><strong>map</strong> will not change the number of elements in an RDD, while <strong>mapPartitions</strong> might very well do so.</li>
<li>The method <a href="https://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=rdd#pyspark.RDD.map">map</a> Return a new distributed dataset formed by passing each <em>element</em> of the source through a function func. <a href="https://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=rdd#pyspark.RDD.mapPartitions">mapPartitions</a> Similar to map, but runs separately on each <em>partition(block)</em> of the RDD, so <i>func</i> must be of type Iterator&lt;T&gt; =&gt; Iterator&lt;U&gt; when running on an RDD of type T.</li>
<li><strong>map</strong> works the function being utilized at a per element level while  <strong>mapPartitions</strong> exercises the function at the partition level</li>
</ul>
<p><strong><em>Example Scenario</em></strong>: if we have 100K elements in a particular RDD partition then we will fire off the function being used by the mapping transformation 100K times when we use <strong>map</strong>.Conversely, if we use <strong>mapPartitions</strong> then we will only call the particular function one time, but we will pass in all 100K records and get back all responses in one function call.There will be performance gain since <strong>map</strong> works on a particular function so many times, especially if the function is doing something expensive each time that it wouldn't need to do if we passed in all the elements at once(in case of <strong>mapPartitions</strong>).</p>

In [3]:
rdd = sc.parallelize([1,2,3,4,5],2)
rdd.map(lambda a:a if a!=2 else None).collect()

[1, None, 3, 4, 5]

In [14]:
def filter_out_2(partition):
    print("partition:")
    for element in partition:
        print(element)
        if element != 2:
            yield element
rdd.mapPartitions(filter_out_2).collect()

[1, 3, 4, 5]

<i>shell打印输出:</i><br/>partition:
3
4
5
partition:
1
2

<h2>mapValues(f)</h2>
<p>Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD’s partitioning.</p>

In [15]:
rdd = sc.parallelize([("a", ["apple", "banana", "lemon"]), ("b", ["grapes"])])
def f(x): return len(x)
rdd.mapValues(f).collect()

[('a', 3), ('b', 1)]

<h2>groupByKey</h2>
<p>When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable&lt;V&gt;) pairs. </p>
<p><b>Note:</b> If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using <em>reduceByKey</em> or <em>aggregateByKey</em> will yield much better performance.
<br/>
<b>Note:</b> By default, the level of parallelism in the output depends on the number of partitions of the parent RDD.You can pass an optional <em>numTasks</em> argument to set a different number of tasks.</p>

In [17]:
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
rdd.groupByKey().mapValues(len).collect()

[('b', 1), ('a', 2)]

In [18]:
rdd.groupByKey().mapValues(list).collect()

[('b', [1]), ('a', [1, 1])]

if you want to use <b>reduceByKey</b> it will be:

In [23]:
from operator import add
rdd.reduceByKey(add).collect()

[('b', 1), ('a', 2)]

<h2>reduceByKey</h2>
<p> When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function <i>func</i>, which must be of type (V,V) =&gt; V. Like in <em>groupByKey</em>, the number of reduce tasks is configurable through an optional second argument.</p>

In [26]:
from operator import add
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
rdd.reduceByKey(add).collect()

[('b', 1), ('a', 2)]