# Partitioning and shuffling

Before we start, execute the following cell to create a Spark context:

In [2]:
import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext(master='local[*]', appName="Spark course Partitioning and Shuffling")

Create an RDD containing numbers from 1 to 10000.

In [2]:
rdd = sc.range(1, 10000)

How many partitions does it have?

In [3]:
rdd.getNumPartitions()

4

How many elements does each partition have? (Use `mapPartitions` to create an RDD called `partSizes` and then call `collect` on it).

In [10]:
def mapfunc(iter):
    print("mappartitions")
    return [len(iter)]
partSizes = rdd.mapPartitions(mapfunc)
partSizes.collect()

[2499, 2500, 2500, 2500]

Map `partSizes` RDD to tuples with `lambda x: (x, x*x)`

In [11]:
rdd2 = partSizes.map(lambda x: (x, x*x))

Will this RDD cause a shuffle? (Find out by examining RDD's lineage.)

In [12]:
print(str(rdd2.toDebugString()).replace('\\n', '\n'))

b'(4) PythonRDD[9] at RDD at PythonRDD.scala:50 []
 |  ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:185 []'


Reduce this RDD by key by summing up values and call the resulting RDD `reducedRdd`.

In [20]:
reducedRdd = rdd2.reduceByKey(lambda v1, v2: v1+v2)

Will this last RDD cause a shuffle?

In [21]:
print(str(reducedRdd.toDebugString()).replace('\\n', '\n'))

b'(4) PythonRDD[23] at RDD at PythonRDD.scala:50 []
 |  MapPartitionsRDD[22] at mapPartitions at PythonRDD.scala:130 []
 |  ShuffledRDD[21] at partitionBy at NativeMethodAccessorImpl.java:0 []
 +-(4) PairwiseRDD[20] at reduceByKey at <ipython-input-20-8eb780c4b668>:1 []
    |  PythonRDD[19] at reduceByKey at <ipython-input-20-8eb780c4b668>:1 []
    |  ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:185 []'


## Checkpointing

Set Spark's checkpoint directory to `/home/spark/checkpointDir` then checkpoint the `reducedRdd` RDD.

In [25]:
sc.setCheckpointDir('/home/spark/checkpointDir')

Recreate the `reducedRdd` but give it a different name (copy the line from the previous cell where you did this).

In [34]:
reducedRdd2 = rdd2.reduceByKey(lambda v1, v2: v1+v2)

Examine its lineage.

In [35]:
print(str(reducedRdd2.toDebugString()).replace('\\n', '\n'))

b'(4) PythonRDD[37] at RDD at PythonRDD.scala:50 []
 |  MapPartitionsRDD[36] at mapPartitions at PythonRDD.scala:130 []
 |  ShuffledRDD[35] at partitionBy at NativeMethodAccessorImpl.java:0 []
 +-(4) PairwiseRDD[34] at reduceByKey at <ipython-input-34-6ff29a7a73ef>:1 []
    |  PythonRDD[33] at reduceByKey at <ipython-input-34-6ff29a7a73ef>:1 []
    |  ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:185 []'


Checkpoint this RDD and examine its lineage again.

In [36]:
reducedRdd2.checkpoint()
print(str(reducedRdd2.toDebugString()).replace('\\n', '\n'))

b'(4) PythonRDD[37] at RDD at PythonRDD.scala:50 []
 |  MapPartitionsRDD[36] at mapPartitions at PythonRDD.scala:130 []
 |  ShuffledRDD[35] at partitionBy at NativeMethodAccessorImpl.java:0 []
 +-(4) PairwiseRDD[34] at reduceByKey at <ipython-input-34-6ff29a7a73ef>:1 []
    |  PythonRDD[33] at reduceByKey at <ipython-input-34-6ff29a7a73ef>:1 []
    |  ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:185 []'


Now "materialize" the RDD by calling an action on it (such as `count`) and reexamine the lineage.

In [37]:
reducedRdd2.count()

2

In [38]:
print(str(reducedRdd2.toDebugString()).replace('\\n', '\n'))

b'(4) PythonRDD[37] at RDD at PythonRDD.scala:50 []
 |  ReliableCheckpointRDD[39] at count at <ipython-input-37-ebf95e46e539>:1 []'


You had to create a new `reducedRdd` because the previous one was already materialized and checkpointing it would have no effect.

## Caching

Create an RDD containing a few thousand numbers and map its partitions (`mapPartitions`) with a function that returns number of elements in each partition but also prints that information out.

In [11]:
rdd = sc.range(0, 10000)
def mapfunc(itr):
    print(str(len(itr)))
    return [len(itr)]
rdd2 = rdd.mapPartitions(mapfunc)

Now "materialize" this RDD by calling an action on it (such as `collect` or `count`). Execute that cell several times and watch the output from the Jupyter server.

In [18]:
rdd2.collect()

[2500, 2500, 2500, 2500]

Now cache this RDD.

In [14]:
rdd2.cache()

PythonRDD[8] at collect at <ipython-input-12-83517eaf6d43>:1

And call the same action several times again. Do you see the difference?

If you no longer need the cached RDD you should always `unpersist` it explicitly because keeping it in memory might cause increased garbage collection.