![DataStax Academy](https://s3.amazonaws.com/datastaxtraining/vq8Jr36Gk48v/datastax-academy.svg "DataStax Academy")

# Exercise 06.02 - Tuning Partitioning: Partitioning Rules

## Background

In this exercise, you'll be exploring partitioning properties for RDDs created from different sources and transformations.

You'll be working with the `videos` table:

and a text file, `/root/data/video-ids.csv`, which contains only one column: the video IDs for different movies. Some of the video IDs are orphaned, i.e. not in the `killrvideo.videos` table.

***

## Directions

#### 1. Create an RDD from the text file. Hint: map() each row to a UUID via java.util.UUID.fromString(). What do you expect the partitioner and number of partitions to be? Why? Write Scala code to discover the actual values. Were they what you expected them to be? Why or why not?

In [1]:
val videoIds = sc.textFile("file:///root/data/video-ids.csv").map(id => java.util.UUID.fromString(id))

println(videoIds.partitioner)
println(videoIds.partitions.size)

None
2


Partitioner is `None` because our data source is a text file. The partition size is 2 because that's the number of cores thus that's the value of `defaultParallelism`. `defaultParallelism` is greater in this case because we have less than 64 MB of data in the text file.

#### 2. Create an RDD from the killrvideo.videos table select()ing only the videoids. What do you expect the partitioner and number of partitions to be? Why? Write Scala code to discover the actual values. Were they what you expected them to be? Why or why not?

In [2]:
val killrVideoIds = sc.cassandraTable[java.util.UUID]("killr_video", "videos").select("video_id")

println(killrVideoIds.partitioner)
println(killrVideoIds.partitions.size)

None
1


The partitioner is `None` because it is a C\* source.
`partitions.size` is 1. This reflects the number of nodes in the C* cluster.

#### 3. Find the orphaned video IDs by using subtract(). What do you expect the partitioner and number of partitions to be? Why? Write Scala code to discover the actual values. Were they what you expected them to be? Why or why not?

In [3]:
val orphans = videoIds.subtract(killrVideoIds)

println(orphans.partitioner)
println(orphans.partitions.size)

None
2


The partitioner is still `None` because we use the source's partitioner. Both inputs here have a partitioner of `None`.
The number of partitions is 2 because the number of partitions for the text file RDD is 2 (we take the source RDD's number of partitions).

#### 4. Turn your RDD into a RDD of keys via keyBy() then sort using sortByKey(). What do you expect the partitioner and number of partitions to be? Why? Write Scala code to discover the actual values. Were they what you expected them to be? Why or why not?

In [4]:
val sortedOrphans = orphans.keyBy(row => row).sortByKey()

println(sortedOrphans.partitioner)
println(sortedOrphans.partitions.size)

Some(org.apache.spark.RangePartitioner@daac8b75)
2


It's a RangePartitioner because we used `sortByKey`. The number of partitions is still 2 because it's the parent's number of partitions.

#### 5. Display the resulting orphaned video IDs.

In [5]:
sortedOrphans.collect.foreach(println)

(9056808b-ca65-1bfb-9957-3bea148dfdce,9056808b-ca65-1bfb-9957-3bea148dfdce)
(907df86e-2208-18a8-90aa-6d837c659f2f,907df86e-2208-18a8-90aa-6d837c659f2f)
(9646278f-14bd-11e5-88ea-8438355b7e3a,9646278f-14bd-11e5-88ea-8438355b7e3a)
(9db57288-e51c-1ff1-805d-c5f1e49c2c8b,9db57288-e51c-1ff1-805d-c5f1e49c2c8b)
(fe3c4045-6f37-1223-81be-250dc60cffc8,fe3c4045-6f37-1223-81be-250dc60cffc8)
(2645e79c-14bd-11e5-a456-8438355b7e3a,2645e79c-14bd-11e5-a456-8438355b7e3a)
(264601a3-14bd-11e5-8c2e-8438355b7e3a,264601a3-14bd-11e5-8c2e-8438355b7e3a)
(2646123a-14bd-11e5-b9db-8438355b7e3a,2646123a-14bd-11e5-b9db-8438355b7e3a)
(26461a70-14bd-11e5-ad08-8438355b7e3a,26461a70-14bd-11e5-ad08-8438355b7e3a)
(2e8ecb4f-e92b-139b-8183-4df0e2a817bb,2e8ecb4f-e92b-139b-8183-4df0e2a817bb)
