# Chapter 3: Programming with RDDs (Scala)

In this Notebook, we will study simple RDD, how to create them and how to operate with them.

## Create RDD

Create RDD from list:

In [1]:
val numeric_rdd = sc.parallelize(1 to 10)

numeric_rdd = ParallelCollectionRDD[0] at parallelize at <console>:27


ParallelCollectionRDD[0] at parallelize at <console>:27

In [2]:
println("Numeric RDD (from list): " + numeric_rdd.collect().mkString(", "))

Numeric RDD (from list): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10


Create RDD from external file:

In [3]:
val text_rdd = sc.textFile("../data/README.md")

text_rdd = ../data/README.md MapPartitionsRDD[2] at textFile at <console>:27


../data/README.md MapPartitionsRDD[2] at textFile at <console>:27

In [4]:
println("Text RDD (from external file): " + text_rdd.take(10).mkString(", "))

Text RDD (from external file): # Apache Spark, , Spark is a fast and general cluster computing system for Big Data. It provides, high-level APIs in Scala, Java, Python, and R, and an optimized engine that, supports general computation graphs for data analysis. It also supports a, rich set of higher-level tools including Spark SQL for SQL and DataFrames,, MLlib for machine learning, GraphX for graph processing,, and Spark Streaming for stream processing., , <http://spark.apache.org/>


## RDD actions

    * collect():
    * take()
    * count(), countByValue()
    * tarkeOrdered(), takeSample()
    * reduce(), fold()
    * aggregate()

`collect()` --> collect the RDD to the driver, `take()`: --> take to the driver n elements of the RDD.

In [5]:
val rdd1 = sc.parallelize(List(1,1,2,3,33,1,4,5,8,6))
val rdd2 = sc.parallelize(List(1,2,9,8))

rdd1 = ParallelCollectionRDD[3] at parallelize at <console>:27
rdd2 = ParallelCollectionRDD[4] at parallelize at <console>:28


ParallelCollectionRDD[4] at parallelize at <console>:28

`count()` --> count the total elements of the RDD, `countByValue()` --> count the number of occurences of each element of the RDD.

In [6]:
println("count(): " + rdd1.count())
println("countByValue(): " + rdd1.countByValue())

count(): 10
countByValue(): Map(5 -> 1, 1 -> 3, 6 -> 1, 33 -> 1, 2 -> 1, 3 -> 1, 8 -> 1, 4 -> 1)


`takeOrdered(n)` take n elements of the RDD in order, `takeSample(n)` take n elements of the RDD randomly chosen.

In [7]:
println("takeOrdered(): " + rdd2.takeOrdered(3).mkString(", "))
println("takeSample(): " + rdd2.takeSample(false, 2).mkString(", "))

takeOrdered(): 1, 2, 8
takeSample(): 9, 8


`reduce()` --> reduce an RDD using some aggregation function, `fold()` -->  same as `reduce()`, but setting the initial value.

In [8]:
println("Sum of list using reduce(): " + rdd1.reduce(_ + _))
println("Sum of list using fold(): " + rdd1.fold(0)(_ + _))

Sum of list using reduce(): 64
Sum of list using fold(): 64


Calculating average using `reduce()`

In [9]:
println("Average calculated using reduce(): " + rdd1.reduce(_ + _).toFloat/rdd1.count())

Average calculated using reduce(): 6.4


Calculating average using `aggregate()`:

In [10]:
val (sum_values, count) = rdd1.aggregate((0,0))(
                                        (acc, value) => (acc._1 + value, acc._2 + 1),
                                        (acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))

val avg = sum_values.toFloat/count
println("Average calculated using aggregate(): " + avg)

Average calculated using aggregate(): 6.4


sum_values = 64
count = 10
avg = 6.4


6.4

## Basic RDD Transformations

In this section, we will explore the most common transformation that can be performed on RDDs:

    * map()
    * flatMap()
    * filter()
    * distinct()

`map()` --> to apply element-wise operations over the elements of an RDD

In [11]:
val rdd_map = numeric_rdd.map(_^2)
print("RDD obtained using map: " + rdd_map.collect().mkString(", "))

RDD obtained using map: 3, 0, 1, 6, 7, 4, 5, 10, 11, 8

rdd_map = MapPartitionsRDD[10] at map at <console>:29


MapPartitionsRDD[10] at map at <console>:29

`flatMap()` --> to apply element-wise operations over the elements of an RDD flattening the final results

In [12]:
val rdd_flat_map = text_rdd.flatMap(_.split(" "))
println("RDD obtained using flatMap: " + rdd_flat_map.take(10).mkString(", "))

RDD obtained using flatMap: #, Apache, Spark, , Spark, is, a, fast, and, general


rdd_flat_map = MapPartitionsRDD[11] at flatMap at <console>:29


MapPartitionsRDD[11] at flatMap at <console>:29

`filter()` --> to filter the elements of an RDD according to a specific condition.

In [13]:
val lines_spark = text_rdd.map(_.split(" ")).filter(_.contains("Spark"))
println("Number of lines that contains the word 'Spark': " + lines_spark.count())

Number of lines that contains the word 'Spark': 16


lines_spark = MapPartitionsRDD[13] at filter at <console>:29


MapPartitionsRDD[13] at filter at <console>:29

In [14]:
val words_python = text_rdd.flatMap(_.split(" ")).filter(_.replace(",", "") == "Python")
println("Number of times that the word 'Python' appears: " + words_python.count())

Number of times that the word 'Python' appears: 4


words_python = MapPartitionsRDD[15] at filter at <console>:29


MapPartitionsRDD[15] at filter at <console>:29

`distinct()` --> to get the distinct elements of an RDD.

In [15]:
print("RDD from distinct(): " + rdd1.distinct().collect().mkString(", "))

RDD from distinct(): 4, 8, 33, 1, 5, 6, 2, 3

## Pseudo-Set Operations

Finally, we study the different pseudo-set operations that can be performed on two RDDs:

    * union()
    * subtract()
    * intersection()
    * cartesian()

`union()` --> to concatenate two RDDs

In [16]:
val rdd_union = rdd1.union(rdd2)
println("RDD from union(): " + rdd_union.collect().mkString(", "))

RDD from union(): 1, 1, 2, 3, 33, 1, 4, 5, 8, 6, 1, 2, 9, 8


rdd_union = UnionRDD[19] at union at <console>:30


UnionRDD[19] at union at <console>:30

`subtract()` --> to subtract one RDD from another one

In [17]:
val rdd_subtract = rdd1.subtract(rdd2)
println("RDD from subtract(): " + rdd_subtract.collect().mkString(", "))

RDD from subtract(): 4, 33, 5, 6, 3


rdd_subtract = MapPartitionsRDD[23] at subtract at <console>:30


MapPartitionsRDD[23] at subtract at <console>:30

`intersection()` --> to get the intersection between two RDDs

In [18]:
val rdd_intersection = rdd1.intersection(rdd2)
println("RDD from intersection(): " + rdd_intersection.collect().mkString(", "))

RDD from intersection(): 8, 1, 2


rdd_intersection = MapPartitionsRDD[29] at intersection at <console>:30


MapPartitionsRDD[29] at intersection at <console>:30

`cartesian()` --> to get the cartesian product of two RDDs

In [19]:
val rdd_cartesian = rdd1.cartesian(rdd2)
println("RDD from cartesian(): " + rdd_cartesian.collect().mkString(", "))

RDD from cartesian(): (1,1), (1,1), (1,2), (1,2), (1,9), (1,9), (1,8), (1,8), (2,1), (3,1), (33,1), (2,2), (3,2), (33,2), (2,9), (3,9), (33,9), (2,8), (3,8), (33,8), (1,1), (4,1), (1,2), (4,2), (1,9), (4,9), (1,8), (4,8), (5,1), (8,1), (6,1), (5,2), (8,2), (6,2), (5,9), (8,9), (6,9), (5,8), (8,8), (6,8)


rdd_cartesian = CartesianRDD[30] at cartesian at <console>:30


CartesianRDD[30] at cartesian at <console>:30