<h1 style="text-align:center"> INFO 323: Cloud Computing and Big Data</h1>
<h2 style="text-align:center"> College of Computing and Informatics</h2>
<h2 style="text-align:center">Drexel University</h2>

<h3 style="text-align:center"> Week 7: Resilient Distributed Datasets (Ch 12: RDDs)</h3>
<h3 style="text-align:center"> Yuan An, PhD</h3>
<h3 style="text-align:center">Associate Professor</h3>

#  About RDDs
Virtually all Spark code you run,
whether DataFrames or Datasets, compiles down to an RDD. 
In short, an RDD represents an immutable, partitioned collection of records that can be operated on in
parallel. Unlike DataFrames though, where each record is a structured row containing fields with a
known schema, in RDDs the records are just Java, Scala, or Python objects of the programmer’s
choosing.

RDDs give you complete control because every record in an RDD is a just a Java or Python object.
You can store anything you want in these objects, in any format you want. This gives you great power,
but not without potential issues. Every manipulation and interaction between values must be defined
by hand, meaning that you must “reinvent the wheel” for whatever task you are trying to carry out.

Also, optimizations are going to require much more manual work, because Spark does not understand
the inner structure of your records as it does with the Structured APIs. For instance, Spark’s
Structured APIs automatically store data in an optimzied, compressed binary format, so to achieve the
same space-efficiency and performance,

## Creating RDDs: Interoperating Between DataFrames and RDDs
One of the easiest ways to get RDDs is from an existing DataFrame or Dataset. Converting these to an
RDD is simple: just use the rdd method on any of these data types.

In [None]:
rd = spark.range(10).rdd

In [None]:
rd.take(2)

To operate on this data, you will need to convert this Row object to the correct data type or extract
values out of it, as shown in the example that follows. This is now an RDD of type Row:

In [None]:
# COMMAND ----------

rd = spark.range(10).toDF("id").rdd.map(lambda row: row[0])

In [None]:
rd.take(3)

You can use the same methodology to create a DataFrame or Dataset from an RDD. All you need to
do is call the toDF method on the RDD:

In [None]:
# COMMAND ----------

df = spark.range(10).rdd.toDF()

In [None]:
df.show()

## From a Local Collection
To create an RDD from a collection, you will need to use the parallelize method on a
SparkContext (within a SparkSession). This turns a single node collection into a parallel collection.
When creating this parallel collection, you can also explicitly state the number of partitions into
which you would like to distribute this array. In this case, we are creating two partitions:

In [None]:
# COMMAND ----------

myCollection = "Spark The Definitive Guide : Big Data Processing Made Simple"\
  .split(" ")
    

In [None]:
words = spark.sparkContext.parallelize(myCollection, 2)

In [None]:
words.collect()

An additional feature is that you can then name this RDD to show up in the Spark UI according to a
given name:

In [None]:
# COMMAND ----------

words.setName("myWords")

In [None]:
words.name() # myWords

## filter
Filtering is equivalent to creating a SQL-like where clause. You can look through our records in the
RDD and see which ones match some predicate function. This function just needs to return a Boolean
type to be used as a filter function. The input should be whatever your given row is. In this next
example, we filter the RDD to keep only the words that begin with the letter “S”:

In [None]:
# COMMAND ----------

def startsWithS(individual):
  return individual.startswith("S")

Now that we defined the function, let’s filter the data. This should feel quite familiar if you read
Chapter 11 because we simply use a function that operates record by record in the RDD. The function
is defined to work on each record in the RDD individually:

In [None]:
# COMMAND ----------

words.filter(lambda word: startsWithS(word)).collect()

## map
Mapping is again the same operation that you can read. You specify a function
that returns the value that you want, given the correct input. You then apply that, record by record.
Let’s perform something similar to what we just did. In this example, we’ll map the current word to
the word, its starting letter, and whether the word begins with “S.”
Notice in this instance that we define our functions completely inline using the relevant lambda
syntax:

In [None]:
# COMMAND ----------

words2 = words.map(lambda word: (word, word[0], word.startswith("S")))

In [None]:
words2.collect()

You can subsequently filter on this by selecting the relevant Boolean value in a new function:

In [None]:
# COMMAND ----------

words2.filter(lambda record: record[2]).take(5)

## flatMap
flatMap provides a simple extension of the map function we just looked at. Sometimes, each current
row should return multiple rows, instead. For example, you might want to take your set of words and
flatMap it into a set of characters. Because each word has multiple characters, you should use
flatMap to expand it. flatMap requires that the ouput of the map function be an iterable that can be
expanded:

In [None]:
# COMMAND ----------

words.flatMap(lambda word: list(word)).take(8)

## sort
To sort an RDD you must use the sortBy method, and just like any other RDD operation, you do this
by specifying a function to extract a value from the objects in your RDDs and then sort based on that.
For instance, the following example sorts by word length from longest to shortest:

In [None]:
# COMMAND ----------

words.sortBy(lambda word: len(word) * -1).take(2)

## Random Splits
We can also randomly split an RDD into an Array of RDDs by using the randomSplit method,
which accepts an Array of weights and a random seed:

In [None]:
# COMMAND ----------

fiftyFiftySplit = words.randomSplit([0.5, 0.5])

In [None]:
fiftyFiftySplit

## reduce
You can use the reduce method to specify a function to “reduce” an RDD of any kind of value to one
value. For instance, given a set of numbers, you can reduce this to its sum by specifying a function that
takes as input two values and reduces them into one. 

In [None]:
# COMMAND ----------

spark.sparkContext.parallelize(range(1, 21)).reduce(lambda x, y: x + y) # 210

You can also use this to get something like the longest word in our set of words that we defined a
moment ago. The key is just to define the correct function:

In [None]:
# COMMAND ----------

def wordLengthReducer(leftWord, rightWord):
  if len(leftWord) > len(rightWord):
    return leftWord
  else:
    return rightWord

In [None]:
words.reduce(wordLengthReducer)