<img src="uva_seal.png">  

## Resilient Distributed Datasets (RDDs)

### University of Virginia
### DS 7200: Distributed Computing
### Last Updated: August 27, 2023

---  


### SOURCES
Learning Spark 1st Ed., Chapter 3: Programming with RDDs   

### OBJECTIVES
-  Basics of RDDs including transformations and actions  
-  Discuss parallelization concepts  



### CONCEPTS

- RDD  
- Transformation  
- Action  
- lazy evaluation  
- SparkSession
- Directed acyclic graph (Lineage graph)
- Set operations  
- Pipelining or chaining  
- accumulator  
- `persist()` and `cache()`  
- `parallelize()`  
- `collect()` and `take()`  
- `map()`, `filter()`, `flatMap()`  
- `reduce()`, `fold()`, `aggregate()`  
- `count()`, `countByValue()`  
- `saveAsTextFile()`, `saveAsSequenceFile()`  

---

### RDD BASICS

An *RDD* is a distributed collection of elements  
It is the most basic abstraction in Spark, created at the birth of Spark.

All work consists of:  
- RDD creation  
- RDD transformation  
- RDD action (e.g., compute a result)  

**When RDDs are Useful**
- For unstructured data like documents  
- Certain models and applications require them

---
**ASIDE**  

DataFrames are more useful for structured data (e.g., tabular)  
We study them later  
Under the hood, DataFrames are constructed as rows of RDDs

---

Spark “magically” handles distributing data and code across cluster, parallelization of operations

Spark is "lazy."  
It doesn’t actually do any work until it encounters an *action*, for example a `count()`.  

**Directed Acyclic Graph (DAG)** 

Spark creates a logical plan or roadmap to optimize performance of the project.  
This is called a *directed acyclic graph* (DAG).  
It makes these optimizations without help from the user.  

DAG defines steps of program.  
Consists of RDD lineages which can be used to re-create RDD in case of job failure.


**Example DAG**

<img src="dag.png" size=100> 


When testing/debugging code, it can be helpful to call `count()` to force Spark to evaluate results.  
This gives a sense of what breaks and how long things take.  

A *transformation* creates a new RDD  

An *action* returns a different data type  

RDDs are created in two ways:   
1. Loading external dataset (`textFile()`, for example)
2. Distributing a collection of objects from driver program  

for example:
```
nums = sc.parallelize([1,2,3,4])
```

The `SparkSession` is a single entry point for working with Spark.  
It was created in Spark 2.0 to unify and simplify multiple context managers.  
For working with RDDs, the `Spark Context` is required. It is an attribute in SparkSession.  
The example below illustrates their use. 

**Example of Transformation: Filter on text**

In [29]:
# import Spark Session from pyspark.sql library
from pyspark.sql import SparkSession

# create SparkSession entry point
spark = SparkSession.builder.getOrCreate()

# Spark Context is needed for working with RDDs. Extract it from Spark Session
sc = spark.sparkContext

# Read in a text file
lines = sc.textFile("README.txt")

# Filter the data by applying a lambda function
pythonLines = lines.filter(lambda line: "Python" in line)

# Collect the filtered data to the driver
py = pythonLines.collect()

# For each line of text, print the index and text
for i, p in enumerate(py):
    print('line: {} text: {}'.format(i,p))

line: 0 text: Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.
line: 1 text: Interactive Python Shell
line: 2 text: Alternatively, if you prefer Python, you can use the Python shell:


### Useful Operations on RDDs

Store or “persist” an RDD by calling  

`RDD.persist()` 

`cache()` is the same as `persist()` with the default storage level

`collect()`  
Retrieve entire RDD on driver.  
Careful w large RDDs, as the results need to fit in memory on single machine!

`take()`  
Retrieve small number of elements from RDD (user can specify size).  
NOTE: values may NOT be in order

`first()`  
Retrieve first element from RDD  

`saveAsTextFile()`, `saveAsSequenceFile()`, `…`  
Save contents of RDD as a file. Different function call depending on file storage type.


### Some Basic Transformations 

`map()`  
Applies transform to each element in RDD  

`flatMap()`  
Apply map to produce list of elements in a single list (e.g, tokenize a sentence into words)  

`filter()`  
Return new RDD with only records meeting condition

`parallellize()`  
Distribute the data to workers, creating an RDD  


**Example of text processing with `map`**

Read in a text file. It happens to be pipe delimited.

In [8]:
pipe = sc.textFile("pipe_delim_data.txt")

In [9]:
pipe.take(2)

['10|105|-20|mmHg|4', '12|101|-55|mmol|5']

Parse the columns by splitting on pipe delimiter

In [10]:
pipe_clean = pipe.map(lambda x: x.split('|'))
pipe_clean.take(2)

[['10', '105', '-20', 'mmHg', '4'], ['12', '101', '-55', 'mmol', '5']]

Notice the split converts strings to lists.

Perhaps you want only a subset of the columns, like the first and third columns.    
The first `map` splits the strings, and the second `map` does the subsetting, placing data into tuples.  

In [11]:
pipe_clean_first_two = pipe.map(lambda x: x.split('|')) \
                           .map(lambda x: (x[0], x[2]))

pipe_clean_first_two.take(3)

[('10', '-20'), ('12', '-55'), ('8', '-35.7')]

In [28]:
pipe_clean_first_and_three_to_end = pipe.map(lambda x: x.split('|')) \
                           .map(lambda x: (x[0], ','.join(x[2:])))

pipe_clean_first_and_three_to_end.take(3)

[('10', '-20,mmHg,4'), ('12', '-55,mmol,5'), ('8', '-35.7,celsius,2')]

### Complexity of Transformations

Some transformations are simpler than others.

For example, `filter()` can operate independently on each data partition, and the results can be combined. This is sometimes called a *narrow transformation*.

Computing a median on the data, however, is more complex, since it requires ordering all the data. This means data would need to be shuffled across the cluster, which is an expensive operation.  This is sometimes called a *wide transformation*.

It is preferable to keep transformations as simple as possible.

**Example of Distributing Data to Workers with `parallelize()`**

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

nums = sc.parallelize([1,2,3,4])

# Show this object is RDD data type
print(type(nums))

### Partitions

Partitions support splitting data into multiple pieces which can be computed in parallel.  
This can speed up jobs.

Each partition (block of data) lives on a single machine. 

Every node in a Spark cluster contains one or more partitions.

When running locally, Spark partitions data into number of CPU cores on system, or specified value when SparkSession was created.

**Example:** Run job locally with 1 partition

In [56]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .appName('partition_example.com') \
        .master("local[1]").getOrCreate()

23/08/27 18:23:05 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [57]:
# create RDD 
rdd1  = sc.parallelize((0,25,10,2,4,3,6,8,3,4,0,20,10,2,4,3,6,8,3,4,10,20,10,2,4,3,6,8,3,4))

# method to get number of partitions
print(rdd1.getNumPartitions())

1


It is possible to change the number of partitions

In [54]:
# Change partitions to 10
new_rdd = rdd1.repartition(10)
print(new_rdd.getNumPartitions())

10


Save repartitioned data. Notice the number of output files matches number of partitions, but some are empty.

In [55]:
new_rdd.saveAsTextFile('repartitioned_data.txt')

                                                                                

Spark tries to be strategic in how it uses partitions. Later, we will learn how to tune the number of partitions manually.

#### Set Operations

In [None]:
list1 = sc.parallelize(['cat','dog','baby'])
list2 = sc.parallelize(['giraffe','baby'])

# take the union of two RDDs
list1.union(list2).collect()

Notice this does not filter duplicates  

Also notice we can “chain” or “pipeline” commands in sequence  

Let’s get the distinct list from the union:

In [None]:
list1.union(list2).distinct().collect()

NOTE: `distinct()` is expensive as it requires shuffling all data over the network  

Shuffling: the process of redistributing data across partitions  

#### Actions

`reduce()`  
Process elements into a new element of the same type

In [None]:
# build an RDD and sum the integers, two at a time

l1 = sc.parallelize([1,2,3,4])
sum = l1.reduce(lambda x, y: x + y)

In [None]:
print('sum: {}'.format(sum))
print('l1 type: {}'.format(type(l1)))
print('sum type: {}'.format(type(sum)))

`fold()`  
Similar to `reduce()`, includes “zero value” acting as identity  

`aggregate()`  
Similar to reduce and fold, uses:  
1. initial value 
2. combining function for each worker or node
3. combining function to merge results across workers

`countbyValue()`

In [None]:
nums = sc.parallelize([1,2,3,3,4])
cv = nums.countByValue()

print('cv[1]: {}'.format(cv[1]))
print('cv[3]: {}'.format(cv[3]))

**TRY FOR YOURSELF (UNGRADED EXERCISES)**

1) Use `reduce()` to compute the cumulative product of the odd numbers from 1 through 15.

2) Print the intersection between these two RDDs, and `collect()` the results.

In [None]:
rdd1 = sc.parallelize(['cat','dog','baby'])
rdd2 = sc.parallelize(['giraffe','baby','baby'])

3) Print the elements in `rdd1` that are NOT in `rdd2`. The `subtract()` function can be helpful.

**SOLUTIONS**

In [None]:
# 1)

l1 = sc.parallelize([val for val in range(1,17,2)])
cumprod = l1.reduce(lambda x,y: x*y)
cumprod

In [None]:
# 2) 

rdd1 = sc.parallelize(['cat','dog','baby'])
rdd2 = sc.parallelize(['giraffe','baby','baby'])
rdd1.intersection(rdd2).collect()

In [None]:
# 3)

rdd1.subtract(rdd2).collect()