# Abstracting Data with RDDs

## Introduction:

#### What are RDDs?

RDDs are called __Resilient Distributed Datasets__, where these are a collection of immutable JVM objects that are distributed across an Apache Spark Cluster. It is also the most fundamental dataset type for Apache Spark, whereby actions that are on a Spark DataFrame will get translated into highly optimised execution of transformations and actions on RDDs. 

The data would be split up into chunks based on a key and subsequently dispersed to all the executor nodes. The advantages of RDDs are its high resilience and ability to be recovered quickly as the same data chunks are replicated across multiple executor nodes. It also allows functional calculations on all the dataset quickly using mulitple nodes. Further, RDDs keep a log of the execution steps that were applied to each chunk which also combat against data lost by execution error. 

__This notebook will then go through the basics of using PySpark__.

## 1 PySpark Machine Configuration:

Here it only uses two processing cores from the CPU, and it set up by the following code.

In [1]:
%%configure
{
    "executorCores" : 2
}

ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
17,,pyspark,idle,,,


In [2]:
from pyspark.sql.types import *

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
18,,pyspark,idle,,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 2 Creating RDDs:

There are two ways to do this:
- 1) Use the "parallelize()" method, a collection of lists or array of some elements.
- 2) Reference a file(s) that are located either locally or from an external source.

Here, the (1) method creates aparallelised collection where it would allow Spark to distribute the data to all the executor nodes and operate on it in parallel. For example, an operations done in parallel can be the "reduceByKey(add)" method applied as "myRDD.reduceByKey(add)".

The ".take()" method returns the values of the RDD to the console. Please note that the more common approach in PySpark is to use ".collect()" moethod. However, this may prove to be taxing on most systems if the dataset is huge. The best way to use the ".take()" method as it is more efficient.

In [3]:
# Example to Creating a RDDs:
myRDD = sc.parallelize( [('Mike', 19), ('June', 18), 
                         ('Rachel', 16), ('Rob', 18),
                        ('Scott', 17)] )


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [4]:
# Inspect: to view what is inside an RDD.
myRDD.take(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[('Mike', 19), ('June', 18), ('Rachel', 16), ('Rob', 18), ('Scott', 17)]

## 3 Reading Data From Files:

Here, a CSV file will be read and the datasets can be found in the "Dataset" folder that is avaible with this notebook download.

The Datasets are:
- 1) airport-codes.txt
- 2) departure_delays.csv

Source:
- https://openflights.org/data.html
- https://catalog.data.gov/dataset/airline-on-time-performance-and-causes-of-flight-delays-on-time-data

NOTE: Make sure your Current Working Directory is correct.

In [5]:
import os

# Change the Path:
path = '++++your working directory here++++/Datasets/'
os.chdir(path)
folder_pathway = os.getcwd()

# print(folder_pathway)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [6]:
# Load in the dataset: airport-codes.txt
myRDD = (
    sc.textFile(folder_pathway + '/airport-codes.txt',
                minPartitions=4,
                use_unicode=True
    ).map(lambda element: element.split("\t"))
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

#### What is happening here?

After defining the pathway to the "Datasets" folder, there are two additional parameters added which are:
- "minPartitions" that defines the number of partitions that make up the RDD, often without specifying this, the Spark engine will determine the best number based on file size. The user can set this partition value based on performance reasons.
- "use_unicode" ensures the processing is in Unicode.

Finally, there is a ".map()" function, where it would transform the data from a list of strings to a lists of lists. There are also a lambda function used for mapping the transformation. This uses PySparks's split function to split the string according to the delimiter which is a tab. ("\t").

In [7]:
# Inspect: 
myRDD.take(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[['City', 'State', 'Country', 'IATA'], ['Abbotsford', 'BC', 'Canada', 'YXX'], ['Aberdeen', 'SD', 'USA', 'ABR'], ['Abilene', 'TX', 'USA', 'ABI'], ['Akron', 'OH', 'USA', 'CAK']]

In [8]:
# Count the number of rows in the RDD:
myRDD.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

527

In [9]:
# Determine the number of partitions that support the RDD:
myRDD.getNumPartitions()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

4

#### Setting the number of partitions when creating an RDD:

The important aspect of using/setting the number of partitions on the RDD is that the more partitions set, the higher the parallelism, which may improve performance of query.

## 3.1 Using a Larger Dataset:

Here, the dataset used will be the "departure_delays.csv" file.

In [10]:
# Load in the dataset: departure_delays.csv
myRDD = (
    sc.textFile(folder_pathway + '/departure_delays.csv').map(lambda element: element.split("\t"))
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [11]:
# Count the number of rows in the RDD: this took about 2 secs.
myRDD.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

1391579

In [12]:
# Determine the number of partitions that support the RDD:
myRDD.getNumPartitions()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

1

#### Let's see how increasing the number of partitions can help the overall performance.

In [13]:
# Load in the dataset: airport-codes.txt
myRDD = (
    sc.textFile(folder_pathway + '/departure_delays.csv',
                minPartitions=8
    ).map(lambda element: element.split("\t"))
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [14]:
# Count the number of rows in the RDD: slightly faster than 2 secs.
myRDD.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

1391579

In [15]:
# Determine the number of partitions that support the RDD:
myRDD.getNumPartitions()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

8

## 4