###### What is RDD (Resilient Distributed Dataset)?

RDD (Resilient Distributed Dataset) is a fundamental building block of PySpark which is fault-tolerant, immutable distributed collections of objects. Immutable meaning once you create an RDD you cannot change it. Each record in RDD is divided into logical partitions, which can be computed on different nodes of the cluster. 

In other words, RDDs are a collection of objects similar to list in Python, with the difference being RDD is computed on several processes scattered across multiple physical servers also called nodes in a cluster while a Python collection lives and process in just one process.

Additionally, RDDs provide data abstraction of partitioning and distribution of the data designed to run computations in parallel on several nodes, while doing transformations on RDD we don’t have to worry about the parallelism as PySpark by default provides.

This Apache PySpark RDD tutorial describes the basic operations available on RDDs, such as map(), filter(), and persist() and many more. In addition, this tutorial also explains Pair RDD functions that operate on RDDs of key-value pairs such as groupByKey() and join() etc.

Note: RDD’s can have a name and unique identifier (id)

###### Creating RDD

RDD’s are created primarily in two different ways,

   * parallelizing an existing collection and
   * referencing a dataset in an external storage system (```HDFS```, ```S3``` and many more). 

Before we look into examples, first let’s initialize SparkSession using the builder pattern method defined in SparkSession class. While initializing, we need to provide the master and application name as shown below. In realtime application, you will pass master from spark-submit instead of hardcoding on Spark application.




In [3]:
from pyspark.sql import SparkSession
spark:SparkSession = SparkSession.builder \
      .master("local[1]") \
      .appName("SparkByExamples.com") \
      .getOrCreate()    

Picked up _JAVA_OPTIONS: -Dawt.useSystemAAFontSettings=on -Dswing.aatext=true
Picked up _JAVA_OPTIONS: -Dawt.useSystemAAFontSettings=on -Dswing.aatext=true


22/09/04 06:27:50 WARN Utils: Your hostname, Jkop resolves to a loopback address: 127.0.1.1; using 172.30.92.24 instead (on interface eth0)
22/09/04 06:27:50 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/09/04 06:27:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


```master()``` – If you are running it on the cluster you need to use your master name as an argument to master(). usually, it would be either yarn (Yet Another Resource Negotiator) or mesos depends on your cluster setup.

  * Use local[x] when running in Standalone mode. x should be an integer value and should be greater than 0; this represents how many partitions it should create when using RDD, DataFrame, and Dataset. Ideally, x value should be the number of CPU cores you have.

```appName()``` – Used to set your application name.

```getOrCreate()``` – This returns a SparkSession object if already exists, and creates a new one if not exist.

Note: Creating SparkSession object, internally creates one SparkContext per JVM.

###### Create RDD using sparkContext.parallelize()

By using ```parallelize()``` function of SparkContext (sparkContext.parallelize() ) you can create an RDD. This function loads the existing collection from your driver program into parallelizing RDD. This is a basic method to create RDD and is used when you already have data in memory that is either loaded from a file or from a database. and it required all data to be present on the driver program prior to creating RDD.

In [4]:
#Create RDD from parallelize    
data = [1,2,3,4,5,6,7,8,9,10,11,12]
rdd=spark.sparkContext.parallelize(data)

For production applications, we mostly create RDD by using external storage systems like HDFS, S3, HBase e.t.c. To make it simple for this PySpark RDD tutorial we are using files from the local system or loading it from the python list to create RDD.

###### Create RDD using sparkContext.textFile()

Using textFile() method we can read a text (.txt) file into RDD.




In [5]:
#Create RDD from external Data source
#rdd2 = spark.sparkContext.textFile("/path/textFile.txt")

###### Create RDD using sparkContext.wholeTextFiles()

wholeTextFiles() function returns a PairRDD with the key being the file path and value being file content.


In [6]:
# #Reads entire file into a RDD as single record.
# rdd3 = spark.sparkContext.wholeTextFiles("/path/textFile.txt")

# Besides using text files, we can also create RDD from CSV file, JSON, and more formats.

###### Create empty RDD using sparkContext.emptyRDD

Using ```emptyRDD()``` method on sparkContext we can create an RDD with no data. This method creates an empty RDD with no partition.

In [7]:
# Creates empty RDD with no partition    
rdd = spark.sparkContext.emptyRDD 
# rddString = spark.sparkContext.emptyRDD[String]

###### Creating empty RDD with partition

Sometimes we may need to write an empty RDD to files by partition, In this case, you should create an empty RDD with partition.

In [8]:
#Create empty RDD with partition
rdd2 = spark.sparkContext.parallelize([],10) #This creates 10 partitions

###### RDD Parallelize

When we use ```parallelize()``` or ```textFile()``` or ```wholeTextFiles()``` methods of SparkContxt to initiate RDD, it automatically splits the data into partitions based on resource availability. when you run it on a laptop it would create partitions as the same number of cores available on your system.

**getNumPartitions()** – This a RDD function which returns a number of partitions our dataset split into.

In [9]:
print("initial partition count:"+str(rdd.getNumPartitions()))
#Outputs: initial partition count:2

AttributeError: 'function' object has no attribute 'getNumPartitions'

**Set parallelize manually** – We can also set a number of partitions manually, all, we need is, to pass a number of partitions as the second parameter to these functions for example  ```sparkContext.parallelize([1,2,3,4,56,7,8,9,12,3], 10). ```

###### Repartition and Coalesce

Sometimes we may need to repartition the RDD, PySpark provides two ways to repartition; first using ```repartition()``` method which shuffles data from all nodes also called full shuffle and second coalesce() method which shuffle data from minimum nodes, for examples if you have data in 4 partitions and doing ```coalesce(2)``` moves data from just 2 nodes.  

Both of the functions take the number of partitions to repartition rdd as shown below.  Note that repartition() method is a very expensive operation as it shuffles data from all nodes in a cluster. 

In [10]:
reparRdd = rdd.repartition(4)
print("re-partition count:"+str(reparRdd.getNumPartitions()))
#Outputs: "re-partition count:4
# Note: repartition() or coalesce() methods also returns a new RDD.

AttributeError: 'function' object has no attribute 'repartition'