# Spark RDD

At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a __resilient distributed dataset (RDD)__, which is a __collection__ of elements __partitioned__ across the nodes of the __cluster__ that can be operated on in __parallel__. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.

![](imgs/SparkFlowRDD.png)

The __first__ thing a Spark program must do is to create a __SparkContext__ object, which tells Spark how to access a cluster. SparkContext is the entry point to any spark functionality. When we run any Spark application, a driver program starts, which has the main function and your SparkContext gets initiated here. The driver program then runs the operations inside the executors on worker nodes.

In [None]:
from pyspark import SparkContext
sc = SparkContext("local[*]", "count app")

In [None]:
# RDD - Resilient Distributed Datasets, distributes data in Spark
rdd = sc.parallelize(range(10))

In [None]:
sc.defaultParallelism

In [None]:
# Object, not actual data
rdd

__RDD__ is a distributed collection. Pure python cannot access it or understand it. Only spark (through SprakContext) can operate on it. Following code won't work:

In [None]:
for i in rdd:
    print(i)

If we want to work on data in RDD with pure python, we have to __collect__ the data, that is - bring it to local machine. Be careful! If Dataset is big it can crash your driver program.

In [None]:
# brings all data to local machine
local_collection = rdd.collect()

In [None]:
print(type(local_collection))
local_collection

Anyway, you can define series of transformations on RDD through sparkCOntext:

In [None]:
rdd.count()

In [None]:
rdd.first()

In [None]:
rdd.take(5)

In [None]:
rdd.top(5)

In [None]:
rdd.takeSample(False, 3)

In [None]:
# chaining operations
rdd.filter(lambda x: x % 2 == 0).collect()

In [None]:
# using reduce with RDD suma of all elements
rdd.reduce(lambda x, y: x + y)

In [None]:
from operator import add
rdd.reduce(add)

In [None]:
animals = sc.parallelize(["cat", "python", "cat", "snake", "snake"])

In [None]:
animal_kv = animals.map(lambda x: (x, 1))

In [None]:
animal_kv.collect()

In [None]:
animal_kv \
  .reduceByKey(add)  \
  .collect()