# Introduction to RDDs

1. RDD = Resillient Distributed Datasets
2. RDD API (Application Programmers interface)was released in 2011

3. RDD is an immutable and resillient distributed collection of elements of your data, that is partitioned across 
    nodes in your cluster.

4. RDD operations are typically referred to as Transformations and Actions, and the operations are executed in parallel.
5. RDDs ar fault-tollerent

# Why Learn RDDs?
1. You want low level transforamtion and action to control your datasets
2. You want to handle unstructred data that cannot be handled by structured API's such as DataFrame and Datasets.
3. You want to optimize your spark applications using a low level API

# Data Preparation

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("TestingRDDS").getOrCreate()

In [3]:
word_list = "Spark make lide a lot easier and put men into good spirits, Spark is too Awesome!".split(" ")

In [4]:
type(word_list)

list

In [5]:
print(word_list)

['Spark', 'make', 'lide', 'a', 'lot', 'easier', 'and', 'put', 'men', 'into', 'good', 'spirits,', 'Spark', 'is', 'too', 'Awesome!']


In [6]:
word_rdd = spark.sparkContext.parallelize(word_list)

In [7]:
word_data = word_rdd.collect()

In [9]:
for word in word_data:
    print(word)

Spark
make
lide
a
lot
easier
and
put
men
into
good
spirits,
Spark
is
too
Awesome!


# Distance and Filter Transformation

In [10]:
word_rdd.count()

16

In [11]:
word_rdd.distinct().count()

15

In [13]:
word_data = word_rdd.collect()
for word in word_data:
    print(word)

Spark
make
lide
a
lot
easier
and
put
men
into
good
spirits,
Spark
is
too
Awesome!


In [14]:
words_unique_rdd = word_rdd.distinct()

In [15]:
for word in words_unique_rdd.collect():
    print(word)

put
good
a
lot
and
men
Awesome!
Spark
make
lide
into
is
easier
spirits,
too


In [16]:
def wordStartsWith(word, letter):
    return word.startswith(letter)

In [19]:
word_rdd.filter(lambda word: wordStartsWith(word,"S")).collect()

['Spark', 'Spark']

In [20]:
words_trd_rdd = word_rdd.map(lambda word: (word, word[0],wordStartsWith(word, "S")))

In [21]:
for element in words_trd_rdd.collect():
    print(element)

('Spark', 'S', True)
('make', 'm', False)
('lide', 'l', False)
('a', 'a', False)
('lot', 'l', False)
('easier', 'e', False)
('and', 'a', False)
('put', 'p', False)
('men', 'm', False)
('into', 'i', False)
('good', 'g', False)
('spirits,', 's', False)
('Spark', 'S', True)
('is', 'i', False)
('too', 't', False)
('Awesome!', 'A', False)


In [23]:
word_rdd.flatMap(lambda word: list(word)).take(10)

['S', 'p', 'a', 'r', 'k', 'm', 'a', 'k', 'e', 'l']

# Sort By Key Transformation

In [24]:
countries_list = [("India",91),("USA",4),("Greece",13)]
countries_rdd = spark.sparkContext.parallelize(countries_list)

In [25]:
srtd_countries_list = countries_rdd.sortByKey().collect()

In [26]:
for country in srtd_countries_list:
    print(country)

('Greece', 13)
('India', 91)
('USA', 4)


In [27]:
srtd_countries_list = countries_rdd.map(lambda c: (c[1],c[0])).sortByKey(False).collect()

In [28]:
for country in srtd_countries_list:
    print(country)

(91, 'India')
(13, 'Greece')
(4, 'USA')


# RDD Actions

In [29]:
num_list = [1,5,2,3,4]

In [31]:
result = spark.sparkContext.parallelize(num_list).reduce(lambda x, y:x+y)
print(result)

15


In [32]:
def sumList(x,y):
    print(x,y)
    return x + y

In [33]:
result = spark.sparkContext.parallelize(num_list).reduce(lambda x,y: sumList(x,y))
print(result)

1 5
6 2
8 7
15


In [34]:
def wordLengthReducer(leftword, rightword):
    if len(leftword) > len(rightword):
        return leftword
    else:
        return rightword

In [35]:
word_rdd.reduce(wordLengthReducer)

'Awesome!'

In [36]:
word_rdd.first()

'Spark'

In [37]:
spark.sparkContext.parallelize(range(1,21)).max()

20

In [38]:
spark.sparkContext.parallelize(range(1,21)).min()

1