# Chapter 3: Resilient Distributed Datasets (RDDs)

This notebook covers the core concepts of RDDs, including their creation, transformations, actions, persistence, and a real-world example (Word Count).

## 1. Introduction to RDDs
RDDs are the fundamental data structure in Apache Spark. They are:
- Immutable: Once created, they cannot be changed.
- Distributed: Data is partitioned across the nodes of a cluster.
- Fault-Tolerant: Spark can recover RDDs in case of failures.

### 1.1 Creating RDDs

In [1]:
from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder.appName("RDDExamples").getOrCreate()

# Creating an RDD from a list
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)
print("RDD Elements:", rdd.collect())

# Creating an RDD from an external file
file_rdd = spark.sparkContext.textFile("word.txt")
print("File RDD Sample:", file_rdd.take(5))

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/01/14 21:39:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/01/14 21:39:37 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/01/14 21:39:37 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


RDD Elements: [1, 2, 3, 4, 5]
File RDD Sample: ['Imran in Ottawa, a journey begun,', 'Where the Ottawa River gleams beneath the sun,', 'He walks the streets, with wonder in his eyes,', 'A city of dreams, beneath autumnal skies.', '']


                                                                                

## 2. Transformations
Transformations create new RDDs from existing ones. They are **lazy**, meaning Spark doesn’t execute them until an action is called.

### 2.1 map() Transformation

In [2]:
# Applying a map() transformation
squared_rdd = rdd.map(lambda x: x ** 2)
print("Squared RDD:", squared_rdd.collect())

Squared RDD: [1, 4, 9, 16, 25]


### 2.2 flatMap() Transformation

In [3]:
# Using flatMap() to split lines into words
text_rdd = spark.sparkContext.parallelize(["Hello world", "Apache Spark is great"])
words_rdd = text_rdd.flatMap(lambda line: line.split(" "))
print("Words RDD:", words_rdd.collect())

Words RDD: ['Hello', 'world', 'Apache', 'Spark', 'is', 'great']


### 2.3 filter() Transformation

In [4]:
# Filtering even numbers from an RDD
even_rdd = rdd.filter(lambda x: x % 2 == 0)
print("Even Numbers:", even_rdd.collect())

Even Numbers: [2, 4]


## 3. Actions
Actions trigger the execution of transformations. They return results or save data to an external storage.

### 3.1 collect() Action

In [5]:
# Collecting all elements of an RDD
collected_data = rdd.collect()
print("Collected Data:", collected_data)

Collected Data: [1, 2, 3, 4, 5]


### 3.2 reduce() Action

In [6]:
# Summing all elements in an RDD
sum_result = rdd.reduce(lambda x, y: x + y)
print("Sum of Elements:", sum_result)

Sum of Elements: 15


### 3.3 count() and countByValue()

In [7]:
# Counting elements in the RDD
count = rdd.count()
print("Number of Elements:", count)

# Counting occurrences of each value
count_by_value = rdd.countByValue()
print("Count by Value:", dict(count_by_value))

Number of Elements: 5
Count by Value: {1: 1, 2: 1, 3: 1, 4: 1, 5: 1}


## 4. Persistence
Persisting an RDD improves performance when the same data is reused multiple times.

In [8]:
# Persisting an RDD in memory
cached_rdd = rdd.cache()
print("Cached RDD Count:", cached_rdd.count())

Cached RDD Count: 5


## 5. Real-World Example: Word Count

In [9]:
# Performing Word Count on a local text file
text_file = spark.sparkContext.textFile("word.txt")  # Make sure words.txt is in the same directory
word_counts = text_file.flatMap(lambda line: line.split(" ")) \
                       .map(lambda word: (word, 1)) \
                       .reduceByKey(lambda x, y: x + y)

print("Word Counts:", word_counts.collect())


Word Counts: [('Imran', 10), ('begun,', 1), ('Where', 2), ('Ottawa', 1), ('River', 1), ('walks', 2), ('with', 6), ('his', 5), ('A', 9), ('of', 12), ('dreams,', 2), ('', 10), ('Hill,', 2), ('where', 2), ('history', 2), ('resides,', 1), ('Peace', 1), ('Tower', 1), ('stands', 2), ('pride,', 1), ('hears', 1), ('melody', 1), ('so', 2), ('clear,', 1), ('echoing', 1), ('ear.', 1), ('By', 1), ('Centennial', 1), ('and', 8), ('heartfelt', 1), ('take', 1), ('their', 1), ('embrace,', 1), ('bathed', 1), ('golden', 1), ('light.', 1), ('glide', 1), ('by,', 1), ('watches', 1), ('twirl,', 1), ("winter's", 1), ('icy', 1), ('seasons,', 1), ('Market,', 1), ("Imran's", 1), ('senses', 1), ('tastes', 1), ('flavors,', 1), ('cultures', 1), ('far', 1), ('journey,', 1), ('holds', 1), ('dear.', 1), ('explores', 1), ('curious', 1), ('mind,', 1), ('kind,', 1), ('From', 1), ('ancient', 1), ('to', 4), ('past,', 1), ('own', 1), ('Park,', 1), ('Hiking', 1), ('cease,', 1), ('step', 2), ('spirit', 1), ('awakes.', 1), ('A

## 6. Execution Plan
Spark generates an execution plan to optimize transformations.

In [10]:
# Viewing the lineage of an RDD
print("RDD Lineage:", word_counts.toDebugString())

RDD Lineage: b'(2) PythonRDD[18] at collect at /tmp/ipykernel_2743/3114491937.py:7 []\n |  MapPartitionsRDD[17] at mapPartitions at PythonRDD.scala:160 []\n |  ShuffledRDD[16] at partitionBy at NativeMethodAccessorImpl.java:0 []\n +-(2) PairwiseRDD[15] at reduceByKey at /tmp/ipykernel_2743/3114491937.py:5 []\n    |  PythonRDD[14] at reduceByKey at /tmp/ipykernel_2743/3114491937.py:5 []\n    |  word.txt MapPartitionsRDD[13] at textFile at NativeMethodAccessorImpl.java:0 []\n    |  word.txt HadoopRDD[12] at textFile at NativeMethodAccessorImpl.java:0 []'


## Chapter Summary
In this notebook, we covered:
- RDD creation
- Transformations (map, flatMap, filter)
- Actions (collect, reduce, count)
- Persistence with cache()
- Real-world example: Word Count
- Execution plan and lineage