# Chapter 3: Resilient Distributed Datasets (RDDs)

This notebook covers the core concepts of RDDs, including their creation, transformations, actions, persistence, and a real-world example (Word Count).

## 1. Introduction to RDDs
RDDs are the fundamental data structure in Apache Spark. They are:
- Immutable: Once created, they cannot be changed.
- Distributed: Data is partitioned across the nodes of a cluster.
- Fault-Tolerant: Spark can recover RDDs in case of failures.

### 1.1 Creating RDDs

In [None]:
from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder.appName("RDDExamples").getOrCreate()

# Creating an RDD from a list
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)
print("RDD Elements:", rdd.collect())

# Creating an RDD from an external file
file_rdd = spark.sparkContext.textFile("word.txt")
print("File RDD Sample:", file_rdd.take(5))

## 2. Transformations
Transformations create new RDDs from existing ones. They are **lazy**, meaning Spark doesn’t execute them until an action is called.

### 2.1 map() Transformation

In [None]:
# Applying a map() transformation
squared_rdd = rdd.map(lambda x: x ** 2)
print("Squared RDD:", squared_rdd.collect())

### 2.2 flatMap() Transformation

In [None]:
# Using flatMap() to split lines into words
text_rdd = spark.sparkContext.parallelize(["Hello world", "Apache Spark is great"])
words_rdd = text_rdd.flatMap(lambda line: line.split(" "))
print("Words RDD:", words_rdd.collect())

### 2.3 filter() Transformation

In [None]:
# Filtering even numbers from an RDD
even_rdd = rdd.filter(lambda x: x % 2 == 0)
print("Even Numbers:", even_rdd.collect())

## 3. Actions
Actions trigger the execution of transformations. They return results or save data to an external storage.

### 3.1 collect() Action

In [None]:
# Collecting all elements of an RDD
collected_data = rdd.collect()
print("Collected Data:", collected_data)

### 3.2 reduce() Action

In [None]:
# Summing all elements in an RDD
sum_result = rdd.reduce(lambda x, y: x + y)
print("Sum of Elements:", sum_result)

### 3.3 count() and countByValue()

In [None]:
# Counting elements in the RDD
count = rdd.count()
print("Number of Elements:", count)

# Counting occurrences of each value
count_by_value = rdd.countByValue()
print("Count by Value:", dict(count_by_value))

## 4. Persistence
Persisting an RDD improves performance when the same data is reused multiple times.

In [None]:
# Persisting an RDD in memory
cached_rdd = rdd.cache()
print("Cached RDD Count:", cached_rdd.count())

## 5. Real-World Example: Word Count

In [None]:
# Performing Word Count on a local text file
text_file = spark.sparkContext.textFile("word.txt")  # Make sure words.txt is in the same directory
word_counts = text_file.flatMap(lambda line: line.split(" ")) \
                       .map(lambda word: (word, 1)) \
                       .reduceByKey(lambda x, y: x + y)

print("Word Counts:", word_counts.collect())


## 6. Execution Plan
Spark generates an execution plan to optimize transformations.

In [None]:
# Viewing the lineage of an RDD
print("RDD Lineage:", word_counts.toDebugString())

## Chapter Summary
In this notebook, we covered:
- RDD creation
- Transformations (map, flatMap, filter)
- Actions (collect, reduce, count)
- Persistence with cache()
- Real-world example: Word Count
- Execution plan and lineage