In [1]:
!pip install pyspark

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("RDD-Example").getOrCreate()

# Get SparkContext
sc = spark.sparkContext



# 📝 What is RDD?

**RDD** stands for **Resilient Distributed Dataset**.

It is the **core data structure of Apache Spark** — an **immutable distributed collection of objects** that can be processed in parallel across a cluster.

👉 In simple words:

* Think of RDD like a **giant list**, but instead of sitting in one machine, it’s **spread across multiple machines**.
* Spark can apply functions to this distributed data in parallel, making it fast and fault-tolerant.

---

# 📝 Key Properties of RDD

* **Resilient** → Fault-tolerant (can recover lost data automatically using lineage).
* **Distributed** → Data is split into partitions across cluster nodes.
* **Dataset** → Collection of elements (numbers, strings, objects, rows, etc.).
* **Immutable** → Once created, cannot be changed — only transformed into new RDDs.

In [3]:
#From a Python list
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]
rdd = sc.parallelize(data)
print("RDD elements:", rdd.collect())

RDD elements: [1, 2, 3, 4, 5, 6, 7, 8, 9]


### PySpark RDD Transformations

- **Map (Square Each Number)**  
  Applies a function to every element in the RDD and returns a new RDD with squared values.

- **Filter (Keep Even Numbers)**  
  Filters the RDD to keep only the elements that are even numbers.


In [4]:
#Map: Square each number
squared_rdd = rdd.map(lambda x: x * x)

#Filter: Keep only even numbers
even_rdd = rdd.filter(lambda x: x % 2 == 0)

### PySpark RDD Actions

- **Collect**  
  Returns all elements of the RDD as a list (used here to show squared numbers and even numbers).

- **Count**  
  Returns the total number of elements in the RDD.

- **Sum**  
  Calculates the sum of all elements in the RDD.

- **Max**  
  Finds the maximum value in the RDD.


In [5]:
print("Squared:", squared_rdd.collect())
print("Even:", even_rdd.collect())
print("Count:", rdd.count())
print("Sum:", rdd.sum())
print("Max:", rdd.max())

Squared: [1, 4, 9, 16, 25, 36, 49, 64, 81]
Even: [2, 4, 6, 8]
Count: 9
Sum: 45
Max: 9


In [7]:
#Sample text
text = ["hello world", "hello spark", "big data with spark"]

# Create RDD
text_rdd = sc.parallelize(text)

# Split words
words = text_rdd.flatMap(lambda line: line.split(" "))

#Map each word to (word, 1)
word_pairs = words.map(lambda word: (word, 1))

#Reduce by key (sum counts)

# hello,1 --- hello, 1
word_count = word_pairs.reduceByKey (lambda a, b: a+b)
print("Word Count:", word_count.collect())

Word Count: [('hello', 2), ('world', 1), ('big', 1), ('with', 1), ('spark', 2), ('data', 1)]
