<a href="https://colab.research.google.com/github/usshaa/SMBDA/blob/main/C-5.1%3A%20Introduction_to_RDDs_in_PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction to RDDs in PySpark

Resilient Distributed Datasets (RDDs) are the fundamental data structure of Apache Spark. RDDs are fault-tolerant collections of elements that can be operated on in parallel. This section will introduce basic RDD operations, including creation, transformations, and actions.

### 1. Creating RDDs

#### From a Python List

You can create an RDD from a Python list using the `parallelize` method.

In [None]:
# Set the PySpark environment variables
import os
os.environ['SPARK_HOME'] = 'spark-3.4.3-bin-hadoop3'

In [None]:
# Import necessary libraries
from pyspark.sql import SparkSession

In [None]:
# Initialize a Spark session
spark = SparkSession.builder.appName("RDD_Basics").getOrCreate()

In [None]:
# Get the SparkContext from the Spark session
sc = spark.sparkContext

In [None]:
# Create an RDD from a Python list
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
numbers_rdd = sc.parallelize(numbers)

In [None]:
# Collect the RDD to view its contents
numbers_rdd.collect()

Out[23]: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

### 2. Basic Transformations

Transformations create a new RDD from an existing one. They are "lazy," meaning they don't execute until an action is called.

#### Map

The `map` transformation applies a function to each element in the RDD and returns a new RDD.

In [None]:
# Square each number in the RDD
squared_numbers_rdd = numbers_rdd.map(lambda x: x ** 2)
squared_numbers_rdd

Out[24]: PythonRDD[31] at RDD at PythonRDD.scala:58

In [None]:
# Collect the RDD to view its contents
squared_numbers_rdd.collect()

Out[25]: [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

#### Filter

The `filter` transformation creates a new RDD by selecting elements that satisfy a predicate.

In [None]:
# Keep only even numbers from the squared RDD
even_squared_numbers_rdd = squared_numbers_rdd.filter(lambda x: x % 2 == 0)
even_squared_numbers_rdd

Out[26]: PythonRDD[32] at RDD at PythonRDD.scala:58

In [None]:
# Collect the RDD to view its contents
even_squared_numbers_rdd.collect()

Out[27]: [4, 16, 36, 64, 100]

### 3. Basic Actions

Actions trigger the execution of transformations and return a value to the driver program.

#### Reduce

The `reduce` action aggregates the elements of the RDD using a specified function.

In [None]:
# Sum all the even squared numbers
sum_even_squared_numbers = even_squared_numbers_rdd.reduce(lambda x, y: x + y)
sum_even_squared_numbers

Out[28]: 220

#### Collect

The `collect` action retrieves the entire RDD to the driver program.

In [None]:
# Collect the even squared numbers RDD to view its contents
even_squared_numbers_rdd.collect()

Out[29]: [4, 16, 36, 64, 100]

#### Count

The `count` action returns the number of elements in the RDD.

In [None]:
# Count the number of elements in the RDD
count_even_squared_numbers = even_squared_numbers_rdd.count()
count_even_squared_numbers

Out[30]: 5

### 4. Working with Text Data

#### Reading a Text File

You can create an RDD from a text file using the `textFile` method.

In [None]:
# Read a text file into an RDD
text_rdd = sc.textFile("sample.txt")

In [None]:
# Show the first 5 lines of the text file
text_rdd.take(5)

[0;31m---------------------------------------------------------------------------[0m
[0;31mPy4JJavaError[0m                             Traceback (most recent call last)
File [0;32m<command-1917210368493309>:2[0m
[1;32m      1[0m [38;5;66;03m# Show the first 5 lines of the text file[39;00m
[0;32m----> 2[0m [43mtext_rdd[49m[38;5;241;43m.[39;49m[43mtake[49m[43m([49m[38;5;241;43m5[39;49m[43m)[49m

File [0;32m/databricks/spark/python/pyspark/instrumentation_utils.py:48[0m, in [0;36m_wrap_function.<locals>.wrapper[0;34m(*args, **kwargs)[0m
[1;32m     46[0m start [38;5;241m=[39m time[38;5;241m.[39mperf_counter()
[1;32m     47[0m [38;5;28;01mtry[39;00m:
[0;32m---> 48[0m     res [38;5;241m=[39m [43mfunc[49m[43m([49m[38;5;241;43m*[39;49m[43margs[49m[43m,[49m[43m [49m[38;5;241;43m*[39;49m[38;5;241;43m*[39;49m[43mkwargs[49m[43m)[49m
[1;32m     49[0m     logger[38;5;241m.[39mlog_success(
[1;32m     50[0m         module_name, clas

#### FlatMap

The `flatMap` transformation splits each line into words, producing a flattened RDD.

In [None]:
# Split each line into words
words_rdd = text_rdd.flatMap(lambda line: line.split(" "))



In [None]:
# Collect the RDD to view its contents
words_rdd.collect()



### 5. Key-Value Pair RDDs

RDDs can also contain key-value pairs. These RDDs support additional transformations.

#### Map to Pair RDD

You can create a pair RDD using the `map` transformation.

In [None]:
# Create a pair RDD with each word and the number 1
word_pairs_rdd = words_rdd.map(lambda word: (word, 2))



In [None]:
# Collect the RDD to view its contents
word_pairs_rdd.collect()



#### ReduceByKey

The `reduceByKey` transformation aggregates values by key.

In [None]:
# Count the occurrences of each word
word_counts_rdd = word_pairs_rdd.reduceByKey(lambda a, b: a * b)



In [None]:
# Collect the RDD to view its contents
word_counts_rdd.collect()



### 6. Additional Transformations

#### GroupByKey

The `groupByKey` transformation groups values with the same key.

In [None]:
# Group values by key
grouped_words_rdd = word_pairs_rdd.groupByKey().mapValues(list)



In [None]:
# Collect the RDD to view its contents
grouped_words_rdd.collect()



#### SortByKey

The `sortByKey` transformation sorts the RDD by key.

In [None]:
# Sort the word counts RDD by key (word)
sorted_word_counts_rdd = word_counts_rdd.sortByKey()



In [None]:
# Collect the RDD to view its contents
sorted_word_counts_rdd.collect()



### 7. Saving and Loading RDDs

#### Save As Text File

You can save an RDD to a text file using the `saveAsTextFile` method.

In [None]:
# Save the word counts RDD to a text file
word_counts_rdd.saveAsTextFile("word_counts")



#### Load from Text File

You can load the saved RDD from a text file.

In [None]:
# Load the saved RDD from the text file
loaded_word_counts_rdd = sc.textFile("word_counts")



In [None]:
# Collect the RDD to view its contents
loaded_word_counts_rdd.collect()



### Cleanup

Always stop the Spark session after completing your tasks.

In [None]:
# Stop the Spark session
spark.stop()



### !Well Done Great Job