<a href="https://colab.research.google.com/github/suriarasai/BEAD2026/blob/main/colab/03c_PySpark_RDD_Practice_Exercises_Using_Functional_Programming.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PySpark RDD Exercises: Functional Programming Concepts
Here are 10 exercises focusing on RDD operations to learn functional programming principles. Each exercise includes the problem, solution, and detailed explanation.

## Initial Setup
Here are 10 exercises focusing on RDD operations to learn functional programming principles. Each exercise includes the problem, solution, and detailed explanation.

## Initial Setup

In [1]:
from pyspark import SparkContext, SparkConf
from functools import reduce
import math

# Initialize SparkContext
conf = SparkConf().setAppName("FunctionalProgrammingExercises")
sc = SparkContext.getOrCreate(conf)

### Exercise 1: Pure vs Impure Functions
Problem: Given an RDD of numbers [1, 2, 3, 4, 5], create two functions - one pure and one impure - that multiply each number. Demonstrate why the impure function causes issues.

In [None]:
# Sample data
numbers_rdd = sc.parallelize([1, 2, 3, 4, 5])

# YOUR TASK:
# 1. Create an impure function that uses external state
# 2. Create a pure function that doesn't use external state
# 3. Apply both and observe the difference

### Exercise 2: Map with Complex Pure Functions

Problem: Create an RDD of temperatures in Celsius [0, 10, 20, 30, 40]. Write a pure function that converts to Fahrenheit and categorizes the temperature.

In [None]:
celsius_rdd = sc.parallelize([0, 10, 20, 30, 40])

# YOUR TASK: Create a pure function that returns a tuple:
# (celsius, fahrenheit, category) where category is "cold", "mild", or "hot"

### Exercise 3: Filter with Functional Predicates
Problem: Given an RDD of numbers 1-20, create pure predicate functions for filtering.

In [None]:
numbers_rdd = sc.parallelize(range(1, 21))

# YOUR TASK: Create pure predicate functions to filter:
# 1. Even numbers
# 2. Prime numbers
# 3. Numbers divisible by 3 or 5

### Exercise 4: Reduce Operations without State
Problem: Calculate the product and sum of an RDD [1, 2, 3, 4, 5] using pure reduction functions.

In [None]:
numbers_rdd = sc.parallelize([1, 2, 3, 4, 5])

# YOUR TASK:
# 1. Calculate sum using reduce
# 2. Calculate product using reduce
# 3. Find max and min using reduce

### Exercise 5: FlatMap for Functional Transformations
Problem: Given an RDD of sentences, use flatMap with pure functions to tokenize and process words.

In [None]:
sentences_rdd = sc.parallelize([
    "Hello world",
    "Functional programming rocks",
    "Pure functions scale"
])

# YOUR TASK:
# 1. Tokenize sentences into words
# 2. Create word pairs (word, length)
# 3. Generate n-grams

### Exercise 6: Avoiding Shared Mutable State

Problem: Count occurrences of even and odd numbers WITHOUT using global counters.

In [None]:
numbers_rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# YOUR TASK: Count even and odd numbers using:
# 1. Pure functions only
# 2. No global variables
# 3. Functional approaches

### Exercise 7: Function Composition and Pipelining

Problem: Create a pipeline of pure functions to process text data.

In [None]:
text_rdd = sc.parallelize([
    "  HELLO world  ",
    "PYTHON programming",
    "spark FUNCTIONAL"
])

# YOUR TASK: Create a pipeline that:
# 1. Strips whitespace
# 2. Converts to lowercase
# 3. Reverses the string
# 4. Adds a prefix

### Exercise 8: Using Accumulators Correctly

Problem: Track statistics about data processing using accumulators instead of global variables.

In [None]:
data_rdd = sc.parallelize([10, -5, 20, -3, 15, 0, -10, 25])

# YOUR TASK: Track:
# 1. Count of positive numbers
# 2. Count of negative numbers
# 3. Count of zeros
# Without using global variables

### Exercise 9: Handling Complex Data with Pure Functions

Problem: Process an RDD of dictionaries representing transactions using pure functions.

In [None]:
transactions_rdd = sc.parallelize([
    {"id": 1, "amount": 100, "type": "credit"},
    {"id": 2, "amount": 50, "type": "debit"},
    {"id": 3, "amount": 200, "type": "credit"},
    {"id": 4, "amount": 75, "type": "debit"}
])

# YOUR TASK: Using pure functions:
# 1. Calculate net balance
# 2. Find largest transaction
# 3. Group by type and sum

### Exercise 10: Combining Multiple RDDs Functionally

Problem: Join and process two RDDs using pure functions.

In [None]:
products_rdd = sc.parallelize([
    (1, "Laptop"),
    (2, "Mouse"),
    (3, "Keyboard")
])

sales_rdd = sc.parallelize([
    (1, 1200),
    (2, 25),
    (3, 75),
    (1, 1200),
    (2, 25)
])

# YOUR TASK: Using pure functions:
# 1. Calculate total sales per product
# 2. Join with product names
# 3. Apply discount calculation