<a href="https://colab.research.google.com/github/suriarasai/BEAD2026/blob/main/colab/03c_PySpark_RDD_Practice_Solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PySpark RDD Exercises: Functional Programming Concepts
Here are 10 exercises focusing on RDD operations to learn functional programming principles. Each exercise includes the problem, solution, and detailed explanation.

## Initial Setup
Here are 10 exercises focusing on RDD operations to learn functional programming principles. Each exercise includes the problem, solution, and detailed explanation.

## Initial Setup

In [1]:
from pyspark import SparkContext, SparkConf
from functools import reduce
import math

# Initialize SparkContext
conf = SparkConf().setAppName("FunctionalProgrammingExercises")
sc = SparkContext.getOrCreate(conf)

### Exercise 1: Pure vs Impure Functions
The impure function tries to modify global state (multiplier) and append to a list (results_list). In distributed computing, each worker has its own copy of these variables, so changes aren't synchronized. The pure function takes all needed parameters and returns results based solely on inputs, making it safe for parallel execution.

In [4]:
# Sample data
numbers_rdd = sc.parallelize([1, 2, 3, 4, 5])

# IMPURE FUNCTION (BAD)
multiplier = 2  # External state
results_list = []  # External mutable state

def impure_multiply(x):
    global multiplier
    multiplier += 1  # Modifying external state - DANGEROUS!
    results_list.append(x)  # Side effect - WON'T WORK in distributed setting
    return x * multiplier

# PURE FUNCTION (GOOD)
def pure_multiply(x, factor=2):
    """Output depends only on input - no side effects"""
    return x * factor

# Apply impure function (problematic)
impure_result = numbers_rdd.map(impure_multiply).collect()
print(f"Impure result: {impure_result}")
print(f"Multiplier after (unreliable): {multiplier}")
print(f"Results list (empty/incomplete): {results_list}")

# Reset for comparison
multiplier = 2


# Apply pure function (reliable)
pure_result = numbers_rdd.map(lambda x: pure_multiply(x, 2)).collect()
print(f"Pure result: {pure_result}")

Impure result: [3, 8, 9, 16, 25]
Multiplier after (unreliable): 2
Results list (empty/incomplete): []
Pure result: [2, 4, 6, 8, 10]


### Exercise 2: Map with Complex Pure Functions

The pure function encapsulates all logic internally, making it testable, reusable, and guaranteed to produce the same output for the same input. This predictability is crucial in distributed systems.

In [6]:
celsius_rdd = sc.parallelize([0, 10, 20, 30, 40])

def temperature_analysis(celsius):
    """
    Pure function: No external dependencies, deterministic output
    """
    fahrenheit = (celsius * 9/5) + 32

    # Categorization logic encapsulated within function
    if fahrenheit < 50:
        category = "cold"
    elif fahrenheit < 77:
        category = "mild"
    else:
        category = "hot"

    return (celsius, fahrenheit, category)

# Apply the pure function
result = celsius_rdd.map(temperature_analysis).collect()
for item in result:
    print(f"Celsius: {item[0]}°C, Fahrenheit: {item[1]}°F, Category: {item[2]}")

# Demonstrate determinism - running again gives same results
result2 = celsius_rdd.map(temperature_analysis).collect()
print(f"\nResults are identical: {result == result2}")

Celsius: 0°C, Fahrenheit: 32.0°F, Category: cold
Celsius: 10°C, Fahrenheit: 50.0°F, Category: mild
Celsius: 20°C, Fahrenheit: 68.0°F, Category: mild
Celsius: 30°C, Fahrenheit: 86.0°F, Category: hot
Celsius: 40°C, Fahrenheit: 104.0°F, Category: hot

Results are identical: True


### Exercise 3: Filter with Functional Predicates
Pure predicate functions return boolean values based solely on input. They can be composed, tested independently, and reused across different RDDs without side effects.

In [7]:
numbers_rdd = sc.parallelize(range(1, 21))

# Pure predicate functions
def is_even(n):
    """Pure predicate: checks if number is even"""
    return n % 2 == 0

def is_prime(n):
    """Pure predicate: checks if number is prime"""
    if n < 2:
        return False
    for i in range(2, int(math.sqrt(n)) + 1):
        if n % i == 0:
            return False
    return True

def is_divisible_by_3_or_5(n):
    """Pure predicate: checks divisibility"""
    return n % 3 == 0 or n % 5 == 0

# Apply filters
even_numbers = numbers_rdd.filter(is_even).collect()
prime_numbers = numbers_rdd.filter(is_prime).collect()
div_3_or_5 = numbers_rdd.filter(is_divisible_by_3_or_5).collect()

print(f"Even numbers: {even_numbers}")
print(f"Prime numbers: {prime_numbers}")
print(f"Divisible by 3 or 5: {div_3_or_5}")

# Demonstrate function composition
def compose_predicates(pred1, pred2):
    """Higher-order function to compose predicates"""
    return lambda x: pred1(x) and pred2(x)

# Even AND prime (only 2)
even_and_prime = compose_predicates(is_even, is_prime)
result = numbers_rdd.filter(even_and_prime).collect()
print(f"Even and prime: {result}")

Even numbers: [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
Prime numbers: [2, 3, 5, 7, 11, 13, 17, 19]
Divisible by 3 or 5: [3, 5, 6, 9, 10, 12, 15, 18, 20]
Even and prime: [2]


### Exercise 4: Reduce Operations without State
Reduce operations require pure, associative functions. The order of operations might vary in distributed computing, so (a op b) op c must equal a op (b op c) for consistent results.

In [8]:
numbers_rdd = sc.parallelize([1, 2, 3, 4, 5])

# Pure reduction functions
def add(x, y):
    """Pure function: adds two numbers"""
    return x + y

def multiply(x, y):
    """Pure function: multiplies two numbers"""
    return x * y

def find_max(x, y):
    """Pure function: returns maximum"""
    return x if x > y else y

def find_min(x, y):
    """Pure function: returns minimum"""
    return x if x < y else y

# Apply reductions
sum_result = numbers_rdd.reduce(add)
product_result = numbers_rdd.reduce(multiply)
max_result = numbers_rdd.reduce(find_max)
min_result = numbers_rdd.reduce(find_min)

print(f"Sum: {sum_result}")
print(f"Product: {product_result}")
print(f"Max: {max_result}")
print(f"Min: {min_result}")

# Alternative: Using lambda functions (also pure)
sum_lambda = numbers_rdd.reduce(lambda x, y: x + y)
product_lambda = numbers_rdd.reduce(lambda x, y: x * y)

print(f"\nUsing lambdas - Sum: {sum_lambda}, Product: {product_lambda}")

# Demonstrate associativity importance
def non_associative_op(x, y):
    """Non-associative operation - problematic for reduce"""
    return (x - y) * 2

# This gives unpredictable results in distributed setting!
# result = numbers_rdd.reduce(non_associative_op)  # Don't rely on this!

Sum: 15
Product: 120
Max: 5
Min: 1

Using lambdas - Sum: 15, Product: 120


### Exercise 5: FlatMap for Functional Transformations
FlatMap applies a function that returns an iterable, then flattens the results. Using pure functions ensures each sentence is processed independently without side effects.

In [9]:
sentences_rdd = sc.parallelize([
    "Hello world",
    "Functional programming rocks",
    "Pure functions scale"
])

def tokenize(sentence):
    """Pure function: splits sentence into words"""
    return sentence.lower().split()

def word_with_length(sentence):
    """Pure function: returns list of (word, length) tuples"""
    return [(word, len(word)) for word in sentence.lower().split()]

def generate_bigrams(sentence):
    """Pure function: generates 2-grams from sentence"""
    words = sentence.lower().split()
    return [f"{words[i]}_{words[i+1]}"
            for i in range(len(words)-1)] if len(words) > 1 else []

# Apply flatMap operations
words = sentences_rdd.flatMap(tokenize).collect()
print(f"All words: {words}")

word_lengths = sentences_rdd.flatMap(word_with_length).collect()
print(f"Words with lengths: {word_lengths}")

bigrams = sentences_rdd.flatMap(generate_bigrams).collect()
print(f"Bigrams: {bigrams}")

# Chaining operations functionally
word_count = (sentences_rdd
    .flatMap(tokenize)
    .map(lambda word: (word, 1))
    .reduceByKey(lambda a, b: a + b)
    .collect())
print(f"Word counts: {word_count}")

All words: ['hello', 'world', 'functional', 'programming', 'rocks', 'pure', 'functions', 'scale']
Words with lengths: [('hello', 5), ('world', 5), ('functional', 10), ('programming', 11), ('rocks', 5), ('pure', 4), ('functions', 9), ('scale', 5)]
Bigrams: ['hello_world', 'functional_programming', 'programming_rocks', 'pure_functions', 'functions_scale']
Word counts: [('hello', 1), ('world', 1), ('functional', 1), ('programming', 1), ('rocks', 1), ('pure', 1), ('functions', 1), ('scale', 1)]


### Exercise 6: Avoiding Shared Mutable State

Instead of maintaining mutable state, we use pure functions that return values representing counts. These can be safely combined using reduce operations.

In [11]:
numbers_rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Using Filters and Count

def is_even(x):
    return x % 2 == 0

def is_odd(x):
    return x % 2 != 0

even_count = numbers_rdd.filter(is_even).count()
odd_count = numbers_rdd.filter(is_odd).count()

print(f"Approach 1 - Even: {even_count}, Odd: {odd_count}")

#  Using map and reduce
def classify_number(x):
    """Pure function: returns (even_count, odd_count) tuple"""
    if x % 2 == 0:
        return (1, 0)  # even
    else:
        return (0, 1)  # odd

def combine_counts(count1, count2):
    """Pure function: combines two count tuples"""
    return (count1[0] + count2[0], count1[1] + count2[1])

result = numbers_rdd.map(classify_number).reduce(combine_counts)
print(f"Approach 2 - Even: {result[0]}, Odd: {result[1]}")

# Using partition
def partition_even_odd(iterator):
    """Pure function: partitions numbers in an iterator"""
    even = []
    odd = []
    for num in iterator:
        if num % 2 == 0:
            even.append(num)
        else:
            odd.append(num)
    return [(len(even), len(odd))]

result = numbers_rdd.mapPartitions(partition_even_odd).reduce(combine_counts)
print(f"Approach 3 - Even: {result[0]}, Odd: {result[1]}")

Approach 1 - Even: 5, Odd: 5
Approach 2 - Even: 5, Odd: 5
Approach 3 - Even: 5, Odd: 5


### Exercise 7: Function Composition and Pipelining

Function composition allows building complex transformations from simple, pure functions. Each function in the pipeline is independent and testable.

In [12]:
text_rdd = sc.parallelize([
    "  HELLO world  ",
    "PYTHON programming",
    "spark FUNCTIONAL"
])

# Individual pure functions
def strip_whitespace(text):
    """Pure: removes leading/trailing whitespace"""
    return text.strip()

def to_lowercase(text):
    """Pure: converts to lowercase"""
    return text.lower()

def reverse_string(text):
    """Pure: reverses string"""
    return text[::-1]

def add_prefix(prefix):
    """Pure: returns a function that adds prefix"""
    return lambda text: f"{prefix}_{text}"

# Function composition helper
def compose(*functions):
    """Compose multiple functions into one"""
    def inner(arg):
        result = arg
        for func in functions:
            result = func(result)
        return result
    return inner

# Create pipeline
pipeline = compose(
    strip_whitespace,
    to_lowercase,
    reverse_string,
    add_prefix("processed")
)

# Apply pipeline
result = text_rdd.map(pipeline).collect()
for item in result:
    print(item)

# Alternative: Manual chaining (also functional)
result2 = (text_rdd
    .map(strip_whitespace)
    .map(to_lowercase)
    .map(reverse_string)
    .map(add_prefix("processed"))
    .collect())

print(f"\nResults identical: {result == result2}")

processed_dlrow olleh
processed_gnimmargorp nohtyp
processed_lanoitcnuf kraps

Results identical: True


### Exercise 8: Using Accumulators Correctly

While accumulators provide a way to track metrics, pure functional approaches using map and reduce are often cleaner and more reliable. Accumulators should be used sparingly and only when necessary.

In [13]:
data_rdd = sc.parallelize([10, -5, 20, -3, 15, 0, -10, 25])

# Create accumulators (Spark's way of handling distributed state)
positive_acc = sc.accumulator(0)
negative_acc = sc.accumulator(0)
zero_acc = sc.accumulator(0)

def process_and_count(x):
    """
    Pure function for transformation,
    with side effect for counting (using accumulators)
    """
    # Side effects using accumulators (safe in Spark)
    if x > 0:
        positive_acc.add(1)
    elif x < 0:
        negative_acc.add(1)
    else:
        zero_acc.add(1)

    # Pure transformation
    return x * 2

# Process data
transformed = data_rdd.map(process_and_count).collect()

print(f"Transformed data: {transformed}")
print(f"Positive numbers: {positive_acc.value}")
print(f"Negative numbers: {negative_acc.value}")
print(f"Zeros: {zero_acc.value}")

# BETTER APPROACH: Pure functional without accumulators
def categorize(x):
    """Pure function: categorizes number"""
    if x > 0:
        return ("positive", 1)
    elif x < 0:
        return ("negative", 1)
    else:
        return ("zero", 1)

# Count by category using pure functions
counts = (data_rdd
    .map(categorize)
    .reduceByKey(lambda a, b: a + b)
    .collectAsMap())

print(f"\nPure functional approach: {counts}")

Transformed data: [20, -10, 40, -6, 30, 0, -20, 50]
Positive numbers: 4
Negative numbers: 3
Zeros: 1

Pure functional approach: {'positive': 4, 'negative': 3, 'zero': 1}


### Exercise 9: Handling Complex Data with Pure Functions

Pure functions can work with complex data structures. Each function transforms data without modifying the original, maintaining immutability.

In [14]:
transactions_rdd = sc.parallelize([
    {"id": 1, "amount": 100, "type": "credit"},
    {"id": 2, "amount": 50, "type": "debit"},
    {"id": 3, "amount": 200, "type": "credit"},
    {"id": 4, "amount": 75, "type": "debit"}
])

def calculate_signed_amount(transaction):
    """Pure function: returns signed amount based on type"""
    amount = transaction["amount"]
    if transaction["type"] == "credit":
        return amount
    else:
        return -amount

def extract_amount(transaction):
    """Pure function: extracts amount"""
    return transaction["amount"]

def transaction_to_type_amount(transaction):
    """Pure function: returns (type, amount) tuple"""
    return (transaction["type"], transaction["amount"])

def max_transaction(t1, t2):
    """Pure function: returns transaction with larger amount"""
    return t1 if t1["amount"] > t2["amount"] else t2

# 1. Calculate net balance
net_balance = transactions_rdd.map(calculate_signed_amount).reduce(lambda a, b: a + b)
print(f"Net balance: {net_balance}")

# 2. Find largest transaction
largest = transactions_rdd.reduce(max_transaction)
print(f"Largest transaction: {largest}")

# 3. Group by type and sum
type_sums = (transactions_rdd
    .map(transaction_to_type_amount)
    .reduceByKey(lambda a, b: a + b)
    .collectAsMap())
print(f"Sum by type: {type_sums}")

# Bonus: Create summary using pure functions
def create_summary(transaction):
    """Pure function: creates transaction summary"""
    signed = calculate_signed_amount(transaction)
    return {
        "id": transaction["id"],
        "signed_amount": signed,
        "is_credit": transaction["type"] == "credit"
    }

summaries = transactions_rdd.map(create_summary).collect()
for summary in summaries:
    print(f"Transaction {summary['id']}: {summary['signed_amount']} (Credit: {summary['is_credit']})")

Net balance: 175
Largest transaction: {'id': 3, 'amount': 200, 'type': 'credit'}
Sum by type: {'debit': 125, 'credit': 300}
Transaction 1: 100 (Credit: True)
Transaction 2: -50 (Credit: False)
Transaction 3: 200 (Credit: True)
Transaction 4: -75 (Credit: False)


### Exercise 10: Combining Multiple RDDs Functionally

oining RDDs and applying transformations using pure functions ensures that operations are deterministic and can be parallelized safely. The functional pipeline approach makes the data flow clear and maintainable.

In [15]:
products_rdd = sc.parallelize([
    (1, "Laptop"),
    (2, "Mouse"),
    (3, "Keyboard")
])

sales_rdd = sc.parallelize([
    (1, 1200),
    (2, 25),
    (3, 75),
    (1, 1200),
    (2, 25)
])

def add_amounts(amount1, amount2):
    """Pure function: adds two amounts"""
    return amount1 + amount2

def apply_discount(amount, discount_rate=0.1):
    """Pure function: applies discount"""
    return amount * (1 - discount_rate)

def format_result(product_sales):
    """Pure function: formats the result"""
    product_id, (name, total) = product_sales
    return {
        "id": product_id,
        "name": name,
        "total_sales": total,
        "after_discount": apply_discount(total)
    }

# Step 1: Calculate total sales per product
sales_totals = sales_rdd.reduceByKey(add_amounts)
print("Sales totals:")
for item in sales_totals.collect():
    print(f"  Product {item[0]}: ${item[1]}")

# Step 2: Join with product names
joined = products_rdd.join(sales_totals)
print("\nJoined data:")
for item in joined.collect():
    print(f"  {item}")

# Step 3: Apply transformations and format
final_results = joined.map(format_result).collect()
print("\nFinal results with discount:")
for result in final_results:
    print(f"  {result['name']}: ${result['total_sales']} (${result['after_discount']} after discount)")

# Alternative: Functional pipeline approach
def create_sales_pipeline(products, sales):
    """Pure function: creates complete pipeline"""
    return (sales
        .reduceByKey(add_amounts)
        .join(products)
        .map(lambda x: (x[0], x[1][1], x[1][0]))  # (id, name, total)
        .map(lambda x: (x[1], x[2], apply_discount(x[2])))
        .collect())

# Reset RDDs for pipeline
products_rdd = sc.parallelize([(1, "Laptop"), (2, "Mouse"), (3, "Keyboard")])
sales_rdd = sc.parallelize([(1, 1200), (2, 25), (3, 75), (1, 1200), (2, 25)])

pipeline_result = create_sales_pipeline(products_rdd, sales_rdd)
print("\nPipeline results:")
for name, total, discounted in pipeline_result:
    print(f"  {name}: ${total} -> ${discounted}")

Sales totals:
  Product 2: $50
  Product 1: $2400
  Product 3: $75

Joined data:
  (1, ('Laptop', 2400))
  (2, ('Mouse', 50))
  (3, ('Keyboard', 75))

Final results with discount:
  Laptop: $2400 ($2160.0 after discount)
  Mouse: $50 ($45.0 after discount)
  Keyboard: $75 ($67.5 after discount)

Pipeline results:
  Laptop: $2400 -> $2160.0
  Mouse: $50 -> $45.0
  Keyboard: $75 -> $67.5


Key Takeaways


1. Pure functions are essential for distributed computing - they guarantee consistent results regardless of execution order or location.
2. Avoid global state - It won't work correctly in distributed settings and makes code unpredictable.
3. Immutability - RDD transformations create new RDDs rather than modifying existing ones.
4. Function composition - Build complex operations from simple, testable pure functions.
5. Use Spark's mechanisms - When you need state-like behavior, use accumulators or aggregations, not global variables.
6. Think in transformations - Express operations as chains of pure transformations rather than imperative steps.

Functional programming in PySpark isn't just about style - it's about writing correct, scalable, and maintainable distributed programs.