### Spark Session & Context

- **SparkSession**: Entry point to use DataFrames and SQL in PySpark.  
- **SparkContext**: Core connection to the Spark cluster, used for creating RDDs.  


In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("RDD-Exercises-Set2").getOrCreate()
sc = spark.sparkContext


# -------------------- 1. Numbers Practice --------------------


### RDD Operations Example

- **Divisible by 3**  
  Filters numbers that are multiples of 3.  

- **Doubled Values**  
  Maps each number to its double.  

- **Count Greater Than 10**  
  Counts how many numbers are greater than 10.  


In [2]:
nums = sc.parallelize(range(1, 16))
div_by_3 = nums.filter(lambda x: x % 3 == 0).collect()
doubled = nums.map(lambda x: x * 2).collect()
count_gt_10 = nums.filter(lambda x: x > 10).count()

print("Divisible by 3:", div_by_3)
print("Doubled:", doubled)
print("Count > 10:", count_gt_10)

Divisible by 3: [3, 6, 9, 12, 15]
Doubled: [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30]
Count > 10: 5


# -------------------- 2. String Processing --------------------


### RDD Operations with Fruits

- **Distinct Fruits**  
  Removes duplicates and returns unique fruit names.  

- **Fruit Counts**  
  Counts how many times each fruit appears using `map` and `reduceByKey`.  

- **Longest Fruit**  
  Finds the fruit name with the maximum length.  


In [3]:
fruits = sc.parallelize(["apple", "banana", "grape", "banana", "apple", "mango"])
distinct_fruits = fruits.distinct().collect()
fruit_counts = fruits.map(lambda x: (x, 1)).reduceByKey(lambda a, b: a + b).collect()
longest_fruit = fruits.reduce(lambda a, b: a if len(a) > len(b) else b)

print("Distinct Fruits:", distinct_fruits)
print("Fruit Counts:", fruit_counts)
print("Longest Fruit:", longest_fruit)

Distinct Fruits: ['apple', 'banana', 'grape', 'mango']
Fruit Counts: [('apple', 2), ('banana', 2), ('grape', 1), ('mango', 1)]
Longest Fruit: banana


# -------------------- 3. Sentence Split --------------------


### RDD Operations with Sentences

- **Split into Words**  
  Uses `flatMap` to split each sentence into words and convert them to lowercase.  

- **Unique Words**  
  Finds distinct words across all sentences.  

- **Unique Word Count**  
  Counts the total number of unique words.  


In [4]:
sentences = sc.parallelize([
    "spark makes big data easy",
    "rdd is the core of spark",
    "python with spark"
])
words = sentences.flatMap(lambda s: s.split(" ")).map(lambda w: w.lower())
unique_words = words.distinct().collect()
unique_word_count = words.distinct().count()

print("Unique Words:", unique_words)
print("Unique Word Count:", unique_word_count)

Unique Words: ['big', 'easy', 'rdd', 'core', 'of', 'python', 'with', 'spark', 'makes', 'data', 'is', 'the']
Unique Word Count: 12


# -------------------- 4. Pair RDD Operations --------------------


### RDD Operations with Student Marks

- **Total Marks**  
  Adds up marks for each student using `reduceByKey`.  

- **Average Marks**  
  Calculates average marks per student by summing marks and counts, then dividing.  

- **Highest Marks**  
  Finds the student with the maximum single score.  


In [5]:
marks = sc.parallelize([
    ("Rahul", 85), ("Priya", 92), ("Aman", 78), ("Rahul", 90), ("Priya", 88)
])
total_marks = marks.reduceByKey(lambda a, b: a + b).collect()
avg_marks = marks.mapValues(lambda x: (x, 1)) \
                 .reduceByKey(lambda a, b: (a[0]+b[0], a[1]+b[1])) \
                 .mapValues(lambda x: x[0]/x[1]).collect()
highest = marks.reduce(lambda a, b: a if a[1] > b[1] else b)

print("Total Marks:", total_marks)
print("Average Marks:", avg_marks)
print("Highest Marks:", highest)


Total Marks: [('Rahul', 175), ('Priya', 180), ('Aman', 78)]
Average Marks: [('Rahul', 87.5), ('Priya', 90.0), ('Aman', 78.0)]
Highest Marks: ('Priya', 92)


# -------------------- 5. Reduce & Aggregate --------------------


### RDD Operations with Numbers

- **Sum**  
  Adds all numbers using `reduce`.  

- **Product**  
  Multiplies all numbers using `reduce`.  

- **Average**  
  Divides the total sum by the count of numbers.  


In [6]:
nums2 = sc.parallelize([5, 10, 15, 20, 25])
sum_val = nums2.reduce(lambda a, b: a + b)
product_val = nums2.reduce(lambda a, b: a * b)
count_val = nums2.count()
avg_val = sum_val / count_val

print("Sum:", sum_val)
print("Product:", product_val)
print("Average:", avg_val)


Sum: 75
Product: 375000
Average: 15.0


# -------------------- 6. Word Length Analysis --------------------


### RDD Operations with Words

- **Word Lengths**  
  Maps each word to a tuple of (word, length).  

- **Longest Word**  
  Finds the word with the maximum length using `reduce`.  

- **Average Length**  
  Calculates the average length of all words.


In [7]:
words2 = sc.parallelize(["data", "engineering", "spark", "rdd", "pyspark", "analytics"])
word_lengths = words2.map(lambda w: (w, len(w))).collect()
longest_word = words2.reduce(lambda a, b: a if len(a) > len(b) else b)
avg_length = words2.map(lambda w: len(w)).sum() / words2.count()

print("Word Lengths:", word_lengths)
print("Longest Word:", longest_word)
print("Average Length:", avg_length)


Word Lengths: [('data', 4), ('engineering', 11), ('spark', 5), ('rdd', 3), ('pyspark', 7), ('analytics', 9)]
Longest Word: engineering
Average Length: 6.5


# -------------------- 7. Joins --------------------


### RDD Joins

- **Inner Join**  
  Returns only the students who have a matching course ID.  

- **Left Outer Join**  
  Returns all students and their courses if available; `None` if no matching course.  

- **Right Outer Join**  
  Returns all courses and the students enrolled if available; `None` if no matching student.


In [8]:
students = sc.parallelize([(1, "Rahul"), (2, "Priya"), (3, "Aman")])
courses = sc.parallelize([(1, "Python"), (2, "Spark"), (4, "Databases")])

inner_join = students.join(courses).collect()
left_join = students.leftOuterJoin(courses).collect()
right_join = students.rightOuterJoin(courses).collect()

print("Inner Join:", inner_join)
print("Left Outer Join:", left_join)
print("Right Outer Join:", right_join)


Inner Join: [(1, ('Rahul', 'Python')), (2, ('Priya', 'Spark'))]
Left Outer Join: [(1, ('Rahul', 'Python')), (2, ('Priya', 'Spark')), (3, ('Aman', None))]
Right Outer Join: [(4, (None, 'Databases')), (1, ('Rahul', 'Python')), (2, ('Priya', 'Spark'))]


# -------------------- 8. Mini Real-World --------------------


### RDD Operations with Orders

- **Total per Customer**  
  Sums all order amounts for each customer using `reduceByKey`.  

- **Customer with Maximum Spend**  
  Finds the customer who spent the most.  

- **Total Revenue**  
  Calculates the sum of all orders.  


In [9]:
orders = sc.parallelize([(1, 200), (2, 500), (3, 300), (1, 150), (2, 250)])

total_per_customer = orders.reduceByKey(lambda a, b: a + b).collect()
max_customer = orders.reduceByKey(lambda a, b: a + b).reduce(lambda a, b: a if a[1] > b[1] else b)
total_revenue = orders.map(lambda x: x[1]).sum()

print("Total per Customer:", total_per_customer)
print("Customer with Max Spend:", max_customer)
print("Total Revenue:", total_revenue)


Total per Customer: [(2, 750), (1, 350), (3, 300)]
Customer with Max Spend: (2, 750)
Total Revenue: 1400
