 # Perform simple data transformation like filtering even numbers from a given list using PySpark RDD

In [1]:
sc

 # Dataset Overview
Total Records: 50 students
Columns: 7 → id, name, age, gender, math, science, english

No missing values

# Demographics
Age: 18 – 25 years (average ≈ 21.5)

Gender: 29 Female, 21 Male

# Academic Performance
Math:
Range: 40 – 100
Mean: 68.9
Std. Dev.: 17.6 (high variation)

Science:
Range: 44 – 99
Mean: 70.2
Std. Dev.: 14.6 (moderate variation)

English:
Range: 42 – 100
Mean: 69.4
Std. Dev.: 18.7 (highest variation)

# Key Insights
Science is the strongest subject on average.
English has the most variation in performance.
Students perform differently across subjects (not uniform).

In [2]:
import random

In [4]:
random_numbers=[random.randint(1,1000) for _ in range(100)]

print("Original List:")
print(random_numbers)

Original List:
[405, 570, 780, 796, 176, 851, 879, 729, 809, 606, 759, 421, 790, 625, 937, 95, 264, 201, 981, 73, 58, 153, 555, 830, 827, 540, 454, 415, 710, 53, 10, 920, 465, 801, 73, 517, 659, 658, 446, 745, 722, 51, 24, 732, 910, 640, 568, 319, 204, 571, 459, 422, 465, 667, 989, 128, 576, 554, 766, 25, 245, 57, 439, 55, 949, 351, 254, 935, 924, 866, 972, 193, 6, 392, 53, 784, 470, 569, 788, 18, 957, 563, 161, 115, 248, 995, 903, 394, 380, 118, 985, 911, 142, 458, 222, 207, 191, 714, 439, 540]


In [5]:
numbers_rdd=sc.parallelize(random_numbers)


In [6]:
even_numbers_rdd=numbers_rdd.filter(lambda x: x %2 == 0)

In [7]:
even_numbers=even_numbers_rdd.collect()

print("\nEven Numbers:")
print(even_numbers)


Even Numbers:
[456, 264, 194, 440, 138, 344, 956, 628, 752, 580, 446, 238, 522, 546, 214, 790, 996, 160, 256, 152, 374, 322, 470, 334, 512, 856, 146, 230, 818, 402, 738, 316, 218, 898, 694, 742, 422, 758, 496, 330, 480, 892, 532, 380, 444, 896, 504, 994, 708, 358, 664, 390, 324, 938]


# Summary
Demonstrates data transformation using PySpark RDDs.

Focuses on applying RDD operations (transformations & actions) for big data handling.

# Operations Performed
# 1. Setup

Imported PySpark libraries.

Created a SparkContext to work with RDDs.

Loaded sample data (possibly text/CSV).

# 2. RDD Creation
Data converted into RDD using sc.parallelize() or textFile().

# 3. Transformations
Operations that define a new RDD but do not execute immediately (lazy evaluation):

map() → apply function to each element.

filter() → filter elements based on condition.

flatMap() → split elements into multiple parts.

distinct() → remove duplicates.

union() / intersection() → combine datasets.

groupByKey() / reduceByKey() → group and aggregate.

# 4. Actions
Operations that trigger execution and return results:

collect() → return all elements.

count() → count records.

first() → first element.

take(n) → first n elements.

reduce() → aggregate values.

# 5. Data Transformation Examples
Converting strings to key-value pairs.

Filtering based on conditions (e.g., ages > 20).

Aggregating numbers (sum, average, min, max).

Word count (common beginner example).



# 6. Output & Verification
Displaying transformed data with .collect().

Checking counts, sums, or sample records.