### Perform simple data transformation like filtering even numbers from a given list using PySpark RDD

In [1]:
sc


## 📊 Dataset Overview

* Total Records: **50 students**
* Columns: **7** → `id`, `name`, `age`, `gender`, `math`, `science`, `english`
* **No missing values**

### 👥 Demographics

* Age: **18 – 25 years** (average ≈ 21.5)
* Gender: **29 Female**, **21 Male**

### 📚 Academic Performance

* **Math:**

  * Range: **40 – 100**
  * Mean: **68.9**
  * Std. Dev.: **17.6** (high variation)

* **Science:**

  * Range: **44 – 99**
  * Mean: **70.2**
  * Std. Dev.: **14.6** (moderate variation)

* **English:**

  * Range: **42 – 100**
  * Mean: **69.4**
  * Std. Dev.: **18.7** (highest variation)

###  Key Insights

* **Science** is the strongest subject on average.
* **English** has the most variation in performance.
* Students perform differently across subjects (not uniform).




In [2]:
# from pyspark import SparkContext
import random

# Step 1: Initialize SparkContext
# sc = SparkContext("local", "EvenNumberFilter")

In [3]:
# Step 2: Generate 100 random integers between 1 and 1000
random_numbers = [random.randint(1, 1000) for _ in range(100)]

print("Original List:")
print(random_numbers)

Original List:
[114, 313, 175, 101, 10, 308, 294, 505, 31, 135, 944, 201, 13, 122, 944, 225, 764, 694, 367, 425, 577, 533, 485, 262, 791, 911, 671, 832, 493, 283, 9, 723, 137, 77, 988, 590, 296, 267, 778, 981, 393, 359, 953, 986, 607, 781, 52, 187, 664, 109, 630, 36, 499, 570, 99, 342, 127, 643, 390, 788, 742, 363, 911, 70, 451, 793, 541, 450, 232, 981, 666, 553, 887, 404, 732, 787, 146, 65, 623, 798, 523, 168, 24, 801, 351, 2, 454, 25, 334, 540, 865, 792, 789, 164, 74, 411, 686, 914, 114, 422]


In [4]:
# Step 3: Parallelize the list into an RDD
numbers_rdd = sc.parallelize(random_numbers)

In [5]:
# Step 4: Filter only even numbers
even_numbers_rdd = numbers_rdd.filter(lambda x: x % 2 == 0)

In [6]:
# Step 5: Collect results
even_numbers = even_numbers_rdd.collect()

print("\nEven Numbers:")
print(even_numbers)


Even Numbers:
[114, 10, 308, 294, 944, 122, 944, 764, 694, 262, 832, 988, 590, 296, 778, 986, 52, 664, 630, 36, 570, 342, 390, 788, 742, 70, 450, 232, 666, 404, 732, 146, 798, 168, 24, 2, 454, 334, 540, 792, 164, 74, 686, 914, 114, 422]


In [8]:
# Stop SparkContext
# sc.stop()

## Summary
Demonstrates data transformation using PySpark RDDs.

Focuses on applying RDD operations (transformations & actions) for big data handling.

## ⚙️ Operations Performed
### 1. Setup
* Imported PySpark libraries.

* Created a SparkContext to work with RDDs.

* Loaded sample data (possibly text/CSV).

### 2. RDD Creation
* Data converted into RDD using sc.parallelize() or textFile().
### 3. Transformations
Operations that define a new RDD but do not execute immediately (lazy evaluation):

* map() → apply function to each element.

* filter() → filter elements based on condition.

* flatMap() → split elements into multiple parts.

* distinct() → remove duplicates.

* union() / intersection() → combine datasets.

* groupByKey() / reduceByKey() → group and aggregate.

### 4. Actions
Operations that trigger execution and return results:

* collect() → return all elements.

* count() → count records.

* first() → first element.

* take(n) → first n elements.

* reduce() → aggregate values.

### 5. Data Transformation Examples
Converting strings to key-value pairs.

* Filtering based on conditions (e.g., ages > 20).

* Aggregating numbers (sum, average, min, max).

* Word count (common beginner example).

### 6. Output & Verification
* Displaying transformed data with .collect().

* Checking counts, sums, or sample records.