Perform simple data transformation like filtering even numbers from a given list using PySpark RDD

In [1]:
import os

os.environ["JAVA_HOME"] = r"C:\Progra~1\Java\jdk1.8"
os.environ["SPARK_HOME"] = r"C:\spark\spark-3.5.7-bin-hadoop3-scala2.13"
os.environ["HADOOP_HOME"] = r"C:\hadoop"
os.environ["PYSPARK_PYTHON"] = r"C:\Users\sathw\anaconda3\envs\pyspark310\python.exe"
os.environ["PYSPARK_DRIVER_PYTHON"] = r"C:\Users\sathw\anaconda3\envs\pyspark310\python.exe"

os.environ["PATH"] += ";" + os.path.join(os.environ["SPARK_HOME"], "bin")
os.environ["PATH"] += ";" + os.path.join(os.environ["HADOOP_HOME"], "bin")

# Initialize findspark
import findspark
findspark.init(os.environ["SPARK_HOME"])



In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Jupyter-PySpark") \
    .master("local[*]") \
    .config("spark.driver.bindAddress", "127.0.0.1") \
    .config("spark.pyspark.python", os.environ["PYSPARK_PYTHON"]) \
    .config("spark.pyspark.driver.python", os.environ["PYSPARK_DRIVER_PYTHON"]) \
    .getOrCreate()

sc = spark.sparkContext



In [3]:
sc

📊 Dataset Overview

    --Total Records: 50 students

    --Columns: 7 → id, name, age, gender, math, science, english

    --No missing values


👥 Demographics

    --Age: 18 – 25 years (average ≈ 21.5)

    --Gender: 29 Female, 21 Male


📚 Academic Performance

**Math

    --Range: 40 – 100

    --Mean: 68.9

    --Std. Dev.: 17.6 (high variation)


**Science

    --Range: 44 – 99

    --Mean: 70.2

    --Std. Dev.: 14.6 (moderate variation)


**English

    --Range: 42 – 100

    --Mean: 69.4

    --Std. Dev.: 18.7 (highest variation)


🔑 Key Insights

    --Science is the strongest subject on average.

    --English has the most variation in performance.

    --Students perform differently across subjects (not uniform).

In [4]:
 # from pyspark import SparkContext
 import random
 # Step 1: Initialize SparkContext
 # sc = SparkContext("local", "EvenNumberFilter")

In [5]:
 # Step 2: Generate 100 random integers between 1 and 1000
 random_numbers = [random.randint(1, 1000) for _ in range(100)]
 print("Original List:")
 print(random_numbers)

Original List:
[367, 990, 227, 786, 874, 155, 467, 785, 887, 78, 884, 899, 61, 934, 580, 948, 606, 188, 442, 39, 581, 857, 71, 613, 722, 952, 852, 971, 345, 537, 814, 123, 766, 549, 165, 794, 12, 10, 302, 43, 942, 286, 129, 360, 690, 161, 524, 887, 573, 823, 741, 450, 195, 732, 955, 45, 747, 428, 682, 156, 932, 490, 312, 310, 487, 130, 32, 478, 133, 469, 353, 369, 99, 344, 542, 944, 781, 985, 116, 658, 58, 2, 956, 121, 240, 275, 33, 462, 821, 586, 176, 572, 887, 94, 880, 582, 254, 104, 300, 396]


In [6]:
 # Step 3: Parallelize the list into an RDD
 numbers_rdd = sc.parallelize(random_numbers)

In [7]:
 # Step 4: Filter only even numbers
 even_numbers_rdd = numbers_rdd.filter(lambda x: x % 2 == 0)

In [8]:
 # Step 5: Collect results
 even_numbers = even_numbers_rdd.collect()
 print("\nEven Numbers:")
 print(even_numbers)


Even Numbers:
[990, 786, 874, 78, 884, 934, 580, 948, 606, 188, 442, 722, 952, 852, 814, 766, 794, 12, 10, 302, 942, 286, 360, 690, 524, 450, 732, 428, 682, 156, 932, 490, 312, 310, 130, 32, 478, 344, 542, 944, 116, 658, 58, 2, 956, 240, 462, 586, 176, 572, 94, 880, 582, 254, 104, 300, 396]


Summary

Demonstrates data transformation using PySpark RDDs.

Focuses on applying RDD operations (transformations & actions) for big data handling.

⚙️ Operations Performed


**1.Setup

    --Imported PySpark libraries.

    --Created a SparkContext to work with RDDs.

    --Loaded sample data (possibly text/CSV).



**2. RDD Creation

    --Data converted into RDD using sc.parallelize() or textFile().

   

**3. Transformations
Operations that define a new RDD but do not execute immediately (lazy evaluation):

    --map() → apply function to each element.

    --filter() → filter elements based on condition.

    --flatMap() → split elements into multiple parts.

    --distinct() → remove duplicates.

    --union() / intersection() → combine datasets.

    --groupByKey() / reduceByKey() → group and aggregate.

   

**4. Actions
Operations that trigger execution and return results:

    --collect() → return all elements.

    --count() → count records.

    --first() → first element.

    --take(n) → first n elements.

    --reduce() → aggregate values.

**5. Data Transformation Examples
Converting strings to key-value pairs.

    --Filtering based on conditions (e.g., ages > 20).

    --Aggregating numbers (sum, average, min, max).

    --Word count (common beginner example).

**6. Output & Verification

    --Displaying transformed data with .collect().

    --Checking counts, sums, or sample records.

In [9]:
# Stop SparkContext
 # sc.stop()