### Perform simple data transformation like filtering even numbers from a given list using PySpark RDD

In [1]:
sc

**Dataset Summary:**

* **Number of entries:** 50
* **Total features:** 7
* The dataset contains student demographic information (**id, name, age, gender**) and academic performance data (**math, science, english**).

| Feature Name | Description                                                                                                         | Data Type       | Example Value |
| ------------ | ------------------------------------------------------------------------------------------------------------------- | --------------- | ------------- |
| **id**       | Unique identifier for each student. Helps distinguish records.                                                      | Integer         | 1             |
| **name**     | Name of the student. Serves as a label but is not useful for statistical analysis.                                  | String (Object) | Alice         |
| **age**      | Age of the student in years. Useful for demographic insights and performance trends.                                | Integer         | 20            |
| **gender**   | Gender of the student, typically denoted as ‘M’ (Male) or ‘F’ (Female). Allows gender-based performance comparison. | String (Object) | F             |
| **math**     | Marks obtained by the student in Mathematics. Reflects proficiency in numerical and problem-solving skills.         | Integer         | 66            |
| **science**  | Marks obtained in Science. Indicates understanding of scientific concepts and application.                          | Integer         | 92            |
| **english**  | Marks obtained in English. Measures language comprehension, grammar, and writing ability.                           | Integer         | 44            |

In [3]:
# from pyspark import SparkContext
import random

# Step 1: Initialize SparkContext
# sc = SparkContext("local", "EvenNumberFilter")

In [4]:
# Step 2: Generate 100 random integers between 1 and 1000
random_numbers = [random.randint(1, 1000) for _ in range(100)]

print("Original List:")
print(random_numbers)

Original List:
[143, 686, 654, 592, 210, 355, 87, 440, 252, 747, 654, 397, 406, 884, 106, 360, 515, 169, 804, 163, 581, 518, 351, 534, 530, 683, 277, 33, 335, 592, 579, 9, 410, 81, 986, 76, 191, 172, 764, 974, 384, 884, 708, 120, 483, 654, 524, 506, 6, 367, 305, 973, 70, 780, 257, 761, 582, 529, 802, 19, 848, 690, 156, 653, 211, 171, 720, 322, 763, 249, 941, 665, 153, 683, 352, 309, 80, 256, 626, 851, 57, 689, 366, 877, 169, 504, 607, 938, 982, 242, 618, 315, 704, 550, 465, 103, 301, 323, 738, 610]


In [5]:
# Step 3: Parallelize the list into an RDD
numbers_rdd = sc.parallelize(random_numbers)

In [6]:
# Step 4: Filter only even numbers
even_numbers_rdd = numbers_rdd.filter(lambda x: x % 2 == 0)

In [7]:
# Step 5: Collect results
even_numbers = even_numbers_rdd.collect()

print("\nEven Numbers:")
print(even_numbers)


Even Numbers:
[686, 654, 592, 210, 440, 252, 654, 406, 884, 106, 360, 804, 518, 534, 530, 592, 410, 986, 76, 172, 764, 974, 384, 884, 708, 120, 654, 524, 506, 6, 70, 780, 582, 802, 848, 690, 156, 720, 322, 352, 80, 256, 626, 366, 504, 938, 982, 242, 618, 704, 550, 738, 610]


In [8]:
# Stop SparkContext
# sc.stop()

### Conclusion

This notebook successfully demonstrated a fundamental data processing task using PySpark. We began by generating a list of 100 random integers. This standard Python list was then transformed into a Resilient Distributed Dataset (RDD), the core data abstraction in Spark, using the `sc.parallelize()` method.

The key operation was the application of the `filter()` transformation, which efficiently processed the distributed data in parallel to select only the numbers that satisfy the even number condition (`x % 2 == 0`). Finally, the `collect()` action was invoked to retrieve the filtered data from the distributed workers and bring it back to the driver program as a local list, which was then printed.

This simple yet effective example illustrates the basic workflow of a Spark application:
1.  Create an RDD from a data source.
2.  Apply one or more transformations to the RDD.
3.  Execute an action to trigger the computation and obtain a result.

It serves as an excellent introduction to the power and simplicity of performing distributed data manipulation with PySpark.