### Implement and demonstrate dataset sampling using the sample() and takeSample() methods in PySpark.(DataFrames)

In [1]:
sc

**Dataset Summary:**

* **Number of entries:** 50
* **Total features:** 7
* The dataset contains student demographic information (**id, name, age, gender**) and academic performance data (**math, science, english**).

| Feature Name | Description                                                                                                         | Data Type       | Example Value |
| ------------ | ------------------------------------------------------------------------------------------------------------------- | --------------- | ------------- |
| **id**       | Unique identifier for each student. Helps distinguish records.                                                      | Integer         | 1             |
| **name**     | Name of the student. Serves as a label but is not useful for statistical analysis.                                  | String (Object) | Alice         |
| **age**      | Age of the student in years. Useful for demographic insights and performance trends.                                | Integer         | 20            |
| **gender**   | Gender of the student, typically denoted as ‘M’ (Male) or ‘F’ (Female). Allows gender-based performance comparison. | String (Object) | F             |
| **math**     | Marks obtained by the student in Mathematics. Reflects proficiency in numerical and problem-solving skills.         | Integer         | 66            |
| **science**  | Marks obtained in Science. Indicates understanding of scientific concepts and application.                          | Integer         | 92            |
| **english**  | Marks obtained in English. Measures language comprehension, grammar, and writing ability.                           | Integer         | 44            |

In [2]:
from pyspark.sql import SparkSession

# Step 1: Initialize Spark Session
spark = SparkSession.builder.appName("SamplingExample").getOrCreate()

In [3]:
# Step 2: Read CSV file into DataFrame
df = spark.read.csv("students.csv", header=True, inferSchema=True)

In [4]:
# === Sampling Demonstration (within 7 operations) ===

# 1. View first 5 rows
print("=== First 5 rows of dataset ===")
df.show(5)

=== First 5 rows of dataset ===
+---+-------+---+------+----+-------+-------+
| id|   name|age|gender|math|science|english|
+---+-------+---+------+----+-------+-------+
|  1|  Alice| 20|     F|  66|     92|     44|
|  2|    Bob| 20|     M|  82|     52|     77|
|  3|Charlie| 22|     F|  43|     57|     76|
|  4|  David| 19|     M|  95|     69|     46|
|  5|    Eva| 19|     F|  62|     44|     96|
+---+-------+---+------+----+-------+-------+
only showing top 5 rows


In [5]:
# 2. Print schema
print("=== Schema of dataset ===")
df.printSchema()

=== Schema of dataset ===
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- math: integer (nullable = true)
 |-- science: integer (nullable = true)
 |-- english: integer (nullable = true)



In [6]:
# 3. Random sample without replacement (30% of data)
print("=== Sample (30% without replacement) ===")
df.sample(withReplacement=False, fraction=0.3, seed=42).show(10)

=== Sample (30% without replacement) ===
+---+------+---+------+----+-------+-------+
| id|  name|age|gender|math|science|english|
+---+------+---+------+----+-------+-------+
|  4| David| 19|     M|  95|     69|     46|
|  8| Henry| 21|     F|  53|     82|     60|
| 17|Quincy| 18|     M|  65|     79|     54|
| 19|   Sam| 18|     F|  76|     70|     65|
| 27| Aaron| 25|     F|  81|     99|     44|
| 28| Bella| 19|     F|  54|     76|     76|
| 32| Fiona| 22|     F|  48|     96|     48|
| 37|  Kyle| 21|     M|  57|     86|     92|
| 39|  Matt| 25|     M|  64|     71|    100|
| 41| Oscar| 20|     M|  87|     72|     81|
+---+------+---+------+----+-------+-------+
only showing top 10 rows


In [7]:
# 4. Random sample with replacement (20% of data)
print("=== Sample (20% with replacement) ===")
df.sample(withReplacement=True, fraction=0.2, seed=42).show(10)

=== Sample (20% with replacement) ===
+---+------+---+------+----+-------+-------+
| id|  name|age|gender|math|science|english|
+---+------+---+------+----+-------+-------+
|  6| Frank| 22|     F|  70|     78|     94|
|  7| Grace| 24|     F|  67|     66|     93|
| 14|Nathan| 23|     F|  71|     66|     60|
| 17|Quincy| 18|     M|  65|     79|     54|
| 21|   Uma| 19|     F|  89|     70|     76|
| 22|Victor| 22|     M|  96|     75|     56|
| 31| Ethan| 24|     M|  53|     57|     45|
| 32| Fiona| 22|     F|  48|     96|     48|
| 35|   Ian| 21|     F|  72|     75|     70|
| 38| Laura| 23|     M|  84|     73|     56|
+---+------+---+------+----+-------+-------+
only showing top 10 rows


In [8]:
# 5. Take a random sample of 5 rows using takeSample (without replacement)
print("=== takeSample: 5 rows (without replacement) ===")
sampled_rows = df.rdd.takeSample(False, 5, seed=42)
for row in sampled_rows:
    print(row)

=== takeSample: 5 rows (without replacement) ===
Row(id=35, name='Ian', age=21, gender='F', math=72, science=75, english=70)
Row(id=26, name='Zoey', age=18, gender='M', math=42, science=48, english=42)
Row(id=17, name='Quincy', age=18, gender='M', math=65, science=79, english=54)
Row(id=43, name='Quinn', age=18, gender='F', math=56, science=60, english=87)
Row(id=38, name='Laura', age=23, gender='M', math=84, science=73, english=56)


In [9]:
# 6. Take a random sample of 5 rows using takeSample (with replacement)
print("=== takeSample: 5 rows (with replacement) ===")
sampled_rows_wr = df.rdd.takeSample(True, 5, seed=42)
for row in sampled_rows_wr:
    print(row)

=== takeSample: 5 rows (with replacement) ===
Row(id=47, name='Umar', age=21, gender='F', math=75, science=80, english=59)
Row(id=17, name='Quincy', age=18, gender='M', math=65, science=79, english=54)
Row(id=10, name='Jack', age=19, gender='F', math=44, science=59, english=60)
Row(id=38, name='Laura', age=23, gender='M', math=84, science=73, english=56)
Row(id=23, name='Wendy', age=24, gender='M', math=57, science=83, english=81)


In [10]:
# 7. Count total rows (to compare with sampled data size)
print("Total rows in dataset:", df.count())

Total rows in dataset: 50


In [11]:
# Stop Spark session
# spark.stop()

### Conclusion

This notebook has effectively demonstrated two distinct and powerful methods for performing dataset sampling in PySpark: the `sample()` transformation and the `takeSample()` action. By applying these functions to a student dataset, we have highlighted their unique characteristics and ideal use cases.

The key takeaways are:
* **`sample()` as a Transformation**: We used `df.sample()` to create a new, smaller **DataFrame** containing a statistical fraction of the original data. This method is **lazy**, fitting perfectly into a larger Spark pipeline where the sampled data will undergo further distributed processing. It is ideal for creating representative subsets for development, testing, or running models on a smaller scale.

* **`takeSample()` as an Action**: We used `df.rdd.takeSample()` to retrieve a fixed number of random records directly into a **local Python list** on the driver node. This method is an **action**, meaning it executes immediately. It is best suited for quickly inspecting a small, exact number of random rows or for feeding a small sample into a local library for analysis.

* **Reproducibility**: In all examples, the use of the `seed` parameter was shown to be crucial for ensuring that the sampling process is **reproducible**, which is essential for consistent testing and debugging.

Ultimately, understanding the fundamental difference between the `sample()` transformation (for creating distributed subsets) and the `takeSample()` action (for pulling a fixed local sample) allows developers to choose the most efficient and appropriate tool for their data exploration and analysis needs.