# Implement and demonstrate dataset sampling using the sample() and takeSample() methods in PySpark.(DataFrames)

**Dataset Description** The dataset contains records of 50 students, with the following attributes:

| Attribute | Description | 
|----------|----------|
| id    | Unique identifier for each student  | 
| name    | Student’s name  | 
| age    | Student’s age in years  | 
| gender    | Gender (M for male, F for female)  | 
| math    | Marks obtained in Mathematics (0–100)  | 
| science    | Marks obtained in Science (0–100)  | 
| english    | Marks obtained in English (0–100)  | 

**Dataset Overview**
- Total Records: 50 students
- Columns: 7 -> id, name, age, gender, math, science, english
- No missing values

**Demographics**
- Age: 18 – 25 years (average ≈ 21.5)
- Gender: 29 Female, 21 Male

**Academic Performance**
1. Math:
- Range: 40 – 100
- Mean: 68.9
- Std. Dev.: 17.6 (high variation)
2. Science:
- Range: 44 – 99
- Mean: 70.2
- Std. Dev.: 14.6 (moderate variation)
3. English:
- Range: 42 – 100
- Mean: 69.4
- Std. Dev.: 18.7 (highest variation)

**Key Insights**
- Science is the strongest subject on average.
- English has the most variation in performance.
- Students perform differently across subjects (not uniform).

In [1]:
sc

In [3]:
from pyspark.sql import SparkSession
 # Step 1: Initialize Spark Session
spark = SparkSession.builder.appName("SamplingExample").getOrCreate()

In [4]:
df = spark.read.csv("students.csv", header=True, inferSchema=True)

In [5]:
print("=== First 5 rows of dataset ===")
df.show(5)

=== First 5 rows of dataset ===
+---+-------+---+------+----+-------+-------+
| id|   name|age|gender|math|science|english|
+---+-------+---+------+----+-------+-------+
|  1|  Alice| 20|     F|  66|     92|     44|
|  2|    Bob| 20|     M|  82|     52|     77|
|  3|Charlie| 22|     F|  43|     57|     76|
|  4|  David| 19|     M|  95|     69|     46|
|  5|    Eva| 19|     F|  62|     44|     96|
+---+-------+---+------+----+-------+-------+
only showing top 5 rows


In [6]:
print("=== Schema of dataset ===")
df.printSchema()

=== Schema of dataset ===
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- math: integer (nullable = true)
 |-- science: integer (nullable = true)
 |-- english: integer (nullable = true)



In [7]:
print("=== Sample (30% without replacement) ===")
df.sample(withReplacement=False, fraction=0.3, seed=42).show(10)

=== Sample (30% without replacement) ===
+---+------+---+------+----+-------+-------+
| id|  name|age|gender|math|science|english|
+---+------+---+------+----+-------+-------+
|  4| David| 19|     M|  95|     69|     46|
|  8| Henry| 21|     F|  53|     82|     60|
| 17|Quincy| 18|     M|  65|     79|     54|
| 19|   Sam| 18|     F|  76|     70|     65|
| 27| Aaron| 25|     F|  81|     99|     44|
| 28| Bella| 19|     F|  54|     76|     76|
| 32| Fiona| 22|     F|  48|     96|     48|
| 37|  Kyle| 21|     M|  57|     86|     92|
| 39|  Matt| 25|     M|  64|     71|    100|
| 41| Oscar| 20|     M|  87|     72|     81|
+---+------+---+------+----+-------+-------+
only showing top 10 rows


In [8]:
print("=== Sample (20% with replacement) ===")
df.sample(withReplacement=True, fraction=0.2, seed=42).show(10)

=== Sample (20% with replacement) ===
+---+------+---+------+----+-------+-------+
| id|  name|age|gender|math|science|english|
+---+------+---+------+----+-------+-------+
|  6| Frank| 22|     F|  70|     78|     94|
|  7| Grace| 24|     F|  67|     66|     93|
| 14|Nathan| 23|     F|  71|     66|     60|
| 17|Quincy| 18|     M|  65|     79|     54|
| 21|   Uma| 19|     F|  89|     70|     76|
| 22|Victor| 22|     M|  96|     75|     56|
| 31| Ethan| 24|     M|  53|     57|     45|
| 32| Fiona| 22|     F|  48|     96|     48|
| 35|   Ian| 21|     F|  72|     75|     70|
| 38| Laura| 23|     M|  84|     73|     56|
+---+------+---+------+----+-------+-------+
only showing top 10 rows


In [9]:
print("=== takeSample: 5 rows (without replacement) ===")
sampled_rows = df.rdd.takeSample(False, 5, seed=42)
for row in sampled_rows:
    print(row)

=== takeSample: 5 rows (without replacement) ===
Row(id=35, name='Ian', age=21, gender='F', math=72, science=75, english=70)
Row(id=26, name='Zoey', age=18, gender='M', math=42, science=48, english=42)
Row(id=17, name='Quincy', age=18, gender='M', math=65, science=79, english=54)
Row(id=43, name='Quinn', age=18, gender='F', math=56, science=60, english=87)
Row(id=38, name='Laura', age=23, gender='M', math=84, science=73, english=56)


In [10]:
print("=== takeSample: 5 rows (with replacement) ===")
sampled_rows_wr = df.rdd.takeSample(True, 5, seed=42)
for row in sampled_rows_wr:
    print(row)

=== takeSample: 5 rows (with replacement) ===
Row(id=47, name='Umar', age=21, gender='F', math=75, science=80, english=59)
Row(id=17, name='Quincy', age=18, gender='M', math=65, science=79, english=54)
Row(id=10, name='Jack', age=19, gender='F', math=44, science=59, english=60)
Row(id=38, name='Laura', age=23, gender='M', math=84, science=73, english=56)
Row(id=23, name='Wendy', age=24, gender='M', math=57, science=83, english=81)


In [11]:
# 7. Count total rows (to compare with sampled data size)
print("Total rows in dataset:", df.count())

Total rows in dataset: 50


### **Conclusion of the Experiment (PySpark DataFrame Sampling)**

* The experiment demonstrated **dataset sampling in PySpark DataFrames** using both the `sample()` and `takeSample()` methods.

* **Key operations performed:**

  1. **Viewed initial data:** displayed the first 5 rows and examined the schema.
  2. **Random sampling without replacement:** selected a fraction (30%) of the dataset using `sample(withReplacement=False)`.
  3. **Random sampling with replacement:** selected a fraction (20%) of the dataset using `sample(withReplacement=True)`.
  4. **Row-based sampling with `takeSample()` (without replacement):** selected 5 random rows.
  5. **Row-based sampling with `takeSample()` (with replacement):** selected 5 random rows allowing duplicates.
  6. **Total dataset size:** confirmed total rows as 50 for reference.

* **Findings:**

  * **`sample()` method** allows fractional sampling and can be configured with or without replacement.
  * **`takeSample()` method** is useful for **selecting a fixed number of rows** randomly, again with or without replacement.
  * Sampling methods provide **flexible ways to create smaller subsets** of large datasets for testing, analysis, or debugging.
  * Using a **seed** ensures reproducibility of the sampled data.

* **Learning Outcome:**

  * Gained practical experience with **PySpark DataFrame sampling techniques**, an important skill for **data exploration, testing, and handling large datasets**.
  * Learned the difference between **fraction-based sampling (`sample`)** and **fixed-size row sampling (`takeSample`)**.