
## The `sample()` Method Deep Dive

### Method Signature

```python
DataFrame.sample(withReplacement=False, fraction=None, seed=None)
```

### Parameters Explained

#### `withReplacement` (Boolean)

- **False** (default): Each row can only be selected once
- **True**: Rows can be selected multiple times

#### `fraction` (Float between 0 and 1)

- **Key Point**: This is NOT a percentage of rows to return
- **Reality**: Probability that each row will be selected
- **Example**: `fraction=0.3` means each row has a 30% chance of selection

#### `seed` (Integer, optional)

- Controls randomness for reproducible results
- Same seed = same sample every time
- Different seeds = different samples

### Alternative Signature for Fixed Sample Size

```python
DataFrame.sample(withReplacement=False, num=integer, seed=None)
```

---

## Probabilistic vs Deterministic Sampling

### Probabilistic Sampling (fraction parameter)

**How it works:**

- Each row undergoes a random test
- If random number < fraction, row is included
- Result size varies each time

**Example:**

```python
# 14 rows, fraction=0.3
# Expected: 14 × 0.3 = 4.2 rows
# Actual: Could be 0-14 rows (typically 2-6)

df = spark.createDataFrame([(i,) for i in range(14)], ["id"])
sample1 = df.sample(fraction=0.3, seed=42)
sample2 = df.sample(fraction=0.3, seed=43)

print(f"Sample 1 count: {sample1.count()}")  # Might be 3
print(f"Sample 2 count: {sample2.count()}")  # Might be 5
```

**Statistical Distribution:**

- Follows binomial distribution: B(n, p)
- n = total rows, p = fraction
- Mean = n × p
- Variance = n × p × (1-p)

### Deterministic Sampling

**How it works:**

- Exactly the specified number of rows
- Still random selection, but fixed count
- More predictable for downstream processing

**Example:**

```python
# Always returns exactly 4 rows
df = spark.createDataFrame([(i,) for i in range(14)], ["id"])
sample = df.sample(withReplacement=False, num=4, seed=42)
print(f"Sample count: {sample.count()}")  # Always 4
```

---

## Common Misconceptions

### Misconception: "fraction=0.3 means 30% of rows"

**Reality:** Each row has a 30% probability of selection

### Misconception: "Results should be exactly predictable"

**Reality:** Probabilistic sampling introduces natural variation

### Misconception: "Sampling is always uniform across partitions"

**Reality:** Each partition's contribution can vary

### Misconception: "Larger datasets give more predictable percentages"

**Reality:** Larger datasets do converge closer to expected fraction due to Law of Large Numbers

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SamplingDemo").getOrCreate()

# Create sample dataset
data = [(i, f"name_{i}", i % 3) for i in range(1000)]
df = spark.createDataFrame(data, ["id", "name", "category"])

# Probabilistic sampling
sample_30_percent = df.sample(fraction=0.3, seed=42)
print(f"Expected ~300, got: {sample_30_percent.count()}")


Expected ~300, got: 307


In [2]:
df = spark.createDataFrame([(i,) for i in range(14)], ["id"])
sample1 = df.sample(fraction=0.3, seed=42)
sample2 = df.sample(fraction=0.3, seed=43)

print(f"Sample 1 count: {sample1.count()}")  # Might be 3
print(f"Sample 2 count: {sample2.count()}")  # Might be 5

Sample 1 count: 3
Sample 2 count: 4


In [3]:
rdd_sample = df.rdd.takeSample(withReplacement=False, num=7, seed=42)
sample_df = spark.createDataFrame(rdd_sample, df.schema)
print(f"sample_df  count: {sample_df.count()}")

sample_df  count: 7


In [5]:
from pyspark.sql.functions import rand
# Shuffle and take top N
sample_df = df.orderBy(rand(seed=42)).limit(4)
sample_df.show()

+---+
| id|
+---+
| 10|
|  9|
|  3|
|  1|
+---+



In [11]:
rand(seed=66)

Column<'rand(66)'>