# Sampling and Sampling Distributions üåæüìä

## Introduction: From Fields to Decisions

Imagine you're an agricultural consultant managing 10,000 wheat fields across a region. You need to estimate the average yield to help farmers plan their marketing strategy. **You cannot physically measure all 10,000 fields** - it would be too expensive and time-consuming!

**The Solution**: Take a **sample** of fields (say, 50), measure them carefully, and use that information to make conclusions about all 10,000 fields.

This is the essence of **statistical inference**: using sample data to draw conclusions about populations.

### Why This Matters for Machine Learning üéØ

- **Cross-validation** is repeated sampling from your dataset
- **Train/test splits** create samples for model evaluation
- Understanding sampling variability helps you interpret model performance differences
- Sample size determines reliability of your ML model evaluation

**Key Question**: If we take different samples, we'll get different estimates. How do we quantify this uncertainty?

---

## Learning Objectives üéØ

By the end of this notebook, you will:

1. ‚úÖ Understand the distinction between **population** and **sample**
2. ‚úÖ Learn different **sampling methods** (random, stratified, systematic)
3. ‚úÖ Grasp the concept of **sampling distributions** ‚≠ê‚≠ê
4. ‚úÖ Calculate and interpret **standard error** (SE = œÉ/‚àön)
5. ‚úÖ Understand **sampling variability** and its implications
6. ‚úÖ Connect sampling concepts to **cross-validation in ML** ‚≠ê

‚≠ê‚≠ê = Most critical concept

---

Let's begin! üöÄ

In [None]:
# üì¶ Setup: Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Set style for beautiful plots
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11

# Set random seed for reproducibility
np.random.seed(42)

print("‚úì Setup complete!")
print("üìä Ready to explore sampling and distributions")

---

## 1. Population vs Sample üåç

### Theory: The Foundation of Inference

**Population**: The complete set of all individuals or observations we're interested in
- Has true parameters: Œº (population mean), œÉ¬≤ (population variance)
- Usually **unknown** and impossible to measure completely

**Sample**: A subset of the population that we actually observe
- Has sample statistics: xÃÑ (sample mean), s¬≤ (sample variance)
- We use these to **estimate** population parameters

### Mathematical Notation:

$$
\begin{align}
\text{Population Mean: } & \mu = \frac{1}{N}\sum_{i=1}^{N} x_i \\
\text{Sample Mean: } & \bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i \\
\text{Population Variance: } & \sigma^2 = \frac{1}{N}\sum_{i=1}^{N} (x_i - \mu)^2 \\
\text{Sample Variance: } & s^2 = \frac{1}{n-1}\sum_{i=1}^{n} (x_i - \bar{x})^2
\end{align}
$$

**Note**: Sample variance uses (n-1) for unbiased estimation (Bessel's correction)

---

In [None]:
# üåæ Create a population of 10,000 wheat fields
# True population: wheat yield ~ Normal(Œº=5.2 tons/hectare, œÉ=0.8)

population_size = 10000
population_mean = 5.2  # tons/hectare (true parameter Œº)
population_std = 0.8   # tons/hectare (true parameter œÉ)

# Generate the complete population
population_yields = np.random.normal(population_mean, population_std, population_size)

print("üåç POPULATION (All 10,000 fields)")
print("=" * 50)
print(f"True Population Mean (Œº): {population_mean} tons/hectare")
print(f"True Population Std Dev (œÉ): {population_std} tons/hectare")
print(f"Actual Mean from simulation: {population_yields.mean():.3f} tons/hectare")
print(f"Actual Std Dev from simulation: {population_yields.std(ddof=0):.3f} tons/hectare")
print(f"\nüí° In reality, we would NEVER know these true population parameters!")

In [None]:
# üìä Visualization 1: Population with sample overlays

# Take 3 different samples of size n=50
sample_size = 50
sample1 = np.random.choice(population_yields, size=sample_size, replace=False)
sample2 = np.random.choice(population_yields, size=sample_size, replace=False)
sample3 = np.random.choice(population_yields, size=sample_size, replace=False)

# Create visualization
plt.figure(figsize=(12, 7))

# Plot population distribution
plt.hist(population_yields, bins=50, alpha=0.3, color='gray', 
         label=f'Population (N={population_size})', density=True, edgecolor='black')

# Plot three sample distributions
plt.hist(sample1, bins=15, alpha=0.5, color='red', 
         label=f'Sample 1 (n={sample_size}, xÃÑ={sample1.mean():.2f})', density=True)
plt.hist(sample2, bins=15, alpha=0.5, color='blue', 
         label=f'Sample 2 (n={sample_size}, xÃÑ={sample2.mean():.2f})', density=True)
plt.hist(sample3, bins=15, alpha=0.5, color='green', 
         label=f'Sample 3 (n={sample_size}, xÃÑ={sample3.mean():.2f})', density=True)

# Mark the true population mean
plt.axvline(population_mean, color='black', linestyle='--', linewidth=2, 
            label=f'True Œº = {population_mean}')

plt.xlabel('Wheat Yield (tons/hectare)', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title('Population vs Samples: Different samples give different estimates! üåæ', 
          fontsize=14, fontweight='bold')
plt.legend(loc='upper left', fontsize=10)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Key Observation:")
print("   - Each sample gives a slightly different mean (xÃÑ)")
print("   - All samples cluster around the true population mean (Œº)")
print("   - This variability is called SAMPLING VARIABILITY")

---

## 2. Sampling Methods üé≤

Not all samples are created equal! The **method** you use to select your sample affects the quality of your inference.

### Common Sampling Methods:

1. **Simple Random Sampling (SRS)** üéØ
   - Every element has equal probability of selection
   - Like drawing names from a hat
   - ‚úÖ Best for homogeneous populations
   - ‚ö†Ô∏è May miss important subgroups

2. **Stratified Sampling** üìä
   - Divide population into strata (groups), then sample from each
   - Example: Sample from each soil type separately
   - ‚úÖ Ensures representation of all subgroups
   - ‚úÖ Often more efficient than SRS

3. **Systematic Sampling** üìè
   - Select every kth element (e.g., every 10th field in a row)
   - ‚úÖ Simple to implement
   - ‚ö†Ô∏è Beware of periodic patterns in data

4. **Cluster Sampling** üó∫Ô∏è
   - Divide population into clusters, randomly select entire clusters
   - Example: Randomly select entire farms, measure all fields in selected farms
   - ‚úÖ Cost-effective for geographically dispersed populations
   - ‚ö†Ô∏è Less efficient than SRS (more variability)

### ML Connection:
- **Train/test split** = Simple random sampling
- **Stratified K-fold CV** = Stratified sampling (ensures class balance)
- **Time series CV** = Systematic sampling considerations

---

In [None]:
# üåæ Create a more complex population with different soil types (strata)
# 3 soil types: Clay (40%), Loam (50%), Sand (10%)

# Generate stratified population
n_clay = 4000
n_loam = 5000
n_sand = 1000

# Different yield distributions for each soil type
clay_yields = np.random.normal(5.5, 0.6, n_clay)  # Higher mean, lower variability
loam_yields = np.random.normal(5.2, 0.7, n_loam)  # Medium mean
sand_yields = np.random.normal(4.5, 1.0, n_sand)  # Lower mean, higher variability

# Combine into a DataFrame
population_df = pd.DataFrame({
    'yield': np.concatenate([clay_yields, loam_yields, sand_yields]),
    'soil_type': ['Clay']*n_clay + ['Loam']*n_loam + ['Sand']*n_sand,
    'field_id': range(10000)
})

print("üåç Population Composition:")
print("=" * 50)
print(population_df.groupby('soil_type')['yield'].agg(['count', 'mean', 'std']))
print(f"\nOverall Population Mean: {population_df['yield'].mean():.3f} tons/hectare")

In [None]:
# üé≤ Implement different sampling methods

sample_size = 200

# 1. Simple Random Sampling
srs_sample = population_df.sample(n=sample_size, random_state=42)

# 2. Stratified Sampling (proportional allocation)
stratified_sample = population_df.groupby('soil_type', group_keys=False).apply(
    lambda x: x.sample(frac=sample_size/len(population_df), random_state=42)
)

# 3. Systematic Sampling (every 50th field)
k = len(population_df) // sample_size
start = np.random.randint(0, k)
systematic_indices = range(start, len(population_df), k)
systematic_sample = population_df.iloc[list(systematic_indices)[:sample_size]]

# 4. Cluster Sampling (select 20 random "farms" of 10 fields each)
population_df['farm_id'] = population_df['field_id'] // 10  # Create farm clusters
selected_farms = np.random.choice(population_df['farm_id'].unique(), size=20, replace=False)
cluster_sample = population_df[population_df['farm_id'].isin(selected_farms)]

# Compare estimates
true_mean = population_df['yield'].mean()

print("\nüìä Sampling Method Comparison:")
print("=" * 60)
print(f"True Population Mean: {true_mean:.3f} tons/hectare")
print(f"Simple Random Sampling:  {srs_sample['yield'].mean():.3f} (error: {abs(srs_sample['yield'].mean() - true_mean):.3f})")
print(f"Stratified Sampling:     {stratified_sample['yield'].mean():.3f} (error: {abs(stratified_sample['yield'].mean() - true_mean):.3f})")
print(f"Systematic Sampling:     {systematic_sample['yield'].mean():.3f} (error: {abs(systematic_sample['yield'].mean() - true_mean):.3f})")
print(f"Cluster Sampling:        {cluster_sample['yield'].mean():.3f} (error: {abs(cluster_sample['yield'].mean() - true_mean):.3f})")
print("\nüí° Stratified sampling often gives the most accurate estimate!")

In [None]:
# üìä Visualization 2: Comparing sampling methods

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Comparison of Sampling Methods üé≤', fontsize=16, fontweight='bold', y=1.00)

# Helper function to plot samples
def plot_sample_by_soil(ax, sample_df, title, true_mean):
    soil_colors = {'Clay': '#8B4513', 'Loam': '#D2691E', 'Sand': '#F4A460'}
    
    for soil in ['Clay', 'Loam', 'Sand']:
        soil_data = sample_df[sample_df['soil_type'] == soil]['yield']
        ax.hist(soil_data, bins=15, alpha=0.6, label=f'{soil} (n={len(soil_data)})',
                color=soil_colors[soil], edgecolor='black')
    
    sample_mean = sample_df['yield'].mean()
    ax.axvline(true_mean, color='black', linestyle='--', linewidth=2, label=f'True Œº={true_mean:.2f}')
    ax.axvline(sample_mean, color='red', linestyle='-', linewidth=2, label=f'Sample xÃÑ={sample_mean:.2f}')
    
    ax.set_xlabel('Yield (tons/hectare)', fontsize=10)
    ax.set_ylabel('Frequency', fontsize=10)
    ax.set_title(title, fontsize=12, fontweight='bold')
    ax.legend(fontsize=8)
    ax.grid(True, alpha=0.3)

# Plot each sampling method
plot_sample_by_soil(axes[0, 0], srs_sample, '1. Simple Random Sampling', true_mean)
plot_sample_by_soil(axes[0, 1], stratified_sample, '2. Stratified Sampling ‚úÖ', true_mean)
plot_sample_by_soil(axes[1, 0], systematic_sample, '3. Systematic Sampling', true_mean)
plot_sample_by_soil(axes[1, 1], cluster_sample, '4. Cluster Sampling', true_mean)

plt.tight_layout()
plt.show()

print("\nüí° Notice:")
print("   - Stratified sampling best represents all soil types")
print("   - Cluster sampling has more variability (fewer unique locations)")
print("   - This affects the accuracy of our population estimate!")

---

## 3. Sampling Distribution ‚≠ê‚≠ê

### The Most Important Concept in Statistical Inference!

**Sampling Distribution**: The distribution of a sample statistic (like xÃÑ) across many possible samples

**Key Idea**: 
- If we take ONE sample ‚Üí we get ONE estimate (xÃÑ)
- If we take MANY samples ‚Üí we get MANY estimates (xÃÑ‚ÇÅ, xÃÑ‚ÇÇ, xÃÑ‚ÇÉ, ...)
- The distribution of these estimates is the **sampling distribution**

### Standard Error (SE):

The standard deviation of the sampling distribution is called the **standard error**:

$$
SE = \frac{\sigma}{\sqrt{n}}
$$

Where:
- œÉ = population standard deviation
- n = sample size

**Key Insight**: Standard error decreases with ‚àön, not n!
- To cut SE in half, you need 4√ó the sample size
- To cut SE by 1/10, you need 100√ó the sample size

---

In [None]:
# üé≤ Simulate the sampling distribution
# Take 1000 different samples, calculate mean for each

n_simulations = 1000
sample_size = 50
sample_means = []

# Simulate taking many samples
for i in range(n_simulations):
    sample = np.random.choice(population_yields, size=sample_size, replace=False)
    sample_means.append(sample.mean())

sample_means = np.array(sample_means)

# Calculate theoretical vs empirical standard error
theoretical_se = population_std / np.sqrt(sample_size)
empirical_se = sample_means.std()

print("üéØ Sampling Distribution of Sample Means:")
print("=" * 60)
print(f"True Population Mean (Œº): {population_mean:.3f} tons/hectare")
print(f"Mean of sample means: {sample_means.mean():.3f} tons/hectare")
print(f"\nTheoretical SE = œÉ/‚àön = {population_std:.3f}/‚àö{sample_size} = {theoretical_se:.3f}")
print(f"Empirical SE (from simulation): {empirical_se:.3f}")
print(f"\nüí° The sample means cluster around the true Œº with spread = SE!")

In [None]:
# üìä Visualization 3: The sampling distribution

plt.figure(figsize=(12, 7))

# Plot the sampling distribution
plt.hist(sample_means, bins=40, alpha=0.7, color='steelblue', 
         edgecolor='black', density=True, label=f'{n_simulations} sample means')

# Overlay theoretical normal distribution
x = np.linspace(sample_means.min(), sample_means.max(), 100)
plt.plot(x, stats.norm.pdf(x, population_mean, theoretical_se), 
         'r-', linewidth=2, label=f'Theoretical: N(Œº={population_mean}, SE={theoretical_se:.3f})')

# Mark the true population mean
plt.axvline(population_mean, color='black', linestyle='--', linewidth=2, 
            label=f'True Œº = {population_mean}')

# Mark ¬±1 SE and ¬±2 SE
plt.axvline(population_mean - theoretical_se, color='green', linestyle=':', linewidth=1.5, alpha=0.7)
plt.axvline(population_mean + theoretical_se, color='green', linestyle=':', linewidth=1.5, alpha=0.7, 
            label='¬±1 SE (68% of samples)')
plt.axvline(population_mean - 2*theoretical_se, color='orange', linestyle=':', linewidth=1.5, alpha=0.7)
plt.axvline(population_mean + 2*theoretical_se, color='orange', linestyle=':', linewidth=1.5, alpha=0.7,
            label='¬±2 SE (95% of samples)')

plt.xlabel('Sample Mean xÃÑ (tons/hectare)', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title(f'Sampling Distribution: Distribution of {n_simulations} Sample Means (n={sample_size}) üìä', 
          fontsize=14, fontweight='bold')
plt.legend(loc='upper left', fontsize=10)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Key Observations:")
print("   - Sample means follow a NORMAL distribution (we'll see why in next notebook!)")
print("   - Distribution is centered at the true Œº")
print("   - Spread is determined by SE = œÉ/‚àön")
print("   - About 95% of sample means fall within ¬±2 SE of true Œº")

In [None]:
# üìä Effect of sample size on sampling distribution
# Compare n = 5, 10, 25, 50, 100, 200

sample_sizes = [5, 10, 25, 50, 100, 200]
n_sims = 1000

sampling_distributions = {}

for n in sample_sizes:
    means = []
    for _ in range(n_sims):
        sample = np.random.choice(population_yields, size=n, replace=False)
        means.append(sample.mean())
    sampling_distributions[n] = np.array(means)

# Calculate theoretical SEs
theoretical_ses = {n: population_std / np.sqrt(n) for n in sample_sizes}

print("üìè Effect of Sample Size on Standard Error:")
print("=" * 60)
print(f"{'Sample Size (n)':<15} {'Theoretical SE':<20} {'Empirical SE':<20}")
print("-" * 60)
for n in sample_sizes:
    theoretical = theoretical_ses[n]
    empirical = sampling_distributions[n].std()
    print(f"{n:<15} {theoretical:<20.4f} {empirical:<20.4f}")

print("\nüí° Notice: SE decreases as ‚àön, so doubling n doesn't halve SE!")

In [None]:
# üìä Visualization 4: Effect of sample size (6-panel comparison)

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()
fig.suptitle('Effect of Sample Size on Sampling Distribution üìè', fontsize=16, fontweight='bold')

for idx, n in enumerate(sample_sizes):
    ax = axes[idx]
    
    # Plot sampling distribution
    ax.hist(sampling_distributions[n], bins=30, alpha=0.7, color='steelblue', 
            edgecolor='black', density=True)
    
    # Overlay theoretical normal
    x = np.linspace(sampling_distributions[n].min(), 
                    sampling_distributions[n].max(), 100)
    ax.plot(x, stats.norm.pdf(x, population_mean, theoretical_ses[n]), 
            'r-', linewidth=2)
    
    # Mark true mean
    ax.axvline(population_mean, color='black', linestyle='--', linewidth=1.5)
    
    # Add text box with SE
    textstr = f'n = {n}\nSE = {theoretical_ses[n]:.3f}'
    props = dict(boxstyle='round', facecolor='wheat', alpha=0.8)
    ax.text(0.70, 0.95, textstr, transform=ax.transAxes, fontsize=10,
            verticalalignment='top', bbox=props)
    
    ax.set_xlabel('Sample Mean', fontsize=10)
    ax.set_ylabel('Density', fontsize=10)
    ax.set_title(f'Sample Size n = {n}', fontsize=11, fontweight='bold')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Key Observation:")
print("   - Larger n ‚Üí Narrower sampling distribution ‚Üí More precise estimates!")
print("   - Distribution gets tighter around true Œº as n increases")
print("   - This is why larger samples give more reliable results!")

---

## 4. Sampling Variability and Standard Error üìä

**Sampling Variability**: The fact that different samples give different estimates

**Standard Error (SE)**: Quantifies the typical amount of sampling variability

### Why This Matters:

When you report xÃÑ = 5.15 tons/hectare, you should also report the **uncertainty**:
- "xÃÑ = 5.15 ¬± 0.11" (mean ¬± SE)
- This acknowledges that a different sample would give a different estimate

### Practical Interpretation:

- **Small SE**: Sample estimate is close to population parameter (reliable)
- **Large SE**: High uncertainty in estimate (need larger sample)

---

In [None]:
# üé≤ Demonstrate sampling variability
# Take 20 samples, show the range of estimates

n_samples = 20
sample_size = 50
estimates = []

for i in range(n_samples):
    sample = np.random.choice(population_yields, size=sample_size, replace=False)
    estimates.append(sample.mean())

estimates = np.array(estimates)

# Calculate SE
se = population_std / np.sqrt(sample_size)

print("üéØ Sampling Variability Demonstration:")
print("=" * 60)
print(f"True Population Mean (Œº): {population_mean:.3f} tons/hectare")
print(f"\n{n_samples} Sample Estimates:")
print("-" * 60)
for i, est in enumerate(estimates, 1):
    print(f"Sample {i:2d}: xÃÑ = {est:.3f} tons/hectare (error: {abs(est - population_mean):.3f})")

print(f"\nRange of estimates: [{estimates.min():.3f}, {estimates.max():.3f}]")
print(f"Standard deviation of estimates: {estimates.std():.3f}")
print(f"Theoretical SE: {se:.3f}")
print(f"\nüí° Different samples ‚Üí different estimates! SE quantifies this variability.")

In [None]:
# üìä Visualization 5: Range of estimates with ¬±2SE bands

plt.figure(figsize=(12, 6))

# Plot each estimate as a point
plt.scatter(range(1, n_samples+1), estimates, s=100, alpha=0.7, 
            color='steelblue', edgecolors='black', linewidths=1.5, zorder=3)

# Draw lines from each point to the true mean
for i, est in enumerate(estimates, 1):
    plt.plot([i, i], [population_mean, est], 'gray', alpha=0.3, linewidth=1)

# Mark the true population mean
plt.axhline(population_mean, color='black', linestyle='--', linewidth=2, 
            label=f'True Œº = {population_mean:.2f}', zorder=2)

# Draw ¬±2 SE bands (95% of estimates should fall here)
plt.axhline(population_mean + 2*se, color='red', linestyle=':', linewidth=1.5, 
            alpha=0.7, label=f'Œº ¬± 2SE', zorder=1)
plt.axhline(population_mean - 2*se, color='red', linestyle=':', linewidth=1.5, 
            alpha=0.7, zorder=1)
plt.fill_between(range(1, n_samples+1), population_mean - 2*se, 
                 population_mean + 2*se, alpha=0.1, color='red')

# Count how many fall within ¬±2SE
within_2se = np.sum((estimates >= population_mean - 2*se) & 
                    (estimates <= population_mean + 2*se))

plt.xlabel('Sample Number', fontsize=12)
plt.ylabel('Sample Mean (tons/hectare)', fontsize=12)
plt.title(f'Sampling Variability: {n_samples} Different Samples (n={sample_size}) üé≤', 
          fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3, axis='y')
plt.xlim(0, n_samples+1)

# Add annotation
plt.text(n_samples*0.5, population_mean + 2.5*se, 
         f'{within_2se}/{n_samples} estimates within ¬±2SE (expected: ~{int(0.95*n_samples)})',
         fontsize=11, ha='center', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

plt.tight_layout()
plt.show()

print("\nüí° Interpretation:")
print("   - Each dot is the mean from one sample")
print("   - Most estimates cluster around the true Œº")
print(f"   - About 95% fall within ¬±2SE = ¬±{2*se:.3f} of true Œº")
print("   - This is the foundation of confidence intervals!")

---

## 5. Machine Learning Connection ‚≠ê

### Cross-Validation is Repeated Sampling!

When you perform **k-fold cross-validation**:
1. You split your data into k subsets
2. Train on k-1 subsets, test on 1 subset
3. Repeat k times ‚Üí get k different accuracy scores

**This is exactly sampling!** Each train/test split is a different "sample" from your data.

### Key Insights:

1. **Different splits ‚Üí different scores** (sampling variability!)
2. **Report mean ¬± SE** of CV scores (not just mean)
3. **Larger training sets** ‚Üí lower variance in model performance (SE decreases with ‚àön)
4. **Understand variability** ‚Üí know when performance differences are meaningful

### Why This Matters:

- Model A: 85% ¬± 2% accuracy
- Model B: 84% ¬± 5% accuracy

Is Model A better? The higher SE in Model B means **higher uncertainty**. The difference might not be real!

---

In [None]:
# ü§ñ ML Example: Train/test split variability
# Create a classification dataset: Predict if yield > 5.0 tons/hectare

# Generate features and labels
np.random.seed(42)
X = np.column_stack([
    np.random.normal(7.0, 1.5, 1000),  # soil_nitrogen
    np.random.normal(6.5, 0.8, 1000),  # soil_pH
    np.random.normal(150, 30, 1000),   # rainfall_mm
])

# Create target: high yield if conditions are good
yield_score = (0.3 * X[:, 0] + 0.2 * X[:, 1] + 0.005 * X[:, 2] + 
               np.random.normal(0, 0.5, 1000))
y = (yield_score > 5.0).astype(int)

print("üåæ Agricultural Classification Problem:")
print("=" * 60)
print("Features: soil_nitrogen, soil_pH, rainfall_mm")
print(f"Target: high_yield (1 if yield > 5.0 tons/hectare, else 0)")
print(f"Dataset size: {len(X)} observations")
print(f"Class distribution: {np.sum(y)} high yield, {len(y) - np.sum(y)} low yield")
print(f"\nüí° We'll train a logistic regression model with different train/test splits")

In [None]:
# üé≤ Demonstrate train/test split variability
# Train the same model 30 times with different random splits

n_splits = 30
test_size = 0.2
accuracy_scores = []

for seed in range(n_splits):
    # Different random split each time
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=seed
    )
    
    # Train model
    model = LogisticRegression(random_state=42, max_iter=1000)
    model.fit(X_train, y_train)
    
    # Evaluate
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_scores.append(accuracy)

accuracy_scores = np.array(accuracy_scores)

# Calculate statistics
mean_accuracy = accuracy_scores.mean()
se_accuracy = accuracy_scores.std()  # Empirical SE

print("ü§ñ Model Performance Variability:")
print("=" * 60)
print(f"Number of different train/test splits: {n_splits}")
print(f"Training set size: {len(X_train)} observations")
print(f"Test set size: {len(X_test)} observations")
print(f"\nAccuracy across {n_splits} splits:")
print(f"  Mean: {mean_accuracy:.4f}")
print(f"  Std Dev (SE): {se_accuracy:.4f}")
print(f"  Min: {accuracy_scores.min():.4f}")
print(f"  Max: {accuracy_scores.max():.4f}")
print(f"  Range: {accuracy_scores.max() - accuracy_scores.min():.4f}")
print(f"\nüí° Report as: Accuracy = {mean_accuracy:.2%} ¬± {se_accuracy:.2%}")
print("   This acknowledges the variability due to different train/test splits!")

In [None]:
# üìä Visualization 6: Distribution of accuracy scores

plt.figure(figsize=(12, 6))

# Box plot
plt.subplot(1, 2, 1)
plt.boxplot(accuracy_scores, vert=True, widths=0.5, patch_artist=True,
            boxprops=dict(facecolor='lightblue', alpha=0.7),
            medianprops=dict(color='red', linewidth=2),
            whiskerprops=dict(linewidth=1.5),
            capprops=dict(linewidth=1.5))
plt.ylabel('Accuracy', fontsize=12)
plt.title('Box Plot of Accuracy Scores', fontsize=12, fontweight='bold')
plt.grid(True, alpha=0.3, axis='y')
plt.xticks([1], [f'{n_splits} splits'])

# Add text annotations
plt.text(1.35, mean_accuracy, f'Mean = {mean_accuracy:.4f}\nSE = {se_accuracy:.4f}',
         fontsize=10, va='center',
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

# Histogram
plt.subplot(1, 2, 2)
plt.hist(accuracy_scores, bins=15, alpha=0.7, color='steelblue', 
         edgecolor='black', density=True)

# Overlay normal distribution
x = np.linspace(accuracy_scores.min(), accuracy_scores.max(), 100)
plt.plot(x, stats.norm.pdf(x, mean_accuracy, se_accuracy), 
         'r-', linewidth=2, label='Normal fit')

plt.axvline(mean_accuracy, color='black', linestyle='--', linewidth=2, 
            label=f'Mean = {mean_accuracy:.4f}')
plt.axvline(mean_accuracy - 2*se_accuracy, color='green', linestyle=':', 
            linewidth=1.5, alpha=0.7)
plt.axvline(mean_accuracy + 2*se_accuracy, color='green', linestyle=':', 
            linewidth=1.5, alpha=0.7, label='¬±2 SE')

plt.xlabel('Accuracy', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title('Distribution of Accuracy Scores', fontsize=12, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)

plt.suptitle('ML Model Performance Variability Due to Train/Test Split ü§ñ', 
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\nüí° Key Insight:")
print("   - Model performance VARIES depending on which data points are in train vs test")
print("   - This is sampling variability in action!")
print("   - Always report uncertainty (SE or confidence intervals)")
print("   - Cross-validation provides multiple samples ‚Üí better estimate of true performance")

---

## 6. Wrap-Up: Agricultural Applications üåæ

### Where Sampling Matters in Agriculture:

1. **Soil Testing** üß™
   - Cannot test every location in a field
   - Take strategic samples (grid, stratified by zone)
   - Estimate mean nutrient levels with confidence intervals

2. **Yield Estimation** üìä
   - Cannot harvest and weigh entire fields before harvest
   - Sample plots to estimate total yield
   - SE tells you reliability of estimate

3. **Quality Control** ‚úÖ
   - Cannot inspect every fruit/grain
   - Sample batches to estimate defect rate
   - Determine appropriate sample size for desired precision

4. **Field Trials** üß™
   - Test new varieties or treatments
   - Each field is a sample from all possible fields
   - Understand variability in treatment effects

5. **ML Model Evaluation** ü§ñ
   - Each train/test split is a sample
   - Cross-validation creates multiple samples
   - Report performance with uncertainty bounds

---

## Key Takeaways üéØ

### Core Concepts:

1. ‚úÖ **Population vs Sample**: 
   - Population has parameters (Œº, œÉ) - usually unknown
   - Sample has statistics (xÃÑ, s) - used to estimate parameters

2. ‚úÖ **Sampling Methods**: 
   - Simple random: Equal probability for all
   - Stratified: Sample from each group (often best for heterogeneous populations)
   - Systematic: Every kth element
   - Cluster: Sample entire groups

3. ‚úÖ **Sampling Distribution** ‚≠ê‚≠ê:
   - Distribution of sample statistics across many samples
   - Centered at the true population parameter
   - Spread measured by standard error (SE)

4. ‚úÖ **Standard Error**:
   - SE = œÉ/‚àön
   - Quantifies typical sampling variability
   - Decreases with ‚àön (not n!)

5. ‚úÖ **Sampling Variability**:
   - Different samples ‚Üí different estimates
   - About 95% of samples fall within ¬±2 SE of true parameter
   - Foundation of confidence intervals

6. ‚úÖ **ML Connection** ‚≠ê:
   - Cross-validation is repeated sampling
   - Train/test splits create sampling variability
   - Always report model performance with SE or CI
   - Larger training sets ‚Üí lower performance variance

### Critical Formula:

$$
\boxed{SE = \frac{\sigma}{\sqrt{n}}}
$$

This single formula governs how uncertainty decreases with sample size!

---

## Next Steps üöÄ

**Coming Up Next: Central Limit Theorem** ‚≠ê‚≠ê

You've seen that sampling distributions look approximately normal. But why?

In the next notebook, we'll discover the **Central Limit Theorem** - the most important theorem in statistics:

- Why sample means are (almost) always normal
- Works for ANY population distribution!
- Foundation of all statistical inference
- **Why ensemble methods work in ML**

**Questions to think about**:
1. What if the population is highly skewed? Will sample means still be normal?
2. How large does n need to be for normality?
3. What does this have to do with bootstrap and bagging?

All will be answered in **`02_central_limit_theorem.ipynb`**!

---

**Great work! You've mastered the foundations of sampling and sampling distributions.** üåæüìä‚ú®