# Bootstrap Methods: Modern Computer-Intensive Inference üéí

## Introduction: Inference Without Formulas!

In previous notebooks, we learned classical inference methods:
- **Point estimation**: Calculate xÃÑ, s¬≤
- **Confidence intervals**: xÃÑ ¬± t* √ó (s/‚àön)
- **Requirements**: Know formulas, make distributional assumptions

**But what if**:
- You want CI for median, correlation, or some complex statistic?
- No formula exists for the SE?
- Distributional assumptions don't hold?
- You want a modern, flexible approach?

### The Solution: Bootstrap! üéØ

**Key Idea**: Use the sample itself to estimate sampling variability

1. Treat your sample as a "mini-population"
2. Resample from it WITH replacement (many times)
3. Calculate statistic for each resample
4. Use distribution of bootstrap statistics for inference

**No formulas needed! No distributional assumptions!**

### ML Connection ü§ñ

Bootstrap is the foundation of many modern ML techniques:

- **Bootstrap Aggregating (Bagging)**: Train models on bootstrap samples, average predictions
- **Random Forests** = Bootstrap + Decision Trees ‚≠ê‚≠ê
- **Bootstrap evaluation**: Robust model performance estimates
- **Feature importance**: Quantify uncertainty in importance scores

Understanding bootstrap helps you understand why ensemble methods work!

---

## Learning Objectives üéØ

By the end of this notebook, you will:

1. ‚úÖ Understand **bootstrap resampling** philosophy ‚≠ê
2. ‚úÖ Implement bootstrap from scratch for any statistic
3. ‚úÖ Calculate **bootstrap confidence intervals**
4. ‚úÖ Apply bootstrap to complex statistics (correlation, difference, etc.)
5. ‚úÖ Connect to ML: **Bagging and Random Forests** ‚≠ê‚≠ê‚≠ê
6. ‚úÖ Understand when bootstrap works and when to be cautious

‚≠ê‚≠ê‚≠ê = Most critical ML connection

---

Let's bootstrap! üöÄ

In [None]:
# üì¶ Setup: Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression

# Set style for beautiful plots
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11

# Set random seed for reproducibility
np.random.seed(42)

print("‚úì Setup complete!")
print("üéí Ready to learn bootstrap methods")

---

## 1. Bootstrap Philosophy ‚≠ê

### The Core Idea:

**Problem**: We have ONE sample, but we want to know about the sampling distribution

**Classical Approach**: Use formulas and assumptions (CLT, normal distribution, etc.)

**Bootstrap Approach**: 
1. Treat sample as if it were the population
2. Draw many new samples FROM this sample (with replacement!)
3. Calculate statistic for each new sample
4. The distribution of these statistics approximates the sampling distribution

### Why "With Replacement"?

- We need to mimic the original sampling process
- Original sample: drew n observations from population (‚àû size)
- Bootstrap sample: draw n observations from sample (finite size)
- **With replacement** allows us to get variability!

### Example:

Original sample: [5.1, 5.3, 4.9, 5.2, 5.4]

Bootstrap samples (with replacement):
- [5.1, 5.3, 5.3, 5.2, 4.9] (5.3 appears twice!)
- [5.4, 5.1, 5.1, 5.1, 5.2] (5.1 appears three times!)
- [4.9, 5.2, 5.4, 5.3, 5.2] (different order, 5.2 twice)

Calculate mean for each ‚Üí bootstrap sampling distribution!

---

In [None]:
# üåæ Simple bootstrap example

# Original sample: wheat yields from 20 fields
np.random.seed(42)
original_sample = np.array([5.1, 5.3, 4.9, 5.2, 5.4, 5.0, 5.2, 4.8, 5.3, 5.1,
                            5.2, 5.4, 5.0, 4.9, 5.3, 5.1, 5.2, 5.3, 5.0, 5.1])

n = len(original_sample)

# Take a few bootstrap samples
print("üéí Bootstrap Resampling Example:")
print("=" * 60)
print(f"Original sample (n={n}):")
print(f"  {original_sample}")
print(f"  Mean: {original_sample.mean():.3f}")
print(f"\nBootstrap samples (each with n={n}, sampled WITH replacement):")
print("-" * 60)

for i in range(5):
    # Resample with replacement
    bootstrap_sample = np.random.choice(original_sample, size=n, replace=True)
    
    # Show unique values to demonstrate repetition
    unique, counts = np.unique(bootstrap_sample, return_counts=True)
    
    print(f"\nBootstrap sample {i+1}:")
    print(f"  {bootstrap_sample.round(1)}")
    print(f"  Mean: {bootstrap_sample.mean():.3f}")
    
    # Show which values were repeated
    repeated = unique[counts > 1]
    if len(repeated) > 0:
        print(f"  Repeated values: {repeated} (that's OK! That's the point!)")

print("\nüí° Key Observations:")
print("   - Each bootstrap sample has same size as original (n=20)")
print("   - Values can repeat (sampled WITH replacement)")
print("   - Each bootstrap sample gives different mean")
print("   - This variability mimics sampling variability!")

In [None]:
# üìä Visualization 1: Original vs bootstrap sample

# Take one bootstrap sample
np.random.seed(42)
bootstrap_sample = np.random.choice(original_sample, size=n, replace=True)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Original sample
unique_orig, counts_orig = np.unique(original_sample, return_counts=True)
axes[0].bar(unique_orig, counts_orig, width=0.05, alpha=0.7, 
            color='steelblue', edgecolor='black')
axes[0].set_xlabel('Yield (tons/hectare)', fontsize=11)
axes[0].set_ylabel('Frequency', fontsize=11)
axes[0].set_title('Original Sample (n=20)', fontsize=12, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='y')
axes[0].axvline(original_sample.mean(), color='red', linestyle='--', 
                linewidth=2, label=f'Mean = {original_sample.mean():.2f}')
axes[0].legend(fontsize=10)

# Right: Bootstrap sample
unique_boot, counts_boot = np.unique(bootstrap_sample, return_counts=True)
axes[1].bar(unique_boot, counts_boot, width=0.05, alpha=0.7, 
            color='orange', edgecolor='black')
axes[1].set_xlabel('Yield (tons/hectare)', fontsize=11)
axes[1].set_ylabel('Frequency', fontsize=11)
axes[1].set_title('Bootstrap Sample (n=20, with replacement)', 
                  fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')
axes[1].axvline(bootstrap_sample.mean(), color='red', linestyle='--', 
                linewidth=2, label=f'Mean = {bootstrap_sample.mean():.2f}')
axes[1].legend(fontsize=10)

# Highlight repeated values
for val, count in zip(unique_boot, counts_boot):
    if count > 1:
        axes[1].text(val, count, f'{count}√ó', ha='center', va='bottom', 
                    fontsize=9, fontweight='bold', color='red')

plt.suptitle('Bootstrap Resampling: WITH Replacement üéí', 
             fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

print("\nüí° Notice:")
print("   - Some values appear more than once in bootstrap sample (red numbers)")
print("   - Some original values might not appear at all")
print("   - This creates variability ‚Üí mimics sampling from population!")

---

## 2. Bootstrap Algorithm üîß

### Step-by-Step Procedure:

**Input**: 
- Original sample: x‚ÇÅ, x‚ÇÇ, ..., x‚Çô
- Statistic of interest: Œ∏ (mean, median, correlation, etc.)
- Number of bootstrap samples: B (typically 1000-10000)

**Algorithm**:
```
FOR b = 1 to B:
    1. Draw n observations from original sample WITH replacement
       ‚Üí bootstrap sample: x*‚ÇÅ, x*‚ÇÇ, ..., x*‚Çô
    
    2. Calculate statistic on bootstrap sample
       ‚Üí Œ∏*·µ¶ = f(x*‚ÇÅ, x*‚ÇÇ, ..., x*‚Çô)
    
    3. Store Œ∏*·µ¶

OUTPUT: Bootstrap distribution {Œ∏*‚ÇÅ, Œ∏*‚ÇÇ, ..., Œ∏*·µ¶}
```

### Uses:

1. **Bootstrap SE**: SE = standard deviation of {Œ∏*‚ÇÅ, ..., Œ∏*·µ¶}
2. **Bootstrap CI**: Use percentiles of bootstrap distribution
3. **Bias estimation**: Bias = mean(Œ∏*) - Œ∏ÃÇ

---

In [None]:
# üîß Implement bootstrap from scratch

def bootstrap(data, statistic_func, n_bootstrap=1000, seed=42):
    """
    Perform bootstrap resampling.
    
    Parameters:
    -----------
    data : array-like
        Original sample data
    statistic_func : function
        Function to calculate statistic (e.g., np.mean, np.median)
    n_bootstrap : int
        Number of bootstrap samples
    seed : int
        Random seed for reproducibility
    
    Returns:
    --------
    bootstrap_statistics : array
        Bootstrap distribution of the statistic
    """
    np.random.seed(seed)
    n = len(data)
    bootstrap_statistics = []
    
    for _ in range(n_bootstrap):
        # Resample with replacement
        bootstrap_sample = np.random.choice(data, size=n, replace=True)
        
        # Calculate statistic
        stat = statistic_func(bootstrap_sample)
        bootstrap_statistics.append(stat)
    
    return np.array(bootstrap_statistics)

print("üîß Bootstrap Implementation:")
print("=" * 60)
print("Function: bootstrap(data, statistic_func, n_bootstrap=1000)")
print("\nAlgorithm:")
print("  1. For each of B bootstrap iterations:")
print("     a. Resample n observations WITH replacement")
print("     b. Calculate statistic on bootstrap sample")
print("     c. Store result")
print("  2. Return array of B bootstrap statistics")
print("\n‚úì Bootstrap function ready to use!")

In [None]:
# üéØ Apply bootstrap to estimate sampling distribution of mean

# Bootstrap for the mean
B = 1000
bootstrap_means = bootstrap(original_sample, np.mean, n_bootstrap=B)

# Calculate statistics
original_mean = original_sample.mean()
bootstrap_mean = bootstrap_means.mean()
bootstrap_se = bootstrap_means.std()

# Compare with theoretical SE
theoretical_se = original_sample.std(ddof=1) / np.sqrt(len(original_sample))

print("üéØ Bootstrap Distribution of the Mean:")
print("=" * 60)
print(f"Original sample size: n = {len(original_sample)}")
print(f"Number of bootstrap samples: B = {B}")
print(f"\nORIGINAL SAMPLE:")
print(f"  Sample mean: {original_mean:.4f}")
print(f"\nBOOTSTRAP DISTRIBUTION:")
print(f"  Mean of bootstrap means: {bootstrap_mean:.4f}")
print(f"  Bootstrap SE: {bootstrap_se:.4f}")
print(f"\nCOMPARISON:")
print(f"  Theoretical SE = s/‚àön: {theoretical_se:.4f}")
print(f"  Bootstrap SE: {bootstrap_se:.4f}")
print(f"  Difference: {abs(bootstrap_se - theoretical_se):.4f}")
print(f"\nüí° Bootstrap SE is very close to theoretical SE!")
print(f"   And we didn't need any formulas!")

In [None]:
# üìä Visualization 2: Bootstrap distribution of the mean

plt.figure(figsize=(12, 6))

# Histogram of bootstrap means
plt.hist(bootstrap_means, bins=40, alpha=0.7, color='steelblue', 
         edgecolor='black', density=True, label=f'{B} Bootstrap Means')

# Overlay normal distribution (theoretical)
x = np.linspace(bootstrap_means.min(), bootstrap_means.max(), 100)
plt.plot(x, stats.norm.pdf(x, original_mean, theoretical_se), 'r-', 
         linewidth=2, label=f'Theoretical: N({original_mean:.2f}, {theoretical_se:.3f})')

# Mark original sample mean
plt.axvline(original_mean, color='black', linestyle='--', linewidth=2, 
            label=f'Original mean = {original_mean:.3f}')

# Mark bootstrap mean
plt.axvline(bootstrap_mean, color='green', linestyle=':', linewidth=2, 
            alpha=0.7, label=f'Bootstrap mean = {bootstrap_mean:.3f}')

plt.xlabel('Sample Mean (tons/hectare)', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title(f'Bootstrap Sampling Distribution (B={B}) üéí', 
          fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)

# Add text box
textstr = f'Bootstrap SE: {bootstrap_se:.4f}\nTheoretical SE: {theoretical_se:.4f}'
props = dict(boxstyle='round', facecolor='wheat', alpha=0.8)
plt.text(0.02, 0.98, textstr, transform=plt.gca().transAxes, fontsize=11,
         verticalalignment='top', bbox=props)

plt.tight_layout()
plt.show()

print("\nüí° Bootstrap Magic:")
print("   - From ONE sample, we created 1000 'pseudo-samples'")
print("   - Distribution approximates the true sampling distribution")
print("   - No formulas, no assumptions needed!")

---

## 3. Bootstrap Standard Error ‚≠ê

### The Power: Works for ANY Statistic!

**Classical approach**: Need formula for SE
- SE for mean: œÉ/‚àön ‚úì
- SE for median: ??? (complicated!)
- SE for correlation: ??? (very complicated!)
- SE for custom statistic: ??? (might not exist!)

**Bootstrap approach**: Same algorithm for everything!

$$
SE_{bootstrap} = \text{Standard Deviation of Bootstrap Statistics}
$$

### Procedure:

1. Calculate bootstrap distribution
2. SE = standard deviation of bootstrap distribution
3. Done! (No formula needed)

---

In [None]:
# üî¨ Bootstrap SE for multiple statistics

B = 2000

# Different statistics
statistics = {
    'Mean': np.mean,
    'Median': np.median,
    'Std Dev': lambda x: np.std(x, ddof=1),
    '75th Percentile': lambda x: np.percentile(x, 75),
    'Coef of Variation': lambda x: np.std(x, ddof=1) / np.mean(x)
}

results = {}

print("üî¨ Bootstrap Standard Error for Various Statistics:")
print("=" * 60)
print(f"Sample size: n = {len(original_sample)}")
print(f"Bootstrap samples: B = {B}")
print(f"\n{'Statistic':<20} {'Estimate':<12} {'Bootstrap SE':<15}")
print("-" * 60)

for name, func in statistics.items():
    # Original estimate
    original_stat = func(original_sample)
    
    # Bootstrap distribution
    bootstrap_dist = bootstrap(original_sample, func, n_bootstrap=B)
    
    # Bootstrap SE
    boot_se = bootstrap_dist.std()
    
    results[name] = {
        'estimate': original_stat,
        'bootstrap_se': boot_se,
        'distribution': bootstrap_dist
    }
    
    print(f"{name:<20} {original_stat:<12.4f} {boot_se:<15.4f}")

print("\nüí° Key Insight:")
print("   - Same bootstrap algorithm works for ALL statistics!")
print("   - No formulas needed")
print("   - Works even when no theoretical formula exists")

In [None]:
# üìä Visualization 3: Bootstrap distributions for multiple statistics

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

stats_to_plot = ['Mean', 'Median', '75th Percentile', 'Coef of Variation']
colors = ['steelblue', 'orange', 'green', 'purple']

for idx, (stat_name, color) in enumerate(zip(stats_to_plot, colors)):
    ax = axes[idx]
    
    dist = results[stat_name]['distribution']
    estimate = results[stat_name]['estimate']
    se = results[stat_name]['bootstrap_se']
    
    # Histogram
    ax.hist(dist, bins=30, alpha=0.7, color=color, edgecolor='black', density=True)
    
    # Mark original estimate
    ax.axvline(estimate, color='red', linestyle='--', linewidth=2, 
               label=f'Estimate = {estimate:.3f}')
    
    # Mark ¬±1 SE
    ax.axvline(estimate - se, color='green', linestyle=':', linewidth=1.5, alpha=0.7)
    ax.axvline(estimate + se, color='green', linestyle=':', linewidth=1.5, alpha=0.7,
               label=f'¬±1 SE')
    
    ax.set_xlabel(f'{stat_name}', fontsize=10)
    ax.set_ylabel('Density', fontsize=10)
    ax.set_title(f'{stat_name}\n(SE = {se:.4f})', fontsize=11, fontweight='bold')
    ax.legend(fontsize=8)
    ax.grid(True, alpha=0.3)

plt.suptitle('Bootstrap Distributions for Multiple Statistics üìä', 
             fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

print("\nüí° Flexibility of Bootstrap:")
print("   - Works for symmetric statistics (mean)")
print("   - Works for skewed statistics (median, percentiles)")
print("   - Works for complex statistics (coefficient of variation)")
print("   - All with the SAME algorithm!")

---

## 4. Bootstrap Confidence Intervals üìä

### Percentile Method (Most Common)

**Idea**: Use the percentiles of the bootstrap distribution

**95% CI**:
$$
[\text{2.5th percentile}, \text{97.5th percentile}]
$$

**Algorithm**:
1. Generate bootstrap distribution {Œ∏*‚ÇÅ, Œ∏*‚ÇÇ, ..., Œ∏*·µ¶}
2. Sort the bootstrap statistics
3. For 95% CI:
   - Lower bound = 2.5th percentile
   - Upper bound = 97.5th percentile

### Advantages:

- ‚úÖ Works for ANY statistic
- ‚úÖ No distributional assumptions
- ‚úÖ Handles skewed distributions naturally
- ‚úÖ Simple to implement

### Example:

For B=1000 bootstrap samples, 95% CI uses the 25th and 975th ordered values

---

In [None]:
# üìä Calculate bootstrap confidence intervals

def bootstrap_ci(data, statistic_func, n_bootstrap=2000, confidence=0.95, seed=42):
    """
    Calculate bootstrap confidence interval using percentile method.
    """
    # Generate bootstrap distribution
    boot_dist = bootstrap(data, statistic_func, n_bootstrap, seed)
    
    # Calculate percentiles
    alpha = 1 - confidence
    lower_percentile = (alpha / 2) * 100
    upper_percentile = (1 - alpha / 2) * 100
    
    ci_lower = np.percentile(boot_dist, lower_percentile)
    ci_upper = np.percentile(boot_dist, upper_percentile)
    
    return ci_lower, ci_upper, boot_dist

# Calculate 95% CI for mean
ci_lower, ci_upper, boot_dist = bootstrap_ci(original_sample, np.mean)

print("üìä Bootstrap Confidence Interval (Percentile Method):")
print("=" * 60)
print(f"Statistic: Sample mean")
print(f"Bootstrap samples: B = {len(boot_dist)}")
print(f"Confidence level: 95%")
print(f"\nPOINT ESTIMATE:")
print(f"  Sample mean: {original_sample.mean():.4f}")
print(f"\nBOOTSTRAP 95% CI:")
print(f"  Lower bound (2.5th percentile): {ci_lower:.4f}")
print(f"  Upper bound (97.5th percentile): {ci_upper:.4f}")
print(f"  CI: [{ci_lower:.4f}, {ci_upper:.4f}]")
print(f"  Width: {ci_upper - ci_lower:.4f}")

# Compare with classical t-based CI
n = len(original_sample)
s = original_sample.std(ddof=1)
t_star = stats.t.ppf(0.975, n-1)
classical_ci = (original_sample.mean() - t_star * s / np.sqrt(n),
                original_sample.mean() + t_star * s / np.sqrt(n))

print(f"\nCOMPARISON WITH CLASSICAL CI:")
print(f"  Classical t-CI: [{classical_ci[0]:.4f}, {classical_ci[1]:.4f}]")
print(f"  Bootstrap CI:   [{ci_lower:.4f}, {ci_upper:.4f}]")
print(f"\nüí° Bootstrap CI is similar to classical CI for the mean")
print(f"   But bootstrap works for ANY statistic!")

In [None]:
# üìä Visualization 4: Bootstrap distribution with CI

plt.figure(figsize=(12, 6))

# Histogram
plt.hist(boot_dist, bins=40, alpha=0.7, color='steelblue', 
         edgecolor='black', density=True, label='Bootstrap Distribution')

# Shade the 95% CI region
x_fill = boot_dist[(boot_dist >= ci_lower) & (boot_dist <= ci_upper)]
plt.hist(x_fill, bins=40, alpha=0.5, color='green', 
         edgecolor='none', density=True, label='95% CI Region')

# Mark the bounds
plt.axvline(ci_lower, color='green', linestyle='--', linewidth=2, 
            label=f'Lower: {ci_lower:.3f}')
plt.axvline(ci_upper, color='green', linestyle='--', linewidth=2, 
            label=f'Upper: {ci_upper:.3f}')

# Mark original estimate
plt.axvline(original_sample.mean(), color='red', linestyle='-', linewidth=2,
            label=f'Estimate: {original_sample.mean():.3f}')

plt.xlabel('Sample Mean (tons/hectare)', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title('Bootstrap Distribution with 95% Confidence Interval üìä', 
          fontsize=14, fontweight='bold')
plt.legend(fontsize=10, loc='upper left')
plt.grid(True, alpha=0.3)

# Add annotations
plt.annotate('2.5% of\nbootstrap\nsamples', xy=(ci_lower - 0.03, 1.5), 
             xytext=(ci_lower - 0.15, 2.5),
             arrowprops=dict(arrowstyle='->', lw=1.5),
             fontsize=9, ha='center')

plt.annotate('2.5% of\nbootstrap\nsamples', xy=(ci_upper + 0.03, 1.5), 
             xytext=(ci_upper + 0.15, 2.5),
             arrowprops=dict(arrowstyle='->', lw=1.5),
             fontsize=9, ha='center')

plt.tight_layout()
plt.show()

print("\nüí° Percentile Method:")
print("   - Green region contains middle 95% of bootstrap statistics")
print("   - This is our 95% confidence interval")
print("   - No assumptions about normality needed!")

---

## 5. Bootstrap for Complex Statistics ‚≠ê

### The Real Power of Bootstrap

Bootstrap shines when dealing with complex statistics where:
- No formula exists for SE
- Distribution is unknown or complicated
- Classical methods don't apply

### Examples:

1. **Correlation coefficient**:
   - Classical SE formula is complicated
   - Bootstrap: Just calculate correlation for each bootstrap sample!

2. **Difference in means** (two groups):
   - Classical: t-test formula
   - Bootstrap: Resample each group, calculate difference

3. **Custom metrics**:
   - Your own metric for which no theory exists
   - Bootstrap gives you SE and CI automatically!

---

In [None]:
# üî¨ Bootstrap CI for correlation

# Generate correlated data: soil nitrogen vs wheat yield
np.random.seed(42)
n = 50
soil_nitrogen = np.random.normal(7.0, 1.5, n)
wheat_yield = 3.0 + 0.3 * soil_nitrogen + np.random.normal(0, 0.5, n)

# Combine into pairs
data_pairs = np.column_stack([soil_nitrogen, wheat_yield])

# Define correlation function
def correlation(data):
    return np.corrcoef(data[:, 0], data[:, 1])[0, 1]

# Original correlation
original_corr = correlation(data_pairs)

# Bootstrap for correlation
B = 2000
bootstrap_corrs = []

for _ in range(B):
    # Resample PAIRS (important!)
    indices = np.random.choice(n, size=n, replace=True)
    bootstrap_sample = data_pairs[indices]
    bootstrap_corrs.append(correlation(bootstrap_sample))

bootstrap_corrs = np.array(bootstrap_corrs)

# Calculate CI
corr_ci_lower = np.percentile(bootstrap_corrs, 2.5)
corr_ci_upper = np.percentile(bootstrap_corrs, 97.5)
corr_se = bootstrap_corrs.std()

print("üî¨ Bootstrap CI for Correlation Coefficient:")
print("=" * 60)
print(f"Variables: Soil nitrogen vs Wheat yield")
print(f"Sample size: n = {n} fields")
print(f"Bootstrap samples: B = {B}")
print(f"\nRESULTS:")
print(f"  Sample correlation: r = {original_corr:.4f}")
print(f"  Bootstrap SE: {corr_se:.4f}")
print(f"  95% Bootstrap CI: [{corr_ci_lower:.4f}, {corr_ci_upper:.4f}]")
print(f"\nüí° Interpretation:")
print(f"   We are 95% confident the true correlation between")
print(f"   soil nitrogen and wheat yield is between")
print(f"   {corr_ci_lower:.2f} and {corr_ci_upper:.2f}")

In [None]:
# üìä Visualization 5: Scatter plot with bootstrap correlation distribution

fig = plt.figure(figsize=(14, 6))

# Main scatter plot
ax_main = plt.subplot(1, 2, 1)
ax_main.scatter(soil_nitrogen, wheat_yield, alpha=0.6, s=80, 
                edgecolors='black', linewidths=0.5)

# Add regression line
z = np.polyfit(soil_nitrogen, wheat_yield, 1)
p = np.poly1d(z)
x_line = np.linspace(soil_nitrogen.min(), soil_nitrogen.max(), 100)
ax_main.plot(x_line, p(x_line), 'r-', linewidth=2, alpha=0.7)

ax_main.set_xlabel('Soil Nitrogen (kg/ha)', fontsize=11)
ax_main.set_ylabel('Wheat Yield (tons/hectare)', fontsize=11)
ax_main.set_title(f'Soil Nitrogen vs Yield\n(r = {original_corr:.3f})', 
                  fontsize=12, fontweight='bold')
ax_main.grid(True, alpha=0.3)

# Bootstrap distribution
ax_boot = plt.subplot(1, 2, 2)
ax_boot.hist(bootstrap_corrs, bins=40, alpha=0.7, color='steelblue', 
             edgecolor='black', density=True)

# Shade CI
x_fill = bootstrap_corrs[(bootstrap_corrs >= corr_ci_lower) & 
                         (bootstrap_corrs <= corr_ci_upper)]
ax_boot.hist(x_fill, bins=40, alpha=0.5, color='green', 
             edgecolor='none', density=True)

ax_boot.axvline(original_corr, color='red', linestyle='--', linewidth=2,
                label=f'r = {original_corr:.3f}')
ax_boot.axvline(corr_ci_lower, color='green', linestyle=':', linewidth=2,
                label=f'95% CI: [{corr_ci_lower:.3f}, {corr_ci_upper:.3f}]')
ax_boot.axvline(corr_ci_upper, color='green', linestyle=':', linewidth=2)

ax_boot.set_xlabel('Correlation Coefficient', fontsize=11)
ax_boot.set_ylabel('Density', fontsize=11)
ax_boot.set_title(f'Bootstrap Distribution\n(B={B})', 
                  fontsize=12, fontweight='bold')
ax_boot.legend(fontsize=9)
ax_boot.grid(True, alpha=0.3)

plt.suptitle('Bootstrap CI for Correlation Coefficient üî¨', 
             fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

print("\nüí° Bootstrap for Correlation:")
print("   - No complicated formulas needed")
print("   - Accounts for non-normality automatically")
print("   - CI shows uncertainty in correlation estimate")

In [None]:
# üåæ Bootstrap for difference in means (two treatments)

# Two fertilizer treatments
np.random.seed(42)
treatment_A = np.random.normal(5.2, 0.7, 30)  # Standard fertilizer
treatment_B = np.random.normal(5.6, 0.8, 35)  # New fertilizer

# Observed difference
observed_diff = treatment_B.mean() - treatment_A.mean()

# Bootstrap for difference
B = 2000
bootstrap_diffs = []

for _ in range(B):
    # Resample each group independently
    boot_A = np.random.choice(treatment_A, size=len(treatment_A), replace=True)
    boot_B = np.random.choice(treatment_B, size=len(treatment_B), replace=True)
    
    # Calculate difference
    diff = boot_B.mean() - boot_A.mean()
    bootstrap_diffs.append(diff)

bootstrap_diffs = np.array(bootstrap_diffs)

# Calculate CI
diff_ci_lower = np.percentile(bootstrap_diffs, 2.5)
diff_ci_upper = np.percentile(bootstrap_diffs, 97.5)
diff_se = bootstrap_diffs.std()

print("üåæ Bootstrap CI for Difference in Means (Two Treatments):")
print("=" * 60)
print(f"Treatment A (standard): n = {len(treatment_A)}, mean = {treatment_A.mean():.3f}")
print(f"Treatment B (new):      n = {len(treatment_B)}, mean = {treatment_B.mean():.3f}")
print(f"\nRESULTS:")
print(f"  Observed difference (B - A): {observed_diff:.4f} tons/hectare")
print(f"  Bootstrap SE: {diff_se:.4f}")
print(f"  95% Bootstrap CI: [{diff_ci_lower:.4f}, {diff_ci_upper:.4f}]")
print(f"\nüí° Interpretation:")
if diff_ci_lower > 0:
    print(f"   Treatment B is significantly better than A (CI doesn't include 0)")
    print(f"   Improvement: {observed_diff:.2f} tons/hectare")
else:
    print(f"   Cannot conclude B is better (CI includes 0)")

In [None]:
# üìä Visualization 6: Difference distribution

plt.figure(figsize=(12, 6))

# Histogram
plt.hist(bootstrap_diffs, bins=40, alpha=0.7, color='steelblue', 
         edgecolor='black', density=True, label='Bootstrap Distribution')

# Shade CI
x_fill = bootstrap_diffs[(bootstrap_diffs >= diff_ci_lower) & 
                         (bootstrap_diffs <= diff_ci_upper)]
plt.hist(x_fill, bins=40, alpha=0.5, color='green', 
         edgecolor='none', density=True, label='95% CI')

# Mark observed difference
plt.axvline(observed_diff, color='red', linestyle='--', linewidth=2,
            label=f'Observed Diff = {observed_diff:.3f}')

# Mark zero (no difference)
plt.axvline(0, color='black', linestyle='-', linewidth=2, alpha=0.5,
            label='No difference (0)')

# Mark CI bounds
plt.axvline(diff_ci_lower, color='green', linestyle=':', linewidth=2)
plt.axvline(diff_ci_upper, color='green', linestyle=':', linewidth=2)

plt.xlabel('Difference in Mean Yield (B - A, tons/hectare)', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title('Bootstrap Distribution of Treatment Difference üåæ', 
          fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)

# Add text
if diff_ci_lower > 0:
    color = 'lightgreen'
    msg = f'CI excludes 0\nTreatment B\nis better!'
else:
    color = 'lightyellow'
    msg = f'CI includes 0\nNo clear\nwinner'

textstr = f'95% CI:\n[{diff_ci_lower:.3f}, {diff_ci_upper:.3f}]\n\n{msg}'
props = dict(boxstyle='round', facecolor=color, alpha=0.8)
plt.text(0.02, 0.98, textstr, transform=plt.gca().transAxes, fontsize=10,
         verticalalignment='top', bbox=props)

plt.tight_layout()
plt.show()

print("\nüí° Using Bootstrap for Treatment Comparison:")
print("   - If CI excludes 0: Significant difference")
print("   - If CI includes 0: No significant difference")
print("   - More flexible than t-test (no normality assumption)")

---

## 6. When to Use Bootstrap ‚öñÔ∏è

### Bootstrap is Great When:

‚úÖ **Complex statistics**: No formula for SE (correlation, percentiles, ratios)

‚úÖ **Non-normal data**: Don't want to assume normality

‚úÖ **Small-to-moderate samples**: n ‚â• 20-30 typically works well

‚úÖ **Flexible inference**: Want to avoid restrictive assumptions

‚úÖ **Quick implementation**: Same algorithm for any statistic

### Use Caution When:

‚ö†Ô∏è **Very small samples**: n < 20 may not provide stable estimates

‚ö†Ô∏è **Estimating extremes**: Max, min, extreme percentiles (sample may not contain extreme values)

‚ö†Ô∏è **Heavy dependence**: Time series with strong autocorrelation (need block bootstrap)

‚ö†Ô∏è **Computational cost**: Very large datasets + many iterations can be slow

### Bootstrap vs Classical:

| Aspect | Classical | Bootstrap |
|--------|-----------|----------|
| **Assumptions** | Normality, known distribution | Minimal |
| **Formulas** | Need specific formula for each | Same algorithm for all |
| **Flexibility** | Limited | High |
| **Computation** | Fast (formula) | Slower (resampling) |
| **Small samples** | May be better | May be unstable |
| **Complex statistics** | Often difficult/impossible | Easy |

---

---

## 7. Machine Learning Connection ‚≠ê‚≠ê‚≠ê

### Bootstrap ‚Üí Bagging ‚Üí Random Forests

**This is where statistical inference meets modern ML!**

#### 1. Bootstrap Aggregating (Bagging) üéí

**Idea**: Reduce variance by averaging predictions from multiple models

**Algorithm**:
1. FOR b = 1 to B:
   - Generate bootstrap sample from training data
   - Train model on bootstrap sample ‚Üí model_b
2. Final prediction = AVERAGE of all model predictions

**Why it works**: Central Limit Theorem!
- Individual models have high variance
- Averaging reduces variance (CLT)
- Bootstrap provides diversity

#### 2. Random Forests üå≤

**Random Forest = Bagging + Decision Trees + Random Feature Selection**

1. Bootstrap samples (like bagging)
2. Train decision tree on each sample
3. At each split: consider only random subset of features
4. Average predictions from all trees

**Result**: One of the most powerful ML algorithms!

#### 3. Why Ensemble Methods Work:

- **Unstable models** (high variance): Decision trees, neural nets
- **Bootstrap creates diversity**: Different training sets
- **Averaging reduces variance**: CLT in action
- **Better generalization**: More robust predictions

---

In [None]:
# ü§ñ Bagging demonstration

# Generate synthetic agricultural data
np.random.seed(42)
n_samples = 100
X, y = make_regression(n_samples=n_samples, n_features=1, noise=15, random_state=42)
X = X.ravel()

# Train single decision tree (unstable, high variance)
single_tree = DecisionTreeRegressor(max_depth=5, random_state=42)
single_tree.fit(X.reshape(-1, 1), y)

# Bagging: Train multiple trees on bootstrap samples
n_trees = 100
bagged_predictions = []
individual_trees = []

for i in range(n_trees):
    # Bootstrap sample
    indices = np.random.choice(n_samples, size=n_samples, replace=True)
    X_boot = X[indices]
    y_boot = y[indices]
    
    # Train tree
    tree = DecisionTreeRegressor(max_depth=5, random_state=i)
    tree.fit(X_boot.reshape(-1, 1), y_boot)
    individual_trees.append(tree)
    
    # Predictions for visualization
    X_test = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
    bagged_predictions.append(tree.predict(X_test))

bagged_predictions = np.array(bagged_predictions)
ensemble_prediction = bagged_predictions.mean(axis=0)

print("ü§ñ Bootstrap Aggregating (Bagging) Demonstration:")
print("=" * 60)
print(f"Training data: n = {n_samples} agricultural observations")
print(f"Model: Decision Tree (max_depth=5)")
print(f"Number of bagged models: {n_trees}")
print(f"\nAPPROACH:")
print(f"  1. Generate {n_trees} bootstrap samples from training data")
print(f"  2. Train decision tree on each bootstrap sample")
print(f"  3. Final prediction = AVERAGE of all {n_trees} tree predictions")
print(f"\nüí° This is exactly how Random Forests work!")
print(f"   (Plus random feature selection at each split)")

In [None]:
# üìä Visualization 7: Bagging reduces variance

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

X_test = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)

# Left: Individual bootstrap predictions
axes[0].scatter(X, y, alpha=0.4, s=50, color='gray', label='Training Data')

# Plot 20 individual tree predictions
for i in range(20):
    axes[0].plot(X_test, bagged_predictions[i], 'b-', alpha=0.15, linewidth=1)

# Highlight one
axes[0].plot(X_test, bagged_predictions[0], 'b-', alpha=0.5, linewidth=2,
             label='Individual Tree Predictions')

axes[0].set_xlabel('Feature', fontsize=11)
axes[0].set_ylabel('Target', fontsize=11)
axes[0].set_title('Individual Bootstrap Trees\n(High Variance)', 
                  fontsize=12, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Right: Ensemble prediction
axes[1].scatter(X, y, alpha=0.4, s=50, color='gray', label='Training Data')

# Show a few individual predictions for reference
for i in range(10):
    axes[1].plot(X_test, bagged_predictions[i], 'b-', alpha=0.1, linewidth=1)

# Ensemble prediction (averaged)
axes[1].plot(X_test, ensemble_prediction, 'r-', linewidth=3, 
             label=f'Bagged Ensemble (avg of {n_trees} trees)', zorder=5)

axes[1].set_xlabel('Feature', fontsize=11)
axes[1].set_ylabel('Target', fontsize=11)
axes[1].set_title(f'Bagged Ensemble (Average of {n_trees} Trees)\n(Low Variance!)', 
                  fontsize=12, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.suptitle('Bootstrap Aggregating: Variance Reduction Through Averaging üéØ', 
             fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

print("\nüí° Key ML Insights:")
print("   LEFT: Individual trees are 'wiggly' (high variance)")
print("   RIGHT: Averaged prediction is smooth (low variance)")
print("\nüéØ This is why Random Forests work so well:")
print("   1. Bootstrap creates diverse training sets")
print("   2. Each tree learns different patterns")
print("   3. Averaging reduces variance (CLT!)")
print("   4. Final model generalizes better")
print("\n‚ú® Bootstrap ‚Üí Bagging ‚Üí Random Forests!")

---

## Key Takeaways üéØ

### Bootstrap Methods:

1. ‚úÖ **Core Idea** ‚≠ê:
   - Treat sample as mini-population
   - Resample WITH replacement (B times)
   - Use bootstrap distribution for inference

2. ‚úÖ **Bootstrap SE**:
   - SE = Standard deviation of bootstrap statistics
   - Works for ANY statistic (no formula needed!)
   - Mean, median, correlation, custom metrics‚Äîall the same algorithm

3. ‚úÖ **Bootstrap CI** (Percentile Method):
   - 95% CI = [2.5th percentile, 97.5th percentile]
   - No distributional assumptions
   - Handles skewed distributions naturally

4. ‚úÖ **Advantages**:
   - Minimal assumptions
   - Works for complex statistics
   - Same algorithm for everything
   - Flexible and modern

5. ‚úÖ **ML Connection** ‚≠ê‚≠ê‚≠ê:
   - **Bootstrap ‚Üí Bagging**: Train on bootstrap samples, average predictions
   - **Random Forests** = Bootstrap + Trees + Random features
   - **Variance reduction**: Averaging works (CLT!)
   - **Ensemble methods**: Understanding bootstrap ‚Üí understanding why they work

### The Bootstrap-to-ML Pipeline:

```
Bootstrap Resampling
    ‚Üì
Multiple diverse training sets
    ‚Üì
Train model on each (Bootstrap Aggregating = Bagging)
    ‚Üì
Average predictions (CLT reduces variance)
    ‚Üì
Better generalization!
    ‚Üì
Random Forests, Gradient Boosting, Ensemble Methods
```

### Critical Formulas:

$$
\boxed{SE_{bootstrap} = SD(\{\theta^*_1, \theta^*_2, ..., \theta^*_B\})}
$$

$$
\boxed{\text{95% CI} = [P_{2.5}, P_{97.5}] \text{ (percentile method)}}
$$

---

## Wrap-Up: The Power of Bootstrap üéí

### What We've Learned:

Bootstrap is a **revolutionary approach to statistical inference**:

1. **Flexibility**: Works for any statistic, any distribution
2. **Simplicity**: Same algorithm every time
3. **Power**: Handles situations where classical methods fail
4. **Modern**: Foundation of many ML techniques

### From Statistics to Machine Learning:

Understanding bootstrap helps you understand:
- **Why Random Forests work** (bootstrap + trees)
- **Why ensemble methods are powerful** (averaging reduces variance)
- **How to evaluate models robustly** (bootstrap validation)
- **How to quantify uncertainty in ML** (bootstrap CIs for performance)

### Next Module: Hypothesis Testing üî¨

In the next module (06-Hypothesis-Testing), we'll learn:
- Making decisions from data
- p-values and significance testing
- Type I and Type II errors
- ML connection: A/B testing, model comparison

But that's for the next module!

---

### You've Completed Statistical Inference Phase 1! üéâ

**You now understand**:
1. ‚úÖ Sampling and sampling distributions
2. ‚úÖ Central Limit Theorem (why inference works)
3. ‚úÖ Point estimation and MLE (training is estimation!)
4. ‚úÖ Confidence intervals (quantifying uncertainty)
5. ‚úÖ Bootstrap methods (modern flexible inference)

**And you see how it all connects to ML**:
- Cross-validation = Sampling
- Ensemble averaging = CLT
- Training = MLE
- Model performance CIs = Confidence intervals
- **Random Forests = Bootstrap + Trees**

**Statistical inference IS the foundation of machine learning!** üéØ‚ú®üåæ

---

**Excellent work completing Phase 1 (Fundamentals)!**

**Next**: Phase 2 (From Scratch), Phase 3 (With SciPy), Phase 4 (Agricultural Applications)