# Point Estimation: From Samples to Parameters üéØ

## Introduction: Making Our Best Guess

In the previous notebooks, we learned about:
- **Sampling**: Taking subsets from populations
- **Sampling distributions**: How sample statistics vary
- **Central Limit Theorem**: Why sample means are normal

But we haven't answered the fundamental question: **How do we actually estimate population parameters from our sample?**

### The Setup:

- üåç **Population**: Has unknown parameter Œ∏ (could be Œº, œÉ¬≤, p, etc.)
- üìä **Sample**: We observe data x‚ÇÅ, x‚ÇÇ, ..., x‚Çô
- üéØ **Goal**: Estimate Œ∏ using our sample

### Real Example:

You sample 50 wheat fields and measure their yields. The average is 5.15 tons/hectare.

**Questions**:
- Is 5.15 our best estimate of the true population mean?
- How do we know it's a "good" estimate?
- Are there better ways to estimate the mean?

This is **point estimation**!

### ML Connection ü§ñ

**Training an ML model IS parameter estimation!**
- Linear regression: Estimating slope and intercept
- Logistic regression: Estimating coefficients
- Neural networks: Estimating millions of weights

**Maximum Likelihood Estimation (MLE)** is the foundation of most ML training algorithms!

---

## Learning Objectives üéØ

By the end of this notebook, you will:

1. ‚úÖ Understand **point estimation** concept and terminology
2. ‚úÖ Learn properties of good estimators (unbiased, consistent, efficient)
3. ‚úÖ Master **Maximum Likelihood Estimation (MLE)** ‚≠ê‚≠ê
4. ‚úÖ Understand **Method of Moments** estimation
5. ‚úÖ Connect estimation to **ML model training** ‚≠ê‚≠ê
6. ‚úÖ Implement MLE from scratch for common distributions

‚≠ê‚≠ê = Most critical concept

---

Let's learn how to estimate! üöÄ

In [None]:
# üì¶ Setup: Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.optimize import minimize

# Set style for beautiful plots
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11

# Set random seed for reproducibility
np.random.seed(42)

print("‚úì Setup complete!")
print("üéØ Ready to learn point estimation")

---

## 1. Point Estimation Basics üìä

### Key Terminology:

**Estimator (Œ∏ÃÇ)**: A rule/formula for calculating an estimate from sample data
- Example: Sample mean XÃÑ = (1/n)Œ£x·µ¢ is an estimator
- An estimator is a **random variable** (changes with different samples)

**Estimate**: The actual numerical value from a specific sample
- Example: xÃÑ = 5.15 tons/hectare is an estimate
- An estimate is a **fixed number**

**Parameter (Œ∏)**: The true population value we're trying to estimate
- Example: Œº = true population mean (unknown)

### Common Estimators:

| Parameter | Estimator | Formula |
|-----------|-----------|----------|
| Population mean Œº | Sample mean XÃÑ | (1/n)Œ£x·µ¢ |
| Population variance œÉ¬≤ | Sample variance s¬≤ | (1/(n-1))Œ£(x·µ¢ - xÃÑ)¬≤ |
| Population proportion p | Sample proportion pÃÇ | (# successes)/n |

### Example:

```python
# Sample data
yields = [5.1, 5.3, 4.9, 5.2, 5.4]

# Estimator: Sample mean
estimate = np.mean(yields)  # 5.18
```

- **Estimator**: np.mean() function (the rule)
- **Estimate**: 5.18 (the result)
- **Parameter**: Œº (unknown true mean)

---

In [None]:
# üåæ Example: Estimate population mean and variance from a sample

# True population (unknown to us in practice)
true_mean = 5.2
true_std = 0.8
true_variance = true_std ** 2

# Generate population
population = np.random.normal(true_mean, true_std, 100000)

# Take a sample (what we actually observe)
sample_size = 50
sample = np.random.choice(population, size=sample_size, replace=False)

# Calculate estimates
mean_estimate = sample.mean()
variance_estimate = sample.var(ddof=1)  # ddof=1 for unbiased estimate
std_estimate = np.sqrt(variance_estimate)

print("üéØ Point Estimation Example:")
print("=" * 60)
print("TRUE POPULATION PARAMETERS (unknown in practice):")
print(f"  Œº (mean) = {true_mean:.3f} tons/hectare")
print(f"  œÉ¬≤ (variance) = {true_variance:.3f}")
print(f"  œÉ (std dev) = {true_std:.3f}")
print(f"\nSAMPLE DATA (n={sample_size}):")
print(f"  Observed yields: {sample[:5].round(2)}... (showing first 5)")
print(f"\nPOINT ESTIMATES:")
print(f"  ŒºÃÇ (estimated mean) = {mean_estimate:.3f} tons/hectare")
print(f"  œÉÃÇ¬≤ (estimated variance) = {variance_estimate:.3f}")
print(f"  œÉÃÇ (estimated std dev) = {std_estimate:.3f}")
print(f"\nESTIMATION ERRORS:")
print(f"  Mean error: {abs(mean_estimate - true_mean):.3f}")
print(f"  Variance error: {abs(variance_estimate - true_variance):.3f}")
print(f"\nüí° Our estimates are close but not perfect (sampling variability!)")

In [None]:
# üìä Visualization 1: Point estimate on distribution

plt.figure(figsize=(12, 6))

# Plot population distribution
x = np.linspace(population.min(), population.max(), 100)
plt.plot(x, stats.norm.pdf(x, true_mean, true_std), 'b-', linewidth=3, 
         alpha=0.5, label='True Population Distribution')

# Plot sample histogram
plt.hist(sample, bins=15, alpha=0.6, color='steelblue', edgecolor='black', 
         density=True, label=f'Sample (n={sample_size})')

# Mark true parameter
plt.axvline(true_mean, color='blue', linestyle='--', linewidth=2, 
            label=f'True Œº = {true_mean:.2f}')

# Mark estimate
plt.axvline(mean_estimate, color='red', linestyle='-', linewidth=2, 
            label=f'Estimate ŒºÃÇ = {mean_estimate:.2f}')

plt.xlabel('Wheat Yield (tons/hectare)', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title('Point Estimation: Sample Mean as Estimate of Population Mean üéØ', 
          fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° The sample mean (red line) is our point estimate of the true mean (blue line)")
print("   It's close, but not exact due to sampling variability!")

---

## 2. Properties of Good Estimators ‚úÖ

Not all estimators are created equal! We want estimators that have desirable properties:

### 1. Unbiased ‚≠ê

**Definition**: An estimator Œ∏ÃÇ is unbiased if E[Œ∏ÃÇ] = Œ∏

In words: On average across many samples, the estimator equals the true parameter

$$
\text{Bias} = E[\hat{\theta}] - \theta
$$

**Examples**:
- Sample mean XÃÑ is unbiased for Œº: E[XÃÑ] = Œº ‚úì
- Sample variance s¬≤ (with ddof=1) is unbiased for œÉ¬≤ ‚úì
- Sample variance (with ddof=0) is biased! ‚úó

### 2. Consistent ‚≠ê

**Definition**: As n ‚Üí ‚àû, Œ∏ÃÇ ‚Üí Œ∏ (converges to true value)

In words: With more data, the estimator gets arbitrarily close to the truth

### 3. Efficient ‚≠ê

**Definition**: Among unbiased estimators, has the smallest variance

In words: Most precise estimator (tightest sampling distribution)

### Mean Squared Error (MSE):

Combines bias and variance:

$$
MSE(\hat{\theta}) = E[(\hat{\theta} - \theta)^2] = \text{Bias}^2 + \text{Variance}
$$

**Goal**: Minimize MSE (trade-off between bias and variance)

---

In [None]:
# üî¨ Demonstrate unbiasedness
# Take many samples, show that average of estimates equals true parameter

n_simulations = 1000
sample_size = 50

# Store estimates from each sample
mean_estimates = []
var_estimates_biased = []  # ddof=0 (biased)
var_estimates_unbiased = []  # ddof=1 (unbiased)

for _ in range(n_simulations):
    sample = np.random.choice(population, size=sample_size, replace=False)
    mean_estimates.append(sample.mean())
    var_estimates_biased.append(sample.var(ddof=0))
    var_estimates_unbiased.append(sample.var(ddof=1))

mean_estimates = np.array(mean_estimates)
var_estimates_biased = np.array(var_estimates_biased)
var_estimates_unbiased = np.array(var_estimates_unbiased)

print("üî¨ Demonstrating Unbiasedness:")
print("=" * 60)
print(f"Simulation: {n_simulations} samples of size n={sample_size}")
print(f"\n1. SAMPLE MEAN (estimator for Œº):")
print(f"   True Œº = {true_mean:.4f}")
print(f"   Average of {n_simulations} estimates = {mean_estimates.mean():.4f}")
print(f"   Bias = {mean_estimates.mean() - true_mean:.4f} ‚úì UNBIASED")

print(f"\n2. SAMPLE VARIANCE with ddof=0 (biased):")
print(f"   True œÉ¬≤ = {true_variance:.4f}")
print(f"   Average of {n_simulations} estimates = {var_estimates_biased.mean():.4f}")
print(f"   Bias = {var_estimates_biased.mean() - true_variance:.4f} ‚úó BIASED (underestimates)")

print(f"\n3. SAMPLE VARIANCE with ddof=1 (unbiased):")
print(f"   True œÉ¬≤ = {true_variance:.4f}")
print(f"   Average of {n_simulations} estimates = {var_estimates_unbiased.mean():.4f}")
print(f"   Bias = {var_estimates_unbiased.mean() - true_variance:.4f} ‚úì UNBIASED")

print(f"\nüí° Unbiased estimator: E[Œ∏ÃÇ] = Œ∏ (average equals true value)")
print(f"   This is why we use ddof=1 for sample variance!")

In [None]:
# üìä Visualization 2: Sampling distributions showing unbiasedness

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Mean estimates (unbiased)
axes[0].hist(mean_estimates, bins=40, alpha=0.7, color='green', 
             edgecolor='black', density=True)
axes[0].axvline(true_mean, color='black', linestyle='--', linewidth=2, 
                label=f'True Œº = {true_mean:.2f}')
axes[0].axvline(mean_estimates.mean(), color='red', linestyle='-', linewidth=2,
                label=f'E[ŒºÃÇ] = {mean_estimates.mean():.2f}')
axes[0].set_xlabel('Estimate of Œº', fontsize=11)
axes[0].set_ylabel('Density', fontsize=11)
axes[0].set_title('Sample Mean: UNBIASED ‚úì', fontsize=12, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)
axes[0].text(0.55, 0.95, 'Centered at\ntrue value!',
             transform=axes[0].transAxes, fontsize=10,
             verticalalignment='top',
             bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.8))

# Right: Variance estimates (biased vs unbiased)
axes[1].hist(var_estimates_biased, bins=40, alpha=0.5, color='red', 
             edgecolor='black', density=True, label=f'ddof=0 (biased)')
axes[1].hist(var_estimates_unbiased, bins=40, alpha=0.5, color='green', 
             edgecolor='black', density=True, label=f'ddof=1 (unbiased)')
axes[1].axvline(true_variance, color='black', linestyle='--', linewidth=2, 
                label=f'True œÉ¬≤ = {true_variance:.2f}')
axes[1].axvline(var_estimates_biased.mean(), color='darkred', linestyle=':', 
                linewidth=1.5, alpha=0.7)
axes[1].axvline(var_estimates_unbiased.mean(), color='darkgreen', linestyle=':', 
                linewidth=1.5, alpha=0.7)
axes[1].set_xlabel('Estimate of œÉ¬≤', fontsize=11)
axes[1].set_ylabel('Density', fontsize=11)
axes[1].set_title('Sample Variance: Effect of ddof', fontsize=12, fontweight='bold')
axes[1].legend(fontsize=9)
axes[1].grid(True, alpha=0.3)

plt.suptitle('Unbiasedness: Does E[Œ∏ÃÇ] = Œ∏? üéØ', fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

print("\nüí° Key Observation:")
print("   - Green distribution (ddof=1) is centered at true œÉ¬≤ ‚úì")
print("   - Red distribution (ddof=0) is shifted left (underestimates) ‚úó")
print("   - Always use ddof=1 for unbiased variance estimation!")

In [None]:
# üìè Demonstrate consistency
# Show estimates get closer to truth as n increases

sample_sizes = [10, 25, 50, 100, 200, 500]
n_sims = 500

results = {}

for n in sample_sizes:
    estimates = []
    for _ in range(n_sims):
        sample = np.random.choice(population, size=n, replace=False)
        estimates.append(sample.mean())
    results[n] = np.array(estimates)

print("üìè Demonstrating Consistency:")
print("=" * 60)
print(f"True Œº = {true_mean:.4f}")
print(f"\n{'Sample Size':<12} {'Mean of Estimates':<20} {'Std of Estimates':<20}")
print("-" * 60)

for n in sample_sizes:
    mean_est = results[n].mean()
    std_est = results[n].std()
    print(f"{n:<12} {mean_est:<20.4f} {std_est:<20.4f}")

print("\nüí° Notice:")
print("   - As n increases, estimates cluster tighter around true value")
print("   - Standard deviation decreases (proportional to 1/‚àön)")
print("   - This is CONSISTENCY: estimator ‚Üí true value as n ‚Üí ‚àû")

In [None]:
# üìä Visualization 3: Consistency (distributions get narrower)

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()
fig.suptitle('Consistency: Estimates Converge as n Increases üìè', 
             fontsize=16, fontweight='bold')

for idx, n in enumerate(sample_sizes):
    ax = axes[idx]
    estimates = results[n]
    
    # Histogram
    ax.hist(estimates, bins=30, alpha=0.7, color='steelblue', 
            edgecolor='black', density=True)
    
    # Mark true value
    ax.axvline(true_mean, color='red', linestyle='--', linewidth=2, 
               label=f'True Œº = {true_mean:.2f}')
    
    # Mark mean of estimates
    ax.axvline(estimates.mean(), color='green', linestyle='-', linewidth=1.5,
               alpha=0.7, label=f'E[ŒºÃÇ] = {estimates.mean():.2f}')
    
    # Statistics box
    textstr = f'n = {n}\nStd = {estimates.std():.3f}\nRange = {estimates.max()-estimates.min():.3f}'
    props = dict(boxstyle='round', facecolor='wheat', alpha=0.8)
    ax.text(0.65, 0.95, textstr, transform=ax.transAxes, fontsize=9,
            verticalalignment='top', bbox=props)
    
    ax.set_xlabel('Estimate', fontsize=10)
    ax.set_ylabel('Density', fontsize=10)
    ax.set_title(f'Sample Size n = {n}', fontsize=11, fontweight='bold')
    ax.legend(fontsize=8)
    ax.grid(True, alpha=0.3)
    
    # Same x-axis for comparison
    ax.set_xlim(4.6, 5.8)

plt.tight_layout()
plt.show()

print("\nüí° Consistency Visualized:")
print("   - Small n: Wide distribution (high variability)")
print("   - Large n: Narrow distribution (low variability)")
print("   - All centered at true value (unbiased + consistent)")

---

## 3. Maximum Likelihood Estimation (MLE) ‚≠ê‚≠ê‚≠ê

### The Gold Standard of Estimation

**Idea**: Choose the parameter value that makes the observed data most likely

### Likelihood Function:

For data x = (x‚ÇÅ, x‚ÇÇ, ..., x‚Çô) and parameter Œ∏:

$$
L(\theta | x) = \prod_{i=1}^{n} f(x_i | \theta)
$$

Where f(x·µ¢ | Œ∏) is the probability density/mass function

### Log-Likelihood:

For computational reasons, we maximize the log-likelihood:

$$
\ell(\theta | x) = \ln L(\theta | x) = \sum_{i=1}^{n} \ln f(x_i | \theta)
$$

### MLE Principle:

$$
\hat{\theta}_{MLE} = \arg\max_{\theta} L(\theta | x) = \arg\max_{\theta} \ell(\theta | x)
$$

### Why MLE is Great:

1. ‚úÖ **Consistent**: Œ∏ÃÇ‚Çò‚Çó‚Çë ‚Üí Œ∏ as n ‚Üí ‚àû
2. ‚úÖ **Asymptotically efficient**: Lowest variance for large n
3. ‚úÖ **Asymptotically normal**: Œ∏ÃÇ‚Çò‚Çó‚Çë ~ N(Œ∏, ...) for large n
4. ‚úÖ **Invariant**: If Œ∏ÃÇ is MLE for Œ∏, then g(Œ∏ÃÇ) is MLE for g(Œ∏)

### ML Connection:

**Training is MLE!**
- Linear regression loss ‚Üí Negative log-likelihood
- Cross-entropy loss ‚Üí Negative log-likelihood
- Most ML training = finding MLE!

---

In [None]:
# üéØ MLE Example 1: Normal Distribution
# Given data, find Œº and œÉ that maximize likelihood

# Sample data
np.random.seed(42)
true_mu = 5.2
true_sigma = 0.8
n = 50
data = np.random.normal(true_mu, true_sigma, n)

# MLE for normal distribution (closed-form solution)
mu_mle = data.mean()
sigma_mle = np.sqrt(((data - mu_mle)**2).sum() / n)  # MLE uses n, not n-1!

print("üéØ MLE for Normal Distribution:")
print("=" * 60)
print(f"Sample size: n = {n}")
print(f"\nTRUE PARAMETERS:")
print(f"  Œº = {true_mu}")
print(f"  œÉ = {true_sigma}")
print(f"\nMAXIMUM LIKELIHOOD ESTIMATES:")
print(f"  ŒºÃÇ_MLE = {mu_mle:.4f}")
print(f"  œÉÃÇ_MLE = {sigma_mle:.4f}")
print(f"\nüí° MLE Formulas for Normal Distribution:")
print(f"   ŒºÃÇ_MLE = (1/n)Œ£x·µ¢ = sample mean")
print(f"   œÉÃÇ_MLE = ‚àö[(1/n)Œ£(x·µ¢-ŒºÃÇ)¬≤] (note: uses n, not n-1!)")
print(f"\n‚ö†Ô∏è Note: œÉÃÇ_MLE is slightly biased (underestimates œÉ for small n)")
print(f"         Use n-1 (ddof=1) for unbiased estimate")

In [None]:
# üìä Visualization 4: Likelihood function (finding the maximum)

# Compute likelihood for different values of Œº (fixing œÉ)
mu_values = np.linspace(4.5, 6.0, 100)
log_likelihoods = []

for mu in mu_values:
    # Log-likelihood: sum of log probabilities
    log_lik = np.sum(stats.norm.logpdf(data, loc=mu, scale=true_sigma))
    log_likelihoods.append(log_lik)

log_likelihoods = np.array(log_likelihoods)

# Find maximum
max_idx = np.argmax(log_likelihoods)
mu_at_max = mu_values[max_idx]

plt.figure(figsize=(12, 6))

# Plot log-likelihood curve
plt.plot(mu_values, log_likelihoods, 'b-', linewidth=2, 
         label='Log-Likelihood ‚Ñì(Œº)')

# Mark the maximum
plt.scatter([mu_at_max], [log_likelihoods[max_idx]], s=200, c='red', 
            marker='*', zorder=5, edgecolors='black', linewidths=1.5,
            label=f'MLE: ŒºÃÇ = {mu_at_max:.3f}')

# Mark true value
plt.axvline(true_mu, color='green', linestyle='--', linewidth=2, alpha=0.7,
            label=f'True Œº = {true_mu}')

# Mark sample mean
plt.axvline(mu_mle, color='orange', linestyle=':', linewidth=2, alpha=0.7,
            label=f'Sample mean = {mu_mle:.3f}')

plt.xlabel('Parameter Œº (mean)', fontsize=12)
plt.ylabel('Log-Likelihood ‚Ñì(Œº)', fontsize=12)
plt.title('Maximum Likelihood Estimation: Finding Œº that Maximizes Likelihood üéØ', 
          fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)

# Add annotation
plt.annotate('Maximum!', xy=(mu_at_max, log_likelihoods[max_idx]), 
             xytext=(mu_at_max+0.3, log_likelihoods[max_idx]+5),
             arrowprops=dict(arrowstyle='->', lw=2, color='red'),
             fontsize=12, fontweight='bold', color='red')

plt.tight_layout()
plt.show()

print("\nüí° MLE Principle:")
print("   - Try different parameter values")
print("   - For each, calculate: How likely is our data given this parameter?")
print("   - Choose parameter that makes data MOST likely (maximum!)")
print(f"   - For normal distribution, this equals the sample mean!")

In [None]:
# üéØ MLE Example 2: Exponential Distribution
# Time between pest occurrences ~ Exponential(Œª)

# Generate data
np.random.seed(42)
true_lambda = 0.5  # rate: 0.5 events per day
n = 100
pest_times = np.random.exponential(scale=1/true_lambda, size=n)

# MLE for exponential: ŒªÃÇ = 1/mean(x)
lambda_mle = 1 / pest_times.mean()

print("üêõ MLE for Exponential Distribution (Pest Times):")
print("=" * 60)
print(f"Sample size: n = {n} observations")
print(f"\nTRUE PARAMETER:")
print(f"  Œª = {true_lambda} events/day")
print(f"  Mean time between events = {1/true_lambda} days")
print(f"\nSAMPLE DATA:")
print(f"  Observed mean time = {pest_times.mean():.3f} days")
print(f"\nMAXIMUM LIKELIHOOD ESTIMATE:")
print(f"  ŒªÃÇ_MLE = {lambda_mle:.4f} events/day")
print(f"  Estimated mean time = {1/lambda_mle:.3f} days")
print(f"\nüí° MLE Formula for Exponential:")
print(f"   ŒªÃÇ_MLE = 1 / sample_mean")
print(f"   Simple and intuitive!")

In [None]:
# üìä Visualization 5: Fitted distribution with MLE parameters

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Exponential data with fitted distributions
axes[0].hist(pest_times, bins=25, alpha=0.6, color='steelblue', 
             edgecolor='black', density=True, label='Observed Data')

x = np.linspace(0, pest_times.max(), 100)

# True distribution
axes[0].plot(x, stats.expon.pdf(x, scale=1/true_lambda), 'g--', 
             linewidth=2, label=f'True (Œª={true_lambda})')

# MLE fitted distribution
axes[0].plot(x, stats.expon.pdf(x, scale=1/lambda_mle), 'r-', 
             linewidth=2, label=f'MLE Fit (ŒªÃÇ={lambda_mle:.3f})')

axes[0].set_xlabel('Days Between Pest Events', fontsize=11)
axes[0].set_ylabel('Density', fontsize=11)
axes[0].set_title('MLE Fitted Distribution üéØ', fontsize=12, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Right: Likelihood function for Œª
lambda_values = np.linspace(0.2, 0.8, 100)
log_likelihoods = []

for lam in lambda_values:
    log_lik = np.sum(stats.expon.logpdf(pest_times, scale=1/lam))
    log_likelihoods.append(log_lik)

log_likelihoods = np.array(log_likelihoods)

axes[1].plot(lambda_values, log_likelihoods, 'b-', linewidth=2)
axes[1].scatter([lambda_mle], [log_likelihoods[np.argmax(log_likelihoods)]], 
                s=200, c='red', marker='*', zorder=5, edgecolors='black', 
                linewidths=1.5, label=f'MLE: ŒªÃÇ={lambda_mle:.3f}')
axes[1].axvline(true_lambda, color='green', linestyle='--', linewidth=2, 
                alpha=0.7, label=f'True Œª={true_lambda}')
axes[1].set_xlabel('Parameter Œª (rate)', fontsize=11)
axes[1].set_ylabel('Log-Likelihood', fontsize=11)
axes[1].set_title('Likelihood Function', fontsize=12, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.suptitle('MLE for Exponential Distribution üêõ', fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

print("\nüí° MLE gives us the 'best fit' distribution for our data!")

---

## 4. Method of Moments (MoM) üìä

### Alternative Estimation Method

**Idea**: Match sample moments to population moments

### Moments:

- **1st moment**: Mean E[X] = Œº
- **2nd moment**: E[X¬≤]
- **kth moment**: E[X·µè]

### Method:

1. Express population moments in terms of parameters
2. Set sample moments equal to population moments
3. Solve for parameters

### Example (Normal Distribution):

Population moments:
- E[X] = Œº
- Var[X] = E[X¬≤] - (E[X])¬≤ = œÉ¬≤

Sample moments:
- (1/n)Œ£x·µ¢
- (1/n)Œ£x·µ¢¬≤ - ((1/n)Œ£x·µ¢)¬≤

Set equal and solve:
- ŒºÃÇ = (1/n)Œ£x·µ¢
- œÉÃÇ¬≤ = (1/n)Œ£(x·µ¢ - ŒºÃÇ)¬≤

### MoM vs MLE:

- **MoM**: Simple, intuitive, easy to compute
- **MLE**: More efficient, better properties, but sometimes harder to compute
- For many distributions (like normal), they give the same estimates!

---

In [None]:
# üìä Method of Moments vs MLE comparison

# Use the same normal data from before
print("üìä Method of Moments vs Maximum Likelihood:")
print("=" * 60)
print(f"Sample size: n = {n}")
print(f"\n1. METHOD OF MOMENTS (MoM):")
print(f"   ŒºÃÇ_MoM = (1/n)Œ£x·µ¢ = {data.mean():.4f}")
print(f"   œÉÃÇ¬≤_MoM = (1/n)Œ£(x·µ¢-ŒºÃÇ)¬≤ = {((data - data.mean())**2).mean():.4f}")

print(f"\n2. MAXIMUM LIKELIHOOD (MLE):")
print(f"   ŒºÃÇ_MLE = {mu_mle:.4f}")
print(f"   œÉÃÇ¬≤_MLE = {sigma_mle**2:.4f}")

print(f"\n3. UNBIASED ESTIMATES:")
sigma_unbiased = np.sqrt(data.var(ddof=1))
print(f"   ŒºÃÇ = {data.mean():.4f} (same as MLE and MoM)")
print(f"   œÉÃÇ¬≤_unbiased = {data.var(ddof=1):.4f} (using n-1)")

print(f"\nüí° For Normal Distribution:")
print(f"   - MoM and MLE give same ŒºÃÇ estimate")
print(f"   - MoM and MLE give same œÉÃÇ estimate (both use n)")
print(f"   - For unbiased œÉÃÇ¬≤, use n-1 instead of n")
print(f"   - MLE is more efficient (lower variance) asymptotically")

---

## 5. Machine Learning Connection ‚≠ê‚≠ê‚≠ê

### Training is Parameter Estimation!

When you train an ML model, you're doing **parameter estimation** (usually MLE):

#### 1. Linear Regression = MLE

**Model**: y = Œ≤‚ÇÄ + Œ≤‚ÇÅx + Œµ, where Œµ ~ N(0, œÉ¬≤)

**MLE**: Minimize sum of squared errors (equivalent to maximizing likelihood!)

$$
\hat{\beta}_{MLE} = \arg\min_{\beta} \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2
$$

#### 2. Logistic Regression = MLE

**Model**: P(y=1|x) = œÉ(Œ≤‚ÇÄ + Œ≤‚ÇÅx)

**MLE**: Maximize log-likelihood (cross-entropy loss!)

$$
\hat{\beta}_{MLE} = \arg\max_{\beta} \sum_{i=1}^{n} [y_i \log p_i + (1-y_i) \log(1-p_i)]
$$

#### 3. Neural Networks = MLE

**Cross-entropy loss** = Negative log-likelihood

**Training** = Finding MLE of network parameters!

### Key Insight:

**Every time you train a model, you're doing MLE!**

- Loss function = Negative log-likelihood
- Training = Maximizing likelihood (minimizing negative log-likelihood)
- Learned parameters = MLE estimates

---

In [None]:
# ü§ñ ML Example: Logistic Regression as MLE

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

# Generate classification data
np.random.seed(42)
n = 200

# Feature: soil nitrogen level
X = np.random.normal(7.0, 2.0, n).reshape(-1, 1)

# Target: high yield (1) or low yield (0)
# Probability depends on nitrogen
prob_high_yield = 1 / (1 + np.exp(-0.8 * (X.ravel() - 7.0)))
y = (np.random.random(n) < prob_high_yield).astype(int)

# Train logistic regression (MLE!)
model = LogisticRegression(random_state=42)
model.fit(X, y)

# Get MLE parameters
beta0_mle = model.intercept_[0]
beta1_mle = model.coef_[0][0]

# Calculate log-likelihood
y_pred_proba = model.predict_proba(X)[:, 1]
log_likelihood = -log_loss(y, y_pred_proba, normalize=False)

print("ü§ñ Logistic Regression as MLE:")
print("=" * 60)
print(f"Sample size: n = {n}")
print(f"\nPROBLEM: Predict high yield based on soil nitrogen")
print(f"\nLOGISTIC MODEL:")
print(f"  P(high yield | nitrogen) = œÉ(Œ≤‚ÇÄ + Œ≤‚ÇÅ √ó nitrogen)")
print(f"\nMLE ESTIMATES (from sklearn):")
print(f"  Œ≤ÃÇ‚ÇÄ = {beta0_mle:.4f}")
print(f"  Œ≤ÃÇ‚ÇÅ = {beta1_mle:.4f}")
print(f"\nLog-Likelihood at MLE: {log_likelihood:.2f}")
print(f"\nüí° Key Insight:")
print(f"   - sklearn.fit() finds parameters that MAXIMIZE likelihood")
print(f"   - This is exactly MLE!")
print(f"   - Cross-entropy loss = Negative log-likelihood")
print(f"   - Training = MLE parameter estimation")

In [None]:
# üìä Visualization 6: Likelihood surface for logistic regression

# Create grid of parameter values
beta0_range = np.linspace(beta0_mle - 2, beta0_mle + 2, 50)
beta1_range = np.linspace(beta1_mle - 1, beta1_mle + 1, 50)
Beta0, Beta1 = np.meshgrid(beta0_range, beta1_range)

# Calculate log-likelihood for each combination
LogLik = np.zeros_like(Beta0)

for i in range(Beta0.shape[0]):
    for j in range(Beta0.shape[1]):
        b0, b1 = Beta0[i, j], Beta1[i, j]
        # Calculate predicted probabilities
        logits = b0 + b1 * X.ravel()
        probs = 1 / (1 + np.exp(-logits))
        # Log-likelihood
        log_lik = np.sum(y * np.log(probs + 1e-10) + (1 - y) * np.log(1 - probs + 1e-10))
        LogLik[i, j] = log_lik

# Create plot
plt.figure(figsize=(12, 8))

# Contour plot
contour = plt.contourf(Beta0, Beta1, LogLik, levels=20, cmap='viridis', alpha=0.8)
plt.colorbar(contour, label='Log-Likelihood')

# Mark the MLE
plt.scatter([beta0_mle], [beta1_mle], s=300, c='red', marker='*', 
            edgecolors='white', linewidths=2, zorder=5,
            label=f'MLE: (Œ≤ÃÇ‚ÇÄ={beta0_mle:.2f}, Œ≤ÃÇ‚ÇÅ={beta1_mle:.2f})')

plt.xlabel('Intercept Œ≤‚ÇÄ', fontsize=12)
plt.ylabel('Coefficient Œ≤‚ÇÅ', fontsize=12)
plt.title('Likelihood Surface: Training Finds the Peak (MLE)! üèîÔ∏è', 
          fontsize=14, fontweight='bold')
plt.legend(fontsize=11, loc='upper left')
plt.grid(True, alpha=0.3, color='white')

# Add annotation
plt.annotate('Maximum\nLikelihood!', xy=(beta0_mle, beta1_mle), 
             xytext=(beta0_mle-1, beta1_mle+0.5),
             arrowprops=dict(arrowstyle='->', lw=2, color='white'),
             fontsize=12, fontweight='bold', color='white',
             bbox=dict(boxstyle='round', facecolor='red', alpha=0.7))

plt.tight_layout()
plt.show()

print("\nüí° ML Training Visualized:")
print("   - Each point (Œ≤‚ÇÄ, Œ≤‚ÇÅ) gives a different model")
print("   - Color shows log-likelihood (brighter = more likely)")
print("   - Training algorithm searches this space")
print("   - Finds peak (maximum likelihood) = MLE")
print("   - This is what model.fit() does!")

---

## Key Takeaways üéØ

### Point Estimation:

1. ‚úÖ **Estimator vs Estimate**:
   - Estimator: Rule/formula (random variable)
   - Estimate: Specific numerical result (fixed number)
   - Parameter: True population value (unknown)

2. ‚úÖ **Good Estimator Properties**:
   - **Unbiased**: E[Œ∏ÃÇ] = Œ∏ (correct on average)
   - **Consistent**: Œ∏ÃÇ ‚Üí Œ∏ as n ‚Üí ‚àû (converges to truth)
   - **Efficient**: Lowest variance among unbiased estimators

3. ‚úÖ **Maximum Likelihood Estimation (MLE)** ‚≠ê‚≠ê:
   - Choose Œ∏ that makes data most likely
   - Maximize L(Œ∏|x) or log-likelihood ‚Ñì(Œ∏|x)
   - Gold standard: consistent, efficient, asymptotically normal

4. ‚úÖ **Method of Moments (MoM)**:
   - Match sample moments to population moments
   - Simple and intuitive
   - Often gives same result as MLE

5. ‚úÖ **ML Connection** ‚≠ê‚≠ê‚≠ê:
   - **Training is parameter estimation!**
   - Most ML training = MLE
   - Loss functions = Negative log-likelihood
   - Linear regression = MLE with normal errors
   - Logistic regression = MLE with Bernoulli
   - Neural networks = MLE

### Critical Formula:

$$
\boxed{\hat{\theta}_{MLE} = \arg\max_{\theta} L(\theta | x) = \arg\max_{\theta} \sum_{i=1}^{n} \ln f(x_i | \theta)}
$$

**This is the foundation of machine learning training!**

---

## Next Steps üöÄ

**Coming Up Next: Confidence Intervals** ‚≠ê‚≠ê

We've learned to make **point estimates**, but every estimate has uncertainty!

**Question**: If ŒºÃÇ = 5.15, how confident are we? Is the true Œº between 5.0 and 5.3?

In the next notebook, we'll learn:
- **Confidence Intervals**: Quantify uncertainty in estimates
- **Correct interpretation**: What does "95% confidence" really mean?
- **Construction**: Using t-distribution for unknown œÉ
- **ML Application**: Reporting model performance with uncertainty ‚≠ê

**Example**: 
- Point estimate: "Model accuracy = 0.85"
- With CI: "Model accuracy = 0.85 ¬± 0.03 (95% CI: [0.82, 0.88])"

The second statement is much more informative!

See you in **`04_confidence_intervals.ipynb`**!

---

**Excellent work! You now understand the foundations of parameter estimation and how ML training works!** üéØ‚ú®üåæ