<link rel="stylesheet" href="https://unpkg.com/sakura.css/css/sakura-dark.css" media="screen" />
<link rel="stylesheet" href="/notebooks/sakura-vader.css" media="screen" />

<a href="/">
    <h1>🏠 Back</h1>
</a>

# Understanding Bias in Estimators

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tanduong/tanduong.github.io/blob/main/notebooks/bias.ipynb)

Bias in statistics measures the difference between an estimator's expected value and the true parameter value it aims to estimate. It's a key concept for assessing the accuracy of statistical methods. In this post, we'll explore the concept of bias, provide examples of biased and unbiased estimators, and use Python to illustrate these ideas through simulations.

---

## Definition

The bias of an estimator is defined as:
$$
\text{bias}(\hat{\theta}_m) = E(\hat{\theta}_m) - \theta
$$

Let's go on to give some examples of the bias of an estimator.

## Examples

### Example 1: Sample Mean as an Estimator of Population Mean
Consider the sample mean $\bar{X}$ as an estimator of the population mean $\mu$.

**Estimator:**
$$
\hat{\theta}_m = \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i
$$

**Bias:**
Since the sample mean is an unbiased estimator of the population mean, we have:
$$
\text{bias}(\bar{X}) = E(\bar{X}) - \mu = \mu - \mu = 0
$$

### Example 2: Sample Variance as an Estimator of Population Variance
Consider the sample variance $S^2$ as an estimator of the population variance $\sigma^2$.

**Estimator:**
$$
\hat{\theta}_m = S^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2
$$

**Bias:**
The sample variance is a biased estimator of the population variance. The bias can be calculated as follows:
$$
\text{bias}(S^2) = E(S^2) - \sigma^2 = \left( \frac{n-1}{n} \sigma^2 \right) - \sigma^2 = -\frac{\sigma^2}{n}
$$

### Example 3: Estimating Population Proportion
Consider the sample proportion $\hat{p}$ as an estimator of the population proportion $p$.

**Estimator:**
$$
\hat{\theta}_m = \hat{p} = \frac{X}{n}
$$

where $X$ is the number of successes in $n$ trials.

**Bias:**
Since the sample proportion is an unbiased estimator of the population proportion, we have:
$$
\text{bias}(\hat{p}) = E(\hat{p}) - p = p - p = 0
$$

### Example 4: Estimating a Population Median with the Sample Median
Consider the sample median $\tilde{X}$ as an estimator of the population median $\eta$.

**Estimator:**
$$
\hat{\theta}_m = \tilde{X}
$$

**Bias:**
The bias of the sample median depends on the underlying distribution. For a symmetric distribution, the sample median is an unbiased estimator:
$$
\text{bias}(\tilde{X}) = E(\tilde{X}) - \eta = 0
$$

However, for asymmetric distributions, the sample median can be biased.

These examples illustrate how the bias of an estimator can vary depending on the estimator and the parameter being estimated.

## Illustrating Bias with Python

Let's write some Python code to illustrate the bias of different estimators. We'll use simulations to demonstrate the bias of the sample mean and the sample variance.


In [5]:
import numpy as np

# Parameters
np.random.seed(0)
population_mean = 50
population_variance = 25
sample_size = 30
num_samples = 1000

# Generate population data
population_data = np.random.normal(loc=population_mean, scale=np.sqrt(population_variance), size=10000)

# Function to calculate sample statistics
def sample_statistics(data, sample_size):
    sample = np.random.choice(data, sample_size, replace=False)
    sample_mean = np.mean(sample)
    sample_variance = np.var(sample, ddof=1)  # Use ddof=1 for sample variance
    return sample_mean, sample_variance

# Collect statistics from many samples
sample_means = []
sample_variances = []

for _ in range(num_samples):
    mean, variance = sample_statistics(population_data, sample_size)
    sample_means.append(mean)
    sample_variances.append(variance)

# Calculate bias
mean_bias = np.mean(sample_means) - population_mean
variance_bias = np.mean(sample_variances) - population_variance

print(f"True Population Mean: {population_mean}")
print(f"Average Sample Mean: {np.mean(sample_means)}")
print(f"Bias of Sample Mean: {mean_bias}")

print(f"True Population Variance: {population_variance}")
print(f"Average Sample Variance: {np.mean(sample_variances)}")
print(f"Bias of Sample Variance: {variance_bias}")

True Population Mean: 50
Average Sample Mean: 49.872643091732456
Bias of Sample Mean: -0.12735690826754364
True Population Variance: 25
Average Sample Variance: 24.311463697256524
Bias of Sample Variance: -0.6885363027434757


This code generates a large population of data from a normal distribution with a specified mean and variance. It then takes many samples from this population, calculates the sample mean and sample variance for each sample, and computes the average bias of these estimators.

In this example, the bias of the sample mean is close to zero, indicating that the sample mean is an unbiased estimator of the population mean. However, the sample variance is biased, as shown by the negative bias value, which aligns with our theoretical understanding that the sample variance is a biased estimator of the population variance.