# Homework 3.2 — Solutions

*Homework is designed to both test your knowledge and challenge you to apply familiar concepts to new applications. Answer clearly and completely. You are welcomed and encouraged to work in groups so long as your work is your own. Submit your figures and answers to [Gradescope](https://www.gradescope.com).*

In [None]:
# Imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#### Q1. The Central Limit Theorem with a Normal Population

The wait times (in minutes) at a restaurant follow a normal distribution with mean ($\mu$) = 12 minutes and standard deviation ($\sigma$) = 2.5 minutes. In this question you will explore what happens to the distribution of **sample means** as the sample size increases.

a) Take 1,000 samples of size $n = 5$ from this distribution. Compute the mean of each sample and plot a histogram of the 1,000 sample means.

```
sample_means_5 = [np.mean(np.random.normal(loc=12, scale=2.5, size=5)) for _ in range(1000)]
```

In [None]:
sample_means_5 = [np.mean(np.random.normal(loc=12, scale=2.5, size=5)) for _ in range(1000)]
sns.histplot(sample_means_5, bins=30)
plt.title('Sampling Distribution of the Mean (n=5)')
plt.xlabel('Sample Mean')
plt.show()

b) Repeat part (a) for sample sizes $n = 30$ and $n = 100$. Plot all three histograms. Describe how the distribution of sample means changes as $n$ increases.

In [None]:
sample_means_30 = [np.mean(np.random.normal(loc=12, scale=2.5, size=30)) for _ in range(1000)]
sample_means_100 = [np.mean(np.random.normal(loc=12, scale=2.5, size=100)) for _ in range(1000)]

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

sns.histplot(sample_means_5, bins=30, ax=axes[0])
axes[0].set_title('n = 5')
axes[0].set_xlim(9, 15)

sns.histplot(sample_means_30, bins=30, ax=axes[1])
axes[1].set_title('n = 30')
axes[1].set_xlim(9, 15)

sns.histplot(sample_means_100, bins=30, ax=axes[2])
axes[2].set_title('n = 100')
axes[2].set_xlim(9, 15)

plt.tight_layout()
plt.show()

As $n$ increases, the distribution of sample means stays centered around $\mu = 12$ but becomes increasingly narrow. All three are approximately bell-shaped (since the population is already normal), but the spread shrinks noticeably from $n = 5$ to $n = 100$.

c) Compute the standard deviation of the sample means for each sample size ($n = 5$, $n = 30$, $n = 100$).

In [None]:
print(f'SD of sample means (n=5):   {np.std(sample_means_5):.4f}')
print(f'SD of sample means (n=30):  {np.std(sample_means_30):.4f}')
print(f'SD of sample means (n=100): {np.std(sample_means_100):.4f}')

d) Compute the theoretical standard error $\sigma / \sqrt{n}$ for each sample size. Compare these values to the standard deviations you computed in part (c).

In [None]:
sigma = 2.5
for n in [5, 30, 100]:
    se = sigma / np.sqrt(n)
    print(f'Theoretical SE (n={n}): {se:.4f}')

The simulated standard deviations are very close to the theoretical standard errors: $2.5/\sqrt{5} \approx 1.118$, $2.5/\sqrt{30} \approx 0.456$, $2.5/\sqrt{100} = 0.250$. Small differences are due to simulation randomness.

e) In your own words, why does the spread of the sampling distribution decrease as $n$ increases?

With a larger sample, extreme values in one direction are more likely to be offset by values in the other direction. The individual observations still vary, but their average cancels out much of that variation. Mathematically, the variance of the mean is $\sigma^2 / n$, so each additional observation reduces the variance of the mean. The standard error shrinks at the rate of $\sqrt{n}$ — to cut the standard error in half, you need four times as many observations.

#### Q2. The Central Limit Theorem with a Skewed Population

Now consider a different scenario. Wait times at a busy food truck follow an **exponential distribution** with a mean of 5 minutes. This distribution is skewed right — most waits are short, but some are very long. Use `np.random.exponential(scale=5, size=n)` to draw from this population.

a) Draw 10,000 observations from this population and plot the histogram. Describe the shape — is it normal?

In [None]:
population_exp = np.random.exponential(scale=5, size=10000)
sns.histplot(population_exp, bins=50)
plt.title('Exponential Population (mean=5)')
plt.xlabel('Wait Time (minutes)')
plt.show()

The distribution is clearly not normal. It is heavily skewed right — most observations are clustered near zero, with a long tail stretching to the right. The peak is at the far left and the distribution decays exponentially.

b) Take 1,000 samples of size $n = 5$ from this population. Compute the mean of each sample and plot a histogram of the sample means.

In [None]:
sample_means_exp_5 = [np.mean(np.random.exponential(scale=5, size=5)) for _ in range(1000)]
sns.histplot(sample_means_exp_5, bins=30)
plt.title('Sampling Distribution of the Mean (n=5, Exponential)')
plt.xlabel('Sample Mean')
plt.show()

c) Repeat part (b) for $n = 30$. Plot the histogram and describe how the shape changed compared to $n = 5$.

In [None]:
sample_means_exp_30 = [np.mean(np.random.exponential(scale=5, size=30)) for _ in range(1000)]

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

sns.histplot(sample_means_exp_5, bins=30, ax=axes[0])
axes[0].set_title('n = 5')

sns.histplot(sample_means_exp_30, bins=30, ax=axes[1])
axes[1].set_title('n = 30')

plt.tight_layout()
plt.show()

At $n = 5$, the distribution of sample means is still noticeably skewed right, though less so than the population itself. By $n = 30$, the distribution looks approximately bell-shaped and roughly symmetric around the mean of 5. The CLT is already working well even though the population is far from normal.

d) For both $n = 5$ and $n = 30$, compare the standard deviation of the sample means to the theoretical standard error $\sigma / \sqrt{n}$. For the exponential distribution, the population standard deviation equals the mean, so $\sigma = 5$.

In [None]:
sigma_exp = 5

print(f'SD of sample means (n=5):  {np.std(sample_means_exp_5):.4f}  |  Theoretical SE: {sigma_exp / np.sqrt(5):.4f}')
print(f'SD of sample means (n=30): {np.std(sample_means_exp_30):.4f}  |  Theoretical SE: {sigma_exp / np.sqrt(30):.4f}')

The simulated standard deviations are close to the theoretical values: $5/\sqrt{5} \approx 2.236$ and $5/\sqrt{30} \approx 0.913$. The standard error formula $\sigma/\sqrt{n}$ works even when the population is not normal.

e) In your own words, what does the Central Limit Theorem guarantee about the shape of the sampling distribution, regardless of the population shape?

The CLT guarantees that the distribution of sample means approaches a normal distribution as the sample size $n$ increases, regardless of the shape of the population distribution. The sampling distribution will be centered on the population mean $\mu$ with a standard deviation of $\sigma/\sqrt{n}$. This means that even if the population is skewed, bimodal, or uniform, the distribution of $\bar{x}$ will be approximately $N(\mu, \sigma/\sqrt{n})$ for sufficiently large $n$.

#### Q3. The Central Limit Theorem with a Bimodal Population

A coffee shop has two types of customers: those who order just a drink (mean spending $\mu_1 = 4$ dollars, $\sigma_1 = 1$) and those who order a drink plus food (mean spending $\mu_2 = 12$ dollars, $\sigma_2 = 1.5$). About half of customers fall into each group. The function below draws `n` observations from this bimodal population.

```
def bimodal_sample(n):
    group = np.random.choice([0, 1], size=n)
    return np.where(group == 0, np.random.normal(4, 1, n), np.random.normal(12, 1.5, n))
```

In [None]:
def bimodal_sample(n):
    group = np.random.choice([0, 1], size=n)
    return np.where(group == 0, np.random.normal(4, 1, n), np.random.normal(12, 1.5, n))

a) Draw 10,000 observations from this population and plot the histogram. Describe the shape — how many peaks does it have?

In [None]:
population_bimodal = bimodal_sample(10000)
sns.histplot(population_bimodal, bins=50)
plt.title('Bimodal Population')
plt.xlabel('Customer Spending ($)')
plt.show()

The distribution has two clear peaks — one around $4 (drink-only customers) and one around $12 (drink-plus-food customers). There is a valley in the middle around $8. This is clearly not a normal distribution.

b) Take 1,000 samples of size $n = 5$ from this population. Compute the mean of each sample and plot a histogram of the sample means.

In [None]:
sample_means_bi_5 = [np.mean(bimodal_sample(5)) for _ in range(1000)]
sns.histplot(sample_means_bi_5, bins=30)
plt.title('Sampling Distribution of the Mean (n=5, Bimodal)')
plt.xlabel('Sample Mean')
plt.show()

c) Repeat part (b) for $n = 30$. Plot the histogram and describe how the shape changed compared to the population and to $n = 5$.

In [None]:
sample_means_bi_30 = [np.mean(bimodal_sample(30)) for _ in range(1000)]

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

sns.histplot(sample_means_bi_5, bins=30, ax=axes[0])
axes[0].set_title('n = 5')

sns.histplot(sample_means_bi_30, bins=30, ax=axes[1])
axes[1].set_title('n = 30')

plt.tight_layout()
plt.show()

At $n = 5$, the sampling distribution may still show some hint of the bimodal shape or appear somewhat flat/wide, since small samples can still land mostly in one group or the other. By $n = 30$, the two peaks have completely merged into a single bell-shaped distribution centered around the overall mean of $8$. The CLT has turned a two-humped population into a single normal-looking sampling distribution.

d) Compute the standard deviation of your 10,000 population observations from part (a). Use this as $\sigma$. Then compare the standard deviation of your sample means (for $n = 5$ and $n = 30$) to the theoretical standard error $\sigma / \sqrt{n}$.

In [None]:
sigma_bi = np.std(population_bimodal)
print(f'Population SD (from 10,000 draws): {sigma_bi:.4f}')
print()
print(f'SD of sample means (n=5):  {np.std(sample_means_bi_5):.4f}  |  Theoretical SE: {sigma_bi / np.sqrt(5):.4f}')
print(f'SD of sample means (n=30): {np.std(sample_means_bi_30):.4f}  |  Theoretical SE: {sigma_bi / np.sqrt(30):.4f}')

The population standard deviation is approximately $4.3$ (driven largely by the $8$-dollar gap between the two group means). The simulated standard deviations of the sample means are close to the theoretical standard errors $\sigma/\sqrt{5} \approx 1.9$ and $\sigma/\sqrt{30} \approx 0.78$. The formula works here too.

e) Across all three questions, you started with a normal population, a skewed population, and a bimodal population. What happened to the sampling distribution of the mean in each case as $n$ increased? What does this tell you about the generality of the Central Limit Theorem?

In all three cases, the sampling distribution of the mean became more bell-shaped and tighter as $n$ increased:

- **Normal population (Q1):** The sampling distribution was already normal at every sample size, but the spread shrank with larger $n$.
- **Skewed population (Q2):** The skew in the sampling distribution diminished as $n$ grew, and by $n = 30$ it looked approximately normal.
- **Bimodal population (Q3):** The two peaks merged into a single bell curve as $n$ grew, and by $n = 30$ the bimodal shape had disappeared entirely.

In every case, the standard deviation of the sample means matched the theoretical standard error $\sigma/\sqrt{n}$. This demonstrates the generality of the CLT: no matter what the population looks like — symmetric, skewed, or bimodal — the distribution of $\bar{x}$ converges to a normal distribution centered on $\mu$ with standard deviation $\sigma/\sqrt{n}$ as the sample size increases.