# Homework 3.3 — Solutions

*Homework is designed to both test your knowledge and challenge you to apply familiar concepts to new applications. Answer clearly and completely. You are welcomed and encouraged to work in groups so long as your work is your own. Submit your figures and answers to [Gradescope](https://www.gradescope.com).*

In [None]:
# Imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

### Context

A university runs a campus wellness program designed to reduce student anxiety. Anxiety is measured on a 0–21 scale (similar to the GAD-7) for the same students before and after the program. For each student, we compute the paired difference:

$$d_i = \text{after}_i - \text{before}_i$$

A negative $d_i$ means the student's anxiety decreased. We want to know: did the program work?

For the simulation questions below, assume the population of paired differences has mean $\mu_d = -3$ and standard deviation $\sigma_d = 5$.

---

#### Q1. Confidence Intervals with Known $\sigma$

Suppose we know $\sigma_d = 5$ and we sample $n = 25$ students.

a) Compute the standard error of $\bar{d}$. Then compute the probability that $\bar{d}$ falls within 1 SE of $\mu_d$, and within 1.96 SE of $\mu_d$. Use `stats.norm.cdf()`.

In [None]:
mu_d = -3
sigma_d = 5
n = 25

se = sigma_d / np.sqrt(n)
print(f'Standard error: {se:.4f}')

# Probability within 1 SE
p_1se = stats.norm.cdf(mu_d + se, mu_d, se) - stats.norm.cdf(mu_d - se, mu_d, se)
print(f'P(d_bar within 1 SE of mu_d): {p_1se:.4f}')

# Probability within 1.96 SE
p_196se = stats.norm.cdf(mu_d + 1.96*se, mu_d, se) - stats.norm.cdf(mu_d - 1.96*se, mu_d, se)
print(f'P(d_bar within 1.96 SE of mu_d): {p_196se:.4f}')

b) Construct a 95% confidence interval centered on $\mu_d$. Then simulate 1,000 samples of size $n = 25$ from this population. For each sample, compute $\bar{d}$. What fraction of the 1,000 sample means fall inside the confidence interval?

In [None]:
# 95% CI centered on mu_d
lower = mu_d - 1.96 * se
upper = mu_d + 1.96 * se
print(f'95% CI centered on mu_d: [{lower:.4f}, {upper:.4f}]')

# Simulate 1,000 samples
count_inside = 0
sample_means = []

for _ in range(1000):
    sample = np.random.normal(mu_d, sigma_d, n)
    d_bar = np.mean(sample)
    sample_means.append(d_bar)
    if lower <= d_bar <= upper:
        count_inside += 1

print(f'Fraction of d_bars inside CI: {count_inside / 1000:.4f}')

c) Now flip the perspective. For each of your 1,000 simulated samples, construct a 95% confidence interval centered on $\bar{d}$ (not on $\mu_d$). What fraction of these intervals contain the true $\mu_d$?

In [None]:
count_contains_mu = 0

for d_bar in sample_means:
    ci_lower = d_bar - 1.96 * se
    ci_upper = d_bar + 1.96 * se
    if ci_lower <= mu_d <= ci_upper:
        count_contains_mu += 1

print(f'Fraction of CIs containing mu_d: {count_contains_mu / 1000:.4f}')

d) Compare your answers from (b) and (c). Why are these the same?

The fractions are the same (both approximately 0.95). This is because the distance between $\bar{d}$ and $\mu_d$ is the same regardless of which point you measure from. If $\bar{d}$ falls within 1.96 SE of $\mu_d$, then $\mu_d$ falls within 1.96 SE of $\bar{d}$. The two perspectives are mathematically identical — the centerpoint flip works because distance is symmetric.

---

#### Q2. The t-Distribution

In practice we don't know $\sigma_d$. We estimate it with the sample standard deviation $S$. This introduces extra uncertainty. Use $n = 15$ for this question (a smaller sample makes the difference more visible).

a) Simulate 1,000 samples of size $n = 15$ from the same population ($\mu_d = -3$, $\sigma_d = 5$). For each sample, compute $\bar{d}$ and $S$. Construct a 90% confidence interval using the **normal distribution**: $\bar{d} \pm 1.645 \cdot S / \sqrt{n}$. Count how many of the 1,000 intervals contain $\mu_d$.

```
z_crit = stats.norm.ppf(0.95)
```

In [None]:
n2 = 15
z_crit = stats.norm.ppf(0.95)

count_normal = 0
samples_q2 = []

for _ in range(1000):
    sample = np.random.normal(mu_d, sigma_d, n2)
    d_bar = np.mean(sample)
    S = np.std(sample, ddof=1)
    samples_q2.append((d_bar, S))
    
    se_hat = S / np.sqrt(n2)
    ci_lower = d_bar - z_crit * se_hat
    ci_upper = d_bar + z_crit * se_hat
    if ci_lower <= mu_d <= ci_upper:
        count_normal += 1

print(f'Coverage using normal: {count_normal / 1000:.4f}')

b) Repeat part (a), but now use the **t-distribution** critical value with $df = n - 1 = 14$: $\bar{d} \pm t_{crit} \cdot S / \sqrt{n}$. Count how many of the 1,000 intervals contain $\mu_d$.

```
t_crit = stats.t.ppf(0.95, df=14)
```

In [None]:
t_crit = stats.t.ppf(0.95, df=n2 - 1)

count_t = 0

for d_bar, S in samples_q2:
    se_hat = S / np.sqrt(n2)
    ci_lower = d_bar - t_crit * se_hat
    ci_upper = d_bar + t_crit * se_hat
    if ci_lower <= mu_d <= ci_upper:
        count_t += 1

print(f'Coverage using t:      {count_t / 1000:.4f}')

c) Which method is closer to the target coverage of 90%? In your own words, why does using the normal distribution with $S$ produce intervals that are too narrow?

The t-distribution method is closer to 90% coverage. The normal method typically achieves only about 87–88%.

This happens because $S$ varies from sample to sample. Sometimes $S$ underestimates $\sigma$, making the interval too narrow — these intervals miss $\mu_d$ more often than they should. The normal distribution doesn't account for this extra source of uncertainty. The t-distribution has heavier tails, which widens the intervals just enough to compensate for the randomness in $S$. With small samples (like $n = 15$), $S$ is an imprecise estimate of $\sigma$, so the correction matters most.

---

#### Q3. Hypothesis Testing

Now let's test whether the wellness program actually works.

a) A university runs the program on $n = 25$ students and finds $\bar{d} = -2.8$ and $S = 4.6$. State the null and alternative hypotheses for testing whether the program has any effect on anxiety.

- $H_0: \mu_d = 0$ — the program has no effect on anxiety scores.
- $H_1: \mu_d \neq 0$ — the program has some effect on anxiety scores (could be an increase or decrease).

b) Compute the t-statistic and the two-sided p-value.

```
t_stat = (d_bar - 0) / (S / np.sqrt(n))
p_value = 2 * stats.t.sf(abs(t_stat), df=n-1)
```

In [None]:
d_bar = -2.8
S = 4.6
n3 = 25

t_stat = (d_bar - 0) / (S / np.sqrt(n3))
p_value = 2 * stats.t.sf(abs(t_stat), df=n3 - 1)

print(f't-statistic: {t_stat:.4f}')
print(f'p-value:     {p_value:.4f}')

c) At a significance level of $\alpha = 0.05$, do you reject $H_0$? Interpret your conclusion in the context of the wellness program.

The t-statistic is approximately $-3.04$ and the p-value is approximately $0.0056$. Since $p < 0.05$, we reject $H_0$.

Interpretation: the data provides strong evidence that the wellness program has an effect on anxiety. The observed average drop of 2.8 points is unlikely to have occurred by chance alone if the program had no effect. We conclude that the program is associated with a statistically significant reduction in anxiety scores.

d) A large national study of a similar program surveys $n = 2{,}000$ students and finds $\bar{d} = -0.3$ and $S = 5$. Compute the t-statistic and p-value. Is this result statistically significant at $\alpha = 0.05$?

In [None]:
d_bar_2 = -0.3
S_2 = 5
n_2 = 2000

t_stat_2 = (d_bar_2 - 0) / (S_2 / np.sqrt(n_2))
p_value_2 = 2 * stats.t.sf(abs(t_stat_2), df=n_2 - 1)

print(f't-statistic: {t_stat_2:.4f}')
print(f'p-value:     {p_value_2:.4f}')

The t-statistic is approximately $-2.68$ and the p-value is approximately $0.0074$. Since $p < 0.05$, this result is statistically significant — we reject $H_0$.

e) A clinician considers a drop of at least 2 points on the anxiety scale to be meaningful. The first study found a 2.8-point drop. The second found a 0.3-point drop. Which result is *practically* significant? What does this tell you about the difference between statistical significance and practical significance?

Only the first study ($\bar{d} = -2.8$) is practically significant — the drop exceeds the clinician's threshold of 2 points. The second study ($\bar{d} = -0.3$) is statistically significant but not practically significant — a 0.3-point change on a 21-point scale is too small to matter clinically.

This illustrates a key distinction:

- **Statistical significance** tells you the effect is probably not zero. With a large enough sample ($n = 2{,}000$), even a tiny difference can produce a small p-value because the standard error shrinks.
- **Practical significance** tells you the effect is large enough to matter. A 0.3-point drop in anxiety is real (unlikely to be zero) but meaningless in practice.

You should always consider the *size* of the effect alongside the p-value. A statistically significant result with a tiny effect size doesn't mean the intervention is worth implementing.

---

#### Q4. Interpreting p-Values

A researcher studying the effect of a new teaching method on exam scores conducts a hypothesis test and reports a p-value of 0.032.

a) In your own words, what does this p-value tell us?

If the null hypothesis were true (i.e., the new teaching method has no effect on exam scores), there is a 3.2% probability of observing data as extreme as or more extreme than what was observed. In other words, a result this far from the null would happen about 3 times in 100 by chance alone.

b) If the researcher used a significance level of $\alpha = 0.05$, would they reject the null hypothesis? What if they used $\alpha = 0.01$?

At $\alpha = 0.05$: yes, reject $H_0$. The p-value (0.032) is less than 0.05, so the result is statistically significant at the 5% level.

At $\alpha = 0.01$: no, do not reject $H_0$. The p-value (0.032) is greater than 0.01, so the result is not significant at the 1% level. The same data can lead to different conclusions depending on how strict our threshold is.

c) For each of the following statements, say whether it is **true or false** and explain your reasoning.

1. A p-value of 0.10 means there is a 10% chance that the null hypothesis is true.
2. If we fail to reject the null hypothesis, we have proven that it is true.
3. A small p-value indicates that our observed result would be unlikely if the null hypothesis were true.

1. **False.** The p-value is not the probability that $H_0$ is true. The truth is fixed — $\mu$ either equals the null value or it doesn't. The p-value is the probability of seeing data this extreme *assuming* $H_0$ is true. It measures how surprising the data is under the null, not how likely the null is.

2. **False.** Failing to reject $H_0$ means the data is consistent with the null — it doesn't prove the null is true. There may simply not be enough data to detect a real effect. Absence of evidence is not evidence of absence.

3. **True.** This is the correct interpretation. A small p-value means that, if the null hypothesis were true, data as extreme as what we observed would rarely occur. This makes us doubt the null.