# Homework 3.4 — Solutions

*Homework is designed to both test your knowledge and challenge you to apply familiar concepts to new applications. Answer clearly and completely. You are welcomed and encouraged to work in groups so long as your work is your own. Submit your figures and answers to [Gradescope](https://www.gradescope.com).*

In [None]:
# Imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

### Context

In HW 3.3, we tested whether a campus wellness program reduced student anxiety using confidence intervals and hypothesis tests. Now we'll see that what we did was actually a special case of a more general framework: the **general linear model** (GLM).

We'll use the same paired differences from the wellness program. Here is a sample of $n = 20$ students' anxiety score changes ($d_i = \text{after} - \text{before}$):

In [None]:
d = np.array([-5.2, -1.3, -4.8, 0.7, -3.1, -6.4, -2.0, -0.5, -4.1, -3.7,
              -1.8, -5.5, -2.9, 1.2, -3.6, -4.3, -0.9, -2.7, -6.1, -3.0])

---

#### Q1. The Intercept-Only Model and MSE

The simplest GLM predicts every observation with the same constant: $\hat{y}_i = b$. The **mean squared error** measures how wrong the prediction is:

$$MSE(b) = \frac{1}{n}\sum_{i=1}^{n}(y_i - b)^2$$

a) Write a function that takes an array of data and a value of $b$ and returns the MSE. Compute the MSE for $b = 0$, $b = -2$, and $b = -5$.

```
def mse(data, b):
    return np.mean((data - b)**2)
```

In [None]:
def mse(data, b):
    return np.mean((data - b)**2)

print(f'MSE(b=0):  {mse(d, 0):.4f}')
print(f'MSE(b=-2): {mse(d, -2):.4f}')
print(f'MSE(b=-5): {mse(d, -5):.4f}')

b) Compute the MSE for a range of $b$ values from $-8$ to $2$ and plot MSE as a function of $b$. At what value of $b$ does the MSE reach its minimum?

```
b_values = np.linspace(-8, 2, 200)
```

In [None]:
b_values = np.linspace(-8, 2, 200)
mse_values = [mse(d, b) for b in b_values]

plt.plot(b_values, mse_values)
plt.xlabel('b')
plt.ylabel('MSE')
plt.title('MSE as a function of b')

# Mark the minimum
best_b = b_values[np.argmin(mse_values)]
plt.axvline(best_b, color='red', linestyle='--', label=f'min at b = {best_b:.2f}')
plt.legend()
plt.show()

print(f'b that minimizes MSE: {best_b:.4f}')

c) Compute the sample mean $\bar{d}$ of the data. Compare it to the $b$ that minimizes MSE. What do you notice?

In [None]:
d_bar = np.mean(d)
print(f'Sample mean: {d_bar:.4f}')
print(f'b that minimizes MSE: {best_b:.4f}')

They are the same (or very nearly, due to the discreteness of our grid). The sample mean is the value of $b$ that minimizes the MSE. When we compute a sample mean, we are fitting the intercept-only model.

d) Compute the MSE at $b = \bar{d}$ and compare it to the variance of the data (`np.var(d)`). What is the relationship?

In [None]:
print(f'MSE at b = d_bar: {mse(d, d_bar):.4f}')
print(f'Variance of d:    {np.var(d):.4f}')

They are equal. The MSE of the best-fitting horizontal line equals the variance of the data. Variance is the average squared distance from the mean — which is exactly what the MSE measures when $b = \bar{d}$.

e) In your own words, explain what it means to say "the sample mean is the parameter of the simplest linear model" and "the variance is the MSE of the best horizontal line through the data."

The intercept-only model predicts every observation with a single number $b$. Among all possible values of $b$, the one that makes the smallest average squared error is the sample mean. So computing a mean isn't just a summary — it's finding the optimal parameter of a model.

The variance, in turn, measures how well that best model fits. It's the average squared gap between each observation and the best constant prediction. If the variance is large, even the best horizontal line is quite wrong — the data is spread out. If the variance is small, the mean is a good prediction for most observations.

---

#### Q2. Sampling Distribution of the Model Parameter

When we fit the intercept-only model to a sample, the estimated $b$ equals the sample mean. Different samples give different $b$ values. Assume the population has $\mu_d = -3$ and $\sigma_d = 5$.

a) Simulate 1,000 samples of size $n = 20$. For each sample, fit the intercept-only model (i.e., compute the sample mean). Collect the 1,000 estimated $b$ values and plot their histogram.

In [None]:
mu_d = -3
sigma_d = 5
n = 20

b_values_sim = [np.mean(np.random.normal(mu_d, sigma_d, n)) for _ in range(1000)]

sns.histplot(b_values_sim, bins=30)
plt.xlabel('Estimated b')
plt.title('Sampling Distribution of b (intercept-only model)')
plt.show()

b) Compute the mean and standard deviation of your 1,000 $b$ values. Compare the mean to $\mu_d$ and the standard deviation to the theoretical standard error $\sigma_d / \sqrt{n}$.

In [None]:
print(f'Mean of b values:     {np.mean(b_values_sim):.4f}  (mu_d = {mu_d})')
print(f'SD of b values:       {np.std(b_values_sim):.4f}  (sigma/sqrt(n) = {sigma_d / np.sqrt(n):.4f})')

The mean of the simulated $b$ values is close to $\mu_d = -3$, and the standard deviation is close to the theoretical SE of $5/\sqrt{20} \approx 1.118$. The estimated intercept is an unbiased estimator of the population mean, and its variability follows the standard error formula.

c) For each of your 1,000 samples, also compute $S$ and then the t-statistic: $t = b / (S / \sqrt{n})$. Plot the histogram of the 1,000 t-statistics and overlay a t-distribution with $df = 19$.

```
x = np.linspace(-4, 4, 200)
plt.plot(x, stats.t.pdf(x, df=19))
```

In [None]:
t_stats = []
for _ in range(1000):
    sample = np.random.normal(mu_d, sigma_d, n)
    b = np.mean(sample)
    S = np.std(sample, ddof=1)
    t = b / (S / np.sqrt(n))
    t_stats.append(t)

sns.histplot(t_stats, bins=30, stat='density', label='Simulated t-statistics')
x = np.linspace(-4, 4, 200)
plt.plot(x, stats.t.pdf(x, df=19), 'r', label='t-distribution (df=19)')
plt.xlabel('t-statistic')
plt.title('Distribution of t-statistics from intercept-only model')
plt.legend()
plt.show()

d) In your own words, why does the distribution of $b$ across samples follow a t-distribution rather than a normal distribution?

The CLT tells us that $b = \bar{d}$ is approximately normal across samples. But when we standardize $b$ by dividing by $S/\sqrt{n}$, we're dividing by a quantity that itself varies from sample to sample. This extra randomness in the denominator makes the distribution slightly wider than a normal — it has heavier tails. The t-distribution accounts for this. With larger samples, $S$ becomes more stable, the extra variability shrinks, and the t-distribution converges to the normal.

---

#### Q3. Hypothesis Testing as a Model

In HW 3.3, we tested $H_0: \mu_d = 0$ by computing $t = \bar{d} / (S / \sqrt{n})$. Now let's see the same test expressed in regression language.

a) Using our sample `d`, compute the sample mean, sample standard deviation, standard error, t-statistic (testing whether the mean is zero), and p-value.

In [None]:
d_bar = np.mean(d)
S = np.std(d, ddof=1)
n = len(d)
se = S / np.sqrt(n)
t_stat = d_bar / se
p_value = 2 * stats.t.sf(abs(t_stat), df=n-1)

print(f'Sample mean (b): {d_bar:.4f}')
print(f'Sample SD (S):   {S:.4f}')
print(f'Standard error:  {se:.4f}')
print(f't-statistic:     {t_stat:.4f}')
print(f'p-value:         {p_value:.6f}')

b) Now fit the intercept-only model using `scipy`: `stats.ttest_1samp(d, 0)`. This tests $H_0: \mu = 0$. Compare the t-statistic and p-value to what you computed in part (a).

In [None]:
result = stats.ttest_1samp(d, 0)
print(f't-statistic: {result.statistic:.4f}')
print(f'p-value:     {result.pvalue:.6f}')

The t-statistic and p-value are identical to part (a). The one-sample t-test and the intercept-only model test are the same thing — the software is doing exactly what we did by hand.

c) Now suppose a skeptic claims the program reduces anxiety by exactly 2 points ($\mu_d = -2$). To test $H_0: \mu_d = -2$, create a shifted dataset by subtracting $-2$ from each observation: `d_shifted = d - (-2)`. Then run `stats.ttest_1samp(d_shifted, 0)`. What are the t-statistic and p-value? Do you reject this null at $\alpha = 0.05$?

In [None]:
d_shifted = d - (-2)
result_shifted = stats.ttest_1samp(d_shifted, 0)
print(f't-statistic: {result_shifted.statistic:.4f}')
print(f'p-value:     {result_shifted.pvalue:.6f}')

The p-value is above 0.05, so we fail to reject $H_0: \mu_d = -2$. Our data is consistent with the claim that the program reduces anxiety by 2 points. The observed mean of about $-3$ is not far enough from $-2$ (relative to the standard error) to rule it out.

d) Verify that the shifting trick works: run `stats.ttest_1samp(d, -2)` (testing the original data against $-2$ directly). Compare to your result in part (c).

In [None]:
result_direct = stats.ttest_1samp(d, -2)
print(f't-statistic: {result_direct.statistic:.4f}')
print(f'p-value:     {result_direct.pvalue:.6f}')

The results are identical. Shifting the data and testing against zero is mathematically equivalent to testing the original data against the hypothesized value directly. The shifting trick works because subtracting a constant from every observation shifts the mean by that constant but doesn't change the standard deviation or sample size.

e) In your own words, explain the connection between "computing a sample mean and testing whether it equals zero" and "fitting a linear model and testing whether the intercept equals zero." Why is this connection useful for Part 4?

Computing a sample mean *is* fitting a linear model — specifically, the intercept-only model $\hat{y} = b$, where the best $b$ is $\bar{y}$. Testing whether the mean equals zero is the same as testing $H_0: \beta_0 = 0$ in this model. The t-statistic, standard error, and p-value are identical either way.

This connection is useful because it places what we've been doing in a much more general framework. In Part 4, we add a real predictor variable: $\hat{y} = mx + b$. The slope $m$ gets its own sampling distribution, standard error, and p-value — just like $b$ did here. Testing whether $m = 0$ asks whether the predictor helps explain the outcome. The mechanics are the same (t-statistic = estimate / SE, compare to t-distribution), just applied to a richer model.