# Homework 3.4

*Homework is designed to both test your knowledge and challenge you to apply familiar concepts to new applications. Answer clearly and completely. You are welcomed and encouraged to work in groups so long as your work is your own. Submit your figures and answers to [Gradescope](https://www.gradescope.com).*

In [None]:
# Imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

### Context

In HW 3.3, we tested whether a campus wellness program reduced student anxiety using confidence intervals and hypothesis tests. Now we'll see that what we did was actually a special case of a more general framework: the **general linear model** (GLM).

We'll use the same paired differences from the wellness program. Here is a sample of $n = 20$ students' anxiety score changes ($d_i = \text{after} - \text{before}$):

In [None]:
d = np.array([-5.2, -1.3, -4.8, 0.7, -3.1, -6.4, -2.0, -0.5, -4.1, -3.7,
              -1.8, -5.5, -2.9, 1.2, -3.6, -4.3, -0.9, -2.7, -6.1, -3.0])

---

#### Q1. The Intercept-Only Model and MSE

The simplest GLM predicts every observation with the same constant: $\hat{y}_i = b$. The **mean squared error** measures how wrong the prediction is:

$$MSE(b) = \frac{1}{n}\sum_{i=1}^{n}(y_i - b)^2$$

a) Write a function that takes an array of data and a value of $b$ and returns the MSE. Compute the MSE for $b = 0$, $b = -2$, and $b = -5$.

```
def mse(data, b):
    return np.mean((data - b)**2)
```

b) Compute the MSE for a range of $b$ values from $-8$ to $2$ and plot MSE as a function of $b$. At what value of $b$ does the MSE reach its minimum?

```
b_values = np.linspace(-8, 2, 200)
```

c) Compute the sample mean $\bar{d}$ of the data. Compare it to the $b$ that minimizes MSE. What do you notice?

d) Compute the MSE at $b = \bar{d}$ and compare it to the variance of the data (`np.var(d)`). What is the relationship?

e) In your own words, explain what it means to say "the sample mean is the parameter of the simplest linear model" and "the variance is the MSE of the best horizontal line through the data."

---

#### Q2. Sampling Distribution of the Model Parameter

When we fit the intercept-only model to a sample, the estimated $b$ equals the sample mean. Different samples give different $b$ values. Assume the population has $\mu_d = -3$ and $\sigma_d = 5$.

a) Simulate 1,000 samples of size $n = 20$. For each sample, fit the intercept-only model (i.e., compute the sample mean). Collect the 1,000 estimated $b$ values and plot their histogram.

b) Compute the mean and standard deviation of your 1,000 $b$ values. Compare the mean to $\mu_d$ and the standard deviation to the theoretical standard error $\sigma_d / \sqrt{n}$.

c) For each of your 1,000 samples, also compute $S$ and then the t-statistic: $t = b / (S / \sqrt{n})$. Plot the histogram of the 1,000 t-statistics and overlay a t-distribution with $df = 19$.

```
x = np.linspace(-4, 4, 200)
plt.plot(x, stats.t.pdf(x, df=19))
```

d) In your own words, why does the distribution of $b$ across samples follow a t-distribution rather than a normal distribution?

---

#### Q3. Hypothesis Testing as a Model

In HW 3.3, we tested $H_0: \mu_d = 0$ by computing $t = \bar{d} / (S / \sqrt{n})$. Now let's see the same test expressed in regression language.

a) Using our sample `d`, compute the sample mean, sample standard deviation, standard error, t-statistic (testing whether the mean is zero), and p-value.

b) Now fit the intercept-only model using `scipy`: `stats.ttest_1samp(d, 0)`. This tests $H_0: \mu = 0$. Compare the t-statistic and p-value to what you computed in part (a).

c) Now suppose a skeptic claims the program reduces anxiety by exactly 2 points ($\mu_d = -2$). To test $H_0: \mu_d = -2$, create a shifted dataset by subtracting $-2$ from each observation: `d_shifted = d - (-2)`. Then run `stats.ttest_1samp(d_shifted, 0)`. What are the t-statistic and p-value? Do you reject this null at $\alpha = 0.05$?

d) Verify that the shifting trick works: run `stats.ttest_1samp(d, -2)` (testing the original data against $-2$ directly). Compare to your result in part (c).

e) In your own words, explain the connection between "computing a sample mean and testing whether it equals zero" and "fitting a linear model and testing whether the intercept equals zero." Why is this connection useful for Part 4?