## 14. Linear Regression

**Regression** is a method for studying the relationship between a **response variable** $Y$ and a **covariates** $X$.  The covariate is also called a **predictor variable** or **feature**.  Later we will generalize and allow for more than one covariate.  The data are of the form

$$ (Y_1, X_1), \dots, (Y_n, X_n) $$

One way to summarize the relationship between $X$ and $Y$ is through the **regression function**

$$ r(x) = \mathbb{E}(Y | X = x) = \int y f(y | x) dy $$

Most of this chapter is concerned with estimating the regression function.

### 14.1 Simple Linear Regression

The simplest version of regression is when $X_i$ is simple (a scalar, not a vector) and $r(x)$ is assumed to be linear:

$$r(x) = \beta_0 + \beta_1 x$$

This model is called the **simple linear regression model**.  Let $\epsilon_i = Y_i - (\beta_0 + \beta_1 X_i)$.  Then:

$$
\begin{align}
\mathbb{E}(\epsilon_i | Y_i) &= \mathbb{E}(Y_i - (\beta_0 + \beta_1 X_i) | X_i)\\
&= \mathbb{E}(Y_i | X_i) - (\beta_0 + \beta_1 X_i)\\
&= r(X_i) - (\beta_0 + \beta_1 X_i)\\
&= 0
\end{align}
$$

Let $\sigma^2(x) = \mathbb{V}(\epsilon_i | X_i = x)$.  We will make the further simplifying assumption that $\sigma^2(x) = \sigma^2$ does not depend on $x$.

**The Linear Regression Model**

$$Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$$

where $\mathbb{E}(\epsilon_i | X_i) = 0$ and $\mathbb{V}(\epsilon_i | X_i) = \sigma^2$.

The unknown models in the parameter are the intercept $\beta_0$, the slope $\beta_1$ and the variance $\sigma^2$.  Let $\hat{\beta_0}$ and $\hat{\beta_1}$ denote the estimates of $\beta_0$ and $\beta_1$.  The **fitted line** is defined to be

$$\hat{r}(x) = \hat{\beta}_0 + \hat{\beta}_1 x$$

The **predicted values** or **fitted values** are $\hat{Y}_i = \hat{r}(X_i)$ and the **residuals** are defined to be

$$\hat{\epsilon}_i = Y_i - \hat{Y}_i = Y_i - (\hat{\beta}_0 + \hat{\beta}_1 X_i)$$

The **residual sum of squares** or RSS is defined by

$$ \text{RSS} = \sum_{i=1}^n \hat{\epsilon}_i^2$$

The quantity RSS measures how well the fitted line fits the data.

The **least squares estimates** are the values $\hat{\beta}_0$ and $\hat{\beta}_1$ that minimize $\text{RSS} = \sum_{i=1}^n \hat{\epsilon}_i^2$.

**Theorem 14.4**.  The least square estimates are given by

$$
\begin{align}
\hat{\beta}_1 &= \frac{\sum_{i=1}^n (X_i - \overline{X}_n) (Y_i - \overline{Y}_n)}{\sum_{i=1}^n (X_i - \overline{X}_n)^2}\\
\hat{\beta}_0 &= \overline{Y}_n - \hat{\beta}_1 \overline{X}_n
\end{align}
$$

An unbiased estimate of $\sigma^2$ is

$$
\hat{\sigma}^2 = \left( \frac{1}{n - 2} \right) \sum_{i=1}^n \hat{\epsilon}_i^2
$$

### 14.2 Least Squares and Maximum Likelihood

Suppose we add the assumption that $\epsilon_i | X_i \sim N(0, \sigma^2)$, that is,

$Y_i | X_i \sim N(\mu_i, \sigma_i^2)$

where $\mu_i = \beta_0 + \beta_i X_i$.  The likelihood function is

$$
\begin{align}
\prod_{i=1}n f(X_i, Y_i) &= \prod_{i=1}^n f_X(X_i) f_{Y|X}(Y_i | X_i)\\
&= \prod_{i=1}^n f_X(X_i) \times \prod_{i=1}^n f_{Y|X}(Y_i | X_i) \\
&= \mathcal{L}_1 \times \mathcal{L}_2
\end{align}
$$

where $\mathcal{L}_1 = \prod_{i=1}^n f_X(X_i)$ and $\mathcal{L}_2 = \prod_{i=1}^n f_{Y|X}(Y_i | X_i)$.

The term $\mathcal{L}_1$ does not involve the parameters $\beta_0$ and $\beta_1$.  We shall focus on the second term $\mathcal{L}_2$ which is called the **conditional likelihood**, given by

$$\mathcal{L}_2 \equiv \mathcal{L}(\beta_0, \beta_1, \sigma)
= \prod_{i=1}^n f_{Y|X}(Y_i | X_i)
\propto \sigma^{-n} \exp \left\{ - \frac{1}{2 \sigma^2} \sum_i (Y_i - \mu_i)^2 \right\}
$$

The conditional log-likelihood is

$$\ell(\beta_0, \beta_1, \sigma) = -n \log \sigma - \frac{1}{2 \sigma^2} \sum_{i=1}^n \left(Y_i - (\beta_0 + \beta_1 X_i) \right)^2$$

To find the MLE of $(\beta_0, \beta_1)$ we maximize the conditional log likelihood. We can see from the equation above that this is the same as minimizing the RSS.  Therefore, we have shown the following:

**Theorem 14.7**.  Under the assumption of Normality, the least squares estimator is also the maximum likelihood estimator.

We can also maximize $\ell(\beta_0, \beta_1, \sigma)$ over $\sigma$ yielding the MLE

$$ \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n \hat{\epsilon}_i^2 $$

This estimator is similar to, but not identical to, the unbiased estimator.  Common practice is to use the unbiased estimator.

### 14.3 Properties of the Least Squares Estimators

**Theorem 14.8**.  Let $\hat{\beta}^T = (\hat{\beta}_0, \hat{\beta}_1)^T$ denote the least squares estimators.  Then,

$$
\mathbb{E}(\hat{\beta} | X^n) = \begin{pmatrix}\beta_0 \\ \beta_1 \end{pmatrix}
$$

$$
\mathbb{V}(\hat{\beta} | X^n) = \frac{\sigma^2}{n s_X^2} \begin{pmatrix} 
\frac{1}{n} \sum_{i=1}^n X_i^2 & -\overline{X}_n \\
-\overline{X}_n & 1
\end{pmatrix}
$$

where $s_X^2 = n^{-1} \sum_{i=1}^n (X_i - \overline{X}_n)^2$.

The estimated standard errors of $\hat{\beta}_0$ and $\hat{\beta}_1$ are obtained by taking the square roots of the corresponding diagonal terms of $\mathbb{V}(\hat{\beta} | X^n)$ and inserting the estimate $\hat{\sigma}$ for $\sigma$.  Thus,

$$
\begin{align}
\hat{\text{se}}(\hat{\beta}_0) &= \frac{\hat{\sigma}}{s_X \sqrt{n}} \sqrt{\frac{\sum_{i=1}^n X_i^2}{n}}\\
\hat{\text{se}}(\hat{\beta}_1) &= \frac{\hat{\sigma}}{s_X \sqrt{n}}
\end{align}
$$

We should write $\hat{\text{se}}(\hat{\beta}_0 | X^n)$ and $\hat{\text{se}}(\hat{\beta}_1 | X^n)$ but we will use the shorter notation $\hat{\text{se}}(\hat{\beta}_0)$ and $\hat{\text{se}}(\hat{\beta}_1)$.

**Theorem 14.9**. Under appropriate conditions we have:

1. (Consistency) $\hat{\beta}_0 \xrightarrow{\text{P}} \beta_0$ and $\hat{\beta}_1 \xrightarrow{\text{P}} \beta_1$

2. (Asymptotic Normality):

$$
\frac{\hat{\beta}_0 - \beta_0}{\hat{se}(\hat{\beta}_0)} \leadsto N(0, 1)
\quad \text{and} \quad
\frac{\hat{\beta}_1 - \beta_1}{\hat{se}(\hat{\beta}_1)} \leadsto N(0, 1)
$$

3. Approximate $1 - \alpha$ confidence intervals for $\beta_0$ and $\beta_1$ are

$$
\hat{\beta}_0 \pm z_{\alpha/2} \hat{\text{se}}(\hat{\beta}_0)
\quad \text{and} \quad
\hat{\beta}_1 \pm z_{\alpha/2} \hat{\text{se}}(\hat{\beta}_1)
$$

The Wald statistic for testing $H_0 : \beta_1 = 0$ versus $H_1: \beta_1 \neq 0$ is: reject $H_0$ if $W > z_{\alpha / 2}$ where $W = \hat{\beta}_1 / \hat{\text{se}}(\hat{\beta}_1)$.

### 14.4 Prediction

Suppose we have estimated a regression model $\hat{r}(x) = \hat{\beta}_0 + \hat{\beta}_1 x$ from data $(X_1, Y_1), \dots, (X_n, Y_n)$.  We observe the value $X_* = x$ of the covariate for a new subject and we want to predict the outcome $Y_*$.  An estimate of $Y_*$ is 

$$ \hat{Y}_* = \hat{\beta}_0 + \hat{\beta}_1 X_*$$

Using the formula for the variance of the sum of two random variables,

$$ \mathbb{V}(\hat{Y}_*) = \mathbb{V}(\hat{\beta}_0 + \hat{\beta}_1 x_*) = \mathbb{V}(\hat{\beta}_0) + x_* \mathbb{V}(\hat{\beta}_1) + 2 x_* \text{Cov}(\hat{\beta}_0, \hat{\beta}_1) $$ 

Theorem 14.8 gives the formulas for all terms in this equation.  The estimated standard error $\hat{\text{se}}(\hat{Y}_*)$ is the square root of this variance, with $\hat{\sigma}^2$ in place of $\sigma^2$.  However, **the confidence interval for $\hat{Y}_*$ is not of the usual form** $\hat{Y}_* \pm z_{\alpha} \hat{\text{se}}(\hat{Y}_*)$.  The appendix explains why.  The correct form is given in the following theorem.  We can the interval a **prediction interval**.

**Theorem 14.11 (Prediction Interval)**.  Let

$$
\begin{align}
\hat{\xi}_n^2 &= \hat{\text{se}}^2(\hat{Y}_*) + \hat{\sigma}^2 \\
&= \hat{\sigma}^2 \left(\frac{\sum_{i=1}^n (X_i - X_*)^2}{n \sum_{i=1}^n (X_i - \overline{X})^2} + 1 \right)
\end{align}
$$

An approximate $1 - \alpha$ prediction interval for $Y_*$ is

$$ \hat{Y}_* \pm z_{\alpha/2} \xi_n$$