## 14. Linear Regression

**Regression** is a method for studying the relationship between a **response variable** $Y$ and a **covariates** $X$.  The covariate is also called a **predictor variable** or **feature**.  Later we will generalize and allow for more than one covariate.  The data are of the form

$$ (Y_1, X_1), \dots, (Y_n, X_n) $$

One way to summarize the relationship between $X$ and $Y$ is through the **regression function**

$$ r(x) = \mathbb{E}(Y | X = x) = \int y f(y | x) dy $$

Most of this chapter is concerned with estimating the regression function.

### 14.1 Simple Linear Regression

The simplest version of regression is when $X_i$ is simple (a scalar, not a vector) and $r(x)$ is assumed to be linear:

$$r(x) = \beta_0 + \beta_1 x$$

This model is called the **simple linear regression model**.  Let $\epsilon_i = Y_i - (\beta_0 + \beta_1 X_i)$.  Then:

$$
\begin{align}
\mathbb{E}(\epsilon_i | Y_i) &= \mathbb{E}(Y_i - (\beta_0 + \beta_1 X_i) | X_i)\\
&= \mathbb{E}(Y_i | X_i) - (\beta_0 + \beta_1 X_i)\\
&= r(X_i) - (\beta_0 + \beta_1 X_i)\\
&= 0
\end{align}
$$

Let $\sigma^2(x) = \mathbb{V}(\epsilon_i | X_i = x)$.  We will make the further simplifying assumption that $\sigma^2(x) = \sigma^2$ does not depend on $x$.

**The Linear Regression Model**

$$Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$$

where $\mathbb{E}(\epsilon_i | X_i) = 0$ and $\mathbb{V}(\epsilon_i | X_i) = \sigma^2$.

The unknown models in the parameter are the intercept $\beta_0$, the slope $\beta_1$ and the variance $\sigma^2$.  Let $\hat{\beta_0}$ and $\hat{\beta_1}$ denote the estimates of $\beta_0$ and $\beta_1$.  The **fitted line** is defined to be

$$\hat{r}(x) = \hat{\beta}_0 + \hat{\beta}_1 x$$

The **predicted values** or **fitted values** are $\hat{Y}_i = \hat{r}(X_i)$ and the **residuals** are defined to be

$$\hat{\epsilon}_i = Y_i - \hat{Y}_i = Y_i - (\hat{\beta}_0 + \hat{\beta}_1 X_i)$$

The **residual sum of squares** or RSS is defined by

$$ \text{RSS} = \sum_{i=1}^n \hat{\epsilon}_i^2$$

The quantity RSS measures how well the fitted line fits the data.

The **least squares estimates** are the values $\hat{\beta}_0$ and $\hat{\beta}_1$ that minimize $\text{RSS} = \sum_{i=1}^n \hat{\epsilon}_i^2$.

**Theorem 14.4**.  The least square estimates are given by

$$
\begin{align}
\hat{\beta}_1 &= \frac{\sum_{i=1}^n (X_i - \overline{X}_n) (Y_i - \overline{Y}_n)}{\sum_{i=1}^n (X_i - \overline{X}_n)^2}\\
\hat{\beta}_0 &= \overline{Y}_n - \hat{\beta}_1 \overline{X}_n
\end{align}
$$

An unbiased estimate of $\sigma^2$ is

$$
\hat{\sigma}^2 = \left( \frac{1}{n - 2} \right) \sum_{i=1}^n \hat{\epsilon}_i^2
$$

### 14.2 Least Squares and Maximum Likelihood

Suppose we add the assumption that $\epsilon_i | X_i \sim N(0, \sigma^2)$, that is,

$Y_i | X_i \sim N(\mu_i, \sigma_i^2)$

where $\mu_i = \beta_0 + \beta_i X_i$.  The likelihood function is

$$
\begin{align}
\prod_{i=1}n f(X_i, Y_i) &= \prod_{i=1}^n f_X(X_i) f_{Y|X}(Y_i | X_i)\\
&= \prod_{i=1}^n f_X(X_i) \times \prod_{i=1}^n f_{Y|X}(Y_i | X_i) \\
&= \mathcal{L}_1 \times \mathcal{L}_2
\end{align}
$$

where $\mathcal{L}_1 = \prod_{i=1}^n f_X(X_i)$ and $\mathcal{L}_2 = \prod_{i=1}^n f_{Y|X}(Y_i | X_i)$.

The term $\mathcal{L}_1$ does not involve the parameters $\beta_0$ and $\beta_1$.  We shall focus on the second term $\mathcal{L}_2$ which is called the **conditional likelihood**, given by

$$\mathcal{L}_2 \equiv \mathcal{L}(\beta_0, \beta_1, \sigma)
= \prod_{i=1}^n f_{Y|X}(Y_i | X_i)
\propto \sigma^{-n} \exp \left\{ - \frac{1}{2 \sigma^2} \sum_i (Y_i - \mu_i)^2 \right\}
$$

The conditional log-likelihood is

$$\ell(\beta_0, \beta_1, \sigma) = -n \log \sigma - \frac{1}{2 \sigma^2} \sum_{i=1}^n \left(Y_i - (\beta_0 + \beta_1 X_i) \right)^2$$

To find the MLE of $(\beta_0, \beta_1)$ we maximize the conditional log likelihood. We can see from the equation above that this is the same as minimizing the RSS.  Therefore, we have shown the following:

**Theorem 14.7**.  Under the assumption of Normality, the least squares estimator is also the maximum likelihood estimator.

We can also maximize $\ell(\beta_0, \beta_1, \sigma)$ over $\sigma$ yielding the MLE

$$ \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n \hat{\epsilon}_i^2 $$

This estimator is similar to, but not identical to, the unbiased estimator.  Common practice is to use the unbiased estimator.

### 14.3 Properties of the Least Squares Estimators

**Theorem 14.8**.  Let $\hat{\beta}^T = (\hat{\beta}_0, \hat{\beta}_1)^T$ denote the least squares estimators.  Then,

$$
\mathbb{E}(\hat{\beta} | X^n) = \begin{pmatrix}\beta_0 \\ \beta_1 \end{pmatrix}
$$

$$
\mathbb{V}(\hat{\beta} | X^n) = \frac{\sigma^2}{n s_X^2} \begin{pmatrix} 
\frac{1}{n} \sum_{i=1}^n X_i^2 & -\overline{X}_n \\
-\overline{X}_n & 1
\end{pmatrix}
$$

where $s_X^2 = n^{-1} \sum_{i=1}^n (X_i - \overline{X}_n)^2$.

The estimated standard errors of $\hat{\beta}_0$ and $\hat{\beta}_1$ are obtained by taking the square roots of the corresponding diagonal terms of $\mathbb{V}(\hat{\beta} | X^n)$ and inserting the estimate $\hat{\sigma}$ for $\sigma$.  Thus,

$$
\begin{align}
\hat{\text{se}}(\hat{\beta}_0) &= \frac{\hat{\sigma}}{s_X \sqrt{n}} \sqrt{\frac{\sum_{i=1}^n X_i^2}{n}}\\
\hat{\text{se}}(\hat{\beta}_1) &= \frac{\hat{\sigma}}{s_X \sqrt{n}}
\end{align}
$$

We should write $\hat{\text{se}}(\hat{\beta}_0 | X^n)$ and $\hat{\text{se}}(\hat{\beta}_1 | X^n)$ but we will use the shorter notation $\hat{\text{se}}(\hat{\beta}_0)$ and $\hat{\text{se}}(\hat{\beta}_1)$.

**Theorem 14.9**. Under appropriate conditions we have:

1. (Consistency) $\hat{\beta}_0 \xrightarrow{\text{P}} \beta_0$ and $\hat{\beta}_1 \xrightarrow{\text{P}} \beta_1$

2. (Asymptotic Normality):

$$
\frac{\hat{\beta}_0 - \beta_0}{\hat{se}(\hat{\beta}_0)} \leadsto N(0, 1)
\quad \text{and} \quad
\frac{\hat{\beta}_1 - \beta_1}{\hat{se}(\hat{\beta}_1)} \leadsto N(0, 1)
$$

3. Approximate $1 - \alpha$ confidence intervals for $\beta_0$ and $\beta_1$ are

$$
\hat{\beta}_0 \pm z_{\alpha/2} \hat{\text{se}}(\hat{\beta}_0)
\quad \text{and} \quad
\hat{\beta}_1 \pm z_{\alpha/2} \hat{\text{se}}(\hat{\beta}_1)
$$

The Wald statistic for testing $H_0 : \beta_1 = 0$ versus $H_1: \beta_1 \neq 0$ is: reject $H_0$ if $W > z_{\alpha / 2}$ where $W = \hat{\beta}_1 / \hat{\text{se}}(\hat{\beta}_1)$.

### 14.4 Prediction

Suppose we have estimated a regression model $\hat{r}(x) = \hat{\beta}_0 + \hat{\beta}_1 x$ from data $(X_1, Y_1), \dots, (X_n, Y_n)$.  We observe the value $X_* = x$ of the covariate for a new subject and we want to predict the outcome $Y_*$.  An estimate of $Y_*$ is 

$$ \hat{Y}_* = \hat{\beta}_0 + \hat{\beta}_1 X_*$$

Using the formula for the variance of the sum of two random variables,

$$ \mathbb{V}(\hat{Y}_*) = \mathbb{V}(\hat{\beta}_0 + \hat{\beta}_1 x_*) = \mathbb{V}(\hat{\beta}_0) + x_* \mathbb{V}(\hat{\beta}_1) + 2 x_* \text{Cov}(\hat{\beta}_0, \hat{\beta}_1) $$ 

Theorem 14.8 gives the formulas for all terms in this equation.  The estimated standard error $\hat{\text{se}}(\hat{Y}_*)$ is the square root of this variance, with $\hat{\sigma}^2$ in place of $\sigma^2$.  However, **the confidence interval for $\hat{Y}_*$ is not of the usual form** $\hat{Y}_* \pm z_{\alpha} \hat{\text{se}}(\hat{Y}_*)$.  The appendix explains why.  The correct form is given in the following theorem.  We can the interval a **prediction interval**.

**Theorem 14.11 (Prediction Interval)**.  Let

$$
\begin{align}
\hat{\xi}_n^2 &= \hat{\text{se}}^2(\hat{Y}_*) + \hat{\sigma}^2 \\
&= \hat{\sigma}^2 \left(\frac{\sum_{i=1}^n (X_i - X_*)^2}{n \sum_{i=1}^n (X_i - \overline{X})^2} + 1 \right)
\end{align}
$$

An approximate $1 - \alpha$ prediction interval for $Y_*$ is

$$ \hat{Y}_* \pm z_{\alpha/2} \xi_n$$

### 14.5 Multiple Regression

Now suppose we have $k$ covariates $X_1, \dots, X_k$.  The data are of the form

$$(Y_1, X_1), \dots, (Y_n, X_n)$$

where

$$ X_i = (X_{i1}, \dots, X_{ik}) $$

Here, $X_i$ is the vector of $k$ covariates for the $i-$th observation.  The linear regression model is

$$ Y_i = \sum_{i=1}^k \beta_j X_{ij} + \epsilon_i $$

for $i = 1, \dots, n$ where $\mathbb{E}(\epsilon_i | X_{1i}, \dots, X_{ik}) = 0$.  Usually we want to include an intercept in the model which we can do by setting $X_{i1} = 1$ for $i = 1, \dots, n$.  At this point it will become more convenient to express the model in matrix notation.  The outcomes will be denoted by

$$ Y = \begin{pmatrix}
Y_1 \\
Y_2 \\
\vdots \\
Y_n
\end{pmatrix}
$$

and the covariates will be denoted by

$$ X = \begin{pmatrix}
X_{11} & X_{12} & \cdots & X_{1k} \\
X_{21} & X_{22} & \cdots & X_{2k} \\
\vdots & \vdots & \ddots & \vdots \\
X_{n1} & X_{n2} & \cdots & X_{nk}
\end{pmatrix}
$$

Each row is one observation; the columns represent to the $k$ covariates.  Thus, $X$ is a $n \times k$ matrix.  Let

$$
\beta = \begin{pmatrix}
\beta_1 \\
\vdots \\
\beta_k
\end{pmatrix}
\quad \text{and} \quad
\epsilon = \begin{pmatrix}
\epsilon_1 \\
\vdots \\
\epsilon_k
\end{pmatrix}
$$

Then we can write the linear regression model as

$$ Y = X \beta + \epsilon $$

**Theorem 14.13**. Assuming that the $k \times k$ matrix $X^TX$ is invertible, the least squares estimate is

$$ \hat{\beta} = (X^T X)^{-1} X^T Y $$

The estimated regression function is

$$ \hat{r}(x) = \sum_{j=1}^k \hat{\beta}_j x_j$$

THe variance-covariance matrix of $\hat{\beta}$ is

$$ \mathbb{V}(\hat{\beta} | X^n) = \sigma^2 (X^T X)^{-1} $$

An unbiased estimate of $\sigma^2$ is

$$ \hat{\sigma}^2 = \left( \frac{1}{n - k} \right) \sum_{i=1}^n \hat{\epsilon}_i^2 $$

where $\hat{\epsilon} = X \hat{\beta} - Y$ is the vector of residuals.  An approximate $1 - \alpha$ confidence interval for $\beta_j$ is

$$ \hat{\beta}_j \pm z_{\alpha/2} \hat{\text{se}}(\hat{\beta}_j) $$

where $\hat{\text{se}}^2(\hat{\beta}_j)$ is the $j$-th diagonal element of the matrix $\hat{\sigma}^2 (X^T X)^{-1}$.

### 14.6 Model Selection

We may have data on many covariates but we may not want to include all of them in the model.  A smaller model with fewer covariates has two advantages: it might give better predictions than a big model and it is more parsimonious (simpler).  Generally, as you add more variables to a regression, the bias of the predictions decreases and the variance increases.  Too few covariates yields high bias; too many covariates yields high variance.  Good predictions result from achieving a good balance between bias and variance.

In model selection there are two problems: assigning a score to each model which measures, in some sense, how good the model is, and searching through all models to find the model with the best score.

Let $S \subset \{1, \dots, k\}$ and let $\mathcal{X}_S = \{ X_j : j \in S \}$ denote a subset of the covariates.  Let $\beta_S$ denote the coefficients of the corresponding set of covariates and let $\hat{\beta}_S$ denote the least squares estimate of $\beta_S$.  Let $X_S$ denote the $X$ matrix for this subset of covariates, and let $\hat{r}_S(x)$ to be the estimated regression function.  The predicted values from model $S$ are denoted by $\hat{Y}_i(S) = \hat{r}_S(X_i)$.  

The **prediction risk** is defined to be

$$R(S) = \sum_{i=1}^n \mathbb{E} (\hat{Y}_i(S) - Y_i^*)^2 $$

where $Y_i^*$ denotes the value of the future observation of $Y_i$ at covariate value $X_i$.  Our goal is to choose $S$ to make $R(S)$ small.

The **training error** is defined to be

$$\hat{R}_\text{tr}(S) = \sum_{i=1}^n (\hat{Y}_i(S) - Y_i)^2 $$

This estimate is very biased and under-estimates $R(S)$.

**Theorem 14.15**.  The training error is a downward biased estimate of the prediction risk:

$$ \mathbb{E}(\hat{R}_\text{tr}(S)) < R(S) $$

In fact,

$$\text{bias}(\hat{R}_\text{tr}(S)) = \mathbb{E}(\hat{R}_\text{tr}(S)) - R(S) = -2 \sum_{i=1}^n \text{Cov}(\hat{Y}_i, Y_i)$$

The reason for the bias is that the data is being used twice: to estimate the parameters and to estimate the risk.  When fitting a model with many variables, the covariance $\text{Cov}(\hat{Y_i}, Y_i)$ will be large and the bias of the training error gets worse.

In summary, the training error is a poor estimate of risk.  Here are some better estimates.

**Mallow's $C_p$ statistic** is defined by

$$\hat{R}(S) = \hat{R}_\text{tr}(S) + 2 |S| \hat{\sigma}^2$$

where $|S|$ denotes the number of terms in $S$ and $\hat{\sigma}^2$ is the estimate of $\sigma^2$ obtained from the full model (with all covariates).  Think of the $C_p$ statistic as lack of fit plus complexity penalty.

A related method for estimating risk is **AIC (Akaike Information Criterion)**.  The idea is to choose $S$ to maximize

$$ \ell_S - |S|$$

where $\ell_S$ is the log-likelihood of the model evaluated at the MLE.  In linear regression with Normal errors, maximizing AIC is equivalent to minimizing Mallow's $C_p$; see exercise 8.

*Some texts use a slightly different definition of AIC which involves multiplying this definition by 2 or -2.  This has no effect on which model is selected.*

Yet another method for estimating risk is **leave-one-out cross-validation**.  In this case, the risk estimator is

$$\hat{R}_\text{CV}(S) = \sum_{i=1}^n (Y_i - \hat{Y}_{(i)})^2 $$

where $\hat{Y}_{(i)}$ is the prediction for $Y_i$ obtained by fitting the model with $Y_i$ omitted.  It can be shown that

$$\hat{R}_\text{CV}(S) = \sum_{i=1}^n \left( \frac{Y_i - \hat{Y}_i(S)}{1 - U_{ii}(S)} \right)^2 $$

where $U_ii$ is the $i$-th diagonal element of the matrix

$$U(S) = X_S (X_S^T X_S)^{-1} X_S^T$$

Thus one need not actually drop each observation and re-fit the model.

A generalization is **k-fold cross-validation**.  Here we divide the data into $k$ groups; often people take $k = 10$.  We omit one group of data and fit the models on the remaining data.  We use the fitted model to predict the data in the group that was omitted.  We then estimate the risk by $\sum_i (Y_i - \hat{Y}_i)^2$ where the sum is over the data points in the omitted group.  This process is repeated for each of the $k$ groups and the resulting risk estimates are averaged.

For linear regression, Mallows $C_p$ and cross-validation often yield essentially the same results so one might as well use Mallow's method.  In some of the more complex problems we will discuss later, cross-validation will be more useful.

Another scoring method is **BIC (Bayesian Information Criterion)**.  Here we choose a model to maximize

$$ \text{BIC}(S) = \text{RSS}(S) = 2 |S| \hat{\sigma}^2 $$

The BIC score has a Bayesian interpretation.  Let $\mathcal{S} = \{ S_1, \dots, S_m \}$ denote a set of models.  Suppose we assign the uniform prior $\mathbb{P}(S_j) = 1 / m$ over the models.  Also assume we put a smooth prior on the parameters within each model.  It can be shown that the posterior probability for a model is approximately

$$ \mathbb{P}(S_j | \text{data}) \approx \frac{e^{\text{BIC}(S_j)}}{\sum_r e^{\text{BIC}(S_r)}}$$

so choosing the model with highest BIC is like choosing the model with highest posterior probability.

The BIC score also has an information-theoretical interpretation in terms of something called minimum description length.

The BIC score is identical to Mallows $C_p$ except that it puts a more severe penalty for complexity.  It thus leads one to choose a smaller model than the other methods.

If there are $k$ covariates then there are $2^k$ possible models. We need to search through all of those models, assign a score to each one, and choose the model with the best score.  When $k$ is large, this is infeasible; in that case, we need to search over a subset of all the models.  Two common methods are **forward and backward stepwise regression**.

In forward stepwise regression, we start with no covariates in the model, and keep adding variables one at a time that lead to the best score.  In backward stepwise regression, we start with the biggest model (all covariates) and drop one variable at a time.

### 14.7 The Lasso

This method, due to Tibshirani, is called the **Lasso**. Assume that all covariates have been rescaled to have the same variance.  Consider estimating $\beta = (\beta_1, \dots, \beta_k)$ by minimizing the loss function

$$ \sum_{i=1}^n (Y_i - \hat{Y}_i)^2 + \lambda \sum_{j=1}^k | \beta_j |$$

where $\lambda > 0$.  The idea is to minimize the sums of squares but there is a penalty that gets large if any $\beta_j$ gets large.  It can be shown that some of the $\beta_j$'s will be 0.  We interpret this as having the $j$-th covariate omitted from the model; thus we are doing estimation and model selection simultaneously.

We need to choose a value of $\lambda$.  We can do this by estimating the prediction risk $R(\lambda)$ as a function of $\lambda$ and choosing to minimize it.  For example, we can estimate the risk using leave-one-out cross-validation

### 14.8 Technical Appendix

The prediction interval is of a different form than other confidence intervals we have seen -- the quantity of interest  $Y_*$ is equal to a parameter $\theta$ plus a random variable.

We can fix this by defining:

$$ \xi_n^2 = \mathbb{V}(\hat{Y}_*) + \sigma^2 = \left[\frac{\sum_i (x_i - x_*)^2}{n \sum_i (x_i - \overline{x})^2} + 1\right] \sigma^2$$

In practice, we substitute $\hat{\sigma}$ for $\sigma$ and we denote the resulting quantity by $\hat{\xi}_n$.  Now,

$$
\begin{align}
\mathbb{P}(\hat{Y}_* - z_{\alpha/2} \hat{\xi}_n < Y_* < \hat{Y}_* + z_{\alpha/2} \hat{\xi}_n) &=
\mathbb{P}\left(-z_{\alpha/2} < \frac{\hat{Y}_* - Y_*}{\hat{\xi}_n} < z_{\alpha/2} \right)\\
&= \mathbb{P}\left(-z_{\alpha/2} < \frac{\hat{\theta} - \theta - \epsilon}{\hat{\xi}_n} < z_{\alpha/2} \right) \\
&\approx \mathbb{P}\left(-z_{\alpha/2} < \frac{N(0, s^2 + \sigma^2)}{\hat{\xi}_n} < z_{\alpha/2} \right)  \\
&\approx \mathbb{P}\left(-z_{\alpha/2} < \frac{N(0, s^2 + \sigma^2)}{\xi_n} < z_{\alpha/2} \right)  \\
&= \mathbb{P}(-z_{\alpha/2} < N(0, 1) < z_{\alpha/2}) \\
&= 1 - \alpha
\end{align}
$$

### 14.9 Exercises

**Exercise 14.9.1**.  Prove Theorem 14.4:

The least square estimates are given by

$$
\begin{align}
\hat{\beta}_1 &= \frac{\sum_{i=1}^n (X_i - \overline{X}_n) (Y_i - \overline{Y}_n)}{\sum_{i=1}^n (X_i - \overline{X}_n)^2}\\
\hat{\beta}_0 &= \overline{Y}_n - \hat{\beta}_1 \overline{X}_n
\end{align}
$$

An unbiased estimate of $\sigma^2$ is

$$
\hat{\sigma}^2 = \left( \frac{1}{n - 2} \right) \sum_{i=1}^n \hat{\epsilon}_i^2
$$

**Solution**.  We can obtain the estimates $\hat{\beta}_0$ and $\hat{\beta}_1$ by minimizing the RSS -- by taking the partial derivatives with respect to $\beta_0$ and $\beta_1$:

$$\text{RSS} = \sum_i \hat{\epsilon}_i^2 = \sum_i (Y_i - (\beta_0 + \beta_1 X_i))^2$$

Derivating RSS on $\beta_0$:

$$\frac{d}{d \beta_0}\text{RSS} = \sum_i \frac{d}{d \beta_0} (Y_i - (\beta_0 + \beta_1 X_i))^2
= \sum_i 2 (\beta_0 - (Y_i - \beta_1 X_i))$$

Making this derivative equal to 0 at $\hat{\beta}_0$, $\hat{\beta}_1$ gives:

$$
\begin{align}
0 &= \sum_i 2 (\hat{\beta}_0 - (Y_i - \hat{\beta}_1 X_i))\\
n \hat{\beta}_0 &= n \sum_i Yi - \hat{\beta}_1 n \sum_i X_i\\
\hat{\beta}_0 &= \overline{Y}_n - \hat{\beta}_1 \overline{X}_n
\end{align}
$$

Replacing $\overline{Y}_n - \beta_1 \overline{X}_n$ for $\beta_0$ and derivating on $\beta_1$:

$$\frac{d}{d \beta_1}\text{RSS} = \sum_i \frac{d}{d \beta_1} (Y_i - (\beta_0 + \beta_1 X_i))^2
\sum_i \frac{d}{d \beta_1} (Y_i - \overline{Y}_n - \beta_1 (X_i - \overline{X}_n)))^2
= \sum_i -2 (X_i - \overline{X}_n) (Y_i - \overline{Y}_n - \beta_1 (X_i - \overline{X}_n)))$$

Making this derivative equal to 0 at $\hat{\beta}_1$ gives:

$$
\begin{align}
0 &= \sum_i (X_i - \overline{X}_n) (Y_i - \overline{Y}_n - \beta_1 (X_i - \overline{X}_n))) \\
0 &= \hat{\beta}_1 \sum_i (\overline{X}_n - X_i)^2 + \sum_i (\overline{X}_n - X_i)(Y_i - \overline{Y}_n) \\
\hat{\beta}_1 &= \frac{\sum_i (X_i - \overline{X}_n)(Y_i - \overline{Y}_n)}{\sum_i (X_i - \overline{X}_n)^2}
\end{align}
$$

For the unbiased estimate, let's adapt a more general proof from Greene (2003), restricted to $k = 2$ dimensions, where the first dimension is set to all ones to represent the intercept, and the second dimension represents the one-dimensional covariates $X_i$.

The vector of least square residuals is

$$ \hat{\epsilon} = \begin{pmatrix}
\hat{\epsilon}_1 \\
\hat{\epsilon}_2 \\
\vdots \\
\hat{\epsilon}_n
\end{pmatrix} = \begin{pmatrix}
Y_1 - (\hat{\beta}_0 \cdot 1 + \hat{\beta}_1 X_1) \\
Y_2 - (\hat{\beta}_0 \cdot 1 + \hat{\beta}_1 X_2) \\
\vdots \\
Y_n - (\hat{\beta}_0 \cdot 1 + \hat{\beta}_1 X_n)
\end{pmatrix} = 
y - X \hat{\beta}
$$

where $$
y = \begin{pmatrix}
Y_1 \\
Y_2 \\
\vdots \\
Y_n
\end{pmatrix}
, \quad
X = \begin{pmatrix}
1 & X_1 \\
1 & X_2 \\
\vdots & \vdots \\
1 & X_n
\end{pmatrix},
\quad \text{and} \quad
\hat{\beta} = \begin{pmatrix}
\hat{\beta}_0 \\
\hat{\beta}_1
\end{pmatrix}
$$

The least squares solution can be written as:

$$\hat{\beta} = (X^T X)^{-1} X^T y$$

Replacing it on the definition of $\hat{\epsilon}$, we get

$$ \hat{\epsilon} = y - X (X^T X)^{-1} X^T y = (I - X (X^T X)^{-1} X^T) y = M y$$

where $M = I - X (X^T X)^{-1} X^T$ is known as the **residual maker** matrix.

Note that $M$ is symmetric, that is, $M^T = M$:

$$
\begin{align}
M^T &= (I - X (X^T X)^{-1} X^T)^T  \\
&= I^T - (X (X^T X)^{-1} X^T)^T \\
&= I - (X^T)^T ((X^T X)^{-1})^T X^T \\
&= I - X ((X^{-1}(X^T)^{-1})^T X^T \\
&= I -  X (X^T X)^{-1} X^T \\
&= M
\end{align}
$$

Note also that $M$ is idempotent, that is, $M^2 = M$:

$$
\begin{align}
M^2 &= (I - X (X^T X)^{-1} X^T) (I - X (X^T X)^{-1} X^T) \\
&= I - X (X^T X)^{-1} X^T - X (X^T X)^{-1} X^T + X \left( (X^T X)^{-1} X^T X \right) (X^T X)^{-1} X^T \\
&= I - X (X^T X)^{-1} X^T - X (X^T X)^{-1} X^T + X (X^T X)^{-1} X^T \\
&= I - X (X^T X)^{-1} X^T \\
&= M
\end{align}
$$

Now, we have that $MX = 0$, as running least squares on a regression where the covariates match the target variables should yield a model that just copies the covariate over, where all residuals are zero ($\beta_0 = 0$ and $\beta_1 = 1$).  So:

$$\hat{\epsilon} = M y = M( X \beta + \epsilon) = M\epsilon$$

where $\epsilon = Y - X\beta$ are the population residuals.

We can then write an estimator for $\sigma^2$:

$$ \hat{\epsilon}^T \hat{\epsilon} = \epsilon^T M^T M \epsilon = \epsilon^T M^2 \epsilon = \epsilon^T M \epsilon$$

Taking the expectation with respect to the data $X$ on both sides,

$$\mathbb{E}(\hat{\epsilon}^T \hat{\epsilon} | X) = \mathbb{E}(\epsilon^T M \epsilon | X)$$

But $\epsilon^T M \epsilon$ is a scalar ($1 \times 1$ matrix), so it is equal to its trace -- and we can use the cyclic permutation property of the trace:

$$ \mathbb{E}(\epsilon^T M \epsilon | X) = \mathbb{E}(\text{tr}(\epsilon^T M \epsilon) | X) = \mathbb{E}(\text{tr}(M \epsilon \epsilon^T) | X)$$

Since $M$ is a function of $X$, we can take it out of the expectation:

$$\mathbb{E}(\text{tr}(M \epsilon \epsilon^T) | X) = \text{tr}(\mathbb{E}(M \epsilon \epsilon^T | X))
= \text{tr}(M \mathbb{E}(\epsilon \epsilon^T | X))
= \text{tr}(M \sigma^2 I_1)
= \sigma^2 \text{tr}(M)
$$

Finally, we can compute the trace of $M$:

$$
\begin{align}
\text{tr}(M) &= \text{tr}(I_n - X(X^T X)^{-1}X^T)\\
&= \text{tr}(I_n) - \text{tr}(X(X^T X)^{-1}X^T) \\
&= \text{tr}(I_n) - \text{tr}((X^T X)^{-1}X^T X) \\
&= \text{tr}(I_n) - \text{tr}(I_k)  \\
&= n - k
\end{align}
$$

Therefore, the unbiased estimator is

$$\hat{\sigma}^2 = \frac{\hat{\epsilon}^T \hat{\epsilon}}{n - k}$$

or, for our case where $k = 2$,

$$\hat{\sigma}^2 = \left( \frac{1}{n - 2} \right) \sum_{i=1}^n \hat{\epsilon}_i^2$$

Reference:  Greene, William H. Econometric analysis. Pearson Education India, 2003.  Chapter 4, pages 61-62.

**Exercise 14.9.2**.  Prove the formulas for the standard errors in Theorem 14.8.  You should regard the $X_i$'s as fixed constants.

$$
\mathbb{E}(\hat{\beta} | X^n) = \begin{pmatrix}\beta_0 \\ \beta_1 \end{pmatrix}
\quad \text{and} \quad
\mathbb{V}(\hat{\beta} | X^n) = \frac{\sigma^2}{n s_X^2} \begin{pmatrix} 
\frac{1}{n} \sum_{i=1}^n X_i^2 & -\overline{X}_n \\
-\overline{X}_n & 1
\end{pmatrix}
$$

$$s_X^2 = n^{-1} \sum_{i=1}^n (X_i - \overline{X}_n)^2$$

$$
\hat{\text{se}}(\hat{\beta}_0) = \frac{\hat{\sigma}}{s_X \sqrt{n}} \sqrt{\frac{\sum_{i=1}^n X_i^2}{n}}
\quad \text{and} \quad
\hat{\text{se}}(\hat{\beta}_1) = \frac{\hat{\sigma}}{s_X \sqrt{n}}
$$

**Solution**.  

The formulas follow immediately by performing the suggested replacement on the diagonal elements of $\mathbb{V}(\hat{\beta} | X^n)$ from Theorem 14.8. From the diagonals, replacing $\hat{\sigma}$ for $\sigma$:

$$
\hat{\text{se}}(\hat{\beta}_0)^2 = \frac{\hat{\sigma}^2}{n s_X^2} \frac{\sum_{i=1}^n X_i^2}{n}
\quad \text{and} \quad
\hat{\text{se}}(\hat{\beta}_1)^2 = \frac{\hat{\sigma}^2}{n s_X^2} \cdot 1$$

Results follow by taking the square root.

We will also prove the variance matrix result itself, following from the notation and proof used on exercise 14.9.1 (again, adapting from Greene):

$$
\hat{\beta} = (X^T X)^{-1}X^T y = (X^T X)^{-1}X^T (X \beta + \epsilon) = \beta + (X^T X)^{-1}X^T \epsilon
$$

Taking the variance conditional on $X$,

$$
\begin{align}
\mathbb{V}(\hat{\beta} | X) &= \mathbb{V}(\hat{\beta} - \beta | X) \\
&= \mathbb{E}((\hat{\beta} - \beta)(\hat{\beta} - \beta)^T | X)  \\
&= \mathbb{E}((X^T X)^{-1}X^T \epsilon\epsilon^T X(X^T X)^{-1} | X) \\
&= (X^T X)^{-1}X^T \mathbb{E}(\epsilon\epsilon^T | X) X(X^T X)^{-1} \\
&= (X^T X)^{-1}X^T \sigma^2 I X(X^T X)^{-1} \\
&= \sigma^2 (X^T X)^{-1} 
\end{align}
$$

But we have:

$$ X^T X = \begin{pmatrix}
1 & 1 & \cdots & 1 \\
X_1 & X_2 & \cdots & X_n
\end{pmatrix} \begin{pmatrix}
1 & X_1 \\
1 & X_2 \\
\vdots & \vdots \\
1 & X_n
\end{pmatrix} = 
n \begin{pmatrix}
1 & \overline{X}_n \\
\overline{X}_n & \frac{1}{n}\sum_{i=1}^n X_i^2
\end{pmatrix}
$$

so we can verify its inverse is

$$(X^T X)^{-1} = \frac{1}{n s_X^2} \begin{pmatrix}
\frac{1}{n} \sum_{i=1}^n X_i^2 & - \overline{X}_n \\
-\overline{X}_n & 1
\end{pmatrix}$$

and so the result follows.

Reference:  Greene, William H. Econometric analysis. Pearson Education India, 2003.  Chapter 4, page 59.

**Exercise 14.9.3**.  Consider the **regression through the origin** model:

$$Y_i = \beta X_i + \epsilon$$

Find the least squares estimate for $\beta$.  Find the standard error of the estimate.  Find conditions that guarantee that the estimate is consistent.

**Solution**.  Once more adopting notation from the solution of 14.9.1,  let

$$ y = \begin{pmatrix} Y_1 \\ \vdots \\ Y_n \end{pmatrix},
\quad
X = \begin{pmatrix} X_1 \\ \vdots \\ X_n \end{pmatrix}$$

and $\beta$ is a scalar (or a $1 \times 1$ matrix).

The least squares estimate is, again,

$$\hat{\beta} = (X^T X)^{-1} X^T y$$

which simplifies in this one-dimensional case to:

$$\hat{\beta} = \frac{\sum_{i=1}^n X_i Y_i}{\sum_{i=1}^n X_i^2}$$

The unbiased estimator for $\sigma^2$ is, with $k = 1$,

$$\hat{\sigma}^2 = \frac{1}{n - 1} \sum_{i=1}^n \hat{\epsilon}_i^2$$

and the variance of $\hat{\beta}$ conditional of $X$ is:

$$\mathbb{V}(\hat{\beta} | X) = \sigma^2 (X^T X)^{-1} = \frac{\sigma^2}{\sum_{i=1}^n X_i^2}$$

so the standard error of the estimate is

$$\hat{\text{se}}(\hat{\beta}) = \frac{\hat{\sigma}}{\sqrt{\sum_{i=1}^n X_i^2}}$$

These, of course, make the assumption that $X^T X$ is invertible -- that is, that the sum of squares of the $X_i$ variables is greater than 0.  This is only not the case when all covariates are 0, in which case the value of our estimator $\hat{\beta}$ would be irrelevant to determining the prediction outcome -- the system would be undetermined.

Finally, note that $\hat{\beta}$ is also the MLE in the parameter space for regression through the origin.  As each measurement error $\epsilon_i = Y_i - \beta X_i$ is drawn from a normal distribution $N(0, \sigma^2)$, the log-likelihood for a given parameter $\beta$ is

$$\ell_n(\beta) = -\frac{n}{2} \log \sigma^2 - \frac{1}{2 \sigma^2} \sum_{i=1}^n (Y_i - \beta X_i)^2 + C$$

and so maximizing it is equivalent to minimizing $\sum_{i=1}^n (Y_i - \beta X_i)^2$, which is what the least squares procedure does.

Since the MLE is consistent, the least squares estimate is also consistent.