# MLE for linear regression

In this section we will see that the maximum likelihood estimate of the parameters for linear regression *are* the least squares estimates.

Suppose there is a quantitative variable you want to predict/model called $y$ and a set of $p$ features $x_1, x_2, \dots x_p$, then the multiple linear regression model regressing $y$ on $x_1, \dots, x_p$ is:

$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p + \epsilon = \vec{x} \cdot \vec{\beta} + \epsilon,
$$

where $\beta_0, \dots, \beta_p \in \mathbb{R}$ are constants, $\vec{\beta}$ is the $(p+1)$-vector of the $\beta_i$ in numerical order, 

$$
\vec{x} = \left(1, x_1, x_2, \dots, x_p \right)^\top,
$$

and $\epsilon \sim N(0,\sigma^2)$ is an error term independent of $\vec{x}$.  Note that we have "padded" $\vec{x}$ with an initial one to capture the constant term.

Suppose that we have $n$ observations $(\vec{x}_i, y_i)$.  Then the likelihood function is:

$$
L(\vec{\beta}, \sigma) = \prod_1^n \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(y_i - \vec{x}_i^\top \vec{\beta})^2}{2\sigma^2} \right)
$$

Thus the negative log likelihood is

$$
\ell(\vec{\beta}, \sigma) = n\log(\sqrt{2\pi}) + n\log(\sigma) + \frac{1}{2\sigma^2}\sum_1^n (y_i - \vec{x}_i^\top \vec{\beta})^2
$$

Package the $\vec{x}_i$ and $y_i$ into an $n \times (p+1)$ matrix $X$ and a $n$-vector $\vec{y}$

$$
X = \begin{bmatrix}
1 & x_{11} & x_{12} & ... & x_{1p}\\
1 & x_{21} & x_{22} & ... & x_{2p}\\
  &        &        & \vdots &    \\
1 & x_{n1} & x_{n2} & ... & x_{np}\\  
\end{bmatrix}

\hphantom{fdsfds}

\vec{y} = 
\begin{bmatrix}
y_1\\y_2\\ \vdots \\ y_n
\end{bmatrix}
$$

Then we can rewrite the loss function as 

$$
\ell(\vec{\beta}, \sigma) = n\log(\sqrt{2\pi}) + n\log(\sigma) + \frac{1}{2\sigma^2}|\vec{y} - X \vec{\beta}|^2
$$

We want to *minimize* this with respect to $\vec{\beta}$ and $\sigma$.

Since the loss is increasing with respect to $|\vec{y} - X \vec{\beta}|^2$, we can first minimize this quantity over $\vec{\beta}$ and then find the minimizing value of $\sigma$.

This shows us that the maximum likelihood estimate of the parameters is the least squares estimate!  This will not be true of all statistical learning techniques, but it is true of linear regression.  I want to point out the crucial role played by the form of the Gaussian in making this happen! The Gaussian is an exponentiated quadratic.  In any likelihood calculation we will be taking a product over independent events.  The exponential function converts products in the codomain into sums in domain.  Since the argument of the exponential is quadratic, we end up with a quadratic form as part of our loss function.

## Two ways to minimize $|\vec{y} - X \vec{\beta}|^2$

We will show how to minimize $|\vec{y} - X \vec{\beta}|^2$ using both linear algebra and multivariable differential calculus. 

While the linear algebra approach is much cleaner and provides more insight in this case, the differential calculus approach is more general and will be applicable to many other machine learning algorithms we meet later on in the bootcamp.

### The linear algebra approach

We are looking for 

$$
\argmin_{\vec{\beta} \in \mathbb{R}^{p+1}} |\vec{y} - X \vec{\beta}|^2
$$

In other words, we have a vector $\vec{y} \in \mathbb{R}^n$ and we are looking to minimize the distance to the subspace $\operatorname{Im}(X) \subset \mathbb{R}^n$.

So, more geometrically, we want to find $\hat{\beta}$ so that $X\hat{\beta}$ is the orthogonal projection of $\vec{y}$ onto the subspace $\operatorname{Im}(X)$.


<center>
<img src="math_hour_assets/least_squares.png" width="400" >
</center>


Since the "vector of residuals" $\vec{y} - X \hat{\beta}$ is orthogonal to $\operatorname{Im}(X)$, we have

$$
\begin{align*}
& X^\top (\vec{y} - X \hat{\beta}) = 0\\
& X^\top X \hat{\beta} = X^\top \vec{y}\\
& \hat{\beta} = (X^\top X)^{-1} X^\top \vec{y}
\end{align*}
$$

Note that this assumes that $X^\top X$ is invertible.  This will be true as long as the columns of $X$ are linearly independent, which is certainly a reasonable assumption for real data.  An example of a linear dependency would be if you had a feature variable for temperature in both Fahrenheit and Celsius.  Then one column would be a linear combination of the other column and the column $\vec{1}$.

### The differential calculus approach

Multivariable calculus courses often only address low dimensional calculus.  We have a function of $p+1$ variables to minimize here, and that might be a little intimidating.

If you want to skip the derivation you could have this [super awesome matrix calculus calculator](https://www.matrixcalculus.org/matrixCalculus) do it for you.  However, I do think it is useful to understand what is really going on here from first principles.

In my view, the "best way" to understand the gradient of a function $f : \mathbb{R}^{p+1} \to \mathbb{R}$ like we have here is to understand what the gradient *does*.

The gradient $\nabla f \big|_{\vec{\beta}}$ is a vector tells us how the output of $f$ responds to changing the input from $\vec{\beta}$ to $\vec{\beta} + \vec{h}$:

$$
f(\vec{\beta} + \vec{h}) = f(\vec{\beta}) + \nabla f \big|_{\vec{\beta}} \cdot \vec{h} + \mathcal{o}(\vec{h})
$$

We can use this to compute the gradient of our loss function $L$ as follows:

$$
\begin{align*}
f(\vec{\beta} + \vec{h}) 
&= \left| \vec{y} - X(\vec{\beta} + \vec{h})\right|^2\\
&= (\vec{y} - X\vec{\beta} - X \vec{h})\cdot(\vec{y} - X\hat{\beta} - X \vec{h})\\
&= |\vec{y} - X\vec{\beta}|^2  - 2 (\vec{y} - X\vec{\beta}) \cdot (X\vec{h}) + \left| X \vec{h} \right|^2\\
\end{align*}
$$

Now there are two slightly tricky next steps.

One is to rewrite $2 (\vec{y} - X\vec{\beta}) \cdot (X\vec{h})$ as $2 X^\top (\vec{y} - X\vec{\beta}) \cdot \vec{h}$.

This is just a general fact about tranposes (one might say it is their raison d'Ãªtre):  $\vec{u} \cdot X\vec{v} = (X^\top \vec{u}) \cdot \vec{v}$.  The key to understanding this is to realize that for a vector $\vec{w}^\top \vec{v} = \vec{w} \cdot \vec{v}$, and also that for matrixes $(AB)^\top = B^\top A^\top$.  So

$$
\begin{align*}
\vec{u} \cdot X\vec{v} 
&=(X\vec{v})^\top \vec{u}\\
&=\vec{v}^\top X^\top \vec{u}\\
&=(X^\top \vec{u}) \cdot \vec{v}
\end{align*}
$$

The other, only slightly harder, fact is that $|X\vec{h}|^2$ is $\mathcal{o}(\vec{h})$ since $|X\vec{h}|^2 \leq C^2 |\vec{h}|^2$ where $C$ is the [operator norm](https://en.wikipedia.org/wiki/Operator_norm) of $X$ which is, in this case, the largest singular value of $X$.

Putting this together we have

$$
f(\vec{\beta} + \vec{h}) = f(\vec{\beta}) - 2 X^\top (\vec{y} - X\vec{\beta}) \cdot \vec{h}  + \mathcal{o}(\vec{h})
$$

Thus, by definition of the gradient, we have

$$
\nabla f \big|_{\vec{\beta}} = -2 X^\top (\vec{y} - X\vec{\beta})
$$

Setting this gradient equal to zero we get 

$$
\begin{align*}
X^\top (\vec{y} - X\hat{\beta}) &=0 \\
X^\top \vec{y} - X^\top X \hat{\beta} &=0\\
X^\top X \hat{\beta} &= X^\top \vec{y}\\
\hat{\beta} &= (X^\top X)^{-1} X^\top \vec{y}
\end{align*}
$$

We can also compute that the [Hessian](https://en.wikipedia.org/wiki/Hessian_matrix) of $f$ is

$$
H = 2X^\top X
$$

which is positive definite as long as the columns of $X$ are linearly independent.  So $f$ is convex in $\vec{\beta}$, and the stationary point we found above is indeed the global minimum.

We will see that many classical machine learning algorithms have convex loss functions!

### Returning to the estimate of $\sigma$

Recall that our negative log likelihood was 

$$
\ell(\vec{\beta}, \sigma) = n\log(\sqrt{2\pi}) + n\log(\sigma) + \frac{1}{2\sigma^2}|\vec{y} - X \vec{\beta}|^2
$$

We have already found $\hat{\beta}$ which minimizes $|\vec{y} - X \vec{\beta}|^2$.  Let's define $\textrm{RSS} = |\vec{y} - X \hat{\beta}|^2$ to be the "residual sum of squares" for notational convenience.

Now we are trying to minimize

$$
\ell(\sigma) = n\log(\sqrt{2\pi}) + n\log(\sigma) + \frac{\textrm{RSS}}{2\sigma^2}
$$

This is a single variable calculus problem.

$$
\begin{align*}
\ell'(\hat{\sigma}) &= 0\\
\frac{n}{\hat{\sigma}} - \frac{\textrm{RSS}}{\hat{\sigma}^3} &= 0\\
n\hat{\sigma}^2 &= \textrm{RSS}\\
\hat{\sigma}^2 &= \frac{1}{n} \textrm{RSS}\\
\hat{\sigma}^2 &= \frac{1}{n} \sum_1^n (y_i - \hat{y}_i)^2
\end{align*}
$$

where $\hat{y}_i$ is the $i^{th}$ entry of $X \hat{\beta}$ (i.e. the prediction of the $i^{th}$ target by the fitted model).

This MLE estimate of the variance is a *biased estimator*.  An unbiased estimator would use $\frac{1}{n - (p+1)}$ in the denominator instead of $\frac{1}{n}$.  You could guess this form of the denominator by thinking hard enough about the orthogonal projection picture we included above:  the dimension of $\operatorname{Im}(X)$ is $p+1$ dimensional, and the residual vector lies in the orthogonal complement which is $n - (p+1)$ dimensional. 

### Covariance of $\hat{\beta}$

In week 5 we will be interested in confidence intervals for the linear regression parameters.  

We have 

$$
\begin{align*}
\hat{\beta} 
&= (X^\top X)^{-1}X^\top \vec{y}\\
&= (X^\top X)^{-1}X^\top (X \beta + \vec{\epsilon})\\
&= \beta + (X^\top X)^{-1}X^\top \vec{\epsilon} 
\end{align*}
$$

Since $\vec{\epsilon} \sim \mathcal{N}(0, \sigma^2 I)$ then $\hat{\beta}$ is also a multivariate normal distribution.  We can compute the covariance matrix as follows:


$$
\begin{align*}
\operatorname{Cov}(\hat{\beta}) 
&= \operatorname{Cov}((X^\top X)^{-1}X^\top \vec{\epsilon})\\
&= (X^\top X)^{-1}X^\top\operatorname{Cov}(\vec{\epsilon})((X^\top X)^{-1}X^\top)^\top\\
&= (X^\top X)^{-1}X^\top \sigma^2 I X (X^\top X)^{-1}\\
&= \sigma^2 (X^\top X)^{-1}
\end{align*}
$$

Thus we have $\hat{\beta} \sim \mathcal{N}(\beta, \sigma^2 (X^\top X)^{-1})$.  

For each individual parameter $\beta_i$ we then have

$$
\frac{\hat{\beta}_{i}-\beta_{i}}{\sigma\sqrt{(X^{T}X)^{-1}_{ii}}} \sim N(0,1)
$$