## Simple Linear Regression (Matrix form)

The Simple Linear Regression (SLR) model in scaler form is represented as

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n + \epsilon \quad where \quad \epsilon \sim \mathcal{N}(0, \sigma^2) $$

This can be written for each obeservation in the data

\begin{align}
y_1 & = \beta_0 + \beta_1 x_{11} + \beta_2 x_{12} + \cdots + \beta_p x_{1p} + \epsilon_1 & \\
y_2 & = \beta_0 + \beta_1 x_{21} + \beta_2 x_{22} + \cdots + \beta_p x_{2p} + \epsilon_2 & \forall \; n \in [1,N] \; \text{and} \; p \in [1,p]\\
\vdots & \qquad \vdots & \\
y_n & = \beta_0 + \beta_1 x_{n1} + \beta_2 x_{n2} + \cdots + \beta_p x_{np} + \epsilon_n & 
\end{align}

The same SLR model can be represented in matrix form

\begin{align}
\begin{bmatrix}
    y_1 \\
    y_2 \\
    \vdots \\
    y_n
\end{bmatrix}
&=
\begin{bmatrix}
    \beta_0 + \beta_1 x_{11} + \beta_2 x_{12} + \cdots + \beta_p x_{1p} \\
    \beta_0 + \beta_1 x_{21} + \beta_2 x_{22} + \cdots + \beta_p x_{2p} \\
    \vdots \\
    \beta_0 + \beta_1 x_{n1} + \beta_2 x_{n2} + \cdots + \beta_p x_{np}
\end{bmatrix} + 
\begin{bmatrix}
    \epsilon_1 \\
    \epsilon_2 \\
    \vdots \\
    \epsilon_n
\end{bmatrix}
\end{align}

which can be further broken as

\begin{align}
\begin{bmatrix}
    y_1 \\
    y_2 \\
    \vdots \\
    y_n
\end{bmatrix}
&=
\begin{bmatrix}
    1 & x_{11} & x_{12} & \cdots & x_{1p} \\
    1 & x_{21} & x_{22} & \cdots & x_{2p} \\
    & & \vdots & & \\
    1 & x_{n1} & x_{n2} & \cdots & x_{np}
\end{bmatrix}
\begin{bmatrix}
    \beta_0 \\
    \beta_1 \\
    \vdots \\
    \beta_p
\end{bmatrix} + 
\begin{bmatrix}
    \epsilon_1 \\
    \epsilon_2 \\
    \vdots \\
    \epsilon_n
\end{bmatrix}
\end{align}

or simply as

$$ \textbf{y} = X\mathbf{\beta} + \mathbf{\epsilon} $$
where
* $X$ is called the design matrix.
* $\mathbf{\beta}$ is the vector of coefficients.
* $\mathbf{\epsilon}$ is the error vector.
* $\textbf{y}$ is the response or target vector.

### Distributional Assumptions in Matrix Form

$$\mathbf{\epsilon} \sim \mathcal{N}(\textbf{0}, \Sigma)$$

where $\Sigma$ = covariance matrix

For case of ordinary least square (OLS) where there is a constant variance for all features $\Sigma = \sigma^2I $, distribution of error can be re-written as

$$\mathbf{\epsilon} \sim \mathcal{N}(\textbf{0}, \sigma^2I)$$

and hence distribution of target (y) will be

$$\textbf{y} \sim \mathcal{N}(X\mathbf{\beta}, \sigma^2I)$$

Therefore,

Covariance of error ($\epsilon$)

\begin{align}
\sigma^2_{\epsilon} = Cov
\begin{bmatrix}
    \epsilon_1 \\
    \epsilon_2 \\
    \vdots \\
    \epsilon_n
\end{bmatrix} = \sigma^2I = 
\begin{bmatrix}
    \sigma^2 & 0 & \cdots & 0 \\
    0 & \sigma^2 & \cdots & 0 \\
    \vdots & \vdots & \ddots & \vdots \\
    0 & 0 & \cdots & \sigma^2
\end{bmatrix}
\end{align}

Similarly, Covariance of target ($\textbf{y}$)

\begin{align}
\sigma^2_{\textbf{y}} = Cov
\begin{bmatrix}
    y_1 \\
    y_2 \\
    \vdots \\
    y_n
\end{bmatrix} = \sigma^2I
\end{align}

### Parameter Estimation

Rearranging the SLR model equation we can get residuals as

$$\mathbf{\epsilon} = \textbf{y} − X\mathbf{\beta}$$.

We want to minimize sum of squared residuals.

$$ \text{minimize} \quad \sum \epsilon_i^2 = [\epsilon_1 \; \epsilon_2 \; \cdots \; \epsilon_n] \begin{bmatrix}\epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n\end{bmatrix} = \mathbf{\epsilon}^T\mathbf{\epsilon} $$


or 
$$ \text{minimize} \quad \mathbf{\epsilon}^T\mathbf{\epsilon} = (\textbf{y} − X\mathbf{\beta})^T(\textbf{y} − X\mathbf{\beta})$$

To find the $\mathbf{\beta}$ which minimize above equation, the differentiation of above equation with respect to $\beta$ should be equal to zero vector

i.e.

$$ \frac{d}{d\beta}(\mathbf{\epsilon}^T\mathbf{\epsilon}) = \frac{d}{d\beta}(\textbf{y} − X\mathbf{\beta})^T(\textbf{y} − X\mathbf{\beta}) = \textbf{0} $$

$$ -2X^T(\textbf{y} − X\mathbf{\beta}) = \textbf{0} $$

$$ X^T\textbf{y} = X^TX\mathbf{\beta} $$

or

$$ X^T\textbf{y} = (X^TX)\mathbf{\beta} $$

Left multiplying both side by $(X^TX)^{-1}$ we get

$$ (X^TX)^{-1}X^T\textbf{y} = (X^TX)^{-1}(X^TX)\mathbf{\beta} $$

therefore,

$$ \mathbf{\beta} = (X^TX)^{-1}X^T\textbf{y} $$

### Hat Matrix

$$ \hat{\textbf{y}} = X\mathbf{\beta} $$

$$ \hat{\textbf{y}} = X(X^TX)^{-1}X^T\textbf{y} $$

$$ \hat{\textbf{y}} = H\textbf{y} $$

where $H = X(X^TX)^{-1}X^T$. We call this the "hat matrix" because it turns $\textbf{y}$ into $\hat{\textbf{y}}$.

We can now express residual ($\epsilon$) in terms of hat matrix as

\begin{align}
\mathbf{\epsilon} &= \textbf{y} - \hat{\textbf{y}} \\
                  &= \textbf{y} - H\textbf{y} \\
                  &= (I - H)\textbf{y}
\end{align}

Notice that the matrices $H$ and $(I − H)$ have two special properties. They are
* Symmetric: $H = H^T$ and $(I − H)^T = (I − H)$.
* Idempotent: $H^2 = H$ and $(I − H)^T(I − H) = (I − H)$

### Estimated Covariance Matrix of $\beta$

* $\beta$ is a linear combination of the elements of **y**.
* These estimates are normal if **y** is normal.

#### Useful theorem

Suppose $U \sim \mathcal{N}(\mu, \Sigma)$, a multivariate normal vector, and $V = c + DU$, a linear
transformation of U where c is a vector and D is a matrix. Then $V \sim \mathcal{N}(c + D\mu, D\Sigma D^T)$.

comparing this to SLR, we have

$$U = \textbf{y} \sim \mathcal{N}(X\mathbf{\beta}, \sigma_{\epsilon}^2I) \quad and \quad V = \mathbf{\beta} = [(X^TX)^{-1}X^T]\textbf{y}$$
$$D = (X^TX)^{-1}X^T$$
$$\mu = X\mathbf{\beta} \quad and \quad \Sigma = \sigma_{\epsilon}^2I$$
$$c = \textbf{0}$$
$$V = \mathbf{\beta}$$

Above theorem tells us the vector $\mathbf{\beta}$ is normally distributed with

\begin{align}
\text{mean} &= (X^TX)^{-1}X^TX\mathbf{\beta} \\
            &= (X^TX)^{-1}(X^TX)\mathbf{\beta} \\
            &= \mathbf{\beta}
\end{align}

\begin{align}
\text{Cov} &= ((X^TX)^{-1}X^T)\sigma_{\epsilon}^2I((X^TX)^{-1}X^T)^T \\
           &= \sigma_{\epsilon}^2((X^TX)^{-1}X^T)I((X^TX)^{-1}X^T)^T \\
           &= \sigma_{\epsilon}^2(X^TX)^{-1}X^T ((X^TX)^{-1})^T X \\
           &= \sigma_{\epsilon}^2(X^TX)^{-1}(X^TX) ((X^TX)^{-1}) \\
           &= \sigma_{\epsilon}^2(X^TX)^{-1}
\end{align}

using the fact that both $X^TX$ and its inverse are symmetric, so $((X^TX)^{−1})^T = (X^TX)^{−1}$

Hence,

$$ \mathbf{\beta} \sim \mathcal{N}(\beta, \sigma_{\epsilon}^2(X^TX)^{-1}) $$

Therefore, standard deviation of estimates ($\beta$) =  $\sqrt{\sigma_{\epsilon}^2(X^TX)^{-1}}$