# <a id='toc1_'></a>[Linear Regression](#toc0_)

There are several formulations of linear regression. This includes

- [Ordinary Least Squares](https://en.wikipedia.org/wiki/Ordinary_least_squares)

- [Weighted Least Squares](https://en.wikipedia.org/wiki/Weighted_least_squares)

- [Generalized Least Squares](https://en.wikipedia.org/wiki/Generalized_least_squares)

- [Iteratively Reweighted Least Squares](https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares) 

- [Instrumental Variables Regression](https://en.wikipedia.org/wiki/Instrumental_variables_estimation) 

- [Total Least Squares](https://en.wikipedia.org/wiki/Total_least_squares) 

- Linear Template Fit 

- Percentage Least Squares

- [Constrained Least Squares](https://en.wikipedia.org/wiki/Constrained_least_squares)

We will concentrate on Ordinary Least Squares, which is the most commonly used.

## Ordinary Least Squares

We have $ n $ data points from observations $ (x_1, y_1) \ (x_2, y_2) \ \cdots (x_n, y_n) $. Where $ x $ is the predictor variable and $ y $ is the response variable. These can also be considered as vectors $ \textbf{x} $ and $ \textbf{y} $.

We wish to fit a linear model through this data that satisfies

$ y_i = \beta_0 + \beta_1 x_i + e_i \ \ \ \ $ where $ \ 0 \lt i \le n $

so we have $ n $ equations

$ y_1 = \beta_0 + \beta_1 x_1 + e_1 \ \ \ \ \\ $ 

$ y_2 = \beta_0 + \beta_1 x_2 + e_2 \ \ \ \ \\ $ 

$ \vdots \\ $

$ y_n = \beta_0 + \beta_1 x_n + e_n \ \ \ \ $ 

Where $ \beta_0 $ and $ \beta_1 $ are model parameters. $ e_i $ is called the error term and is defined as $ e_i = y_i - (\beta_0 + \beta_1 x_i) $

In matrix notation:

$ \textbf{y} = \textbf{x} \boldsymbol{\beta} + \textbf{e} $

where

$
\textbf{y} =
\begin{bmatrix}
\ \ y_1 \ \ \\
y_2 \\
\vdots \\
y_n
\end{bmatrix}
\ \ \ \ \ \
\textbf{x} =
\begin{bmatrix}
\ \ 1 \ \ \ x_1 \ \\
\ \ 1 \ \ \ x_2 \ \\
\ \vdots \ \ \\
\ \ 1 \ \ x_n
\end{bmatrix}
\ \ \ \ \ \
\boldsymbol{\beta} =
\begin{bmatrix}
\ \ \beta_1 \ \ \\
\beta_2 \\
\vdots \\
\beta_n
\end{bmatrix}
$

or

$
\begin{bmatrix}
\ \ y_1 \ \ \\
y_2 \\
\vdots \\
y_n
\end{bmatrix}
    =
\begin{bmatrix}
\ \ \beta_0 + \beta_1 x_1 \ \\
\ \ \beta_0 + \beta_1 x_2 \ \\
\ \vdots \ \ \\
\ \ \beta_0 + \beta_1 x_n
\end{bmatrix}
$

### Mean Squared Error

At each data point, the error of prediction is given by

$ e_i(\beta) = y_i - x_i \beta \ \ \ \ \ $ for $ \ \ 0 < i \le n $

or in matrix form

$ \textbf{e}(\beta) = \textbf{y} - \textbf{x}\beta $

We define the mean squared error as follows:

$ MSE(\beta) = \dfrac{1}{n} \sum \limits_{i=1}^{n} e_i^2 (\beta) $

or in matrix notation

$ MSE(\beta) = \dfrac{1}{n} \textbf{e}^T \textbf{e} $

$ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ =  \dfrac{1}{n} (\textbf{y} - \textbf{x}\boldsymbol{\beta})^T (\textbf{y} - \textbf{x}\boldsymbol{\beta}) $

$ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ =  \dfrac{1}{n} (\textbf{y}^T - \boldsymbol{\beta}^T \textbf{x}^T) (\textbf{y} - \textbf{x}\boldsymbol{\beta}) $

$ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ =  \dfrac{1}{n} (\textbf{y}^T \textbf{y} - \textbf{y}^T \textbf{x} \boldsymbol{\beta} - \boldsymbol{\beta}^T \textbf{x}^T \textbf{y} + \boldsymbol{\beta}^T \textbf{x}^T \textbf{x}\boldsymbol{\beta}) $

<br>

Now $ (\textbf{y}^T \textbf{x} \boldsymbol{\beta})^T = \textbf{y}^T \textbf{x} \boldsymbol{\beta} = \boldsymbol{\beta}^T \textbf{x}^T \textbf{y} $

$ \therefore MSE(\boldsymbol{\beta}) = \dfrac{1}{n} (\textbf{y}^T \textbf{y} - 2\boldsymbol{\beta}^T \textbf{x}^T \textbf{y} + \boldsymbol{\beta}^T \textbf{x}^T \textbf{x}\boldsymbol{\beta}) $

### Least Squares

The method of least squares is a parameter estimation method based on minimizing the sum of the squares of the residuals, $ MSE(\boldsymbol{\beta}) $, the difference between observed values and fitted values provided by a model, made in the results of each individual equation.

First, we find the gradient of the mean square error, MSE with respect to $ \boldsymbol{\beta} $

$ \nabla MSE(\boldsymbol{\beta}) = \dfrac{1}{n} (\nabla \boldsymbol{y}^T \textbf{y} - 2\nabla\boldsymbol{\beta}^T \textbf{x}^T \textbf{y} + \nabla\boldsymbol{\beta}^T \textbf{x}^T \textbf{x}\boldsymbol{\beta}) $

$ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = \dfrac{1}{n} (\nabla \textbf{y}^T \textbf{y} - 2\nabla\boldsymbol{\beta}^T \textbf{x}^T \textbf{y} + \nabla\boldsymbol{\beta}^T \textbf{x}^T \textbf{x}\boldsymbol{\beta}) $

$ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = \dfrac{1}{n} (0 - 2 \textbf{x}^T \textbf{y} + 2 \textbf{x}^T \textbf{x}\boldsymbol{\beta}) $

$ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = \dfrac{2}{n} (\textbf{x}^T \textbf{x}\boldsymbol{\beta} - \textbf{x}^T \textbf{y}) $

To find the minimum, set this equal to zero

$ \textbf{x}^T \textbf{x}\boldsymbol{\beta} - \textbf{x}^T \textbf{y} = 0 $

The solution of this equation provides us with the optimized values for $ \hat{\boldsymbol{\beta_0}} $ and $ \hat{\boldsymbol{\beta_1}} $

Rearranging, we get

$ \hat{\boldsymbol{\beta}} = (\textbf{x}^T \textbf{x})^{-1}  \textbf{x}^T \textbf{y} $

Introduce a normalizing factor of $ 1/n $

$ \hat{\boldsymbol{\beta}} = \dfrac{n}{n}\Big(\textbf{x}^T \textbf{x}\Big)^{-1}  \textbf{x}^T \textbf{y} = \Big(\dfrac{1}{n}\textbf{x}^T \textbf{x}\Big)^{-1} \Big(\dfrac{1}{n}\textbf{x}^T \textbf{y}\Big)$

Now the second term 

$ 
\dfrac{1}{n}\textbf{x}^T \textbf{y} 
    = \dfrac{1}{n}
\begin{bmatrix}
  1 & 1 & \cdots & 1 \\
  x_1 & x_2 & \cdots & x_n 
\end{bmatrix}
\begin{bmatrix}
  \ y_1 \ \\
  \ y_2 \ \\
  \ \vdots \ \\
  \ y_n \
\end{bmatrix}
    =
\begin{bmatrix}
  y_1 & y_2 & \cdots & y_n \\
  x_1 y_1 & x_2 y_2 & \cdots & x_n y_n 
\end{bmatrix}
    =
\begin{bmatrix}
  \sum_{i}^{} y_i \\
  \sum_{i}^{} x_i y_i  
\end{bmatrix}
    =
\begin{bmatrix}
  \bar{y} \\
  \overline{xy}  
\end{bmatrix}
$ 

Similarly for the first term

$ 
\dfrac{1}{n}\textbf{x}^T \textbf{x} 
    = \dfrac{1}{n}
\begin{bmatrix}
  1 & 1 & \cdots & 1 \\
  x_1 & x_2 & \cdots & x_n 
\end{bmatrix}
\begin{bmatrix}
  1 & x_1 \\
  1 & x_2 \\
  \ \vdots \\
  1 & x_n 
\end{bmatrix}
    =
\dfrac{1}{n}
\begin{bmatrix}
  1 + 1 \cdots + 1 & x_1 + x_2 + \cdots + x_n \\
  x_1 + x_2 + \cdots + x_n & x_1^2 + x_2^2 + \cdots + x_n^2
\end{bmatrix}
    =
\begin{bmatrix}
  1 & \bar{x} \\
  \bar{x} & \bar{x}^2 
\end{bmatrix}
$ 

$
\Rightarrow \Big(\dfrac{1}{n}\textbf{x}^T \textbf{x}\Big)^{-1}
    =
\dfrac{1}{\overline{x^2} - \bar{x}^2}
\begin{bmatrix}
  \overline{x^2} & -\bar{x} \\
  -\bar{x} & 1  
\end{bmatrix}
$

Multiply the first and second terms

$
\hat{\beta} =
\Big(\dfrac{1}{n}\textbf{x}^T \textbf{x}\Big)^{-1} \Big(\dfrac{1}{n}\textbf{x}^T \textbf{y}\Big)
    = 
\dfrac{1}{\overline{x^2} - \bar{x}^2}
\begin{bmatrix}
  \overline{x^2} & -\bar{x} \\
  -\bar{x} & 1  
\end{bmatrix}
\begin{bmatrix}
  \bar{y} \\
  \overline{xy}  
\end{bmatrix}
    =
\dfrac{1}{\overline{x^2} - \bar{x}^2}
\begin{bmatrix}
  \bar{x^2} \bar{y} - \overline{xxy} \\
  -\bar{x}\bar{y}+\overline{xy}
\end{bmatrix}
    =
\dfrac{1}{\overline{x^2} - \bar{x}^2}
\begin{bmatrix}
  \bar{x^2} \bar{y} - \overline{xxy} \\
  \overline{xy}-\bar{x}\bar{y}
\end{bmatrix}
$

Now the variance is given by $ s_x^2 = \overline{x^2} - \bar{x}^2 $
 
and the covariance is given by $ c_{xy} = \overline{xy} - \bar{x}\bar{y} $

$
\hat{\beta}
    =
\dfrac{1}{s_x^2}
\begin{bmatrix}
  (s_x^2 + \bar{x}^2)\bar{y} - \bar{x}(c_{xy} + \bar{x}\bar{y}) \\
  c_{xy}
\end{bmatrix}
    =
\begin{bmatrix}
  s_x^2 \bar{y} + \bar{x}^2 \bar{y} - \bar{x}c_{xy} + \bar{x}^2\bar{y} \\
  c_{xy}
\end{bmatrix}
    =
\begin{bmatrix}
  \bar{y} - \dfrac{\bar{x}c_{xy}}{s_x^2} \\
  c_{xy}
\end{bmatrix} 
$

$
\therefore \hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}
$

and

$
\hat{\beta_1} = \dfrac{\overline{xy} -\bar{x}\bar{y}}{\overline{x^2} - \bar{x}^2}
$