# Linear Regression Theory
(by Tevfik Aytekin)

Parts of these notes are largely inspired by Andrew Ng's ML course notes.

### Preliminaries
Assume we are given a data set $D = ((x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ..., (x^{(m)}, y^{(m)}))$ 

where $x^{(i)} \in \mathbb{R}^n$ and $y^{(i)} \in \mathbb{R}$. 
 

### Linear Regression

Hypothesis (model): 
\begin{equation}
h_\theta(x) = \theta_0x_0+\theta_1x_1+\theta_2x_2+...+\theta_nx_n
\end{equation}

where $\theta \in \mathbb{R}^{n+1}$ is the parameter vector and $x_0=1$. This model assumes that the output is a linear function of the inputs. 

Cost function:
\begin{equation}
J(\theta) = \frac{1}{2} \sum_{i=1}^m (y^{(i)} - h_\theta(x^{(i)}) )^2 
\end{equation}

The objective is to find the $\theta$ values which minimizes the cost.

#### An Example
Suppose that we have the following toy dataset.

| Rooms       | Area        |  Price |
| ----------- | ----------- |--------|
| 4      | 120       | 100000|
| 3   | 110        | 90000|
| 5   | 210  | 150000|


Linear model: 
$$
h_\theta(x) = \theta_0 + \theta_1 rooms + \theta_2 area
$$

System of linear equations

$ \theta_0 + \theta_14 + \theta_2120 = 100000$ 

$ \theta_0 + \theta_13 + \theta_2110 = 90000$ 

$ \theta_0 + \theta_15 + \theta_2210 = 150000$ 

Find $\theta$ values which minimizes the error in the above set of linear equations.


### Probabilistic Interpretation

Assumption:
\begin{equation}
\begin{aligned}
y^{(i)}&= \theta_0x^{(i)}_0+\theta_1x^{(i)}_1+\theta_2x^{(i)}_2+...+\theta_nx^{(i)}_n + \epsilon^{(i)} \\
& = \theta^Tx^{(i)} + \epsilon^{(i)}
\end{aligned}
\end{equation}

That is, the output is a linear function of the input variables plus some noise $\epsilon^{(i)}$  where $\epsilon^{(i)} \sim \mathcal{N}(0,\,\sigma^{2})$ which means that the probability density function of $\epsilon^{(i)}$  can be written as:
$$
p(\epsilon^{(i)}) = \frac{1}{ \sqrt{2\pi\sigma^2 }}\exp\left(\frac{-(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma ^2 }\right)
$$

which implies
$$
p(y^{(i)} | x^{(i)}; \theta) = \frac{1}{ \sqrt{2\pi\sigma^2 }}\exp\left(\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma ^2 }\right)
$$

We can understand the above formula as follows: It is the probability of seeing a $y^{(i)}$ given that we see a $x^{(i)}$ and this probability is the value of the normal distribution at $y^{(i)}-\theta^Tx^{(i)}$.

Here is the next step: what is the probability of seeing $n$ number of $y$'s, namely, $y^{(1)}, y^{(2)}, ..., y^{(n)}$ given the corresponding $x$'s, namely, $x^{(1)}, x^{(2)}, ..., x^{(n)}$ . Given that $y^{(i)}$'s are independent (independence assumption) of each other this probability can be written as:

\begin{equation}
\begin{split}
L(\theta)& =\prod_{i=1}^m p(y^{(i)}| x^{(i)} \theta) \\
& = \prod_{i=1}^m \frac{1}{ \sqrt{2\pi\sigma^2 }}\exp\left(\frac{-(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma ^2 }\right)
\end{split}
\end{equation}

$L(\theta)$ is known as the likelihood function, that is, it is the probability of seeing the data as a function of the parameters.

The next question is to ask which values of $\theta$ makes this likelihood most likely (known as the maximum likelihood estimation). Now, we have an optimization problem, find the values $\theta$ which maximizes $L(\theta)$.

A common trick is to maximize the log of this likelihood which is easier to solve and since log is a strictly increasing function the result will be the same.
\begin{equation}
\begin{split}
logL(\theta) & = log\prod_{i=1}^m \frac{1}{ \sqrt{2\pi\sigma^2 }}\exp\left(\frac{-(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma ^2 }\right) \\
& = \sum_{i=1}^m log\frac{1}{ \sqrt{2\pi\sigma^2 }}\exp\left(\frac{-(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma ^2 }\right) \\
& = mlog\frac{1}{ \sqrt{2\pi\sigma^2 }}-\frac{1}{\sigma^2}\frac{1}{2}\sum_{i=1}^m (y^{(i)} - \theta^Tx^{(i)})^2 
\end{split}
\end{equation}

As can be seen above, maximizing $logL(\theta)$ is equivalent to minimizing 
$$
\frac{1}{2}\sum_{i=1}^m (y^{(i)} - \theta^Tx^{(i)})^2 
$$
which is the cost function we defined at the beginning. 


### Batch Gradient Descent
<blockquote>
<b>Algorithm</b>: Batch Gradient Descent <br>
repeat <br>
&nbsp;&nbsp;&nbsp;&nbsp; $\theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j}J(\theta) $ <br>
until convergence
</blockquote>
    
Below is the derivative of the cost function for a data set where there is a single example ($x, y$).

\begin{equation} \label{eq1}
\begin{split}
\frac{\partial}{\partial \theta_j}J(\theta) & =  \frac{\partial}{\partial \theta_j}\frac{1}{2}(y-h_\theta(x))^2 \\
 & =2\frac{1}{2}(y-h_\theta(x)) \frac{\partial}{\partial \theta_j} (y-h_\theta(x))\\
 & =(y-h_\theta(x)) \frac{\partial}{\partial \theta_j}\left(y- \sum_{i=0}^{n}\theta_ix_{i}\right)\\
  & =-(y-h_\theta(x)) x_{j}
\end{split}
\end{equation}

For $m$ examples:
\begin{equation}
\frac{\partial}{\partial \theta_j}J(\theta) = -\sum_{i=1}^m(y^{(i)}-h_\theta(x^{(i)})) x_{j}
\end{equation}

So, gradient descent algorithm becomes:

<blockquote>
<b>Algorithm</b>: Batch Gradient Descent <br>
repeat<br>
&nbsp;&nbsp;&nbsp;&nbsp; $\theta_j := \theta_j + \alpha \sum\limits_{i=1}^m(y^{(i)}-h_\theta(x^{(i)})) x_{j} $ &nbsp;&nbsp;&nbsp;&nbsp; {(for every $j$)} <br>
until convergence
</blockquote>

    
$\alpha$ is called the learning rate which controls the magnitude of the updates. Note that you need to update $\theta_j$'s simultaneously. 

### Stochastic Gradient descent

<blockquote>
<b>Algorithm</b>: Stochastic Gradient Descent <br>
repeat <br>
&nbsp;&nbsp;&nbsp;&nbsp; shuffle the data <br>
&nbsp;&nbsp;&nbsp;&nbsp; for $i = 0$ to $m$ do <br>
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;  $\theta_j := \theta_j +\alpha(y^{(i)}-h_\theta(x^{(i)})) x_{j} $  &nbsp;&nbsp;&nbsp;&nbsp;   (for every $j$) <br>
until convergence
</blockquote>
    
    
Different from the batch version stochastic gradient ascent update the parameters after seeing every individual example. Stochastic gradient descent achieves faster convergence than the batch version.


### Closed Form Solution

Using vector notation we can write the cost function
\begin{equation}
J(\theta) =  \sum_{i=1}^m (y^{(i)} - h_\theta(x^{(i)}))^2 
\end{equation}
as follows:
\begin{equation}
(y - X\theta)^T(y-X\theta)\\
\end{equation}

In order to find the values of $\theta$ which minimizes the cost function we need to set the derivative to zero and solve for $\theta$.

\begin{equation} \label{eq1}
\begin{split}
 \nabla (y - X\theta)^T(y-X\theta) & = 0 \\
  -2X^T(y-X\theta) & = 0 \\
-2X^Ty+2X^TX\theta & = 0\\
(X^TX)^{-1}X^TX\theta & = (X^TX)^{-1}X^Ty \\
I\theta & = (X^TX)^{-1}X^Ty \\
\theta & = (X^TX)^{-1}X^Ty
\end{split}
\end{equation}

Note that the time complexity of the matrix inverse operation is $O(d)$.


### Regularized Linear Regression
#### Ridge Regression 

Cost function: <br>
\begin{equation}
J(\theta) = \displaystyle \frac{1}{2m} \left[\sum\limits_{i=1}^m (y^{(i)} - h_\theta(x^{(i)}))^2 + \lambda\sum\limits_{j=1}^n\theta_j^2\right] 
\end{equation}

<b>Algorithm:</b> Gradient Descent for Ridge Regression
<blockquote>
repeat<br>
&nbsp;&nbsp;&nbsp;&nbsp;  $\theta_0 := \theta_0 + \alpha \frac{1}{m}  \sum\limits_{i=1}^m (y^{(i)}-h_\theta(x^{(i)})x_{0}$ <br>
&nbsp;&nbsp;&nbsp;&nbsp;  $\theta_j := \theta_j + \alpha \left[ \frac{1}{m}  \sum\limits_{i=1}^m (y^{(i)}-h_\theta(x^{(i)}))x_{j} - \frac{\lambda}{m}\theta_j \right]$ &nbsp;&nbsp;&nbsp;&nbsp;  $(j = 1,2,3, ..., n)$
</blockquote>

<b>Closed form solution:</b><br>
\begin{equation}
\theta = (X^TX+\lambda I)^{-1}X^Ty
\end{equation}
