# Linear Regression
**Goal**: Given labeled data $(x',y'),...,(x^N,y^N)$ with $x^i\in\mathbb{R}$ and $y^i\in\mathbb{R}$ for $i=1,...,N$, predict $y^i$ given $x^i$

### Assumptions
* We will assume that our data is roughly linear.
* That is, there exists some line that approx. the data

### Formulation
* $z$ is the pre-activiation value
    * $x^i$ is the input
    * $w$ is the weight
    * $b$ is the bias
* $a$ is the post-activation value
    * #\hat y^i$ is our prediction

Note: $\hat y=a=z$
#### Model: $z=x^iw+b$
**Loss Function**: 

$$
\begin{align}
L(w,b)&=\frac{1}{2N}\sum_{i=1}{N}(\hat y^i-y^i)^2\\
&= \frac{1}{2N}\sum_{i=1}{N}(x^iw+b-y^i)^2\\
\end{align}
$$

## **Task:** Optimize the Model Using Gradient Decent
* Find the gradient: $\nabla L = \begin{bmatrix} \frac{2L}{\partial 2}\\ \frac{\partial L}{\partial b}\end{bmatrix}$
* To simplify, only consider $(x^i,y^i)$
$$ \begin{align}
L(w,b;x^i,y^i)&=1/2(x^iw+b-y^i)^2\\
\frac{\partial}{\partial w}[L(w,b;x^i,y^i)]&=\frac{1}{2}\frac{\partial}{\partial w}[(x^iw+b-y^i)^2]\\
&=\frac{1}{2}2(x^iw+b-y^i)x^i\\
&=(x^iw+b-y^i)x^i\\
&=(\hat y^i-y^i)x^i\\

\frac{\partial L}{\partial w}&=\frac{1}{N}\sum_{i=1}^N(\hat y^i-y^i)x^i\\
\frac{\partial L}{\partial b}&=\frac{1}{N}\sum_{i=1}^N(\hat y^i-y^i)\\

\nabla L &= \begin{bmatrix}
\frac{\partial L}{\partial w}=\frac{1}{N}\sum_{i=1}^N(\hat y^i-y^i)x^i\\
\frac{\partial L}{\partial b}=\frac{1}{N}\sum_{i=1}^N(\hat y^i-y^i)\\
\end{bmatrix}
\end{align}$$

### Full Gradient Decent:
$$
\begin{bmatrix} w \\ b \end{bmatrix} \gets \begin{bmatrix} w \\ b \end{bmatrix} - \alpha \begin{bmatrix} \frac{1}{N} \sum_{i=1}^N( \hat y^i-y^i)x^i \\ \frac{1}{N}\sum_{i=1}^N(\hat y^i-y^i)\end{bmatrix}
$$

### Stochastic Gradient Descent
* for _ in $1,2,...,num\_epochs$ do

    * for $i$ in $1,2,...,N$

        * $w \gets w-\alpha(\hat y^i-y^i)x^i$

        * $b\gets b-\alpha(\hat y^i-y^i)$