# Regression Analysis

Regression analysis is finding relationship between Dependent variable and Independent variable. There two ways to find the result. Deterministic model and Probabilistic model. 

----

### Deterministic Model
- Deterministic model is finding f(x) that outputs the value $ \hat{y} $ most similar to to dependent variable $ y $.
- Linear regression analysis is as follows

$$ \hat{y} = w_0 + w_1x_1 + w_2x_2 + ... + w_nx_n = w_0 + w^Tx $$

- $ w_0 $, $ ... $, $ w_n $ are calls `coefficient` or `paramters`.

----

### OLS(Ordinary Least Square
- OLS is one of Deterministic model and using RSS(Residual Sum of Squares).
- residual vector $ e $ and RSS are as belows

$$ e = y-\hat{y} = y-WX $$

$$ RSS = \sum_{j=1}^{n}(y_i-\hat{y_i})^2 = \sum_{j=1}^{n}(y_i-w_ix_i)^2 $$
 
$$ L = RSS = e^Te $$

$$ = (y-WX)^T(y-WX) $$

$$ = y^Ty - 2y^TWX + (WX)^TWX $$

- To find $ W $, the partial derivative to find the minimum value of the gradient of Loss is as follows.

$$ \frac{dL}{dw} = -2X^Ty + 2X^TXw$$

- And the optimization condition is belows. 

$$ \frac{dL}{dw} = 0 $$

$$ X^Ty = X^TXw$$

- So, **`if` $ X^TX $ `has Inverse matrix`**, the conclusion is as follow.

$$ w^* = (X^TX)^{-1}X^Ty $$

----

### MLE(Maximum likelihood estimator)

Maximum likelihood estimation (MLE) is a technique used for estimating the parameters of a given distribution, using some observed data.

- Expression as $ L(parameters | data ) $.
- It's little bit different from probability : $ P(data | parameters) $
- For example, if a population is known to follow a “normal distribution” but the “mean” and “variance” are unknown, MLE can be used to estimate them using a limited sample of the population. MLE does that by finding particular values for the parameters (mean and variance) so that the resultant model with those parameters (mean and variance) would have generated the data.
- PDF of Gaussian distribution is : $ (\frac{1}{\sqrt{2\pi\sigma^2}} * e^{-\frac{(x-\mu)^2}{2\sigma^2}}) $
- Another example
    - we have 3 data points : 2, 2.5, 3
    - let's suppose that the values are from normal(gaussian) distribution
    - we have some data from a model
    - but we don't know about parameters
    - in this situation, we can use MLE
----
### Calculating Likelihood

Likelihood is also calculated from PDF functions but by calculating the joint probabilities of data points from a particular PDF function

$$ L(parametes | data) = \prod_{i=1}^{m} f(data_i | parameters) $$

- Now, we assume that $ \mu=2, \sigma^2=1 $.
- So, Likelihood when model $ N(\mu=2, \sigma^2=1): $

$$ L(\mu=2, \sigma^2=1 | x = 2, 2.5, 3) = PDF(x=2) * PDF(x=2.5) * PDF(x=3) $$

$$ L(2, 1 | x = 2, 2.5, 3) = f(x=2|2,1) * f(x=2.5|2,1) * f(x=3|2,1) $$

$$ L(2, 1 | x = 2, 2.5, 3) = (\frac{1}{\sqrt{2\pi1^2}} * e^{-\frac{(2-2)^2}{2*1^2}}) * (\frac{1}{\sqrt{2\pi1^2}} * e^{-\frac{(2.5-2)^2}{2*1^2}}) * (\frac{1}{\sqrt{2\pi1^2}} * e^{-\frac{(3-2)^2}{2*1^2}}) $$

$$ L(2, 1 | x = 2, 2.5, 3) = 0.39 * 0.35 * 0.24 = 0.03276$$ 

- How can we calculate the precise value of $ \mu, \sigma^2 $?

$$ L(\mu=2, \sigma^2=1 | x = 2, 2.5, 3) = \prod_{i=1}^{3} f(x_i | \mu, \sigma^2) $$

$$ (\frac{1}{\sqrt{2\pi\sigma^2}} * e^{-\frac{(2-\mu)^2}{2\sigma^2}}) * (\frac{1}{\sqrt{2\pi\sigma^2}} * e^{-\frac{(2.5-\mu)^2}{2\sigma^2}}) * (\frac{1}{\sqrt{2\pi\sigma^2}} * e^{-\frac{(3-\mu)^2}{2\sigma^2}})$$

$$ = (\frac{1}{\sqrt{2\pi\sigma^2}})^3 * e^{-\frac{(2-\mu)^2}{2\sigma^2}} * e^{-\frac{(2.5-\mu)^2}{2\sigma^2}} * e^{-\frac{(3-\mu)^2}{2\sigma^2}} $$

$$ ln(L(\mu=2, \sigma^2=1 | x = 2, 2.5, 3)) $$ 

$$ = -3ln(\sqrt{2\pi\sigma^2}) - {\frac{(2-\mu)^2}{2\sigma^2}} - {\frac{(2.5-\mu)^2}{2\sigma^2}} - {\frac{(3-\mu)^2}{2\sigma^2}} $$

$$ = -3ln(\sqrt{2\pi\sigma^2}) - {\frac{19.25 - 15\mu + 3\mu^2}{2\sigma^2}} $$

- Now we can find the ML estimator for $ \mu $. Since this equation is a 2-th quadratic function, the maximum point is Inflection point of $\mu$.

$$ \frac{\delta(ln(L(\mu, \sigma^2 | x)))}{\delta\mu} = \frac{6\mu - 15}{2\mu^2}$$

$$ \mu = 2.5 $$

- Now we know MLE of $\mu$. Like this, we can get MLE of $\sigma^2$ by derivative partial differential of $\sigma$ 

### Linear Regression with MLE

In linear regression with MLE, we need a assuming that `y` has a normal distribution.

> “The independent variable (y values) is assumed be in a normal distribution”

$$ \hat{y} = w_0 + w_1x_1 + w_2x_2 + ... + w_nx_n $$

- And our ML value of estimator is $\hat{y}$. So from $f(x)$, we get a set of values as means.
- The follow graph show this concept. That is each labels in the data point have their own mean and variance in a normal distribution.

![MLE of regression](img/MLE.png)

- So, $\hat{y}$ ~ $ N(WX,0)$
- And we can derive like this.

> - "Residuals(=Error, $\hat{y}-y$) are normally distributed"
> - Mean of residuals is 0
> - Variance of residuals is $\sigma^2$

- So, residuals follow this.

$$ \epsilon \approx N(0, \sigma^2) $$

- Now we have two ideas. One is $\hat{y}$ ~ $ N(WX,0)$ and other is $ \epsilon \approx N(0, \sigma^2) $. And there are some facts.

> - E($y$) = E($\hat{y} + \epsilon$)
> - E($y$) = E($\hat{y}) + E(\epsilon$)
> - E($y$) = XW + 0 = XW
> - Variance($y$) = Variance($\hat{y} + \epsilon$)
> - Variance($y$) = Variance($\hat{y}) + Variance(\epsilon$)
> - Variance($y$) = 0 + $\sigma^2$ 

- So, the result is

$$ y \approx N(WX, \sigma^2) $$

- Now we can define the likelihood of regression.

$$ N(WX, \sigma^2 | x) = \prod_{i=1}^{n} Gaussian(x_i) $$

$$ = (\frac{1}{\sqrt{2\pi\sigma^2}})^n * e^{-\frac{\sum_{i=1}^{n}(y_i-x_iw)^2}{2\sigma^2}} $$

$$ = (\frac{1}{\sqrt{2\pi\sigma^2}})^n * e^{-\frac{(Y-WX)^T(Y-WX)}{2\sigma^2}} $$

- Then we can get log-likelihood. And it is exactly same as OLS.

![MLE of regression](../img/MLE_math.png)

#### Gradient Descent
The other way of estimate $W$ is optimization. And Gradient Descent is the basic of first-order iterative optimization.

- Gradient Descent using partial derivatives of cost function.
- Basic theorem is here. (alpha is learning rate)
- It is important to make const function convex.

$$ \Theta = \Theta - \alpha * \frac{\delta L}{\delta \Theta} $$

#### references
- https://medium.com/quick-code/maximum-likelihood-estimation-for-regression-65f9c99f815d
- https://en.wikipedia.org/wiki/Gradient_descent 