# Linear Regression

## Applications

* Linear regression analysis can help a builder to predict how much houses it would sell in the coming months and at what price.

![title](../Images/house_price_pred.png)


* An organisation can use linear regression to figure out how much they would pay to a new employee based on the years of experience.

![title](../Images/salary_pred.png)


## Problem formulation

Given a $n\times p$ and $n\times1$ matrices. Let's denote them X and Y respectively.

$X$ may be seen as a matrix of row-vectors ${\displaystyle \mathbf {x} _{i}}$ or of n-dimensional column-vectors ${\displaystyle X_{j}}$, which are known as features, regressors, exogenous variables, explanatory variables, covariates, input variables, predictor variables, or independent variables  

![title](../Images/X.svg?sanitize=true)

$\mathbf {y}$  is a vector of observed values ${\displaystyle y_{i}\ (i=1,\ldots ,n)}$ of the variable called the target, regressand, endogenous variable, response variable, measured variable, criterion variable, or dependent variable
![title](../Images/Y.svg?sanitize=true)


######                                     So we need to find some $f:X->Y$, which can do "good" predictions on our dataset 
<br>
This $f:X->Y$ function called model.
<br>
Let's consider that $f$ is a linear function which depends on X

### <center>$f(x_{i}):=\hat{y}_i=x_{i}\beta$<center>

where
<br>
$$
\beta = \begin{bmatrix}
 \beta_{0} & \beta_{1} & \cdots & \beta_{p}
 \end{bmatrix}^{T}
$$

Assumption 0.
<br>
Linearity. This means that the mean of the response variable is a linear combination of the parameters (regression coefficients) and the predictor variables. 

## Loss function
**Mean Squared Error** (MSE) is the mean of the squared errors:

$$L(Y,X,\beta) = \frac {1}{2n}\sum_{i=1}^n(\hat{y}_i - y_i)^2$$
![title](../Images/mse_plot.png)


This is a square of Euclidian distance between $Y$(target) and $\hat{Y}$(predictions) vectors

See connection with Maximum Likelihood Estimation in Bishop [book](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf)

### <center> Our goal is to minimize $L(Y,X,\beta)$ with respect to $\beta$ <center> 


![title](../Images/optimal_beta.svg?sanitize=true)


## Analytical Solution 
![title](../Images/Loss_norm.svg?sanitize=true)
<br>
![title](../Images/loss_der.svg?sanitize=true)
<br>
![title](../Images/optimal_beta_sol.svg?sanitize=true)


###### Whether $X^{T}X$ is always invertible?


Assumption 1. $X$ must have full column rank p

## Gradient Descent

Gradient descent is a first-order iterative optimization algorithm for finding the local minimum of a differentiable function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. If, instead, one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent.

Gradient descent is based on the observation that if the multi-variable function $F(\mathbf {x} )$ is defined and differentiable in a neighborhood of a point $\mathbf {a}$ , then $F(\mathbf {x})$ decreases fastest if one goes from $\mathbf {a}$  in the direction of the negative gradient of $F$ at  $a$ ,$-\nabla F(\mathbf {a})$ It follows that, if
<br>
<center> $\mathbf {a} _{n+1}=\mathbf {a} _{n}-\gamma \nabla F(\mathbf {a} _{n})$ <center>

![title](../Images/gradient_descent_homo.gif)


![title](../Images/gradient_desc.gif)


![title](../Images/lrning_rate.png)


#### Calculate $\nabla{L}$


\begin{equation}
\nabla{L} = \frac{\partial L}{\partial \beta} = \frac{\partial \frac {1}{2n}\sum_{i=1}^n(\hat{y}_i - y_i)^2} 
{\partial \beta}
= \frac{1}{n}\sum_{i=1}^n(x_{i}\beta - y_i)\frac{\partial (x_{i}\beta - y_i)}{\partial \beta} n
= \frac{1}{n}\sum_{i=1}^n(x_{i}\beta - y_i)x_{i}
\end{equation}

so our gradient step is 
<br>

\begin{equation}
\beta^{n+1} = \beta^{n} - \alpha\nabla{L} =\beta^{n} - \alpha \frac{1}{n}\sum_{i=1}^n(x_{i}\beta - y_i)x_{i}
\end{equation}


learning stops when for given $\delta$
\begin{equation}
 \left\lVert  {\beta^{n+1} - \beta^{n}} \right\rVert  \leq \delta
\end{equation}

REFERENCES
<br>
* https://www.quora.com/What-are-some-real-world-applications-of-simple-linear-regression
* https://analyticstraining.com/popular-applications-of-linear-regression-for-businesses/
* http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf 
* https://en.wikipedia.org/wiki/Linear_regression#Assumptions
* https://math.stackexchange.com/questions/1956541/mean-square-error-using-linear-regression-gives-a-convex-function
* https://en.wikipedia.org/wiki/Maximum_likelihood_estimation
* https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/