# Ridge Regression

*Linear Regression using least square loss with $L_2$ regularization penalty*

---
* [Implementation in Python](../pymlalgo/regression/ridge_regression.py)
* [Demo](../demo/ridge_regression_demo.ipynb)

---

### Features and Labels Shapes
Each training example, $x_i$ is composed of $d$ features and hence it can be represented using a vector of shape $d \times 1$  The matrix of all training examples, $\bf{X}$ is of dimension $d \times n$ where $n$ is the number of training examples and $d$ is the number of features. As such each training example is a column of the matrix $\bf{X}$.

Each label $y_i$ is a scalar and there are $n$ labels. The matrix of labels, $\bf{Y}$ is of shape $n \times 1$. 

After training, weights will be assigned to each feature. The shape of the weight vector $\beta$ is $d \times 1$

### Loss and Cost Function
The loss function is given as:
$$\mathcal{L}(\hat{y}, y) = (y_i - x_i^T\beta)^2$$

The cost function is given as:
$$F(\beta) = \frac{1}{n}\sum_{i=1}^{n}(y_i - x_i^T\beta)^2 + \lambda ||\beta||_2^2$$  

where $\lambda$ is the regularization coefficient (a hyper parameter).

The goal is to find the value of $\beta$ for which $F(\beta)$ is minimum.

### Gradient Derivation

For $d = 1$ and $n = 1$:  
$$F(\beta) = \frac{1}{1}\sum_{i=1}^{1}(y_i - x_i^T\beta)^2 + \lambda ||\beta||_2^2$$
$$F(\beta) = (y - x\beta)^2 + \lambda \beta^2$$ 
taking the derivative w.r.t $\beta$,
$$F^{'}(\beta) = 2(y - x\beta)(-x) + 2\lambda \beta$$
$$F^{'}(\beta) = -2x(y - x\beta) + 2\lambda \beta$$

For $d > 1$ and $n > 1$  
$$F(\beta) = \frac{1}{n}\sum_{i=1}^{n}(y_i - x_i^T\beta)^2 + \lambda ||\beta||_2^2$$

$L_2$ norm of a matrix $A$ with dimensions $m \times n$ can be written as,    
$$||A||_2 = (\sum_{i=1}^m \sum_{j=1}^n |A_{ij}|^2)^{\frac{1}{2}}$$

For any column vector of shape $d \times 1$
$$ v =  \left[ \begin{matrix} 
1 \\ 2 \\ 3 \\ 4 
\end{matrix} \right]$$
$$v^Tv = \left[ \begin{matrix} 
1 \\ 2 \\ 3 \\ 4 
\end{matrix} \right] \left[ \begin{matrix} 
1 & 2 & 3 & 4 
\end{matrix} \right] = \sum_{i=1}^d v_i^2 = ||v||_2^2$$


Since, $\bf{Y} - \bf{X}^T\beta$ is of dim $n \times 1$, $L_2$ norm can be written as
$$\sum_{i=1}^{n}(y_i - x_i^T\beta)^2 = ||\bf{Y} - \bf{X}^T\beta||_2^2$$

$$F(\beta) = \frac{1}{n}||\bf{Y} - \bf{X}^T\beta||_2^2 + \lambda ||\beta||_2^2$$

let's assume the first term and second term to be $g(\beta)$ and $h(\beta)$ respectively

$$\frac{\partial}{\partial \beta} h(\beta) = \frac{\partial}{\partial \beta} \lambda ||\beta||_2^2 = 2\lambda \beta$$  

  
$$\frac{\partial}{\partial \beta} g(\beta) = \frac{\partial}{\partial \beta} ||\bf{Y} - \bf{X}^T\beta||_2^2 = \frac{\partial}{\partial \beta} (\bf{Y} - \bf{X}^T\beta)^T(\bf{Y} - \bf{X}^T\beta)$$

Expanding $g(\beta)$ and differentiate the terms individually:
$$g(\beta) = \bf{Y}^T\bf{Y} - \bf{Y}^T\bf{X}^T\beta - (\bf{X}^T\beta)^T\bf{Y} + (\bf{X}^T\beta)^T(\bf{X}^T\beta)$$
$$=\bf{Y}^T\bf{Y} - (\bf{X}\bf{Y})^T\beta - \beta^T\bf{X}\bf{Y} + \beta^T\bf{X}\bf{X}^T\beta$$
Now,
$$\frac{\partial}{\partial{\beta}}\bf{Y}^T\bf{Y} =0$$  

Using the identity: $\frac{\partial}{\partial{X}} A^TX = \frac{\partial}{\partial{X}} A^TX = A^T$:

$$\frac{\partial}{\partial{\beta}}(-(\bf{X}\bf{Y})^T\beta) = -\bf{XY}$$
$$\frac{\partial}{\partial{\beta}}(-(\beta^T\bf{X}\bf{Y})) = -\bf{XY}$$

Using the identity: $\frac{\partial}{\partial{X}} X^TAX = (A + A^T
)X$
$$\frac{\partial}{\partial{\beta}}\beta^T\bf{X}\bf{X}^T\beta = (\bf{XX}^T + (\bf{XX}^T)^T)\beta$$
$$=(\bf{XX}^T + \bf{XX}^T)\beta$$
$$=2\bf{XX}^T\beta$$

Collecting all the 4 terms, and taking mean of all the losses by dividing by $n$
$$\frac{1}{n}(0 -\bf{XY} - \bf{XY} + 2\bf{XX}^T\beta)$$
$$=-\frac{2\bf{X}}{n}(\bf{Y} - \bf{X}^T\beta)$$

Collecting both $g(\beta)$ and $h(\beta)$:  
$$\nabla F(\beta) = -\frac{2\bf{X}}{n}(\bf{Y} - \bf{X}^T\beta)  + 2 \lambda \beta$$


## Accuracy using R Squared
Once the weights have been calculated using gradient descent, the predictions can be made using
$$\hat{\bf{Y}} = \bf{X}^T\beta$$

The accuracy of the model is given by the R squared a.k.a [Coefficient of Determination](https://en.wikipedia.org/wiki/Coefficient_of_determination).
It is calculated as:
$$ 1-\frac{\text{total squared error}}{\text{total variation}} = 1 -\frac{\sum_{i=1}^n(\hat{y_i}-y_i)^2}{\sum_{i=1}^n (y_i - \bar{\bf{Y}})^2 }$$

The maximum value of R squared is 1 and higher values denote higher accuracy.