

# Ridge Regression

*Linear Regression using least square loss with $L_2$ regularization penalty*

---
* [Implementation in Python](../pymlalgo/regression/ridge_regression.py)
* [Demo](../demo/ridge_regression_demo.ipynb)

---

### Symbols and Conventions
Refer to [Symbols and Conventions](symbols_and_conventions.ipynb) for details. In summary:
* $n$ is the number of training examples
* $d$ is the number of features in each training example (a.k.a the dimension of the training example)
* $X$ is the features matrix of shape $n \times d$
* $Y$ is the labels matrix of shape $n \times 1$
* $W$ is the weight matrix of shape $d \times 1$

### Loss and Cost Function
The loss function is given as:
$$\mathcal{L}(\hat{y_i}, y_i) = (y_i - x_i W)^2$$

The cost function is given as:
$$F(W) = \frac{1}{n}\sum_{i=1}^{n}(y_i - x_i W)^2 + \lambda ||W||_2^2$$  

where $\lambda$ is the regularization coefficient (a hyper parameter).

The goal is to find the value of $W$ for which $F(W)$ is minimum.

### Gradient Derivation

For $d = 1$ and $n = 1$:  
$$F(W) = \frac{1}{1}\sum_{i=1}^{1}(y_i - x_iW)^2 + \lambda ||W||_2^2$$
$$F(W) = (y - xW)^2 + \lambda W^2$$ 
taking the derivative w.r.t $W$,
$$F^{'}(W) = 2(y - xW)(-x) + 2\lambda W$$
$$F^{'}(W) = -2x(y - xW) + 2\lambda W$$

For $d > 1$ and $n > 1$  
$$F(W) = \frac{1}{n}\sum_{i=1}^{n}(y_i - x_iW)^2 + \lambda ||W||_2^2$$

$L_2$ norm of a matrix $A$ with dimensions $m \times n$ can be written as,    
$$||A||_2 = (\sum_{i=1}^m \sum_{j=1}^n |A_{ij}|^2)^{\frac{1}{2}}$$

For any column vector of shape $d \times 1$
$$ v =  \left[ \begin{matrix} 
1 \\ 2 \\ 3 \\ 4 
\end{matrix} \right]$$
$$v^Tv = \left[ \begin{matrix} 
1 & 2 & 3 & 4 
\end{matrix} \right] \left[ \begin{matrix} 
1 \\ 2 \\ 3 \\ 4 
\end{matrix} \right] = \sum_{i=1}^d v_i^2 = ||v||_2^2$$


Since, $Y - XW$ is of dim $n \times 1$, $L_2$ norm can be written as
$$\sum_{i=1}^{n}(y_i - x_iW)^2 = ||Y - XW||_2^2$$

$$F(W) = \frac{1}{n}||Y - XW||_2^2 + \lambda ||W||_2^2$$

let's assume the first term and second term to be $g(W)$ and $h(W)$ respectively

$$\frac{\partial}{\partial W} h(W) = \frac{\partial}{\partial W} \lambda ||W||_2^2 = 2\lambda W$$  

  
$$\frac{\partial}{\partial W} g(W) = \frac{\partial}{\partial W} ||Y - XW||_2^2 = \frac{\partial}{\partial W} (Y - XW)^T(Y - XW)$$

Expanding $g(W)$ and differentiate the terms individually:
$$g(W) = Y^TY - Y^TXW - (XW)^TY + (XW)^T(XW)$$
$$=Y^TY - (X^TY)^TW - W^T(X^TY) + W^TX^TXW$$
Now,
$$\frac{\partial}{\partial{W}}Y^TY =0$$  

Using the identity: $\frac{\partial}{\partial{X}} A^TX = \frac{\partial}{\partial{X}} X^TA = A$:

$$\frac{\partial}{\partial{W}}(-(X^TY)^TW) = -X^TY$$
$$\frac{\partial}{\partial{W}}(-W^T(X^TY)) = -X^TY$$

Using the identity: $\frac{\partial}{\partial{X}} X^TAX = (A + A^T
)X$
$$\frac{\partial}{\partial{W}}W^TX^TXW = (X^TX + (X^TX)^T)W$$
$$=(X^TX + X^TX)W$$
$$=2 X^TXW$$

Collecting all the 4 terms, and taking mean of all the losses by dividing by $n$
$$\frac{1}{n}(0 -X^TY - X^TY + 2X^TXW)$$
$$=-\frac{2X^T}{n}(Y - XW)$$

Collecting both $g(W)$ and $h(W)$:  
$$\nabla F(W) = -\frac{2X^T}{n}(Y - XW)  + 2 \lambda W$$


## Accuracy using R Squared
Once the weights have been calculated using gradient descent, the predictions can be made using
$$\hat{Y} = XW$$

The accuracy of the model is given by the R squared a.k.a [Coefficient of Determination](https://en.wikipedia.org/wiki/Coefficient_of_determination).
It is calculated as:
$$ 1-\frac{\text{total squared error}}{\text{total variation}} = 1 -\frac{\sum_{i=1}^n(\hat{y_i}-y_i)^2}{\sum_{i=1}^n (y_i - \bar{Y})^2 }$$

The maximum value of R squared is 1 and higher values denote higher accuracy.