# L2 regularised Logisitc Regression

*Logistic Regression using maximum likelihood with $L_2$ regularization penalty*

---
* [Implementation in Python](../pymlalgo/regression/ridge_regression.py)
* [Demo](../demo/ridge_regression_demo.ipynb)

---

### Features and Labels Shapes
Each training example, $x_i$ is composed of $d$ features and hence it can be represented using a vector of shape $d \times 1$  The matrix of all training examples, $\bf{X}$ is of dimension $d \times n$ where $n$ is the number of training examples and $d$ is the number of features. As such each training example is a column of the matrix $\bf{X}$.

Each label $y_i$ is a scalar and there are $n$ labels for each training example. The matrix of labels, $\bf{Y}$ is of shape $n \times 1$. 

After training, weights will be assigned to each feature. The shape of the weight vector $\beta$ is $d \times 1$

### Cost Function

The cost function for logistic regression is given as:
$$F(\beta) = \frac{1}{n} \sum_{i=1}^n log(1 + exp(-y_i x_{i}^T \beta)) + \lambda ||\beta||_2^2$$

where $\lambda$ is the regularization coefficient (a hyper parameter).

The goal is to find the value of $\beta$ for which $F(\beta)$ is minimum.

### Gradient Derivation
To find the gradient, we will differentiate cost function wrt $\beta$
$$F(\beta) = \frac{1}{n} \sum_{i=1}^n log(1 + exp(-y_i x_{i}^T \beta)) + \lambda ||\beta||_2^2$$

The gradient is calculated as,

$$\frac{\partial}{\partial \beta}F(\beta) = \frac{\partial}{\partial \beta}(\frac{1}{n} \sum_{i=1}^n log(1 + exp(-y_i x_{i}^T \beta)) + \lambda ||\beta||_2^2)$$

Given the linearity of differentiation, the above equation can be tackled one by one.

 

$$\frac{\partial}{\partial \beta} log(1 + exp(-y_i x_{i}^T \beta))$$

As a reminder, $y_i$ is a scalar $\in \{-1,1\}$ and $x_i$ is a column vector of shape $d \times 1$

 

Using the chain rule, we write the derivative as

$$\frac{1}{1 + exp(-y_i x_{i}^T \beta)}.exp(-y_i x_{i}^T \beta).\frac{\partial}{\partial \beta}(-y_i x_{i}^T \beta)$$

For the last part of the equation, we use the identity,

$$\frac{\partial}{\partial X} A^TX = \frac{\partial}{\partial X} X^TA = A$$

and rewrite the derivative as

$$\frac{1}{1 + exp(-y_i x_{i}^T \beta)}.exp(-y_i x_{i}^T \beta).(-y_i x_{i})$$

$$=-y_i x_{i}\frac{exp(-y_i x_{i}^T \beta)}{1 + exp(-y_i x_{i}^T \beta)}$$

$$=-y_i x_{i}\frac{1}{1 + exp(y_i x_{i}^T \beta)}$$

To write the objective function, the thing to minimize is the negative log of the likelihood ratio, where the likelihood ratio is given by

$$\frac{P(y_i = 1 | x_i, \beta)}{P(y_i = -1 | x_i, \beta)}$$

Taking the log, we write

$$log(\frac{P(y_i = 1 | x_i, \beta)}{P(y_i = -1 | x_i, \beta)}) =  x_i^T\beta$$

where the RHS is log of likelihood.

Since the events $y_i = 1$ and $y_i = -1$ are mutually exclusive, we can also write it in the logit form.

$$log(\frac{P(y_i = 1 | x_i, \beta)}{1 - P(y_i = 1 | x_i, \beta)}) =  x_i^T\beta$$

 

Taking the inverse, we can write it in the form

$$P(y_i = 1 | x_i, \beta) = \frac{1}{1  + exp(-x_i^T\beta)}$$

where $g(z) = \frac{1}{1 - e^{-z}}$ is called the expit function. Thus we can write

$$P(y_i = 1 | x_i, \beta) = g(x_i^T\beta)$$

$$P(y_i = -1 | x_i, \beta) = 1 - g(x_i^T\beta) = g(-x_i^T\beta)$$

Here we used the identity $1 - g(z) = g(-z)$.  

Using the two equations,  we can write,

$$P(y_i | x_i, \beta) = g(y_ix_i^T\beta) = \frac{1}{1 + exp(-y_ix_i^T\beta)}$$

 

Using this result and substituting in the derivative, we have

$$=-y_i x_{i}\frac{1}{1 + exp(y_i x_{i}^T \beta)}$$

$$=-y_i x_{i}g(-y_ix_i^T\beta)$$

$$=-y_i x_{i}(1 - P(y_i | x_i, \beta))$$

 

Thus,

$$\frac{\partial}{\partial\beta}F(\beta) = \frac{1}{n} \sum_{i=1}^n -y_i x_{i}(1 - p_i) + 2\lambda\beta$$

Here, $p_i = P(y_i | x_i,\beta)$ 

The final step is to convert it to a matrix form. We will write a matrix $P$ of shape $n \times n$ where

$$P = I - diag[p_1, p_2, ....., p_n] = diag[1-p_1, 1- p_2,....,1-p_n]$$

Thus, in the matrix form, we can write,

$$\frac{\partial}{\partial\beta}F(\beta) = -\frac{1}{n} XYP + 2\lambda\beta$$

 

Since, $P$ is a diagonal matrix, we can also write it as:

$$\frac{\partial}{\partial\beta}F(\beta) = -\frac{1}{n} XPY + 2\lambda\beta$$