## Regularized Regression

### Intro

- Regularised regression a similar idea to the usual OLS regression, except that we constrain the estimated parameters in some way

- While this notebook will look specifically at regularised *regression*, do note that regularisation as a concept can and has been applied to multiple other domains (e.g. tree-based boosting, neural net building etc.)

- In this notebook, we look specifically at 2 forms of regularisation; L1 regularisation (also known as LASSO), and L2 regularisation (also known as ridge regression)

### Theory

#### Constrained Optimisation and KKT Conditions

- This section will do 2 things: 
    - We will look at the generalised theory of a constrained OLS regression
    - In doing so, we will also show logically why any optimal solution must meet the 4 Kurush-Kuhn-Tucker (KKT) conditions

- In normal OLS, we assume the following functional form

$$\begin{aligned}
    y &= X\beta
\end{aligned}$$

- We wish to find estimates of $\beta$ that minimises the following loss function

$$\begin{aligned}
    \cal L &= (y - X\hat{\beta})^2
\end{aligned}$$

- In the case of regularisation, we wish to minimise the same loss, **BUT** with some constraint. Let's express this constraint as some function of the model parameters $\beta$ being less than some constant $t$

$$\begin{aligned}
    \cal L &= (y - X\hat{\beta})^2 \qquad \text{s.t. } f(\beta) \le t
\end{aligned}$$

- To incorporate this constraint into the optimisation process, we simply introduce an addition to the loss, there increases when the constraint is violated. As we can see below, whenever $f(\beta) > t$, we get an additional loss term, which forces our $\beta$ value downwards in the optimisation

$$\begin{aligned}
    \cal L &= (y - X\hat{\beta})^2 + (f(\beta) - t)
\end{aligned}$$

$$\begin{array}{c}
    \textbf{KKT 1: Primal Feasibility} \\ 
    \text{For a solution $\hat{\beta}$ to be valid, it must respect the constraint $f(\beta) \le t$, or $f(\beta) - t \le 0$}
\end{array}$$

- However, note 1 issue here: when $f(\beta) < t$ (i.e. an interior point), $(f(\beta) - t) < 0$. This means that we can end up with a lower $\cal L$ from reducing $f(\beta)$, which may shift $\beta$ away from its true optimum point

- So ideally
    - $f(\beta) - t$ should be "switched on" when $f(\beta) > t$ to increase the loss of violating the condition
    - But it should be "switched off" when $f(\beta) < t$ to avoid adjusting $\beta$ away from the optimum $\hat{\beta}$ that minimises the MSE term

- To allow for such an interaction , we introduce a term $\lambda$ into $\cal L$ such that

$$\begin{aligned}
    \cal L &= (y - X\hat{\beta})^2 + \lambda (f(\beta) - t)
\end{aligned}$$

$$\begin{array}{c}
    \textbf{KKT 2: Complementary Slackness} \\ 
    \text{For a solution $\hat{\beta}$ to be valid, it must be true that either (i) $\lambda = 0$ when $f(\hat{\beta}) < t$} \\
    \text{Or (ii) $f(\hat{\beta}) = t$, because by primal feasibility, we never have $f(\hat{\beta}) > t$} \\ 
    \text{This means that $\lambda \cdot (f(\hat{\beta}) - t) = 0$}
\end{array}$$

- By the formulation of the loss function below, the constraint term must contribute positively to the loss when the constraint is violated. That is, we must always be **adding** to the loss if $(f(\beta) > t)$

$$\begin{aligned}
    \cal L &= (y - X\hat{\beta})^2 + \lambda (f(\beta) - t)
\end{aligned}$$

$$\begin{array}{c}
    \textbf{KKT 3: Dual Feasibility} \\ 
    \text{When $f(\beta) > t$, the condition is violated} \\
    \text{In such a case, $\cal L$ must increase} \\
    \text{For $\cal L$ to increase, $\lambda (f(\beta) - t) \ge 0$} \\
    \text{Since $f(\beta) > t$, $f(\beta) - t > 0$} \\
    \text{Therefore, $\lambda > 0$ must also be true, else we'd be lowering the loss when violating the constraint} \\
\end{array}$$

- Finally, when we find a solution $\hat{\beta}$, how do ew know it is optimal?

- Recall that $\hat{\beta}$ must minimise the loss $\cal L$, and to solve for $\hat{\beta}$, we set the first order condition of $\cal L$ to be 0

$$\begin{aligned}
    \nabla_\beta \cal L_{\lambda}(\beta) &= \nabla_\beta \left \| y - X\beta \right \|^2 + \lambda \nabla_\beta f(\beta) \\ 
    &= 0
\end{aligned}$$

$$\begin{array}{c}
    \textbf{KKT 4: Stationarity} \\ 
    \text{At optimal solution $\hat{\beta}$, any change in $\beta$ should not change the loss (i.e. gradient of loss relative to beta must be 0)} 
\end{array}$$

#### Lagrangian Duality: Equivalence of Regularisation and Constrained Optimisation

- If you look closely at the formulation of the constrained optimisation problem above, and compare it to the regularisation formulation, you will notice some differences in their functional forms

$$\begin{aligned}
    \text{Regularisation:}& \qquad \argmin_{\beta} (y - X\beta)^2 + \lambda f(\beta) \\
    \text{Constrained Optimisation:}& \qquad \argmin_{\beta} (y - X\beta)^2 + \lambda (f(\beta) - t)

\end{aligned}$$

- It may be surprising to learn that both forms will give an equivalent optimal value $\hat{\beta}$

- Why?
    - Because the only additional term in the constrained optimisation is $-\lambda t$, which is a constant adjustment to the regularisation formulation
    - Since $\beta$ does not appear in the additional term, it disappears in the gradient, and so doesn't affect $\hat{\beta}$

- This is known as the **Lagrangian Duality**

### L2 Regularisation

- We established 2 points in the theory section above 
    1. The loss function for a constrained is $\cal L = (y - X \hat{\beta})^2 + \lambda (f(\beta) - t)$
    2. We can drop $t$ from the loss function (by Lagrangian Duality) to minimising $\cal L = (y - X \hat{\beta})^2 + \lambda f(\beta)$ instead, which gives us the same optimal value $\hat{\beta}$

- In L2 regularisation $f(\beta) = \left\| \beta \right\|^2_2$, or the squared L2 norm of $\beta$ 

$$\begin{aligned}
    \left\| \beta \right\|^2_2 &= \sum_i (\beta_i)^2
\end{aligned}$$

- Therefore, our objective function is:

$$\begin{aligned}
    \argmin_\beta (y - X \hat{\beta})^2 + \lambda \left\| \beta \right\|^2_2
\end{aligned}$$

- Happily for us, it turns out that L2 regularisation actually has a closed form solution!

$$\begin{aligned}
    \cal L(\beta, \lambda) &= (y - X \hat{\beta})^2 + \lambda \left\| \beta \right\|^2_2 \\
    &= (y - X\hat{\beta})^T (y - X\hat{\beta}) + \lambda \beta^T \beta \\ 
    &= y^Ty - \hat{\beta}^T X^T y - y^T X \hat{\beta} + \hat{\beta}^T X^T X \hat{\beta} + \lambda \beta^T \beta \\ \\

    \cal L'(\beta, \lambda) &= -2 X^T y + 2 X^T X \hat{\beta} + 2 \lambda \hat{\beta} \\
    &= 0 \\ \\
    \therefore 2 X^T y &= 2 X^T X \hat{\beta} + 2 \lambda \hat{\beta} \\
    X^T y &= X^T X \hat{\beta} + \lambda \hat{\beta} \\
    X^T y &= (X^T X + \lambda I) \hat{\beta} \\
    \hat{\beta} &= (X^T X + \lambda I)^{-1} X^T y
\end{aligned}$$

- Let's now see this in action!

#### Implementation

In [44]:
from sklearn.linear_model import Ridge
from sklearn.datasets import make_regression
import numpy as np

In [45]:
X, y = make_regression(n_features=3, n_informative=3, random_state=123)
LAMBDA = 1

In [49]:
skl_ridge = Ridge(alpha=LAMBDA, fit_intercept=False)
skl_ridge.fit(X, y)

In [50]:
my_ridge_coefs = np.linalg.inv((X.T @ X) + (LAMBDA * np.identity(X.shape[1]))) @ X.T @ y

In [51]:
my_ridge_coefs, skl_ridge.coef_

(array([77.03403159, 22.95061111, 74.1699865 ]),
 array([77.03403159, 22.95061111, 74.1699865 ]))

### L1 Regularisation

- Unlike the L2 regularisation case, L1 regularisation is much more complex. 

- For one, there is no closed form solution, unlike in the L2 regularisation case. Therefore, we need to solve this with numerical optimisation methods

- We will implement 4 different numerical optimisation approaches here, just to show it is possible. Please refer to the `Optimisation Algorithms` section for more details
    - Coordinate Descent
    - Least Angle Regression (LARS) - Effective in high dimensional settings
    - Subgradient Methods 
    - Proximal Gradient Descent / ISTA / FISTA

In [None]:
from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso
import numpy as np

In [40]:
X, y = make_regression(n_features=3, n_informative=3, random_state=123)
LAMBDA = 1

In [52]:
skl_lasso = Lasso(alpha=LAMBDA, fit_intercept=False)
skl_lasso.fit(X, y)
skl_lasso.coef_

array([76.74000644, 22.05547475, 74.01791603])

In [None]:
'''

FILL IN YOUR LASSO NUMERICAL OPTIMISATIONS HERE!!!

'''