## Regularized Regression

### Intro

- Regularised regression a similar idea to the usual OLS regression, except that we constrain the estimated parameters in some way

- While this notebook will look specifically at regularised *regression*, do note that regularisation as a concept can and has been applied to multiple other domains (e.g. tree-based boosting, neural net building etc.)

- In this notebook, we look specifically at 2 forms of regularisation; L1 regularisation (also known as LASSO), and L2 regularisation (also known as ridge regression)

### Theory

#### Why must KKT Conditions be met for optimal $\hat{\beta}$?

- This section will do 2 things: 
    - We will look at the generalised theory of a constrained OLS regression
    - In doing so, we will also show logically why any optimal solution must meet the 4 Kurush-Kuhn-Tucker (KKT) conditions

- In normal OLS, we assume the following functional form

$$\begin{aligned}
    y &= X\beta
\end{aligned}$$

- We wish to find estimates of $\beta$ that minimises the following loss function

$$\begin{aligned}
    \cal L &= (y - X\hat{\beta})^2
\end{aligned}$$

- In the case of regularisation, we wish to minimise the same loss, **BUT** with some constraint. Let's express this constraint as some function of the model parameters $\beta$ being less than some constant $t$

$$\begin{aligned}
    \cal L &= (y - X\hat{\beta})^2 \qquad \text{s.t. } f(\beta) \le t
\end{aligned}$$

- To incorporate this constraint into the optimisation process, we simply introduce an addition to the loss, there increases when the constraint is violated. As we can see below, whenever $f(\beta) > t$, we get an additional loss term, which forces our $\beta$ value downwards in the optimisation

$$\begin{aligned}
    \cal L &= (y - X\hat{\beta})^2 + (f(\beta) - t)
\end{aligned}$$

$$\begin{array}{c}
    \textbf{KKT 1: Primal Feasibility} \\ 
    \text{For a solution $\hat{\beta}$ to be valid, it must respect the constraint $f(\beta) \le t$, or $f(\beta) - t \le 0$}
\end{array}$$

- However, note 1 issue here: when $f(\beta) < t$ (i.e. an interior point), $(f(\beta) - t) < 0$. This means that we can end up with a lower $\cal L$ from reducing $f(\beta)$, which may shift $\beta$ away from its true optimum point

- So ideally
    - $f(\beta) - t$ should be "switched on" when $f(\beta) > t$ to increase the loss of violating the condition
    - But it should be "switched off" when $f(\beta) < t$ to avoid adjusting $\beta$ away from the optimum $\hat{\beta}$ that minimises the MSE term

- To allow for such an interaction , we introduce a term $\lambda$ into $\cal L$ such that

$$\begin{aligned}
    \cal L &= (y - X\hat{\beta})^2 + \lambda (f(\beta) - t)
\end{aligned}$$

$$\begin{array}{c}
    \textbf{KKT 2: Complementary Slackness} \\ 
    \text{For a solution $\hat{\beta}$ to be valid, it must be true that either (i) $\lambda = 0$ when $f(\hat{\beta}) < t$} \\
    \text{Or (ii) $f(\hat{\beta}) = t$, because by primal feasibility, we never have $f(\hat{\beta}) > t$} \\ 
    \text{This means that $\lambda \cdot (f(\hat{\beta}) - t) = 0$}
\end{array}$$

- By the formulation of the loss function below, the constraint term must contribute positively to the loss when the constraint is violated. That is, we must always be **adding** to the loss if $(f(\beta) > t)$

$$\begin{aligned}
    \cal L &= (y - X\hat{\beta})^2 + \lambda (f(\beta) - t)
\end{aligned}$$

$$\begin{array}{c}
    \textbf{KKT 3: Dual Feasibility} \\ 
    \text{When $f(\beta) > t$, the condition is violated} \\
    \text{In such a case, $\cal L$ must increase} \\
    \text{For $\cal L$ to increase, $\lambda (f(\beta) - t) \ge 0$} \\
    \text{Since $f(\beta) > t$, $f(\beta) - t > 0$} \\
    \text{Therefore, $\lambda > 0$ must also be true, else we'd be lowering the loss when violating the constraint} \\
\end{array}$$

- Finally, when we find a solution $\hat{\beta}$, how do ew know it is optimal?

- Recall that $\hat{\beta}$ must minimise the loss $\cal L$, and to solve for $\hat{\beta}$, we set the first order condition of $\cal L$ to be 0

$$\begin{aligned}
    \nabla_\beta \cal L_{\lambda}(\beta) &= \nabla_\beta \left \| y - X\beta \right \|^2 + \lambda \nabla_\beta f(\beta) \\ 
    &= 0
\end{aligned}$$

$$\begin{array}{c}
    \textbf{KKT 4: Stationarity} \\ 
    \text{At optimal solution $\hat{\beta}$, any change in $\beta$ should not change the loss (i.e. gradient of loss relative to beta must be 0)} 
\end{array}$$

#### Derivation of L2 Regularisation

- As we established in the section above, it must be true that 