# Regularization

In any kind of model training, overfitting is one of the major challenges. Overfitting is when the model follows the training data extremely well but performs poorly on unseen data. This occurs when the model training overcompensates for the training error while estimating the trainable parameters, often leading to estimates that are overly complex.

An example of this is fitting a high-degree polynomial to a linearly distributed data. Even though a high-degree polynomial will perfectly fits the given data, it will have high error on the next data point. One way to think about overfitting is that we have a model that is too complex for the problem, and this complexity allows the model to fit to random noise.

## Overfitting in Linear Regression

In linear regression, overfitting can occur due to multiple reasons some of which could be

* **High dimensionality:**
If the number of predictors, *p*, is large relative to the number of observations, *n*, the model can always find coefficients that explain the training data extremely well, even if many predictors are irrelevant. In extreme cases, if $p \ge n$, the model can fit the data perfectly (zero training error), but will generalize poorly.
* **Multicollinearity:**
Highly correlated features can make the regression coefficients unstable, letting the model latch onto random fluctuations in the data.
* **Noise fitting:**
The model tries to minimize squared error, so it may assign large weights to certain predictors just to reduce error from random noise.

## How to address Overfitting?

To keep the model from becoming overly complex and to improve its ability to generalize, we apply a technique known as ***regularization***. In simple terms, regularization is introducing an additional constraint during training. This discourages the model from relying too heavily on any one feature or from allowing coefficients to grow excessively large.

Regularization techniques can be broadly categories in one of the two forms
* **Shrinking:** These methods work by directly penalizing large coefficient values, forcing the model parameters to remain small and stable. This “shrinkage” reduces variance and guards against overfitting.

* **Bayesian prior:** These methods take a probabilistic view, where regularization naturally arises from placing prior distributions on parameters.

### $\mathscr{l}2$ Regularization

One of the ways to add a constraint to the model parameters is to add it in form of parameters' norm. Simplest would be $\mathscr{l}2$-norm ($\lVert\beta\rVert_2$). With this the cost function for the linear regression in matrix form becomes

$$ \underset{\beta}{min} \quad (y - X\beta)^T(y - X\beta) + \lambda \beta^T\beta $$

where, $\lambda$ is the regularization parameter, which control the effect of regularization.

Let us expand this to get our full objective function

$$ \mathit{J}(\beta) = y^y - 2\beta^TX^Ty + \beta X^TX\beta + \lambda \beta^T\beta $$

Taking its derivative with respect to $\beta$

$$ \frac{\partial \mathit{J}}{\partial \beta} = - 2X^Ty + 2X^TX\beta + 2\lambda \beta $$

Equating it to zero and solving for $\beta$

\begin{equation}
    - 2X^Ty + 2X^TX\beta + 2\lambda \beta = 0 \\
    X^TX\beta + \lambda \beta = X^Ty \\
    (X^TX + \lambda \mathit{I})\beta = X^Ty \\
    \beta = (X^TX + \lambda \mathit{I})^{-1}X^Ty
\end{equation}

Adding $\mathscr{l}2$-norm of $\beta$ as a constraint is also called **Ridge Regression** and the normal equation for it is

$$ \hat\beta_{Ridge} = (X^TX + \lambda \mathit{I})^{-1}X^Ty $$

The ridge constrain uniformly smoothen all the parameters shrinking the value for parameters which are noise towards zero.

### $\mathscr{l}1$ Regularization

Similar to $\mathcal{l}2$ constraint, we can also add $\mathscr{l}1$ constraint on the parameters. This approach is called **LASSO (Least Absolute Shrinkage and Selection Operator) Regression**.

$$ \underset{\beta}{min} \quad \lVert(y - X\beta)\rVert_2 + \lambda \lVert\beta\rVert_1 $$

However, this cannot be solved by simply taking a derivative because $\mathscr{l}1$-norm is not differential at zero. Instead, we use **subgradient**

\begin{align*}
    \frac{\partial}{\partial\beta}|\beta_i| =
    \left\{
    \begin{array}{lll}
      1,&  \beta_i\gt0 \\
      -1,&  \beta_i\lt0 \\
      [-1, 1],& \beta_i=0
    \end{array}
    \right.
\end{align*}

So the optimality condition for LASSO becomes

\begin{equation*}
    - 2X^Ty + 2X^TX\beta + 2\lambda s = 0 \\
    X^T(y - X\beta) = \lambda s
\end{equation*}


where $s \in \partial \lVert\beta\rVert_1$ is the subgradient vector of the L1 norm
* $s_i = \text{sign}(\beta_i) \quad \forall \beta_i \neq 0$
* $s_i \in [-1, 1] \quad \forall \beta_i = 0$

Because of this, there is no closed-form solution. However, solution can be found using techniques like
* Coordinate Descent (most common for tabular data)
* LARS-Lasso (efficient for small-to-medium datasets)
* Proximal gradient methods (for large-scale problems)

Due to the "kink" at 0 for $|\beta_i|$, the solution can hit zero exactly for the value for parameters which are noise. Therefore, this can performs automatic feature selection.

### Tikhonov Regularization

Earlier we had used $\mathscr{l}2$-norm of $\beta$ as a constraint, however, if we can also add $\mathscr{l}2$-norm of $L\beta$ ($\lVert L\beta\rVert_2$), where *L* is a regularization matrix. This is called **Tikhonov Regression**.

$$ \underset{\beta}{min} \quad (y - X\beta)^T(y - X\beta) + \lambda (L\beta)^T(L\beta) $$

On solving this, we get normal equation as

$$ \hat\beta_{Tikhonov} = (X^TX + \lambda L^TL)^{-1}X^Ty $$

In this, if we substitute $L = I$, we get Ridge regression. So we can say that Ridge regression is a special case of Tikhonov regression.

**What benefit does this give over Ridge?**

If we want to define a certain structure of parameters, it is not possible is Ridge. But in Tikhonov, we can capture this structure in form of regularization matrix (L).

For example, in a time series data if we know that the dependencies on previous $\beta$s is smooth, the regularization matrix could be a difference operator to enforce smoothness between neighboring coefficients.

In summary,
* ***Ridge:*** Shrinks coefficients directly (no structure).
* ***Tikhonov:*** Lets you encode what kind of structure you want — smoothness in time, continuity in space, or stable curvature — by choosing the penalty operator $L$.