## Norms 

*by Eric Bridgeford*

Norms essentially are just a way to have a scalar value associated with an absolute measurement of the position of a point in space. All of you are familiar with the absolute value $|\cdot|$, or the euclidian distance $||\cdot||$
between two points, which are just applications of norms. 
 
### p-Norm

Given some input $x \in \mathbb{R}^n$. We consider $\mathbb{R}^n$ to be the $n$-space in which our vector $x$ lives. Then our $p$-norm gives a scalar value:

\begin{align*}
    ||x||_p = \left(\sum_{i=1}^n |x_i|^p\right)^{\frac{1}{p}}
\end{align*}

<img src="https://qph.ec.quoracdn.net/main-qimg-21cabd2502cb3194b6f9a9b0aab686be">

### L-1 Norm

The L-1 norm is just an absolute metric on the weights of a point in vector space. 

\begin{align*}
    ||x||_1 = \sum_{i=1}^n |x_i|
\end{align*}

The absolute value is an example of an application of the $L-1$ norm in $p=1$ space.

When plotting the $L-1$ norm, you will see the central square in the above plot. What this says is simply, any value on the square represents $|x| + |y| = 1$.

### L-2 Norm

The $L-2$ norm is a squared form of the previous $L-1$ norm. 

\begin{align*}
    ||x||_2 = \left(\sum_{i=1}^n |x_i|^2\right)^{\frac{1}{2}}
\end{align*}

The euclidian distance is an example of an application of the $L-2$ norm in $p$ space.

The $L-2$  norm has the convenience of also being equivalent to the following dot product:

\begin{align*}
    ||x||_2 &= \sqrt{\langle x, x \rangle} = \sqrt{x^Tx}
\end{align*}

When plotting the $L-2$ norm, you will see the central circle in the above plot. Similarly, this just says that any $\sqrt{x^2 + y^2} = 1$ where $x, y$ are the first and second dimensions of a data point.


## Regularization

As Prof. Arora introduced, every machine learning algorithm has a bias-variance tradeoff. That is, whenever we fit a model, unless we have PERFECT training data containing every possible relevant feature that could impact our solution and every possible case (which is virtually impossible in a real world setting), we will always fall on some side of having a model that has higher bias and lower variance, or lower bias and higher variance. 

### High Bias

An underfit model shows higher bias. The hypothesis class is much simpler; that is, it might have very few features, it might only include a few different possible solutions, etc. In the extreme case, our model can generalize to virtually any input, and we can make predictions on just about any dataset (basically, it is so stupidly simple that our input choice doesn't matter). On the plus side, however, this means that noise that may be present in our model will not really impact our solution that much. For instance, if we have data that is normally distributed, we COULD fit it to every individual point, but a model that allows for higher bias will be able to distribute the noise across the dataset and only view the non-noise information as "important".

### High Variance

An overfit model shows high variance. On ANY given dataset, we could always get perfect training accuracy. Awesome, right? Wrong. Consider a naive solution: simply provide a model that maps each input to its corresponding output. Obviously, this is a terrible solution: it will be perfectly non-robust (any input it hasn't seen before it will have no clue what to do with). An overfit, high variance model will shift radically to changes in the data we feed it; it is incredibly sensitive to noise and outliers, and has no ability to effectively generalize a solution to the input data.

### Combatting the Bias-Variance Tradeoff

How can we avoid a model with too much bias or too much variance? One way is regularization. By adding regularization, we can essentially penalize our model for considering too much of the data in its given fitting procedure. Regularizing is the process of penalizing a model for adding too many weights with high coefficients, that is, for considering too much of the data at once. Given any model with parameter set $\theta$, input $X$, output $Y$, and loss function $L(\theta | X, Y)$, where $\theta_i \in \theta$ is a particular vector of parameter weights, $X = \left\{x_i\right\}_{i=1}^N \in \mathbb{R}^{d, N}$ is our $d$-dimensional inputs, and $Y = \left\{y_i\right\}_{i=1}^N \in \mathbb{R}^N$ are our outputs, our optimization problem becomes:

\begin{align*}
    \hat{\theta} = \textrm{argmin}_\theta \left[L(\theta | X, Y) + \lambda f(\theta)\right]
\end{align*}

Where we penalize our weights $\theta$ with function $f(\theta)$.

### L-1 Regularization

This solution is known as the $L-1$, or lasso, regularization. It is called the $L-1$ regularizer because is regularizes by using the $L-1$ norm described above. We can train an $L-1$ regularized algorithm by considering the following:

\begin{align*}
    \hat{\theta} = \textrm{argmin}_\theta L(\theta | X, Y) + \lambda ||\theta||_1^1$
\end{align*}

While we might intuitively expect a "better" norm to perform better (where "better" = more complex), the lasso regularizer often is one of the best norms for navigating the bias-variance tradeoff. Consider the hypothesis set, with each level set below representing possible parameter choices that give identical error. Our optimal solution (that is, the hypothesis that maximizes our fit to the data) is just going to be the level set that has the minimum loss, and each level set radiating outwards from that is going to have more and more loss, as long as our hypothesis set is convex. 



### L-2 Regularization

This solution is known as the $L-2$, or ridge, regularization. It is called the $L-2$ regularizer because is regularizes by using the $L-2$ norm described above. We can train an $L-2$ regularized algorithm by considering the following:

\begin{align*}
    \hat{\theta} = \textrm{argmin}_\theta L(\theta | X, Y) + \frac{\lambda}{2} \left|\left|\theta\right|\right|_2^2$
\end{align*}

Where we add the $\frac{1}{2}$ in front of $\lambda$ for convenience when taking a derivative (as this term is squared, the $\frac{1}{2}$ will cancel out upon differentiation). This model has the feature that it punishes larger values much more than smaller values, as it will square values. On the downside, this model is susceptible to keep a large number of parameters, which tends to favor variance.

### Visual Representation

The shaded region represents the "valid" hypotheses under our regularization function. An overlap of the shaded region with the hypothesis set represents the loss at a particular choice of parameters. If we think about this behavior, it seems  intuitive that the "lasso" is going to most likely to want to fit solutions that lie on the "point" of our square, as these clearly stick out the most. As these points have one feature with maximal weight, and other features with minimal weight, this means that our $L-1$ regularizer has self-selected which parameters "matter most". Conversely, the Ridge regularization method will be very likely to select parameters with incredibly low weights, which may overfit our given dataset.

<img src="./img/lasso.png">