# The Shape of Regularization

You know you need to regularize your models to avoid overfitting, but what effects do your choice of regularization method have on a model's parameters? In this tutorial we'll answer that question with a simple experiment comparing parameters in a simple multilayer perceptron (MLP) trained to classify the digits dataset from `scikit-learn` while subject to parameter-value based regularization. The regularization methods we'll specifically focus on are the norm-based method, which you may recognize by their $L_n$ nomeclature. Depending on the $n$, that is, the _order_ of the norm used to regularize a model you may see very different characteristics in the resulting parameter values. Questions regarding regularization effects on neural network models is at least as common as I've personally asked candidates interviewing for data science or machine learning roles (~several), so if you need a financial incentive to continue reading there's that. Of course your life is likely to be much more enjoyable if you are instead driven by the pleasure of figuring things out and building useful tools, but all of that is up to you. Without further ado, let's get into it. 

Let's start by visualizing and defining norms $L_0, L_1, L_2, L_3, ... L_\infty$ 

In general, we can describe the $L_n$ norm as 

$$
\hspace{6cm} L_n = (\sum(|x|^n))^\frac{1}{n} \hspace{0.2cm}. \hspace{6cm}(1)
$$



Although we can't raise the sum in **Eq.** 1 by the $\frac{1}{n}$ power when $n=0$, we can take the limit as $n \rightarrow \infty$ to find that the $L_0$ norm is 1.0 for all non-zero scalars, _i.e._ when taking the $L_0$ norm of a vector we'll get the number of non-zero elements in the vector. The animation below visualizes this process.

<img src="./assets/l0_limit.gif">

As we begin to see for very small values of $n$, the contribution to the $L_0$ norm for any non-zero value is 1 and 0 otherwise.

<img src="./assets/l0.png">

If we inspect the figure above we see that there's no slope to the $L_0$ norm: it's  totally flat at a value of 1.0 everywhere except 0.0, where there is a discontinuity. That's not very useful for machine learing with gradient-descent, but you can use the $L_0$ norm as a regularization term in algorithms that don't use gradients, such as evolutionary computation, or to compare tensors of parameters to one another. 

The $L_1$ norm will probably look more familiar and useful. 

<img src="./assets/norm1.png">

Visual inspection of the $L_1$ plot reveals a line with a slope of -1.0 before crossing the y-axis and 1.0 afterward. This is actually a very useful observation! The gradient with respect to the $L_1$ norm of any parameter with a non-zero value will be either 1.0 or -1.0, regardless of the magnitude of said parameter. This means that $L_1$ regularization won't be satisfied until parameter values are 0.0, so any non-zero parameter values better be contributing meaningful functionality to minimizing the overall loss function. In addition to reducing overfitting, $L_1$ regularization encourages sparsity, which can be a desirable characteristic for neural network models for, _e.g._ model compression purposes.

Regularizing with higher order norms is markedly different, and this is readily apparent in the plots for $L_2$ and $L_3$ norms. 

<img src="./assets/norm2.png">

Compared to the sharp "V" of $L_1$, $L_2$ and $L_3$ demonstrate an increasingly flattened curve around 0.0. As you may intuit from the shape of the curve, this corresponds to low gradient values around x=0. Parameter gradients with respect to these regularization functions are straight lines with a slope equal to the order the norm. Instead of encouraging parameters to take a value of 0.0, norms of higher order will encourage small parameter values. The higher the order, the more emphasis the regularization function puts on penalizing large parameter values. In practice norms with order higher than $L_2$ are very rarely used.

<img src="./assets/norm3.png">

An interesting thing happens as we approach the $L_\infty$ norm, typically pronounced as "L sup" which is short for "supremum norm." The $L_\infty$ function returns the maximum absolute parameter value.

$$
L_\infty = max(|x|)
$$

<img src="./assets/norm_sup.png">

We can visualize how higher order norms begin to converge toward $L_\infty$:

<img src="./assets/ln_norms.gif">

# Example

In the previous section we looked at plots for various $L_n$ norms used in machine learning, using our observations to discuss how these regularization methods might affect the parameters of a model during training. Now it's time to see exactly what happens 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
my_cmap = plt.get_cmap("viridis")

In [None]:
def norm(x, n, axis=0):
    # expects a scalar or vector and returns the nth-order norm for each value
    # (i.e. no summing)
    # note numpy also has a norm function in the `linalg` module
    
    my_norm = (np.abs(x)**n)
    if n:
        my_norm = np.sum(np.abs(x), axis=axis)**n 
    else:
        my_norm[np.abs(x)==np.min(np.abs(x))] = 0.0
        my_norm = my_norm[0]
        
    return my_norm

def sup_norm(x):
    
    x_max = np.max(np.abs(x))
    my_norm = np.zeros_like(x)
    my_norm[np.abs(x)==x_max] = x_max
    return my_norm[0]

x = np.linspace(-.20, .20, 128)[np.newaxis,:]

l0 = norm(x, 0.0)
l1 = norm(x, 1)
l2 = norm(x, 2)
l3 = norm(x, 3)

l_sup = sup_norm(x)


my_fontsize=32
with plt.xkcd(scale=0.50, length=128.0):
    
    plt.figure(figsize=(8,4))
    plt.plot(x[0], l0, ".", color=my_cmap(0.0), lw=5)
    plt.title("$L_0$", fontsize = my_fontsize)

    plt.figure(figsize=(8,4))
    plt.plot(x[0], l1, color=my_cmap(0.25), lw=5)
    plt.title("$L_1$", fontsize = my_fontsize)

    plt.figure(figsize=(8,4))
    plt.plot(x[0], l2, color=my_cmap(0.5), lw=5)
    plt.title("$L_2$", fontsize = my_fontsize)
  
    plt.figure(figsize=(8,4))
    plt.viridis()
    plt.plot(x[0], l3, color=my_cmap(0.75), lw=5)
    plt.title("$L_3$", fontsize = my_fontsize)
    
    plt.figure(figsize=(8,6))
    plt.plot(x[0], l_sup, color=my_cmap(1.0), lw=5)
    plt.title("$L_\infty$", fontsize = my_fontsize)
    plt.show()

In [None]:
# visualize taking the limit as we approach the L0 norm

orders = np.linspace(2e-2, 1e-17, 32)
x = np.linspace(-1.0,1.0,128)[np.newaxis,:]

cc = 0
for n in orders:
    
    with plt.xkcd(scale=0.6, length=128.0):
        
        fig = plt.figure(figsize=(8,6))
        plt.plot(x[0], norm(x, n), color=my_cmap(n/max_norm_order), lw=5)
        plt.title("$L_n$ for n = {:.2e}".format(n), fontsize=my_fontsize)
        plt.axis((-1.0, 1.0, 0.9, 1.1))
        plt.savefig("./assets/l0_limit{}.png".format(cc))
        cc += 1
        
        plt.close(fig)

In [None]:
# visualize L0 through L_n norms

max_norm_order = 256

x = np.linspace(-1.0,1.0,128)[np.newaxis,:]

for n in range(max_norm_order):
    
    with plt.xkcd(scale=0.3, length=128.0):
        
        fig = plt.figure(figsize=(8,6))
        line_type = "." if n == 0 else "-"
        plt.plot(x[0], norm(x, n), line_type, color=my_cmap(n/max_norm_order), lw=5)
        plt.title("$L${}".format(n), fontsize=my_fontsize)
        plt.savefig("./assets/norm{}.png".format(n))
        plt.close(fig)

In [None]:
import sklearn
import sklearn.datasets as datasets
import autograd
from autograd import numpy as np
from autograd import grad
import numpy.random as npr

# load and prep data
[x, y] = datasets.load_digits(return_X_y=True)

# convert target labels to one-hot encoding
y_one_hot = np.zeros((y.shape[0],10))

for dd in range(y.shape[0]):
    y_one_hot[dd, y[dd]] = 1.0
    
my_seed = 42

npr.seed(my_seed)
npr.shuffle(x)

npr.seed(my_seed)
npr.shuffle(y)

# separate training, test and validation data
test_size = int(0.2 * x.shape[0])

train_x, train_y = x[:-2*test_size], y_one_hot[:-2*test_size]
val_x, val_y = x[-2*test_size:-test_size], y_one_hot[-2*test_size:-test_size]
test_x, test_y = x[-test_size:], y_one_hot[-test_size:]



def softmax(x):
    
    x = x - np.max(x)
    x = np.exp(x) / np.sum(np.exp(x), axis=0)
    
    return x

def cross_entropy(prediction, y):
    
    return -np.sum( y * np.log(prediction) \
                   + (1-y) * np.log(1-prediction))

def forward(x, wx2h, wh2y):
    
    x = np.matmul(x, wx2h)
    # relu
    x = np.arctan(x)
    
    x = np.matmul(x, wh2y)
    x = softmax(x)
    
    return x

def get_loss(x, wx2h, wh2y, y, l_n=[0.,0.,0.,0.], l_sup=0):
    
    prediction = forward(x, wx2h, wh2y)
    
    loss = cross_entropy(prediction, y)
    
    # add regularization
    ## l0
    loss += l_n[0] * (np.sum(np.sign(np.abs(wx2h)) \
                             + np.sum(np.sign(np.abs(wh2y)))))
    ## l1
    loss += l_n[1] * (np.sum(np.abs(wx2h)) \
                      + np.sum(np.sign(np.abs(wh2y))))
    ## l2
    loss += l_n[2] * (np.sum(np.abs(wx2h)**2) \
                      + np.sum(np.sign(np.abs(wh2y)**2)))
    ## l3http://localhost:8888/notebooks/regularization_shapes.ipynb#
    loss += l_n[3] * (np.sum(np.abs(wx2h)**3) \
                      + np.sum(np.sign(np.abs(wh2y)**3)))
    ## l_sup
    loss += l_sup * np.max([np.max(np.abs(wx2h)), np.max(np.abs(wh2y))])
    
    
    return loss

get_grad = grad(get_loss, argnum=[1,2])

def get_accuracy(x, wx2h, wh2y, y):
    
    prediction = forward(x, wx2h, wh2y)
    
    accuracy = np.sum(1.*np.argmax(prediction, axis=1) == np.argmax(y, axis=1)) / np.sum(y)
    
    return accuracy

In [None]:
# define an MLP with 1 hidden layer, no biases
dim_x = x.shape[1]
dim_y = 10
dim_h = 64
learning_rate = 1e-3

wx2h = npr.randn(dim_x, dim_h) / np.sqrt(dim_x * dim_h)
wh2y = npr.randn(dim_h, dim_y) / np.sqrt(dim_h * dim_y)


In [None]:
num_steps = 1000
disp_every = 100
learning_rate = 3e-4

for step in range(num_steps):
    [dwx2h, dwh2y] = get_grad(train_x, wx2h, wh2y, train_y)

    wx2h -= learning_rate * dwx2h
    wh2y -= learning_rate * dwh2y
    
    if step % disp_every == 0.0:
    
        train_loss = get_loss(train_x, wx2h, wh2y, train_y) / train_x.shape[0]
        val_loss = get_loss(val_x, wx2h, wh2y, val_y) / val_x.shape[0]
        train_accuracy = get_accuracy(train_x, wx2h, wh2y, train_y)
        val_accuracy = get_accuracy(val_x, wx2h, wh2y, val_y)
        
        print("training step {}\n".format(step))
        print("  training loss: {:.3e}".format(train_loss))
        print("    training accuracy: {:.3f}\n".format(train_accuracy))
        print("  validation loss: {:.3e}".format(val_loss))
        print("    validation accuracy: {:.3f}".format(val_accuracy))


In [None]:
xx = np.linspace(-9, 9., 100)
plt.plot(xx, np.arctan(xx))
plt.plot(xx, np.tanh(xx))
plt.show()
