# The Shape of Regularization

## Introduction

You know you need to regularize your models to avoid overfitting, but what effects do your choice of regularization method have on a model's parameters? In this tutorial we'll answer that question with a simple experiment comparing parameters in a simple multilayer perceptron (MLP) trained to classify the digits dataset from `scikit-learn` while subject to parameter-value based regularization. The regularization methods we'll specifically focus on are the norm-based method, which you may recognize by their $L_n$ nomeclature. Depending on the $n$, that is, the _order_ of the norm used to regularize a model you may see very different characteristics in the resulting parameter values. Questions regarding regularization effects on neural network models is at least as common as I've personally asked candidates interviewing for data science or machine learning roles (~several), so if you need a financial incentive to continue reading there's that. Of course your life is likely to be much more enjoyable if you are instead driven by the pleasure of figuring things out and building useful tools, but all of that is up to you. Without further ado, let's get into it. 

Let's start by visualizing and defining norms $L_0, L_1, L_2, L_3, ... L_\infty$ 

In general, we can describe the $L_n$ norm as 

$$
\hspace{6cm} L_n = (\sum(|x|^n))^\frac{1}{n} \hspace{0.2cm}. \hspace{6cm}(1)
$$



Although we can't raise the sum in **Eq.** 1 by the $\frac{1}{n}$ power when $n=0$, we can take the limit as $n \rightarrow \infty$ to find that the $L_0$ norm is 1.0 for all non-zero scalars, _i.e._ when taking the $L_0$ norm of a vector we'll get the number of non-zero elements in the vector. The animation below visualizes this process.

<img src="./assets/l0_limit.gif">

As we begin to see for very small values of $n$, the contribution to the $L_0$ norm for any non-zero value is 1 and 0 otherwise.

<img src="./assets/l0.png">

If we inspect the figure above we see that there's no slope to the $L_0$ norm: it's  totally flat at a value of 1.0 everywhere except 0.0, where there is a discontinuity. That's not very useful for machine learing with gradient-descent, but you can use the $L_0$ norm as a regularization term in algorithms that don't use gradients, such as evolutionary computation, or to compare tensors of parameters to one another. 

The $L_1$ norm will probably look more familiar and useful. 

<img src="./assets/norm1.png">

Visual inspection of the $L_1$ plot reveals a line with a slope of -1.0 before crossing the y-axis and 1.0 afterward. This is actually a very useful observation! The gradient with respect to the $L_1$ norm of any parameter with a non-zero value will be either 1.0 or -1.0, regardless of the magnitude of said parameter. This means that $L_1$ regularization won't be satisfied until parameter values are 0.0, so any non-zero parameter values better be contributing meaningful functionality to minimizing the overall loss function. In addition to reducing overfitting, $L_1$ regularization encourages sparsity, which can be a desirable characteristic for neural network models for, _e.g._ model compression purposes.

Regularizing with higher order norms is markedly different, and this is readily apparent in the plots for $L_2$ and $L_3$ norms. 

<img src="./assets/norm2.png">

Compared to the sharp "V" of $L_1$, $L_2$ and $L_3$ demonstrate an increasingly flattened curve around 0.0. As you may intuit from the shape of the curve, this corresponds to low gradient values around x=0. Parameter gradients with respect to these regularization functions are straight lines with a slope equal to the order the norm. Instead of encouraging parameters to take a value of 0.0, norms of higher order will encourage small parameter values. The higher the order, the more emphasis the regularization function puts on penalizing large parameter values. In practice norms with order higher than $L_2$ are very rarely used.

<img src="./assets/norm3.png">

An interesting thing happens as we approach the $L_\infty$ norm, typically pronounced as "L sup" which is short for "supremum norm." The $L_\infty$ function returns the maximum absolute parameter value.

$$
L_\infty = max(|x|)
$$

<img src="./assets/norm_sup.png">

We can visualize how higher order norms begin to converge toward $L_\infty$:

<img src="./assets/ln_norms.gif">

## Experiment 1

In the previous section we looked at plots for various $L_n$ norms used in machine learning, using our observations to discuss how these regularization methods might affect the parameters of a model during training. Now it's time to see what actually happens when we use different parameter norms for regularization. To this end we'll train a small MLP with one hidden layer on the `scikit-learn` digits dataset. The first question is, and it's a test of the experimental setup as much as anything, does adding $L_n$ regularization reduce overfitting? 

I wasn't able to get a really dramatic "swoosh" example of overfitting as an un-regularized baseline, even with an exxageratingly wide hidden layer with 256 units. None the less, the gap between the training and validation performance with no regularization was 2.9%. 

```
# After training for 50,000 epochs with no regularization, dim_h = 256

min/max/mean magnitude and number of zeros:
  0.000e+00/1.015e-01/8.837e-03 and 10932.0
training step 49999

  training loss: 4.720e-01
    training accuracy: 0.960

  validation loss: 4.503e-01
    validation accuracy: 0.931
```

<img src="./assets/progress_lNone.png">

All regularization methods did produce a slightly narrower gap between training and validation performance, but the difference is not so stark as to provide a basis for confident conclusions (though it should guide our thinking when setting up future experiments). A good sanity check for this is that $L_0$ regularization narrowed the training-validation gap by 0.9% in the first experiment, and I have no explanation for why because the regularization has a discontinuity at x=0.0 and is absolutely flat everywhere else. This shouldn't produce any sort of gradient that could regularize model parameters, but there may be numerical approximations used in [`autograd`](https://github.com/HIPS/autograd), which I used as a lightweight automatic differentiation package for these experiments. The same goes for regularization with the $L_\infty$ norm.

It's also worth noting that I did tinker with the experiment for a while and my expectations likely influenced the numbers. As I write this, I haven't touched the test set yet, we'll save that for the very end. 

The best performance came with $L_3$ regularization, achieving a validation accuracy of 94.1% after training (and 96.0% on the training set), the other techniques had slightly worse performance, but these differences are not large enough to draw much from. As expected, $L_1$ regularization produced parameters with the most 0.0 values, defined as any weight with an absolute value less than 0.001 which we could use as something like a pruning threshold. The differences between the training plots are not really visually discernible, so I've left those out except for $L_1$ and $L_3$ regularization. 

```
# After training for 50,000 epochs with L0 regularization, dim_h = 256
min/max/mean magnitude and number of zeros:
  0.000e+00/1.146e-01/8.734e-03 and 11071.0
training step 49999

  training loss: 4.723e-01
    training accuracy: 0.961

  validation loss: 4.502e-01
    validation accuracy: 0.939
```


<!-- <img src="./assets/progress_l0.png"> -->

```
# After training for 50,000 epochs with L1 regularization, dim_h = 256

min/max/mean magnitude and number of zeros:
  0.000e+00/1.252e-01/8.085e-03 and 11782.0
training step 49999

  training loss: 4.729e-01
    training accuracy: 0.958

  validation loss: 4.510e-01
    validation accuracy: 0.935
```

<img src="./assets/progress_l1.png"> 


```
# After training for 50,000 epochs with L2 regularization, dim_h = 256

min/max/mean magnitude and number of zeros:
  0.000e+00/1.093e-01/8.811e-03 and 10980.0
training step 49999

  training loss: 4.714e-01
    training accuracy: 0.958

  validation loss: 4.498e-01
    validation accuracy: 0.935
```


<!-- <img src="./assets/progress_l2.png"> -->


```
# After training for 50,000 epochs with L3 regularization, dim_h = 256

min/max/mean magnitude and number of zeros:
  0.000e+00/1.102e-01/8.705e-03 and 11093.0
training step 49999

  training loss: 4.721e-01
    training accuracy: 0.960

  validation loss: 4.500e-01
    validation accuracy: 0.941
```

<img src="./assets/progress_l3.png">

The results of the first experiment were more or less in line with what I expected, but the differences were small enough to warrant a second look. In experiment 2 we'll turn it up to 11 and use an MLP that is way too deep for the number of data size and number of samples in our dataset. This time we'll use an MLP with over a million parameters, about 1000X the number of samples and over 10000X the number of elements in each sample. Overall, the model has about 10X more parameters than the entire `sklean` digits dataset. Statistical learning purists may wan to shake their head vigorously before reading the next part.

## Experiment 2: Turning it Up to 11







# Code Section 1: Introductoy Figures

In [None]:
import numpy as np
import matplotlib.pyplot as plt
my_cmap = plt.get_cmap("viridis")

In [None]:


def norm(x, n, axis=0):
    # expects a scalar or vector and returns the nth-order norm for each value
    # (i.e. no summing)
    # note numpy also has a norm function in the `linalg` module
    
    my_norm = (np.abs(x)**n)
    if n:
        my_norm = np.sum(np.abs(x), axis=axis)**n 
    else:
        my_norm[np.abs(x)==np.min(np.abs(x))] = 0.0
        my_norm = my_norm[0]
        
    return my_norm

def sup_norm(x):
    
    x_max = np.max(np.abs(x))
    my_norm = np.zeros_like(x)
    my_norm[np.abs(x)==x_max] = x_max
    return my_norm[0]

x = np.linspace(-.20, .20, 128)[np.newaxis,:]

l0 = norm(x, 0.0)
l1 = norm(x, 1)
l2 = norm(x, 2)
l3 = norm(x, 3)

l_sup = sup_norm(x)


my_fontsize=32
with plt.xkcd(scale=0.50, length=128.0):
    
    plt.figure(figsize=(8,4))
    plt.plot(x[0], l0, ".", color=my_cmap(0.0), lw=5)
    plt.title("$L_0$", fontsize = my_fontsize)

    plt.figure(figsize=(8,4))
    plt.plot(x[0], l1, color=my_cmap(0.25), lw=5)
    plt.title("$L_1$", fontsize = my_fontsize)

    plt.figure(figsize=(8,4))
    plt.plot(x[0], l2, color=my_cmap(0.5), lw=5)
    plt.title("$L_2$", fontsize = my_fontsize)
  
    plt.figure(figsize=(8,4))
    plt.viridis()
    plt.plot(x[0], l3, color=my_cmap(0.75), lw=5)
    plt.title("$L_3$", fontsize = my_fontsize)
    
    plt.figure(figsize=(8,6))
    plt.plot(x[0], l_sup, color=my_cmap(1.0), lw=5)
    plt.title("$L_\infty$", fontsize = my_fontsize)
    plt.show()

In [None]:
# visualize L0 through L_n norms

max_norm_order = 256

x = np.linspace(-1.0,1.0,128)[np.newaxis,:]

for n in range(max_norm_order):
    
    with plt.xkcd(scale=0.3, length=128.0):
        
        fig = plt.figure(figsize=(8,6))
        line_type = "." if n == 0 else "-"
        plt.plot(x[0], norm(x, n), line_type, color=my_cmap(n/max_norm_order), lw=5)
        plt.title("$L${}".format(n), fontsize=my_fontsize)
        plt.savefig("./assets/norm{}.png".format(n))
        plt.close(fig)

In [None]:
# visualize taking the limit as we approach the L0 norm

orders = np.linspace(2e-2, 1e-17, 32)
x = np.linspace(-1.0,1.0,128)[np.newaxis,:]

cc = 0
for n in orders:
    
    with plt.xkcd(scale=0.6, length=128.0):
        
        fig = plt.figure(figsize=(8,6))
        plt.plot(x[0], norm(x, n), color=my_cmap(n/max_norm_order), lw=5)
        plt.title("$L_n$ for n = {:.2e}".format(n), fontsize=my_fontsize)
        plt.axis((-1.0, 1.0, 0.9, 1.1))
        plt.savefig("./assets/l0_limit{}.png".format(cc))
        cc += 1
        
        plt.close(fig)

# Code Section 2: Experiment 1

In [None]:
import sklearn
import sklearn.datasets as datasets
import autograd
from autograd import numpy as np
from autograd import grad
import numpy.random as npr

def softmax(x):
    
    x = x - np.max(x)
    x = np.exp(x) / np.sum(np.exp(x), axis=0)
    
    return x

def cross_entropy(prediction, y):
    
    return -np.mean( y * np.log(prediction) \
                   + (1-y) * np.log(1-prediction))

def forward(x, wx2h, wh2y):
    
    x = np.matmul(x, wx2h)
    # relu
    x = np.arctan(x)
    
    x = np.matmul(x, wh2y)
    x = softmax(x)
    
    return x

def get_loss(x, wx2h, wh2y, y, l_n=[0.,0.,0.,0.], l_sup=0):
    
    prediction = forward(x, wx2h, wh2y)
    
    loss = cross_entropy(prediction, y)
    
    # add regularization
    ## l0
    loss += l_n[0] * (np.sum(np.sign(np.abs(wx2h)) \
                             + np.sum(np.sign(np.abs(wh2y)))))
    ## l1
    loss += l_n[1] * (np.sum(np.abs(wx2h)) \
                      + np.sum(np.sign(np.abs(wh2y))))
    ## l2
    loss += l_n[2] * (np.sum(np.abs(wx2h)**2) \
                      + np.sum(np.sign(np.abs(wh2y)**2)))
    ## l3http://localhost:8888/notebooks/regularization_shapes.ipynb#
    loss += l_n[3] * (np.sum(np.abs(wx2h)**3) \
                      + np.sum(np.sign(np.abs(wh2y)**3)))
    ## l_sup
    loss += l_sup * np.max([np.max(np.abs(wx2h)), np.max(np.abs(wh2y))])
    
    
    return loss

get_grad = grad(get_loss, argnum=[1,2])

def get_accuracy(x, wx2h, wh2y, y):
    
    prediction = forward(x, wx2h, wh2y)
    
    accuracy = np.sum(1.*np.argmax(prediction, axis=1) == np.argmax(y, axis=1)) / np.sum(y)
    
    return accuracy

In [None]:
# load and prep data
my_seed = 1337

[x, y] = datasets.load_digits(return_X_y=True)

npr.seed(my_seed)
npr.shuffle(x)

npr.seed(my_seed)
npr.shuffle(y)

# convert target labels to one-hot encoding
y_one_hot = np.zeros((y.shape[0],10))

for dd in range(y.shape[0]):
    y_one_hot[dd, y[dd]] = 1.0
    
# separate training, test and validation data
test_size = int(0.3 * x.shape[0])

train_x, train_y = x[:-2*test_size], y_one_hot[:-2*test_size]
val_x, val_y = x[-2*test_size:-test_size], y_one_hot[-2*test_size:-test_size]
test_x, test_y = x[-test_size:], y_one_hot[-test_size:]


In [None]:
# define an MLP with 1 hidden layer, no biases
dim_x = x.shape[1]
dim_y = 10
dim_h = 64
learning_rate = 1e-3

def get_weights(dim_x=64, dim_h=4, dim_y=10):
    
    wx2h = npr.randn(dim_x, dim_h) / np.sqrt(dim_x * dim_h)
    wh2y = npr.randn(dim_h, dim_y) / np.sqrt(dim_h * dim_y)
    
    return wx2h, wh2y

wx2h, wh2y = get_weights()

restore_weights = False

In [None]:
num_steps = 50000
disp_every = 500
learning_rate = 3e-4

if restore_weights:
    weight_index = 0
else:
    list_wx2h = []
    list_wh2y = []
    
for ln in [[0,0,0,0.], [1e-4,0.,0.,0.], [0.,1e-4,0.,0.],\
          [0.,0.,1e-4,0.], [0.,0.,0.,1e-4]]:
    
    if restore_weights:
        wx2h = list_wx2h[weight_index]
        wh2y = list_wh2y[weight_index]
        weight_index += 1
    else:
        wx2h, wh2y = get_weights(dim_h=256)
        
    train_losses = []
    val_losses = []
    train_accuracies = []
    val_accuracies = []
    epochs = []
    
    for step in range(num_steps):
        
        [dwx2h, dwh2y] = get_grad(train_x, wx2h, wh2y, train_y, l_n=ln)

        wx2h -= learning_rate * dwx2h
        wh2y -= learning_rate * dwh2y

        if step % disp_every == 0.0:
            train_pred = forward(train_x, wx2h, wh2y)
            train_loss = cross_entropy(train_pred, train_y)
            
            val_pred = forward(val_x, wx2h, wh2y)
            val_loss = cross_entropy(val_pred, val_y)
            
            train_accuracy = get_accuracy(train_x, wx2h, wh2y, train_y)
            val_accuracy = get_accuracy(val_x, wx2h, wh2y, val_y)
                        
            if(0):
                train_loss = get_loss(train_x, wx2h, wh2y, train_y) #/ train_x.shape[0]
                val_loss = get_loss(val_x, wx2h, wh2y, val_y) #/ val_x.shape[0]
                
            if(0): # reach unreachable code below for progress reports 
                print("training step {}\n".format(step))
                print("  training loss: {:.3e}".format(train_loss))
                print("    training accuracy: {:.3f}\n".format(train_accuracy))
                print("  validation loss: {:.3e}".format(val_loss))
                print("    validation accuracy: {:.3f}".format(val_accuracy))
            train_losses.append(train_loss)
            val_losses.append(val_loss)
            train_accuracies.append(train_accuracy)
            val_accuracies.append(val_accuracy)
            epochs.append(step)
    
    list_wx2h.append(wx2h)
    list_wh2y.append(wh2y)
    
    l_type = np.argmax(ln) if np.sum(ln) else "(None)"
    
    all_params = np.abs(np.append(wx2h.ravel(), wh2y.ravel()))
    all_params[all_params < 1e-2] = 0.0
    my_mean = np.mean(all_params)
    my_min = np.min(all_params)
    my_max = np.max(all_params)
    
    num_zeros = all_params.shape[0] - np.sum(np.sign(np.abs(all_params)))
    
    print("min/max/mean magnitude and number of zeros:")
    print("  {:.3e}/{:.3e}/{:.3e} and {}"\
          .format(my_min, my_max, my_mean, num_zeros))
    
    print("training step {}\n".format(step))
    print("  training loss: {:.3e}".format(train_loss))
    print("    training accuracy: {:.3f}\n".format(train_accuracy))
    print("  validation loss: {:.3e}".format(val_loss))
    print("    validation accuracy: {:.3f}".format(val_accuracy))

    plt.figure(figsize=(12,6))
    plt.subplot(121)
    plt.imshow(wx2h, \
             vmin=-.20, vmax=.20)
    plt.title("wx2h for L{}".format(l_type), fontsize=32)
    plt.colorbar()
    plt.subplot(122)
    plt.imshow(wh2y, \
             vmin=-.20, vmax=.20)
    plt.title("wh2y for L{}".format(l_type), fontsize=32)
    plt.colorbar()
    plt.tight_layout()
    plt.savefig("./assets/weights_L{}.png".format(l_type))
    
    
    with plt.xkcd(scale=0.3, length=128):
        plt.figure(figsize=(12,6))
        plt.subplot(121)
        plt.plot(epochs, train_losses, color=my_cmap(0.0), lw=5, label="training loss")
        plt.plot(epochs, val_losses, color=my_cmap(0.5), lw=5, label="validation loss")
        plt.axis((0, num_steps, 0.2, 0.80))
        plt.xlabel("epoch")
        plt.ylabel("loss")
        plt.title("losses for L{}".format(l_type), fontsize=32)
        plt.legend()
        
        plt.subplot(122)
        plt.plot(epochs, train_accuracies, color=my_cmap(0.75), lw=5, label="training acc.")
        plt.plot(epochs, val_accuracies, color=my_cmap(1.0), lw=5, label="validation acc.")
        plt.axis((0, num_steps, 0.0, 1.0))
        plt.xlabel("epoch")
        plt.ylabel("accuracy")
        plt.legend()
        plt.title("L{} accuracy".format(l_type), fontsize=32)
        
        plt.tight_layout()
        plt.savefig("./assets/progress_l{}.png".format(l_type))
        plt.show()
        

restore_weights = True

# Code Section 3: Experiment 2

In [None]:
# redefine new functions for get_weights, forward and get_loss to define a "deep" MLP* and include l_sup regularization in the main reg. list
#
# * As per an offhand quip from Geoffrey Hinton in his previously available Coursera course on neural networks,\
# deep learning begins at 7 layers

In [None]:

def get_deep_weights(dim_x=64, dim_h=4, dim_y=10, depth=7):
    
    weights = []
    weights.append(npr.randn(dim_x, dim_h) / np.sqrt(dim_x * dim_h))
    
    for ww in range(1,depth-1):
        weights.append(npr.randn(dim_h, dim_h) / np.sqrt(dim_h**2))
    
    weights.append(npr.randn(dim_h, dim_y))
        
    return weights

def deep_forward(x, weights, dropout=0.0):
    
    for dd in range(len(weights)):
        if dd:
            x = np.arctan(x)
        
        x = np.matmul(x, weights[dd])
        if dropout:
            dropout_map = np.random.random((x.shape[0], x.shape[1]))
            x[dropout_map < dropout] *= 0.0
            
            x /= (1 - dropout)
        
    x = softmax(x)
    
    return x

def get_all_params(weights):
    
    all_params = np.append(weights[0].ravel(), weights[1].ravel())
    
    for ww in range(1, len(weights)):
        all_params = np.append(all_params, weights[ww].ravel())
    
    return all_params

def get_deep_loss(x, weights, y, l_n=[0.,0.,0.,0.,0.], dropout=0.0):
    
    prediction = deep_forward(x, weights, dropout)
    
    loss = cross_entropy(prediction, y)
    
    # add regularization
    all_params = get_all_params(weights) 
    ## l0
    loss += l_n[0] * (np.mean(np.sign(np.abs(all_params))))
    ## l1
    loss += l_n[1] * (np.mean(np.abs(all_params)))
    ## l2
    loss += l_n[2] * (np.mean(np.abs(all_params)**2))
    
    ## l3
    loss += l_n[3] * (np.mean(np.abs(all_params)**3))
    ## l_sup
    loss += l_n[-1] * np.max(np.abs(all_params))
    
    return loss

def get_deep_accuracy(x, weights, y):
    
    prediction = deep_forward(x, weights)
    
    accuracy = np.sum(1.*np.argmax(prediction, axis=1) == np.argmax(y, axis=1)) / np.sum(y)
    
    return accuracy

get_deep_grad = grad(get_deep_loss, argnum=1)
restore_weights = False

In [None]:
num_steps = 50000
disp_every = 50
learning_rate = 3e-2
reg_scale = 1e-1

if restore_weights:
    weight_index = 0
else:
    list_weights = []
    
for ln in [[0,0,0,0.,0.], [reg_scale,0.,0.,0.,0.], [0.,reg_scale,0.,0.,0.],\
          [0.,0.,reg_scale,0.,0.], [0.,0.,0.,reg_scale,0.], [0.,0.,0.,0.,reg_scale], None]:
    
    if ln is None:
        ln = [0.,0.,0.,0.,0.]
        dropout_rate = 0.25
    else:
        dropout_rate = 0.0
        
        
    weights = get_deep_weights(dim_h=16)
        
    train_losses = []
    val_losses = []
    train_accuracies = []
    val_accuracies = []
    epochs = []
    
    for step in range(num_steps):
        
        if step % disp_every == 0.0:
            train_pred = deep_forward(train_x, weights)
            train_loss = cross_entropy(train_pred, train_y)
            
            val_pred = deep_forward(val_x, weights)
            val_loss = cross_entropy(val_pred, val_y)
            
            train_accuracy = get_deep_accuracy(train_x, weights, train_y)
            val_accuracy = get_deep_accuracy(val_x, weights, val_y)
                          
            if(0): # reach unreachable code below for progress reports 
                print("training step {}\n".format(step))
                print("  training loss: {:.3e}".format(train_loss))
                print("    training accuracy: {:.3f}\n".format(train_accuracy))
                print("  validation loss: {:.3e}".format(val_loss))
                print("    validation accuracy: {:.3f}".format(val_accuracy))
                
            train_losses.append(train_loss)
            val_losses.append(val_loss)
            train_accuracies.append(train_accuracy)
            val_accuracies.append(val_accuracy)
            epochs.append(step)
            
        grads = get_deep_grad(train_x, weights, train_y, l_n=ln)

        for ii in range(len(weights)):
            weights[ii] -= learning_rate * grads[ii]
            

    
    list_weights.append(weights)
    
    l_type = "L" + str(np.argmax(ln)) if np.sum(ln) else "noreg"
    if l_type == 4:
        l_type = "sup"
    elif dropout_rate:
        l_type = "dropout"
    
    all_params = get_all_params(weights) 
    
    all_params[all_params < 1e-3] = 0.0
    my_mean = np.mean(all_params)
    my_min = np.min(all_params)
    my_max = np.max(all_params)
    
    num_zeros = all_params.shape[0] - np.sum(np.sign(np.abs(all_params)))
    
    print("min/max/mean magnitude and number of zeros:")
    print("  {:.3e}/{:.3e}/{:.3e} and {}"\
          .format(my_min, my_max, my_mean, num_zeros))
    
    print("training step {}\n".format(step))
    print("  training loss: {:.3e}".format(train_loss))
    print("    training accuracy: {:.3f}\n".format(train_accuracy))
    print("  validation loss: {:.3e}".format(val_loss))
    print("    validation accuracy: {:.3f}".format(val_accuracy))

        
    with plt.xkcd(scale=0.3, length=128):
        plt.figure(figsize=(12,6))
        plt.subplot(121)
        plt.plot(epochs, train_losses, color=my_cmap(0.0), lw=5, label="training loss")
        plt.plot(epochs, val_losses, color=my_cmap(0.5), lw=5, label="validation loss")
        plt.axis((0, num_steps, 0.2, 0.80))
        plt.xlabel("epoch")
        plt.ylabel("loss")
        plt.title("losses with {}".format(l_type), fontsize=32)
        plt.legend()
        
        plt.subplot(122)
        plt.plot(epochs, train_accuracies, color=my_cmap(0.75), lw=5, label="training acc.")
        plt.plot(epochs, val_accuracies, color=my_cmap(1.0), lw=5, label="validation acc.")
        plt.axis((0, num_steps, 0.0, 1.0))
        plt.xlabel("epoch")
        plt.ylabel("accuracy")
        plt.legend()
        plt.title("accuracy with {}".format(l_type), fontsize=32)
        
        plt.tight_layout()
        plt.savefig("./assets/exp2_progress_{}.png".format(l_type))
        plt.show()
        

restore_weights = True