In [1]:
# loading libraries for data manipulation
import numpy as np
import pandas as pd

# loading libraries for data visualization
import matplotlib.pyplot as plt
from plotnine import *

# import tensorflow and keras packages
import tensorflow as tf
from tensorflow import keras

import warnings
warnings.filterwarnings('ignore')

We will load the MNIST data again. This time, we will change the labels to be in a more standardized form. 

In [2]:
# Load MNIST data from keras.datasets
(X_train_mnist, y_train_mnist), (X_test_mnist, y_test_mnist) = keras.datasets.mnist.load_data()

X_train_mnist = X_train_mnist.reshape(-1, 28*28).astype('float32') / 255.0
X_test_mnist = X_test_mnist.reshape(-1, 28*28).astype('float32') / 255.0

# Convert y labels to one-hot encoded vectors
y_train_mnist = keras.utils.to_categorical(y_train_mnist, num_classes=10)
y_test_mnist = keras.utils.to_categorical(y_test_mnist, num_classes=10)



Regularization techniques ensure that a deep neural network is generalized - avoids overfitting in particular. Some techniques we can employ:
- Penalization
- Dropout
- Batch Normalization
- Early Stopping

Let's first build a deep neural network to classify digits without any regularization

In [None]:
model = ...

model.compile(...)

history_no_reg = model.fit(...)

Let's introduce L2 weight regularization. This is theoretically a weight decay which will ensure that weights remain small and not one neuron influences inferences. 

In [None]:
model = ...

model.compile(...)

history_l2_reg = model.fit(...)

Let's also introduce Dropout. Dropout with a probability p will randomly drop neurons from network during training. 

In [None]:
model = ...

model.compile(...)

history_l2_drop_reg = model.fit(...)

Now, we can compare the models to see if any generalization is observed. 

In [None]:

model_history = {"None":history_no_reg.history,
                 "L2":history_l2_reg.history,
                 "L2/Dropout":history_l2_drop_reg.history}

# Loss
plt.figure(figsize=(10, 4))
for name in model_history:
    plt.plot(model_history[name]['loss'], label=f'{name}')
plt.title("Training Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.show()

# Training Accuracy
plt.figure(figsize=(10, 4))
for name in model_history:
    plt.plot(model_history[name]['accuracy'], label=f'{name}')
plt.title("Training Accuracy")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

# Accuracy
plt.figure(figsize=(10, 4))
for name in model_history:
    plt.plot(model_history[name]['val_accuracy'], label=f'{name}')
plt.title("Validation Accuracy")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

The plots show that training and validation accuracies are closer after model training. Note that with regularization, the model has not reached convergence and could use more training steps. 

Another option would be to use Early Stopping. This mechanism stops training when validation loss stops improving, preventing overfitting. We can use the EarlyStopping callback. 

In [None]:
model = keras.Sequential([
        keras.layers.Dense(256,activation='relu',input_shape=(784,)),
        keras.layers.Dense(256,activation='relu'),
        keras.layers.Dense(10,activation='softmax') # output layer
    ])

model.compile(loss="categorical_crossentropy",
              optimizer="adam", #keras.optimizers.SGD(0.01)
              metrics=["accuracy"])

early_stopping = ...

history_no_reg = model.fit(X_train_mnist,y_train_mnist,
                           epochs=20,verbose=1,batch_size=128,
                           validation_data=(X_test_mnist,y_test_mnist),
                           callbacks=[early_stopping])

Our last regularization step (not include data augmentation), is Batch Normalization. This normalizes activations of each layer to mean = 0 and std = 1, per mini-batch. It stabilizes and speeds up training. Note that we can apply Batch Normalization before the activation function is applied. That is, normalize the weight sum and then apply an activation function on top. You can apply Batch Normalization after activation function as well, but be careful as ReLU can result in a lot of 0s.  

In [None]:
model = ...

model.compile(...)

history_no_reg = model.fit(...)

In [None]:
# batch normalization adds additional parameters for the networks to learn
model.summary()

The network now has to learn two new parameters (scale and shift) per node, plus the running running mean and variance but these only need to be tracked not learned. 

Traditionally, the layout for a feed forward neural network should be
1. Define Input Layer
2. Add N Hidden Layers - for each, add L2 regularization, Activation Function, Batch Normalization (before/after activation)
3. Add Dropout between layers (after activation)
4. Define Output Layer with task-depended activation function (softmax for classification, sigmoid for binary, none for regression/linear)
5. Define Optimizer, Loss Function
6. Define Early Stopping parameters