In [None]:
from google.colab import drive
drive.mount('/content/drive')

# In today's lab, we'll cover two topics that will be useful in your assignment: Overfitting and Convolutional layers

## Overfitting

We'll see a few ways of reducing overfitting, including:
* Reducing the capacity of the network.
* Adding weight regularization.
* Incorporating dropout.

In many of the examples we've seen so far, the performance of our model on the validation data peaked after a few epochs and would then start degrading, even though the performance on the training data continued to improve. This is known as _overfitting_ to the training data. Overfitting happens in every single machine learning problem, and learning how to deal with it is essential to mastering machine learning.

To start with, we'll continue to work with the imdb movie review dataset to investigate overfitting. Look back at last week's lab if you can't remember what any of the following code is doing, or ask questions!


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import keras
import os

from tensorflow.keras.datasets import imdb
print(keras.__version__)

from tensorflow.keras import layers


In [None]:
# Load the dataset
(train_data, train_labels), _ = imdb.load_data(num_words=10000)  #

In [None]:
# Define function to vectorise dataset
def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

# Vectorise training data
train_data = vectorize_sequences(train_data)
print(len(train_data))

We can see that the data is evenly split.

In [None]:
# Create original model architecture
model = keras.Sequential([
    layers.Dense(16, activation="relu"),
    layers.Dense(16, activation="relu"),
    layers.Dense(1, activation="sigmoid")
    ])

# Compile model
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

# Train data
history_original = model.fit(
    train_data,
    train_labels,
    epochs=20,
    batch_size=512,
    validation_split=0.4
    )

Now let’s try to replace it with this smaller model.

In [None]:
# Version of the model with lower capacity
model = keras.Sequential([
    layers.Dense(4, activation="relu"),
    layers.Dense(4, activation="relu"),
    layers.Dense(1, activation="sigmoid")
    ])

model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

history_smaller_model = model.fit(
    train_data, train_labels,
    epochs=20,
    batch_size=512,
    validation_split=0.4)

In [None]:
# Compare validation losses of original model and smaller model

val_loss_original = history_original.history["val_loss"]
val_loss_smaller_model = history_smaller_model.history["val_loss"]
epochs = range(1, 21)
plt.plot(epochs, val_loss_original, "b--",
label="Validation loss of original model")
plt.plot(epochs, val_loss_smaller_model, "b-",
label="Validation loss of smaller model")
plt.title("Comparison of validation losses of the original and smaller models")
plt.xlabel("Epochs")
plt.ylabel("Validation loss")
plt.legend();

As you can see, the smaller model starts overfitting later than the reference model (after six epochs rather than four), and its performance degrades more slowly once it starts overfitting.

We will now add a model that has far more capacity than the problem warrants. While it is standard to work with models that are significantly overparameterized for what they’re trying to learn, there can definitely be such a thing as too much memorization capacity. You’ll know your model is too large if it
starts overfitting right away and if its validation loss curve looks choppy with high variance (although choppy validation metrics could also be a symptom of using an unreliable validation process, such as a validation split that’s too small).

In [None]:
# Version of the model with higher capacity
model = keras.Sequential([
    layers.Dense(512, activation="relu"),
    layers.Dense(512, activation="relu"),
    layers.Dense(1, activation="sigmoid")
    ])

model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

history_larger_model = model.fit(
    train_data,
    train_labels,
    epochs=20,
    batch_size=512,
    validation_split=0.4)

In [None]:
# Compare validation losses of original model and larger model

val_loss_original = history_original.history["val_loss"]
val_loss_larger_model = history_larger_model.history["val_loss"]
epochs = range(1, 21)
plt.plot(epochs, val_loss_original, "b--",
label="Validation loss of original model")
plt.plot(epochs, val_loss_larger_model, "b-",
label="Validation loss of larger model")
plt.title("Comparison of validation losses of the original and larger models")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend();

The bigger model starts overfitting almost immediately, after just one epoch, and it overfits much more severely. Its validation loss is also noisier. It gets training loss near zero very quickly. The more capacity the model has, the more quickly it can model the
training data (resulting in a low training loss), but the more susceptible it is to overfitting (resulting in a large difference between the training and validation loss).

## Weight Regularization

Another way of avoiding overfitting is by using weight regularization. You may be familiar with the principle of Occam’s razor: given two explanations for something, the explanation most likely to be correct is the simplest one — the one that
makes fewer assumptions. This idea also applies to the models learned by neural networks: given some training data and a network architecture, multiple sets of weight values (multiple models) could explain the data. Simpler models are less likely to overfit
than complex ones.


A simple model in this context is a model with fewer parameters.
Thus, a common way to mitigate overfitting is to put constraints on the complexity of a model by forcing its weights to take only small values, which makes the distribution of weight values more regular. This is called *weight regularization*, and it’s
done by adding to the loss function of the model a cost associated with having large weights. This cost comes in two flavors:

1. **L1 regularization** — The cost added is proportional to the *absolute value of the weight coefficients* (the L1 norm of the weights).
2. **L2 regularization** — The cost added is proportional to the square of the value of the weight coefficients *italicized text* (the L2 norm of the weights). L2 regularization is also called *weight decay* in the context of neural networks.

In Keras, weight regularization is added by passing *weight regularizer instances* to layers as keyword arguments. Let’s add L2 weight regularization to our initial movie-review
classification model.

In [None]:
# Adding L2 weight regularization to the model
from tensorflow.keras import regularizers

model = keras.Sequential([
    layers.Dense(16, kernel_regularizer=regularizers.l2(0.002), activation="relu"),
    layers.Dense(16, kernel_regularizer=regularizers.l2(0.002), activation="relu"),
    layers.Dense(1, activation="sigmoid")
    ])

model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

history_l2_reg = model.fit(
    train_data,
    train_labels,
    epochs=20,
    batch_size=512,
    validation_split=0.4)

In [None]:
# Compare validation losses of original model and larger model

val_loss_original = history_original.history["val_loss"]
val_loss_l2_reg = history_l2_reg.history["val_loss"]
epochs = range(1, 21)
plt.plot(epochs, val_loss_original, "b--",
label="Validation loss of original model")
plt.plot(epochs, val_loss_l2_reg, "b-",
label="Validation loss of L2-regularized model")
plt.title("Effect of L2 weight regularization on validation loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend();

We can see that the model with `L2 regularization` has become much more resistant to overfitting than the reference model, even though both models have the same number of parameters.

As an alternative to L2 regularization, you can use one of the following Keras weight regularizers.

In [None]:
# Different weight regularizers available in Keras
from tensorflow.keras import regularizers

regularizers.l1(0.001)      #L1 regularization
regularizers.l1_l2(l1=0.001, l2=0.001)    #Simultaneous L1 and L2 regularization

In the above code, `l2(0.002)` means every coefficient in the weight matrix of the layer will add `0.002 * weight_coefficient_value ** 2` to the total loss of the model. Note that because this penalty is only added at training time, the loss for this model will be much higher at training than at test time.

**Note** that weight regularization is more typically used for *smaller deep learning models*. *Large deep learning models* tend to be so overparameterized that imposing constraints on weight values hasn’t much impact on model capacity and generalization. In
these cases, a different regularization technique is preferred: *dropout*.

## Dropout

*Dropout* is one of the most effective and most commonly used regularization techniques for neural networks. *Dropout*, applied to a layer, consists of randomly dropping out (setting to zero) a number of output features of the layer during training. Let’s say a
given layer would normally return a vector `[0.2, 0.5, 1.3, 0.8, 1.1]` for a given input sample during training. After applying *dropout*, this vector will have a few zero entries distributed at random: for example, `[0, 0.5, 1.3, 0, 1.1]`.

The *dropout* rate is the fraction of the features that are zeroed out; it’s usually set between `0.2 and 0.5`. At test time, no units are dropped out; instead, the layer’s output values are scaled down by a factor equal to the dropout rate, to balance for the fact that more units are active than at training time. The core idea is that introducing noise in the output values of a layer can break up happenstance patterns that aren’t significant.

In Keras, you can introduce *dropout* in a model via the Dropout layer, *which is applied to the output of the layer right before it*. Let’s add two Dropout layers in the IMDB model to see how well they do at reducing overfitting.

In [None]:
# Adding dropout to the IMDB model

model = keras.Sequential([
    layers.Dense(16, activation="relu"),
    layers.Dropout(0.5),
    layers.Dense(16, activation="relu"),
    layers.Dropout(0.5),
    layers.Dense(1, activation="sigmoid")
    ])

model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

history_dropout = model.fit(
    train_data,
    train_labels,
    epochs=20,
    batch_size=512,
    validation_split=0.4)

In [None]:
# Compare validation losses of original model and model with dropout

val_loss_original = history_original.history["val_loss"]
val_loss_dropout = history_dropout.history["val_loss"]
epochs = range(1, 21)
plt.plot(epochs, val_loss_original, "b--",
label="Validation loss of original model")
plt.plot(epochs, val_loss_dropout, "b-",
label="Validation loss of dropout-regularized model")
plt.title("Effect of dropout on validation loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend();

Again, we notice a clear improvement over the original network.

To recap, these are the most common ways to maximize generalization and prevent overfitting in neural networks:

a) Get more training data, or better training data.

b) Develop better features.

c) Reduce the capacity of the model.

d) Add weight regularization (for smaller models).

e) Add dropout.

## Brief Introduction to Convnets (aka Convolutional Neural Networks, or CNNs)

As an introduction, let's take a quick look at a simple convnet example. It uses a convnet to classify MNIST digits, a task we performed in Chapter 2 using a densely connected network (our test accuracy then was around 97.8%). Convnets have been very
successful at computer vision tasks.Even though this convnet we are about to buid will be basic, its accuracy will be way better than the densely connected network.

A basic convnet is a stack of It’s a stack of `Conv2D` and `MaxPooling2D` layers. We will build the model using the *Functional API*, which was introduced in Chapter 7.

Importantly, a convnet takes as input tensors of shape `(image_height, image_width, image_channels)` (not including the batch dimension). In this case, we'll configure the convnet to process inputs of size `(28, 28, 1)`, which is the format of MNIST images (image_channels=1 because the images are black and white; colour images have image_channels=3). We do this by passing the argument `input_shape = c(28, 28, 1)` to the first layer.

In [None]:
# Instantiating a small convnet
from tensorflow import keras
from tensorflow.keras import layers

inputs = keras.Input(shape=(28, 28, 1))
x = layers.Conv2D(filters=32, kernel_size=3, activation="relu")(inputs)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=64, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=128, kernel_size=3, activation="relu")(x)
x = layers.Flatten()(x)
outputs = layers.Dense(10, activation="softmax")(x)
model = keras.Model(inputs=inputs, outputs=outputs)

In [None]:
# Displaying model summary
model.summary()

You can see that the output of every `Conv2D` and `MaxPooling2D` layer is a `rank-3` tensor of shape (height, width, channels). The width and height dimensions tend to shrink as you go deeper in the model. The number of channels is controlled by the first argument passed to the `Conv2D` layers `(32, 64, or 128)`. After the last `Conv2D` layer, we end up with an output of shape `(3, 3, 128)` — a `3 × 3` feature map of 128 channels.

The next step is to feed this output into a densely connected
classifier like those you’re already familiar with: a stack of Dense layers. These classifiers process vectors, which are 1D, whereas the current output is a rank-3 tensor. To bridge the gap, we flatten the `3D` outputs to `1D` with a Flatten layer before adding the Dense layers. Finally, we do `10-way` classification, so our last layer has 10 outputs and a softmax activation.

We will now train the convnet on the MNIST digits, reusing a lot of the code from the `MNIST` example in Chapter 2. Because we’re doing `10-way` classification with a softmax output, we’ll use the `categorical crossentropy loss`, and because our labels are
integers, we’ll use the sparse version, `sparse_categorical_crossentropy`.

In [None]:
# Training the convnet on MNIST images
from tensorflow.keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype("float32") / 255
test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype("float32") / 255

model.compile(optimizer="rmsprop",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"])

model.fit(train_images, train_labels, epochs=5, batch_size=64)

Next thing to do is to evaluate the model on the test data.

In [None]:
# Evaluating the convnet
test_loss, test_acc = model.evaluate(test_images, test_labels)
print(f"Test accuracy: {test_acc:.3f}")

This is much better than we achieved before!

If you get to this point, look back at the summary of the convolutional model and try to explain the number of parameters at each layer. Don't forget there are still bias terms!

Whereas the densely connected model from chapter 2 had a test accuracy of `97.8%`, the basic convnet has a test accuracy of 99.1%: we decreased the relative error rate by about 60%.
