# Lab 3 Part 1: Convolutions in Keras
The code below will allow you to play with a single convolutional layer in Keras. Take a look at the documentation for the Conv2D layer, which is also where the original code came from.

In [3]:
import tensorflow as tf
from tensorflow.keras.layers import Conv2D

In [4]:
input_shape = (4, 28, 28, 3)

x = tf.random.normal(input_shape)

y = Conv2D(filters=2,
           kernel_size=(3, 3),
           strides=1,
           padding='valid',
           input_shape=(None, 28, 28, 3))(x)
y.shape

TensorShape([4, 26, 26, 2])

Here is a brief explanation of the above code:

<img src="Convolution_notebook-2.jpg" width=600 align="center">

# Exercises
In the code above, make changes to:

- input  ℎ
- input  𝑤
- input  𝑛𝑐
- number of filters
- kernel size (same as filter size)

For each change, calculate the dimensions of the output (y.shape) by hand, including drawing a diagram (as shown below).



<img src="Convolution_notebook-1.jpg" width=600 align="center">

## MNIST Revisited

Let's now revisit our MNIST. Knowing that the data contains 2-dimensional images of handwritten digits, we should be able to apply what we've learned about convolutions. Thus, in this section, we will create a convolutional neural network (CNN or convnet) for this data set.

In [None]:
from tensorflow.keras.datasets import mnist
import numpy as np

(train_data, train_labels), (test_data, test_labels) = mnist.load_data()

In [None]:
train_data[0].shape

In [None]:
train_labels[:10]

This time we are going to use a **validation set** to monitor our training progress. We can also use this validation set for *hyperparameter tuning*. Remember, using the validation set allows us to keep the *test set* to gauge how well our final model should do in the real world; that is, the final model only sees the test data once.

In [None]:
# Use the first 10,000 samples of our training data as our validation set
val_data = train_data[:10000]
val_labels = train_labels[:10000]

# Use the remainder of the original training data for actual training
partial_train_data = train_data[10000:]
partial_train_labels = train_labels[10000:]

In [None]:
# Scale the pixel values so they lie in the range of 0-1
partial_train_data = partial_train_data / 255.
val_data = val_data / 255.

test_data = test_data /255.

Note that our data currently has 3 dimensions: `(samples, height, width)`.

In [None]:
print(partial_train_data.shape)
print(val_data.shape)
print(test_data.shape)

In [None]:
print(partial_train_labels.shape)
print(val_labels.shape)
print(test_labels.shape)

Our convolutional neural network will expect 4-dimensional data: `(batch_size, height, width, channels)`. Note that depending on how you decide to update the parameters of the network, `batch_size` could equal the number of `samples` (as in *batch gradient descent*), or it could equal a single sample (as in *stochastic gradient descent*, or it can equal the batch size (as in *mini-batch gradient descent*).

We can use a NumPy function to add this dimension.

In [None]:
partial_train_data = np.expand_dims(partial_train_data, axis=3)
val_data = np.expand_dims(val_data, axis=3)
test_data = np.expand_dims(test_data, axis=3)

In [None]:
print(partial_train_data.shape)
print(val_data.shape)
print(test_data.shape)

Note how a fourth dimension was added to our data. This dimension corresponds to the number of channels in our input data. Here it is 1, since the images are all greyscale. It would be 3 if the images were RGB. Also note, that the convention here is *channels last*, as opposed to *channels first*.

As in Lab 1, we need to convert our label data to the correct format.

In [None]:
from tensorflow.keras.utils import to_categorical

partial_train_labels = to_categorical(partial_train_labels)
val_labels = to_categorical(val_labels)
test_labels = to_categorical(test_labels)

In [None]:
print(partial_train_labels.shape)
print(val_labels.shape)
print(test_labels.shape)

We will now import the necessary modules for building our convolutional neural network. Since we are using Keras's sequential API we need to import the `Sequential` module. The remaining 3 imports will help us build the layers of our CNN. `Conv2D` creates the convolutional layers we have been discussing in the lectures. `Flatten` is used to create a 1 dimensional vector so we can feed the output of our convolutional layers to the fully-connected layers. We used NumPy's `reshape` function to do this flattening in Lab 1. And the `Dense` layer is the same as what we used in Lab 1.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, Flatten, Dense

We are going to use a slightly different approach to building our network than we did in Lab 1. Here we will directly add a *list of layers* to the `Sequential()` object. That is, we put all our layers inside square brackets `[...]` and put this inside the `Sequential( [...] )` object to create our model. In Lab 1 we used the `.add()` method to add individual layers to our `Sequential()` object that we initialized without any layers.

In [None]:
model = Sequential([
    Conv2D(filters=32,
           kernel_size=(3, 3),
           strides=1,
           padding='same',
           activation='relu',
           input_shape=(28, 28, 1)),
    Conv2D(filters=32,
           kernel_size=(3, 3),
           strides=2,
           padding='valid',
           activation='relu'),
    Conv2D(filters=64,
           kernel_size=(3, 3),
           strides=1,
           padding='same',
          activation='relu'),
    Conv2D(filters=64,
           kernel_size=(3, 3),
           strides=1,
           padding='valid',
           activation='relu'),
    Flatten(),
    Dense(128, activation='relu'),
    Dense(10, activation='softmax')
])

It is often helpful to see the tensor shapes and number of parameters per layer. We can get this information by using the `.summary()` method.

In [None]:
model.summary()

We are still tackling the same type of problem (multi-class classification) so the same loss and metrics will work for us here. The optimizer `rmsprop` is the same as we used before and can be taken as the default method (or recipe) to try out for updating the model parameters.

In [None]:
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

We now fit our model to the remaining training data (the original training data minus the validation data). You will now see that *loss* and *accuracy* get updated for each batch of images (here set to 256) but the *validation loss* and *validation accuracy* get updated after each *epoch*. Note that the *validation data* is not being used to train the model. Each batch of the training data is used to update the parameters and then, once we have gone through all of the samples in our training data (that is, all the samples in `partial_train_data`) the model is used to make predictions for the validation set. From those predictions the validation loss and accuracy are calculated.

Each epoch of training should take 30-50s to complete.

In [None]:
history = model.fit(partial_train_data,
                    partial_train_labels,
                    epochs=10,
                    batch_size=256,
                    validation_data=(val_data, val_labels),
                    verbose=1)

The values for the training loss and accuracy, as well as the validation loss and accuracy, are stored in the `history` variable. You can see the structure of the dictionary that stores this information as follows:

In [None]:
history.history['loss']

We will now use this information to visualize the progress our network makes on the loss and accuracy as the number of epochs increases.

In [None]:
import matplotlib.pyplot as plt  # needed to create our plot

history_dict = history.history # the dictionary that has the information on loss and accuracy per epoch

loss_values = history_dict['loss']   # training loss
val_loss_values = history_dict['val_loss'] # validation loss

epochs = range(1, len(loss_values)+1)  #creates list of integers to match the number of epochs of training

# code to plot the results
plt.plot(epochs, loss_values, 'b', label="Training Loss")
plt.plot(epochs, val_loss_values, 'r', label="Validation Loss")
plt.title("Training and Validation Loss")
plt.xlabel("Epochs")
plt.xticks(epochs)
plt.ylabel("Loss")
plt.legend()
plt.show()

In [None]:
# As above, but this time we want to visualize the training and validation accuracy
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']

plt.plot(epochs, acc_values, 'b', label="Training Accuracy")
plt.plot(epochs, val_acc_values, 'r', label="Validation Accuracy")
plt.title("Training and Validation Accuracy")
plt.xlabel("Epochs")
plt.xticks(epochs)
plt.ylabel("Accuracy")
plt.legend()
plt.show()

## Exercise: Change the layers

Play around with the **number of filters** and the **filter size** in our model. Note the change in:
- number of parameters in the model
- training and validation losses and accuracies

### Exercise: Early Stopping

When you have a final model, train it until the validation loss stops decreasing. At this point, the model will have stopped learning and will start to memorize the training data. The model may be starting to overfit. Note the number of epochs at which this happens.  One way to avoid this overfitting is called *early stopping*.  

Try implementing early stopping for our model:
- use the validation loss plot to determine which epoch corresponds to when the model stops learning
    - if it so happens that the validation loss continues going down for all 10 epochs, then increase the number of epochs in the original code to 20
- use the complete training set (no validation set)
- scale this training set
- expand its dimensions to 4
- use the same model, and same optimizer, loss and metrics
- fit the model to the complete training set (no validation set)
- evaluate the trained model on the test data


### Exercise: Early Stopping with Callbacks

Now try to implement early stopping using the Keras [callback](https://keras.io/api/callbacks/early_stopping/) functionality. In this case, you will need to use the validation data, because you want the early stopping to occur as a result of Keras monitoring the validation loss.