<div style="text-align:center;">
    <img src="http://www.cs.wm.edu/~rml/images/wm_horizontal_single_line_full_color.png">
    <h1>CSCI 416-01/516-01, Fundamentals of AI/ML</h1>
    <h1>Fall 2025</h1>
    <h1>A CNN for the MNIST digit dataset</h1>
</div>

## Credit where credit is due

Much of the code in the last section of this notebook was taken from notebooks that accompany the first edition of Chollet's *Deep Learning with Python*.

# Preamble

In [None]:
import numpy as np
import tensorflow as tf

In [None]:
print(tf.__version__)

In [None]:
# Try to make the randomness repeatable.
np.random.seed(42)
tf.random.set_seed(54)

# Read and preprocess the data

We reshape the data to create a fake color channel.  We will later include a layer in our model to rescale the pixel values to the interval $[0, 1]$.

In [None]:
from tensorflow.keras.datasets import mnist

(X_train, y_train), (X_test, y_test) = mnist.load_data()

X_train = X_train.reshape((60000, 28, 28, 1))
X_test = X_test.reshape((10000, 28, 28, 1))

On a mad whim, let's encode the class labels using one-hot encoding.  Here are the original labels:

In [None]:
print(y_train)
print(y_test)

Now we transmogrify them:

In [None]:
from tensorflow.keras.utils import to_categorical

y_train = to_categorical(y_train)
y_test  = to_categorical(y_test)

In [None]:
print(y_train)
print(y_test)

For each label we have created a vector of length 10 with one entry for each class.  All entries are $0$ except for a $1$ in the component representing the class.

For instance, the first image in the training set is a $5$, so there is a $1$ in component $5$:

In [None]:
print(y_train[0])

# Build a CNN

Our model starts with a rescaling layer.  Building preprocessing into the model ensures that it will be applied when the model is deployed.

We then alternate convolutional layers with pooling layers; this is a common architecture.

We end with a dense layer that is connected to a last dense layer that produces probability estimates for membership in each of the 10 classes.

## Pooling layers

**Pooling layers** downsample their input.  They do so by decomposing the input into blocks of specified size and keep only  part of the information about the block.  

[Max pooling layers](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MaxPool2D) only the maximum value for each block.

[Average pooling layers](https://www.tensorflow.org/api_docs/python/tf/keras/layers/AveragePooling2D) retain the average of the values in the blocks.

Keras also features [global max pooling layers](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalMaxPool2D) and [global averaging pooling layers](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalAveragePooling2D).

## The model

In our model each pooling layers operates on $2 \times 2$ blocks.  Each pooling layer reduces the input to 25% of its original size.  At the end we flatten the data and pass it through two dense layers.  The last one produces probability estimates for membership in each of the 10 classes.

In [None]:
from tensorflow import keras
from tensorflow.keras import layers
inputs = keras.Input(shape=(28, 28, 1))
x = layers.Conv2D(filters=32, kernel_size=3, activation="relu")(inputs)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=64, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=128, kernel_size=3, activation="relu")(x)
x = layers.Flatten()(x)
outputs = layers.Dense(10, activation="softmax")(x)
model = keras.Model(inputs=inputs, outputs=outputs)

While we specified an explicit input layer, it does not appear in the model summary.  Its role here is solely to make clear the shape of the inputs.

In [None]:
model.summary()

Because we are using one-hot encoding of the class labels we specify ```categorical_crossentropy``` as the loss function.

In [None]:
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

We will also specify some callbacks for the training phase.  We use [```EarlyStopping```](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping) if 3 epochs (```patience=3```) elapse with no improvement in the validation accuracy (```monitor='val_accuracy'```).

In [None]:
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_accuracy', 
    patience=3
)

We use [```ModelCheckpoint```](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint) to save the best model (as measured by validation accuracy) seen to date.  I am encountering a bug where without ```save_weights_only=True``` no file is written.

In [None]:
checkpoint_path = '/tmp/checkpoints/mnist.keras'
save_best = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_path,
    monitor='val_accuracy',
    mode='max',
    verbose=1,
    save_best_only=True
)

We specify that 10% of the training data should be set aside as the validation set.

<div class="danger"></div>
Keras chooses this set from <a target="_blank" href="https://www.tensorflow.org/api_docs/python/tf/keras/Model?hl=de#fit">the last samples before any random shuffling</a>.  Be sure your training data has been randomly shuffled before using <code>validation_split</code>.

In [None]:
history = model.fit(X_train, y_train, 
              callbacks=[early_stopping, save_best],
              epochs=5, 
              batch_size=128,
              validation_split=0.1)

If we wish we can load the best model we saw in training.

In [None]:
model.load_weights(checkpoint_path)

In [None]:
test_loss, test_acc = model.evaluate(X_test, y_test)

It is always a good idea to examine the training history.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

def plot_training_history(history):
    pd.DataFrame(history.history).plot(figsize=(10, 6))
    plt.grid(True)
    plt.gca().set_ylim(0, 1)
    plt.show() 

plot_training_history(history)

The attribute ```history.history``` is simply a dictionary.  You can use this fact to splice together multiple histories.

In [None]:
history.history

# Visualizing what a CNN does

Keras makes it simple to look at the outputs of each layer of a NN.  Here we will look at the outputs of the convolutional and pooling layers, as they are 2-d objects.

The following image is part of the test set.

In [None]:
# The image to plot.
image = 560

In [None]:
import matplotlib.pyplot as plt

plt.imshow(X_test[image])
plt.show()

Now we will build a Keras model that computes the outputs of the pairs of convolutional layers and pooling layers of our CNN.  These are all the layers up to the flattening layer that feeds into the dense layers.

In [None]:
layer_names = []
for layer in model.layers[0:6]:
    print(layer.name)
    layer_names.append(layer.name)

In [None]:
from tensorflow.keras import models

# Extract the outputs of the pairs of convolutional and pooling layers.
layer_outputs = [layer.output for layer in model.layers[0:6]]
print(len(layer_outputs))

# Create a model that will return these outputs, given the model input.
activation_model = models.Model(inputs=model.input, outputs=layer_outputs)
print(len(activation_model.layers))

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

layer_outputs = []
layer_names = []
for layer in model.layers:
    if isinstance(layer, (layers.Conv2D, layers.MaxPooling2D)):
        layer_outputs.append(layer.output)
        layer_names.append(layer.name)
activation_model = keras.Model(inputs=model.input, outputs=layer_outputs)

We use the new model's ```predict()``` method to run the test set through the new model.

In [None]:
# This will return a list of Numpy arrays, one array per
# activation layer.
activations = activation_model.predict(X_test)
print(len(activations))

Let's look at the output of the first layer.

In [None]:
first_layer_activation = activations[1]
print(first_layer_activation.shape)

Here is the result of applying filter #3 to the image:

In [None]:
import matplotlib.pyplot as plt

plt.matshow(first_layer_activation[image, :, :, 4], cmap='viridis')
plt.show()

Now let's look at all of the outputs.

In [None]:
images_per_row = 16

# Now let's display our feature maps.
for layer_name, layer_activation in zip(layer_names, activations):
    # This is the number of features in the feature map.
    n_features = layer_activation.shape[-1]

    # The feature map has shape (1, size, size, n_features).
    size = layer_activation.shape[1]

    # Tile the activation channels in this matrix.
    n_cols = n_features // images_per_row
    display_grid = np.zeros((size * n_cols, images_per_row * size))

    # Tile each filter into a big horizontal grid.
    for col in range(n_cols):
        for row in range(images_per_row):
            channel_image = layer_activation[image,
                                             :, :,
                                             col * images_per_row + row]
            # Post-process the feature to make it visually palatable
            channel_image -= channel_image.mean()
            channel_image /= channel_image.std()
            channel_image *= 64
            channel_image += 128
            channel_image = np.clip(channel_image, 0, 255).astype('uint8')
            display_grid[col * size : (col + 1) * size,
                         row * size : (row + 1) * size] = channel_image

    # Display the grid.
    scale = 1. / size
    plt.figure(figsize=(scale * display_grid.shape[1],
                        scale * display_grid.shape[0]))
    plt.title(layer_name)
    plt.grid(False)
    plt.imshow(display_grid, aspect='auto', cmap='viridis')
    
plt.show()

The first convolutional layer seems to contain horizontal and diagonal edge detectors.

The second convolutional layer seems to further decompose the digit.

However, after we exit the third convolutional layer it is not clear what we are looking at (at least not to me!).

The blank tiles are interesting &ndash; these are filters that did not activate.  This means that the features those filters detect were not present in this particular image.

# Softmax and cross-entropy

Recall that probabilities lie in the range $0$ to $1$, and for a particular instance the sum over all the
classes must be $1$.  How do we ensure our neural network outputs possess these properties?

Suppose $(z_{1}, \ldots, z_{m})$ are the outputs of the last hidden layer.   We can convert these into
probability estimates using the **softmax function**.  The softmax function is defined via
$$
  y_{k} = \frac{e^{z_{k}}}{\sum_{j=1}^{m} e^{z_{j}}}.
$$
This function is smooth and behaves like a probability.  Clearly we have $0 < y_{k} < 1$ for all $k$, and if we
sum over all the possible classes, we obtain $1$:
$$
  y_{1} + \cdots + y_{m}
  = \sum_{k=1}^{m} \frac{e^{z_{k}}}{\sum_{j=1}^{m} e^{z_{j}}}
  = \frac{\sum_{k=1}^{m} e^{z_{k}}}{\sum_{j=1}^{m} e^{z_{j}}} = 1.
$$

Define
$$
  c_{ik} =
  \begin{cases}
    1 & \mbox{if training case $i$ belongs to class $k$,} \\
    0 & \mbox{otherwise.}
  \end{cases}
$$
Let $(y_{1}(x;W), \ldots, y_{m}(x; W))$ be the softmax probability estimates the neural network computes
given the input $x$ and model parameters $W$.  The likelihood associated with our training set is
$$
  \prod_{x^{(i)} \in {\mathcal T}} \prod_{k=1}^{m} y_{k}(x^{(i)}; W)^{c_{ik}}.
$$
Taking the logarithm we obtain the log-likelihood
$$
  \sum_{x^{(i)} \in {\mathcal T}} \sum_{k=1}^{m} c_{ik} \log y_{k}(x^{(i)}; W).
$$
For each $i$, the quantity
$$
  - \sum_{k=1}^{m} c_{ik} \log y_{k}(x^{(i)}; W)
$$
is called the **cross-entropy** of $c$ and $y$.  We see that maximizing the log-likelihood for our neural networks is equivalent to minimizing the cross-entropy.

Observe that since $0 < y_{k} < 1$ the cross-entropy is always positive.  We can ensure that the cross-entropy has a lower bound of zero by subtracting a constant:
$$
  - \sum_{k=1}^{m} \left( c_{ik} \log y_{k}(x^{(i)}; W) - c_{ik} \log c_{ik} \right).
$$
This function is minimized when $y_{k}(x^{(i)}; W) = c_{kk}$ with minimum value 0, which makes it easier to monitor the progress of the optimization since we know where we would like ideally to end up.