# CS-6580 Lecture 7 - Convolutional Neural Networks
**Dylan Zwick**

*Weber State University*

In this lecture we'll discuss the basic concepts behind a "convolutional neural network", or CNN, and how they can be used to significantly improve performance when there are certain types of structure to the data. In particular, CNNs have been extremely useful when dealing with images, and are one of the most important machine learning tools in the field of computer vision.

The concept here is similar to an RNN is that it doesn't fundamentally alter the methodology of the neural network - it modifies its structure to address an underlying structure in the data. In the case of an RNN this structure was the sequential order within the data. In the case of a CNN its the relative importance of data that is close together.

**The Problem With Images**

In our first assignment, we created a basic sequential neural network to learn how to classify hand-written digits - and it did quite well! However, while it's not a trivial problem, it's much, *much* simpler than most computer vision problems.

One reason for this is the size of the data for each observation. For the hand-written digits, we had a $28 \times 28$ grid, and each grid had a greyscale score between 0 and 255, which meant 784 inputs per observation. That's not a trivial number, but it's much smaller than the amount of data in a standard digital image. For a standard digital image something like a $200 \times 200$ grid is much more common, and instead of there being a single greyscale value, there are typically 3 (RGB) color values, which would mean each observation would have about $200 \times 200 \times 3 = 120,000$ data points. That's a lot of data, and consequently a lot of weights to learn! With that many weights, convergence can be slow, and you need *a lot* of data if you want to avoid overfitting.

So, what does a convolutional neural network do? It looks for patterns by applying "convolution filters" to the images, and then learns to classify the images based upon the information from these filters - which is usually a much smaller amount of information!

**Convolution Filters**

What is a convolution filter? Let's look at it in the context of an exceptionally easy image classification problem - differentiating Xs and Os. So, instead of trying to learn 10 digits, here we're just learning two characters. Further, let's suppose we've only got a $9 \times 9$ grid and that each pixel in the grid is a bitmap, which means it's either a $0$ or a $1$. Here we could probably train just a simple perceptron (logistic regression) model and it would do a decent job.

<center>
    <img src = "https://lh3.googleusercontent.com/drive-viewer/AEYmBYTbYudLJAOAdM2O1vgTOSBRmWaGS-9EVuZyVH-k2RTuFqZg6YcwWXN-Sql8452dv87GFKpOLVvPeGWJQM8ylyX3xacY=s1600" width=1200>
</center>

However, this example highlights some important aspects about how we might use a convolutional neural network. For example, there are features in the data that we'd expect to see for an "X" that we would not expect to see for an "O". We wouldn't expect to see a hard left-to-right or right-to-left diagonal line on an "O", but we'd definitely expect to see one on an "X". Even more, we wouldn't expect to see a crossed line on an "O", but we'd absolutely expect to find one on an "X". Conversely, on an "O" we'd expect to see curves and vertical / horizontal lines.

The important thing here is that these features that are critical for distinguishing between an "X" and an "O" relate to pixels that are *near* each other. In other words, the interactions between inputs that are - in some sense - close to each other are important, and it would be nice if this were reflected within our model.

This is the idea behind a convolution filter, which is the foundation of a convolutional neural network. What a convolution filter does it it takes a pattern of interest (like the left to right line below), and moves over the data attempting to determine whether and where that pattern occurs. Generally speaking, in mathematics a "convolution" is some measure of how similar two things are to each other. They come up *all the time* in things like signal processing.

<center>
    <img src = "https://lh3.googleusercontent.com/drive-viewer/AEYmBYRFqA4p2B7lTh5HKzEsYpyEFBubwK8gGLP7Y21ht69iSAsZlA9uhWC0Gw9cMGToJBviHyWf-1dnguCx0ZrKtZu3wMiIhg=s1600" width=1200>
</center>

The way it does this is by taking the "convolution" of the pattern with a segment of the data.

For example, we could format our data is such a way that a black pixel is represented by a 1, and a white pixel is represented by a -1. Then, we could do a pixel-wise multiplication of our pattern with a segment of our image, and then calculate the average of this multiplication. This will give us a number between -1 and 1. A 1 would incidate perfect alignment with our pattern, while a -1 would indicate an exact misalignment.

<center>
    <img src = "https://lh3.googleusercontent.com/drive-viewer/AEYmBYRIyps5zVaWyp4XzetvVfDiNxbZjn9nSFudfu0x6Kc7rI_Au8ce44AeSW7anvUzR7p4EpoNWrnuT4Yv7rfW8dzUFx57zw=s1600" width=1200>
</center>

We then move through the image, calculating this convolution on various segments, and forming an array of convolutions.

<center>
    <img src = "https://lh3.googleusercontent.com/drive-viewer/AEYmBYTl6fIKcoL0yVtfsLrCRjxuTo7l16JZ5lf8Kjzn-NQ6EafRSN0itjNkan1T4BgLOSjhOxVMnsphILgesD65BUScxOEgIg=s1600" width=1200>
</center>
    
This array can then inform us as to which segments of the image correspond with our pattern, and which do not. We then repeat this for the various patterns of interest for our problem.

<center>
    <img src = "https://lh3.googleusercontent.com/drive-viewer/AEYmBYSXUMm5nT9KBSBKxRmZ-M6jt40H9v18MQ9FcgiPiOAeTYw7ln6i3T525HxizO44dZkzvWB-1btsJKe77FlyWZQMCu2b=s1600" width=1200>
</center>

These convolutions are then sent to the *convolution layer* of our neural network. The convolution layer is the main building block of a CNN, and we note that it's a restricted layer - the nodes from the previous layer are only connected to a subset of nodes within the convolution layer! This can significantly decrease the number of weights in the neural network, which can decrease the variance and increase the training speed. Big win!

A few important things to note about CNNs:

* Generally speaking the filters are *learned*, not prescribed as in the example above. This is one of the things that's so cool about machine learning - with enough data and the right setup, the program will actually learn which patterns are important and look for those.
* The approach is only applicable for certain types of data. Specifically, data where there is a connection between adjacent or near entries.

<center>
    <img src="https://lh3.googleusercontent.com/drive-viewer/AEYmBYSASlkMdeNMSFu6rSlwHPIEm1tCm_0T_O1BrnV9AfxRfBRa2FEpcRWkb_4tH0AZ31M1jImjsbjacxGzpCmYEkbsaWls=s1600" scale=1200>
</center>

**Implementing CNNs in Keras**

We'll now see how we can use convolutional neural networks in Keras, and show how they can be used to improve performance even on our old favorite MNIST dataset. But first, we'll import our favorite libraries, including our new favorites.

In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras

2024-02-06 09:09:36.719170: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-06 09:09:36.719310: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-06 09:09:36.827308: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-06 09:09:37.081880: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Now, we'll grab the MNIST dataset from Keras.

In [14]:
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype("float32") / 255
test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype("float32") / 255

In previous lectures, we've used a Sequential neural network model. Such models are easy to use, but their applicability is quite limited: it can only express models with a single input and a single output, applying one layer after the other in a sequential fashion. In practice, it's pretty common to encounter models with multiple inputs (say, an image and its metadata), multiple outputs (different things you want to predict about the data), or a nonlinear topology (the "shape" of the neural network).

In cases where a linear model is insufficient, models are build in Keras with the Functional API. This is what most Keras models in the "wild" use. Here's a convolutional neural network. We'll first talk about the Functional API aspects of it, and then the actual convolutional layers themselves.

In [16]:
inputs = keras.Input(shape=(28,28,1), name="Handwritten Digits")
x = keras.layers.Conv2D(filters=32,kernel_size=3,activation="relu") (inputs)
x = keras.layers.MaxPooling2D(pool_size = 2)(x)
x = keras.layers.Conv2D(filters = 64, kernel_size=3, activation="relu")(x)
x = keras.layers.MaxPooling2D(pool_size=2)(x)
x = keras.layers.Conv2D(filters=128, kernel_size=3, activation="relu")(x)
x = keras.layers.Flatten()(x)
outputs = keras.layers.Dense(10, activation="softmax")(x)
model = keras.Model(inputs = inputs, outputs = outputs)

Let's go over this step by step. 

We started by declacing an *Input*. Note that we can give names to these inputs, but it's optional. The inputs object holds information about the shape and possibly the dtype of the data that the model will process.

In [19]:
inputs.shape

TensorShape([None, 28, 28, 1])

Here "None" just means we haven't specified how much data we'll be using in our training set. That's fine.

In [21]:
inputs.dtype

tf.float32

Next, we create a layer, "x", and call it on the input. We then fed this through a sequence of new layers, always calling "x" upon itself.

Finally, we defined the nature of our output. We then instantiated our model by calling the *Model* function in Keras with the specified inputs and outputs. The outputs encode all the layers that create them. We can check out a summary of our model:

In [24]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 Handwritten Digits (InputL  [(None, 28, 28, 1)]       0         
 ayer)                                                           
                                                                 
 conv2d (Conv2D)             (None, 26, 26, 32)        320       
                                                                 
 max_pooling2d (MaxPooling2  (None, 13, 13, 32)        0         
 D)                                                              
                                                                 
 conv2d_1 (Conv2D)           (None, 11, 11, 64)        18496     
                                                                 
 max_pooling2d_1 (MaxPoolin  (None, 5, 5, 64)          0         
 g2D)                                                            
                                                             

In total, we have a svelte 104,202 parameters. Do you know what these types of models are called? They're called "non-parametric" models.

Alright, now let's go through the various layers.

In the MNIST dataset, the first layer of the model takes a feature map of size $(28, 28, 1)$, and outputs a feature map of size $(26, 26, 32)$. The number $32$ here is specified, and is the number of filter that we learn for that first layer. So, what's produced from this first 2D convolution layer is a feature map of size $(26,26,32)$ - the 32 filters applied to each $3 \times 3$ grid in the image. Note that we don't specify what these filters are - that's part of the learning process.

OK, so what about that *max_pooling2D* function? What's that? Well, what it does is it takes each $2 \times 2$ subgrid, and it produces the maximum value observed there for each feature. In other words, it aggressively downsamples our feature maps.

Why would we want to do this? Well, let's take a look at what would happen if we didn't have these pooling layers:

In [29]:
inputs = keras.Input(shape=(28,28,1), name="Handwritten Digits")
x = keras.layers.Conv2D(filters=32,kernel_size=3,activation="relu") (inputs)
x = keras.layers.Conv2D(filters = 64, kernel_size=3, activation="relu")(x)
x = keras.layers.Conv2D(filters=128, kernel_size=3, activation="relu")(x)
x = keras.layers.Flatten()(x)
outputs = keras.layers.Dense(10, activation="softmax")(x)
model_without_pooling = keras.Model(inputs = inputs, outputs = outputs)

We can get a summary of this model:

In [31]:
model_without_pooling.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 Handwritten Digits (InputL  [(None, 28, 28, 1)]       0         
 ayer)                                                           
                                                                 
 conv2d_3 (Conv2D)           (None, 26, 26, 32)        320       
                                                                 
 conv2d_4 (Conv2D)           (None, 24, 24, 64)        18496     
                                                                 
 conv2d_5 (Conv2D)           (None, 22, 22, 128)       73856     
                                                                 
 flatten_1 (Flatten)         (None, 61952)             0         
                                                                 
 dense_1 (Dense)             (None, 10)                619530    
                                                           

What are some issues with this model?

* We want to make sure that, after a few layers, the convolution maps from disparate regions of our image are interacting with each other. If we only had three Conv2D layers then the $3 \times 3$ windows in the third layer would only contain information coming from $7 \times 7$ windows in the initial input. The high-level patterns learned will still be very small with regard to the initial input, which may not be enough to learn what we need to learn. So, why not just put in enough convolution layers to fully shrink down our image? Well, that leads to our second point.

* If we only had 3 convolution layers, then the final  feature map would have 712,202 parameters. That's a lot! Unless you have a lot of data, that's way too many parameters, and you'll get intense overfitting. So, those pooling layers help decrease the overall size of your model, which makes it faster to train and more robust (less prone to overfitting). Please note that if we added a bunch more convolution layers, we'd have a ridiculously large number of parameters - much more than every pixel in our training data.

Alright, let's see how these models actually perform:

In [34]:
model_without_pooling.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model_without_pooling.fit(train_images, train_labels, epochs=5, batch_size = 64)

Epoch 1/5


2024-02-06 09:09:51.387982: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 188160000 exceeds 10% of free system memory.


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7f0341dc5c10>

That's a pretty great accuracy! On... the training data. Let's see how it does on the test data.

In [36]:
model_without_pooling.evaluate(test_images, test_labels)



[0.029887810349464417, 0.9904000163078308]

In [37]:
model.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model.fit(train_images, train_labels, epochs=5, batch_size = 64)

Epoch 1/5


2024-02-06 09:21:00.915312: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 188160000 exceeds 10% of free system memory.


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7f03342271f0>

Similarly great accuracy! Let's check out how it does on the test data.

In [39]:
model.evaluate(test_images, test_labels)



[0.028450777754187584, 0.9908999800682068]

Look at that! It does a little better, plus it trains a lot faster.