# Introduction to Convolutional Neural Networks (CNNs)

In this notebook, we introduce Convolutional Neural Networks (CNNs) or convnets. We try to understand how convnets are used to solve perception tasks such as vision. Later we will discuss the use of convnets in Natural Language Processing, e.g., text and speech processing.

Previously we have studied fully-connected (or densely connected) Multi-Layer Perceptron (MLP). Let's motivate the need for convnets by comparing it with fully-connected MLPs.

## Motivation: On the Horizon of the Fully-Conncted or Dense MLP Networks

In traditional feedforward neural networks, e.g., in MLPs, each neuron in the input layer is connected to every neuron in the next layer. We call this a fully-connected (FC) layer (or densely connected layer aka dense layer).

<img src="https://cse.unl.edu/~hasan/Pics/MLP_FC.png" width=600, height=300>

The distinctive aspect of the FC MLPs is that they operate directly on the raw pixels. Consider the following example. The 28 x 28 pixels handwritten-digit image is flattened creating a 784-dimensional vector. This vector is the input to the FC MLP. The FC network uses these raw pixel data for detecting global patterns in the pixel distribution in the images. For example, in the MNIST dataset all images are center positioned. Thus, images representing the digit "4" will have their pixels distributed around the center of the image in a similar fashion. This distribution pattern is a **global** property of the image dataset, i.e., all images representing "4" share the same property. When we flatten the image, the resulting input vector retains this global property. To identify this global pattern, FC networks need to operate on the entire set of pixels en masse.  

<img src="https://cse.unl.edu/~hasan/Pics/MLP_FC_FlattenedInput.png" width=600, height=400>

However, this arrangement of FC networks suffer from two key limitations:
- Unable to Maintain Spatial Invariance
- Don't Scale Well With the Size of the Images

Let's discuss these limitations to motivate our discussion on convnets.

### FC Networks: Unable to Maintain Spatial Invariance

Consider the following figure. We have shifted the location of the digit "4" spatially along both vertical and horizontal axes. We want a FC network to recognize these translated images irrespective of the locations of the digits on the images. This property is known as **translation invariance**. The following demo shows that the accuracy a FC MLP trained on the original MNIST data drops from 98% to 67% when we test it on shifted images.
https://github.com/rhasanbd/Artificial-Neural-Network-Investigation-of-Translation-Invariance-Property

<img src="https://cse.unl.edu/~hasan/Pics/MLP_MNIST_ShiftedImages.png" width=1000, height=800>

The reason FC networks are unable to achieve translation invariance is that they are only able to detect global patterns of the raw pixels. Consider the following figure. Observe that after flattening the images, the pixel patterns are different across the shifted images.



<img src="https://cse.unl.edu/~hasan/Pics/MLP_MNIST_Shifted_Flattened.png" width=400, height=200>


### FC Networks: Don't Scale Well With the Size of the Images

In the previous MNIST demo, the images are grayscale, i.e., single-channel. After flattening, the length of the input vector was 784. Thus, the number of input weights was 784. In the single hidden layer we used 200 neurons.

<img src="https://cse.unl.edu/~hasan/Pics/MLP_FC_Scalability.png" width=600, height=400>

Now consider RGB images. Assume that the size of the images is 224 x 224 pixels. The length of the input vector and weights will increase to 224 x 224 x 3 = 150,528, which is huge! In addition to this, we will have to add multiple hidden layers and significantly many neurons in each hidden layer. Thus, the number of parameters and connections needed to discover patterns from these large images will be prohibitively expensive. Even if we could afford this expense, because of the large size of the FC network it will overfit.


- How do we resolve the two above-mentioned limitations of FC networks?

We use convnets!

# Convnets

First, we will show how the search for spatial invaiance lead to the general convnet architecture. For now let's not worry about the convnets. Just think that we have knowledge about the FC MLPs and how they detect **global** patterns or features. We want to create a new type of MLP architechture that is able to recognize a digit on image even though its global position could change (i.e., it is no more centered).


## Convnets: Spatial Invariance

To achieve this goal, we may simplify the problem of global feature detection by the problem detecting  **local features**. As an example, consider the following image. The **global** representation of the digit "4" consists of **local** representations of three symbols. These are local features. These three features can appear anywhere in the image to construct the global view of "4".


<img src="https://cse.unl.edu/~hasan/Pics/CNN_Image_SpatialStructure.png" width=600, height=600>

Thus, instead of trying to globally detect "4", we may focus on detecting the local features. We can do this as follows.
- Don't flatten the image.
- Focus on a small region on the image.
- Scan that region using a MLP to detect a local feature, e.g., horizontal stroke, vertical stroke.

In other words, we use a MLP with a small receptive field for scanning a small region of the image, as shown below.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_Scanning_MLP.png" width=500, height=500>

In our simplified example, we wanted to detect three local features: a tilted stroke, a horizontal stroke, and a vertical stroke. Thus, we create three MLPs and use those to scan small regions of an image to detect the three local features, illustrated below. 

<img src="https://cse.unl.edu/~hasan/Pics/CNN_Scanning_MLPs.png" width=600, height=600>

The MLPs will process the entire image successively scanning every small region so that if these features appear anywhere on the image, we will be able to detect those.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_Scanning.png" width=500, height=500>

The use of multiple MLPs with **small receptive fields** enables us to achieve spatial invariance. Consider the following figure. Irrespective of the location of "4" in the following images, we can detect the tilted and horizontal strokes. 

<img src="https://cse.unl.edu/~hasan/Pics/CNN_Spatial_Invariance.png" width=800, height=600>

Once we are able to recognize the three low-level features, we can compose high-level features by employing more MLPs that operate on these low-level feature maps, as shown below.


The MLP 4 and 5 will scan the feature maps to create high-level features. Finally, we can employ another MLP to use the high-level features as input for detecting class-level information, such as whether the image represents "4".
<img src="https://cse.unl.edu/~hasan/Pics/CNN_Scanning_MLPs_HighLevelFeatures.png" width=500, height=500>

The above illustration suggests that we can design a new type of MLP architechture by using multiple MLPs. 
- The architecture is hierarchical and is divided into multiple layers.
- In each layer we have a set of MLPs to detect the local features by scanning small spatial regions.
- Each layer learns features of increasing complexity by scanning the feature maps from the previous layer.
- The final layer combines the high-level representations to determine the class information.

This new architecture, as shown below, forms the basis of convnets.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_HierarchicalRepresentation.png" width=900, height=600>


Observe that both the FC MLPs and convnets are based on layered architectures. However, they differ fundamentally by the tasks of the layers. While the goal of the FC MLP layers is to learn global patterns by using the entire input space, convnet layers learn local features. In case of FC MLPs, if a pattern appears at a new location, it will have to learn it anew. Convnets don't have this problem, which makes them data efficient. Convnets require fewer data to learn representations, thus they are more generalizable.



## Convnets: Efficient Scalability


To understand how convnets scale well with the size of the images consider the following figure. For the input layer of the FC MLP the number of neurons or weights is equal to the length of the flattened input vectors, i.e., number of pixels. For a 3 x 3 image we need 9 input weights. On the other hand, in convnets since we scan only a small region of the image, the number of weights is significantly smaller. In this example we scan 2 x 2 regions for which we need only 4 weights. 

The most important aspect of convnets is that even when the input image size is increased, the receptive field is still smaller (usually 5 x 5 or 3 x 3). As a consequence, the number of weights don't increase much. This makes convnets scale well with image size. Consequently it reduces the risk of overfitting.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_FC_MLP.png" width=900, height=600>

Also notice that the number of connections in FC MLPs are significantly larger. Consider a 224 x 224 pixel image. The number of neurons in the input layer will be 50,176. Let's say that in the next layer we have 1000 neurons (which is much smaller compared to the size of the input). Then, the number of connections will be more than 50 million! 

Convnets not only reduce the number of neurons, but also the neuron weights are shared. For example, in the figure below, all 2 x 2 input regions are scanned by the same 4 neurons. These neuron weights are shared for the entire image.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_WeightSharing.png" width=400, height=400>



## Convnets: Compositionality

So far, we have discussed two benefits of convnets: spatial invariance and scalability with respect to the input size.

Another benefit is compositionality. Each filter composes a local patch of lower-level features into a higher-level representation, similar to how we can compose a set of mathematical functions that build on the output of previous functions: $f(g(x(h(x)))$. This composition allows the convnet to learn more rich features deeper in the network. For example, a convnet may build edges from pixels, shapes from edges, and then complex objects from shapes. This process happens naturally in an automated fashion during the training. The concept of building higher-level features from lower-level ones is exactly why convnets are so powerful in computer vision.





## Convnets are Designed by "God"

- How did we design the ingenious architecture of convents for solving perception tasks?

Well, we didn't. We only emulated "God"! 


Convnets are designed based on the architecture of visual cortex. The fundamental observation about the visual world is that it is **spatially hierarchical**. To create a complete representation of an image, it creates successive layers of representations of that image. 

This property of the visual cortex was first studied by David H. Hubel and Torsten Wiesel during 1958 and 1959. They performed painful experiments on cats and monkeys to understand the visual perception process of living creatures.

In their experimental setting, beamed light of different patterns were projected on the retina of an anesthetized cat through its fully open (slitted) Iris. Then, their responses were measured in the primary visual cortex that is located at the back of the head. It is known by striate cortex or V1.

They showed that many cortical units or neurons in the visual cortex have a small local receptive field. They also showed that some neurons react only to images of horizontal lines, while others react only to lines with different orientations. This implies that two neurons may have the same receptive field but react to different line orientations.


<img src="https://cse.unl.edu/~hasan/Pics/CNN_Hubel_Wiesel.png" width=900, height=600>


Hubel and Wiesel also noticed that some neurons have larger receptive fields, and they react to more complex patterns that are combinations of the lower-level patterns. These observations led to the idea that the higher-level neurons are based on the outputs of neighboring lower-level neurons. This powerful architecture is able to detect all sorts of complex patterns in any area of the visual field.

A key limitation of the Hubel-Wiesel model is that it could not explain the position invariance of visual cortex. In 1980 Kunihiko Fukushima attempted to overcome this limitation by proposing the **Neocognitron** model for visual cortex. 

According to this model, the visual system consists of a hierarchy of modules. Each module comprises of a layer of "S-cells" followed by a layer of "C-cells".
- S-cells respond to the signal in the previois layer.
- C-cells confirm the S-cell's response.

Only S-cells are "plastic" (i.e., learnable) and C-cells are fixed in their response. 

The modules in neocognitron includes a layer of S-cells and a layer of C-cells. In each subsequent module, the planes of the S layers detect plane-specific patterns in the previous layer (C layer or retina).

<img src="https://cse.unl.edu/~hasan/Pics/CNN_Neocognitron_Fukushima.png" width=900, height=600>

Each cell in a plane "look" at a slightly shifted region of the input to the plane than the adjacent cells in the plane. The planes of the C layers "refine" the response of the corresponding planes of the S layers. The deeper the layer, the larger the receptive field of each neuron. Cell planes get smaller with layer number.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_Neocognitron_ReceptiveField.png" width=600, height=400>

The learning in Fukushima's Neocognitron model was based on a fully unsupervised technique. It was able to learn semantic labels automatically. 

The next key idea was to **add supervision to the learning** in the neocognitron model. The first significant supervised model along this direction was developed in 1998 by Yann LeCun. It was the first convnet!

LeCun used the Bacpropagation algorithm for training his famous LeNet-5 convnet classifier. It was commercially succesful for its use in handwritten check numbers recognition in banks.

LeCun combined the FC or dense layers and sigmoid activation functions with two new building blocks: convolutional layers and pooling layers. Next we deconstruct the general convnet architecture. 


<img src="https://cse.unl.edu/~hasan/Pics/CNN_LeNet5.png" width=900, height=500>


## Convnets: General Architecture


For percieving an input image, convnets create successive layers of representations of the image. Image sub-regions are scanned by small MLPs, which are known as convolutional **filters or kernels**. These filters scan across the image for detecting local features. The scanning process is known as **convolution**. The output of the convolution of an input by a filter is known as **feature maps**. These feature maps are then scanned by filters in the successive layers.

Before using the feature maps of a layer for convolution they are passed through a nonlinear activation function in the activation layer. Usually the activation layers are not separately shown. They are included within the convolution layer.

The spatially convolved activated feature maps are usually downsampled using pooling layers. Historically pooling was used to reduce the spatial dimension of the feature maps as well as to add invariance to small translations (i.e., local invariance).


The final layer of a convnet is the classification layer, which is usually a Softmax layer. Before passing the signal through the classification layer it needs to be flattened. The flattening layer is not separately shown, and is included within the Softmax layer.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_Architecture_Cartoon.png" width=600, height=300>

Convnets architectures vary due to the way the layers are stacked. It is mainly an engineering problem. Modern convnet architectures develop novel and uselful stacking strategies to achieve both effectiveness and efficiency. We will discuss modern convnet architectures later. For now we will use the LeNet-5 type classical architecture to present the layers and fundamental building blocks of convnets.


## Convnets: Layer Types

The most fundamental layers of a convnet are:
- Convolutional (Conv)
- Activation (usually included with the Conv layer)
- Pooling (Pool)
- Fully-connected (FC)
- Classification or Softmax 

Stacking a series of these layers in a specific manner yields a convnet.


## Practical Convnets: Formalization

In practical convnets multi-dimensional input is used, e.g., RGB image with 3 channels. These input channels are scanned using separate filters. Thus, to scan a 3-channel image a 3-channel filter is needed. 


We formalize the general convnet architecture. The 3D input with $n_k$ channels is represented by a 3D tensor $X$ of size $n_h \times n_w \times n_k$. To scan the 3D input by a single filter it has to be 3D of size $f_h \times f_w \times n_k$. 


<img src="https://cse.unl.edu/~hasan/Pics/CNN_Architecture_FlowDiagram.png" width=800, height=600>

Each 3D filter computes a single 2D feature map, denoted by $Z$ (shown below). There are total $f_k$ filters, which are represented by a 4D tensor of size $f_h \times f_w \times n_k \times f_k$. The $f_k$ filters produce $f_k$ number of 2D feature maps in the next layer. Thus the feature map tensor is 3D of size $height \times width \times f_k$. After activation there will be $f_k$ number of 2D maps that are denoted by $Y$.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_Architecture_FlowDiagram_1.png" width=700, height=500>


The above illustration is distilled into the following 3D representation of the convnet architecture.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_Architecture_Cartoon_3D.png" width=800, height=600>


## Convnets Quickstarter: TensorFlow

Before we delve deeper into the understanding of the convolution and pooling process in convnets, and convnet training issues, lets get a quick taste of building a simple convnet architecture using TensorFlow.

We will stack a series of the fundamental convnet layers sequentially in our design:

- Conv layer (with activation): 32 filters of size 3 x 3
- Pool layer (max pooling)
- Conv layer (with activation): 64 filters of size 3 x 3
- Pool layer (max pooling)
- Conv layer (with activation): 64 filters of size 3 x 3
- Flatten layer
- FC layer: 64 neurons
- Softmax layer: 10 neurons


We will use this convnet to solve the MNIST 10-class classification problem.

First, let's construct our convnet architecture.

In [10]:
import tensorflow as tf
from tensorflow import keras

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten


model = Sequential(name='Simple_Convnet')
model.add(Conv2D(filters=32, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1), 
                 padding='valid')) 
model.add(MaxPooling2D(pool_size=(2, 2), strides=2, padding='valid'))
model.add(Conv2D(filters=64, kernel_size=(3, 3), activation='relu')) 
model.add(MaxPooling2D(pool_size=(2, 2), strides=2, padding='valid'))
model.add(Conv2D(filters=64, kernel_size=(3, 3), activation='relu'))
model.add(Flatten())
model.add(Dense(units=64, activation='relu'))
model.add(Dense(units=10, activation='softmax'))
model.summary()

Model: "Simple_Convnet"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_6 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 3, 3, 64)          36928     
_________________________________________________________________
flatten_2 (Flatten)          (None, 576)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 64)             

## Convnet for MNIST Classification


The first conv layer takes a feature map of size (28, 28, 1) and outputs a feature map of size (26, 26, 32). It computes 32 filters over its input. Each of these 32 output channels contains a 26 × 26 grid of values, which is a response map of the filter over the input, indicating the response of that filter pattern at different locations in the input. That is what the term feature map means: every dimension in the depth axis is a feature (or filter), and the 2D tensor output[:, :, n] is the 2D spatial map of the response of this filter over the input.

The Keras Conv2D layer is created by providing the number of filters (depth) and the dimension of each filter. A convolution works by sliding the 3 × 3 filters over the 3D input feature map, stopping at every possible location, and extracting the 3D patch of surrounding features. We specify the input shape by providing its height, width and depth (number of channels). Each such 3D patch is then transformed into a 1D vector of shape (output_depth,). All of these vectors are then spatially reassembled into a 3D output map of shape (height, width, output_depth). Every spatial location in the output feature map corresponds to the same location in the input feature map (for example, the lower-right corner of the output contains information about the lower-right corner of the input). We will explain the convolution process and pooling in greater detail later. Now, let's train this simple convnet to classify the MNIST data.

In [6]:
from keras.datasets import mnist
from keras.utils import to_categorical

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()


train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype('float32') / 255  # scale the training images
test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype('float32') / 255 # scale the training images
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

# Compile the model
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Train the model
model.fit(train_images, train_labels, epochs=5, batch_size=64)

# Evaluate the model on the test data
test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=0)
print("\nTest Accuracy: ", test_acc)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test Accuracy:  0.9912999868392944


## Convnets: Beyond "Black Art"

What just happened is no short of a magic. We achieved a stunning 99% accuracy on the test data using less than 20 statements! 

How exactly did the convnet achieve this?

We have a vague understanding about how convnets create hierarchical representations to determine the class-level information about an input. This conceptual vagueness is overshadowed by TensorFlow's convenient LEGO blocks. We didn't have to know much abut convnets to create such a powerful model. The TensorFlow based exercise is no more than a "black" art. But this shallow grasp in not very useful. 

How do we unravel the "black" art of building convnets?

This notebook series tells that story!
