<a href="https://colab.research.google.com/github/wenxuan0923/My-notes/blob/master/DL_Cov2D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The basics of convnets
This note will mainly focus on **feature maps**, **convolution** and **max pooling**. The following questions will be answered along the way:

* What does a Conv2D layer do
* What does a MaxPooling2D layer do
* How to interpret the output shape
* How to implement a simple convolutional neural network

We will continue using the well known MNIST dataset of handwritten digits for illustration. 

In [0]:
import keras
from keras import models
from keras import layers
from keras.datasets import mnist
from keras.utils import to_categorical

In [2]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()
print(x_train.shape)
print(y_train.shape)

Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz
(60000, 28, 28)
(60000,)


Before diving deep in to the CNN model, we need to process the data to make sure it is in the form that a convolutional neural network model would expect. To be more specific, we need to:

**1. Convert x_train and x_test into 4D tensors** 
of shape (samples, height, width, color_depth), which can be processed by Conv2D layers 

**2. Scale the data so that all values are in the [0, 1] interval** to make it easier for our model to converge

**3. Categorically encode the labels using one-hot encoder**, which can be easily done using the `keras.utils.to_categorical()` function, so that each target variable will be a 10 dimensional vector consists of 0s and 1

In [0]:
def preprocess_input(x):
    desired_shape = (-1, 28, 28, 1)
    return x.reshape(desired_shape) / 255.0

def preprocess_output(y):
    return keras.utils.to_categorical(y)

In [9]:
x_train, x_test = map(preprocess_input, [x_train, x_test])
y_train, y_test = map(preprocess_output, [y_train, y_test])

print(x_train.shape)
print(y_train.shape)

(60000, 28, 28, 1)
(60000, 10)


## What does a Cov2D layer do?

When you look at most images, you will notice there is a lot of wasted space in each image. While there are 28*28=784 pixels, it will be interesting to see if there is a way to condense the image down to the important features. That is where convolutions come in.


So how does Con2D work to extract informative information? Let's break down the concept little by little:
* Convolutions operate over 3D tensors called **feature maps** 
> In our case it is (28, 28, 1), which represents (height, width, input_depth) respectively

* The input_depth axis is also called **channels** axis
> * For an RGB image, it has 3 channels: Red, Green and Blue
> * For a black-and-white image, like the MNIST dights, the depth is 1: gray

* The Conv2D layer extracts 3D patches of size `(patch_window_height, patch_window_width, input_depth)` from its input feature map
> It slides a window of size `(patch_window_height, patch_window_width)` over the input feature map and stop at all possible locations. Each of these locations defines a patch

* A transformation will be applied to each of these 3D patches to convert them into 1D vector of shape `(output_depth, )`
> This means it generates 1 value for each output_depth 

* The output_depth can be any number, and it does not represent color channel anymore, they stand for **filters** instead

* Each filter encode specific aspect of the input data by applying different transforamtion

* All of these 1D vector are then *spatially* reassembled into a 3D **output feature map** of shape (height, width, output_depth)


* The `patch_window_height`, `patch_window_width` and `output_depth` are defined in the layers as input arguments: <br>
 > `layers.Conv2D(output_depth, (patch_window_height, patch_window_width))`

### Example 1: A naive example with `input_depth = 1` and `output_depth = 2`
Consider a naive example, say we have a **input feature map of shape (3, 3, 1)**, a grayscale image, and we defined the Conv2D layer to be  `layers.Conv2D(2, (2, 2))`. By sliding a window of size (2, 2) over the input feature map at all possible locations, we will be able to extract **four** 3D patches and each of them has size (2, 2, 1). Extact 1 value will be generated for each patch at each output_depth and reassembled them spatially will get us a 3D output feature map of shape (2, 2, 2).


<p align="center">
<img src = 'https://drive.google.com/uc?id=1asIzBttuFQQ_XmGddtuw5l6DlqH5Lyav'
height="440" style="vertical-align:middle"/>
</p>

**Code it up using keras:**

In [17]:
model = models.Sequential()
model.add(layers.Conv2D(2, (2, 2), input_shape=(3, 3, 1), activation='relu'))
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_5 (Conv2D)            (None, 2, 2, 2)           10        
Total params: 10
Trainable params: 10
Non-trainable params: 0
_________________________________________________________________


Note the output from the Conv2D layer is (, 2, 2, 2), which is exactly what we expected.

### Example 2: A little complicated example with `input_depth = 2` and `output_depth = 3`

Consider a little more complicated example, say we have a 
**input feature map of shape (5, 5, 2)**, and we defined the Conv2D layer to be  `layers.Conv2D(3, (3, 3))`. By sliding a window of size (3, 3) over the input feature map, we will be able to extract **nine** 3D patches. Each of them has size (3, 3, 2). Because of the output_depth = 3, each patch will output a 1D vector of shape (3, ). Reassemble them spatially will get us a 3D output feature map of shape (3, 3, 3).


 <tr>
    <td>    <img src = 'https://drive.google.com/uc?id=1X8H0LrLtZjRZ-vWICH0e4DjbDF9PFO5a'
height="330"/> </td>
<td> <td> <td> <td> <td> <td> <td>
    <td> <img src = 'https://drive.google.com/uc?id=1SVDdgtUrF-Hd9gjFKzsnyIlmD1u1nXsD'
height="540"/> </td>
    </tr>

**Code it up using keras:**

In [21]:
model = models.Sequential()
model.add(layers.Conv2D(3, (3, 3), input_shape=(5, 5, 2), activation='relu'))
model.summary()

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_8 (Conv2D)            (None, 3, 3, 3)           57        
Total params: 57
Trainable params: 57
Non-trainable params: 0
_________________________________________________________________


### Example 3: Stacking multiple Conv2D layers


In [23]:
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), input_shape=(28, 28, 1), activation='relu'))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_9 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
conv2d_10 (Conv2D)           (None, 24, 24, 64)        18496     
_________________________________________________________________
conv2d_11 (Conv2D)           (None, 22, 22, 64)        36928     
Total params: 55,744
Trainable params: 55,744
Non-trainable params: 0
_________________________________________________________________


From these examples above, we can see the output width and height may differ from the input width and height:
* Example 1: (3, 3) $\Rightarrow$ (2, 2)
* Example 2: (5, 5) $\Rightarrow$ (3, 3)
* Example 3: (28, 28) $\Rightarrow$ (26, 26) $\Rightarrow$ (24, 24) $\Rightarrow$ (22, 22) 


The input feature map shrinks a little after going through each Conv2D layer, because of the so called **border effect**. If you want to get an output feature map with the same spatial dimensions as the
input, you can use **padding** argument in the Conv2D layers. We will take about it in another notes. For now, we will just use the default setting `padding = "valid"`, which means no padding. 

## What does a MaxPooling2D layer do?

When combing with some pooling method, the model we just built can become way more powerful.

Pooling is a way of compressing an image. The MaxPooling2D layer aggressively downsample feature maps by extracting windows from the input feature maps and out putting the max value of each channel. 
> **Meaning of downsample**: to make a digital audio signal smaller by lowering its sampling rate or sample size

A quick and easy way to do this is to go over the image of 4 pixels at a time, i.e. a 2*2 tensor, then it picks the biggest value and keep just that.

<p align="center">
<img src = 'https://drive.google.com/uc?id=1ow2wcWhZSGNYgezsE1InLgaXS4nslM5G'
height="420" style="vertical-align:middle"/>
</p>



This method can be easily implemented by adding a `layers.MaxPooling2D((2, 2))` layer into our network. For every 4 pixels, only the biggest will survive.

In [31]:
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.summary()

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_15 (Conv2D)           (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 13, 13, 32)        0         
Total params: 320
Trainable params: 320
Non-trainable params: 0
_________________________________________________________________


Note: if you are wondering why the output shape change from (26, 26) to (13, 13):
> $(26*26)/4 = 13*13 = 169$

We can then add another convolutional layer, and another max-pooling layer so that the network can further learn another set of convolutions on top of the existing one, and then again, pool to reduce the size. So, by the time the image gets to the flatten to go into the dense layers, it's already much smaller. It's being quartered, and then quartered again. So, its content has been greatly simplified, and thus mitigate overfitting.


In [33]:
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2,2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.summary()

Model: "sequential_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_16 (Conv2D)           (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_17 (Conv2D)           (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_7 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_18 (Conv2D)           (None, 3, 3, 64)          36928     
Total params: 55,744
Trainable params: 55,744
Non-trainable params: 0
_________________________________________________________________


To make final prediction, we need to flatten out the 3D image and feed it into a dense layer, just like what we did in a regular deep neural net. For MNIST dataset, it is a classification problem with 10 classes, so we use softmax as activation function with 10 hidden units in the output layer.

In [34]:
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2,2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))
model.summary()

Model: "sequential_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_19 (Conv2D)           (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_8 (MaxPooling2 (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_20 (Conv2D)           (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_9 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_21 (Conv2D)           (None, 3, 3, 64)          36928     
_________________________________________________________________
flatten_2 (Flatten)          (None, 576)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 64)              

Note that `layers.Flatten()` function flatten the 3D outputs of shape (3, 3, 64) to 1D: a vector with 576 numbers (3 * 3 * 64 = 576).

Now you just need to config the model using `model.compile` and fit it on the training data!


In [0]:
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [15]:
history = model.fit(x_train, y_train, epochs=5, batch_size=64)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [35]:
history.history

{'accuracy': [0.87105, 0.92635, 0.95105, 0.9643833, 0.97211665],
 'loss': [0.4159620881001155,
  0.23982162843942642,
  0.15492084644238155,
  0.11187756851116816,
  0.08968276811937491]}

Great! We have finished building our first convolutional neural net model.