# Neural Network: Convolutional Neural Network (CNN) with MNIST

## Introduction to Convolutional Neural Network

<img src="pic/cnnall.jpeg" width=800>

**Convolutional Neural Network (CovNet/CNN)** is now the go-to model on every image related problem. In terms of accuracy they blow competition out of the water. It is also successfully applied to recommender systems, natural language processing, audio recognition, and much much more. CNN is also computationally efficient. The key advantage of CNN lies in its **feature learning** ability. THat is, CNN can capture relevant features from an image /video and detect patterns in part of the image rather than the image as a whole. For normal neural network, it can only see the whole image as a whole.


Convolutional neural network is very powerful in image analysis/recognition. You can, in fact, use simple Artificial Neural Network (ANN) to analyze images, for example image recognition. Then why nowadays people often switch to CNN for images? 

Say that you have a picture with 100 x 100 pixels. Since each pixel consists of 3 values(R,G,B), this picture will, in turn, have $100 \times 100 \times 3 = 30,000$ dimensions. If the first hidden layer consists of 1,000 nodes, you will have 30 million weights if you use the fully-connected ANN, for only the first layer! Therefore, CNN is a great technique for images since it reduces the parameters needed for neural network. It basically is a technique that utilize prior knowledge to remove some of the weights in the fully-connected neural network.







---
## Why CNN is useful for image analysis?

CNN is extremely useful not only because of its efficency but also because of unique operations that help detect certain patterns . It uses special convolution and pooling operations and performs parameter sharing. This enables CNN models to run on any device, making them universally attractive.


Say the neuron below detects whether or not there is a beak.

<img src="pic/bird1.png" width=100>

The advantage of using CNN instead of ANN is as the following:


1. **Pattern learning**. There's no need to see the whole image for image recognition.

In order to know whether this image contains a wheel, you don't see to see the whole image, since from prior knowledge we know that wheels cannot be in the sky. In CNN, each neuron only performs certain work, say detecting whether there is a beak. Therefore, some of the weights in the fully-connected network structure is useless. That is, **a neuron does not have to see the whole image to discover the pattern**. This property can be implemented via **Convolution Layer**.


2. **Weight sharing**. Detecting same items of different positions requires only one neuron.

Another reason why we can remove some of the weights is that **even if the patterns appear in different regions, since they do almost the same thing (say detecting whether there is a beak), they can use the same sets of parameters**, therefore reducing the number of weights required. For example, two images below shows two birds with beaks in different positions. We don't have to train two neurons to detect this two beaks. Instead, we only need one. This property can be implemented via **Convolution Layer**.

<table><tr><td><img src="pic/bird1.png" width=100></td><td><img src="pic/bird2.png" width=100></td></tr></table>


3. **Subsampling**. Enlarging or reducing the size of the image doesn't matter.

One other reason is that **subsampling the pixels will not change the target**. Subsampling means to shrink the size of image to make it smaller. Since the number of pixels decreases, you need fewer weights. This property can be implemented via **Max Pooling Layer**.


## The process of Convolution Neural Network

<img src="pic/cnnseq.jpeg" width=600>

There is an input image that we’re working with. We perform **a series convolution + pooling operations**, followed by a number of fully connected layers. Note taht the number of iterations (convolution + pooling) is not a fixed number. The ultimate goal of the iteration is the limit the number of digits. For example, the original hand-written number is 28 x 28 digits. After performing two sets of convolution + pooling as stated above, we shrink it to 4 x 4 digits. We can then limit the size of the input neurons. 

In this example, since we are dealing with a classification problem, the output layer contains the exact number of nodes as the total number of classes we are going to predict. Since we are performing multiclass classification, the activation function for the output layer should be the softmax function. If we are doing a regression problem, we can simply give the final layer one node representing the value. 

Knowing the entire process of CNN, let's now dive into each component.

### Convolution (沒寫完)

The main building block of Convolutional Neural Network is the convolutional layer. Convolution is a mathematical operation to merge two sets of information. In our case the convolution is applied on the input data using a convolution filter to produce a feature map. There are a lot of terms being used so let’s define and visualize them.

###  Important Elements within Convolutional Neural Network

**The input Image**

Images are made up of pixels. Each pixel is represented by a number between 0 and 255. As stated in the beginning, since each pixel consists of 3 values(R,G,B), this picture will, in turn, have  100×100×3=30,000  dimensions. If the first hidden layer consists of 1,000 nodes, you will have 30 million weights! It's almost impossible to train these gigantic amount of weights in a fully-connected neural network, let alone the fact that most of the weights are useless. This is why CNN is widely used for images in the field of neural network.

**Feature Detector**

The feature detector is a matrix, usually 3x3 (it could also be 7x7). The feature dectector is also widely known as a **filter** or a **kernel**. 

**Feature Map**

Say that you have a decent amount of filters. Feature map is the output of matrix representation of the input image that is multiplied element-wise with the feature detector (filter) and the input image. The feature map is also known as a **convolved feature** or an **activation map**. The aim of this step is to reduce the size of the image and make processing faster and easier. Indeed, some of the features of the image are lost in this step, but most of the representations can be captured by the feature detector.


Let’s say we have a 32x32x3 image and we use a filter of size 5x5x3 (note that the depth of the convolution filter matches the depth of the image, both being 3). Here we perform the convolution operation described above. The only difference is that this time we do the sum of matrix multiply in 3D instead of 2D, but the result is still a scalar. We slide the filter over the input like above and perform the convolution at every location aggregating the result in a feature map. This feature map is of size 32x32x1, shown as the red slice on the right. If we take another filter, denoted in green, and follow the same process stated above, we can produce another feature map of size 32x32x1. Therefore, the number of filters determines the thickness of the feature map, which is how many individual feature map that we will stack together to form a 3D one.

<img src="pic/conv.png" width=450>


**Stride**

**Stride** is the magnitude of slide that filter moves along the input image matrix. The default value for stride is 1. If we want to have less overlap when conducting the element-wise inner product, or want a smaller feature map, we can have bigger strides. This is the parameter that should be set before the model compile. It requires some domain knowledge as well as trial and error.

**Padding**

Another technique used in CNN is called **padding**. Padding is commonly used in CNN to preserve the size of the feature maps, otherwise they would shrink at each layer, which may not be desirable in some cases. Without padding, the size of the feature map is smaller than the input. The picture below demonstrates that fact. If we want to maintain the same dimensionality, we can use padding to surround the input with zeros. We can either pad with **zeros or the values on the edge**. 

With proper padding, the height and width of the feature map was the same as the input (both 32x32), and only the depth changed.

### Process of Convolution

<img src="pic/convo.png" width=600>

On the left side is the input image to the convolution layer. At the middle is the convolution filter. This is called a 3x3 convolution; the size of the filter can be determined by setting parameters. On the right side is the feature map produced by mering the input image and the convolution filter. We perform the convolution operation by sliding this filter over the input. The magnitude of this slide is controlled by the parameter **stride**. At every location, we do element-wise inner product and sum the result. This sum goes into the feature map. This is the result of this Convolution Layer. Below is a wonderful gif from [A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53) that show the process of convolution.

<img src="pic/conv.gif">

Note that since we have done padding around the original picture, the dimensionaility of the feature map remains the same.

---


The image below clearly demonstrates why the process of convolution is actually a neural network with less weights connected.

<img src="pic/convo2.png" width=600>

When we move the filter frame to the right, since they are using the same filter, what this actually means is that the two neurons share the same weights, making the total weights even less.


### Max Pooling

What Max Pooling really does is **subsampling**. It is done by applying a max filter to subregions of the initial representation. See the picture below.

<img src="pic/maxpool.png" width=400>


What Max Pooling does is to take out the maximum value within a subset of the values. The objective is to down-sample an input representation (image, hidden-layer output matrix, etc.) and reduce its dimensionality. This decreases the computational power required to process the data through dimensionality reduction. It also helps overfitting by providing an abstracted form of the representation. As well, it reduces the computational cost by reducing the number of parameters to learn.

Note that in fact there are two types of pooling: **Max Pooling** and **Average Pooling**. In CNN, Max Pooling dominates and is our go-to method for subsampling.


Whenever you finish one loop of **Convolution** and **Max Pooling**, what you really get is a new image representing the original input. **The Convolutional Layer and the Max Pooling Layer together form the i-th layer of a Convolutional Neural Network**. Depending on the complexities of the images, the number of such layers may be increased for capturing low-levels details even further, but at the cost of more computational power.

Note that **Max Pooling is not a must-do step in Convolutional Neural Network**. For example, in Alphago's paper, the author states that he uses the strucure of convolutional neural network to detect patterns on the go board. However, in this go example, the author doesn't use this Max Pooling technique to subsample the image. 

### Flatten

Flatten is the process that connects the result from Max Pooling and the fully-connected neural network. Once the pooled featured map is obtained through iterations of convolution and max pooling, the next step is to transform the entire pooled feature map matrix into a single column so that it can be fed to the neural network for processing. The flattening process is shown as the image below.

<img src="pic/flatten.png" width=300>


### Fully-Connected Neural Network

After doing all the work, now we can simply feed the flattened data into the fully-connected neural network (DNN). 

<img src="pic/fc.jpeg" width=700>

Remember that a fully-connected neural network can represent any kind of non-linear relationship between input and output. Thus, adding a fully-connected layer is a good way to learn all the non-linear combinations of the high-level features represented by the output of the flatten layer.

## Quick Practice in Keras

In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers.core import Dense, Dropout
from keras.layers import Flatten, Conv2D, MaxPooling2D
from keras.optimizers import Adam
from keras.utils import np_utils
from keras.datasets import fashion_mnist
from keras import backend

In [26]:
def load_data():
    
# #     X_train = X_train.reshape(-1, 28*28)
#     X_train = X_train.astype('float')
# #     X_test = X_test.reshape(-1, 28*28)
#     X_test = X_test.astype('float')
    
    # input image dimensions
    img_rows, img_cols = 28, 28

    # the data, split between train and test sets
    (X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()

    if backend.image_data_format() == 'channels_first':
        X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
        X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
        input_shape = (1, img_rows, img_cols)
    else:
        X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)
        X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
        input_shape = (img_rows, img_cols, 1)
    
    # Convert class vectors to binary class matrices
    y_train = np_utils.to_categorical(y_train, 10)
    y_test = np_utils.to_categorical(y_test, 10)
    X_train = X_train
    X_test = X_test
    
    # X_test = np.random.normal(X_test)
    X_train = X_train / 255 # normalize the pixel
    X_test = X_test / 255 # normalize the pixel
    return((X_train, y_train),(X_test, y_test), input_shape)

In [30]:
if __name__ == '__main__':
    # load training data and testing data
    (X_train, y_train), (X_test, y_test), input_shape = load_data()
    # define network structure
    model = Sequential()
    
    # CNN
    model.add(Conv2D(input_shape = input_shape, filters=25, kernel_size=(3,3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(Conv2D(filters=50, kernel_size=(3,3)))
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(Flatten())
    
    # ANN
    model.add(Dense(units=300, activation='relu'))
    model.add(Dense(units=10, activation='softmax'))

    # set configurations
    model.compile(loss='categorical_crossentropy',
                  optimizer=Adam(), metrics=['accuracy'])

    # train model
    history = model.fit(X_train, y_train, batch_size=256, epochs=20, validation_split=0.3)

    # evaluate the model and output the accuracy
    result_train = model.evaluate(X_train, y_train)
    result_test = model.evaluate(X_test, y_test)
    print('\n')
    print('----------Model Result----------')
    print('Train Acc:', result_train[1])
    print('Test Acc:', result_test[1])

Train on 42000 samples, validate on 18000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


----------Model Result----------
Train Acc: 0.955133318901062
Test Acc: 0.9077000021934509


---
## What does Convolution Neural Network Learn?

**Before Flatten -- Convolution + Max Pooling**

So after training the model, we get a great result...then what? How can be interpret this neural network?

We all know that when we are training the neural network, we pass an input into the model. Via gradient descent, we find the set of weights and biases that minimize total loss. Now, how can we know what the filters are doing?

Say that the output of the k-th filter is a M x M matrix. Here we can define a matrix called **Degree of Activation**. Here we define the degree of the activation of the k-th filter to be 
$$a^k = \sum \limits_{i=1}^M \sum \limits_{j=1}^M a_{ij}^k$$
, where $a_{ij}^k$ is the inner-product retrieved after the convolution layer.

The trick we are going to use is to implement **Gradient Ascent** to get find the image that miximize the degree of the activation. That is, we want to find the image that gives the highest total sum after the inner-product is calculated. 
$$x^* = argmax  {a^k}$$

We want to find the input $x^*$ that maximize $a^k$, the degree of activation. This way we can find the image that best reflect what the filter is looking for, which is also equivalent to **what the filter is detecting**. If we do this process of several times, say for 12 filters, we can get something like the following.

<img src="pic/cnnlearn.png" width=300>

What does these images mean? For example, the image on the bottom right corner represents the fact that **that particular filter is responsible for detecting that pattern, in this example diagonal stripes.** Therefore, if we input a image containing diagonal stripes, the output of this filter will be larger compared to other filter that is not detecting diagonal stripes.

From the discussion we can know that **the job for each filter is to detect a certain pattern in the image**.

So the aforementioned discuss can tell us what the filters in the convolution & Max Pooling layer is doing. What about the fully-connected part after flattening?

**After Flatten -- Fully-Connected Neural Network**

By repeating the same process, finding the $x^*$ that maximize the degree of activation, we can find out that each neuron in the fully-connected neural network actually performs the following task.

<img src="pic/cnnlearn2.png" width=250>

It's very different from what we have seen before. In the first image that demonstrates what filters in Convolution + Max Pooling learn, it shows some kind of **patterns**. However, in the second image tha shows what neurons in the fully-connected neural network learn is a **full image**, even if you cannot recognize anything from it. This is because what you feed into the fully-connected neural network is not part of the image, but the entire image as a whole. Therefore, it doesn't only detect and learn some kind of patterns. Instead, it learns the whole picture, or you can say is a larger pattern.


Now say you want to see what is the does the entire CNN learn. Let's denote the output of the layer to be $y_i$.
Say that you want to find an input that maximize the degree of activation of $y^i$, denoting as $x^* = argmax  {y^i}$. Below is the shocking result.

<img src="pic/cnnlearn3.png" width=250>

Each image is accompanied by what the result is for each image. The top left image is recognized by the CNN as number $0$, while it is arguably irrecognizable by human beings. Therefore, we can say that neural network learns extremely differently from human beings. For more information about this phenomenon, check out this great video on YouTube: [Deep Neural Networks are Easily Fooled](https://www.youtube.com/watch?v=M2IebCN9Ht4).

So how can we overcome this issue? We all know that digits can only fill a portion of the image. A number $2$ can never cover the entire image. There must be pixels that is not white. We can see that in the previous picture, most of the pixels are white. What if we limit the number of white? We can use regularization!

$$ x^* = argmax_x (y^i - \sum \limits_{i,j}|x_{i,j}|) $$

The term $\sum \limits_{i,j}|x_{i,j}|$ in fact represents the overall number of pixels. Thus, in the equation above we want to find an input $x$ that maximize the output $y^i$ but minimize the number of total $x_{i,j}$, which means limiting the total number of $x$. In this case, most of the images should be left black. The result will be something on the right hand side: 

<table><tr><td><img src="pic/cnnlearn3.png" width=300></td><td><img src="pic/cnnlearn4.png" width=300></td></tr></table>

The left hand side represents the orginal result. We can clearly see that using regularization, $x^*$ is much more closer to the real number that we human beings can recognize.

---
## Reference: 

[cs231n](http://cs231n.github.io/)

[What is max pooling in convolutional neural networks?](https://www.quora.com/What-is-max-pooling-in-convolutional-neural-networks)

[Applied Deep Learning - Part 4: Convolutional Neural Networks](https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2#a86a)

[What are the advantages of a convolutional neural network (CNN) compared to a simple neural network from the theoretical and practical perspective?](https://www.quora.com/What-are-the-advantages-of-a-convolutional-neural-network-CNN-compared-to-a-simple-neural-network-from-the-theoretical-and-practical-perspective)

[A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)