# The Problem

<img src="img/cnn_mnist.png">

The MNIST data above consists of values from 0 to 255 for each pixel present. So, MLP  can be use to classify all digits with pretty good results because most of the data in MNIST, the object to be recognized is in the middle of the image.

Then what if the object to be recognized was not in the middle of the image? This is the weakness of MLP. The number 6 in the middle of the picture will be recognized, but the number 6 in the left corner may not be recognized.

But a lot of data with each digit in a different location can be use, but this is not an efficient way to solve the problem.

# Convolutional Neural Network

Convolutional Neural Network (CNN) is a type of neural network commonly used in image data. CNN can be used to detect and recognize objects in an image. Broadly speaking, CNN is not much different from the usual neural network. CNN consists of neurons that have a weight, bias and activation function.

Then what is the difference? The architecture of CNN is divided into 2 major parts, **Feature Lerning Layer** and **Fully-Connected Layer** or **Multi-Layer Perceptron**.

<img src="img/cnn_architecture.png">

# Feature Learning Layer

That term was used because the process that occurs in this section was "encoding" from an image into features in the form of numbers that represent the image (Feature Extraction). Feature Learning layer consists of two parts. **Convolutional Layer** and **Pooling Layer**. But sometimes there are some research/papers that don't use pooling.

# Convolutional Layer (Conv. Layer)

<img src="img/cnn_conv_layer.png">

The picture above is an RGB (Red, Green, Blue) **image measuring 32x32 pixels** which is actually a multidimensional array with a size of **32x32x3 (3 is the number of channels)**.

Convolutional layer consists of neurons arranged in such a way as to form a filter with length and height (pixels). For example, the first layer in the feature learning layer is usually convolutional layer with a size of the image is 5x5x3. **5 pixels length**, **5 pixels height** and **3 pieces depth/total** according to the channel of the image.

These three filters will be shifted to all parts of the image. Every shift will be carried out a "dot" operation between the input and the value of the filter to produce an output or commonly referred to as an activation map or feature map.

<img src="img/cnn_conv_process.gif">

# Stride

Stride is a parameter that determines how many filters shift. **If the value of stride is 1, then convolution filter will shift 1 pixel horizontally then vertically**. In the illustration above, the stride that used was 2.

More smaller the stride, the more detailed information that got from the input, but it requires more computing performance when compared to a large stride. However, it should be noted, even using a small stride the performance won't always good.

# Padding

Padding or Zero Padding is a parameter that **determines the number of pixels (containing the value 0) to be added on each side of the input images**. This is used for the purpose of manipulating the output dimensions of convolution layer (Feature Map).

The purpose of using padding is:

- **Output dimension of convolution layer is always smaller than the input (except the use of a 1x1 filter with stride 1)**. This output will be reused as input from next convolution layer, so that more and more information is wasted.

    By using padding, the output dimensions can be adjusted same as the input dimensions or at least not decrease drastically. So, we can use deeper convolution layer/deep convolution layer, that more features were successfully extracted.

- Improve the performance of the model because of the convolution filter will focus on the actual information that exist between the zero padding.

In the illustration above, the dimensions of **the actual input was 5x5**, if convolution is done with **3x3 filter and 2 stride**, **you will get only 2x2 feature map**. But **if 1 zero padding was added, the feature map would be 3x3 in size** (more information will be generated).

Here's the equation to calculate the dimensions of the feature map:

$output = \frac{W\ -\ N\ +\ 2P}{S} + 1$

- W = Length/Height of Input
- N = Filter Length/Height
- P = Zero Padding
- S = Stride

# Pooling Layer

The pooling layer is usually exist after convolution layer. In principle, the pooling layer consists of a filter with certain size and stride that will shift throughout the feature map area.

Pooling commonly used Max Pooling and Average Pooling. For example Max Pooling 2x2 was used with stride 2, then at each shift of the filter, the maximum value in the 2x2 pixel area will be selected, while Average Pooling will choose the average value.

<img src="img/cnn_pooling.png">

*The output dimension of the Pooling layer also uses the same formula as conv. layer.*

The purpose of using the pooling layer is to reduce the dimensions of the feature map (downsampling), thereby speeding up the computation because the parameters that need to be updated are fewer and overcome overfitting.

# Fully-Connected Layer (FC Layer)

The feature map that is produced from the feature extraction layer is **still in the form of a multidimensional array**. So, **it need to be "flatten" or reshaped the feature map into a vector**. So, it can be used as input for the fully-connected layer.

FC Layer is meant here is the MLP that was studied in part-4 and part-5 before. FC Layer has **several hidden layers, activation functions, output layers and loss functions**.

Here's the detail about backpropagation process in CNN: [https://www.jefkine.com/general/2016/09/05/backpropagation-in-convolutional-neural-networks/](https://www.jefkine.com/general/2016/09/05/backpropagation-in-convolutional-neural-networks/)

# Let's Code

This time MNIST Fashion data must be classify. This MNIST Fashion is a dataset consisting of 10 fashion categories as follows:

<img src="img/fashion_mnist.png">

+ T-Shirt/Tops = 0
+ Trouser = 1
+ Pullover = 2
+ Dress = 3
+ Coat = 4
+ Sandal = 5
+ Shirt = 6
+ Sneaker = 7
+ Bag = 8
+ Ankle Boot = 9

Each category consists of 6,000 images for training and 1,000 images for testing. So the total for training data is 60,000 images and 10,000 for testing data.

In [1]:
import numpy as np
import tensorflow as tf
from keras.models import Model
from keras.layers import Input, Activation, Dense, Conv2D, MaxPooling2D, ZeroPadding2D, Flatten
from keras.optimizers import Adam
from keras.utils.np_utils import to_categorical
from keras.callbacks import TensorBoard
from keras.datasets import fashion_mnist

(train_x, train_y), (test_x, test_y) = fashion_mnist.load_data()

train_x = train_x.astype('float32') / 255.
test_x = test_x.astype('float32') / 255.

train_x = np.reshape(train_x, (len(train_x), 28, 28, 1))
test_x = np.reshape(test_x, (len(test_x), 28, 28, 1))

train_y = to_categorical( train_y )
test_y = to_categorical( test_y )

Using TensorFlow backend.


Not really different from the previous parts. But this time the new layers named **Conv2D**, **MaxPooling2D**, **ZeroPadding2D** and **Flatten** will be used. **TensorBoard** will also be used to *visualize during training*.

**MNIST Fashion** Input will be scaled from 0 - 255 to 0 - 1 same with the explanation in part-6 and reshape the data to 4-D because the requirements of the framework that will be used was like the code above *(batch_size*, *width*, *height*, *channel)* $\rightarrow$ (256, 28, 28, 1).

The targets will also be changed to **one-hot** by using the *to_categorical* method same as what did in part-5.

# The Model

<img src="img/cnn_model.jpg">

The architectural model that gonna be make is like the picture above. Feature map that was successfully extracted from the input size of 64x3. Furthermore, there is a Flatten layer that changes the feature map to 1-D vector which will be used on the FC Layer.

In [2]:
inputs = Input(shape=(28, 28, 1))
conv_layer = ZeroPadding2D(padding=(2,2))(inputs) 
conv_layer = Conv2D(16, (5, 5), strides=(3,3), activation='relu')(conv_layer) 
conv_layer = MaxPooling2D((2, 2))(conv_layer)
conv_layer = Conv2D(32, (3, 3), strides=(1,1), activation='relu')(conv_layer) 
conv_layer = ZeroPadding2D(padding=(1,1))(conv_layer) 
conv_layer = Conv2D(64, (3, 3), strides=(1,1), activation='relu')(conv_layer)

# Flatten feature map to Vector with 576 element.
flatten = Flatten()(conv_layer) 

# Fully Connected Layer
fc_layer = Dense(256, activation='relu')(flatten)
fc_layer = Dense(64, activation='relu')(fc_layer)
outputs = Dense(10, activation='softmax')(fc_layer)

model = Model(inputs=inputs, outputs=outputs)

# Adam Optimizer and Cross Entropy Loss
adam = Adam(lr=0.0001)
model.compile(optimizer=adam, loss='categorical_crossentropy', metrics=['accuracy'])

# Print Model Summary
print(model.summary())

Instructions for updating:
If using Keras pass *_constraint arguments to layers.

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 28, 28, 1)         0         
_________________________________________________________________
zero_padding2d_1 (ZeroPaddin (None, 32, 32, 1)         0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 10, 10, 16)        416       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 16)          0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 3, 3, 32)          4640      
_________________________________________________________________
zero_padding2d_2 (ZeroPaddin (None, 5, 5, 32)          0         
___________________________________________

# Training with TensorBoard Visualization

TensorBoard will be used to visualize during the training. All training **loss/accuracy** and **validation loss/accuracy** will be saved and the graph can be seen. So, the result can be seen, whether there is **underfit**, **overfit** and **the performance of our model**.

In [3]:
# Use TensorBoard
callbacks = TensorBoard(log_dir='./Graph')

# Train for 100 Epochs and use TensorBoard Callback
model.fit(train_x, train_y, batch_size=256, epochs=100, verbose=1, 
          validation_data=(test_x, test_y), callbacks=[callbacks])

# Save Weights
model.save_weights('7_conv_neural_network.h5')


Train on 60000 samples, validate on 10000 samples


Epoch 1/100

Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100


Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


After 100 epochs, amount of loss is 0.2236 for training and 0.3071 for validation, while the accuracy obtained is 0.9186 (91.86%) for training and 0.8891 (88.9%) for validation.

To use TensorBoard, this command can be following:

`$ tensorboard --logdir=Graph/`

Here's the result which has shown in the tensorboard.

<img src="img/tf_board_cnn.png">