# CONVOLUTIONAL NEURAL NETWORK APPLICATION

# Introducation

In this section, we will use the famous MNIST Dataset to build two Neural Networks capable to perform handwritten digits classification. The first Network is a simple Multi-layer Perceptron (MLP) and the second one is a Convolutional Neural Network (CNN from now on). In other words, our algorithm will say, with some associated error, what type of digit is the presented input.

# What is Deep Learnig?

Brief Theory: Deep learning (also known as deep structured learning, hierarchical learning or deep machine learning) is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using multiple processing layers, with complex structures or otherwise, composed of multiple non-linear transformations.

In Practice, defining the term "Deep": in this context, deep means that we are studying a Neural Network which has several hidden layers (more than one), no matter what type (convolutional, pooling, normalization, fully-connected etc). The most interesting part is that some papers noticed that Deep Neural Networks with right architectures/hyper-parameters achieve better results than shallow Neural Networks with same computational power (e.g. number of neurons or connections).

In Practice, defining "Learning": In the context of supervised learning, digits recognition in our case, the learning consists of a target/feature which is to be predicted using a given set of observations with the already known final prediction (label). In our case, the target will be the digit (0,1,2,3,4,5,6,7,8,9) and the observations are the intensity and relative position of pixels. After some training, it's possible to generate a "function" that map inputs (digit image) to desired outputs(type of digit). The only problem is how well this map operation occurs. While trying to generate this "function", the training process continues until the model achieves a desired level of accuracy on the training data.

In [5]:
import tensorflow as tf
tf.__version__

'1.3.0'

# USing MLP

classify mnixt using a simple model

In [1]:
## loading the data from tensorflow
from tensorflow.examples.tutorials.mnist import input_data

  from ._conv import register_converters as _register_converters


In [2]:
mnist = input_data.read_data_sets('MNIST_data', one_hot = True)

Extracting MNIST_data\train-images-idx3-ubyte.gz
Extracting MNIST_data\train-labels-idx1-ubyte.gz
Extracting MNIST_data\t10k-images-idx3-ubyte.gz
Extracting MNIST_data\t10k-labels-idx1-ubyte.gz


# Creating an interactive section

You have two basic options when using TensorFlow to run your code:

[Build graphs and run session] Do all the set-up and THEN execute a session to evaluate tensors and run operations (ops)

[Interactive session] create your coding and run on the fly.
For this first part, we will use the interactive session that is more suitable for environments like Jupyter notebooks.

In [9]:
sess = tf.InteractiveSession()

# Creating  placehloder

It's a best practice to create placeholders before variable assignments when using TensorFlow. Here we'll create placeholders for inputs ("Xs") and outputs ("Ys").

Placeholder 'X': represents the "space" allocated input or the images.

   * Each input has 784 pixels distributed by a 28 width x 28 height matrix   
   * The 'shape' argument defines the tensor size by its dimensions.  
   * 1st dimension = None. Indicates that the batch size, can be of any size.  
   * 2nd dimension = 784. Indicates the number of pixels on a single flattened MNIST image.  

Placeholder 'Y':_ represents the final output or the labels.

   * 10 possible classes (0,1,2,3,4,5,6,7,8,9)  
   * The 'shape' argument defines the tensor size by its dimensions.  
   * 1st dimension = None. Indicates that the batch size, can be of any size.   
   * 2nd dimension = 10. Indicates the number of targets/outcomes 

In [11]:
x = tf.placeholder(tf.float32, shape = [None,784])
y_ = tf.placeholder(tf.float32, shape =[None, 10])

In [12]:
## weight tensor 
w = tf.Variable(tf.zeros([784, 10], tf.float32))

## bias tensor
b = tf.Variable(tf.zeros([10], tf.float32))

# Execute the assignment operation 


Before, we assigned the weights and biases but we did not initialize them with null values. For this reason, TensorFlow need to initialize the variables that you assign.
Please notice that we're using this notation "sess.run" because we previously started an interactive session.

In [13]:
# run the op initialize_all_variables using an interactive session
sess.run(tf.initialize_all_variables())

Instructions for updating:
Use `tf.global_variables_initializer` instead.


Adding the weights and bias to inputs

The only difference from our next operation to the picture below is that we are using the mathematical convention for what is being executed in the illustration. The tf.matmul operation performs a matrix multiplication between x (inputs) and W (weights) and after the code add biases.

In [14]:
tf.matmul(x,w) +b

<tf.Tensor 'add:0' shape=(?, 10) dtype=float32>

In [15]:
y = tf.nn.softmax(tf.matmul(x, w) +b)

Logistic function output is used for the classification between two target classes 0/1. Softmax function is generalized type of logistic function. That is, Softmax can output a multiclass categorical probability distribution.

# Cost Function

It is a function that is used to minimize the difference between the right answers (labels) and estimated outputs by our Network.

In [18]:
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))

# Type of optimization: Gradient Descent

This is the part where you configure the optimizer for you Neural Network. There are several optimizers available, in our case we will use Gradient Descent that is very well stablished.

In [21]:
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
print(train_step)

name: "GradientDescent_1"
op: "NoOp"
input: "^GradientDescent_1/update_Variable/ApplyGradientDescent"
input: "^GradientDescent_1/update_Variable_1/ApplyGradientDescent"



# Training batches

Train using minibatch Gradient Descent.

In practice, Batch Gradient Descent is not often used because is too computationally expensive. The good part about this method is that you have the true gradient, but with the expensive computing task of using the whole dataset in one time. Due to this problem, Neural Networks usually use minibatch to train.

In [22]:
## load the 50 training examples for each training iteration
for i in range(1000):
    batch = mnist.train.next_batch(50)
    train_step.run(feed_dict = {x: batch[0], y_: batch[1]})

# Test

In [24]:
correct_predictions = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_predictions, tf.float32))
acc = accuracy.eval(feed_dict = {x: mnist.test.images, y_: mnist.test.labels}) * 100
print( 'final accuracy for the simple ann model is: {} % '.format(acc))

final accuracy for the simple ann model is: 91.68999791145325 % 


In [25]:
sess.close()

# Evaluating the final result

Is the final result good?

Let's check the best algorithm available out there (10th june 2016):

Result: 0.21% error (99.79% accuracy)
Reference here

# How to improve the model

Several options as follow 

 - Regularization of Neural Networks using DropConnect
 - Multi-column Deep Neural Networks for Image Classiﬁcation
 - APAC: Augmented Pattern Classification with Neural Networks
 - Simple Deep Neural Network with Dropout

In the next part we are going to explore the options 

 - Simple Deep Neural Network with Dropout (more than 1 hidden layer)

# Deep learning applied on MNIST 

In the first part, we learned how to use a simple ANN to classify MNIST. Now we are going to expand our knowledge using a Deep Neural Network.

Architecture of our network is:

 - (Input) -> [batch_size, 28, 28, 1] >> Apply 32 filter of [5x5]
 - (Convolutional layer 1) -> [batch_size, 28, 28, 32]
 - (ReLU 1) -> [?, 28, 28, 32]
 - (Max pooling 1) -> [?, 14, 14, 32]
 - (Convolutional layer 2) -> [?, 14, 14, 64]
 - (ReLU 2) -> [?, 14, 14, 64]
 - (Max pooling 2) -> [?, 7, 7, 64]
 - [fully connected layer 3] -> [1x1024]
 - [ReLU 3] -> [1x1024]
 - [Drop out] -> [1x1024]
 - [fully connected layer 4] -> [1x10]

The next cells will explore this new architecture.

In [27]:
import tensorflow as tf

In [28]:
sess.close()

In [29]:
## start the interactive session 
sess = tf.InteractiveSession()

In [31]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot = True)

Extracting MNIST_data\train-images-idx3-ubyte.gz
Extracting MNIST_data\train-labels-idx1-ubyte.gz
Extracting MNIST_data\t10k-images-idx3-ubyte.gz
Extracting MNIST_data\t10k-labels-idx1-ubyte.gz


# initial parameters


In [32]:
width = 28 ## width of image in pixels
height = 28 ## height of image in pixels
flat = width* height  # number of pixels in one image 
class_output = 10 ## number of possible classification for the problem

# input and output 

In [33]:
x = tf.placeholder(tf.float32, shape = [None , flat])
y_ = tf.placeholder(tf.float32, shape = [None, class_output])

Converting images of the data set to tensors

The input image is a 28 pixels by 28 pixels, 1 channel (grayscale). In this case, the first dimension is the batch number of the image, and can be of any size (so we set it to -1). The second and third dimensions are width and hight, and the last one is the image channels.

In [34]:
x_image = tf.reshape(x, [-1, 28,28,1])
x_image

<tf.Tensor 'Reshape:0' shape=(?, 28, 28, 1) dtype=float32>

# Convolutional Layer1

Defining the kernel weight and bias 

We define a kernle here. The Size of the filter/kernel is 5x5; Input channels is 1 (greyscale); and we need 32 different feature maps (here, 32 feature maps means 32 different filters are applied on each image. So, the output of convolution layer would be 28x28x32). In this step, we create a filter / kernel tensor of shape [filter_height, filter_width, in_channels, out_channels]

In [38]:
w_conv1 = tf.Variable(tf.truncated_normal([5,5,1,32], stddev = 0.1))
b_conv1 = tf.Variable(tf.constant(0.1, shape = [32])) ##need 32 biases for 32 outputs

# Convolve with weight tensor and add biases.

inputs

 - tensor of shape [batch, in_height, in_width, in_channels]. x of shape [batch_size,28 ,28, 1]
 - a filter / kernel tensor of shape [filter_height, filter_width, in_channels, out_channels]. W is of size [5, 5, 1, 32]
 - stride which is [1, 1, 1, 1]. The convolutional layer, slides the "kernel window" across the input tensor. As the input tensor has 4 dimensions: [batch, height, width, channels], then the convolution operates on a 2D window on the height and width dimensions. strides determines how much the window shifts by in each of the dimensions. As the first and last dimensions are related to batch and channels, we set the stride to 1. But for second and third dimension, we coould set other values, e.g. [1, 2, 2, 1]

Process

 - Change the filter to a 2-D matrix with shape [5*5*1,32]
 - Extracts image patches from the input tensor to form a virtual tensor of shape [batch, 28, 28, 5*5*1].
 - For each batch, right-multiplies the filter matrix and the image vector.

Output

- A Tensor (a 2-D convolution) of size <tf.Tensor 'add_7:0' shape=(?, 28, 28, 32)- Notice: the output of the first convolution layer is 32 [28x28] images. Here 32 is considered as volume/depth of the output image.

In [43]:
convolve1 = tf.nn.conv2d(x_image, w_conv1, strides = [1,1,1,1], padding = 'SAME') + b_conv1
print(convolve1)

Tensor("add_3:0", shape=(?, 28, 28, 32), dtype=float32)


# Applying the RELU activation function 

In this step, we just go through all outputs convolution layer, covolve1, and wherever a negative number occurs,we swap it out for a 0. It is called ReLU activation Function.

In [42]:
h_conv1 = tf.nn.relu(convolve1)
print(h_conv1)

Tensor("Relu_1:0", shape=(?, 28, 28, 32), dtype=float32)


# Applying the max pooling

max pooling is a form of non-linear down-sampling. It partitions the input image into a set of rectangles and, and then find the maximum value for that region.

Lets use tf.nn.max_pool function to perform max pooling. Kernel size: 2x2 (if the window is a 2x2 matrix, it would result in one output pixel)
Strides: dictates the sliding behaviour of the kernel. In this case it will move 2 pixels everytime, thus not overlapping. The input is a matix of size 14x14x32, and the output would be a matrix of size 14x14x32.

In [45]:
conv1 = tf.nn.max_pool(h_conv1, ksize= [1,2,2,1], strides= [1,2,2,1], padding='SAME') #max_pool_2x2
conv1

<tf.Tensor 'MaxPool_1:0' shape=(?, 14, 14, 32) dtype=float32>

# Convolutaional layer2

Weights and Biases of kernels
We apply the convolution again in this layer. Lets look at the second layer kernel:

- Filter/kernel: 5x5 (25 pixels)
- Input channels: 32 (from the 1st Conv layer, we had 32 feature maps)
- 64 output feature maps

Notice: here, the input image is [14x14x32], the filter is [5x5x32], we use 64 filters of size [5x5x32], and the output of the convolutional layer would be 64 covolved image, [14x14x64].

Notice: the convolution result of applying a filter of size [5x5x32] on image of size [14x14x32] is an image of size [14x14x1], that is, the convolution is functioning on volume.

In [46]:
w_conv2 = tf.Variable(tf.truncated_normal([5,5, 32, 64], stddev = 0.1))
b_conv2 = tf.Variable(tf.constant(0.1, shape = [64])) #need 64 biases for 64 outputs

# Convolve image with weight tensor and add biases.

In [48]:
convolve2 = tf.nn.conv2d(conv1, w_conv2, strides = [1,1, 1, 1], padding = 'SAME') + b_conv2
convolve2

<tf.Tensor 'add_5:0' shape=(?, 14, 14, 64) dtype=float32>

Apply the activation Relu function 

In [50]:
h_conv2 = tf.nn.relu(convolve2)
h_conv2 

<tf.Tensor 'Relu_3:0' shape=(?, 14, 14, 64) dtype=float32>

Applying the max pooling 

In [52]:
conv2 = tf.nn.max_pool(h_conv2, ksize= [1,2,2,1], strides= [1,2,2,1], padding = 'SAME') #max_pool_2X2
conv2

<tf.Tensor 'MaxPool_3:0' shape=(?, 7, 7, 64) dtype=float32>

Second layer completed. So, what is the output of the second layer, layer2?

- iti is 64 matrix of [7X7]

# Fully Connected Layer

You need a fully connected layer to use the Softmax and create the probabilities in the end. Fully connected layers take the high-level filtered images from previous layer, that is all 64 matrics, and convert them to a flat array.

So, each matrix [7x7] will be converted to a matrix of [49x1], and then all of the 64 matrix will be connected, which make an array of size [3136x1]. We will connect it into another layer of size [1024x1]. So, the weight between these 2 layers will be [3136x1024]

# Flattening Second Layer 

In [55]:
layer2_matrix = tf.reshape(conv2, [-1,7*7*64])
layer2_matrix

<tf.Tensor 'Reshape_3:0' shape=(?, 3136) dtype=float32>

# Weights and Biases between layer 2 and 3

Composition of the feature map from the last layer (7x7) multiplied by the number of feature maps (64); 1027 outputs to Softmax layer

In [57]:
w_fc1 = tf.Variable(tf.truncated_normal([7*7* 64, 1024], stddev = 0.1))
b_fc1 = tf.Variable(tf.constant(0.1, shape = [1024])) # need 1024 biases for 1024 outputs

matrix multiplication (applying weights and bias)

In [59]:
fc1 = tf.matmul(layer2_matrix, w_fc1) + b_fc1
fc1

<tf.Tensor 'add_7:0' shape=(?, 1024) dtype=float32>

Applying the Relu activation function

In [61]:
h_fc1 = tf.nn.relu(fc1)
h_fc1

<tf.Tensor 'Relu_5:0' shape=(?, 1024) dtype=float32>

Third layer completed

Dropout layer, OPtional phase for the reducing the overfitting

It is a phase where the network "forget" some features. At each training step in a mini-batch, some units get switched off randomly so that it will not interact with the network. That is, it weights cannot be updated, nor affect the learning of the other network nodes. This can be very useful for very large neural networks to prevent overfitting.

In [63]:
keep_prob = tf.placeholder(tf.float32)
layer_drop = tf.nn.dropout(h_fc1, keep_prob)
layer_drop

<tf.Tensor 'dropout/mul:0' shape=(?, 1024) dtype=float32>

# Readout Layer (Softmax Layer)


Type: Softmax, Fully Connected Layer.

Weights adn biases 

In last layer, CNN takes the high-level filtered images and translate them into votes using softmax. Input channels: 1024 (neurons from the 3rd Layer); 10 output features

In [66]:
w_fc2 = tf.Variable(tf.truncated_normal([1024, 10], stddev = 0.1))
b_fc2 = tf.Variable(tf.constant( 0.1, shape = [10])) # 10 possibilities for digits [0,1,2,3,4,5,6,7,8,9]
print(w_fc2, b_fc2)

<tf.Variable 'Variable_13:0' shape=(1024, 10) dtype=float32_ref> <tf.Variable 'Variable_14:0' shape=(10,) dtype=float32_ref>


Matrix multiplication (applying weights and biases)


In [67]:
fc = tf.matmul(layer_drop, w_fc2) + b_fc2

Applying the softamx actiavtion function

softmax allows us to interpret the outputs of fcl4 as probabilities. So, y_conv is a tensor of probablities.

In [69]:
y_CNN = tf.nn.softmax(fc)
y_CNN

<tf.Tensor 'Softmax_2:0' shape=(?, 10) dtype=float32>

# Summary of the Deep Convolutional Neural Network

Now is time to remember the structure of our network

0) Input - MNIST dataset

1) Convolutional and Max-Pooling

2) Convolutional and Max-Pooling

3) Fully Connected Layer

4) Processing - Dropout

5) Readout layer - Fully Connected

6) Outputs - Classified digits

# Define the functions and train the model

Define the loss function 

We need to compare our output, layer4 tensor, with ground truth for all mini_batch. we can use cross entropy to see how bad our CNN is working - to measure the error at a softmax layer.

The following code shows an toy sample of cross-entropy for a mini-batch of size 2 which its items have been classified. You can run it (first change the cell type to code in the toolbar) to see hoe cross entropy changes.

In [77]:
import numpy as np
layer4_test = [[0.9, 0.1, 0.1], [0.9,0.1,0.1]]
y_test = [[1.0, 0.0, 0.0], [1.0, 0.0,0.0]]
np.mean( -np.sum(y_test * np.log(layer4_test), 1))

0.10536051565782628

reduce_sum computes the sum of elements of (y_ * tf.log(layer4) across second dimension of the tensor, and reduce_mean computes the mean of all elements in the tensor..

In [79]:
cross_entropy = tf.reduce_mean(-tf.reduce_sum((y_ *tf.log(y_CNN)), reduction_indices=[1]))
cross_entropy

<tf.Tensor 'Mean_4:0' shape=() dtype=float32>

# Define the Optimizer

It is obvious that we want minimize the error of our network which is calculated by cross_entropy metric. To solve the problem, we have to compute gradients for the loss (which is minimizing the cross-entropy) and apply gradients to variables. It will be done by an optimizer: GradientDescent or Adagrad.

In [81]:
train_step = tf.train.AdadeltaOptimizer(1e-4).minimize(cross_entropy)
train_step

<tf.Operation 'Adadelta_1' type=NoOp>

Define Prediction 

Do you want to know how many of the cases in a mini-batch has been classified correctly? lets count them.

In [83]:
correct_predictions = tf.equal(tf.argmax(y_CNN, 1), tf.argmax(y_,1))
correct_predictions

<tf.Tensor 'Equal_3:0' shape=(?,) dtype=bool>

Define accuracy

It makes more sense to report accuracy using average of correct cases.

In [86]:
accuracy = tf.reduce_mean(tf.cast(correct_predictions, tf.float32))
accuracy

<tf.Tensor 'Mean_7:0' shape=() dtype=float32>

In [87]:
sess.run(tf.global_variables_initializer())

In [88]:
for i in range(1100):
    batch = mnist.train.next_batch(50)
    if i%100 == 0:
        train_accuracy = accuracy.eval(feed_dict={x:batch[0], y_: batch[1], keep_prob: 1.0})
        print("step %d, training accuracy %g"%(i, float(train_accuracy)))
    train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

step 0, training accuracy 0.18
step 100, training accuracy 0.06
step 200, training accuracy 0.16
step 300, training accuracy 0.12
step 400, training accuracy 0.08
step 500, training accuracy 0.16
step 600, training accuracy 0.08
step 700, training accuracy 0.08
step 800, training accuracy 0.12
step 900, training accuracy 0.1
step 1000, training accuracy 0.12


# Evaluate the model

print the evaluation to the user

In [89]:
print("test accuracy %g"%accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))

test accuracy 0.0956


# Visualization 

Do you want to look at all the filters?

In [91]:
kernels = sess.run(tf.reshape(tf.transpose(w_conv1, perm=[2, 3, 0,1]),[32,-1]))

In [92]:
!wget --output-document utils1.py http://deeplearning.net/tutorial/code/utils.py
import utils1
from utils1 import tile_raster_images
import matplotlib.pyplot as plt
from PIL import Image
%matplotlib inline
image = Image.fromarray(tile_raster_images(kernels, img_shape=(5, 5) ,tile_shape=(4, 8), tile_spacing=(1, 1)))
### Plot image
plt.rcParams['figure.figsize'] = (18.0, 18.0)
imgplot = plt.imshow(image)
imgplot.set_cmap('gray')  

'wget' is not recognized as an internal or external command,
operable program or batch file.


ModuleNotFoundError: No module named 'utils1'