#  Convolutional Neural Networks

# What are Convolutional Neural Networks?

Convolutional Neural Networks (ConvNets or CNNs) are a category of Neural Networks that have proven very effective in areas such as image recognition and classification. ConvNets have been successful in identifying faces, objects and traffic signs apart from powering vision in robots and self driving cars.


A Convolutional Neural Network (CNN) is comprised of one or more convolutional layers (often with a subsampling step) and then followed by one or more fully connected layers as in a standard multilayer neural network. The architecture of a CNN is designed to take advantage of the 2D structure of an input image (or other 2D input such as a speech signal). This is achieved with local connections and tied weights followed by some form of pooling which results in translation invariant features. Another benefit of CNNs is that they are easier to train and have many fewer parameters than fully connected networks with the same number of hidden units. In this article we will discuss the architecture of a CNN and the back propagation algorithm to compute the gradient with respect to the parameters of the model in order to use gradient based optimization. 

> Artificial Intelligence has been witnessing a monumental growth in bridging the gap between the capabilities of humans and machines. Researchers and enthusiasts alike, work on numerous aspects of the field to make amazing things happen. One of many such areas is the domain of Computer Vision.

> The agenda for this field is to enable machines to view the world as humans do, perceive it in a similar manner and even use the knowledge for a multitude of tasks such as Image & Video recognition, Image Analysis & Classification, Media Recreation, Recommendation Systems, Natural Language Processing, etc. The advancements in Computer Vision with Deep Learning has been constructed and perfected with time, primarily over one particular algorithm — a Convolutional Neural Network.

> A **Convolutional Neural Network (ConvNet/CNN)** is a Deep Learning algorithm which can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other. The pre-processing required in a ConvNet is much lower as compared to other classification algorithms.

###  Architecture of a Convolutional Neural Network
<img src='images/cnn.jpeg'>

> A **ConvNet** is able to successfully capture the Spatial and Temporal dependencies in an image through the application of relevant filters. The architecture performs a better fitting to the image dataset due to the reduction in the number of parameters involved and reusability of weights. 

<img src='images/cnn2.jpeg' width='580px'>

<center> A CNN sequence to classify handwritten digits </center>

> Videos are collection of multiple Images or frames. A video in low FPS is nothing but an Image.

> Images are made up of pixels and a pixel is made by a combination of RGB colors. In RGB, each colors
    ranges between (0, 255) and each and every color is made by this RGB combination.

# Image Processing

> In case of **Image Processing**, We try to convolve or extract some features out of images and then send them to Neural Networks for training.

<img src='images/cnn-arch.jpg' width='640px'>

##  Convolution Layer  &  The process of Convolution

> **Convolution** means Extraction of Details or Features of the images.

> For convolution, we define some **Filters or Kernels** of a certain shape and dimension and then extract the Features from images.

> In very beginning, we don't have any idea about the Features available in Images, so We will define some hard-coded filters for different-different Features available in the Image.
   
   * **filter_1** is responsible for extracting the Linear shapes.
   * **filter_2** is responsible for extracting the Square shapes
   * **filter_3** is responsible for extracting the Circular shapes

<img src='images/Filters.jpg' width='420px'>

> We will apply these Filters one-by-one on Image matrix and these filters will extract features from the Image.
  
> **filter_1** is responsible for extracting all **Linear shapes** inside the Image, **filter_2** is responsible for erxtracting all **Square shapes** and **filter_3** will extract the **Circular Shape** 

##  Conv Matrices

> Image matrices are of Dimensions = 5 (Height) x 5 (Breadth) x 1 (Number of channels or color-spaces, eg. RGB)

<img src='images/img-dim.png' width='420px'>

>  Each **Filters** are overlapped to the **Image matrix** and then we multiplies the corresponding overlapped items and so it return **Conv Matrices**. 

> Applying **Filers** to **Image Matrices** are moved by leaving some columns known as **Strides** are selected in previous overlaps. Filters are moved horizontally and then vertically.

> **Strides** means adjusting the kernel in Image marix by some no. of columns.

<img src='images/conv-matrix.jpg' width='480px'>

<img src='images/rgb-filters.gif'>

###  Overlapping of filters results to Conv Matrices
<img src='https://d17h27t6h515a5.cloudfront.net/topher/2016/November/581a58be_convolution-schematic/convolution-schematic.gif'>
<img src='images/kernel-moves.png'>

> To extract multiple features from an Image, We applies as many 'n' **Filters** and we get 'n' **Conv matrices**.

<img src='images/conv-n.jpg' width='480px'>

> **Filters** are overlapped to repeat same kind of dataset in calculation. We always try to keep an overlap of data, so that it will be able to consider in previous stage and same data in next stage.

>  We can create and keep applying 'n' filters, which will be able to extract various features out of Image and we get different-different **Conv Matrices**. Matrices obtained by ovelapping Filters over image data is known as **Conv matrices** or Feature Matrices.

>  Data extracted by applying the filters on Image matrix, is not same as per my original image.

> Whole idea and objective here in Image Processing is to extact features from Images.

> In very beginning, I don't have an idea about what kind of features are available in Image dataset and what kind of filters, I I can apply? So I have taken some some filters created by me and once I got **Conv Matrices**, I can do a little bit Data processing.

> After Convloution, we get huge no. of pixels and this is too much, so I will try to take some common pixels, which represents all of these pixels, Therefore, We perform some **Pooling** or **Averaging** operations over the **conv matrices** and reduce down the amount of pixels to a lower figure. i.e. **3x3x4 matrix is reduced to 2x2 matrix**
**1920x1080p** resolution of images and videos means that there are 1920 pixels are horizonally and 1080 pixels are vertially arranged inside Images. 

# Self - Learnable Kernels

> Before 1998, People uses some hard-coded filters and these hard-coded filters are responsible for extracting a single feature from the Image. After 1998, Self-learnable Filters comes to picture.

> Researchers said - "This is not required to give hard-coded kernels for convolution, because they are not able to extract features and not be able to identify or classify some of objects inside an Image in a corect way." So They are not going to define any paricular shape for Filters and make them self-learnable. System creates these self-learnable kernels by itself and it require 2 parameters (shape of kernel and No. of kernels) to be tuned. This way a new concept arises as 'Self-learnable Kernels' and in all recent CNN Models, we uses this concept.

> If there are more no. of Filters, it will train itself with a more no. of patterns available in images.

> In case of Self-learnable kernels, it assigns some random data values inside the kenels and so we get some differences between the **Actual values** and **Predicted values** and so **Expected values** and **predicted values** are not cor-relating to each others. If they will be able to cor-relate and both are almost closed to each other in certain cor-relation, we can say **predicted values** are some kind of data, that we have extracted is relaed to our **Expected values**.

> These random data values of kernels are considered as **weights**. If there is no any cor-relation between the expected values and predicted ones, it will goes for Backward Propagation and will try to adjusts all the random values of kernels one-by-one.
 It will ry to adjust the weihts in such a way, that there will be some correlation between Actual values and predicted ones.
 
> We calculates the **Local Gradients** then perform **Local Gradient Analysis** or **Local Gradient Decomposition** and based on that, we try to learn about te Kernels.

> These **Kernels** are subjective parameters and changes Image to Image.

#  Padding operation

> In **Padding**, we adds some more data all across the inputted matrixand fill them with zero's. Padding with 'n' units depends upon the size of matrix required in next step.

   * **Padding with 1 unit** padds 1 line of data all across the inputed matrix.
   * **Padding with 2 unit** padds 2 lines of data all across the inputed matrix.
   
> By Padding, we doesn't mean to change meaning of data or add any data to the original matrix at all.
   Performing **Pooling operations** over the Padded data don't make any changes to the shape of output data, so shape of matrices don't changes.
   
> There may be some significant features or Informations on the Edes of image, so we do Padding to cover the same feature again & again and hence those significant features get considered.  
  
####  Why we do Padding operation?

  *  To keep shape & size of Input matrices and output matrices intact, we do **Padding**. 
  *  If significant Features or informations are available on Edes of image and we don't want to loose them, then we uses **Padding**.
  
 
<img src='images/stride2-convol.gif'>
<img src='images/padding.gif'>

## Pooling  Operations

> There is no any meaning of considering same outcome again & again & again, so It is completely fine, if I remove some of features and considera single feature and our work is done. This is what we try to do in **Pooling operations** . 
   
>  In **Pooling operation**, we are trying to restrict some of the pixels, so that my datasize will reduce to a lower resolution. As **Pooling operations**, we do **Minimum Pooling**, **Max Pooling** or **Average Pooling**.

 * Min Pooling  ->   Picks minimum value out of the selection
 * Max Pooling  ->   Picks maximum value out of the seletion
 * Avg pooling  ->   Return the Average value of all values in the selection
 
> After **Pooling operation**, we get some final features, so that we will be able to reduce the amount of data.

<img src='images/pooling.gif' width='480px'>

> Max Pooling also performs as a Noise Suppressant. It discards the noisy activations altogether and also performs de-noising along with dimensionality reduction. On the other hand, Average Pooling simply performs dimensionality reduction as a noise suppressing mechanism. Hence, we can say that Max Pooling performs a lot better than Average Pooling.

<img src='images/max-pool.png' width='360px'>


 > We can perform **Convolution operation**, **Padding operation**, **Pooling Operations** multiple times  in each and every layers of the CNN Network.

## Flatenning Operation

> Once it will be able to convolve the Image data, we try to flattened this entire data.

> Flatenning the dataset means (mxmxn) Image matrix is converted to a (mx1) column vector and then this flatenned data is going to send inside the **Neural network** or **Fully connected network or f-c** layer.

> **f-c network Layer** is responsible for learning the relationship between the features, that we have extracted in **CNN Layer**.

<img src='images/flatten.png' width='420px'>

> **The main idea & objective behind a CNN network is to learn some patterns out of Image data using some kernels.**

# Architecture of CNN networks

> **CNN Network** is designed by combining 2 different layers **Convolution layer** and **f-c layer**.

<img src='images/cnn-architecture.jpg' width='480px'>

> In **Convolution layer**, we performs convolution, Padding, pooling and flattenning operations over the image dataset. It will learn what to extract out of Images and then flattening the entire data and send it to **f-c layer**.

> **f-c layer** yields the final result. In **f-c layer**, it will learn the relationship between the Extrated Features and finally give some kind of Outcomes.

> Learning happens in both **Convolution & f-c** layers.  **Convolutin layer** learns to adjust the **kernels**, while **f-c layer** learns to adjust the relations in terms of weights.

## TensorFlow Convolution Layer

Let's examine how to implement a CNN in TensorFlow.

TensorFlow provides the `tf.nn.conv2d()` and `tf.nn.bias_add()` functions to create your own convolutional layers.

The code above uses the `tf.nn.conv2d()` function to compute the convolution with `weight` as the filter and `[1, 2, 2, 1]` for the strides. TensorFlow uses a stride for each `input` dimension, `[batch, input_height, input_width, input_channels]`. We are generally always going to set the stride for `batch` and `input_channels `(i.e. the first and fourth element in the strides array) to be `1`.

You'll focus on changing `input_height` and `input_width` while setting batch and `input_channels` to `1`. The `input_height` and `input_width` strides are for striding the filter over `input`. This example code uses a stride of 2 with 5x5 filter over `input`.

The `tf.nn.bias_add()` function adds a 1-d bias to the last dimension in a matrix.

## TensorFlow Max Pooling
![](https://d17h27t6h515a5.cloudfront.net/topher/2016/November/582aac09_max-pooling/max-pooling.png)


The image above is an example of max pooling with a 2x2 filter and stride of 2. The four 2x2 colors represent each time the filter was applied to find the maximum value.

For example, `[[1, 0], [4, 6]]` becomes `6`, because `6` is the maximum value in this set. Similarly, `[[2, 3], [6, 8]]` becomes `8`.

Conceptually, the benefit of the max pooling operation is to reduce the size of the input, and allow the neural network to focus on only the most important elements. Max pooling does this by only retaining the maximum value for each filtered area, and removing the remaining values.

TensorFlow provides the `tf.nn.max_pool()` function to apply max pooling to your convolutional layers.

In [None]:
...
conv_layer = tf.nn.conv2d(input, weight, strides=[1, 2, 2, 1], padding='SAME')
conv_layer = tf.nn.bias_add(conv_layer, bias)
conv_layer = tf.nn.relu(conv_layer)
# Apply Max Pooling
conv_layer = tf.nn.max_pool(
    conv_layer,
    ksize=[1, 2, 2, 1],
    strides=[1, 2, 2, 1],
    padding='SAME')

The `tf.nn.max_pool()` function performs max pooling with the `ksize` parameter as the size of the filter and the `strides` parameter as the length of the stride. 2x2 filters with a stride of 2x2 are common in practice.

The `ksize` and `strides` parameters are structured as 4-element lists, with each element corresponding to a dimension of the input tensor (`[batch, height, width, channels]`). For both `ksize` and `strides`, the `batch` and `channel` dimensions are typically set to `1`.

## 1x1 Convolutions

#### [1x1 Convolutions Video](https://www.youtube.com/watch?v=Zmzgerm6SjA)

## Inception Module

#### [Inception Module Video](https://www.youtube.com/watch?v=SlTm03bEOxA)


## Convolutional Network in TensorFlow

It's time to walk through an example Convolutional Neural Network (CNN) in TensorFlow.

The structure of this network follows the classic structure of CNNs, which is a mix of convolutional layers and max pooling, followed by fully-connected layers.

The code we'll be looking at is similar to what we saw in the segment on Deep Neural Network in TensorFlow, except we'll restructured the architecture of this network as a CNN.

Just like in that segment, here we'll study the line-by-line breakdown of the code. [Link to download the code and run it.](https://d17h27t6h515a5.cloudfront.net/topher/2017/February/58a61ca1_cnn/cnn.zip)

We've seen this section of code from previous lessons. Here we're importing the MNIST dataset and using a convenient TensorFlow function to batch, scale, and One-Hot encode the data.

In [None]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets(".", one_hot=True, reshape=False)

import tensorflow as tf

# Parameters
learning_rate = 0.00001
epochs = 1
batch_size = 128

# Number of samples to calculate validation and accuracy
# Decrease this if you're running out of memory to calculate accuracy
test_valid_size = 256

# Network Parameters
n_classes = 10  # MNIST total classes (0-9 digits)
dropout = 0.75  # Dropout, probability to keep units

### Weights and Biases

In [None]:
# Store layers weight & bias
weights = {
    'wc1': tf.Variable(tf.random_normal([5, 5, 1, 32])),
    'wc2': tf.Variable(tf.random_normal([5, 5, 32, 64])),
    'wd1': tf.Variable(tf.random_normal([7*7*64, 1024])),
    'out': tf.Variable(tf.random_normal([1024, n_classes]))}

biases = {
    'bc1': tf.Variable(tf.random_normal([32])),
    'bc2': tf.Variable(tf.random_normal([64])),
    'bd1': tf.Variable(tf.random_normal([1024])),
    'out': tf.Variable(tf.random_normal([n_classes]))}

The above is an example of a convolution with a 3x3 filter and a stride of 1 being applied to data with a range of 0 to 1. The convolution for each 3x3 section is calculated against the weight, `[[1, 0, 1], [0, 1, 0], [1, 0, 1]]`, then a bias is added to create the convolved feature on the right. In this case, the bias is zero. In TensorFlow, this is all done using `tf.nn.conv2d()` and `tf.nn.bias_add()`.

In [None]:
def conv2d(x, W, b, strides=1):
    x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')
    x = tf.nn.bias_add(x, b)
    return tf.nn.relu(x)

The `tf.nn.conv2d()` function computes the convolution against weight `W` as shown above.

In TensorFlow, stride is an array of 4 elements; the first element in the stride array indicates the stride for batch and last element indicates stride for features. It's good practice to remove the batches or features you want to skip from the data set rather than use stride to skip them. You can always set the first and last element to 1 in stride in order to use all batches and features.

The middle two elements are the strides for height and width respectively. I've mentioned stride as one number because you usually have a square stride where `height = width`. When someone says they are using a stride of 3, they usually mean `tf.nn.conv2d(x, W, strides=[1, 3, 3, 1])`.

To make life easier, the code is using `tf.nn.bias_add()` to add the bias. Using `tf.add()` doesn't work when the tensors aren't the same shape.

The above is an example of max pooling with a 2x2 filter and stride of 2. The left square is the input and the right square is the output. The four 2x2 colors in input represents each time the filter was applied to create the max on the right side. For example, `[[1, 1], [5, 6]]` becomes `6` and `[[3, 2], [1, 2]]` becomes `3`.

In [None]:
def maxpool2d(x, k=2):
    return tf.nn.max_pool(
        x,
        ksize=[1, k, k, 1],
        strides=[1, k, k, 1],
        padding='SAME')

The `tf.nn.max_pool()` function does exactly what you would expect, it performs max pooling with the `ksize` parameter as the size of the filter.

### Model

![](https://d17h27t6h515a5.cloudfront.net/topher/2016/November/581a64b7_arch/arch.png)

In the code below, we're creating 3 layers alternating between convolutions and max pooling followed by a fully connected and output layer. The transformation of each layer to new dimensions are shown in the comments. For example, the first layer shapes the images from 28x28x1 to 28x28x32 in the convolution step. Then next step applies max pooling, turning each sample into 14x14x32. All the layers are applied from `conv1` to `output`, producing 10 class predictions.

In [None]:
def conv_net(x, weights, biases, dropout):
    # Layer 1 - 28*28*1 to 14*14*32
    conv1 = conv2d(x, weights['wc1'], biases['bc1'])
    conv1 = maxpool2d(conv1, k=2)

    # Layer 2 - 14*14*32 to 7*7*64
    conv2 = conv2d(conv1, weights['wc2'], biases['bc2'])
    conv2 = maxpool2d(conv2, k=2)

    # Fully connected layer - 7*7*64 to 1024
    fc1 = tf.reshape(conv2, [-1, weights['wd1'].get_shape().as_list()[0]])
    fc1 = tf.add(tf.matmul(fc1, weights['wd1']), biases['bd1'])
    fc1 = tf.nn.relu(fc1)
    fc1 = tf.nn.dropout(fc1, dropout)

    # Output Layer - class prediction - 1024 to 10
    out = tf.add(tf.matmul(fc1, weights['out']), biases['out'])
    return out

In [None]:
# tf Graph input
x = tf.placeholder(tf.float32, [None, 28, 28, 1])
y = tf.placeholder(tf.float32, [None, n_classes])
keep_prob = tf.placeholder(tf.float32)

# Model
logits = conv_net(x, weights, biases, keep_prob)

# Define loss and optimizer
cost = tf.reduce_mean(\
    tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)\
    .minimize(cost)

# Accuracy
correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

# Initializing the variables
init = tf. global_variables_initializer()

# Launch the graph
with tf.Session() as sess:
    sess.run(init)

    for epoch in range(epochs):
        for batch in range(mnist.train.num_examples//batch_size):
            batch_x, batch_y = mnist.train.next_batch(batch_size)
            sess.run(optimizer, feed_dict={
                x: batch_x,
                y: batch_y,
                keep_prob: dropout})

            # Calculate batch loss and accuracy
            loss = sess.run(cost, feed_dict={
                x: batch_x,
                y: batch_y,
                keep_prob: 1.})
            valid_acc = sess.run(accuracy, feed_dict={
                x: mnist.validation.images[:test_valid_size],
                y: mnist.validation.labels[:test_valid_size],
                keep_prob: 1.})

            print('Epoch {:>2}, Batch {:>3} -'
                  'Loss: {:>10.4f} Validation Accuracy: {:.6f}'.format(
                epoch + 1,
                batch + 1,
                loss,
                valid_acc))

    # Calculate Test Accuracy
    test_acc = sess.run(accuracy, feed_dict={
        x: mnist.test.images[:test_valid_size],
        y: mnist.test.labels[:test_valid_size],
        keep_prob: 1.})
    print('Testing Accuracy: {}'.format(test_acc))

In [None]:
We can further improve the accuracy by increasing the epochs.

# Simple CNN Network for Dog_Cat classification

In [1]:
# Imorting the required modules
import os
import numpy as np

from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


## Setting up Directories

In [10]:
# Setting Train and Validation Directories
base_dir = 'parent'
train_dir = os.path.join(base_dir, 'parent/train')
validation_dir = os.path.join(base_dir, 'parent/validation')

# Directory with our training cat pictures
train_cats_dir = os.path.join(train_dir, 'cats')

# Directory with our training dog pictures
train_dogs_dir = os.path.join(train_dir, 'dogs')

# Directory with our validation cat pictures
validation_cats_dir = os.path.join(validation_dir, 'cats')

# Directory with our validation dog pictures
validation_dogs_dir = os.path.join(validation_dir, 'dogs')

## Creating a PipeLine for CNN Network

In [11]:
# Creating an object of sequential API
classifier = Sequential()

# Step 1 - Convolution
classifier.add(Conv2D(32, (3, 3), input_shape = (64, 64, 3), activation = 'relu'))

# Step 2 - Pooling
classifier.add(MaxPooling2D(pool_size = (2, 2)))

# Step 3 - Flattening
classifier.add(Flatten())

# Step 4 - Full connection
classifier.add(Dense(units = 128, activation = 'relu'))
classifier.add(Dense(units = 1, activation = 'sigmoid'))

# Compiling the CNN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


## Training of Model

In [16]:
# Normalizing the Image dataset

from keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(rescale = 1./255, shear_range = 0.2, zoom_range = 0.2, horizontal_flip = True)
test_datagen = ImageDataGenerator(rescale = 1./255)
    
training_set = train_datagen.flow_from_directory('parent/train', target_size = (64, 64), batch_size = 32,
                                                 class_mode = 'binary')
    
test_set = test_datagen.flow_from_directory('parent/validation', target_size = (64, 64), batch_size = 32,
                                               class_mode = 'binary')
   
# Fitting the pipeline to data
model = classifier.fit_generator(training_set, steps_per_epoch = 100, epochs = 2, validation_data = test_set,
                               validation_steps = 5)

# Finally saving the Traine CNN Model
classifier.save("model.h5")
print("Saved model to disk")
    

Found 10003 images belonging to 2 classes.
Found 2022 images belonging to 2 classes.
Epoch 1/2
Epoch 2/2
Saved model to disk


##  Getting probabilities  for the prediction

In [7]:
# Defining a Function for Prediction

def predict(img):
    
    # Importing modules for predictions
    import numpy as np
    from keras.models import load_model
    from keras.preprocessing import image
    
    test_image = image.load_img(img, target_size = (64, 64))
    test_image = image.img_to_array(test_image)
    test_image = np.expand_dims(test_image, axis = 0)
    
    # Getting prediction from the Trained model
    model = load_model('newmodel.h5')
    result = model.predict(test_image)
    #result = classifier.predict(test_image) 
        
    # Prediction
    if result[0][0] == 1:
        prediction = 'dog'
        print(prediction)
    else:
        prediction = 'cat'
        print(prediction)
        
# Driver code
# Taking Image input
img = input('Copy the path of Image: ')
predict(img)

Copy the path of Image: parent/validation/cats/cat.4545.jpg
dog


In [None]:
         ##################    #############    Conclusion    ###############   ################
    
 =>  We have used a simple CNN Model for doing Classification over Cat-Dog Image dataset.

 =>  We can use some more Convolution layers and pooling and Padding for a good optimized results.
    
 =>  Keras Tuners can be used for Tuning up the Hyper-parameters of this CNN Network.

#  LeNet  or Gradient-based Learning appplied to document Recognition

## Basic Introduction

> LeNet-5 or Gradient-Based Learning Applied to Document Recognition model is a very efficient convolutional neural network for handwritten character recognition tasks. Click here for the LeNet Research Paper <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf" target="_blank">Paper: <u>Gradient-Based Learning Applied to Document Recognition</u></a>

**Authors**: Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner

**Published in**: Proceedings of the IEEE (1998)


## Structure of the LeNet network

> LeNet5 is a small network, it contains the basic modules of Deep learning i.e., convolutional layer, pooling layer, and fully-connected layer. It is the basis of other Deep learning models. Here we analyze LeNet5 in depth. 

![lenet](images/lenet-5.png)


> LeNet-5 model includes seven layer, does not comprise an input, each containing a trainable parameters; each layer has a plurality of the Map the Feature , a characteristic of each of the input FeatureMap extracted by means of a convolution filter, and then each FeatureMap There are multiple neurons.

![lenet1](images/arch.jpg)


##  Detailed explanation of each layer parameters in the Lenet model

#### **INPUT Layer**

> The first layer is the data **INPUT layer**. The size of the input image is uniformly normalized to 32 * 32.

> Note: This layer does not count as the network structure of LeNet-5. Traditionally, the input layer is not considered as one of the network hierarchy.


#### **C1 layer-convolutional layer**

>**Input picture**: 32 * 32

>**Convolution kernel size**: 5 * 5

>**Convolution kernel types**: 6

>**Output featuremap size**: 28 * 28 (32-5 + 1) = 28

>**Number of neurons**: 28 * 28 * 6

>**Trainable parameters**: (5 * 5 + 1) * 6 (5 * 5 = 25 unit parameters and one bias parameter per filter, a total of 6 filters)

>**Number of connections**: (5 * 5 + 1) * 6 * 28 * 28 = 122304

**Detailed description:** 

>1. The first convolution operation is performed on the input image (using 6 convolution kernels of size 5 * 5) to obtain 6 C1 feature maps (6 feature maps of size 28 * 28, 32-5 + 1 = 28). 

>2. Let's take a look at how many parameters are needed. The size of the convolution kernel is 5 * 5, and there are 6 * (5 * 5 + 1) = 156 parameters in total, where +1 indicates that a kernel has a bias. 

>3. For the convolutional layer C1, each pixel in C1 is connected to 5 * 5 pixels and 1 bias in the input image, so there are 156 * 28 * 28 = 122304 connections in total. There are 122,304 connections, but we only need to learn 156 parameters, mainly through weight sharing.


### **S2 layer-pooling layer (downsampling layer)**

>**Input**: 28 * 28

>**Sampling area**: 2 * 2

>**Sampling method**: 4 inputs are added, multiplied by a trainable parameter, plus a trainable offset. Results via sigmoid

>**Sampling type**: 6

>**Output featureMap size**: 14 * 14 (28/2)

>**Number of neurons**: 14 * 14 * 6

>**Trainable parameters**: 2 * 6 (the weight of the sum + the offset)

>**Number of connections**: (2 * 2 + 1) * 6 * 14 * 14

>The size of each feature map in S2 is 1/4 of the size of the feature map in C1.


**Detailed description:** 

> The pooling operation is followed immediately after the first convolution. Pooling is performed using 2 * 2 kernels, and S2, 6 feature maps of 14 * 14 (28/2 = 14) are obtained. 

> The pooling layer of S2 is the sum of the pixels in the 2 * 2 area in C1 multiplied by a weight coefficient plus an offset, and then the result is mapped again. 
So each pooling core has two training parameters, so there are 2x6 = 12 training parameters, but there are 5x14x14x6 = 5880 connections.


###  **C3 layer - 2nd Convolutional layer**

>**Input**: all 6 or several feature map combinations in S2

>**Convolution kernel size**: 5 * 5

>**Convolution kernel type**: 16

>**Output featureMap size**: 10 * 10 (14-5 + 1) = 10

>Each feature map in C3 is connected to all 6 or several feature maps in S2, indicating that the feature map of this layer is a different combination of the feature maps extracted from the previous layer.

>One way is that the first 6 feature maps of C3 take 3 adjacent feature map subsets in S2 as input. The next 6 feature maps take 4 subsets of neighboring feature maps in S2 as input. The next three take the non-adjacent 4 feature map subsets as input. The last one takes all the feature maps in S2 as input.

>**The trainable parameters are**: 6 * (3 * 5 * 5 + 1) + 6 * (4 * 5 * 5 + 1) + 3 * (4 * 5 * 5 + 1) + 1 * (6 * 5 * 5 +1) = 1516

>**Number of connections**: 10 * 10 * 1516 = 151600

**Detailed description:** 

> After the first pooling, the second convolution, the output of the second convolution is C3, 16 10x10 feature maps, and the size of the convolution kernel is 5 * 5. We know that S2 has 6 14 * 14 feature maps, how to get 16 feature maps from 6 feature maps? Here are the 16 feature maps calculated by the special combination of the feature maps of S2. details as follows:




> The first 6 feature maps of C3 (corresponding to the 6th column of the first red box in the figure above) are connected to the 3 feature maps connected to the S2 layer (the first red box in the above figure), and the next 6 feature maps are connected to the S2 layer The 4 feature maps are connected (the second red box in the figure above), the next 3 feature maps are connected with the 4 feature maps that are not connected at the S2 layer, and the last is connected with all the feature maps at the S2 layer. The convolution kernel size is still 5 * 5, so there are 6 * (3 * 5 * 5 + 1) + 6 * (4 * 5 * 5 + 1) + 3 * (4 * 5 * 5 + 1) +1 * (6 * 5 * 5 + 1) = 1516 parameters. The image size is 10 * 10, so there are 151600 connections.

![lenet1](images/c31.png)


The convolution structure of C3 and the first 3 graphs in S2 is shown below:

![lenet1](images/c32.png)


###  **S4 layer =>  Pooling layer (downsampling layer)**

>**Input**: 10 * 10

>**Sampling area**: 2 * 2

>**Sampling method**: 4 inputs are added, multiplied by a trainable parameter, plus a trainable offset. Results via sigmoid

>**Sampling type**: 16

>**Output featureMap size**: 5 * 5 (10/2)

>**Number of neurons**: 5 * 5 * 16 = 400

>**Trainable parameters**: 2 * 16 = 32 (the weight of the sum + the offset)

>**Number of connections**: 16 * (2 * 2 + 1) * 5 * 5 = 2000

>The size of each feature map in S4 is 1/4 of the size of the feature map in C3

**Detailed description:**

> S4 is the pooling layer, the window size is still 2 * 2, a total of 16 feature maps, and the 16 10x10 maps of the C3 layer are pooled in units of 2x2 to obtain 16 5x5 feature maps. This layer has a total of 32 training parameters of 2x16, 5x5x5x16 = 2000 connections. 

> *The connection is similar to the S2 layer.*

###  **C5 layer =>  Convolution layer**

>**Input**: All 16 unit feature maps of the S4 layer (all connected to s4)

>**Convolution kernel size**: 5 * 5

>**Convolution kernel type**: 120

>**Output featureMap size**: 1 * 1 (5-5 + 1)

>**Trainable parameters / connection**: 120 * (16 * 5 * 5 + 1) = 48120

**Detailed description:**


> The C5 layer is a convolutional layer. Since the size of the 16 images of the S4 layer is 5x5, which is the same as the size of the convolution kernel, the size of the image formed after convolution is 1x1. This results in 120 convolution results. Each is connected to the 16 maps on the previous level. So there are (5x5x16 + 1) x120 = 48120 parameters, and there are also 48120 connections. The network structure of the C5 layer is as follows:

![lenet1](images/c5.png)


#### **F6 layer - Fully Connected layer**

>**Input**: c5 120-dimensional vector

>**Calculation method**: calculate the dot product between the input vector and the weight vector, plus an offset, and the result is output through the sigmoid function.

>**Trainable parameters**: 84 * (120 + 1) = 10164

**Detailed description:**

> Layer 6 is a fully connected layer. The F6 layer has 84 nodes, corresponding to a 7x12 bitmap, -1 means white, 1 means black, so the black and white of the bitmap of each symbol corresponds to a code. The training parameters and number of connections for this layer are (120 + 1) x84 = 10164. The ASCII encoding diagram is as follows:

![lenet1](images/f61.png)

> The connection method of the F6 layer is as follows:

![lenet1](images/f62.png)

#### **Output layer - Fully connected layer**

> The output layer is also a fully connected layer, with a total of 10 nodes, which respectively represent the numbers 0 to 9, and if the value of node i is 0, the result of network recognition is the number i. A radial basis function (RBF) network connection is used. Assuming x is the input of the previous layer and y is the output of the RBF, the calculation of the RBF output is:

![lenet1](images/81.png)

> The value of the above formula w_ij is determined by the bitmap encoding of i, where i ranges from 0 to 9, and j ranges from 0 to 7 * 12-1. The closer the value of the RBF output is to 0, the closer it is to i, that is, the closer to the ASCII encoding figure of i, it means that the recognition result input by the current network is the character i. This layer has 84x10 = 840 parameters and connections.

![lenet1](images/82.png)



**Summary of LeNet model**


* LeNet-5 is a very efficient convolutional neural network for handwritten character recognition.
* Convolutional neural networks can make good use of the structural information of images.
* The convolutional layer has fewer parameters, which is also determined by the main characteristics convolutional layer, that is, local connection and shared weights.





In [None]:
       ###################    ###############   Lenet Model Pipeline   ###############    #################

 =>  Building the Model Pipeline by using Sequential API

       >>>  model = Sequential()

 =>  Select 6 feature convolution kernels with a size of 5 * 5 (without offset), and get 66 feature maps. 
    The size of each feature map is 32−5 + 1 = 2832−5 + 1 = 28. That is, the number of neurons has been reduced 
      from 10241024 to 28 ∗ 28 = 784 28 ∗ 28 = 784. Parameters between input layer and C1 layer: 6 ∗ (5 ∗ 5 + 1)

      >>>  model.add(Conv2D(6, kernel_size=(5, 5), activation='relu', input_shape=(28, 28, 1)))

 =>  The input of this layer is the output of the first layer, which is a 28 * 28 * 6 node matrix.
 =>  The size of the filter used in this layer is 2 * 2, and the step length and width are both 2, so the output 
       matrix size of this layer is 14 * 14 * 6.

     >>>  model.add(MaxPooling2D(pool_size=(2, 2)))

 =>  The input matrix size of this layer is 14 * 14 * 6, the filter size used is 5 * 5, and the depth is 16.
      This layer does not use all 0 padding, and the step size is 1.
     output matrix size of this layer is 10 * 10 * 16. This layer has 5 * 5 * 6 * 16 + 16 = 2416 parameters.

     >>>  model.add(Conv2D(16, kernel_size=(5, 5), activation='relu'))

 =>  The input matrix size of this layer is 10 * 10 * 16. The size of the filter used in this layer is 2 * 2, and 
      the length and width steps are both 2, so the output matrix size of this layer is 5 * 5 * 16.

     >>>  model.add(MaxPooling2D(pool_size=(2, 2)))

 =>  The input matrix size of this layer is 5 * 5 * 16. This layer is called a convolution layer in the LeNet-5
       paper, but because the size of the filter is 5 * 5, So it is not different from the fully connected layer.
     If the nodes in the 5 * 5 * 16 matrix are pulled into a vector, then this layer is the same as the fully 
         connected layer. 
            
 =>  The number of output nodes in this layer is 120, with a total of 5 * 5 * 16 * 120 + 120 = 48120 parameters.

     >>>  model.add(Flatten())
 
 =>  Adding a Dense Layer to the network 
    
     >>>  model.add(Dense(120, activation='relu'))

 =>  The number of input nodes in this layer is 120 and the number of output nodes is 84. The total parameter
         is 120 * 84 + 84 = 10164 (w + b).

     >>>  model.add(Dense(84, activation='relu'))

 =>  The number of input nodes in this layer is 84 and the number of output nodes is 10. The total parameter
         is 84 * 10 + 10 = 850.

     >>>  model.add(Dense(10, activation='softmax'))

 =>  Compiling the model
 
     >>> model.compile(loss=keras.metrics.categorical_crossentropy, optimizer=keras.optimizers.Adam(),
                       metrics=['accuracy'])

 =>  Training of LeNet model

     >>> model.fit(x_train, y_train, batch_size=128, epochs=20, verbose=1, validation_data=(x_test, y_test))
    
 =>  Getting Accuracy score from the trained LeNet model.

     >>> score = model.evaluate(x_test, y_test)
    
     >>> print ('Test Loss:', score[0], 'Test accuracy:', score[1], sep=' ')



# MNIST Digits recognition or Classification with LeNet model

In [None]:
# Importing the required modules
import keras
from keras.datasets import mnist
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Dense, Flatten
from keras.models import Sequential

# Loading the dataset and perform splitting
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Peforming reshaping operation
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)

# Normalization of dataset
x_train = x_train / 255
x_test = x_test / 255

# One Hot Encoding
y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)


# Building the Model Pipeline or architecture
model = Sequential()
# Select 6 feature convolution kernels with a size of 5 * 5 (without offset), and get 66 feature maps. The size of each feature map is 32−5 + 1 = 2832−5 + 1 = 28.
# That is, the number of neurons has been reduced from 10241024 to 28 ∗ 28 = 784 28 ∗ 28 = 784.
# Parameters between input layer and C1 layer: 6 ∗ (5 ∗ 5 + 1)
model.add(Conv2D(6, kernel_size=(5, 5), activation='relu', input_shape=(28, 28, 1)))
# The input of this layer is the output of the first layer, which is a 28 * 28 * 6 node matrix.
# The size of the filter used in this layer is 2 * 2, and the step length and width are both 2, so the output matrix size of this layer is 14 * 14 * 6.
model.add(MaxPooling2D(pool_size=(2, 2)))
# The input matrix size of this layer is 14 * 14 * 6, the filter size used is 5 * 5, and the depth is 16. This layer does not use all 0 padding, and the step size is 1.
# The output matrix size of this layer is 10 * 10 * 16. This layer has 5 * 5 * 6 * 16 + 16 = 2416 parameters
model.add(Conv2D(16, kernel_size=(5, 5), activation='relu'))
# The input matrix size of this layer is 10 * 10 * 16. The size of the filter used in this layer is 2 * 2, and the length and width steps are both 2, so the output matrix size of this layer is 5 * 5 * 16.
model.add(MaxPooling2D(pool_size=(2, 2)))
# The input matrix size of this layer is 5 * 5 * 16. This layer is called a convolution layer in the LeNet-5 paper, but because the size of the filter is 5 * 5, #
# So it is not different from the fully connected layer. If the nodes in the 5 * 5 * 16 matrix are pulled into a vector, then this layer is the same as the fully connected layer.
# The number of output nodes in this layer is 120, with a total of 5 * 5 * 16 * 120 + 120 = 48120 parameters.
model.add(Flatten())
model.add(Dense(120, activation='relu'))
# The number of input nodes in this layer is 120 and the number of output nodes is 84. The total parameter is 120 * 84 + 84 = 10164 (w + b)
model.add(Dense(84, activation='relu'))
# The number of input nodes in this layer is 84 and the number of output nodes is 10. The total parameter is 84 * 10 + 10 = 850
model.add(Dense(10, activation='softmax'))

# Compiling the model
model.compile(loss=keras.metrics.categorical_crossentropy, optimizer=keras.optimizers.Adam(), metrics=['accuracy'])

# Training of LeNet model on MNIST dataset
model.fit(x_train, y_train, batch_size=128, epochs=20, verbose=1, validation_data=(x_test, y_test))

# Getting the Accuracy score
score = model.evaluate(x_test, y_test)

print('Test Loss:', score[0])
print('Test accuracy:', score[1])