#  Convolutional Neural Networks

# What are Convolutional Neural Networks?

Convolutional Neural Networks (ConvNets or CNNs) are a category of Neural Networks that have proven very effective in areas such as image recognition and classification. ConvNets have been successful in identifying faces, objects and traffic signs apart from powering vision in robots and self driving cars.


A Convolutional Neural Network (CNN) is comprised of one or more convolutional layers (often with a subsampling step) and then followed by one or more fully connected layers as in a standard multilayer neural network. The architecture of a CNN is designed to take advantage of the 2D structure of an input image (or other 2D input such as a speech signal). This is achieved with local connections and tied weights followed by some form of pooling which results in translation invariant features. Another benefit of CNNs is that they are easier to train and have many fewer parameters than fully connected networks with the same number of hidden units. In this article we will discuss the architecture of a CNN and the back propagation algorithm to compute the gradient with respect to the parameters of the model in order to use gradient based optimization. 

> Artificial Intelligence has been witnessing a monumental growth in bridging the gap between the capabilities of humans and machines. Researchers and enthusiasts alike, work on numerous aspects of the field to make amazing things happen. One of many such areas is the domain of Computer Vision.

> The agenda for this field is to enable machines to view the world as humans do, perceive it in a similar manner and even use the knowledge for a multitude of tasks such as Image & Video recognition, Image Analysis & Classification, Media Recreation, Recommendation Systems, Natural Language Processing, etc. The advancements in Computer Vision with Deep Learning has been constructed and perfected with time, primarily over one particular algorithm — a Convolutional Neural Network.

> A **Convolutional Neural Network (ConvNet/CNN)** is a Deep Learning algorithm which can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other. The pre-processing required in a ConvNet is much lower as compared to other classification algorithms.

###  Architecture of a Convolutional Neural Network
<img src='images/cnn.jpeg'>

> A **ConvNet** is able to successfully capture the Spatial and Temporal dependencies in an image through the application of relevant filters. The architecture performs a better fitting to the image dataset due to the reduction in the number of parameters involved and reusability of weights. 

<img src='images/cnn2.jpeg' width='580px'>

<center> A CNN sequence to classify handwritten digits </center>

> Videos are collection of multiple Images or frames. A video in low FPS is nothing but an Image.

> Images are made up of pixels and a pixel is made by a combination of RGB colors. In RGB, each colors
    ranges between (0, 255) and each and every color is made by this RGB combination.

# Image Processing

> In case of **Image Processing**, We try to convolve or extract some features out of images and then send them to Neural Networks for training.

<img src='images/cnn-arch.jpg' width='640px'>

##  Convolution Layer  &  The process of Convolution

> **Convolution** means Extraction of Details or Features of the images.

> For convolution, we define some **Filters or Kernels** of a certain shape and dimension and then extract the Features from images.

> In very beginning, we don't have any idea about the Features available in Images, so We will define some hard-coded filters for different-different Features available in the Image.
   
   * **filter_1** is responsible for extracting the Linear shapes.
   * **filter_2** is responsible for extracting the Square shapes
   * **filter_3** is responsible for extracting the Circular shapes

<img src='images/Filters.jpg' width='420px'>

> We will apply these Filters one-by-one on Image matrix and these filters will extract features from the Image.
  
> **filter_1** is responsible for extracting all **Linear shapes** inside the Image, **filter_2** is responsible for erxtracting all **Square shapes** and **filter_3** will extract the **Circular Shape** 

##  Conv Matrices

> Image matrices are of Dimensions = 5 (Height) x 5 (Breadth) x 1 (Number of channels or color-spaces, eg. RGB)

<img src='images/img-dim.png' width='420px'>

>  Each **Filters** are overlapped to the **Image matrix** and then we multiplies the corresponding overlapped items and so it return **Conv Matrices**. 

> Applying **Filers** to **Image Matrices** are moved by leaving some columns known as **Strides** are selected in previous overlaps. Filters are moved horizontally and then vertically.

> **Strides** means adjusting the kernel in Image marix by some no. of columns.

<img src='images/conv-matrix.jpg' width='480px'>

<img src='images/rgb-filters.gif'>

###  Overlapping of filters results to Conv Matrices
<img src='https://d17h27t6h515a5.cloudfront.net/topher/2016/November/581a58be_convolution-schematic/convolution-schematic.gif'>
<img src='images/kernel-moves.png'>

> To extract multiple features from an Image, We applies as many 'n' **Filters** and we get 'n' **Conv matrices**.

<img src='images/conv-n.jpg' width='480px'>

> **Filters** are overlapped to repeat same kind of dataset in calculation. We always try to keep an overlap of data, so that it will be able to consider in previous stage and same data in next stage.

>  We can create and keep applying 'n' filters, which will be able to extract various features out of Image and we get different-different **Conv Matrices**. Matrices obtained by ovelapping Filters over image data is known as **Conv matrices** or Feature Matrices.

>  Data extracted by applying the filters on Image matrix, is not same as per my original image.

> Whole idea and objective here in Image Processing is to extact features from Images.

> In very beginning, I don't have an idea about what kind of features are available in Image dataset and what kind of filters, I I can apply? So I have taken some some filters created by me and once I got **Conv Matrices**, I can do a little bit Data processing.

> After Convloution, we get huge no. of pixels and this is too much, so I will try to take some common pixels, which represents all of these pixels, Therefore, We perform some **Pooling** or **Averaging** operations over the **conv matrices** and reduce down the amount of pixels to a lower figure. i.e. **3x3x4 matrix is reduced to 2x2 matrix**
**1920x1080p** resolution of images and videos means that there are 1920 pixels are horizonally and 1080 pixels are vertially arranged inside Images. 

# Self - Learnable Kernels

> Before 1998, People uses some hard-coded filters and these hard-coded filters are responsible for extracting a single feature from the Image. After 1998, Self-learnable Filters comes to picture.

> Researchers said - "This is not required to give hard-coded kernels for convolution, because they are not able to extract features and not be able to identify or classify some of objects inside an Image in a corect way." So They are not going to define any paricular shape for Filters and make them self-learnable. System creates these self-learnable kernels by itself and it require 2 parameters (shape of kernel and No. of kernels) to be tuned. This way a new concept arises as 'Self-learnable Kernels' and in all recent CNN Models, we uses this concept.

> If there are more no. of Filters, it will train itself with a more no. of patterns available in images.

> In case of Self-learnable kernels, it assigns some random data values inside the kenels and so we get some differences between the **Actual values** and **Predicted values** and so **Expected values** and **predicted values** are not cor-relating to each others. If they will be able to cor-relate and both are almost closed to each other in certain cor-relation, we can say **predicted values** are some kind of data, that we have extracted is relaed to our **Expected values**.

> These random data values of kernels are considered as **weights**. If there is no any cor-relation between the expected values and predicted ones, it will goes for Backward Propagation and will try to adjusts all the random values of kernels one-by-one.
 It will ry to adjust the weihts in such a way, that there will be some correlation between Actual values and predicted ones.
 
> We calculates the **Local Gradients** then perform **Local Gradient Analysis** or **Local Gradient Decomposition** and based on that, we try to learn about te Kernels.

> These **Kernels** are subjective parameters and changes Image to Image.

#  Padding operation

> In **Padding**, we adds some more data all across the inputted matrixand fill them with zero's. Padding with 'n' units depends upon the size of matrix required in next step.

   * **Padding with 1 unit** padds 1 line of data all across the inputed matrix.
   * **Padding with 2 unit** padds 2 lines of data all across the inputed matrix.
   
> By Padding, we doesn't mean to change meaning of data or add any data to the original matrix at all.
   Performing **Pooling operations** over the Padded data don't make any changes to the shape of output data, so shape of matrices don't changes.
   
> There may be some significant features or Informations on the Edes of image, so we do Padding to cover the same feature again & again and hence those significant features get considered.  
  
####  Why we do Padding operation?

  *  To keep shape & size of Input matrices and output matrices intact, we do **Padding**. 
  *  If significant Features or informations are available on Edes of image and we don't want to loose them, then we uses **Padding**.
  
 
<img src='images/stride2-convol.gif'>
<img src='images/padding.gif'>

## Pooling  Operations

> There is no any meaning of considering same outcome again & again & again, so It is completely fine, if I remove some of features and considera single feature and our work is done. This is what we try to do in **Pooling operations** . 
   
>  In **Pooling operation**, we are trying to restrict some of the pixels, so that my datasize will reduce to a lower resolution. As **Pooling operations**, we do **Minimum Pooling**, **Max Pooling** or **Average Pooling**.

 * Min Pooling  ->   Picks minimum value out of the selection
 * Max Pooling  ->   Picks maximum value out of the seletion
 * Avg pooling  ->   Return the Average value of all values in the selection
 
> After **Pooling operation**, we get some final features, so that we will be able to reduce the amount of data.

<img src='images/pooling.gif' width='480px'>

> Max Pooling also performs as a Noise Suppressant. It discards the noisy activations altogether and also performs de-noising along with dimensionality reduction. On the other hand, Average Pooling simply performs dimensionality reduction as a noise suppressing mechanism. Hence, we can say that Max Pooling performs a lot better than Average Pooling.

<img src='images/max-pool.png' width='360px'>


 > We can perform **Convolution operation**, **Padding operation**, **Pooling Operations** multiple times  in each and every layers of the CNN Network.

## Flatenning Operation

> Once it will be able to convolve the Image data, we try to flattened this entire data.

> Flatenning the dataset means (mxmxn) Image matrix is converted to a (mx1) column vector and then this flatenned data is going to send inside the **Neural network** or **Fully connected network or f-c** layer.

> **f-c network Layer** is responsible for learning the relationship between the features, that we have extracted in **CNN Layer**.

<img src='images/flatten.png' width='420px'>

> **The main idea & objective behind a CNN network is to learn some patterns out of Image data using some kernels.**

# Architecture of CNN networks

> **CNN Network** is designed by combining 2 different layers **Convolution layer** and **f-c layer**.

<img src='images/cnn-architecture.jpg' width='480px'>

> In **Convolution layer**, we performs convolution, Padding, pooling and flattenning operations over the image dataset. It will learn what to extract out of Images and then flattening the entire data and send it to **f-c layer**.

> **f-c layer** yields the final result. In **f-c layer**, it will learn the relationship between the Extrated Features and finally give some kind of Outcomes.

> Learning happens in both **Convolution & f-c** layers.  **Convolutin layer** learns to adjust the **kernels**, while **f-c layer** learns to adjust the relations in terms of weights.

## TensorFlow Convolution Layer

Let's examine how to implement a CNN in TensorFlow.

TensorFlow provides the `tf.nn.conv2d()` and `tf.nn.bias_add()` functions to create your own convolutional layers.

In [None]:
import tensorflow as tf

# Output depth
k_output = 64

# Image Properties
image_width = 10
image_height = 10
color_channels = 3

# Convolution filter
filter_size_width = 5
filter_size_height = 5

# Input/Image
input = tf.placeholder(
    tf.float32,
    shape=[None, image_height, image_width, color_channels])

# Weight and bias
weight = tf.Variable(tf.truncated_normal(
    [filter_size_height, filter_size_width, color_channels, k_output]))
bias = tf.Variable(tf.zeros(k_output))

# Apply Convolution
conv_layer = tf.nn.conv2d(input, weight, strides=[1, 2, 2, 1], padding='SAME')
# Add bias
conv_layer = tf.nn.bias_add(conv_layer, bias)
# Apply activation function
conv_layer = tf.nn.relu(conv_layer)

The code above uses the `tf.nn.conv2d()` function to compute the convolution with `weight` as the filter and `[1, 2, 2, 1]` for the strides. TensorFlow uses a stride for each `input` dimension, `[batch, input_height, input_width, input_channels]`. We are generally always going to set the stride for `batch` and `input_channels `(i.e. the first and fourth element in the strides array) to be `1`.

You'll focus on changing `input_height` and `input_width` while setting batch and `input_channels` to `1`. The `input_height` and `input_width` strides are for striding the filter over `input`. This example code uses a stride of 2 with 5x5 filter over `input`.

The `tf.nn.bias_add()` function adds a 1-d bias to the last dimension in a matrix.

## TensorFlow Max Pooling
![](https://d17h27t6h515a5.cloudfront.net/topher/2016/November/582aac09_max-pooling/max-pooling.png)


The image above is an example of max pooling with a 2x2 filter and stride of 2. The four 2x2 colors represent each time the filter was applied to find the maximum value.

For example, `[[1, 0], [4, 6]]` becomes `6`, because `6` is the maximum value in this set. Similarly, `[[2, 3], [6, 8]]` becomes `8`.

Conceptually, the benefit of the max pooling operation is to reduce the size of the input, and allow the neural network to focus on only the most important elements. Max pooling does this by only retaining the maximum value for each filtered area, and removing the remaining values.

TensorFlow provides the `tf.nn.max_pool()` function to apply max pooling to your convolutional layers.

In [None]:
...
conv_layer = tf.nn.conv2d(input, weight, strides=[1, 2, 2, 1], padding='SAME')
conv_layer = tf.nn.bias_add(conv_layer, bias)
conv_layer = tf.nn.relu(conv_layer)
# Apply Max Pooling
conv_layer = tf.nn.max_pool(
    conv_layer,
    ksize=[1, 2, 2, 1],
    strides=[1, 2, 2, 1],
    padding='SAME')

The `tf.nn.max_pool()` function performs max pooling with the `ksize` parameter as the size of the filter and the `strides` parameter as the length of the stride. 2x2 filters with a stride of 2x2 are common in practice.

The `ksize` and `strides` parameters are structured as 4-element lists, with each element corresponding to a dimension of the input tensor (`[batch, height, width, channels]`). For both `ksize` and `strides`, the `batch` and `channel` dimensions are typically set to `1`.

## 1x1 Convolutions

#### [1x1 Convolutions Video](https://www.youtube.com/watch?v=Zmzgerm6SjA)

## Inception Module

#### [Inception Module Video](https://www.youtube.com/watch?v=SlTm03bEOxA)


## Convolutional Network in TensorFlow

It's time to walk through an example Convolutional Neural Network (CNN) in TensorFlow.

The structure of this network follows the classic structure of CNNs, which is a mix of convolutional layers and max pooling, followed by fully-connected layers.

The code we'll be looking at is similar to what we saw in the segment on Deep Neural Network in TensorFlow, except we'll restructured the architecture of this network as a CNN.

Just like in that segment, here we'll study the line-by-line breakdown of the code. [Link to download the code and run it.](https://d17h27t6h515a5.cloudfront.net/topher/2017/February/58a61ca1_cnn/cnn.zip)

We've seen this section of code from previous lessons. Here we're importing the MNIST dataset and using a convenient TensorFlow function to batch, scale, and One-Hot encode the data.

In [None]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets(".", one_hot=True, reshape=False)

import tensorflow as tf

# Parameters
learning_rate = 0.00001
epochs = 1
batch_size = 128

# Number of samples to calculate validation and accuracy
# Decrease this if you're running out of memory to calculate accuracy
test_valid_size = 256

# Network Parameters
n_classes = 10  # MNIST total classes (0-9 digits)
dropout = 0.75  # Dropout, probability to keep units

### Weights and Biases

In [None]:
# Store layers weight & bias
weights = {
    'wc1': tf.Variable(tf.random_normal([5, 5, 1, 32])),
    'wc2': tf.Variable(tf.random_normal([5, 5, 32, 64])),
    'wd1': tf.Variable(tf.random_normal([7*7*64, 1024])),
    'out': tf.Variable(tf.random_normal([1024, n_classes]))}

biases = {
    'bc1': tf.Variable(tf.random_normal([32])),
    'bc2': tf.Variable(tf.random_normal([64])),
    'bd1': tf.Variable(tf.random_normal([1024])),
    'out': tf.Variable(tf.random_normal([n_classes]))}

The above is an example of a convolution with a 3x3 filter and a stride of 1 being applied to data with a range of 0 to 1. The convolution for each 3x3 section is calculated against the weight, `[[1, 0, 1], [0, 1, 0], [1, 0, 1]]`, then a bias is added to create the convolved feature on the right. In this case, the bias is zero. In TensorFlow, this is all done using `tf.nn.conv2d()` and `tf.nn.bias_add()`.

In [None]:
def conv2d(x, W, b, strides=1):
    x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')
    x = tf.nn.bias_add(x, b)
    return tf.nn.relu(x)

The `tf.nn.conv2d()` function computes the convolution against weight `W` as shown above.

In TensorFlow, stride is an array of 4 elements; the first element in the stride array indicates the stride for batch and last element indicates stride for features. It's good practice to remove the batches or features you want to skip from the data set rather than use stride to skip them. You can always set the first and last element to 1 in stride in order to use all batches and features.

The middle two elements are the strides for height and width respectively. I've mentioned stride as one number because you usually have a square stride where `height = width`. When someone says they are using a stride of 3, they usually mean `tf.nn.conv2d(x, W, strides=[1, 3, 3, 1])`.

To make life easier, the code is using `tf.nn.bias_add()` to add the bias. Using `tf.add()` doesn't work when the tensors aren't the same shape.

The above is an example of max pooling with a 2x2 filter and stride of 2. The left square is the input and the right square is the output. The four 2x2 colors in input represents each time the filter was applied to create the max on the right side. For example, `[[1, 1], [5, 6]]` becomes `6` and `[[3, 2], [1, 2]]` becomes `3`.

In [None]:
def maxpool2d(x, k=2):
    return tf.nn.max_pool(
        x,
        ksize=[1, k, k, 1],
        strides=[1, k, k, 1],
        padding='SAME')

The `tf.nn.max_pool()` function does exactly what you would expect, it performs max pooling with the `ksize` parameter as the size of the filter.

### Model

![](https://d17h27t6h515a5.cloudfront.net/topher/2016/November/581a64b7_arch/arch.png)

In the code below, we're creating 3 layers alternating between convolutions and max pooling followed by a fully connected and output layer. The transformation of each layer to new dimensions are shown in the comments. For example, the first layer shapes the images from 28x28x1 to 28x28x32 in the convolution step. Then next step applies max pooling, turning each sample into 14x14x32. All the layers are applied from `conv1` to `output`, producing 10 class predictions.

In [None]:
def conv_net(x, weights, biases, dropout):
    # Layer 1 - 28*28*1 to 14*14*32
    conv1 = conv2d(x, weights['wc1'], biases['bc1'])
    conv1 = maxpool2d(conv1, k=2)

    # Layer 2 - 14*14*32 to 7*7*64
    conv2 = conv2d(conv1, weights['wc2'], biases['bc2'])
    conv2 = maxpool2d(conv2, k=2)

    # Fully connected layer - 7*7*64 to 1024
    fc1 = tf.reshape(conv2, [-1, weights['wd1'].get_shape().as_list()[0]])
    fc1 = tf.add(tf.matmul(fc1, weights['wd1']), biases['bd1'])
    fc1 = tf.nn.relu(fc1)
    fc1 = tf.nn.dropout(fc1, dropout)

    # Output Layer - class prediction - 1024 to 10
    out = tf.add(tf.matmul(fc1, weights['out']), biases['out'])
    return out

In [None]:
# tf Graph input
x = tf.placeholder(tf.float32, [None, 28, 28, 1])
y = tf.placeholder(tf.float32, [None, n_classes])
keep_prob = tf.placeholder(tf.float32)

# Model
logits = conv_net(x, weights, biases, keep_prob)

# Define loss and optimizer
cost = tf.reduce_mean(\
    tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)\
    .minimize(cost)

# Accuracy
correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

# Initializing the variables
init = tf. global_variables_initializer()

# Launch the graph
with tf.Session() as sess:
    sess.run(init)

    for epoch in range(epochs):
        for batch in range(mnist.train.num_examples//batch_size):
            batch_x, batch_y = mnist.train.next_batch(batch_size)
            sess.run(optimizer, feed_dict={
                x: batch_x,
                y: batch_y,
                keep_prob: dropout})

            # Calculate batch loss and accuracy
            loss = sess.run(cost, feed_dict={
                x: batch_x,
                y: batch_y,
                keep_prob: 1.})
            valid_acc = sess.run(accuracy, feed_dict={
                x: mnist.validation.images[:test_valid_size],
                y: mnist.validation.labels[:test_valid_size],
                keep_prob: 1.})

            print('Epoch {:>2}, Batch {:>3} -'
                  'Loss: {:>10.4f} Validation Accuracy: {:.6f}'.format(
                epoch + 1,
                batch + 1,
                loss,
                valid_acc))

    # Calculate Test Accuracy
    test_acc = sess.run(accuracy, feed_dict={
        x: mnist.test.images[:test_valid_size],
        y: mnist.test.labels[:test_valid_size],
        keep_prob: 1.})
    print('Testing Accuracy: {}'.format(test_acc))

In [None]:
We can further improve the accuracy by increasing the epochs.

# Simple CNN Network for Dog_Cat classification

## Setting up Directories

## Creating a PipeLine for CNN Network

## Training of Model

##  Getting probabilities  for the prediction

In [None]:
         ##################    #############    Conclusion    ###############   ################
    
 =>  We have used a simple CNN Model for doing Classification over Cat-Dog Image dataset.

 =>  We can use some more Convolution layers and pooling and Padding for a good optimized results.
    
 =>  Keras Tuners can be used for Tuning up the Hyper-parameters of this CNN Network.