- What is convolution
- What is pooling
- What is subsampling

Convolution is a mathematical operation that combines two functions to describe the overlap between them. Convolution takes two functions and “slides” one of them over the other, multiplying the function values at each point where they overlap, and adding up the products to create a new function.

Formally, convolution is an integral that expresses the amount of overlap of one function, $f(t)$,  as it is shifted over function $g(t)$, expressed as:
$$
(f*g)(t) = \int_{-\infty}^{\infty} f(\tau)g(t-\tau)d\tau
$$

<img src="images/convolution_animated.gif" width="500" />

In image processing, convolutional filtering can be used to implement algorithms such as edge detection, image sharpening, and image blurring.

This is done by selecting the appropriate kernel (convolution matrix).

<div style="display:flex;align-items:center;justify-content:center;">
    <img src="images/image_convolution_animated.gif" width="700" />
</div>

Convolution plays a key role in convolutional neural networks (CNNs). CNNs are a type of deep network commonly used to analyze images. CNNs eliminate the need for manual feature extraction, which is why they work very well for complex problems such as image classification and medical image analysis. CNNs are effective for non-image data analysis such as audio, time-series, and signal data.

To be more specific, similar to an actual biological neural network, CNN could
identify the fraction of the image and recognize the unique feature which does not
alter even if certain transformations such as shifting, scaling, and rotating. The usual CNN consists of the following layers, Convolution layer, ReLU
layer, Pooling layer, and FC (Full Connection) layer. The convolutional layer is used
to extract the main features through operations. The pooling layer is very effective in
reducing the size of the matrix, thus increasing efficiency. Compared with other tools
of image recognition, with its own known pattern and following learning, there is no
necessity to input detailed and complex mathematical arithmetic expressions for the
computer to judge and CNN could come into forming specific mapping capability
for further operations of detecting images.

<div style="display:flex;align-items:center;justify-content:center;">
    <img src="images/general_convolutional_neural_network.png" width="800" />
</div>

A fully connected neural network consists of a series of fully connected layers,
that connect every neuron in one layer to every neuron in the other layer. The
main problem with fully connected neural networks are that the number of weights
required is very large for certain types of data. For example, an image of 224×224×3 would require 150,528 weights in just the first hidden layer, and will grow
quickly for even bigger images. You can imagine how computationally intensive
things would become once the images reach dimensions as large as 8K resolution
images (7680×4320), training such a network would require a lot of time and
resources.

However for image data, repeating patterns can occur in different places. Hence
we can train many smaller detectors, capable of sliding across an image, to take
advantage of the repeating patterns. This would reduce the
number of weights required to be trained.

<div style="display:flex;align-items:center;justify-content:center;">
    <img src="images/pattern_detection_in_cnn.png" width="800" />
</div>

<div style="display:flex;align-items:center;justify-content:center;">
    <img src="images/beak_detector.png" width="600" />
</div>

A Convolutional Neural Network is a neural network with some convolutional
layers (and some other layers). A convolutional layer has a number of filters that
does the convolutional operation.

The convolution operation is very similar to image processing
filters such as the Sobel filter and Gaussian Filter. The Kernel slides across an image and multiplies the weights with each aligned pixel, element-wise across the filter.
Afterwards the bias value is added to the output.

There are three hyperparameters deciding the spatial of the output feature map:

- Stride (S) is the step each time we slide the filter. When the stride is 1 then we move the filters one pixel at a time. When the stride is 2 (or uncommonly 3 or more, though this is rare in practice) then the filters jump 2 pixels at a time as we slide them around. This will produce smaller output volumes spatially.

- Padding (P): The inputs will be padded with a border of size according to the value specified. Most commonly, zero-padding is used to pad these locations. In neural network frameworks (caffe, tensorflow, pytorch, mxnet), the size of this zero-padding is a hyperparameter. The size of zero-padding can also be used to control the spatial size of the output volumes.

- Depth (D): The depth of the output volume is a hyperparameter too, it corresponds to the number of filters we use for a convolution layer.

<div style="display:flex;align-items:center;justify-content:center;">
    <img src="images/convolution_operation.png" width="600" />
</div>

Given $w$ as the width of input, and $F$ is the width of the filter, with $P$ and $S$ as
padding and stride respectively, the output width will be: $\frac{W + 2P − F}{S}+1$. Generally, set $P = \frac{F − 1}{2}$
when the stride is $S = 1$ ensures that the input volume and output volume will have
the same size spatially.

For an input of 7×7×3 and a output depth of 2, we will have 6 kernels as shown
below. 3 for the first depth output and another 3 for the second depth output. The
outputs of each filter is summed up to generate the output feature map.
In this example, the output from each Kernel of Filter W1 is as
follows:
<pre>
Output of Kernel 1 = 1
Output of Kernel 2 = −2 
Output of Kernel 3 = 2 
Output of Filter W1 = Output of Kernel 1 + Output of Kernel 2 + Output of Kernel 3 + bias 
    = 1 − 2 + 2 + 0 = 1.
</pre>

<div style="display:flex;align-items:center;justify-content:center;">
    <img src="images/convolution_example.png" width="800" />
</div>

In [None]:
import torch
import torch.nn as nn

In [71]:
torch.manual_seed(42)
c = nn.Conv2d(3, 2, kernel_size=4)
c.weight.shape

torch.Size([2, 3, 4, 4])

In [72]:
x = torch.randn(1, 3, 7, 7)
c(x)

tensor([[[[ 0.9037,  0.5184,  0.0768,  0.9506],
          [ 0.3873,  0.5276,  0.4582,  0.3262],
          [-0.0666,  0.2944, -0.2848,  0.7257],
          [ 0.5757, -0.0773,  0.4156,  0.3511]],

         [[-1.2760, -0.1103, -0.0280,  0.4482],
          [-0.8313,  0.2466, -0.5139, -0.5076],
          [ 0.8714,  0.2920, -0.3182, -0.0161],
          [-0.2489, -0.6562, -0.3168, -0.4677]]]],
       grad_fn=<ConvolutionBackward0>)

In [75]:
for i in range(4):
    for j in range(4):
        print(
            f"{(
                (x[0, 0][i:4+i, j:4+j] * c.weight[0, 0]).sum() +
                (x[0, 1][i:4+i, j:4+j] * c.weight[0, 1]).sum() +
                (x[0, 2][i:4+i, j:4+j] * c.weight[0, 2]).sum() + 
                c.bias[0]).item():.4f}".rjust(8),
            end='\t'
        )
    print()

  0.9037	  0.5184	  0.0768	  0.9506	
  0.3873	  0.5276	  0.4582	  0.3262	
 -0.0666	  0.2944	 -0.2848	  0.7257	
  0.5757	 -0.0773	  0.4156	  0.3511	


In [80]:
for out in range(2):
    for i in range(4):
        for j in range(4):
            print(
                f"{(
                    (x[0, 0][i:4+i, j:4+j] * c.weight[out, 0]).sum() +
                    (x[0, 1][i:4+i, j:4+j] * c.weight[out, 1]).sum() +
                    (x[0, 2][i:4+i, j:4+j] * c.weight[out, 2]).sum() + 
                    c.bias[out]).item():.4f}".rjust(8),
                end='\t'
            )
        print()
    print('-'*60)

  0.9037	  0.5184	  0.0768	  0.9506	
  0.3873	  0.5276	  0.4582	  0.3262	
 -0.0666	  0.2944	 -0.2848	  0.7257	
  0.5757	 -0.0773	  0.4156	  0.3511	
------------------------------------------------------------
 -1.2760	 -0.1103	 -0.0280	  0.4482	
 -0.8313	  0.2466	 -0.5139	 -0.5076	
  0.8714	  0.2920	 -0.3182	 -0.0161	
 -0.2489	 -0.6562	 -0.3168	 -0.4677	
------------------------------------------------------------


**Pooling**

Pooling layer is used in CNNs to reduce the spatial dimensions (width and height) of the input feature maps while retaining the most important information. It involves sliding a two-dimensional filter over each channel of a feature map and summarizing the features within the region covered by the filter.
A pooling layer helps reduce computation time and
gradually build up spatial and configural invariance. For image understanding,
pooling layer helps extract more semantic meaning. The max pooling layer simply
returns the maximum value over the values that the kernel operation is applied on.
The example below illustrates the outputs of a max pooling and average
pooling operation respectively, given a kernel of size 2 and stride 2.

<div style="display:flex;align-items:center;justify-content:center;">
    <img src="images/pooling.png" width="500" />
</div>

**Flattening**

Adding a Fully-Connected layer is a (usually) cheap way of learning non-linear
combinations of the high-level features as represented by the output of the convolutional layer. The Fully-Connected layer is learning a possibly non-linear function in
that space.

By flattening the image into a column vector, we have converted our input image
into a suitable form for our Multi-Level Perceptron. The flattened output is fed
to a feed-forward neural network and backpropagation applied to every iteration
of training. Over a series of epochs, the model is able to distinguish between
dominating and certain low-level features in images and classify them using the
Softmax Classification technique.

<div style="display:flex;align-items:center;justify-content:center;">
    <img src="images/flattening.png" width="800" />
</div>

## Resources:

https://medium.com/@siddheshb008/understanding-convolution-neural-networks-a30211e12a06
https://medium.com/@siddheshb008/understanding-convolutional-neural-networks-part-2-98694dd47923