<a href="https://colab.research.google.com/github/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-II-Computer-Vision/02_convolutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Convolutions

A convolution is "a mathematical operation on two functions (`f` and `g`) that produces a third function (`f * g`) expressing how the shape of one is modified by the other."

In image processing, a convolution matrix is also called a kernel or filter. 

Typical image processing operations—like blurring, sharpening, edge detection, and more, are
accomplished by performing a convolution between a kernel and an image.

##Setup

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

In [2]:
try:
    import google.colab
    import requests
    url = 'https://raw.githubusercontent.com/dvgodoy/PyTorchStepByStep/master/config.py'
    r = requests.get(url, allow_redirects=True)
    open('config.py', 'wb').write(r.content)    
except ModuleNotFoundError:
    pass

from config import *
config_chapter5()
# This is needed to render the plots in this chapter
from plots.chapter5 import *

Downloading files from GitHub repo to Colab...
Finished!


In [3]:
import random
import numpy as np
from PIL import Image

import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F

from torch.utils.data import DataLoader, Dataset
from torchvision.transforms import Compose, Normalize

from data_generation.image_classification import generate_dataset
from helpers import index_splitter, make_balanced_sampler
from stepbystep.v1 import StepByStep

## Filter / Kernel

Usually, the filters are
small square matrices. The convolution itself is performed by applying the filter on
the image repeatedly. 

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-II-Computer-Vision/images/conv1.png?raw=1)

That’s the region to which the filter is being applied and is called the
receptive field, drawing an analogy to the way human vision works.

Let’s try a concrete example to make it more clear.

In [4]:
single = np.array([
  [ # batch dim
      [ # channel dim
          [5, 0, 8, 7, 8, 1], # height and width dim
          [1, 9, 5, 0, 7, 7],
          [6, 0, 2, 4, 6, 6],
          [9, 7, 6, 6, 8, 4],
          [8, 3, 8, 5, 1, 3],
          [7, 2, 7, 0, 1, 0]
      ]
  ]
])
single.shape

(1, 1, 6, 6)

In [5]:
identity = np.array([
    [
        [
            [0, 0, 0],
            [0, 1, 0],
            [0, 0, 0]
        ]
    ]
])
identity.shape

(1, 1, 3, 3)

##Convolving

Convulution performs an element-wise multiplication between the
two, region and filter, and adds everything up.

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-II-Computer-Vision/images/conv2.png?raw=1)

In [6]:
region = single[:, :, 0:3, 0:3]
filtered_region = region * identity
total = filtered_region.sum()
total

9

Doing a convolution produces an image with a
reduced size.

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-II-Computer-Vision/images/conv3.png?raw=1)

## Moving Around

Next, we move the region one step to the right; that is, we change the receptive
field and apply the filter again.

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-II-Computer-Vision/images/stride1.png?raw=1)

In code, it means we’re changing the slice of the input image:

In [7]:
new_region = single[:, :, 0:3, (0+1):(3+1)]

But the operation remains the same: First, an element-wise multiplication, and
then adding up the elements of the resulting matrix.

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-II-Computer-Vision/images/conv5.png?raw=1)

In [8]:
new_filtered_region = new_region * identity
new_total = new_filtered_region.sum()
new_total

5

Great! We have a second pixel value to add to our resulting image.

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-II-Computer-Vision/images/conv6.png?raw=1)

We can keep moving the gray region to the right until we can’t move it anymore.

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-II-Computer-Vision/images/conv7.png?raw=1)

The fourth step to the right will actually place the region partially outside the
input image.

In [9]:
last_horizontal_region = single[:, :, 0:3, (0+4):(3+4)]

The selected region does not match the shape of the filter anymore. 

So, if we try to
perform the element-wise multiplication, it fails:

In [10]:
try:
  last_horizontal_region * identity
except Exception as exp:
  print(exp)

operands could not be broadcast together with shapes (1,1,3,2) (1,1,3,3) 


##Shape

Next, we go back to the left side and move down one step. If we repeat the
operation, covering all valid regions, we’ll end up with a resulting image that is
smaller (on the right).

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-II-Computer-Vision/images/conv8.png?raw=1)

How much smaller is it going to be?

It depends on the size of the filter.

>The larger the filter, the smaller the resulting image.

Since applying a filter always produces a single value, the reduction is equal to the
filter size minus one.

$$
\Large
(h_i, w_i) * (h_f, w_f) = (h_i - (h_f - 1), w_i - (w_f - 1))
$$

If we assume the filter is a square matrix of size f, we can simplify the expression
above to:

$$
\Large
(h_i, w_i) * f = (h_i - f + 1, w_i - f + 1)
$$

But I’d like to keep the image size, is it possible?

Sure it is! Padding comes to our rescue in this case.

##Convolving in PyTorch

Now that we know how a convolution works, let’s try it out using PyTorch.

In [11]:
# convert our image and filter to tensors
image = torch.as_tensor(single).float()
kernel = torch.as_tensor(identity).float()

Just like the activation functions, convolutions come in two
flavors: functional and module. 

There is a fundamental difference between the
two, though: The functional convolution takes the kernel / filter as an argument
while the module has (learnable) weights to represent the kernel / filter.

Let’s use the functional convolution, `F.conv2d()`, to apply the identity filter to our
input image.

In [12]:
convolved = F.conv2d(image, kernel, stride=1)
convolved

tensor([[[[9., 5., 0., 7.],
          [0., 2., 4., 6.],
          [7., 6., 6., 8.],
          [3., 8., 5., 1.]]]])

Now, let’s turn our attention to PyTorch’s convolution module, `nn.Conv2d`.

In [13]:
conv = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=3, stride=1)
conv(image)

tensor([[[[-3.1584, -3.6587, -3.3084, -1.9776],
          [-1.6495, -0.7789, -3.8846, -2.4516],
          [-6.4975, -4.3493, -3.1753, -3.6608],
          [-5.0802, -1.0271, -2.6073, -1.0769]]]],
       grad_fn=<ConvolutionBackward0>)

These results are gibberish now because the convolutional module randomly initializes the weights representing
the kernel / filter.

That’s the whole point of the convolutional module: It will learn
the kernel / filter on its own.

In traditional computer vision, people would develop different
filters for different purposes: blurring, sharpening, edge
detection, and so on.

Can we tell it to learn multiple filters at once?

In [14]:
conv_multiple = nn.Conv2d(in_channels=1, out_channels=2, kernel_size=3, stride=1)
conv_multiple(image)

tensor([[[[ 2.5559,  1.6035,  0.2515, -1.0625],
          [-1.8400, -2.9293,  2.6575,  1.8039],
          [ 1.3645,  3.5655,  1.4484,  0.7340],
          [-0.8595,  1.0977, -0.7592, -1.5410]],

         [[ 3.4871,  3.6961,  3.2884,  5.1495],
          [ 4.2876,  5.7939,  5.4350,  3.7492],
          [ 7.8516,  5.2071,  3.2165,  4.2438],
          [ 5.1253,  4.9019,  4.3800,  1.6612]]]],
       grad_fn=<ConvolutionBackward0>)

In [15]:
conv_multiple.weight

Parameter containing:
tensor([[[[-0.2008, -0.3166,  0.2790],
          [ 0.0693,  0.1972,  0.1449],
          [-0.2383,  0.0989,  0.0931]]],


        [[[ 0.2917, -0.1087, -0.0920],
          [ 0.2168,  0.2777, -0.2537],
          [ 0.0901,  0.2887,  0.2582]]]], requires_grad=True)

We can also force a convolutional module to use a particular filter by setting its weights.

In [16]:
with torch.no_grad():
  conv.weight[0] = kernel
  conv.bias[0] = 0

conv(image)

tensor([[[[9., 5., 0., 7.],
          [0., 2., 4., 6.],
          [7., 6., 6., 8.],
          [3., 8., 5., 1.]]]], grad_fn=<ConvolutionBackward0>)

In [17]:
conv.weight

Parameter containing:
tensor([[[[0., 0., 0.],
          [0., 1., 0.],
          [0., 0., 0.]]]], requires_grad=True)

Setting the weights to get specific filters is at the heart of
transfer learning. 

Someone else trained a model, and that model
learned lots of useful filters, so we don’t have to learn them
again. We can set the corresponding weights and go from there.

##Striding

Let’s try a stride of two for a change and see what happens to the resulting image.

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-II-Computer-Vision/images/strider2.png?raw=1)

The resulting image, after the only four valid operations, looks like this.

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-II-Computer-Vision/images/strider3.png?raw=1)

Also, notice that using a larger stride made the shape of the resulting image even smaller.

>The larger the stride, the smaller the resulting image.

Once again, it makes sense: If we are skipping pixels in the input image, there are
fewer regions of interest to apply the filter to. 

We can extend our previous formula
to include the stride size (s):

$$
\Large
(h_i, w_i) * f = \left(\frac{h_i - f + 1}{s}, \frac{w_i - f + 1}{s}\right)
$$

Let’s use
PyTorch’s functional convolution to double-check the results.

In [18]:
convolved_stride2 = F.conv2d(image, kernel, stride=2)
convolved_stride2

tensor([[[[9., 0.],
          [7., 6.]]]])

Cool, it works!

##Padding

So far, the operations we have performed have been shrinking the images. What
about restoring them to their original glory, I mean, size?

Padding means stuffing. We need to stuff the original image so it can sustain the "attack" on its size.

How do I stuff an image?

Simply add zeros around it.

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-II-Computer-Vision/images/padding1.png?raw=1)

By adding columns and rows of zeros around it, we expand the
input image such that the gray region starts centered in the actual top left corner
of the input image. 

This simple trick can be used to preserve the original size of the
image.

In code, as usual, PyTorch gives us two options: functional (`F.pad()`) and module (`nn.ConstantPad2d`).

Let’s start with the module version this time:

In [19]:
constant_padder = nn.ConstantPad2d(padding=1, value=0)
constant_padder(image)

tensor([[[[0., 0., 0., 0., 0., 0., 0., 0.],
          [0., 5., 0., 8., 7., 8., 1., 0.],
          [0., 1., 9., 5., 0., 7., 7., 0.],
          [0., 6., 0., 2., 4., 6., 6., 0.],
          [0., 9., 7., 6., 6., 8., 4., 0.],
          [0., 8., 3., 8., 5., 1., 3., 0.],
          [0., 7., 2., 7., 0., 1., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 0.]]]])

One can also do asymmetric padding by specifying a tuple in the padding
argument representing (left, right, top, bottom). 

So, if we were to stuff our
image on the left and right sides only, the argument would go like this: `(1, 1, 0, 0)`.

We can achieve the same result using the functional padding:

In [20]:
padded = F.pad(image, pad=(1, 1, 1, 1), mode="constant", value=0)
padded

tensor([[[[0., 0., 0., 0., 0., 0., 0., 0.],
          [0., 5., 0., 8., 7., 8., 1., 0.],
          [0., 1., 9., 5., 0., 7., 7., 0.],
          [0., 6., 0., 2., 4., 6., 6., 0.],
          [0., 9., 7., 6., 6., 8., 4., 0.],
          [0., 8., 3., 8., 5., 1., 3., 0.],
          [0., 7., 2., 7., 0., 1., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 0.]]]])

Here, there is another argument, mode, which was
set to constant to match the module version above.

What are the other available modes?

There are three other modes: replicate, reflect, and circular.

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-II-Computer-Vision/images/paddings.png?raw=1)

In replication padding, the padded pixels have the same value as the closest real
pixel.

In PyTorch, one can use the functional form `F.pad()` with mode="replicate", or use
the module version `nn.ReplicationPad2d`:

In [21]:
replication_padder = nn.ReplicationPad2d(padding=1)
replication_padder(image)

tensor([[[[5., 5., 0., 8., 7., 8., 1., 1.],
          [5., 5., 0., 8., 7., 8., 1., 1.],
          [1., 1., 9., 5., 0., 7., 7., 7.],
          [6., 6., 0., 2., 4., 6., 6., 6.],
          [9., 9., 7., 6., 6., 8., 4., 4.],
          [8., 8., 3., 8., 5., 1., 3., 3.],
          [7., 7., 2., 7., 0., 1., 0., 0.],
          [7., 7., 2., 7., 0., 1., 0., 0.]]]])

In [22]:
padded = F.pad(image, pad=(1, 1, 1, 1), mode="replicate")
padded

tensor([[[[5., 5., 0., 8., 7., 8., 1., 1.],
          [5., 5., 0., 8., 7., 8., 1., 1.],
          [1., 1., 9., 5., 0., 7., 7., 7.],
          [6., 6., 0., 2., 4., 6., 6., 6.],
          [9., 9., 7., 6., 6., 8., 4., 4.],
          [8., 8., 3., 8., 5., 1., 3., 3.],
          [7., 7., 2., 7., 0., 1., 0., 0.],
          [7., 7., 2., 7., 0., 1., 0., 0.]]]])

In reflection padding, it gets a bit trickier. It is like the outer columns and rows are
used as axes for the reflection.

In PyTorch, you can use the functional form `F.pad()` with mode="reflect", or use
the module version `nn.ReflectionPad2d`:

In [23]:
reflection_padder = nn.ReflectionPad2d(padding=1)
reflection_padder(image)

tensor([[[[9., 1., 9., 5., 0., 7., 7., 7.],
          [0., 5., 0., 8., 7., 8., 1., 8.],
          [9., 1., 9., 5., 0., 7., 7., 7.],
          [0., 6., 0., 2., 4., 6., 6., 6.],
          [7., 9., 7., 6., 6., 8., 4., 8.],
          [3., 8., 3., 8., 5., 1., 3., 1.],
          [2., 7., 2., 7., 0., 1., 0., 1.],
          [3., 8., 3., 8., 5., 1., 3., 1.]]]])

In [24]:
padded = F.pad(image, pad=(1, 1, 1, 1), mode="reflect")
padded

tensor([[[[9., 1., 9., 5., 0., 7., 7., 7.],
          [0., 5., 0., 8., 7., 8., 1., 8.],
          [9., 1., 9., 5., 0., 7., 7., 7.],
          [0., 6., 0., 2., 4., 6., 6., 6.],
          [7., 9., 7., 6., 6., 8., 4., 8.],
          [3., 8., 3., 8., 5., 1., 3., 1.],
          [2., 7., 2., 7., 0., 1., 0., 1.],
          [3., 8., 3., 8., 5., 1., 3., 1.]]]])

In circular padding, the left-most (right-most) column gets copied as the right (left) padded column.

Similarly, the top-most
(bottom-most) row gets copied as the bottom (top) padded row. The corners
receive the values of the diametrically opposed corner.

In PyTorch, you must use the functional form `F.pad()` with mode="circular" since
there is no module version of the circular padding.

In [25]:
padded = F.pad(image, pad=(1, 1, 1, 1), mode="circular")
padded

tensor([[[[0., 7., 2., 7., 0., 1., 0., 7.],
          [1., 5., 0., 8., 7., 8., 1., 5.],
          [7., 1., 9., 5., 0., 7., 7., 1.],
          [6., 6., 0., 2., 4., 6., 6., 6.],
          [4., 9., 7., 6., 6., 8., 4., 9.],
          [3., 8., 3., 8., 5., 1., 3., 8.],
          [0., 7., 2., 7., 0., 1., 0., 7.],
          [1., 5., 0., 8., 7., 8., 1., 5.]]]])

By padding an image, it is possible to get resulting images with the same shape as
input images, or even larger, should you choose to stuff more and more rows and
columns into the input image. 

Assuming we’re doing symmetrical padding of size p,
the resulting shape is given by the formula below:

$$
\Large
(h_i, w_i) * f = \left(\frac{(h_i + 2p) - f + 1}{s}, \frac{(w_i + 2p) - f + 1}{s}\right)
$$

We’re basically extending the original dimensions by 2p pixels each.

## A REAL Filter

Let’s try an edge detector filter from traditional
computer vision for a change:

In [26]:
edge_matrix = np.array([
  [[
    [0, 1, 0],
    [1, -4, 1],
    [0, 1, 0]
  ]]
])

kernel_edge = torch.as_tensor(edge_matrix).float()
kernel_edge.shape

torch.Size([1, 1, 3, 3])

And let’s apply it to a different region of our (padded) input image.

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-II-Computer-Vision/images/padding2.png?raw=1)

As you can see, filters, other than the identity one, do not simply copy the value at
the center. 

The element-wise multiplication finally means something.

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-II-Computer-Vision/images/padding3.png?raw=1)

Let’s apply this filter to our image.

In [27]:
padded = F.pad(image, pad=(1, 1, 1, 1), mode="constant", value=0)
conv_padded = F.conv2d(padded, kernel_edge, stride=1)
conv_padded

tensor([[[[-19.,  22., -20., -12., -17.,  11.],
          [ 16., -30.,  -1.,  23.,  -7., -14.],
          [-14.,  24.,   7.,  -2.,   1.,  -7.],
          [-15., -10.,  -1.,  -1., -15.,   1.],
          [-13.,  13., -11.,  -5.,  13.,  -7.],
          [-18.,   9., -18.,  13.,  -3.,   4.]]]])

## Pooling

Pooling is different than the former operations: It splits the image into tiny chunks, performs an operation on each chunk (that yields a single value), and puts the chunks together as the resulting image.

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-II-Computer-Vision/images/pooling1.png?raw=1)

Our input image is split into nine chunks, and we perform a simple max operation
(hence, max pooling) on each chunk (really, it is just taking the largest value in each
chunk). Then, these values are put together, in order, to produce a smaller
resulting image.

>The larger the pooling kernel, the smaller the resulting image.

In PyTorch, as usual, we have both forms: `F.max_pool2d()` and `nn.MaxPool2d`. 

Let’s
use the functional form to replicate the max pooling.

In [28]:
pooled = F.max_pool2d(conv_padded, kernel_size=2)
pooled

tensor([[[[22., 23., 11.],
          [24.,  7.,  1.],
          [13., 13., 13.]]]])

And then let’s use the module version to illustrate the large four-by-four pooling.

In [29]:
maxpool4 = nn.MaxPool2d(kernel_size=4)
pooled4 = maxpool4(conv_padded)
pooled4

tensor([[[[24.]]]])

Besides max pooling, average pooling is also fairly common. As the name
suggests, it will output the average pixel value for each chunk. 

In PyTorch, we have
`F.avg_pool2d()` and `nn.AvgPool2d`.

In [30]:
pooled = F.avg_pool2d(conv_padded, kernel_size=2)
pooled

tensor([[[[-2.7500, -2.5000, -6.7500],
          [-3.7500,  0.7500, -5.0000],
          [-2.2500, -5.2500,  1.7500]]]])

In [31]:
avgpool4 = nn.AvgPool2d(kernel_size=4)
pooled4 = avgpool4(conv_padded)
pooled4

tensor([[[[-2.0625]]]])

Can I use a stride of a different size?

In [32]:
pooled = F.max_pool2d(conv_padded, kernel_size=2, stride=1)
pooled

tensor([[[[22., 22., 23., 23., 11.],
          [24., 24., 23., 23.,  1.],
          [24., 24.,  7.,  1.,  1.],
          [13., 13., -1., 13., 13.],
          [13., 13., 13., 13., 13.]]]])

##Flattening

It simply flattens a tensor, preserving the first
dimension such that we keep the number of data points while collapsing all other
dimensions.

In [33]:
flattened = nn.Flatten()(pooled)
flattened

tensor([[22., 22., 23., 23., 11., 24., 24., 23., 23.,  1., 24., 24.,  7.,  1.,
          1., 13., 13., -1., 13., 13., 13., 13., 13., 13., 13.]])

We can also accomplish
the same thing using `view()`.

In [34]:
pooled.view(1, -1)

tensor([[22., 22., 23., 23., 11., 24., 24., 23., 23.,  1., 24., 24.,  7.,  1.,
          1., 13., 13., -1., 13., 13., 13., 13., 13., 13., 13.]])

##Typical Architecture

A typical architecture uses a sequence of one or more typical convolutional
blocks, with each block consisting of three operations:

1. Convolution
2. Activation function
3. Pooling

As images go through these operations, they will shrink in size.

After the sequence of blocks, the image gets flattened: Hopefully, at this stage,
there is no loss of information occurring by considering each value in the flattened
tensor a feature on its own.

If you think of it, what those typical convolutional blocks do is
akin to pre-processing images and converting them into
features.

###LeNet-5

LeNet-5 is a seven-level convolutional neural network developed by Yann LeCun in
1998 to recognize hand-written digits in 28x28 pixel images—the famous MNIST
dataset!

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-II-Computer-Vision/images/architecture_lenet.png?raw=1)

Adapting LeNet-5 to today’s standards, it could be implemented like this:

In [None]:
lenet = nn.Sequential()

###### Featurizer ##########
# Block 1: 1@28x28 -> 6@28x28 -> 6@14x14
lenet.add_module("C1", nn.Conv2d(in_channels=1, out_channels=5, kernel_size=5, padding=2))
lenet.add_module("func1", nn.ReLU())
lenet.add_module("S2", nn.MaxPool2d(kernel_size=2))

# Block 2: 6@14x14 -> 16@10x10 -> 16@5x5
lenet.add_module("C3", nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5))
lenet.add_module("func2", nn.ReLU())
lenet.add_module("S4", nn.MaxPool2d(kernel_size=2))

# Block 3: 16@5x5 -> 120@1x1
lenet.add_module("C3", nn.Conv2d(in_channels=16, out_channels=120, kernel_size=5))
lenet.add_module("func3", nn.ReLU())

# Flattening
lenet.add_module("flatten", nn.Flatten())

###### Classification ##########
# Hidden Layer
lenet.add_module("F6", nn.Linear(in_features=120, out_features=84))
lenet.add_module("func4", nn.ReLU())
# Output Layer
lenet.add_module("OUTPUT", nn.Linear(in_features=84, out_features=10))

LeNet-5 used three convolutional blocks, although the last one does not have a
max pooling, because the convolution already produces a single pixel.

Then, these 120 values (or features) are flattened and fed to a typical hidden layer
with 84 units. 

The last step is, obviously, the output layer, which produces ten
logits to be used for digit classification (from 0 to 9, there are ten classes).

##Multiclass Classification