# Convolution arithmetic

A technical report on convolution arithmetic in the context of deep learning.

The code and the images of this tutorial are free to use as regulated by the 
licence and subject to proper attribution:

* \[1\] Vincent Dumoulin, Francesco Visin - [A guide to convolution arithmetic
  for deep learning](https://arxiv.org/abs/1603.07285)
  ([BibTeX](https://gist.github.com/fvisin/165ca9935392fa9600a6c94664a01214))

## Convolution animations

_N.B.: Blue maps are inputs, and cyan maps are outputs._

<table style="width:100%; table-layout:fixed;">
  <tr>
    <td><img width="150px" src="gif/no_padding_no_strides.gif"></td>
    <td><img width="150px" src="gif/arbitrary_padding_no_strides.gif"></td>
    <td><img width="150px" src="gif/same_padding_no_strides.gif"></td>
    <td><img width="150px" src="gif/full_padding_no_strides.gif"></td>
  </tr>
  <tr>
    <td>No padding, no strides</td>
    <td>Arbitrary padding, no strides</td>
    <td>Half padding, no strides</td>
    <td>Full padding, no strides</td>
  </tr>
  <tr>
    <td><img width="150px" src="gif/no_padding_strides.gif"></td>
    <td><img width="150px" src="gif/padding_strides.gif"></td>
    <td><img width="150px" src="gif/padding_strides_odd.gif"></td>
    <td></td>
  </tr>
  <tr>
    <td>No padding, strides</td>
    <td>Padding, strides</td>
    <td>Padding, strides (odd)</td>
    <td></td>
  </tr>
</table>

# Filters
Unlike the previous image, the input to the Conv2d layers have a number of channels.

When defined, the Conv2d layer needs a specification of the number of input and output channels.
Along with these, the size of the kernel $(k_h, k_w)$, size of the padding $(p_h, p_w)$, and the stride $(s_h, s_w)$ need to be specified. 

Using this the layer generates a set of $N$ filters where $N$ is the number of output channels specified, each of which applies on each of the channels of the input. The total parameters that the network can learn is therefore:
$$k \times k \times c \times N + N$$

<!-- <div style="text-align: center;">
</div> -->
|<img width="600px" src="gif/convfilter.png">|
|:--:| 
|Visualization of filters and the input/output data dimensions. Source: https://convviz.netlify.app/|


In action this looks like<a name="cite_ref-1"></a>[<sup>[1]</sup>](#cite_note-1):

<table style="width:100%; table-layout:fixed;text-align: center; margin-left: auto; margin-right: auto;">
  <tr>
    <td><img width="800px" src="gif/convolution-animation-3x3-kernel.gif"></td>
    <td><img width="800px" src="gif/convolution-animation-3x3-kernel-same-padding.gif"></td>
  </tr>
  <tr>
    <td>No Padding AKA "Valid"</td>
    <td>[1,1,1,1] Padding AKA "Same"</td>
  </tr>
</table>
As seen, each filter acting on the respective

<a name="cite_note-1"></a>1. [^](#cite_ref-1) https://animatedai.github.io

Assuming that the images we use as inputs have $h=w=l$, we use the side $l$ in the following formulae for the output dimensions

&emsp; Convolutional Layer:
$$l_\text{out} = \left\lfloor \frac{l_\text{in} - k + 2 \times p}{s}\right\rfloor + 1$$
$$c_\text{out} = N$$

&emsp; Pooling Layer:
$$l_\text{out} = \left\lfloor \frac{l_\text{in} - k}{s}\right\rfloor + 1$$
$$c_\text{out} = c_\text{in}$$

&emsp; &emsp; where, $k$ is the side of the kernel/filter, $p$ is the padding on the input image, $N$ is the number of filters. The padding and the kernel need not be symmetric, but we assume they are

In [None]:
ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer2): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer3): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer4): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (fc): Linear(in_features=512, out_features=1000, bias=True)
)


In [None]:
ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3))
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
  )
  (layer2): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (downsample): Sequential(
        (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2))
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
  )
  (layer3): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (downsample): Sequential(
        (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2))
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
  )
  (layer4): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (downsample): Sequential(
        (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2))
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (fc): Linear(in_features=512, out_features=1000, bias=True)
)

|<img width="300px" src="gif/shortcut.png">|<img width="400px" src="gif/downsample.png">|
|:--:|:--:|
|A shortcut connection in `forward` method|How `downsample` works|


In [1]:
def Conv2d(in_channels, out_channels, kernel_size, stride, padding=(0,0), bias=True):
    # assuming h=w, so we use the length l
    def conv(l, channels):
        if channels == in_channels:
            return (((l-kernel_size[0]+2*padding[0])//stride[0] + 1), out_channels)
        else:
            print("Channels do not match")
            raise ValueError
    return conv

def MaxPool2d(kernel_size, stride, padding):
    def maxpool(l, channels):
        return (l-kernel_size+2*padding)//stride + 1, channels
    return maxpool

def AvgPool2d(output_size):
    def avgpool(l, channels):
        return output_size, channels
    return avgpool

In [2]:
conv1 = Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
maxpool = MaxPool2d(kernel_size=3, stride=2, padding=1)
layer1conv1 = Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
layer1conv2 = Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
layer1conv3 = Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
layer1conv4 = Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
layer2conv1 = Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
layer2conv2 = Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
# layer2conv3 = Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
layer2conv4 = Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
layer2conv5 = Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
layer3conv1 = Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
layer3conv2 = Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
# layer3conv3 = Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
layer3conv4 = Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
layer3conv5 = Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
layer4conv1 = Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
layer4conv2 = Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
# layer4conv3 = Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
layer4conv4 = Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
layer4conv5 = Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
avgpool = AvgPool2d(output_size=(1, 1))

In [4]:
side = 100; channels = 3
side, channels = conv1(side, channels); print(f"Conv1 - dimensions: {side}x{side}, channels: {channels}")
side, channels = maxpool(side, channels); print(f"Maxpool - dimensions: {side}x{side}, channels: {channels}")
side, channels = layer1conv1(side, channels); print(f"Layer 1 Conv 1 - dimensions: {side}x{side}, channels: {channels}")
side, channels = layer1conv2(side, channels); print(f"Layer 1 Conv 2 - dimensions: {side}x{side}, channels: {channels}")
side, channels = layer1conv3(side, channels); print(f"Layer 1 Conv 3 - dimensions: {side}x{side}, channels: {channels}")
side, channels = layer1conv4(side, channels); print(f"Layer 1 Conv 4 - dimensions: {side}x{side}, channels: {channels}")
side, channels = layer2conv1(side, channels); print(f"Layer 2 Conv 1 - dimensions: {side}x{side}, channels: {channels}")
side, channels = layer2conv2(side, channels); print(f"Layer 2 Conv 2 - dimensions: {side}x{side}, channels: {channels}")
# side, channels = layer2conv3(side, channels); print(f"dimensions: {side}x{side}, channels: {channels}")
side, channels = layer2conv4(side, channels); print(f"Layer 2 Conv 4 - dimensions: {side}x{side}, channels: {channels}")
side, channels = layer2conv5(side, channels); print(f"Layer 2 Conv 5 - dimensions: {side}x{side}, channels: {channels}")
side, channels = layer3conv1(side, channels); print(f"Layer 3 Conv 1 - dimensions: {side}x{side}, channels: {channels}")
side, channels = layer3conv2(side, channels); print(f"Layer 3 Conv 2 - dimensions: {side}x{side}, channels: {channels}")
# side, channels = layer3conv3(side, channels); print(f"dimensions: {side}x{side}, channels: {channels}")
side, channels = layer3conv4(side, channels); print(f"Layer 3 Conv 4 - dimensions: {side}x{side}, channels: {channels}")
side, channels = layer3conv5(side, channels); print(f"Layer 3 Conv 5 - dimensions: {side}x{side}, channels: {channels}")
side, channels = layer4conv1(side, channels); print(f"Layer 4 Conv 1 - dimensions: {side}x{side}, channels: {channels}")
side, channels = layer4conv2(side, channels); print(f"Layer 4 Conv 2 - dimensions: {side}x{side}, channels: {channels}")
# side, channels = layer4conv3(side, channels); print(f"dimensions: {side}x{side}, channels: {channels}")
side, channels = layer4conv4(side, channels); print(f"Layer 4 Conv 4 - dimensions: {side}x{side}, channels: {channels}")
side, channels = layer4conv5(side, channels); print(f"Layer 4 Conv 5 - dimensions: {side}x{side}, channels: {channels}")
side, channels = avgpool(side, channels); print(f"AvgPool - side: {side}, channels: {channels}")

Conv1 - dimensions: 50x50, channels: 64
Maxpool - dimensions: 25x25, channels: 64
Layer 1 Conv 1 - dimensions: 25x25, channels: 64
Layer 1 Conv 2 - dimensions: 25x25, channels: 64
Layer 1 Conv 3 - dimensions: 25x25, channels: 64
Layer 1 Conv 4 - dimensions: 25x25, channels: 64
Layer 2 Conv 1 - dimensions: 13x13, channels: 128
Layer 2 Conv 2 - dimensions: 13x13, channels: 128
Layer 2 Conv 4 - dimensions: 13x13, channels: 128
Layer 2 Conv 5 - dimensions: 13x13, channels: 128
Layer 3 Conv 1 - dimensions: 7x7, channels: 256
Layer 3 Conv 2 - dimensions: 7x7, channels: 256
Layer 3 Conv 4 - dimensions: 7x7, channels: 256
Layer 3 Conv 5 - dimensions: 7x7, channels: 256
Layer 4 Conv 1 - dimensions: 4x4, channels: 512
Layer 4 Conv 2 - dimensions: 4x4, channels: 512
Layer 4 Conv 4 - dimensions: 4x4, channels: 512
Layer 4 Conv 5 - dimensions: 4x4, channels: 512
AvgPool - side: (1, 1), channels: 512



<img width="800px" src="gif/image_sizes.png">

Visualization <a name="cite_ref-1"></a>[<sup>[1]</sup>](#cite_note-1) of data flowing through resnet.

<a name="cite_note-1"></a>1. [^](#cite_ref-1) https://towardsdatascience.com/understanding-and-visualizing-resnets-442284831be8

In [1]:
import torch as T
from torchvision.models import resnet18
from torchinfo import summary
model = resnet18()
model.conv1 = T.nn.Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
model.fc = T.nn.Linear(in_features=512, out_features=1, bias=True)
summary(model, input_size=(256, 1, 200, 200))

  warn(


Layer (type:depth-idx)                   Output Shape              Param #
ResNet                                   [256, 1]                  --
├─Conv2d: 1-1                            [256, 64, 100, 100]       3,136
├─BatchNorm2d: 1-2                       [256, 64, 100, 100]       128
├─ReLU: 1-3                              [256, 64, 100, 100]       --
├─MaxPool2d: 1-4                         [256, 64, 50, 50]         --
├─Sequential: 1-5                        [256, 64, 50, 50]         --
│    └─BasicBlock: 2-1                   [256, 64, 50, 50]         --
│    │    └─Conv2d: 3-1                  [256, 64, 50, 50]         36,864
│    │    └─BatchNorm2d: 3-2             [256, 64, 50, 50]         128
│    │    └─ReLU: 3-3                    [256, 64, 50, 50]         --
│    │    └─Conv2d: 3-4                  [256, 64, 50, 50]         36,864
│    │    └─BatchNorm2d: 3-5             [256, 64, 50, 50]         128
│    │    └─ReLU: 3-6                    [256, 64, 50, 50]         --
│