# Computer Vision (911.908)

## <font color='crimson'>Convolutions & ConvNets</font>

**Changelog**:
- *Nov. 2022*: adaptations to PyTorch 1.13 + minor fixes
- *Nov. 2024*: adaptations to PyTorch 2.5.1

---

## Contents

- [Motivation](#Motivation)
- [Conventions & Terminology](#Conventions-and-Terminology)
- [1D convolution](#Convolution-in-1D)
- [2D convolution](#Convolution-in-2D)
- [Convolution in neural networks](#Convolution-in-neural-networks)
- [Grouped convolution](#Grouped-convolution)
- [Resources](#Resources)

## Motivation

Let's say we have input signals of large dimension, but with a very distinct inherent structure, e.g., images or acoustic signals.

If we would handle these signals simply as ''vectors'', we would immediately run into problems. Just think of a relatively small $256 \times 256$ RGB image. Mapping such an input, when considered as a vector, with a linear layer to, say, $\mathbb{R}^{1000}$, would require a matrix of size

$$1000 \times (256\cdot 256 \cdot 3) = 1000 \times 196608$$

This already requires **750 MB** of memory, assuming 32-bit floating point numbers (and this would just be the memory footprint of the first layer!)

Possibly even more concerning is the fact that processing such signals (e.g., images) should have some degree of invariance (e.g., wrt. translation). Put differently, a *representation meaningful at
a certain location can / should be used everywhere*.

In [1]:
import torch
import torch.nn as nn
import numpy as np

---

## Conventions and Terminology

As we know, everything in PyTorch is handled in *batches* of size `N`. A batch of size `N=1` would mean only a single input. In the context of deep learning with PyTorch, the batch size is the first dimension of a tensor.    

For convolution operations, we call the second dimension the **number of channels** and any following dimensions the **channel dimensions**.

A tensor of size 
`1 x 1 x 10` would mean

- batch size: `1`
- \#channels: `1`
- channel size: `10`

A tensor of size `1 x 10 x 1` would mean

- batch size: `1`
- \#channels: `10`
- channel size: `1`

A tensor of size `100 x 3 x 32 x 32` would mean

- batch size: `100`
- \#channels: `3`
- channel size: `32 x 32`

A typical example of such a tensor would be 100 RGB images (i.e., 3 channels) of width and height of 32 pixels.


<div class="alert alert-block alert-info">
We will see that convolution layers in neural networks will take input tensors of this form and output tensors of this form.
</div>

---

## Convolution in 1D

Say we have a signal $\mathbf{x}$, written here as a row vector with $W$ elements, i.e.,

$$\mathbf{x} = [x_1,\ldots,x_W]$$

and a **convolution kernel** $\mathbf{u}$ with $w$ elements, i.e.,

$$\mathbf{u} = [u_1,\ldots,u_w]$$

Then, *convolving* $\mathbf{x}$ with $\mathbf{u}$ means

$$[\mathbf{x} * \mathbf{u}]_i = \sum_{j=1}^w x_{i-1+j} u_j$$

where $[\mathbf{x} * \mathbf{u}]_i$ denotes the output of the convolution operation at the $i$-th position.

**Example**

$$\mathbf{x} = [1,2,3,4,5,6,7,8,9,10]$$

(with $W=10$ and $w=3$) and 

$$\mathbf{u} = [1,2,3]$$

$$[\mathbf{x} * \mathbf{u}]_1 = \sum_{j=1}^3 x_{i-1+j} u_j = x_1u_1 + x_2u_2 + x_3u_3 = 14$$

where we started indexing by $1$. This is different from convolution in signal processing since we both visit signal and kernel elements in *increasing* index order.

**Illustration**

<img src="1Dconv.svg" style="width: 400px;"/>


**Parameters**

The parameters of this convolution operation are the <font color='blue'>values of the convolution kernel</font>. In the previous example, we thus have 3 parameters (as we did not include bias). If we do include bias, the number of parameters would be 4.

### PyTorch implementation

In [2]:
# create toy input [1,2,3,4,...,10] 
x = torch.tensor(list(np.arange(1, 11)), dtype=torch.float32)

# view the input as a 1x1x10 tensor, i.e., batch-size 1, 1x10 inputs
x = x.view(1, 1, 10)

# 1D convolution
m = nn.Conv1d(in_channels=1,   # one input channel
              out_channels=1,  # one output channel
              kernel_size=3,   # use a kernel size of 3
              stride=1,        # move the kernel along by steps of 1
              padding=0,       # do not pad the input
              bias=False)      # do not include bias (i.e., the b in Ax+b)

# directly set the convolution kernel parameters (for demonstration)
m.weight.data = torch.tensor([1, 2, 3], dtype=torch.float32).view(1, 1, 3)

# forward through the 1D conv. operation
y = m(x)

print(x.numpy())
print(y.detach().numpy())

# Note that the formula from above also works when we 
# start indexing by 0
print('Check (pos 0): (1*1 + 2*2 + 3*3) = 14')
print('Check (pos 1): (1*2 + 2*3 + 3*4) = 20')

[[[ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]]]
[[[14. 20. 26. 32. 38. 44. 50. 56.]]]
Check (pos 0): (1*1 + 2*2 + 3*3) = 14
Check (pos 1): (1*2 + 2*3 + 3*4) = 20


In [3]:
m.weight

Parameter containing:
tensor([[[1., 2., 3.]]], requires_grad=True)

It is important to note that a linear layer could implement the *same* operation here, but we would need to be careful when we compute gradient updates, as we would have to make sure that the *zero* entries stay zero (can be done via masking for instance).

**Padding**

Per default, *padding* is set to zero, so the output (in the previous example) is of size `1 x 1 x 8`; in other words, a kernel of size 3 with a stride of 1 can be applied 8 times.
Next, let's pad with *one zero entry* on both sides and still use a stride of 1; in this setting, we **preserve the input dimension**.

In [4]:
x = torch.tensor(list(np.arange(1,11)), dtype=torch.float32).view(1,1,10)
m = nn.Conv1d(in_channels=1, 
              out_channels=1, 
              kernel_size=3, 
              stride=1, 
              padding=1, #!!!!! 
              bias=False) 
y = m(x)
print(y.size())

torch.Size([1, 1, 10])


**Multiple convolution kernels**

We do not have to use just one kernel, we can use as many as we want:

In [5]:
x = torch.tensor(list(np.arange(1, 11)), dtype=torch.float32).view(1, 1, 10)

m = nn.Conv1d(in_channels=1, 
              out_channels=5, 
              kernel_size=3, 
              stride=1, 
              padding=1, 
              bias=False)
y = m(x)
print(y.size())

torch.Size([1, 5, 10])


In [6]:
m.weight.shape

torch.Size([5, 1, 3])

In the previous example, we not only convolve with one kernel but with **5 (different) kernels**. Hence, we obtain five output channels of size 10. The parameter tensor of the 1D convolution layer from the last example is:

In [7]:
print(m.weight.data.size())

torch.Size([5, 1, 3])


That is, 5 kernels of size $1 \times 3$.

**Stride**

Let's not shift the kernel by one position a time, but by two. This effectively *downsamples* the signal.

In [8]:
x = torch.tensor(list(np.arange(1, 11)), dtype=torch.float32).view(1, 1, 10)

m = nn.Conv1d(in_channels=1, 
              out_channels=1,
              kernel_size=3, 
              stride=2, 
              padding=0, 
              bias=False)
y = m(x)
print(y)
print(y.size())

# check the number of parameters - striding does obviously not change anything
n_params = 0
for p in m.parameters(): 
    n_params += p.numel()
print('#Parameters: ', n_params)

tensor([[[-0.0832, -0.2018, -0.3204, -0.4390]]],
       grad_fn=<ConvolutionBackward0>)
torch.Size([1, 1, 4])
#Parameters:  3



<div class="alert alert-block alert-info">
    <b>Bias.</b> In all previous examples, we disabled the bias term. Per default, this is enabled. If so, we have one additional parameter. 
</div>


In [9]:
x = torch.tensor(list(np.arange(1, 11)), dtype=torch.float32).view(1, 1, 10)
m = nn.Conv1d(1, 1, 3, stride=2, padding=0, bias=True)
y = m(x)
print(y.size())

n_params = 0
for p in m.parameters(): 
    n_params += p.numel()
print('#Parameters: ', n_params)

torch.Size([1, 1, 4])
#Parameters:  4


<div class="alert alert-block alert-info">
Also note that at all positions where we apply the kernel, we use the same kernel parameters. We call this <b>weight sharing</b>.
</div>


### Convolution as a linear operation

In fact, it's fairly easy to see that a convolution operation (without bias in our example) can be implemented via a linear/full-connected layer. In the following visualizations, gray dots (at the bottom figure) indicate zero values. 

<img src="WeightSharing.svg" style="width: 450px;"/>

In [10]:
# input is a 1x5 tensor
x_inp = torch.tensor([[1.,2.,3.,4.,5.]])

# 1D convolution layer, 1 input channel, 1 output channel, kernel size 3, no bias
layer = nn.Conv1d(1,1,3,bias=False)
layer.weight.data = torch.tensor([[[1.,2.,3.]]])
print('Forward pass through a PyTorch 1D convolution layer')
print(layer(x_inp.unsqueeze(0)).detach().squeeze().numpy())
print()

# let's implement the same mapping using a linear layer using the following 
# weight matrix:
A = torch.tensor([
    [1.,2.,3.,0.,0.],
    [0.,1.,2.,3.,0.],
    [0.,0.,1.,2.,3.]
])

lin_layer = nn.Linear(5, 3, bias=False)
lin_layer.weight.data = A
print('Forward pass through a corresponding PyTorch linear layer')
print(lin_layer(x_inp).detach().squeeze().numpy())

Forward pass through a PyTorch 1D convolution layer
[14. 20. 26.]

Forward pass through a corresponding PyTorch linear layer
[14. 20. 26.]


---

## Convolution in 2D

We will not discuss 2D convolution formally, as the specification of the convolution operation becomes tedious. The principle is quite simple and remains the same as in the 1D case. 

In the following example, we have an input tensor of size `W x H` (width times height) and `C` channels. The convolution kernel has spatial size `w x h` and also `C` channels.

Convolution with this kernel (using `stride=1` and *no* padding) gives a tensor of size `1 x (W-w+1) x (H-h+1)`. If we use 2 kernels, we get, as output, a tensor of size `2 x (W-w+1) x (H-h+1)`, etc.

<img src="2Dconv.svg" style="width: 350px;"/>

**Parameters**: The number of parameters in the first case is `w*h*C` (or `w*h*C+C` if we include the bias). We see that **weight sharing** allows us to efficiently process fairly large inputs with relatively few parameters.  

Also note that in case we use `K` kernels, the parameters of the convolution layers are stored in a `K x C x w x h` tensor. If bias is included we also have an `K` additional parameters.



In [11]:
x = torch.randn(16,3,32,32)
c1 = nn.Conv2d(3,10,kernel_size=3,bias=False)
out1 = c1(x)
c2 = nn.Conv2d(10,100,stride=1,kernel_size=7,bias=False)
out2 = c2(out1)





In [12]:
K = 3 # 3 kernels / output channels
m_wo_bias = nn.Conv2d(1, K, 3, stride=2, padding=0, bias=False) # 3x3 kernels
m_w_bias = nn.Conv2d(1, K, 3, stride=2, padding=0, bias=True)   # 3x3 kernels

for (name, p) in m_wo_bias.named_parameters():
    print('Parameter {} : #parameters = {}'.format(name, p.numel()))
for (name, p) in m_w_bias.named_parameters():
    print('Parameter {} : #parameters = {}'.format(name, p.numel()))

Parameter weight : #parameters = 27
Parameter weight : #parameters = 27
Parameter bias : #parameters = 3


**Terminology**: When processing inputs with convolution layers having $K$ kernels, we produce $K$ outputs (i.e., the `output_channels`). We sometimes also say, we produce $K$ **feature maps**. This makes sense, if we consider, e.g., a $5 \times 5$ kernel as identifying relevant features and applying this kernel to the full input produces a **feature map** (just think about an edge detection filter for example).

### Practical examples

In [13]:
x = torch.randn(10, 6, 32, 32)
m = nn.Conv2d(6, 1, 3, stride=1, padding=0, bias=True)
print(m(x).size())

n_params = 0
for p in m.parameters(): 
    n_params += p.numel()
print('#Parameters: ', n_params)

torch.Size([10, 1, 30, 30])
#Parameters:  55


First, we note that the input tensor is of size `10 x 6 x 32 x 32`, i.e., 

- batch size: `10`
- #channels: `6`
- channel size: `32 x 32`

Our kernel will have size `1 x 6 x 3 x 3`, so `6` channels of size `3 x 3`; the number of channels corresponds to the channels of the input tensor.

**Convolution vs. a linear layer** (aka fully-connected)

In [14]:
x_vec = x.view(10,-1) # vectorize the input, i.e.,
o_vec = m(x).view(10,-1)

print('x_vec', x_vec.size())
print('o_vec', o_vec.size())

x_vec torch.Size([10, 6144])
o_vec torch.Size([10, 900])


To map the input tensor to the same output (as with our convolution), we would need a mapping

$$f: \mathbb{R}^{6144} \to \mathbb{R}^{900}$$

i.e., a matrix $\mathbf{W}$ of size `6144 x 900` having `5529600` parameters! (our conv. layer only has 54 parameters, or 55 with bias).

---

## Convolution in neural networks

In [15]:
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.cl0 = nn.Conv2d(3, 10, 3, 2, 0)
        self.cl1 = nn.Conv2d(10, 20, 3, 2, 0)
        self.lin = nn.Linear(980,100)
        
    def forward(self, x): # x is input
        x = F.relu(self.cl0(x))
        x = F.relu(self.cl1(x))
        x = self.lin(x.view(-1,20*7*7))
        return x

In [16]:
net = Net()

x = torch.randn(16,3,32,32)
out = net(x)
print(out.size())

torch.Size([16, 100])


---

## Grouped convolution

In some applications, it can be necessary to process a multi-channel (e.g., `C` channels) input in a more refined way. 

Conventionally, a convolution kernel for a `C`-channel input would also have `C` channels, i.e., it operates over all channels of the input.

**Example 1**

*Say we want to handle each channel of the input separately*. Assume our input is of size `B x 10 x 32 x 32` and we want to have `50` output channels. This means we obviously need `50` kernels in total, however, each of size `(10/10) x 3 x 3`. This means we need ten `groups`.

Now, the first `50/10=5` kernels are applied on the first channel, producing `5` output channels, the next `5` kernels are applied to the second channel and so on. As we have `10` input channels, we get a total of `50` output channels.

We also see that the number of input channels, as well as the number of output channels needs to be divisible by the number of groups.

**Example 2**

Let's change the number of groups from our previous example to `2` (instead of `10` as before) and keep everything else the same.

We again need `50` kernels, but now of size `(10/2) x 3 x 3`. 
There will be `25` kernels in **group 1** and `25` kernels in **group 2**. Group 1 processes the first half of the input channels, group 2 the second half. As the outputs are concatenated (along the channel dimension), we obtain `50` output channels.



<div class="alert alert-block alert-info">
In principle, you can interpret grouped convolution as first splitting the input into #groups (along the channels) and then processing each group by its own convolution layer with an appropriate number of filters and an appropriate filter size.
</div>




In [17]:
x = torch.randn(1,10,32,32)
m = nn.Conv2d(10,50,3,1,groups=10, bias=False)
print(m.weight.data.size())
print(m(x).size())

torch.Size([50, 1, 3, 3])
torch.Size([1, 50, 30, 30])


In [18]:
x = torch.randn(1,10,32,32)
m = nn.Conv2d(10,50,3,1,groups=2, bias=False)
print(m.weight.data.size())
print(m(x).size())

torch.Size([50, 5, 3, 3])
torch.Size([1, 50, 30, 30])


---

## Resources

A great resource for understanding convolution is the convolution arithmetic tutorial by Dumoulin & Visin, which can be found [here](https://github.com/vdumoulin/conv_arithmetic).