# Convolution Arithmetic

In this lecture, we discuss a little bit of convolution arithmetic, in particular, transposed convolutions. More details can be found [here](https://arxiv.org/pdf/1603.07285.pdf).


| Symbol        | Meaning                                          |
| ------------- |:------------------------------------------------:|
| $s$           | Stride size (i.e., $s=1$)                        |
| $k$           | Kernel size (i.e., $k \times k$)                 |
| $p$           | Padding size (i.e., $p=1$ on all sides)          |
| $i$           | Spatial size of **input**  (e.g., $i \times i$)  |
| $o$           | Spatial size of **output**  (e.g., $o \times o$) |

In [2]:
import torch
import torch.nn as nn
import numpy as np

We start by defining some random $i \times i$ input and convolve this with a $k \times k$ kernel using a stride of $s$ and no padding (i.e., $p=0$). *In all examples, we assume the same width and height of an input* (i.e., $i$ specifies both width and height).

**Note**: **Relationship X** refers to [this](https://arxiv.org/pdf/1603.07285.pdf).

### Convolution - NO PADDING, unit strides (i.e., $s=1$)

Here's what we expect (**Relationship 1**)

$$o = (i-k)+1$$ 

In [18]:
i = 4    # width=height=i=4
s = 1    # stride=1
k = 3    # kernel size 3x3
p = 0    # NO PADDING

In [19]:
# Input
x = torch.randn(1,1,i,i)
# 2D conv. layer
cl = nn.Conv2d(1,1,k,s)

print('Output size:                    ', cl(x).size())
print('Expected (spatial) output size: ', (i-k+1))


Output size:                     torch.Size([1, 1, 2, 2])
Expected (spatial) output size:  2


### Convolution -  Zero padding, unit strides

Next, we check what happens in case of padding with zeros, in particular, padding by $1$ pixel on each side. We expect (**Relationship 2**)

$$ o = (i-k)+2p+1$$

This is, in fact, easy to see as we are effectively working with inputs of size $i+2p$; this relationship obviously holds for $p>1$ as well.

In [25]:
i = 32    # width=height=i=5
s = 1    # stride=1
k = 3    # kernel size 3x3
p = 1    # pad by one zero pixel on each side (i.e., bottom,top,right,left)

x = torch.randn(1,1,i,i)
cl = nn.Conv2d(1,1,k,s,padding=p)

print('Output size:                    ', cl(x).size())
print('Expected (spatial) output size: ', (i-k)+2*p+1)

Output size:                     torch.Size([1, 1, 32, 32])
Expected (spatial) output size:  32


### Convolution - NO PADDING, non-unit strides

We expect (**Relationship 5**)

$$ o = \lfloor (i-k)/s \rfloor +1$$

In [22]:
i = 5    # width=height=i=5
s = 2    # stride=2
k = 3    # kernel size 3x3
p = 0    # NO PADDING

x = torch.randn(1,1,i,i)
cl = nn.Conv2d(1,1,k,s,padding=p)

print('Output size:                    ', cl(x).size())
print('Expected (spatial) output size: ', np.floor((i-k)/s).astype(np.int)+1)

Output size:                     torch.Size([1, 1, 2, 2])
Expected (spatial) output size:  2


### Convolution - Zero padding, non-unit strides

We expect (**Relationship 6**), as we use inputs of effective size $i+2p$ (see **Relationship 2** from above), 

$$ o = \lfloor (i+2p-k)/s \rfloor +1$$

In [6]:
i = 5    # width=height=i=5
s = 2    # stride=2
k = 3    # kernel size 3x3
p = 1    # pad by one zero pixel on each side

x = torch.randn(1,1,i,i)
cl = nn.Conv2d(1,1,k,s,padding=p)

print('Output size:                    ', cl(x).size())
print('Expected (spatial) output size: ', np.floor((i+2*p-k)/s).astype(np.int)+1)

Output size:                     torch.Size([1, 1, 3, 3])
Expected (spatial) output size:  3


Next, we see an example, where the same convolution operation creates the same output size for different input sizes (only when $s>1$). This will lead to ambiguities when considering the corresponding operation in the opposite direction (i.e., *convolutional transpose* below).

In [7]:
i = 5    # INPUT0: width=height=i=5
s = 2    # stride=2
k = 3    # kernel size 3x3
p = 1    # pad by one zero pixel on each side

x = torch.randn(1,1,i,i)
cl0 = nn.Conv2d(1,1,k,s,padding=p)
print(cl0(x).size())

i = 6    # INPUT1: width=height=i=6
s = 2    # stride=2
k = 3    # kernel size 3x3
p = 1    # pad by one zero pixel on each side

x = torch.randn(1,1,i,i)
cl1 = nn.Conv2d(1,1,k,s,padding=p)
print(cl1(x).size())

torch.Size([1, 1, 3, 3])
torch.Size([1, 1, 3, 3])


# Transposed (/Fractionally-strided) convolution

To understand *transposed convolutions*, it is a good exercicse to study the standard convolution operation in greater detail, in particular, as a linear operation, taking, as input, the vectorized data and multiplying it with a suitably desigend linear operator (here $W$).

In the following example, we set the input and weight of the 2D convolution (stride of 1, no padding, $3 \times 3$ kernel) by hand to develop a little bit of intuition of what is happening.

In [8]:
# Input data, i=4
x = torch.FloatTensor(np.array([
    [1,2,3,4],
    [5,6,7,8,],
    [9,10,11,12],
    [13,14,15,16]]))

# Weights for our kxk, k=3, convolution kernel
w = torch.FloatTensor(
    np.array([
        [1,2,3],
        [4,5,6],
        [7,8,9]]))

# Add dimensions for PyTorch compatibility
w = w.unsqueeze(0).unsqueeze(0)
x = x.unsqueeze(0).unsqueeze(0)

s = 1  # stride = 1
k = 3  # kernel = 3x3

cl4 = nn.Conv2d(1,1,k,s,bias=False)
cl4.weight = nn.Parameter(w)
out = cl4(x)
print('Output of the 2D conv. operation:\n', out)

Output of the 2D conv. operation:
 tensor([[[[348., 393.],
          [528., 573.]]]], grad_fn=<MkldnnConvolutionBackward>)


In [27]:
# Lets perform the same operation by hand, see Sec. 4.1. in 
# https://arxiv.org/pdf/1603.07285.pdf

# This is our linear operator C from above, C is (16x4)
C = np.array([
    [1,2,3,0,4,5,6,0,7,8,9,0,0,0,0,0],
    [0,1,2,3,0,4,5,6,0,7,8,9,0,0,0,0],
    [0,0,0,0,1,2,3,0,4,5,6,0,7,8,9,0],
    [0,0,0,0,0,1,2,3,0,4,5,6,0,7,8,9],
]).T

# Our 4x4 input (will be vectorized to (1,16))
X = np.array([
    [1,2,3,4],
    [5,6,7,8,],
    [9,10,11,12],
    [13,14,15,16]])

# We compute y = x' * C, i.e., (1,16) x (16,4) = (1,4) -> reshape to 2x2
print('Convolution operation (by hand):\n', 
      np.dot(X.reshape((1,16)),C).reshape(2,2))

Convolution operation (by hand):
 [[348 393]
 [528 573]]


### More details ...,

Importantly, during backpropagation (through the convolution layer), the error signal (computed by the loss) is simply multiplied by $W^\top$ (i.e., the transpose of the matrix $W$), as the partial derivative of $x^\top W$ w.r.t. the input $x$ is $W^\top$. With the 2D convolution from above, we have basically computed

$$ y = x^\top W $$

resulting in a $1 \times 4$ output $y$ which we reshaped into a $2 \times 2$ matrix. Going the opposite 
direction (i.e., trying to reverse the operation), we see that 

$$ y W^\top$$

gives a $1 \times 16$ output that can be reshaped into a $4 \times 4$ matrix, i.e., our *original input shape*. This effectively explains the name *transposed* convolution. Lets look at some more interesting cases next.

### Transposed Convolution - NO PADDING, unit strides

In [10]:
i = 4    # width=height=i=4
s = 1    # stride=1
k = 3    # kernel size 3x3
p = 0    # NO PADDING

x = torch.randn(1,1,i,i)
cl = nn.Conv2d(1,1,k,s,p)
out = cl(x)
print(out.size())

# We could implement a transposed convolution in that case by
# simply padding the input by 2 zeros on each side, using a 3x3
# kernel and a stride of 1 to get a 4x4 output, as
#
# o = (i-k)+2p+1 = (2-3)+2*2+1 = 4
tcl = nn.Conv2d(1,1,k,s,2)
print(tcl(out).size())

torch.Size([1, 1, 2, 2])
torch.Size([1, 1, 4, 4])


Lets implement the same using PyTorch's `nn.ConvTranspose2d` layer.

In [11]:
i = 4    # width=height=i=4
s = 1    # stride=1
k = 3    # kernel size 3x3
p = 0    # NO PADDING

tcl = nn.ConvTranspose2d(1,1,k,s,p)
print(tcl(out).size())

torch.Size([1, 1, 4, 4])


**Remark**: Relationsip 8 in Sec. 4.3. would indicate setting `p=k-1` (so `p=3-1=2`), however, from the PyTorch documentation we see that the padding argument controls `kernel_size - 1 - padding`, so setting `p=0` is the appropriate choice here.

### Transposed Convolution - Zero padding, unit strides

So, the transposed convolution corresponding to a convolution with no padding and unit strides relies on zero padding the input by $k-1$ points. Similarly, if padding was used in the original convolution, this would imply a transposed convolution with *less* padding.


In [12]:
i = 4    # width=height=i=4
s = 1    # stride=1
k = 3    # kernel size 3x3
p = 1    # pad by 1 zero pixel on each side

x = torch.randn(1,1,i,i)
cl = nn.Conv2d(1,1,k,s,p)
out = cl(x)
print(out.size())

tcl = nn.ConvTranspose2d(1,1,k,s,p) # here we now have k-1-p with p=1, effectively
print(tcl(out).size())

torch.Size([1, 1, 4, 4])
torch.Size([1, 1, 4, 4])


### Transposed Convolution - NO PADDING, non-unit strides

Here, we assume, $i-k$ is a multiple of $s$, in other words, the valid convolution (operating at stride $s$) covers all the input pixel. Otherwise, we would map multiple input shapes to the same output shape. Just consider $i=\{5,6\},k=3,s=2$. Both, inputs of size $5 \times 5$ and $6 \times 6$ would be mapped to a $2 \times 2$ output (we have seen this earlier). We resolve this ambiguity later ...

In [29]:
i = 6    # width=height=i=5
s = 2    # stride=1
k = 3    # kernel size 3x3
p = 0

# Lets check: i-k is a multiple of s
# 5-3 = 2
# 7-3 = 4
# 6-3 = 3 - so 3 mod 2 = 1 - THIS DOES NOT WORK!

# 2D CONV
x = torch.randn(1,1,i,i)
cl = nn.Conv2d(1,1,k,s,p)
out = cl(x)
print(out.size())

# corresponding 2D CONV TRANSP.
tcl = nn.ConvTranspose2d(1,1,k,s,p)
print(tcl(out).size())

torch.Size([1, 1, 2, 2])
torch.Size([1, 1, 5, 5])


### Transposed Convolution - Zero padding, non-unit strides

Here, we assume, $i-2p+k$ is a multiple of $s$ (otherwise, we have the same problem as mentioned above).

In [29]:
i = 9   # width=height=i=9
s = 2    # stride=1
k = 3    # kernel size 3x3
p = 1    # pad by one zero pixel on each side

# for i = 9,11,..., we have i-2*p+k are multiples of s
# 9  - 2 + 3  = 10
# 11 - 2 + 3  = 12

# 2D CONV
x = torch.randn(1,1,i,i)
cl = nn.Conv2d(1,1,k,s,p)
out = cl(x)
print(out.size())

# corresponding 2D CONV TRANSP.
tcl = nn.ConvTranspose2d(1,1,k,s,p)
print(tcl(out).size())

torch.Size([1, 1, 5, 5])
torch.Size([1, 1, 9, 9])


To resolve the ambiguity mentioned above, we have the possiblity to use the `output_padding` ($a$) parameter. This parameter, computed as 

$$ a = (i+2p-k)\mod 2$$

essentially takes care of resolving the ambiguity by adding padding zeros to 2 sides of the input.

In [31]:
i = 10   # width=height=i=10
s = 2    # stride=2
k = 3    # kernel size 3x3
p = 1    # pad by one zero pixel on each side

# i - 2*p + k = 23 - 2 + 2 = 23 NOT a multiple of s

# 2D CONV
x = torch.randn(1,1,i,i)
cl = nn.Conv2d(1,1,k,s,p)
out = cl(x)
print(out.size())

# Compute a
a = np.mod(i+2*p-k,s)
print('Output padding a={}'.format(a))

# corresponding 2D CONV TRANSP.
tcl = nn.ConvTranspose2d(1,1,k,s,p,output_padding=a)
print(tcl(out).size())

torch.Size([1, 1, 5, 5])
Output padding a=1
torch.Size([1, 1, 10, 10])
