# Convolutional Neural Networks

CNNs are a special type of Neural Networks for prediction over data that has a grid-like topology.

## Input
The input to a CNN is a volume of width $W_1$, height $H_1$, and depth $D_1$. While the width and the height refer to the resolution of the input, e.g. the number of pixels if the input is an image, the depth refers to the number of features for each grid. In the case of an image, the depth is tipically $3$ containing RGB data.

### Example
The input volume below has width $W=10$, height $H=10$, and depth $D=3$

In [1]:
import numpy as np
W, H, D = 10, 10, 3
input_volume = np.ones([W, H, D])

## Kernel
Convolutions are applied over an input volume and a kernel, $w$. In Convolutional Neural Networks, the width and height of the kernels are tipically square, i.e. they have the same width and height. The value for the width and height of a kernel is given by the receptive field, $F$. The receptive field indicates the amount of elements in width and height that the kernel is covering by placing it over the volume to be convolved width. The depth of each kernel is **always** equal to the depth of the input volume.

## Example
In the example below, the kernel `kernel` has a receptive field $F=2$ and a depth equal to the depth of `input_volume`.

In [2]:
# The kernel has widht 2, height 2, and depth 3
F = 2
kernel = np.ones([F, F, D])

# Convolution
The fundamental operation of a convolutional layer is the convolution. Given an input and a kernel, the convolution produces a scalar value by applying the kernel to the input.

Althought there are different types of convolutional operations, the most common is computed by summing up the matrix elements from the element-wise multiplication of the input and the kernel.

## Example
Given two 2D matrices of equal size, the convolution can be computed as shown below using NumPy.

In [3]:
x = np.array([[1, 2], [2, 4]])
kernel = np.array([[1, 1], [2, 2]])
C = np.sum(x * kernel)

## Tensors

In Deep Learning, the input is a volume with width, $W$, height, $H$, and depth, $D$. While the width and height represent the grid-like topology, the depth dimension represents data over the grid. In the case of images, the depth contains each of the color channels, R, G, and B, of input images.

Because of the multidimensional nature of input volumes, Deep Learning libraries use tensors to represent them.

### Example

Below, `x` is extended to an input volume with $D=3$. The depth of the kernel `kernel` should match the depth of the volume `x`.

In [4]:
x = np.array([ [[1, 2], [2, 4]], [[1, 2], [2, 4]], [[1, 2], [2, 4]] ])
kernel = np.array([ [[1, 1], [2, 2]], [[1, 1], [2, 2]], [[1, 1], [2, 2]] ])
C = np.sum(x * kernel)

The same operation can be performed with `numpy.einsum`.

In [5]:
C2 = np.einsum('ijk,ijk->', x, kernel)
assert C == C2

In the example above, the first parameter to the `numpy.einsum` function encodes the sum and element-wise multiplication. The first _ijk_ term refers to the dimensions of the tensor `x`. _i_ is the depth, _j_ is the height, and _k_ is the width. The second _ijk_ term refers to the width, height, and depth of the tensor `kernel`. By using the same characters for both tensors, we indicate `numpy.einsum` to perform an element-wise multiplication among them. By not adding any character after _->_, we indicate `numpy.einsum` to sum all the elements that result from the element-wise multiplication and return a scalar value.

## Simplification

From now on, for simplification we regard the width, $W$, of the input volume to be the same as its height, $H$, and refer to both of them as width, $W$.

## Convolutional layer

The output of a convolutional layer is not a scalar value but an output volume produced by sliding and convolving kernels over the input. The width of the output volume that is produced from convolving a kernel with receptive field, $F$, with the input volume with width, $W$, is $W - F + 1$.

In Convolutional Neural Networks, while the input has large values for $W$ and $H$, the receptive field, $F$, of the kernels is comparatively small in size.

### Example I

Provided the input volume, `x`, and the kernel, `kernel`, the output of the convolutional layer can be computed using NumPy as shown below. 

In [6]:
W, D, F = 10, 2, 2
x = np.random.rand(D, W, W)
kernel = np.random.rand(D, F, F)

W_out = W - F  + 1
x_out = np.ones((1, W_out, W_out))
for row in range(W_out):
    for col in range(W_out):
        x_out[0, row, col] = np.einsum('ijk,ijk->', x[:,row:row+F,col:col+F], kernel)

### Example II
A convolutional layer convolves the volume over a collection of $K$ filters, where each filter has the same receptive field, $F$. The depth of each filter is the same depth, $D$, as the input volume. The example below is a demonstration of a convolutional layer with $K=2$.

In [7]:
W, D, F, K = 10, 2, 2, 2
x = np.random.rand(D, W, W)
kernels = np.random.rand(K, D, F, F)

W_out = W - F  + 1
x_out = np.ones((K, W_out, W_out))
for row in range(W_out):
    for col in range(W_out):
        x_out[:, row, col] = np.einsum('ijk,hijk->h', x[:,row:row+F,col:col+F], kernels)

## Zero-padding

Sliding a kernel over an input volume to produce an output has the effects listed below.
- Stacking Convolutional Networks leads to volumes that are smaller and smaller in size.
- The closer an element from the input volume is to its borders, the less amount of elements it influences on the output volume.

A strategy to gain control over these phenomena is to enlarge the input volume by adding zeros to its borders. This technique is called zero-padding. Adding zero-padding before convolution can result into an output volume that preserves the original size of the input volume.

The amount of applied zero-padding is represented with $P$. Adding zero-padding $P$ increases the width, $W$, of the input volume by $2P$. Sliding the kernel over the zero-padded input volume results into an output volume of width $W + 2P -F + 1$.


### Example

Given the input volume, `x`, with values $W=10$ and $D=1$, and a filter $F=5$. Zero-padding the input volume with $P=2$, results into an output volume of the same size as the input volume.


In [8]:
W, D, F = 10, 1, 5
x = np.random.rand(D, W, W)
kernel = np.random.rand(D, F, F)

P = 2
W_out = W - F + 2*P + 1

x_out = np.zeros((kernel.shape[0], W_out, W_out))
assert x_out.shape == x.shape
x = np.pad(x,((0,0),(P,P),(P,P)))

for row in range(W_out):
    for col in range(W_out):
        x_out[:, row, col] = np.einsum('ijk,ijk->',x[:,row:row+F,col:col+F], kernel)

## Stride
Previous convolutional layer demonstrations apply the kernel over all valid locations in the input volume, i.e. over all locations in the input volume where the kernel properly fits without non-overlapping areas between the kernel and the input volume. 

It is possible, however, to systematically apply the kernel over a subset of all valid locations by skipping a determined number of cells when sliding it. This is known as the stride $S$, which defines the amount of cells that are skipped per slide.

Given the input volume with width, $W$, and the kernel width receptive field, $F$, convolving the input volume with the kernel with zero-padding, $P$, and stride, $S$, results into an output volume of width $\frac{W - F + 2P}{S}+1$.

Selecting the values for the receptive field, $F$, zero-padding, $P$, and stride, $S$ is tricky, as not all combinations properly fit over the input volume. Only combinations for which $\frac{W - F + 2P}{S}+1$ results into a natural number are valid.

Striding is a technique to downsample the input volume. The larger the $S$ value, the more aggresive the downsampling is. Commonly, $S=2$ or $S=3$.

### Example

Given an input volume with width, $W$, the following values for the stride, $S$, receptive field, $F$, and zero-padding, $P$, are a _valid_ combination.

In [9]:
D, W, F = 3, 10, 5
x = np.random.rand(D, W, W)
kernel = np.random.rand(D, F, F)
S = 2
P = 2

W_out = int((W-F+2*P) / S) + 1
x = np.pad(x,((0,0),(P,P),(P,P)))
x_out = np.zeros((x.shape[0], W_out, W_out))

for row_out, row_in in zip(range(0, W_out), range(0, W, 2)):
    for col_out, col_in in zip(range(0, W_out), range(0, W, 2)):
        x_out[:,row_out, col_out] = np.einsum('ijk,ijk->', x[:,row_in:row_in+F,col_in:col_in+F], kernel)

## Kernels as shared parameters
So far convolutions and kernels have been introduced without explaining their meaning, purpose, and origin.

### Origin
The origin of convolutions can be backtracked to filtering techniques for image processing. Given an input image $x$, filters modify the original image to extract feature information thas is valuable for further image processing stages. A common approach for image filtering is to use a neighboring window of adjacent cells and feed it to a function that produces a scalar value for each pixel. If the size of the window is given by the receptive field, $F$, and the selected pixels from that window are passed through a function that is a convolution, then we have something that resembles the operation previously described in the context of Deep Learning.

### Purpose
The objective of convolving an input volume over a kernel using a sliding window is to extract information, features, that could be relevant for further processing stages. While traditional image processing techniques, kernels would be hand engineered, the intention of Convolutional Neural Networks is to learn those kernels. The values of a kernel are thus the weights to learn
to during gradient descent.

### Meaning of kernels as neurons
Given an input, $x$, and a kernel $w$, where $x$ and $w$ have the same width, length, and depth. If the convolution, $z = x * w$, is given by the sum of the element-wise multiplication of $x$ and $w$, then $z$ is the weighted sum of $x$, and $w$. This operation is identical to weighted sum produced between the output from an input layer and the weights from a neuron in a hidden or output layer in Feed Forward Neural Networks. A kernel can then be regarded as a neuron, and the collection of $K$ kernels in a Convolutional Layer is the amount of neurons in that layer.

### Infinite prior
In Convolutional Neural Networks, the input volume is significantly larger than the sizes of the $K$ kernels in a Convolutional Layer. Given that kernels contain the weights to be learned, if they are slided throuh the input, each convolution between a region in the input volume and a kernel can be considered as an operation between an input layer and a neuron. If the input volume is regarded as the input layer, it is considered that the weights of the neuron for the region of the input volume that lie outside the receptive field, $F$, of the kernel are always zero.

Ian Goodfellow refers to this as the infinitely strong prior.

### Shared weights
A Convolutional Layer applies the same kernel across the whole input volume with different patterns, defined by the stride $S$, the zero-padding, $P$, and the receptive field, $F$. Since each of these convolutions between a section in the input volume and a kernel is a scalar value containing a weighted sum, we can say that either the same neuron, kernel, is being applied to different input patches, or that we are using different neurons per patch that share the same weights. It is because of this that several authors mention that kernels contain shared weights to be learned during training, i.e. instead of assigning a unique weight to each _input_ (_input_ here is a section of the whole input volume), weights are shared across the _inputs_.

## Bias

If each kernel is regarded as a neuron, then each neuron is missing the bias parameter. By adding a bias parameter after the convolution, we can modify our interpretation of a kernel, not as a neuron, but as the weights of a neuron. Biases are shared in the same way as weights.

### Example
Applying $K$ kernels over an input volume, results into a vector which size is equal to the depth of $K$. For each element of that vector, i.e. for each of the depth values in $K$, there is a bias.

In [10]:
W, D, F, K = 10, 3, 2, 4
x = np.random.rand(D, W, W)
kernels = np.random.rand(K, D, F, F)
biases = np.random.rand(K)

W_out = W-F+1
z = np.zeros((K, W_out, W_out))

for row in range(W_out):
    for col in range(W_out):
        z[:,row, col] = np.einsum('ijk,fijk->f', x[:,row:row+F, col:col+F], kernels) + biases

## Applying a non-linearity

Just like the Multi Layer Perceptron, the weighted sum plus the bias is passed through a non-linear function

### Example
Below, ReLU is applied to the output volume `z`.

In [11]:
out = z * (z>0)

If instead of $F=5$, we use $F=4$, the combination of $W$, $F$, $P$, and $S$ becomes valid.

## Lessons learned
Kernels are tipically square, i.e. they have the same width and height. The value for the width and height is given by the receptive field, $F$. **The depth of each kernel is equal to the depth of the input volume.**

### NumPy
Given two matrices, $A$ and $B$, `numpy.dot(A, B)`, `numpy.matmul(A, B)`, `A @ B` produce the same output by multiplying the matrices $A$ and $B$.


In [12]:
A = np.array([[1, 2],[3, 4]])
B = np.array([[1, 2],[3, 8]])
C1 = np.dot(A, B)
C2 = np.matmul(A, B)
C3 = A @ B
assert np.all(C1==C2)
assert np.all(C1==C3)

`numpy.einsum` is a powerful function that can be applied to convolve volumes and kernels.