#**Notes on CNNs**

_(to be updated with code later)_

This guide uses [tf.keras](https://www.tensorflow.org/guide/keras), a high-level API to build and train models in TensorFlow.

In [None]:
# TensorFlow and tf.keras
import tensorflow as tf

# Helper libraries
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt

# print(tf.__version__)

___
**Convolutional networks** are neural networks that use convolution in place of general matrix multiplication in at least one layer. This operation is usually followed by another, called **pooling**.

Here is the example they use to introduce convolution: we are tracking the position of a spaceship with a sensor. The output of the sensor is denoted $x(t)$, which we interpret as giving the position at time $t$. 

Suppose that the sensor is noisy -- one way to get a better estimate of the position of the ship is to average together several recent measurements. 

Let $w$ denote a probability density function, and let $a$ be a variable used to denote the age of a measurement. Averaging the measurements $x$ against $w$ gives us a smoother measurement denoted $s \equiv s(t)$, defined by

$$
s(t) := \int x(a) w(t-a) da 
$$

This is to say, $s(t) = (x * w)(t)$. They mention that the support of $w$ should be on the non-negative real axis -- they say this is to avoid looking into the future: looking at the integral, this is equivalent to the enforced constraint $a \leq t$.


___
In the above example, $x$ is usually referred to as the **input** and $w$ is called the **kernel**. They remark that, in their example, it might be more realistic to assume that their laser provides measurement at a sequence of time intervals with fixed spacing, in which case, the kernel $w$ is interpreted as a probability (mass) function, i.e. $\sum_{k \in \mathbb{Z}} w(k) = 1$. 

Thus, if we assume $x$ and $w$ are only defined at integer values of $t$, the discrete convolution is defined as
$$
s(t) = (x * w) (t) := \sum_{ a \in \mathbb{Z} } x(a) w(t-a) 
$$


___
In ML applications, the input is usually a multidimensional array of data, which we refer to as a **tensor**. If any of this is going into a computer, we need to work with a kernel that has finite support. Importantly, if our input is a tensor, the convolution operation is multi-dimensional. 

For example, if the input is an image, stored as a matrix, the "features" of the image are extracted by examining the image locally (in a CNN). 


___
Convolution is commutative, and the following equivalent expression is easier / "more straightforward" to implement, "because there is less variation in the range of valid values of $m$ and $n$:"
$$
S(i,j) = (K *I ) (i,j) = \sum_{m,\,n} I(i-m,j-n) K(m,n)
$$
And, this feels a lot more like a "weighted average" of the image 


___
If $(i,j)$ are coordinates of pixels of an image, (say it is grayscale, so the image is determined completely by a value assignment to each pixel. Let $I$ denote this image function. We convolve this with a two-dimensional kernel $K$:
$$
S(i,j) = (I * K) (i,j) = \sum_{m,\,n} I(m,n) K(i - m, j - n) 
$$
People in ML have run with this expression: there is no need there to preserve the commutativitity of the convolution, they introduce the **cross-correlation**, which is defined as follows
$$
S(i,j) = (I *K) (i,j) = \sum_{m,\,n} I( i +m, j +n ) K(m,n) 
$$


Discrete convolution (with kernel and input finite dimensional) can be viewed as multiplication by a matrix .

Importantly, this matrix is constrained so that several entries are equal to other entries. For example, the matrix associated to single-variable convolution is such that each row is equal to the one above (a kind of Toeplitz matrix called a circulant matrix.)

Suppose we want to consider the convolution of two discrete (one dimensional) signals 
$$
x = 
[
\begin{matrix}
x_0 & x_1 & x_2 & x_3 
\end{matrix}
]
$$
and
$$
w =
[
\begin{matrix}
w_0 & w_1 & w_2 & w_3 
\end{matrix}
]
$$
Where $x$ is a 4-dimensional input vector, and where $w$ is a 4-dimensional weight vector. We can view each as a function on a four-point space. 

To express the convolution of these functions in terms of matrix multiplication, we define the circulant matrix $W$ associated to the vector $w$ 
$$
W
  :=
  \left[
      \begin{matrix}
        w_0 & w_1 & w_2 & w_3 \\
        w_3 & w_0 & w_1 & w_2
         \\
        w_2 & w_3 & w_0 & w_1 \\
        w_1 & w_2 & w_3 & w_0
      \end{matrix}
  \right]
$$
With this definition, we have that
$$
(x*w) = W x 
$$
To see this, just write out the matrix multiplication. Let us use $a$ and $t$ because it is suggestive of the time-series input in the $1$-dimensional case:
$$
[Wx]_t = 
  \sum_{a = 0,1,2,3} 
    W_{t,a} x_a   
$$ 
which is
$$
[Wx]_t = \sum_{a = 0}^3 w_{(t + a) \text{mod} 4 } x_a 
$$



___
Now we examine what convolution looks like for two-dimensional input data. This is more complicated, and requires a bit of setup: 

Let $x$ be a $2d$ input, represented as an $n \times n$ matrix. Let $K$ denote a $m \times m$ kernel. Concretely, suppose that $x$ is $3 \times 3$, and write the entries as
$$
x = 
\left(
\begin{matrix}
x_1 & x_2 & x_3 \\
x_4 & x_5 & x_6 \\
x_7 & x_8 & x_9 
\end{matrix}
\right)
$$
and 
$$
K
=
\left(
\begin{matrix}
K_1 & K_2 \\
K_3 & K_4 
\end{matrix}
\right)
$$
The way to encode convolution of these two objects as matrix multiplication: embed both $K$ and $x$ into higher dimensional spaces via
$$
\left(
  \begin{matrix}
  K_1 & K_2 & 0 & K_3 & K_4 & 0 & 0 & 0 & 0 \\
  0 & K_1 & K_2 & 0 & K_3 & K_4 & 0 & 0 & 0 \\
  0 & 0 & 0 & K_1 & K_2 & 0 & K_3 & K_4 & 0 \\
  0 & 0 & 0 & 0 & K_1 & K_2 & 0 & K_3 & K_4 
  \end{matrix}
\right)
$$


___
We can apply this matrix to $x$ after it's been reshaped into a length 9 vector. Thus, the effect of multiplying these two "augmented" objects is to form a vector of length four, with each entry of this vector corresponding to multiplication of the $2 \times 2$ kernel acting on each $2 \times 2$ block of $x$. 








___
In addition to the periodicity in the matrix entries described above, there is also sparsity. The periodicity is synonymous with parameter sharing. These are two of the three (sparsity + parameter sharing) main motivations for CNNs. 

The third motivation is **equivariant representations**. A function $f$ is **equivariant** to a function $g$ if $f(g(x)) = g(f(x)$. In the case of a convolution "matrix" applied to the augmented input, this function is equivariant with respect to translations. In some sense, the convolution performs a coarse-graining on the image with block size determined by the kernel, and with block spacing another parameter to pay attention to. 

Equivariance is not invariance: all it means is that if we translate an image modulo its boundary, and then apply the convolution matrix, we get what we would have if we had first colvolved and then translated.



Convolution is not naturally equivariant to other transformations, such as changes in scale or rotations of an image. 

A convolution layer in a network typically consists of three stages: 

* applying convolution matrix to input to get pre-activation
* applying activation function to this to get the activation. This is sometimes called the detector stage.
* applying a pooling function to modify the output of the layer further

A **pooling function** replaces the output of the net with a summary statistic of nearby outputs. 

It seems that the pooling layer gives us another chance to coarse-grain the data. One operation described is "max-pooling", which reports the maximum value in a block region. 

___
"In all cases, pooling helps to make the representaiton become approximately invariant to small translations of the input. Invariance to translation means that if we translate the input by a small amount, the values of most of the pooled outputs do not change...

...Invariance to local translation can be a very useful property if we care more about *whether* some feature is present than exactly where it is. 
___
To understand better: "pooling over spatial regions produces invariance to translation, but if we pool over the outputs of separately parametrized convolutions, the features can learn which transformations to become invariant to"
... moreover,
"pooling is essential for handling inputs of varying size": we can equip any image with a grid that matches its aspect ratio, so that each picture contributes the same number of summary statistics. 

___
# Implementation

___

## Keras API: Convolution layer types

### `Conv1D` 

A one-dimensional convolution. _"This layer creates a convolution kernel that is convolved with the layer input over a single spatial or temporal dimension to produce a tensor of outputs."_ 

### `Conv2D` 

Two-dimensional convolution

### `Conv3D` 

A spatial convolution over volumes.

### `SeparableConv1D`

_"Depthwise separable 1D convolution. This layer performs a depthwise convolution that acts separately on channels, followed by a pointwise convolution that mixes channels."_

### `SeparableConv2D` 

_"Depthwise separable 2D convolution. Separable convolutions consist of first performing a depthwise spatial convolution (which acts on each input channel separately) followed by a pointwise convolution which mixes the resulting output channels. The `depth_multiplier` argument controls how many output channels are generated per input channel in the depthwise step."_

_"Intuitively, separable convolutions can be understood as a way to factorize a convolution kernel into two smaller kernels, or as an extreme version of an Inception block"_

___

**_Q_** : What is an inception block?
___



### `DepthwiseConv2D`

_"Depthwise separable 2D convolution. Consists of performing just the first step in a depthwise spatial convolution (which acts on each input channel separately)."_

### `Conv2DTranspose`

_"Transposed convolution layer (sometimes called Deconvolution). The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to something that has the shape of its input while maintaining a connectivity pattern that is compatible with said transpose."_

### `Conv3DTranspose`