# CNN

Used in `computer vision`, object detection, style transfer...

Challenge: large inputs (images have large resolution) -> extremely high dimension for input layer of a NN. And for **fully connected NN**, the matrix of weights is extremely large.   

Solution: convolution operation

### Convolution operation

Consider an edge detection in an image (gray scale).  
Given an $N\times N$ matrix of the image, constract a __kernal__ or __filter__ of the image $F\times F$ where $F<N$ and `convolve` it wih the original matrix via __convolution operation__ '$*$'. The oputput of the convolution operation, _filter_ is a matrix $P\times P$ where $N<P<F$. 

The first element of this matrix is computed by overlaying the $F\times F$ matrix with the equally sized region of the $N\times N$ matrix in the upper left corner and performing **element-wise** multiplication and **summing** the result.

Next, shift the convolution layer one element and repeat the convolution to get the second number in $P\times P$ matrix. 

In python comvolution is done with $\texttt{\text{tf.nn.conv2d}}$. 

If the _filter_ is made to detect vertical edges, i.e., 
bright pixes on the left and dark pixels on the right: 

$$
\begin{bmatrix}
1 & 0 & -1 \\
1 & 0 & -1 \\
1 & 0 & -1 \\
\end{bmatrix}
$$

Convolving it with the image that has an edge, it will give a lighter (higher values) for the region where there is such a structure, an _edge structure_. 

### Types of edge detections

For horizontal edges, the filter would be 

$$
\begin{bmatrix}
1 & 1 & 1 \\
0 & 0 & 0 \\
-1 & -1 & -1 \\
\end{bmatrix}
$$

High values in the resulted matrix would indicate the presence of an edge. Note that if exact transition orientation is not important, an absolute value should be taken. 


#### Vertical edge detection examples:

**Sobel filter** 

$$
\begin{bmatrix}
1 & 0 & -1 \\
2 & 0 & -2 \\
1 & 0 & -1 \\
\end{bmatrix}
$$

**Scharr filter**

$$
\begin{bmatrix}
3 & 0 & -3 \\
10 & 0 & -10 \\
3 & 0 & -3 \\
\end{bmatrix}
$$

These filters can be learned by a NN. Back-prop can learn these values in matrixes.

### Padding

Convolving $6\times 6$ matrix with $3\times 3$ filter results in a $4\times 4$ matrix. 

Convolving $n\times n$ matrix with $f\times f$ filter results in a $(n-f+1)\times (n-f+1)$ matrix. 

> Every time conv. oper. is applied, the image shrinks. 

Corner pixels are used only in one operation, while middle pixes are used multiple times. 

`Padding` -- adding addition 'border' image to preserve the original size after convolving with the filter.  

Padding can also be done with two or more pixels. 

- `Valid convolution` - no padding
- `Same convolution` - oputput size is then equal to the original size. Padding size is then set by the filter size as $p = (f-1)/2$. 

Usually $f$ is an odd number (symmetric padding). It also gives a central position.  

### Strided Convolution

Where the filter moves not by one pixel forward but by $s$ (with the same element-wise product). This gives smaller final image after the convolution. The output is 
$$
(n + 2p - f)/s+1, (n + 2p - f)/s+1
$$, 
where $n$ is the image, $p$ is the padding  and $f$ is the filter. 

#### _Cross correlation vs. convolution_

Convolution in math: Filter is first (mirrored) along both axis  _transposed?_ before being applied to the original matrix. This allows to have the following prperties that used in signal __signal processing__ 

$$
(A*B)*C = A*(B*C)
$$

But for deep NNs it is not that important. 

What we do here is more _cross-correlation_ but commonly used as __convolution__.



### Convolution Over Volume 

For colored images. $3$ rd dimension is the number of channels. 

Filter should have __the smae__ number of channels. 

Output there is computed by __piece-wise__ multiplication for each channel for each pixel and then sum the enitre thing to get the output pixel value. 

Then the edges in a red for example, would have the standaed edge filter for red and all zeroes for other channels. 

This is feature detection for various cahnnels. 

### Multiple filters

Applyting multiple filters to the same image, we get different iputps and __stack__ the results into a new multid-dimensional outputs. 

$$
(n\times n\times n_c)*(f\times f\times n_c)\rightarrow ((n - f + 1)\times (n - f + 1) \times n_c')
$$

where $n_c$ is the number of filters

### One Layer of a Convolutional Network

One layer of a CNN is applyting multiple, $n_c$ filters toa given image with dimensions $n\times n \times n_c$ and then applyting a certain non-linear function, e.g., ReLU to the output and adding a bias value, and the stacking the outputs together with the thrird dimenstion being the number of filter $n_f$.  

$Z^{[1]} = w a^{[0]} + b^{[1]}$, with $a^{[1]} = g(z^{[1]})$

where $w a^{[0]}$ is the convolution operation on each filter. 

##### Example: 
Consider a 10 filters that are $3\times3\times3$ in one layer of a NN. How many parameters does the layer have? 

Answer: $3\times 3\times 3 = 27$ weights plus $1$ bias times $10$ filters. So it is $280$ parameters. 
Note that this value is **independent** of the image size.  
Hence, this model is less prone to _overfitting_.  

### Summary

- $f^{[l]}$ - filter size in layer $l$
- $p^{[l]}$ - padding size in layer $l$
- $s^{[l]}$ - stride size in layer $l$
- $n_c^{[l]}$ - number of filters in layer $l$

Input for layer $[l]$ is: $n_H^{[l-1]} \times n_W^{[l-1]} \times n_c^{[l-1]}$ for hight and width (if different). 

Output $n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]}$ with the $n^{[l]} = \Big[ \frac{n^{[l-1]} + 2p^{[l]} - f^{[l]}}{s} +1 \Big]$. 
for hight and width in the same way.  

**Note** output image depth is equal to the number of filters. 

Size of each filter for layer $l$: $f^{[l]}\times f^{[l]}\times n_c^{[l-1]}$ (with the number of channels being the same as in the image)

Activations: $a^{[l]}\rightarrow n_H^{[l]}\times n_W^{[l]}\times n_c^{[l]}$.  
For vectorized implementation $A^{[l]} \rightarrow m\times n_H^{[l]}\times n_W^{[l]}\times n_c^{[l]}$

Weights $f^{[l]}\times f^{[l]} \times n_c^{[l]} \times n_c^{[l]}$, with the last being the number of filters in the layer $l$.  

Bias  $n_c^{[l]}$ is $(1,1,1,n_c^{[l]})$. 


### Simple Convolutional Network Example

Consider an image $39\times 39\times 3$.  
There $n_H^{[0]}=n_w^{[0]}=39$ and $n_c^{[0]}=3$.  
Assume that we use a _filter_ with $f^{[1]}=3$, $s^{[1]}=1$ and $p^{[1]}=0$.  
The result is $37\times 37\times 10$ where the last number is the result of 10 filters application.  
The dimension of the _activation_ in the first layer than $n_H^{[1]}=n_w^{[1]}=37$ and $n_c^{[0]}=10$ (number of filters). 

next, assume that the _filter_ is:
$f^{[2]}=5$ and $s^{[2]}=2$ and $p^{[2]}=0$ and 20 of them.  
The ouput is the $17\times 17\times 20$ with 20 being the number of filters.  

At the end, we flatten the result into a vector and feed it into a ligistic or softmax layer to get the prediction.  

Hyperparamters are difficult to select. 

It is common that the size, $W\times$ H will decrease while the number of channels increases. 

**Types of layers in CNN**: 
- Convolutiopn (CONV)
- Pooling (POOL)
- Fully connected (FC)

### Pooling layers

- Reduct size, speed-up, make robust calcs. 

#### Max pooling: 
Split image into $x$ regions and output would have the number of those regions where values are the $max$ over the region in the image.  

This layer highlights the feature that exists -- make it preserved in the output layer. 

**Note** once $f$ and $s$ are fixed there are no hyperparameters to learn for this layer (as it is an empty filter essentially)

So, essentially, it is like a standard convolution, but instead of summing all the results after the convolution, we just take an ''empty filter'', move it over the image and take a maximum value _inside this window_ for an output image pixel

For each channel the max pooling is done idependently. 

#### Average pooling

In a very deep in the NN

Overall:  
Hyperparameters $f$- filter size $n$- stride. Common $f=2$ $s=2$ that shrink the image by $2$.  

But there are __No parameters__ to learn for a backprop.  




### CNN exmaple

Image recogintion from colored images

$[32\times 32\times 3]\rightarrow[28\times28\times6]\rightarrow[14\times 14\times6]$
after Convolution and Pool layers.  

Number of layers -- number of parameters.  
Pooling layer does not have parameters and is generally treated as a part of 'conv' layer. 

$[14\times 14\times6]\rightarrow[10\times10\times10]\rightarrow[5\times5\times10]$

after another 'conv' plus 'pool' combo.  
Recall that maxpul with $f=2$ $s=2$ halves the $H$ and $W$ of the image.  

Then we flatten the result after two layers $400\times1$ vector.  
Consider next lyaer with $120\times1$ units. This is `fully-connected` layer. Standard layer with all connections. 
Than another layer with $84$ parameters and than final output layer with \eg softmax activation function.  

It is common to have the following parrten:  
CONV-POOL-CONV-POOL-FC-FC-FC-SOFTMAX 

Activation size genrerally __slowly__ decreases with NN depth. 


### Why Convolution

Reasons for Conv
- Spacity of connections
For FC layers, if the input is e.g., image, the amount of connections, e.g., parameters is just enormous. CONV allows to decrease the number of connections and as such, the number of parameters.  

- Parameter sharing
Parameter sharing allows different parts of the NN use the low-level 'feature' detection. Also a given image might have the same feature multiple times, again, there is no need to learn it every time.  
(translation invaraicne can also be utilized)

Cost functions are generally in FC layers. 



