# Convolutional neural network

## Edge detection

- ex. 6x6 image * (convolve) 3x3 filter = 4x4 image
- vertical edge detection: use filters such as

$\left(\begin{array}{ccc}
  {1} & 0 & {1}\\
  0  & 1 & 0 \\
  {1} & 0 & {1}
\end{array}\right)$

$\left(\begin{array}{ccc}
  {1} & 0 & {-1}\\
  2  & 0 & -2 \\
  {1} & 0 & {-1}
\end{array}\right)$ (Sobel filter)

$\left(\begin{array}{ccc}
  {3} & 0 & {-3}\\
  10  & 0 & -10 \\
  {3} & 0 & {-3}
\end{array}\right)$ (Schors filter)

- vertical edge detection: use filters such as

$\left(\begin{array}{ccc}
  {1} & 1 & {1}\\
  0  & 0 & 0 \\
  {-1} & -1 & {-1}
\end{array}\right)$

- $n$ x $n * f$ x $f = (n-f+1)$ x $(n-f+1)$

## Padding

- avoids shrinking output and throwing away information from edges
- $n$ x $n * f$ x $f = (n+2p-f+1)$ x $(n+2p-f+1)$
- "valid": no padding
- "same": pad so that output size is the same as the input size
    - $p = \dfrac{f-1}{2}$ ($f$ is usually odd)

## Strided convolutions

- $n$ x $n * f$ x $f = (\dfrac{n+2p-f}{2}+1)$ x $(\dfrac{n+2p-f}{2}+1)$

## Convolution over volumn

- 6x6x3 volumn (height x width x number of channels) * 3x3x3 volume (height x width x number of channels) = 4x4 (27 numbers are multiplied and summed up 16 times to produce 4x4)
- number of channel must match between input and filter
- multuple filters
    - 6x6x3 * 3x3x3 (two of these: one for vertical edge and the other for horizontal edge) = 4x4x2
    - $n$ x $n$ x $n_{c} * f$ x $f$ x $n_{c} = (n-f+1)$ x $(n-f+1)$ x $n_{c}^{'}$ where $n_{c}^{'}$ = number of filters
    
## One layer of a convolutional network

- if layer $l$ is a convolutional layer
    - $f^{l}$ = filter size
    - $p^{l}$ = padding
    - $s^{l}$ = stride
    - $n_{c}^{l}$ = number of filters
    - input: $n_{H}^{[l-1]}$ x $n_{W}^{[l-1]}$ x $n_{c}^{[l-1]}$
    - output: $n_{H}^{[l]}$ x $n_{W}^{[l]}$ x $n_{c}^{[l]}$
    - $n^{[l]} = \lfloor\dfrac{n^{[l-1]}+2p^{[l]}-f^{[l]}}{s^{[l]}} + 1\rfloor$
    - each filter: $f^{[l]}$ x $f^{[l]}$ x $n_{c}^{[l-1]}$
    - activation: $a^{[l]}$ => $n_{H}^{[l]}$ x $n_{W}^{[l]}$ x $n_{c}^{[l]}$ or $A^{[l]}$ => $m$ x $n_{H}^{[l]}$ x $n_{W}^{[l]}$ x $n_{c}^{[l]}$
    - weights: $f^{[l]}$ x $f^{[l]}$ x $n_{c}^{[l-1]}$ x $n_{c}^{[l]}$
    - bias: $n_{c}^{[l]}$ (represented as $(1,1,1,n_{c}^{[l]})$)
    
## Pooling layers

- max pooling: take max number in each region
- $f$: filter size
- $s$: stride
- no parameters to learn

## LeNet-5

<table>
    <tr>
        <th></th>
        <th>Activation shape</th>
        <th>Activation size</th>
        <th>Number of parameters</th>
    </tr>
    <tr>
        <th>Input</th>
        <td>(32,32,3)</td>
        <td>3072</td>
        <td>0</td>
    </tr>
    <tr>
        <th>CONV1(f=5,s=1)</th>
        <td>(28,28,6)</td>
        <td>4704</td>
        <td>608</td>
    </tr>
    <tr>
        <th>POOL1(f=2,s=2)</th>
        <td>(14,14,6)</td>
        <td>1176</td>
        <td>0</td>
    </tr>
    <tr>
        <th>CONV2(f=5,s=1)</th>
        <td>(10,10,6)</td>
        <td>1600</td>
        <td>3216</td>
    </tr>
    <tr>
        <th>POOL2(f=2,s=2)</th>
        <td>(5,5,16)</td>
        <td>400</td>
        <td>0</td>
    </tr>
    <tr>
        <th>FC3</th>
        <td>(120,1)</td>
        <td>120</td>
        <td>48120</td>
    </tr>
    <tr>
        <th>FC4</th>
        <td>(84,1)</td>
        <td>84</td>
        <td>10164</td>
    </tr>
    <tr>
        <th>Softmax</th>
        <td>(10,1)</td>
        <td>10</td>
        <td>850</td>
    </tr>
</table> 

- about 60K parameters
- width and height go down but number of channels goes up

## Why convolutions

- parameter sharing: a feature detector (such as vertical edge detector) that is useful in one part of the image is probably useful in another part of the image
- sparsity of connections: in each layer, each output value depends only on a small number of inputs

## AlexNet

<table>
    <tr>
        <th></th>
        <th>Activation shape</th>
    </tr>
    <tr>
        <th>Input</th>
        <td>(227,227,3)</td>
    </tr>
    <tr>
        <th>CONV(f=11,s=4)</th>
        <td>(55,55,96)</td>
    </tr>
    <tr>
        <th>POOL(f=3,s=2)</th>
        <td>(27,27,96)</td>
    </tr>
    <tr>
        <th>CONV(f=5,same)</th>
        <td>(27,27,256)</td>
    </tr>
    <tr>
        <th>POOL(f=3,s=2)</th>
        <td>(13,13,256)</td>
    </tr>
    <tr>
        <th>CONV(f=3,same)</th>
        <td>(13,13,384)</td>
    </tr>
    <tr>
        <th>CONV(f=3,same)</th>
        <td>(13,13,384)</td>
    </tr>
    <tr>
        <th>CONV(f=3,same)</th>
        <td>(13,13,256)</td>
    </tr>
    <tr>
        <th>POOL(f=3,s=2)</th>
        <td>(6,6,256)</td>
    </tr>
    <tr>
        <th>FC</th>
        <td>(4096,1)</td>
    </tr>
    <tr>
        <th>FC</th>
        <td>(4096,1)</td>
    </tr>
    <tr>
        <th>Softmax</th>
        <td>(1000,1)</td>
    </tr>
</table> 

- about 60M parameters
- RELU

## VGG-16

- all CONVs are f=3,s=1,same
- all POOLs are f=2,s=2

<table>
    <tr>
        <th></th>
        <th>Activation shape</th>
    </tr>
    <tr>
        <th>Input</th>
        <td>(224,224,3)</td>
    </tr>
    <tr>
        <th>CONV</th>
        <td>(224,224,64)</td>
    </tr>
    <tr>
        <th>CONV</th>
        <td>(224,224,64)</td>
    </tr>
    <tr>
        <th>POOL</th>
        <td>(112,112,64)</td>
    </tr>
    <tr>
        <th>POOL</th>
        <td>(112,112,64)</td>
    </tr>
    <tr>
        <th>CONV</th>
        <td>(112,112,128)</td>
    </tr>
    <tr>
        <th>CONV</th>
        <td>(112,112,128)</td>
    </tr>
    <tr>
        <th>POOL</th>
        <td>(56,56,128)</td>
    </tr>
    <tr>
        <th>CONV</th>
        <td>(56,56,256)</td>
    </tr>
    <tr>
        <th>CONV</th>
        <td>(56,56,256)</td>
    </tr>
    <tr>
        <th>CONV</th>
        <td>(56,56,256)</td>
    </tr>
    <tr>
        <th>There are many more</th>
        <td>skipped</td>
    </tr>
    
</table> 

- about 130M parameters

## ResNets 

- typically tp go from $a^{[l]}$ to $a^{[l+2]}$ 
    - $z^{[l+1]} = w^{[l+1]}a^{[l]} + b^{[l+1]}$
    - $a^{[l+1]} = g(z^{[l+1]})$
    - $z^{[l+2]} = w^{[l+2]}a^{[l+1]} + b^{[l+2]}$
    - $a^{[l+2]} = g(z^{[l+2]})$
- take shortcuts such that $a^{[l+2]} = g(z^{[l+2]} + a^{[l]})$ which skips intermediate connections

## 1x1 convolution

- ex. 6x6x32 * 1x1x32 = 6x6xnumber_of_filters
- ex. 28x28x192 * 1x1x32 = 23x23x32 (this shrinks $n_{c}$)

## Inception network

- with input 28x28x192, apply 3 different convolutions and 1 max-pooling
    - 28x28x192 * 1x1 = 28x28x64
    - 28x28x192 * 3x3 (same) = 28x28x128
    - 28x28x192 * 5x5 (same) = 28x28x32
    - 28x28x192 * POOL (same, s=1) = 28x28x32
- total output is 28x28x256

Computation reduction example
- 28x28x192 * 5x5x192 (f=5,same,32) = 28x28x32. This requires 120M multiplications
- 28x28x192 * 1x1x192 (16 filters) = 28x28x16 * 5x5x16 (32 filters) = 28x28x32. This requires 12.4M multiplications

## Object localization

- target label $y = 
\begin{bmatrix}
    p_{c} \\ 
    b_{x} \\ 
    b_{y} \\
    b_{h} \\
    b_{w} \\
    c_{1} \\
    c_{2} \\
    c_{3} \\
\end{bmatrix}$
 
- $p_{c}$: is there an object?
- $b_{x}$: coordinate
- $b_{y}$: coordinate
- $b_{h}$: height
- $b_{w}$: width
- $c_{1}$: class #1
- $c_{2}$: class #2
- $c_{3}$: class #3
- $L(\hat{y}, y) = (\hat{y}_{1} - y_{1})^{2} + (\hat{y}_{2} - y_{2})^{2} + \dots + (\hat{y}_{8} - y_{8})^{2}$ if $y_{1} = 1$ 
- $L(\hat{y}, y) = (\hat{y}_{1} - y_{1})^{2}$ if $y_{1} = 0$

## Sliding windows

- create boxes and move it around inside the image. when image recognition algorithm detects the object, return. when finished, change the size of boxes and move them around again

## Convolution implementation of sliding windows

- 14x14x3 $\xrightarrow{5x5}$ 10x10x16 $\xrightarrow{2x2, POOL}$ 5x5x16 $\xrightarrow{5x5, FC}$ 1x1x400 $\xrightarrow{1x1, FC}$ 1x1x400 $\xrightarrow{1x1, FC}$ 1x1x4
- 16x16x3 $\xrightarrow{5x5}$ 12x12x16 $\xrightarrow{2x2, POOL}$ 6x6x16 $\xrightarrow{5x5, FC}$ 2x2x400 $\xrightarrow{1x1, FC}$ 2x2x400 $\xrightarrow{1x1, FC}$ 2x2x4
- 28x28x3 $\xrightarrow{5x5}$ 24x24x16 $\xrightarrow{2x2, POOL}$ 12x12x16 $\xrightarrow{5x5, FC}$ 8x8x400 $\xrightarrow{1x1, FC}$ 8x8x400 $\xrightarrow{1x1, FC}$ 8x8x4

## Bounding box predictions

- split the image into multiple grid cells
- assign object to grid cell by the midpoint of the object

## Intersection over union

- measure of the overlap between two bounding boxes
- size of intersection / size of predicted bounding box
- correct if IoU $\ge 0.5$

## Non-max suppression

- cleans up multiple detections of same object
- discard all boxes with $p_{c} \le 0.6$
- while there are any remaining boxes
    - pick the box with the largest $p_{c}$, output that as a prediction
    - discard any remaining box with IoU $\ge 0.5$ with the box output in the previous step
    
## Anchor boxes

- overlapping objects? create multiple (anchor) boxes
- target label $y = 
\begin{bmatrix}
    p_{c} \\ 
    b_{x} \\ 
    b_{y} \\
    b_{h} \\
    b_{w} \\
    c_{1} \\
    c_{2} \\
    c_{3} \\
    p_{c} \\ 
    b_{x} \\ 
    b_{y} \\
    b_{h} \\
    b_{w} \\
    c_{1} \\
    c_{2} \\
    c_{3} \\
\end{bmatrix}$
- each object in training image is assigned to grid cell that contains object's midpoint and anchor box for the grid cell with highest IoU

## YOLO algorithm

- ex. training: 3x3x2 (number of anchors) x8 (5 + number of classes)
- output: 3x3x16

## Face verification

- input image
- output whether the input image is that of the claimed person

## Face recognition

- has database of $K$ persons
- get an input image
- output ID if the image is any of the $K$ persons (or not recognized)

## One-shot learning

- learning from one example to recognize the person again
- "similarity" function d(img1, img2) = degree of difference between images

## Siamese network

- run different images on the same network and compare encodings $f(x^{(i)})$
- if $x^{(i)}, x^{(j)}$ are the same person, $||f(x^{(i)}) - f(x^{(j)})||^{2}$ is small
- if $x^{(i)}, x^{(j)}$ are the different person, $||f(x^{(i)}) - f(x^{(j)})||^{2}$ is large

## Triplet loss

- anchor, positive, negative
- want $||f(A)-f(P)||^{2} \le ||f(A)-f(N)||^{2} - \alpha$ (margin to avoid parameters being all zeros)
- $L(A,P,N) = max(||f(A)-f(P)||^{2} - ||f(A)-f(N)||^{2} + \alpha, 0)$
- $J = \displaystyle\sum_{i=1}^{m}L(A^{(i)}, P^{(i)}, N^{(i)})$
- during training, if A,P,N are chosen randomly, $d(A,P) + \alpha \le d(A,N)$ is easily satisfied
    - choose triplets that are "hard" to train on $d(A,P) \approx d(A,N)$
    
## Learning the similarity function

- $\hat{y} = \sigma\left(\displaystyle\sum_{k=1}^{128}w_{i}|f(x^{(i)})_{k} - f(x^{(j)})_{k}| + b\right)$

## Neural style transfer

- content image (C), style image (S), generated image (G)
- cost function $J(G) = \alpha J_{content}(C,G) + \beta J_{style}(S,G)$
- initialize $G$ randomly
- use gradient descent to minimize $J(G)$
    - $G = G - \dfrac{\partial}{\partial G}J(G)$
    
## Content cost function

- say you use hidden layer $l$ to compute content cost
- use pre-trained ConvNet (eg. VGG network)
- let $a^{[l](C)}$ and $a^{[l](G)}$ be the activation of layer $l$ on the images
- if $a^{[l](C)}$ and $a^{[l](G)}$ are similar, both images have similar content
    - $J_{content}(C,G) = \dfrac{1}{2}||a^{[l](C)} - a^{[l](G)}||^{2}$
    
## Style cost function

- style matrix
    - let $a_{i,j,k}^{l}$ = activation at $(i,j,k)$ (height, weight, channel)
    - $G^{[l]}$ is $n_{c}^{[l]}$ x $n_{c}^{[l]}$
    - $G_{kk'}^{[l]} = \displaystyle\sum_{i=1}^{n_{H}^{[l]}}\displaystyle\sum_{j=1}^{n_{W}^{[l]}}a_{ijk}^[l]a_{ijk'}^[l]$ (do this for both style and generated) 
    - $J_{style}^{[l]}(S,G) = ||G^{[l](S)} - G^{[l](G)}||^{2}_{F} = \displaystyle\sum_{k}\displaystyle\sum_{k'}(G_{kk'}^{[l](S)} - G_{kk'}^{[l](G)})^{2}$ 
    - $J_{style}(S,G) = \displaystyle\sum_{l}\lambda^{[l]}J_{style}^{[l]}(S,G)$