# Convolutional Neural Networks

---

## Local Receptive Fields:
- Instead of 1D input vector, inputs in convolutional net are 2D
- Instead of being Fully connected as in FCNN, small localized regions are connected in CNN
- 5x5 region can be connected to a neuron with 25 inputs, each with a learned weight and a neuron bias 
- A 28x28 input layer with a 5x5 **local receptive field** and *stride length* of 1 results in a 24x24 Hidden Layer

## Shared Weights and Biases:
- Same weights and bias are used for each of the hidden neurons connected to the **local receptive fields**
- All the neurons in the hidden layer detect the same feature in different location
- Shared weights and bias define a **kernel** or **filter**
- A map from input to hidden layer is refered to as *feature map*
- Each *feature map* recognize one feature (at different locations in the image) and to recognize several different features, convolutional layers consists of several different *feature maps*
- Advantage of sharing weights and biases is that it greatly reduces the number of parameters involved in a convolutional network.  For each 5x5 *feature map* and 1 bias, there are only 26 parameters.  Much less than in a FCNN.

## Pooling Layers:
- Condense and simplify the information in the output from the convolutional hidden layer and reduces the number of parameters
- A 2x2 *max-pooling* layer outputs the maximum activation in the 2×2 input region of the convolutional hidden layer
- **Pooling Layers** have no learning Parameters and *Hyperparameters* for Filter Size and Stride
- Output Shape after applying **Pooling Layer**:
$$ \lfloor \frac{n+2p-f}{s}+1 \rfloor $$
- 24x24 convolutional hidden layer with 2x2 **pooling layer** with **stride** of 1, results in 12x12 neurons
- Apply *max-pooling* to each feature map separately:  A 3x24x24 convolution hidden layers, result in 3X12x12 max-pooling layers.  The *Volume* or number of channels stays the same
- Other Pooling methods include *average-pooling* and *L2 pooling* where the square root of the sum of the squares of the activations in the 2×2 region is taken

## Example CNN:
- **28x28 input layer (corresponding to image pixels)**
    - Convolutional Layer with 5×5 *local receptive field* and 3 feature maps: 3x5x5
        - **3×24×24 Hidden feature neurons layer**
            - *Max-Pooling* Layer with 2×2 regions across 3 feature maps: 3x2x2
                -  **3×12×12 Hidden feature neurons**
                    - FCNN, connects every neuron from the 3x12x12 *max-pooled* layer to the 100 neurons hidden layer...
- The convolutional and pooling layers learn local spatial structures, while the fully-connected layers learn  more abstract information from across the entire image

---

## Padding:
- Hidden Layer output after applying a $f \times f$ *feature map* to $n \times n$ *Input*: $(n-f+1) \times (n-f+1)$
- Apply **Padding** to avoid shrinking output and maximuze input information at the *edges*
- With Padding the *Output Shape* is: $(n+2p-f+1) \times (n+2p-f+1)$
- **Valid** Convolution: No padding, output shape is smaller than the input shape
- **Same** Convolution: Padding where output shape is equals to the input shape
    - Choose odd number of filters
    - Choice for padding $p$ is: $ p = \frac{f-1}{2}$ 

## Stride:
- Formula for Output Shape when using $n \times n$ Input, $f \times f$ Filter, $p$ Padding, and $s$ Stride: $$\lfloor \frac{n+2p-f}{s}+1 \rfloor \times \lfloor\frac{n+2p-f}{s}+1 \rfloor$$

## Convolution Over Volume (RGB Images):
- Allows to detect features in either one channel or all channels at the same time, depending on the parameters in the *Filter* channels
- Input Shape of $n_h \times n_w \times n_c$ with Filter $f_h \times f_w \times f_c$, where number of channels $f_c$ in Filter have to match number of channels $n_c$ in Input
- The Output Shape of each Filter Map is a *Flat* Matirx with results of each Convolution over the Volume added together
$$(n_h \times n_w \times n_c) \times (n_h \times n_w \times n_c) * (f_h \times f_w \times f_c) \times (f_h \times f_w \times f_c) \rightarrow (n-f+1) \times (n-f+1) \times n_c $$

---

## Convolution Layer:
$$ a^{[l]} = g(z^{[l]}) $$
where:
$$ z^{[l]} = w^{[l]} a^{[l-1]} + b^{[l]} $$
where: <br>
<div style="text-align: center">
    $a^{[l-1]}$ is <em>Activation</em> of Input <br>
    $w^{[l]}$ is <em>Weights</em> of the Filter Map <br>
    $b^{[l]}$ is <em>Bias</em> of the Output
</div>

## Summary:
- **Input**: $n_h^{[l-1]} \times n_w^{[l-1]} \times n_c^{[l-1]}$
- **Filter Size**: $f^{[l]}$
    - $f^{[l]} \times f^{[l]} \times n_c^{[l-1]}$, where $n_c^{[l-1]}$ matches the number of channels in the Input
- **Padding**: $p^{[l]}$
- **Stride**: $s^{[l]}$
- **Output**: $n_h^{[l]} \times n_w^{[l]} \times n_c^{[l]}$
<br><br>
where
$$ n_h^{[l]} = \lfloor \frac{n_h^{[l-1]}+2p^{[l]}-f^{[l]}}{s^{[l]}} + 1 \rfloor$$
$$ n_w^{[l]} = \lfloor \frac{n_h^{[l-1]}+2p^{[l]}-f^{[l]}}{s^{[l]}} + 1 \rfloor$$
<div style="text-align: center"> and $n_c^{[l]}$ is number of filters </div>
<br>
- **Activation**: $a^{[l]} \rightarrow n_h^{[l]} \times n_w^{[l]} \times n_c^{[l]}$
- **Weights**: $f^{[l]} \times f^{[l]} \times n_c^{[l-1]} \times n_c^{[l]}$, where $n_c^{[l]}$ is the number of filters in Layer $l$
- **Bias**: $n_c^{[l]}$, same as number of filters in Layer $l$

---

# Network Architecture:

## Classic Networks:
- LeNet-5: 
    - Convolution/Average Pooling Layers that reduce the size of input at each level
- AlexNet:
    - Much bigger than *LeNet-5* Network
    - First Conv Layer used 96 Filters with Stride of 4 to reduce dimension from 227x227x3 to 55x55x96
    - Uses Max Pooling with **same** padding in the network
    - Uses **LeRu** as an activation function, and **Softmax** as an Output Layer
- VGG-16:
    - Uses simpler Network with similar Layers:
        - Conv Filters = 3x3 with stride = 1 and padding = same
        - Max-Pool = 2x2 with stride = 2
    - Convolution layer with *same* padding keeps the dimensions the same, while the Pooling layers reduce the dimension at each Level
    - Uniformed Architecture where *height* dimension reduced roughly by half while *width* dimension increased roughly twice at each layer

## ResNets: Residual Networks
- Adding a shortcut/skipping a Layer: $a^{[l]} \rightarrow a^{[l+2]}$ 
- Easy to learn the *identity function* when $a^{[l+2]}$ goes to zero 
    - $a^{[l+2]} = g(z^{[l+2]} + a^{[l]})$
    - $a^{[l+2]} = g(w^{[l+2]} \cdot a^{[l+1]} + b^{[l+2]} + a^{[l]})$
- Deeper Networks don't slow down as $w^{[l+2]}$ or $b^{[l+2]}$  approach 0, since $a^{[l+2]}$ is just $a^{[l]}$


## Inception Networks:
- Use several *types* of layers (Convolution or Pooling with same padding) and combine (stack) the output into a single $n_h \times n_w \times n_c$ Output
    - Can use 1x1 Convolutions to reduce the number of channels while keeping the height and width dimensions the same using $1 \times 1 \times n_c$ filters.  Also reduces computational power needed as compared to using $n \times n$ filters with *same* padding
    - Can use $n \times n$ filters with **same** padding to produce an Output of the same shape

---