# Conv Nets
MatMult complexity scales with input size, i.e. becomes expensive and usies a lot of parameters for large inputs

Architecture - 3 dimensions
* Height
* Width
* Depth

## Conv Layer
Parameters: Learnable filters
* Conv layer has many filters

### A Filter 
3 dimensions [H, W, K]
* H: Height
* W: Width
* K: Channels, ex RGB for images (K=3)

Forward pass of network
* Each filter slides (convolves) across height and width of input
* Produces 2D activation map
    * Learn filters that activate when they see edges, or similar features
* Stack 2D activation for each filter $\rightarrow$ 3D filter map output

### Local connectivity and Receptive Fields
Each neuron is connected to a small region of the input volume. The region is then called the neurons receptive field.


#### Example 1
Input vol [32x32x3] = [H, W, K]
* Filter size [5x5] = [M, N]
* Each neuron will have assigned a [M, N, K] = [5,5,3] region of the volume
    * Number of weights: $5 \cdot 5 \cdot 3 = 75$ + a bias parameter (76)


#### Example 2
Input vol [16x16x20] = [H, W, K]
* Filter size [3x3] = [M, N]
* Each neuron will have assigned a [M, N, K] = [3x3x20] region of the volume
    * Number of weights: $3 \cdot 3 \cdot 20 = 180$ + a bias parameter (181)

### Spatial Arrangement
How many neurons per layer - and how are they arranged?

Output size is determined by three parameters:
* Depth ~ Number of filters 
* Stride ~ How many pixels we slide each filter, no skipping means stride is 1. Higher stride values means smaller output size
* Zero-Padding ~ Allows us to control output volume size, such that it can match input size.

Let the following be all the parameters
* $W$: Input volume size
* $F$: Receptive field size of neurons
* $S$: Stride
* $P$: Amount of zero padding used

Output size is then $$size = \frac{W-F+2P}{S} + 1$$

#### Example
Input size 7x7, Filter size 3x3, S = 1, No padding

$$size = \frac{7 - 3 + 2\cdot 0}{1} + 1 = 4 + 1 = 5 \text{ i.e.} \ 5\times 5$$ 

Input size 7x7, Filter size 3x3, S = 2, No padding

$$size = \frac{7 - 3 + 2\cdot 0}{1} + 1 = 2 + 1 = 3 \text{ i.e.} \ 3\times 3$$ 



### Parameter Sharing
Idea: If one set of weights are good at identifying some feature at some region in the output, it's a good idea to assign those parameters to neurons at other regions

## Pooling Layer
Reduce the spatial size of the representation, to reduce the number of parameters used down the line in the network.

Max Pooling is the most common with 2x2 and a stride of 2, this dicards 75% of activations down the line.

Concretely:
* Input ~ Volume of size $W_1 \times H_1 \times D_1$
* Hyperparams
    * Spatial extentnt $F$, often $2\times 2$
    * Stride $S$, often $2$
* Output ~ Volumze of size $W_2 \times H_2 \times D_2$
    * $W_2 = \frac{W_1 - F}{S} +1$
    * $H_2 = \frac{H_1 - F}{S} +1$
    * $D_2 = D_1$