## CNN Building Blocks
CNNs are built by stacking a sequence of layers where each layer is responsible for a given
task. include these layer types: 
- Convolutional (CONV)
- Activation (ACT or RELU, where we use the same of the actual activation function)
- Pooling (POOL)
- Fully-connected (FC)
- Batch normalization (BN)
- Dropout (DO)
- Simple text diagrams of a CNN: 
    - INPUT => CONV => RELU => FC => SOFTMAX

### Convolutional layer
- The CONV layer is the core building block of a Convolutional Neural Network.
Consist of a set of K learnable filters (i.e., “kernels”), where each filter has a width and a
height, and are nearly always square.
- For inputs to the CNN, the depth is the number of channels in the image (i.e., a depth of three
when working with RGB images, one for each channel). For volumes deeper in the network, the
depth will be the number of filters applied in the previous layer.

Figure 11.6:
Left: At each convolutional layer in a CNN, there are K kernels applied to the input
volume. Middle: Each of the K kernels is convolved with the input volume. Right: Each kernel
produces an 2D output, called an activation map.

Figure 11.7: After obtaining the K activation maps, they are stacked together to form the input
volume to the next layer in the network.


- Note: Every entry in the output volume is thus an output of a neuron that “looks” at only a small
region of the input. In this manner, the network “learns” filters that activate when they see a specific
type of feature at a given spatial location in the input volume. In lower layers of the network, filters
may activate when they see edge-like or corner-like regions.
Then, in the deeper layers of the network, filters may activate in the presence of high-level
features, such as parts of the face, the paw of a dog, the hood of a car, etc.


The spatial dimensions of our input volume have been reduced to a smaller size, but our depth is now larger, due to utilizing more filters deeper in the network

When working with images, it’s often impractical to connect neurons in the current
volume to all neurons in the previous volume – there are simply too many connections and too
many weights, making it impossible to train deep networks on images with large spatial dimensions.
Instead, when utilizing CNNs, we choose to connect each neuron to only a local region of the input
volume – we call the size of this local region the "receptive field" (or simply, the variable F) of the
neuron.


There are three parameters that control the size of an output volume: the depth, stride, and zero-padding size
- Depth: The depth of an output volume controls the number of neurons (i.e., filters) in the CONV layer that
connect to a local region of the input volume.

- Stride:
    - When applying the convolution operation (as “sliding” a small matrix across a large matrix, stopping at each coordinate, computing an element-wise
multiplication and sum, then storing the output) in CNN,  we create a new
depth column around the local region of the image where we convolve each of the K filters with
the region and store the output in a 3D volume.
    - The concept of "Stride" is skipping pixels when applying the convolution operation to reduce the spatial dimensions of
the input volume.
    - Example: skip two pixels at a time (two pixels along the x-axis and two
pixels along the y-axis), Thus, producing a smaller output volume.

- Zero-padding: We need to “pad” the borders of an image to retain the original
image size when applying a convolution to filters inside of a CNN.

Putting all these parameters together, we can compute the size of an output volume as a function
of the input volume size (W, assuming the input images are square, which they nearly always are),
the receptive field size F, the stride S, and the amount of zero-padding P.
- To construct a valid CONV layer, we need to ensure the following equation is an integer:
    ((W - F + 2P)/S) + 1

To summarize, the CONV layer accepts an input volume of size W_input xH_input XD_input (it’s common to see W_input = H_input ).
- Requires four parameters:
    1. The number of filters K (which controls the depth of the output volume).
    2. The receptive field size F (the size of the K kernels used for convolution and is nearly always square, yielding an F x F kernel).
    3. The stride S.
    4. The amount of zero-padding P.
- The output of the CONV layer is then W_output x H_output x D_output, where:
    - W_output = ((W_input - F +2P)/S)+1
    - H_output = ((H_input - F +2P)/S)+1
    - D_output = K
    

### Activation Layers
After each CONV layer in a CNN, we apply a nonlinear activation function, such as ReLU, ELU,...

Activation layers are not technically “layers” (due to the fact that no parameters/weights are
learned inside an activation layer)

An activation layer accepts an input volume of size W_input x H_input x D_input and then applies the
given activation function

Figure 11.9: An example of an input volume going through a ReLU activation, max(0;x)

Since the activation function is applied in an element-wise manner, the output of an activation layer is always the same as the input dimension: W_input =W_output, H_input = H_output , D_input = D_output .


### Pooling Layers
There are two methods to reduce the size of an input volume – CONV layers with a stride > 1 (in CONV layer) and POOL layers

It is common to insert POOL layers in-between consecutive CONV layers in a CNN architectures:
- INPUT => CONV => RELU => POOL => CONV => RELU => POOL => FC

The primary function of the POOL layer is to reduce the spatial size (i.e., width
and height) of the input volume.
- Doing this allows us to reduce the amount of parameters and computation in the network – pooling also helps us control overfitting
- The most common type of POOL layer is max pooling which is typically done in the middle of the CNN. Other kind is average pooling which is normally used as the final layer of the network.

Figure 11.10: Left: Our input 4x4 volume. Right: Applying 2x2 max pooling with a stride of
S = 1. Bottom: Applying 2x2 max pooling with S = 2 – this dramatically reduces the spatial
dimensions of our input.

POOL layers Accept an input volume of size W_input x H_input x D_input.
They then require two parameters:
- The receptive field size F (also called the “pool size”).
- The stride S.
Applying the POOL operation yields an output volume of size
 W_output x H_output x D_output where:
- W_output = ((W_input -F)/S)+1
- H_output = ((H_input -F)/S)+1
- D_output = D_input

### Fully-connected Layers
Neurons in FC layers are fully-connected to all activations in the previous layer, as is the standard for
feedforward neural networks.

FC layers are always placed at the end of the network. It’s common to use one or two FC layers prior to applying the softmax classifier, as the following
(simplified) architecture demonstrates:

INPUT => CONV => RELU => POOL => CONV => RELU => POOL => FC => FC

Here we apply two fully-connected layers before our (implied) softmax classifier which will
compute our final output probabilities for each class.

### Batch Normalization
Used to normalize the activations of a given input volume before
passing it into the next layer in the network.

Extremely effective at reducing the number of epochs it takes to train a neural network.

Applying batch normalization to our network architectures can help us prevent overfitting and allows us to obtain significantly higher
classification accuracy in fewer epochs compared to the same network architecture without batch
normalization.

The biggest drawback of batch normalization is that it can actually slow down the wall time it
takes to train your network (even though you’ll need fewer epochs to obtain reasonable accuracy)
by 2-3x due to the computation of per-batch statistics and normalization.

Placing the BN after the RELU yields slightly higher accuracy and lower loss:

INPUT => CONV => RELU => BN ...

### Dropout

Dropout is actually a form of regularization that aims to help prevent overfitting
by increasing testing accuracy, perhaps at the expense of training accuracy.

For each mini-batch in our training set, dropout layers, with probability p,
randomly disconnect inputs from the preceding layer to the next layer in the network architecture

Randomly dropping connections ensures that no single node in the network is responsible
for “activating” when presented with a given pattern. Instead, dropout ensures there are
multiple, redundant nodes that will activate when presented with similar inputs – this in
turn helps our model to generalize.

It is most common to place dropout layers with small probabilities p in-between FC layers of an architecture
where the final FC layer is assumed to be our softmax classifier:
... CONV => RELU => POOL => FC => DO => FC => DO => FC


## Layer Patterns
CNN architecture is to stack a few CONV and RELU layers, following them with a POOL operation.
We repeat this sequence until the volume width and height is small, at which point we apply one or more FC layers.
Therefore, we can derive the most common CNN architecture:

INPUT => [[CONV => RELU]*N => POOL?]*M => [FC => RELU]*K => FC

- Common rules of thumb when constructing your own CNNs:
    - Input layer should be square. (i.e. 32x32)
    - Input layer should also be divisible by two multiple times
