## Different Famous Networks :
 1. **Classic networks:**
  * LeNet - 5
  * AlexNet
  * VGG
 2. **ResNet**
 
 
# Classic Networks : 
## 1. LeNet - 5 :
 1. Typical Architectural flow for the LeNet - 5 :
<img src='./imgs/LeNet.png' height="100%" width="100%">
 2. This was mostly used for handwritten digit classification i.e. gray scale image of digits between 0-9.
 3. Previously, people preferred to use _Average Pooling_ in place of _Max Pooling_, also the number of parameters were very less(~60k) as compared to today(may be in millions).
 4. We can see the general trend in the network i.e. the $n_H,n_W$ decreases as we progress deep in the network and the number of features increases(i.e. $n_C$). Even in modern implementation you would see **Conv -> Pool -> Conv -> Pool -> FC -> FC -> O/P**, also for the output today people would use softmax function(in this case with 10 o/ps).
 5. Back then people didn't use padding much unlike now and the filters were of dimensions fxf and not fxfx$n_C$(because of computational complexity unlike now).
 6. Reseach Paper : [LeCun et al., 1998. Gradient-based learning applied to document recognition](http://yann.lecun.com/exdb/publis/pdf/lecun-99.pdf)[Useful : focus on section 2 and 3.]
 
## 2. AlexNet :
 1. Typical Architectural flow for AlexNet :
<img src='./imgs/AlexNet.png' height="100%" width="100%">
 2. It is very much similar to LeNet - 5, but much much bigger number of parameters(~60 millions).
 3. It used ReLU activations.
 4. Since during this period still the GPUs were very slow, hence it used multiple GPUs for the task.
 5. It used something called as LRN(Local Response Normalisation), say for example we have a 13x13x256 dimensional data, it'll look for say any index data at position 1 say (5,4) and will try to normalize the entire data along 256 channel data, but researches found that it won't affect that much.
 6. Research Paper : [Krizhevsky et al.,2012. ImageNet Classification with deep convolutional neural networks](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf), easy to read and understand.
 
## 3. VGG - 16
 1. Typical Architectural design for VGG-16 :
<img src='./imgs/VGG-16.png' height="100%" width="100%">
 2. Layer Contains :
  * CONV = 3x3 filter, s=1, padding = 'same'
  * MAX-POOL = 2x2, s=2
 3. Layer presents a very simple architecute with just the Conv layers and the pooling layers but at the same time consisted of very large number of parameters(~138 million).
 4. At each Pool layer the Height and width of the volume shrinked and for each subsequent conv layers the number of channels almost doubled each time, and also the number of times convolution was performed.
 5. Output was again 1000 different labels identified via Softmax function.
 6. Research Paper : [Simonyan & Zisserman 2015. Very deep convolutional network for large-scale image recognition](https://arxiv.org/abs/1409.1556)
 
# ResNets (Residual Network) : 
## Idea :
 1. A deeper neural network may suffer from Vanishing/Exploding Gradient issues and can make the training process very difficult.
 2. Hence, **the idea in ResNet is to use the activation say $a^{[l]}$ and skip say 1|2 layers and feed it directly after the linear activation to obtain $a^{[l+2]}$/$a^{[l+3]}$(depending upon the layers you skip) this is also termed as _'shortcut'/ 'skip connection'_**.
 3. Architectural understanding :-
 <img src='./imgs/resnet_1.png' height="30%" width="30%">
 \begin{align}
 z^{[l+1]} &= W^{[l+1]}a^{[l]}+b^{[l+1]} \\
 a^{[l+1]} &= g(z^{[l+1]}) \\
 z^{[l+2]} &= W^{[l+2]}a^{[l+1]}+b^{[l+2]} \\
 a^{[l+2]} &= g(z^{[l+2]}+a^{[l]}) \\
 \end{align}
 4. Thus for a bigger NN :
 <img src='./imgs/resnet_2.png' height="75%" width="75%">
 As we make the NN deeper, the training error should only decrease but in reality because of exploding/vanishing gradient problem the optimisation algorithm has the harder time to optimise the algorithm.

## Why it works ?
 * Say, x -> Big NN -> $a^{[l]}$
 * Also say, x -> Big NN -> $a^{[l]}$ -> ResNet -> $a^{[l+2]}$
  * $a^{[l+2]} = g(z^{[l+2]}+a^{[l]}) ---> g(W^{[l+2]}a^{[l+1]} + b^{[l+2]} + a^{[l]}) $
  * After say regularisation the value of say $W^{[l+2]} = 0, b^{[l+2]} = 0$ then the o/p for $g(a^{[l]})=a^{[l]}$, thus it's easy for the residual blocks to learn the **identity functions**.
  * Hence, we can say that adding a Residual block won't hurt the performance of the function, but we also want to help the performance of the function and the fact is it is possible bcoz at times it's possible that residual block learns some important Identity functions.
  * It is not a bad to idea to have few Residual blocks either in the middle/end of the network to help build deeper NN.
  * Also it is important to note here that the **Dimensions for $z^{[l+2]}$ should be the same as $a^{[l]}$ to perform addition i.e. to apply padding='same'**. 
   * In case if it's not the same, consider $z^{[l+2]}$ to be of dimension (256,1) and $a^{[l]}$ to be of dimension (128,1), then consider $W_s$ of dimension (256,128) and multipy it with $a^{[l]}$ to get the final martix of dimension (256,1)
   
# 1 x 1 Convolution :
## What does it do ? 
 1. Consider the following :
<img src='./imgs/1x1_conv1.png' height="60%" width="60%">

  * The operation is pretty straight forward and the single number say in our case 2 is multiplied each time to each element in 6x6x1 matrix and the o/p is  a matrix is same $n_H$ and $n_W$.
  * For say input with 32 different input channels, the 1x1x32 matrix multiplies with all the channels and replace it with a single number(first multiply each 32 numbers in input to the channels of the weight matrix) and then at the end sum it all for a single number.
  * This 1x1 conv is also termed as **Network in Network** and is useful in **Inception Network**.