## Different Famous Networks :
 1. **Classic networks:**
  * LeNet - 5
  * AlexNet
  * VGG
 2. **ResNet**
 
 
# Classic Networks : 
## 1. LeNet - 5 :
 1. Typical Architectural flow for the LeNet - 5 :
<img src='./imgs/LeNet.png' height="100%" width="100%">
 2. This was mostly used for handwritten digit classification i.e. gray scale image of digits between 0-9.
 3. Previously, people preferred to use _Average Pooling_ in place of _Max Pooling_, also the number of parameters were very less(~60k) as compared to today(may be in millions).
 4. We can see the general trend in the network i.e. the $n_H,n_W$ decreases as we progress deep in the network and the number of features increases(i.e. $n_C$). Even in modern implementation you would see **Conv -> Pool -> Conv -> Pool -> FC -> FC -> O/P**, also for the output today people would use softmax function(in this case with 10 o/ps).
 5. Back then people didn't use padding much unlike now and the filters were of dimensions fxf and not fxfx$n_C$(because of computational complexity unlike now).
 6. Reseach Paper : [LeCun et al., 1998. Gradient-based learning applied to document recognition](http://yann.lecun.com/exdb/publis/pdf/lecun-99.pdf)[Useful : focus on section 2 and 3.]
 
## 2. AlexNet :
 1. Typical Architectural flow for AlexNet :
<img src='./imgs/AlexNet.png' height="100%" width="100%">
 2. It is very much similar to LeNet - 5, but much much bigger number of parameters(~60 millions).
 3. It used ReLU activations.
 4. Since during this period still the GPUs were very slow, hence it used multiple GPUs for the task.
 5. It used something called as LRN(Local Response Normalisation), say for example we have a 13x13x256 dimensional data, it'll look for say any index data at position 1 say (5,4) and will try to normalize the entire data along 256 channel data, but researches found that it won't affect that much.
 6. Research Paper : [Krizhevsky et al.,2012. ImageNet Classification with deep convolutional neural networks](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf), easy to read and understand.
 
## 3. VGG - 16
 1. Typical Architectural design for VGG-16 :
<img src='./imgs/VGG-16.png' height="100%" width="100%">
 2. Layer Contains :
  * CONV = 3x3 filter, s=1, padding = 'same'
  * MAX-POOL = 2x2, s=2
 3. Layer presents a very simple architecute with just the Conv layers and the pooling layers but at the same time consisted of very large number of parameters(~138 million).
 4. At each Pool layer the Height and width of the volume shrinked and for each subsequent conv layers the number of channels almost doubled each time, and also the number of times convolution was performed.
 5. Output was again 1000 different labels identified via Softmax function.
 6. Research Paper : [Simonyan & Zisserman 2015. Very deep convolutional network for large-scale image recognition](https://arxiv.org/abs/1409.1556)
 
# ResNets (Residual Network) : 
## Idea :
 1. A deeper neural network may suffer from Vanishing/Exploding Gradient issues and can make the training process very difficult.
 2. Hence, **the idea in ResNet is to use the activation say $a^{[l]}$ and skip say 1|2 layers and feed it directly after the linear activation to obtain $a^{[l+2]}$/$a^{[l+3]}$(depending upon the layers you skip) this is also termed as _'shortcut'/ 'skip connection'_**.
 3. Architectural understanding :-
 <img src='./imgs/resnet_1.png' height="30%" width="30%">
 \begin{align}
 z^{[l+1]} &= W^{[l+1]}a^{[l]}+b^{[l+1]} \\
 a^{[l+1]} &= g(z^{[l+1]}) \\
 z^{[l+2]} &= W^{[l+2]}a^{[l+1]}+b^{[l+2]} \\
 a^{[l+2]} &= g(z^{[l+2]}+a^{[l]}) \\
 \end{align}
 4. Thus for a bigger NN :
 <img src='./imgs/resnet_2.png' height="75%" width="75%">
 As we make the NN deeper, the training error should only decrease but in reality because of exploding/vanishing gradient problem the optimisation algorithm has the harder time to optimise the algorithm.

## Why it works ?
 * Say, x -> Big NN -> $a^{[l]}$
 * Also say, x -> Big NN -> $a^{[l]}$ -> ResNet -> $a^{[l+2]}$
  * $a^{[l+2]} = g(z^{[l+2]}+a^{[l]}) ---> g(W^{[l+2]}a^{[l+1]} + b^{[l+2]} + a^{[l]}) $
  * After say regularisation the value of say $W^{[l+2]} = 0, b^{[l+2]} = 0$ then the o/p for $g(a^{[l]})=a^{[l]}$, thus it's easy for the residual blocks to learn the **identity functions**.
  * Hence, we can say that adding a Residual block won't hurt the performance of the function, but we also want to help the performance of the function and the fact is it is possible bcoz at times it's possible that residual block learns some important Identity functions.
  * It is not a bad to idea to have few Residual blocks either in the middle/end of the network to help build deeper NN.
  * Also it is important to note here that the **Dimensions for $z^{[l+2]}$ should be the same as $a^{[l]}$ to perform addition i.e. to apply padding='same'**. 
   * In case if it's not the same, consider $z^{[l+2]}$ to be of dimension (256,1) and $a^{[l]}$ to be of dimension (128,1), then consider $W_s$ of dimension (256,128) and multipy it with $a^{[l]}$ to get the final martix of dimension (256,1)
   
# 1 x 1 Convolution :
## What does it do ? 
 1. Consider the following :
<img src='./imgs/1x1_conv1.png' height="60%" width="60%">

  * The operation is pretty straight forward and the single number say in our case 2 is multiplied each time to each element in 6x6x1 matrix and the o/p is  a matrix is same $n_H$ and $n_W$.
  * For say input with 32 different input channels, the 1x1x32 matrix multiplies with all the channels and replace it with a single number(first multiply each 32 numbers in input to the channels of the weight matrix) and then at the end sum it all for a single number.
  * This 1x1 conv is also termed as **Network in Network** and is useful in **Inception Network**.

# Inception Layer:
## Motivation
  * You want to choose either a 1 x 3 filter, or 3 x 3, or 5 x 5, or do you want a pooling layer, **Why not do them all?**
  <p><img src='./imgs/inception_1.png' height="40%" width="40%" /></p>

  * In this case we'll have input as (28x28x192) -> (28x28x256), the results are concatenated... thus applying all the steps together and letting the model learn by itself what is best.
  * **Problem:** 
    * Well it comes with a computational cost
    * Let's consider an example with just 5 x 5 conv:
    <p><img src='./imgs/inception_2.png' height="30%" width="30%"/><p>
    
    * We can see that in this we need to perform first (5x5x192) multiplications and then each number needs to be multiplied for each output i.e. 28x28x32, thus making 120M calculations.
  * **Solution:**
    * The calculations can be drastically reduced by using 1x1 Conv net
    <p><img src='./imgs/inception_3.png' height="40%" width="40%"/><p>
    
    * This drastically reduces the number of calculations from 120M - 12M!

## Case Studies:
  * Let's redraw a single inception module:
    <p><img src='./imgs/inception_4.png' height="40%" width="40%"/><p>
  
  * Regularisation effect on the inception layer:
    * Intermediate inception blocks have FC layers along with Softmax layer
    * It helps to ensure that the features computed. Even in the heading units, even at intermediate layers. That they're not too bad for predicting the output 
    * This appears to have a regularizing effect on the inception network and helps prevent this network from overfitting.
    * Developed at google and named as 'gooLeNet'.

# MobileNets
## Motivation:
  * Low computational cost at deployment
  * Useful for mobile and embedded vision applications
  * Key idea : Normal vs depthwise-seperable convolution
## Comparison with Normal Convolution vs Depthwise-seperable convolution
  <p><img src='./imgs/mbn_1.png' height="40%" width="40%"/><p>

  * Thus we see that the total computational values are 2160
  * **Depthwise-separable convolution**
    * It consists of 2 steps i.e. `Depthwise Convolution` and then `Pointwise Convolution`
    * **Depthwise Convolution**
      * Rather than having different number of filters, it just has 1 filter with size (3 x 3) and results in $(n_{out}xn_{out}x3)$.
      <p><img src='./imgs/mbn_2.png' height="40%" width="40%"/><p>
    * **Pointwise Convolution**
      * The output of the depthwise convolution is then convoluted with $1x1xn_{c}$ convolution with say in this case 5 different feature maps/ filters.
      <p><img src='./imgs/mbn_3.png' height="40%" width="40%"/><p>

  * Thus in total we performed 672 computations, which is almost 31%(672/2160) better in performance as compared to normal convolution.
  * In the paper(Howard et al. 2017, MobileNets:Efficient CNN for Mobile Vision App), they claim the performance improvement is $\frac{1}{n^{`}_{c}}+\frac{1}{f^{2}}$.

## MobileNets V2:
  * In addition to the depthwise and pointwise convolution, v2 adds in the expansion block and it renames the pointwise convolution -> projection, with the addition of residual connections.
  <p><img src='./imgs/mbn_4.png' height="40%" width="40%"/><p>
  
  * The `expansion` block enables the addition of more number of channels which leads to **learning of richer functions**.
  * The  `projection` block reduces the amount of memory needed to store the values.
  * Thus allow your neural network to learn richer and more complex functions, while also keeping the amounts of memory that is the size of the activations you need to pass from layer to layer, relatively small.

# Practical implementation tips
## Transfer Learning:
  * At times we might not have enough data to train the model from scratch, in such scenarios it's a good idea to reuse weights obtained by someone else from the community after vigrous work.
  * Depending upon the amount of data we have for traning, we can decide how many layers in the model we can set `trainable` to false/ true.
  <p><img src='./imgs/tl.png' height="40%" width="40%"/><p>

  * In first case, when we have very less data to train, we can set all the trainable parameters to false and just train the softmax layer parameters.
  * In second case, when we have relatively large dataset for training, we can set some intial layer's trainable parameters to false and later layers to be true.

## Data Augmentation
  * One of the techniques used for improving the performance of the task.
  * Different techniques encompass
    * Mirroring over the vertical axis
    * Random Cropping(works in pracitcal world)
    * Rotation(used less, perhaps because of complexity)
    * Shearing(")
    * Local warping(")
    * Color Shifting(adding different values to the RGB channel)[Motivation is driven from the fact that lighting might be different] | One principle algorithm is PCA(AlexNet Paper) Color augmentation.


#