# Why look at case studies?
- It turns out that neural network architecture that works well on one computer vision tasks often works well on other tasks as well.
- Some of the effective classic networks are LeNet-5, AlexNet and VGG.
- ResNet or Residual Network trained a very deep 152 layer neural network.
- ![image.png](attachment:image.png)
- Even if we end up not working computer visison ourself, we find a lot of the ideas from some of these examples such as ResNet, Inception network.

# Classic Network

## LeNet - 5
- The goal of LeNet-5 was to recognize handwritten digits, so an image of a digits.
- LeNet-5 was trained on grayscale images.
- Here is the LeNet-5 architecture, 
    - We start off with an image 32x32x1.
    - In the 1st step, we use a set of 6 5x5 filters with a stride = 1. As if we are using 6 filters we end up with a 28x28x6.
    - Then the LeNet neural network applies average pooling with a filter width = 2 and a stride of 2. So we wind up reducing the dimensions to 14x14x6 volume.
    - Now, we apply another convolutional layer with 16 5x5. so we end u0p with 10x10x16 volume.
    - Then with another pooling layer with pool width = 2 and stride = 2. So the volume becomes 5x5x16.
    - Now flatten the layer which becomes 400 (5x5x16). 
    - The next layer is then a fully connected layer that fully connects each of these 400 nodes with every one of 120 neurons. 
    - Then another fully connected layer with 84 neurons.
    - Then the final step is it uses these essential 84 features and uses it with one final output. We can use one node to predict for y_hat. The y_hat takes 10 possible values corresponding to recognising each of the digits from 0 to 9. We'll use a softmax layer with a 10 way classification output.
    - ![image.png](attachment:image.png)
- As we go from left to right, the height and width tend to go down, whereas the number of channels does increase.

## AlexNet
- The architecture of AlexNet
    - AlexNet input starts with 227x227x3 images.
    - The the first layer applies a set of 96 11x11 filters with a stride = 4. So the dimensions shrinks to 55x55x96.
    - Then it applies max pooling with a 3x3 filter. So f=3 and a stride=2. This reduces the volume to 27x27x96.
    - Then it performs a 256 5x5 with same padding. So we end up with 27x27x256.
    - The max pooling again with filter 3x3 and stride = 2, it reduces the height and width to 13x13x156
    - Then another same convolution with 284 3x3 filters with same padding. So it becomes 13x13x384.
    - Then 3x3 same convolution again ...
    - ![image-2.png](attachment:image-2.png)
- The LeNet-5 has about 60,00 parameters, whereas the AlexNet has about 60 million parameters.    
- The aspect of this architecture that made it much better than LeNet was using the ReLU activation function. 
- AlexNet had a relatively complicated architecture, there's just a lot of hyperparameters.

## VGG-16
- A remarkable thing about the VGG-16 net is instead of having so many hyperprameters, we use a much simpler network where we focus on just having conv-layers that are just 3x3 filters with a stride of 1 and alwas use same padding and make all our max pooling layers 2x2 with a stride=2.
- One nice thing about the VGG network was it really simplified the neural network architectures.
- The architecture :
    - We start with an image 224x224x3
    - The 1st 2 layers are convoltuons, which are 3x3 filters. The 1st 2 layers use 64 filters. So we end up with a 224x224x64 because we are using same convolutions.
    - Then uses pooling layer, so the pooling layer reduces to 122x112x64.
    - Then it 2 more conv-layers with 128 filters. It will be 112x112128.
    - Then a pooling layer to 56x56x128. And so on..
    - ![image-3.png](attachment:image-3.png)
- The fact 16 layers refers to the fact it has 16 layers that have weights.
- This network has about 138 million parameters.

# ResNets
- Skip connections which allows us to take the activation from one layer and suddenly feed it to another layer even much deeper in the neural network. Using this, we'll build ResNet which enables us to train very, very deep networks.
- ResNets are built out of something called a residual block.
## Residual block
- Rather than needing to follow the main path, the information from a[l] can now follow a shortcut to go much deeper into the neural network.
- The addition of this a[l] at last here, makes this a residual block.
- The way we build a ResNet is by taking many of these residual blocks and stacking them together to form a deep network.
- ![image.png](attachment:image.png)
- If we use our standard optimization algorithms to the train or plain network without all the extra residual or skip connections. We find that as we increase the number of layers, the training error will tend to decrease after a while but they'll tend to go back up.
- What happens with ResNet is that even as the number of layers get deeper, we can have the performance of the training error kind of keep on going down. even if we train a network with over a hunderd layers.
- By taking these intermediate activations and allowing it to go much deeper in the neural network, this helps with the vanishing and exploding gradient problems and allows us to train much deeper neural networks without really appreciable loss in performance and may be at some point this will flatten out.
- ![image-2.png](attachment:image-2.png)

# Why ResNets work
- If we make a network deeper, it can hurt our ability to train the network to do well on the training set and that's why sometimes we don't want a network that is too deep.
- The identity function is easy for residual block to learn. It's easy to get a[l+2] = a[l] because of the skip connection. This means is that adding these 2 layers in our neural network, it doesn't really hurt our neural network's ability to do as well as the simpler network without these 2 extra layers, because it's quite easy for it to learn the identity function to just copy a[l] to a[l+2] using despite the addition of these 2 layers.  This is why addingt 2 extra layers, adding this residual block to somewhere in the middle or end of the big neural network it doesn't hurt performance.
- Our goal is to not just not hurt performance, it is to help performance and so we can image that if all of these hidden units if they actually learned something useful then maybe we can do even better than learnig the identity function.
- The main reason the residual network works is that it's so easy for these extra layers to learn the identity function that we're kind of guaranteed that it doesn't hurt performance.
- ![image.png](attachment:image.png)
## ResNet on Images
- ![image-2.png](attachment:image-2.png)

# Networks in Networks and 1x1 convolutions
- ![image.png](attachment:image.png)
- Let's say we have a 28x28x192 volume, 
    - If we want to shrink the height and width, we can use a pooling layer. But one of the number of channels has gotten too big and we want to shrink that. 
    - What what we can do is use, 32 filters that are 1x1 and technically each filter would be of dimension 1x1x192, because the number of channels in our filter has to match the number of channels in our input volume.
    - We use 32 filters and the output of the process will be 28x28x32 volume. 
    - The pooling layer are used just to shrink nh and nw, the height and width of the volumes.
- ![image-2.png](attachment:image-2.png)

![image.png](attachment:image.png)

# Inception Network Motivation
- When designing a layer for a ConvNet, we might have to pick, do us want a 1x1 filter, or 3x3 or 5x5 or do we want a pooling lyaer? What the inception network does is it says, why should we do them all? This makes the network architecture more complicated, but it also works remarkably well.
- The basic idea is that instead of we needing to pick one of those filter sizes or pooling we want and committing to that, we can do them all and just concatenate all the outputs and let the network learn whatever parameters it wants to use, whatever the combinations of these filter sizes it wants.
- ![image.png](attachment:image.png)
## The Problem of computational cost
- ![image-2.png](attachment:image-2.png)
## Using 1x1 convolution
- Use a 1x1 convolution to reduce the volume to 16 channels instead of 192 channels, and then on this much smaller volume, run our 5x5 convoltuon to give us our final output. Notice that the input and output dimensions are still the same.
- ![image-3.png](attachment:image-3.png)
- To summarize, if we are building a layer of a neural network and we don't want to have to decide, do we want a 1x1 or 3x3 or 5x5 or pooling layer, the inception module lets us say let's do them all, and let's concatenate the results. Then we run to the problem of computational cost. What we saw here was how using a 1x1 convolution, we can create the bottlenect layer thereby reducing the computational cost significantly.

# Inception Network
- The inception network put a lot of modules together.
- ![image.png](attachment:image.png)
- ![image-2.png](attachment:image-2.png)

# MobileNet
- Another foundational convoltuional neural network architecture used for computer vision. Using Mobilenets will allow us to build and deploy new networks that work even in low compute environment, such as a mobile phone.
## Motivation for MobileNets
- If we want our neural network to run on a device with less powerful CPU or a GPU at deployment, then there's another neural network architecture called the MobileNet that could perform much better.
- ![image.png](attachment:image.png)
## Normal Convoltuon
- In the normal convoltuion, we may have an input iamge nxnxn_c i.e 6x6x3
    - We want to convolve it with a filter that is fxfxn_c i.e 3x3x3
    - ![image-2.png](attachment:image-2.png)
## Depthwise separable convolution
- The depthwise separable convolution has 2 steps : A depthwise convolution followed by pointwise convolution. It is these 2 steps which together make up the depthwise separable convolution.
- ![image-3.png](attachment:image-3.png)
- We have an input 6x6x3 (nxnxn_c)
    - The filters in depthwise convolution is going to be fxf not fxfxn_c. The number of filters is going to be n_c.
    - The way that we will compute the 4x4x3 output is that we apply one of each of these filters to one of each of these input channels. 
    - We need to take the 4x4x3 intermediate value and carry out one more step.
    - ![image-4.png](attachment:image-4.png)
    - We need to take the 4x4x3 set of values or nxnxn_c set of values and apply a pointwise convoltuion in order to get the output we want which will be 4x4x5.
    - ![image-5.png](attachment:image-5.png)
- ![image-6.png](attachment:image-6.png)
- ![image-7.png](attachment:image-7.png)

# MobileNet Architecture
- The idea of MobileNet is everywhere that we previosly have used an expensive convoltuional operation. Now we can now instead use a much less expensive depthwise separable convolutional operation, comprising the depthwise convolution operation and the pointwise convolution operation.
- The MobileNet v1 had a specific architecture in which it use a block 13 times. It would use a depthwise convolutional operation to genuine outputs and then have a stack of 13 of these layers in order to go from the origianl raw input image to finally making a classification prediction. 
- The neural network last few layers are the usual pooling layer, followed by a fully connected layer, followed by a softmax in oder for it to make a classification prediction. 
- This turns out to perform well while being much less computationally expensive than earlier algorithms that used a normal convoltuonal operation.
- ![image.png](attachment:image.png)
- Details of MobileNet v2 block : 
    - Given an input nxnx3, the MobileNet v2 bottleneck will pass that input via the residual connection directly to the output, just like in the Resnet.
    - Then in the main non-residual connection part, we'll 1st apply an expansion operator, and what that means is we'll apply a 1x1xn_c. 
    - ![image-2.png](attachment:image-2.png)

# EfficientNet
- MobileNet v1 and v2 gave us a way to implement a neural network, that is more computationally efficient. But is ther a way to tune MobileNet, or some other architecture, to our specific device?
- Maybe we're implementing a computer vision algorithm for different brands of mobile phones with different amounts of computer resources, or for different edges devices. If we have a little bit more computation, maybe we have a slightly bigger neural network and we get a bit more accuracy, or if we are more computationally constraint. How can we automatically scale up or down neural networks for a particular device.
- EfficientNet gives us a way to do so. The 3 things we could do to scale things up or down are, 
    - we could use a high resolution image
        - ![image.png](attachment:image.png)
    -  We could make the network much deeper
        - ![image-2.png](attachment:image-2.png)
    - We could make the layers wider
        - ![image-3.png](attachment:image-3.png)
- The question is, given a particular computational budget, what's the good choice of r, d and w. 