# Deep Residual Learning for Image Recognition

[Deep residual Nets](https://arxiv.org/pdf/1512.03385.pdf) is one of most influential papers of modern Deep Learning. We'll cover import concepts and intuition behind the idea. The reader can refer to the original paper for detailed information on training procedure and improvements on different benchmark tasks and datasets. 

<i><u>Overview of the notebook </u></i>

1. Background Problem Statement
2. Residual learning
3. How Residual learning counters vanishing gradient
4. Residual Network Architecture
5. Results of Residual Learning
6. References

## 1. Background Problem Statement

- Deep NNs are able to create multiple levels of rich features (low/mid/high-level) from image data
- `The richness of features is directly on the level / depth` of layer, meaning stacking more layers would result in more detailed features, leading to greater accuracy.
- SOTA models of that time used *16-30 layers*
- *Authors question : Is learning better networks as easy as stacking more layers?*
- One bottleneck till that point was Vanishing gradient problem. `Batch Normalization helped in solving the problem, which resulted in convergence of deeper netrworks`
- But `Deeper nets faced a degradation problem when converging. (i.e) deeper layers learning got saturated`
- This was not overfitting as the training error was also high as you can see in the image below

![Deeper nets Training difficulty](images/problemStatement.png)

<center>
    Reference : <a href="https://arxiv.org/pdf/1512.03385.pdf">Deep residual Nets</a>
</center>

#### Simple Experiment

- Consider two networks - a shallow and a deeper one where the deeper one contained the same layers as the shallow one but stacked with extra layers
- In ideal scenario, `we'd expect the deeper network to produce same, if not lesser error than shallower one by mapping deeper layers to identity function`. Expected solution will be something like :

![Constructed Deeper net solution](images/solutionConstruction.png)

- But solutions found by solvers were worse than the shallow networks suggesting `deeper networks were harder to train`

## 2. Residual learning

- Authors propose a `residual learning` approach where the `networks try to learn the residual of the target function` than the original function itself. (i.e) Assuming *x* to be the network input and **H(x)** to be the original function to learn, we reformulate the function as **F(x) = H(x) - x**. It can be described as follows : 

![Residual block](images/residualBlock.png)

The layers are made to learn the non-linear mapping ***F(x)*** while an `extra skip-connection or shortcut connection` helps us get back to the original function *H(x)*

### Hypothesis for residual learning
- Ideally as universal function approximators, the deeper layers must have been able to learn the identity mapping. But not being able to do so suggests there exists a `learning degradation` problem. So we try to precondition them by adding a reference identity mapping
- Authors hypothesize `ease of learning residual function than an unreference non-linear function`
- ***Authors propose that if identity mappings were optimal, the solvers might just drive the weights of deeper layers to zero***
- ***Its assumed that identity mappings are optimal, which unlikely to be true in all cases*** but still the layer responses support the above assumption

### Identity mapping by shortcuts

Equation for the residual block is as follows : 

![residualBlockEquation](images/residualBlockEquation.png)


where **F(x,W)** *is called a building block which tries to learn residual function*, essentially a series of weight layers followed by non-linear activation function. **W_s x** *indicates shortcut connection* where the sizes dont match and we add `projection matrices`. 
- The size can be matched by *identity mapping and zero padding* or *projection matrices (learned parameters)*  
- If the feature map sizes are same, then we can use **identity shortcut (element-wise addition) which is parameter-free and has negligible additional time complexity**. 
- The authors point out that this is crucial in comparing plain vs residual networks performance

## 3. How Residual learning affects Gradient flow

Consider two sample networks as follows:

<table>
<tr><h4>demo_Plain_Network  (without skip-connection)</h4></tr>   

<tr>
<td> <img src="images/demoPlainNet.png" width="800" height="300"> </td>
</tr>
</table>

<table>
<tr><h4>demo_Residual_Network (with skip-connection)</h4></tr>   
<tr>
<td> <img src="images/demoResidualNet.png" style="width:150%"> </td>
</tr>
</table>

- Both demoPlainNet and demoResidualNet are identical except for the additional skip-connection pathways
- A sample batch of input  from [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist) dataset is fed through both the networks
- The magnitude of gradients is observed across different layers of both networks
- The ratio of magnitude of the gradients for the conv1 block of 1st Layer of Residual Net w.r.t Plain Net looks like below :
<img src="images/Layer1_0_conv1_weight_gradient.png" style="width:60%">
- As you can see, many filters in residual blocks have **gradients > 1**
- ***Rationale : The skip-connection pathway is not scaled by the magnitude of the intermediate layers and thus allows gradients to flow much earlier in the network and with much greater effect***
- Following images shows gradient flow between a sample plain-network and same network with identity skip-connection

<table>
<tr>
<td> <img src="images/normalConnectionMath.jpg" style="width:90%"> </td>
<td> <img src="images/skipConnection_normalConnection_comparison.jpg"  style="width:90%"> </td>
</tr>
</table>

- The code for the above visualization comparing Plain vs Skip-connection network on FashionMNIST dataset can be found at this [kaggle notebook](https://www.kaggle.com/suryajrrafl/plain-vs-residualnet)

## 4. Residual Network Architecture

- The architecture itself is inspired from the *VGGNet* where series of **Conv -> Batch Norm -> ReLU Activation** blocks are used. The authors quote the following principles were followed when designing the architecture
    1. For the same output feature map size, the layers have the same number of filters
    2. If the feature map size is halved, the number of filters is doubled so as to preserve the time complexity per layer (*downsampling is done by using conv layer with stride of 2*)
    
    
- The feature map size reduces while the number of channels increases as we go deeper into the network
- Finally we do averge pooling followed by a fully connected layer to match number of output classes

<center>
    <h4> Comparison between 34-layer Plain and Residual networks </h4>
</center>


![plain_vs_resnet](images/plain_vs_resnet.jpeg)

- The solid shortcut connections indicate identity mappings
- The dotted shortcut connections indicate projection matrix approach

<center>
    <h4> Resnet Family Table</h4>
</center> 


![resnetFamilyTable](images/resnetFamilyTable.png)



### Identity vs. Projection Shortcuts
- The parameter-free, identity shortcuts help with training. The Projection shortcuts mentioned earlier can be achieved in 3 ways : 
    - Option A : zero-padding shortcuts are used for increasing dimensions, and all shortcuts are parameter free
    - Option B : projection shortcuts are used for increasing dimensions, and other shortcuts are identity
    - Option C : All shortcuts are projections.

- `Option A < Option B < Option C`. B is better than A because A is just zero padding and no residual learning. C is marginally better than B due to extra parameters, but `Option B is preferred due to better generalization and << time complexity than Option C`

### Basic  Blocks
- Basic block containing 2 layers of conv-batchnorm-relu modules
- Used in resnet18 and resnet34 architectures

![BasicBlock](images/basicBlock.png)


### BottleNeck  Blocks
- Bottleneck block containing 3 layers of conv-batchnorm-relu modules
- The three layers are 1×1, 3×3, and 1×1 convolutions, where the 1×1 layers are responsible for reducing and then increasing (restoring) dimensions, leaving the 3×3 layer a bottleneck with smaller input/output dimensions.
- Used in resnet50, resnet101 and resnet152 architectures

![BottleNeckBlock](images/bottleNeckBlock.png)


- **NOTE** : The parameter-free identity shortcuts are particularly important for the bottleneck architectures. If the identity shortcut is replaced with projection, one can show that the time complexity and model size are doubled, as the shortcut is connected to the two high-dimensional ends. So identity shortcuts lead to more efficient models for the bottleneck designs.

## 5. Results of Residual Learning

The residual networks showed significant improvement over their plain counterparts and more importantly provided way to train very deep architectures

<center>
    <h5> Plain vs Residual Networks top1% Test Error on Imagenet</h5>
</center> 

![plain_vs_resnet_results_table](images/plain_vs_resnet_results_table.png)


<u> <i> Observations </i></u>
- Residual networks are better than Plain Networks for both 18 and 34 layer architecture
- 34-layer Plain network has higher error than 18-layer Plain Network
- 34-layer Residual network has lesser error than 18-layer Residual Network
- 18-layer Residual network has same error as the 18-layer Plain network

<center>
    <h5> Plain vs Residual Networks Training Validation curves </h5>
</center> 

![plain_vs_resnet_Training curves](images/plain_vs_resnet_results_curves_plot.png)


### Inference
- One can see that Training error for 34-layer Plain Network is more than 18-layer network, proving the degradation problem.
- The trend is reversed for the Residual networks where the 34-layer networks has lesser training and validation error than the 18-layer network
- The 18-layer Residual network, though similar in accuracy to the 18-layer Plain network ***converges faster***

### Analysis of Layer Responses

![layerResponses](images/layerResponses.png)

- For ResNets, this analysis reveals the response strength of the residual functions.
- It is evident from above image that ResNets have generally smaller responses than their plain counterparts. 
- These results support our basic motivation that the residual functions might be generally closer to zero than the non-residual functions.
- We also notice that the deeper ResNet has smaller magnitudes of responses, as evidenced by the comparisons among ResNet-20, 56, and 110. 
- When there are more layers, an individual layer of ResNets tends to modify the signal less.


## 6. References

1. [Resnet function classes](https://d2l.ai/chapter_convolutional-modern/resnet.html)
2. [visio blog post](https://viso.ai/deep-learning/resnet-residual-neural-network/)
3. [cv tricks blog post](https://cv-tricks.com/keras/understand-implement-resnets/)
4. [Great learning blog post](https://www.mygreatlearning.com/blog/resnet/)
5. [Understanding and visualizing resnsets](https://towardsdatascience.com/understanding-and-visualizing-resnets-442284831be8)