____________________________________

# Deep Residual Learning for Image Recognition

https://arxiv.org/pdf/1512.03385.pdf

Paper summary by: vijay Mariappan
_____________________
*Deep residual nets swept all detection and classification competitions in ILSVRC & COCO 2015.* The network that they used had 152 layers, an impressive 8 times deeper than a comparable VGG network

<img src='ILSVRC.png'>
_________________

### Claims:

> **Deeper 'plain' neural networks are more difficult to train.**

<img src='train_error.png'>

The above represents the training and test error of a large 'plain' neural network. The surpising part is not that the test error is large for the 56-layer network compared to a 20-layer network, but the training error is also large compared to the smaller network.

Doesnt large networks overfit the data, and so we should get a lower training error than a smaller network?. Or is it a problem because of the notorious vanishing/exploding gradients?.The authors claim that this behaviour is because of 'optimization problem': Deeper models are difficult to optimize. They construct an example to show that.

### An experiment


<img src='shallower.png'>

In the above experiment:

 * The deeper model is constructed by coping the original shallow network and adding 'extra layers' which are set as 'identity' (where these layers output the input as it is).  So this should produce no higher training error than its shallower counterpart, right?
 * No, the error is higher and the current solvers (like SGD or any momentum based) are unable to handle it.
 * The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers.
 ______________________

## Addressing the above degradation problem using deep residual framework
* Hypothesis: It is easier to optimize theresidual mapping than to optimize the original, unreferenced mapping.
* Lets say the desired underying mapping is $H(x)$,they let the stack non-linear layer fit another mapping: $F(x):= H(x) -x$. The original mapping is recast into $F(x) + x$
* The formulation of $F(x) + x$ can be realized by feedforward neural networks with “shortcut connections”.
* So solvers need to learn residual functions i.e. F(x) = H(x) - x. 
* With the residual learning reformulation, if identity mappings are optimal, the solvers may simply drive the weights of the multiple nonlinear layers toward zero to approach identity mappings.
* In real cases, it is unlikely that identity mappings are optimal, but the reformulation may help to precondition the problem.

<img src='resnet_new.png'>

Note: The desired mapping is still $H(x)$, but the solvers need to learn only $F(x)$ as the shortcut connections don't require additional parameters.



## Lets summarize till this point
* The authors think that the current solvers and architecture have a hard time learning the identity function for certain layers of the net, and may have an easier time learning zero mappings for those same layers. 
* Ultimately, this motivates their setup for learning small residuals and adding them to the input, rather than just transforming the whole input directly.
* Each subsequent layer is only responsible for, in effect, fine tuning the output from a previous layer by just adding a learned "residual" to the input. This differs from a more traditional approach where each layer has to generate the whole desired output.
____________________________

## Network Design

<img src='plainvsRes_v2.png'>

**Full ResNet architecture:**
* Stack residual blocks

* Every residual block has two 3x3 conv layers

* Periodically, double # of filters and downsample spatially using stride 2.

<img src='experiments.png'>

<img src='practical-design.png'>

## Computing the complexity of the above

### left size. 
  * The input is `32*32*64` and the padding is SAME
  * for 1st conv 3x3, 64 : number of input channels is 64 for each output pixel the computation would be `3x3x64`. Since the output is `32*32*64`, the number of multiplications are `32*32*64*3*3*64`
  * for 2nd conv 3x3, 64: its the same as above as the input and output channels are 64: `32*32*64*3*3*64` 
  
  **Total = 75M**
  
### On the right side:
  * The input is `32*32*256` and the padding is SAME
  * for 1st 1x1 conv, 64: number of input channels is 256, so for each output pixel the computation would be `1x1*256`. Since the output is `32*32*64`, the total mumber of multiplications are `32*32*64*1*1*256`.
  * for 2nd conv 3x3, 64: its going to be `32*32*64*3*3*64` as calculated previously.
  * for 3rd conv 1x1, 256, the number of imput channels is 64, for each output pixel the computation would be `1*1*64`. Since the output is `32*32*256`, the total number of multiplciations are `32*32*256*1*1*64`
  
  **Total = 71M**
  
Consider the case where the input is of size `32*32*256` for the right side of the network, the total computation would be `32*32*64*3*3*256*2` = 300 M, very expensive!

## Commentry
One of the usefulness of skip connections are the easiness in which the gradients flow. Lets calculate for one block:

<img src='gradients.png'>

Now if you consider multiple of them stacked, the gradients would be of the form: $\frac{\partial E}{\partial y}*(1 + F'_{1})*(1 + F'_{2})...(1 + F'_{N})$ without the skip connection it would be  $\frac{\partial E}{\partial y}*(F'_{1})*(F'_{2})...(F'_{N})$. So if the magnitude of gradients are < 1, which is most of the cases, as $F'(x)$ corresponds to $weights_1*weights_2$, and they are generally initialized between ${-1, 1}$ then we may have a problem of vanishing gradients, that is avoided in the case of skip connections.

## Implementation
* A `224×224` crop is randomly sampled from an image or its horizontal flip, with the per-pixel mean subtracted.
* Batch normalization(BN) is applied  right  after  each  convolution  and before activation.
* ReLu is the activation thats been used

<img src='block.png'>

## Experiments in tensorflow

In [55]:
import tensorflow as tf
import numpy as np
import sys
print(tf.__version__)
print(sys.version)

1.8.0
3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609]


## A simple experiment to prove that the gradients flow well in a resnet model vs plain vanilla model

* Create two models one with skip connection and one without
* Removed BN for simplicity.
* The block is repeated 50 times to construct a deep model.
* Measure the gradient flow across the architectures by finding the gradient from the output with respect to the input.
* Resnet should have a higher L2 norm of the gradients compared with the plain network.

In [56]:
# Input used is 224x224
input = tf.placeholder(tf.float32,(1, 224, 224, 3))

# if skip connection is False, then its plain network
def resnet_block(input_layer, output_channel, first_block=False, skip_connection=True):
    
    if first_block:
        conv =  tf.layers.conv2d(input_layer, output_channel, (7,7), strides=(2,2), padding='SAME')
        output = tf.layers.average_pooling2d(conv,(2,2), (2,2), padding='VALID')
        
    else:    
        
        # Removed batch norm for simplicity
        conv =  tf.layers.conv2d(input_layer, output_channel, (3,3), strides=(1,1), padding='SAME',activation=tf.nn.relu)
    
        # relu activation
        act = tf.nn.relu(conv)
    
        # 
        conv =  tf.layers.conv2d(act, output_channel, (3,3), strides=(1,1), padding='SAME')
        
        if skip_connection:
            output = tf.nn.relu(conv + input_layer)
        else:
            output = tf.nn.relu(conv)
    
    return output



In [57]:
# Create a 50 resnet_blocks and plain blocks
input_layer_res = input
input_layer_plain = input
is_first = True
for _ in range(50):
    output_resnet = resnet_block(input_layer_res, 64, first_block=is_first)
    output_plain = resnet_block(input_layer_plain, 64, first_block=is_first,skip_connection=False)
    is_first = False
    input_layer_res = output_resnet
    input_layer_plain = output_plain
t_grad_resnet = tf.gradients(output_resnet, input)[0]
t_grad_plain = tf.gradients(output_plain, input)[0]

In [59]:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(10):
        g_resnet, g_plain = sess.run([t_grad_resnet, t_grad_plain], {input: np.random.normal(size=(1,224,224,3))})

In [63]:
from numpy import linalg as LA
print('Resnet case l2 norm:',LA.norm(g_resnet))
print('Plain Vanilla, l2 norm:',LA.norm(g_plain))

Resnet case l2 sum of gradients: 407905.56
Plain Vanilla, l2 sum of gradients: 2.4008885e-13


**Resnet indeed has better gradient flow.**

- EOF -