# Deep Residual Networks

[Kaiming He](http://kaiminghe.com/index.html)

06/19/2016

[Tutorial page](http://kaiminghe.com/icml16tutorial/index.html)

__Abstract__

*Deeper neural networks are more difficult to train. Beyond a certain depth, traditional deeper networks start to show severe underfitting caused by optimization difficulties. This tutorial will describe the recently developed residual learning framework, which eases the training of networks that are substantially deeper than those used previously. These residual networks are easier to converge, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with depth of up to 152 layers --- 8x deeper than VGG nets but still having lower complexity. These deep residual networks are the foundations of our 1st-place winning entries in all five main tracks in ImageNet and COCO 2015 competitions, which cover image classification, object detection, and semantic segmentation.*

*In this tutorial we will further look into the propagation formulations of residual networks. Our latest work reveals that when the residual networks have identity mappings as skip connections and inter-block activations, the forward and backward signals can be directly propagated from one block to any other block. This leads us to promising results of 1001-layer residual networks. Our work suggests that there is much room to exploit the dimension of network depth, a key to the success of modern deep learning.*

__Papers__

*[Deep Residual Learning for Image Recognition](http://arxiv.org/abs/1512.03385)*

*[Identity Mappings in Deep Residual Networks](http://arxiv.org/abs/1603.05027)*

*Watch a condensed version of this lecture [here](https://www.youtube.com/watch?v=1PGLj-uKT1w&feature=youtu.be).*

*ResNet* (100 - 1000 layers)

- Deepest layers (ImageNet) in 2014: 20 layers
- Winning network: 152 layers

### Training deep residual networks

Given a set of images, and a set of 3 - 5 classification labels:

- Naïve approach:
  - SVM
  - Pixels as input

- More sophisticated, traditional image recognition:
  - Edge detection
  - Color normalization
  

Think of the pipeline as a 4 - 5 layer process. In a traditional image classification pipeline, we generally require some domain expertise; in a deep learning domain, we typically require less.

If we want to train layers with over, say, 10 layers, 30 layers, etc., sometimes we require some skipped connections. Furthermore, if we need over 100 layers (up to 1000 layers!) we need to design our skips carefully.

## Say we have a linear activation function:

__Backing up to a single layer__:

- Generally computed as a matrix multiplication
- If activation is simply linear (with a single layer), variance of input layer is proportional to the variance of the output layer
    - If there are more layers, the variance in the input layer is proportional to the variance of the output layer multiplied by the product of the hidden layers

__Controlling the variance of the network:__

- If the variance of each layer is slightly smaller than an ideal value, your gradient will dimish
- If the variance of each layer is slightly larger than an ideal value, your gradient will explode

__Initialization__

- Need healthy forward and backward propagation
  - The number of input nodes multiplied by the variance of the weights should equal 1
  - If you have a healthy forward propagation signal, you'll typically have a healthy backpropagation signal
  

## Now say we are using ReLu activation

__Initialization__:

- The number of inputs in each layer multiplied the weights should *equal 2* instead of 1.
- If you do something wrong in any layer, the result is amplified exponentially throughout the forward propagation
- A better initialization will help improve convergence
- The deeper the network, the more important the initialization criteria
  - Lest you destroy the ability for the network to converge
- [Batch normalization paper](http://jmlr.org/proceedings/papers/v37/ioffe15.pdf) says we should normalize the input and output of each layer
  - Critical for faster training of deep neural nets

__Batch normalization__:

$$X = \frac{x_{i}-\mu}{\sigma}$$
  - In training mode, σ and µ contribute to the back-propagation
- Easy to make errors in batch normalization
- Greatly accelerates the training of deep neural networks
- Improves accuracy

## Going deeper

*Is learning better networks as simple as stacking more layers?*

Simply stacking does not improve our performance:

![Image](img/stacking.png)

![Image](img/stacking2.png)

*This motivates us to develop DRN*

__Rather than fit *H(X)*, fit *F(X)*, where *H(X) = F(X) + x*__&mdash;this approaches an identity function

If our network has many, many layers. We can expect that each layer will do less than those in a shallow network. In this case, our layers should do more identity mapping.

Before the prevalence of deep learning, the most successful hand-crafted features encoded the vectors with respect to a "dictionary" (residual representations).

### Network design

Each layer is:
  - 3X3 conv
  - spatial size / 2
  - simple design, __just deep__
  - skipped connections skip layers
    - sometimes three
    - uses a bottle neck structure:
      - alternating layers learn compressed feature spaces
  
Furthermore:
  - No max pooling (almost)
  - No hidden fc
  - No dropout
  - Only one fully connected layer
  
__Training__
  - Trained from *scratch*
  - Use batch normalization
  - Standard hyper-parameters
  
![Image](img/resnet1.png)

*__ResNet loss on ImageNet is nearly half that of its second-place competitor__*

Caveats of deep learning:
  - Deeper networks are harder to optimize
    - Even if the solution exists, the solver may not find it
  - Generalizability
    - How good will net generalize to test/validation data?
    - Increasing the width can improve training error, but damages test performance
    
How does DRN solve these?
  - Improves the optimization
  - Does __not__ address generalizability
    - Although optimization improvements allow for deeper, thinner networks

### Going from 100 layers to 1000 layers

__On identity mappings for optimization__
  - shortcut mapping: h = identity
  - very smooth forward propagation
      - response of any layer is equal to the response of any shallower layer + the summation of a set of residual functions
      - the feature of any deeper layer is an additive outcome
  - very smooth back propagation
      - the gradient of any shallower layer is an additive outcome of the deeper layers
      - unlikely for this gradient to diminish
          - the presence of a multiplicative outcome directly causes the diminishing gradient problem
          
          
__Experiments (all performed on ResNet with 100 layers)__

Results of the shortcut connection:
  - What if a shortcut ≠ identity?
      - The residual function was scaled by a constant 0.5
          - Much higher error&mdash;nearly 3-4 times greater
              - Multiplicative effect exploded the gradient
      - 1x1 conv layer followed by sigmoid activation
          - Error was about 2.5 times greater
          - Kept the bottleneck 3x3 conv structure
      - Made the shortcut function a 1x1 conv layer
          - Still higher error
      - Shortcut functions are blocked by some kind of multiplication. This means that if the shortcut mapping is multiplicative, there is a chance that the direct propagation is decayed. For instance, if the shortcut is a constant scaling of the input signal, we can show that the forward propagation will explode the gradient exponentially.
      - Bias initialized at 0
          - If you initialize the bias to be a very negative number, the gate becomes 1, which will exhibit similar results to the identity-skip formulation
  - What if the shortcut = identity or ReLu (special gate: either multiplies signal by 1 or 0)?
      - 1st design: weight layer, batch norm, 3x3 conv, batch norm and elementwise addition
      - 2nd design: same, but move the last batch norm step to after the elementwise addition
          - weight layer, batch norm, 3x3 conv, elementwise addition, batch norm
      - 3rd design: move elementwise addition before the weight layer
          - elementwise addition, weight layer, batch norm, 3x3 conv, batch norm
          
Increasing layers from 100 to 200 degraded performance, but using the improved activation function and going even deeper, the performance improved dramatically. See [the code](https://github.com/KaimingHe/resnet-1k-layers)

### Future works

- Representation
    - Study of tradeoff between depth and width
        - Potential exists for more optimal performance in shallower, wider networks (but difficult to converge)
- Generalization
    - Dropout
    - MaxOut
- Optimization

### Applications

*Features matter*

- Deeper features are more transferable to other recognition tasks

__CNN pipe__

![Image](img/cnn.png)

__Acknowledgements__

The concept of ResNet is not limited to computer vision; it is suitable for other recognition tasks, such as:

- Speech recognition
- Image generation
- NLP

__Takeaways__:

- Deeper is better
- Features matter
- Fast R-CNN is simply amazing