# Notable Architectures - II


In this notebook, we study the **ResNet** architecture and some of its variants.
- ResNet (2015)
- Pre-Activation ResNet (2016)
- Wide ResNet (2016)
- ResNeXt (Aggregated Residual Transformations for Deep Neural Networks) (2016)

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import MaxPooling2D, GlobalAveragePooling2D, AveragePooling2D
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import BatchNormalization

## ResNet (2015)

The modern convnet architectures that we have studied so far share a common principle.
- Deeper networks improve classification performance.

We have seen that deep networks are created by stacking computational layers (e.g., VGG block, inception block), which facilitate learning more expressive features. Thus, the key takeaway message is that network depth is the main factor towards creating powerful architectures.

However, the effort to build deeper networks soon hit a roadblock. It was found that it was challenging to optimally train deeper networks. With the network depth increasing, training accuracy gets saturated and then degrades rapidly. Researchers started investigating this training performance **degradation** problem and identified that the problem was caused not due to overfitting. Instead it was a training optimization problem that was due the increased depth of the network.

Let's illustrate the issue. We use two plain "VGG-like" networks: one deep (34 layers) and one shallow (18 layers). The 34-layer plain network is shown below.

The convolutional layers mostly have 3×3 filters. When the feature map size is halved, the number of filters is doubled so as to preserve the time complexity per layer. Downsampling is performed directly by the Conv layers using stride of 2. The network ends with a global average pooling layer and a 1000-way FC layer with softmax. 


<img src="https://cse.unl.edu/~hasan/Pics/CNN_ResNet_34LayerPlainNetwork.png" width=800, height=600>

By comparing this 34-layer plain network with the VGG-19 architecture, we see that it's significantly deeper. It can be observed that the plain model, albeit deeper, has fewer number of filters and lower complexity than the VGG-19 model. The 34-layer plain network has 3.6 billion FLOPs (multiply-adds), which is only 18% of VGG-19 (19.6 billion FLOPs).

<img src="https://cse.unl.edu/~hasan/Pics/CNN_ResNet_VGG19.png" width=700, height=600>

Our goal is to study the training performance of the 34-layer plain network and compare its performance with a shallower version of it (an 18-layer plain network that follows the same architecture). Both the training and validation errors of these two networks are shown below. Thin curves denote training error, and bold curves denote validation error.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_ResNet_18vs34LayerPlainNetwork.png" width=500, height=400>

We see that the 34-layer plain network has higher training error throughout the whole training procedure. This is surprising because the solution space of the 18-layer plain network is a subspace of that of the 34-layer one. This illustration indicates the training performance degradation problem is a consequence of the network depth.

One plausible explanation for non-optimal training could be the existence of the vanishing gradients. But to ensure non-zero variance in the forward signal and healthy norms in the backward gradients, the plain networks are trained with batch normalization. Thus, this degradation problem was not caused by vanishing gradients.

The valid explanation for this problem that a group of researchers led by Kaiming He at the Microsoft Research came up with in 2015 was that the deep plain nets may have exponentially low convergence rates, which impact the reducing of the training error.


### Training Deeper Models: An Optimization Problem
He et al. **hypothesized** that this problem is an optimization problem, i.e., deeper models are harder to
optimize. The deeper model should be able to perform at least as well as the shallower model. Then, how is it possible that a deeper model, extended from its shallower version, performs poorly? Let's address this question.

A neural network is a function approximator. Say that we have a shallow convnet architecture that can approximate a class of functions, denoted by $F\_shallow$. If the true function $f\_true \subseteq F\_shallow$, then theoretically it is possible to learn its approximation by finding the suitable weights. To improve this true function approximation, we could design a deeper convnet to achieve more expressive power for learning a larger class of functions, denoted by $F\_deep$. A reasonable expectation is that using the deep model, we should be able to create better approximation of $f\_true$. 

However, we might never converge to a better approximation if $F\_shallow \nsubseteq F\_deep$. In fact, our solution could be worse.

- Then, how do we guarentee that the deeper network is at least as good as its shallow version?

We need to ensure that $F\_shallow \subseteq F\_deep$. In other words, in the deep network if we could train the newly-added layer into an identity function $f(x) = x$, then the new model will be as effective as the original model. As the new model may get a better solution to fit the training dataset, the added layer might make it easier to reduce training errors.

How do we make the network to learn an identity function?

Well, we could simply add the input x to the output of the network. In other words, we could add a short-cut from input to output, or add a **skip connection**.


Let's say that the goal of the network is to learn a target function f(x). The left figure below directly learns the mapping f(x) (the part within the dotted-line box). However, on the right, the part within the dotted-line box needs to learn the **residual mapping** $f(x) − x$. Now our goal to learn an identity mapping $f(x) = x$ is easier to achieve. For example, the learning process needs to push the weights and biases of the upper weight layer (e.g., fully-connected layer and Conv layer) within the dotted-line box to zero. Thus, by adding a skip connection we could train an identity mapping. This micro-network to learn identity map is known as the **residual block**. With residual blocks, inputs can forward propagate faster through the residual connections across layers.


<img src="https://cse.unl.edu/~hasan/Pics/CNN_ResNet_ResidualLearning.png" width=700, height=600>

A useful aspect of the residual learning is that if we add many skip connections, the network can start making progress even if several layers have not started learning yet (figure below). Due to the skip connections, the signal can easily make its way across the whole network. 

<img src="https://cse.unl.edu/~hasan/Pics/CNN_ResNet_Backpropagation.png" width=700, height=600>



## Very Deep Models using Residual Learning

Kaiming He et al. used residual blocks to avoid the training degradation problem. They designed deeper networks by stacking a set of residual blocks, which resulted into a new architecture named **ResNet**. They won the ILSVRC 2015 challenge by achieving 3.57% top-5 error rate. The winning variant used an **extremely deep ResNet composed of 152 layers** (other variants had 34, 50, and 101 layers).

## Residual Block Architecture

Following figure shows the architecture of a typical residual block. We use a special notation for the presentation. For example, the notation "$f_k, 3 × 3 + 1(S)$" indicates that the layer uses $f_k$ number of  3 × 3 filters with stride 1, and SAME padding.

The size of the filters used in the Conv layers in a residual block is always 3 x 3, which is inspired by VGG-design. Below we describe the residual block architecture.

- The residual block has two 3 × 3 convolutional layers with the same number of filters or output channels $f_k$. 

- The first Conv layer uses either a stride of 1 or 2 with SAME padding. 

- The Conv layers use stride of 1 everywhere except for the starting Conv layer in a stack of residual blocks with the same $f_k$ where a stride of 2 is used.

- Each convolutional layer is followed by a batch normalization layer and a ReLU activation function. 

- Then, the two convolution operations are skipped and add the input directly before the final ReLU activation function. 

<img src="https://cse.unl.edu/~hasan/Pics/CNN_ResNet_ResidualBlockArchitecture_0.png" width=400, height=300>


### Special Note on the Skip Connection

When the residual blocks are stacked sequentially in a ResNet, the number of feature maps ($f_k$) is doubled every few residual blocks, at the same time as their height and width are halved (using a Conv layer with stride 2), as illustrated below. 


<img src="https://cse.unl.edu/~hasan/Pics/CNN_ResNet_ResidualBlockArchitecture_1.png" width=500, height=300>


When this happens, the inputs cannot be added directly to the outputs of the residual block because they don’t have the same shape. To solve this problem, the inputs are passed through a 1 × 1 Conv layer with stride 2 and the right number of output feature maps. Following figure shows the 1 x 1 Conv operation on the skip channel. The stride 2 ensures that the input map height and width are halved, thus it matches the size of the input through the two Conv layers.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_ResNet_ResidualBlockArchitecture_2.png" width=500, height=300>

Thus, every time we change the number of channels and halve the size of the input, we need to introduce an additional 1 × 1 Conv layer on the skip channel to transform the input into the desired shape for the addition operation. This is illustrated below.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_ResNet_ResidualBlockArchitecture_3.png" width=800, height=600>


## Residual Block: TensorFlow Implementation

In [4]:
class Residual_Block(keras.layers.Layer):
    def __init__(self, filters, strides=1, activation="relu", **kwargs):
        super().__init__(**kwargs)
        self.activation = keras.activations.get(activation)
        
        '''
        Add two Conv layers
            - After the first Conv layer perform batch normalization (BN) and ReLU activation
            - After the second Conv layer, perform only BN

        Don't use bias neurons in the Conv layers (set the use_bias to False).
        Because the Conv layer is followed by a BN layer that adds a bias.
        The BN "shift" parameter shifts the output of the layer (thus acts like a bias). 
        '''
        self.main_layers = [
            # The stride of the first Conv layer is specified by the designer: could be 1 or 2
            keras.layers.Conv2D(filters, kernel_size=3, strides=strides, padding="SAME", use_bias=False),
            keras.layers.BatchNormalization(),
            self.activation,
            # The stride of the second Conv layer is always 1
            keras.layers.Conv2D(filters, kernel_size=3, strides=1, padding="SAME", use_bias=False),
            keras.layers.BatchNormalization()]
        self.skip_layers = []
        # If the stride of the first Conv layer is 2,
        #    then add a 1 x 1 Conv layer on the skip channel, followed by batch normalization
        if strides > 1:
            self.skip_layers = [
                keras.layers.Conv2D(filters, kernel_size=1, strides=strides, padding="SAME", use_bias=False),
                keras.layers.BatchNormalization()]
                
    def call(self, inputs):
        Z = inputs
        for layer in self.main_layers:
            Z = layer(Z)
        skip_Z = inputs
        for layer in self.skip_layers:
            skip_Z = layer(skip_Z)
        return self.activation(Z + skip_Z)

## ResNet Model Architecture

The ResNet architecture is surprisingly simple, which starts and ends exactly like GoogLeNet (except without a dropout layer), and in between is just a very deep stack of simple residual blocks. 

<img src="https://cse.unl.edu/~hasan/Pics/CNN_ResNet_Architecture.png" width=800, height=600>

The first two layers of ResNet are the same as those of the GoogLeNet: the 7 x 7 Conv layer with 64 output channels and a stride of 2 is followed by the 3 x 3 maximum pooling layer with a stride of 2. The difference is the batch normalization layer added after each Conv layer in ResNet. 

GoogLeNet uses four modules made up of Inception blocks. However, ResNet uses four modules made up of residual blocks, each of which uses several residual blocks with the same number of output channels. 

The number of channels in the first module is the same as the number of input channels. Since a maximum pooling layer with a stride of 2 has already been used, it is not necessary to reduce the height and width. In the first residual block for each of the subsequent modules, the number of channels is doubled compared with that of the previous module, and the height and width are halved. 


### ResNet-18 Architecture

Below we implement the ResNet-18 architecture. 

It has 8 residual blocks (16 Conv layers, excluding the 1 x 1 Conv layer) + input Conv layer + output FC layer. Thus a total of 18 layers with learnable parameters (excluding batch normalization parameters). 

This model is commonly known as ResNet-18. 


### ResNet-18 Model: TensorFlow Implementation


In [5]:
# ResNet-18 Architecture
# 8 residual blocks (16 Conv layers) + input Conv layer + output FC layer 
#     = 18 layers with learnable parameters (excluding batch normalization parameters)
def resNet(input_shape, num_of_output_classes):
    
    resnet = keras.models.Sequential(name='ResNet-18')
    
    '''
    Set the use_bias to False because the following BN layer adds a bias. 
    The BN "shift" parameter shifts the output of the layer (thus acts like a bias).
    '''
    resnet.add(Conv2D(64, (7, 7), strides=2, padding='same', 
                     input_shape=input_shape, use_bias=False))
    resnet.add(keras.layers.BatchNormalization())
    resnet.add(keras.layers.Activation("relu"))
    resnet.add(keras.layers.MaxPool2D(pool_size=3, strides=2, padding="SAME"))

    # Add the residual blocks
    prev_filters = 64
    for filters in [64] * 2 + [128] * 2 + [256] * 2 + [512] * 2:
        strides = 1 if filters == prev_filters else 2
        resnet.add(Residual_Block(filters, strides=strides))
        prev_filters = filters
        
    # Perform global average pooling
    resnet.add(keras.layers.GlobalAvgPool2D())
    resnet.add(keras.layers.Flatten())
    resnet.add(keras.layers.Dense(units=num_of_output_classes, activation="softmax"))
    
    return resnet

# Create a ResNet network and display layer information
resnet = resNet((224, 224, 3), 10)
resnet.summary()

Model: "ResNet-18"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 112, 112, 64)      9408      
_________________________________________________________________
batch_normalization (BatchNo (None, 112, 112, 64)      256       
_________________________________________________________________
activation (Activation)      (None, 112, 112, 64)      0         
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 56, 56, 64)        0         
_________________________________________________________________
residual__block (Residual_Bl (None, 56, 56, 64)        74240     
_________________________________________________________________
residual__block_1 (Residual_ (None, 56, 56, 64)        74240     
_________________________________________________________________
residual__block_2 (Residual_ (None, 28, 28, 128)       23


## Deeper ResNet


For ImageNet classification deeper ResNet architectures were used with layer numbers 34, 50, 101, 152. Following figure shows the possible ResNet architecture specifications for Imagenet including the shallow ResNet-18 architecture. The ResNet building blocks are shown in brackets with the number of blocks stacked.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_ResNet_Architectures_ImageNet.png" width=800, height=600>


### Flexibility of the ResNet Architecture

We can see that by configuring different numbers of channels and residual blocks in the ResNet module (each is represented by a square bracket), we can create different ResNet models. Although the main architecture of ResNet is similar to that of GoogLeNet, ResNetʼs structure is simpler and easier to modify. 

- All these factors have resulted in the rapid and widespread use of ResNet.




### Residual Block for Deeper ResNet Models


Deeper ResNets use a slightly different design for the residual blocks. They use "bottleneck" layers, which is 1 x 1 Conv layer, to improve the computational efficiency. To understand the design of the residual block for deeper ResNets, consider the following figure.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_ResNet_ResidualBlock_DeepModel_1.png" width=600, height=400>

The figure on the right shows the new residual block. It has three Conv layers. We use an example of input of height and size 28 x 28 with 256 channels.

- First Conv layer: it's a bottleneck 1 x 1 Conv layer. It reduces the dimension of the input filter channels to 64. 
- Second Conv layer: it's a scanning 3 x 3 Conv layer that works on the reduced 64 filters.

- Third Conv layer: it's 1 x 1 Conv layer to increase the number of channels back to 256, such that the number of input and output channels remain the same.


### ResNet Training: Practical Considerations

We should use the following parameter settings, recommended by He et al. in their ResNet paper, to train a ResNet model in practice.

- Perform batch normalization after every Conv layer
- He/Xavier weight initialization 
- SGD + Momentum (0.9)
- Learning rate: 0.1, divided by 10 when validation error plateaus
- Mini-batch size 256
- Weight decay of 1e-5
- No dropout 


## Variants of ResNet

Below we discuss some variants of the ResNet model.
- Pre-activation ResNet
- Wide ResNet
- ResNeXt


## Pre-Activation ResNet (2016)

- The main motivation behind the pre-activation ResNet is to ensure the smooth flow of signals such that deeper models can be trained easily and effectively.

https://arxiv.org/abs/1603.05027

In the original ResNet, the input map is convolved first, then passed through activation. Thus, this ResNet block architecture can be regarded as "post-activation" (figure a below).

A logical question might be: what happens if we perform pre-activation? I.e., first the maps are activated by passing it through the batch normalization and ReLU activation layer, then the activated maps are convolved (figure b below). This very question led to design a variant of ResNet known as the **Pre-Activation ResNet**.

This new design is effective in training very deep networks. The figure on the right below shows that a 1001-layer ResNet based on pre-activation has lower test error as well as it is easier to train.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_PreActivation_ResNet_1.png" width=800, height=600>

It was shown that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation. Following figure illustrates various usages of activation. All these blocks consist of the same components, only the orders are different. We see that the pre-activation block gives the best performance.


<img src="https://cse.unl.edu/~hasan/Pics/CNN_PreActivation_ResNet_2.png" width=800, height=600>

## Wide ResNet (2016)

The ResNet model introduced residual bolcks with identity mapping with the goal to train deep networks. Residual block based deeper models suffer from two issues.

- Increased number of parameters and computations.
- Diminishing feature reuse.

Let's illustrate the latter issue. The identity mappings in the residual blocks may become a weakness in deep networks. As gradient flows through the network there is nothing to force it to go through residual block weights and it can avoid learning anything during training. So, it is possible that there is either only a few blocks that learn useful representations, or many blocks share very little information with small contribution to the final goal. This problem is known as **diminishing feature reuse**. One strategy to mitigate this problem is by randomly disabling residual blocks during training. This method can be viewed as a special case of dropout, where each residual block has an identity scalar weight on which dropout is applied. The effectiveness of this approach proves the fact that residual blocks leads to the diminishing feature reuse problem in deeper networks.

A new residual network architecture was designed based on the above observation. It was named as **wide residual networks**. It was shown that by widening the ResNet blocks, it is possible to improve performance of residual networks compared to increasing their depth.

https://arxiv.org/abs/1605.07146

It was argued that residuals are the important factor to improve performance, not depth. It was demonstrated that:
- 50-layer wide ResNet outperforms 152-layer original ResNet
- Increasing width instead of depth is more computationally efficient (parallelizable)

### Wide Residual Block Architecture

Following figure shows that bottleneck architecture of the original Resnet and the wide ResNet. The wider residual blocks use $F \times k$ filters instead of $F$ filters in each layer in the residual block. Here $k$ denotes the widening factor and is used to multiply the number of features in convolutional layers.


<img src="https://cse.unl.edu/~hasan/Pics/CNN_Wide_ResNet.png" width=600, height=400>

## ResNet vs Wide ResNet: Depth vs Width

Following table shows the performance comparison between original ResNet and Wide Resnet. We see that with a widening factor of 2.0 the resulting 50-layer Wide Resnet "WRN-50-2-bottleneck" outperforms ResNet-152. The wide ResNet has 3 times less layers, and is significantly faster. 

Also notice that the WRN-50-2-bottleneck is only slightly worse and almost 2x faster than the best-performing pre-activation ResNet-200, althouth having slightly more parameters. 

<img src="https://cse.unl.edu/~hasan/Pics/CNN_Wide_ResNet_Performance.png" width=600, height=500>

## ResNeXt (2016)

The main idea of the ResNeXt (Aggregated Residual Transformations for Deep Neural Networks) architecture (authored by Kaiming He's group) is that it combines ResNet with split-transform-merge strategy of Inception. A module in the network performs a set of transformations, each on a low-dimensional embedding, whose outputs are aggregated by summation.

https://arxiv.org/abs/1611.05431

Following figure compares the architecture of ResNet block (left) with that of ResNeXt. The number of parallel paths in ResNeXt, known as its cardinality, is 32. Here a layer is shown as (# in channels, filter size, # out channels). 

- The key aspect to note is that both blocks have roughly the same complexity. 

<img src="https://cse.unl.edu/~hasan/Pics/CNN_ResNeXt_1.png" width=600, height=500>

The ResNeXt Network-in-Neuron induces the **boosting** effect by combining a set of simpler classifier into a powerful one. 


Following table shows the results on the 1000-class ImageNet classification task using 50-layer and 101-layer residual networks. The ResNeXt blocks use cardinality = 32 (# parallel paths) and bottleneck width = 4d (filters), denoted as ResNeXt-50 (32 x 4d). We see that the ResNeXt models outperforms the vanilla ResNet models without increasing model complexity.


<img src="https://cse.unl.edu/~hasan/Pics/CNN_ResNeXt_2.png" width=1000, height=800>