
# Notable Convnet Architectures-III

In this notebook, we study the following notable convnet architectures.

- Inception-v4 (2016)
- Xception (2016)
- DenseNet (2017)
- Ensemble Method: Deep Feature Fusion (2016)
- Squeeze-and-Excitation Network (SENet) (2017)

Finally, we conclude our discussion on the notable convnet architectures by comparing the model performance against computational complexity (number of operations) and model complexity (number of parameters).

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import MaxPooling2D, GlobalAveragePooling2D, AveragePooling2D
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import BatchNormalization

## Marriage between GoogLeNet & ResNet: Inception-v4 (2016)

By combining the ideas of GoogLeNet and ResNet, a new architecture was proposed in 2016. It was known as Inception-v4. It achieved a top-5 error rate of close to 3% on ImageNet classification.

https://arxiv.org/pdf/1602.07261.pdf

The marriage was blissful indeed!



## Xception (2016)

Another happy marriage between GoogLeNet and ResNet was observed in 2016 when François Chollet proposed a variant of the GoogLeNet architecture. It is known as Xception (stands for Extreme Inception) https://arxiv.org/pdf/1610.02357.pdf.



### Motivation

First, let's understand how a Conv layer was conceived prior to the advent of Inception-based architectures. In general, a Conv layer was designed to learn feature maps from an input 3D map with two spatial dimensions (height and width) and a chennel dimension. Thus, a single Conv filter is tasked with simultaneously mapping spatial correlations and cross-channel correlations. 

The Inception block, first used in GoogLeNet, changed this paradigm of learning. It formulated a prior that says 
- Spatial correlations and cross-channel correlations are sufficiently decoupled.

Thus, according to the Inception philosophy, it is preferable not to map them jointly. Consider the following inception block, also known as the canonical inception block from Inception-v3 network.

<img src="http://engineering.unl.edu/images/uploads/CNN_Inception-v3.png" width=700, height=500>

It has factored out the spatial and cross-channel scanning. It first looks at cross-channel correlations via a set of 1 x 1 convolutions, mapping the input data into 3 or 4 separate spaces that are smaller than the original input space, and then maps all correlations in these smaller 3D spaces, via regular 3 x 3 or 5 x 5 convolutions.

Now consider a simplified version of an Inception block (left figure below) that only uses one size of convolution (e.g., 3 x 3) and does not include an average pooling layer. This Inception block can be reformulated as a large 1 x 1 convolution followed by spatial convolutions that would operate on nonoverlapping segments of the output channels (figure on the right). 

<img src="http://engineering.unl.edu/images/uploads/CNN_Inception_Simplified.png" width=800, height=600>



This observation led to the creation of an "extreme" version of the Inception block, as shown below. Chollet made a strong prior:
- Cross-channel correlations and spatial correlations can be mapped **completely** separately.

Because this hypothesis is a stronger version of the hypothesis underlying the Inception architecture, it is named as Xception, which stands for "Extreme Inception".

In this arrangement, we would first use a 1 x 1 convolution to map cross-channel correlations, and then separately map the spatial correlations of every output channel. 

<img src="http://engineering.unl.edu/images/uploads/CNN_Inception_ExtremeVersion.png" width=600, height=400>

Let's explain this further using the figure below. First we perform the pointwise convolution followed by a depthwise convolution. This type of convolution layer is known as **depthwise separable convolution layer** (or separable convolution layer for short). While a regular Conv layer use filters that try to simultaneously capture spatial patterns (e.g., an oval) and cross-channel patterns (e.g., mouth + nose + eyes = face), a separable convolutional layer makes the strong assumption that spatial patterns and cross-channel patterns can be modeled separately. These separable convolution layers had been used before in some convnet architectures, but they were not as central as in the Xception architecture. 

<img src="http://engineering.unl.edu/images/uploads/CNN_Extreme_Inception.png" width=700, height=500>


## Xception Architecture

The Xception architecture is a linear stack of separable convolution layers with residual connections. Its specification is shown in the following figure.

The Xception architecture has 36 Conv layers forming the feature extraction base of the network. The 36 Conv layers are structured into 14 modules, all of which have linear residual connections around them, except for the first and last modules.

- The data first goes through the entry flow
- Then through the middle flow which is repeated eight times, and 
- Finally through the exit flow

Since separable convolutional layers only have one spatial filter per input channel, they are not used after layers that have too few channels, such as the input layer. For this reason, the Xception architecture starts with 2 regular convolutional layers, but then the rest of the architecture uses only separable convolutions (34 in all), plus a few max pooling layers and the usual final layers (a global average pooling layer and a dense output layer).

Note that all Convolution and SeparableConvolution layers are followed by batch normalization (not included in the figure). All SeparableConvolution layers use a depth multiplier of 1 (no depth expansion).


<img src="http://engineering.unl.edu/images/uploads/CNN_Xception_Architecture.png" width=1000, height=800>


Although the Xception architecture doesn't contain any Inception block, it is still considered as a variant of GoogLeNet. An inception block contains convolutional layers with 1 × 1 filters to look exclusively for cross-channel patterns. However, the convolutional layers that sit on top of them are regular convolutional layers that look both for spatial and cross-channel patterns. We can think of the Inception blocks as an intermediate between a regular convolutional layer (which considers spatial patterns and cross-channel patterns jointly) and a separable convolutional layer (which considers them separately). In practice, it seems that separable convolutional layers generally perform better.


The Xception model significantly outperformed Inception-v3 on a huge vision task (350 million images and 17,000 classes). Due to the separable convolutional layers, it uses fewer parameters, less memory, and fewer computations than regular convolutional layers, and in general they even perform better. Thus, it should be used by default (except after layers with few channels).


## DenseNet (2017)

The ResNet architecture has a profound influence on the design of many subsequent successful convnet architectures. DenseNet is one notable example of such architecture that derived from ResNet.

- What are the key characteristics of ResNet that created such a big ripple?


ResNet provided a simple strategy to create deeper networks. The main problem with very deep netwprks is that as information about the input or gradient passes through many layers, it can vanish and "wash out" by the time it reaches the end (or beginning) of the network. ResNet circumvent this **information loss** problem by passing signal from one layer to the next via identity connections. Thus one key idea that ResNet used is:

- Create short paths from early layers to later layers.



Another key characteristic of ResNet is that it decomposes a the target function $f(x)$ into a simple linear term $x$ and a more complex nonlinear one $f(x) - x$. This is similar to decomposing a function using its Taylor series expansion. For example, for the point $x = 0$, a function $f(x)$ and be decomposed as follows:

$f(x) = f(0) + f^{'}(0)x + \frac{f^{''}(0)}{2!}x^2 + \frac{f^{'''}(0)}{3!}x^3 + ...$


Observe that the $f(x)$ is decomposed into increasingly nonlinear (higher order) terms, which is similar to what ResNet does. However, ResNet restricts the decomposition up to two terms: a linear and a nonlinear term.

- The logical next step could be to capture information beyond two terms. 

But how do we do this? In ResNet, we have one linear channel (skip connection) and one nonlinear channel (the Conv layers block). Then, how do we add more nonlinear signals?

One strategy would be to combine nonlinear signals from the previous layers. This is exactly what DenseNet pursued.

https://arxiv.org/abs/1608.06993

DenseNet architecture connects all layers (with matching feature-map sizes) directly with each other, as shown in the figure below. The goal is to ensure maximum information flow between layers in the network. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers. This made it possible for DenseNet to learn complex features as the functions of each layers are the composition of many nonlinear functions.

<img src="http://engineering.unl.edu/images/uploads/CNN_DenseNet_1.png" width=600, height=400>

However, there is one notable difference in the manner functions are combined at each layer. Unlike ResNets that adds the linear and nonlinear functions, the functions in DenseNets are combined by concatenation (denoted by [,]), as shown below.  

<img src="http://engineering.unl.edu/images/uploads/CNN_DenseNet_2.png" width=400, height=200>

This concatenation enables to perform a mapping from $x$ to its values after applying an increasingly complex sequence of functions:

$x \rightarrow [x, f_1(x), f_2([x, f_1(x)]), f_3([x, f_1(x), f_2([x, f_1(x)])]), ...]$


Finally, all these functions are combined in MLP to reduce the number of features again. In terms of implementation this is quite simple: rather than adding terms, we concatenate them. The name DenseNet arises from the fact that the dependency graph between variables becomes quite dense. The last layer of such a chain is densely connected to all previous layers, as shown below.


<img src="http://engineering.unl.edu/images/uploads/CNN_DenseNet_3.png" width=600, height=400>

## DenseNet Architecture


The main components of the DenseNet architecture are:
- Dense blocks: define how the inputs and outputs are concatenated
- Transition layers: control the number of channels so that it is not too large


## Dense Block: TensorFlow Implementation


A dense block consists of multiple Conv sub-blocks. Each Conv sub-block includes:
- Batch normalization
- Activation (ReLU)
- 3 x 3 convolution with stride 1 and SAME padding

Note that each Conv sub-block in the dense block uses the same number of output channels. Finally, in the forward propagation, we concatenate the input and output of each convolution sub-block on the channel dimension.

- Thus, a dense block increases the number of output channels.

In [2]:
'''
Define a Dense Block block class using Keras' Sequential API 
The Dense_Block class takes two arguments:
- conv_block_number: number of Conv sub-blocks  
- num_of_channels: number of output channels 
'''
class Dense_Block(keras.layers.Layer):
    def __init__(self, conv_block_number, num_of_channels, **kwargs):
        super().__init__(**kwargs)
        self.conv_layers = [] 
        for _ in range(conv_block_number):
            self.conv_layers.append(keras.layers.BatchNormalization())
            self.conv_layers.append(keras.layers.Activation("relu"))
            self.conv_layers.append(keras.layers.Conv2D(filters=num_of_channels, kernel_size=(3, 3), 
                                    strides=1, padding="same"))

    def call(self, inputs):
        for layer in self.conv_layers:
            Z = layer(inputs)
            # Concatenate the outputs on the channel dimension
            # We use tf.concat() operation, with axis=3 (this axis is the depth)
            inputs = tf.keras.layers.Concatenate(axis=3)([inputs, Z])
    
        return inputs

## Transition Layer: TensorFlow Implementation

Recall that each dense block increases the number of channels, which creates an excessively complex model.

Once way to reduce this complexity is to decrease the number of channels as well the height and width of the maps. This is exactly the purpose of the transition layer. 
- It reduces the number of channels by using the 1 × 1 convolutional layer.
- It halves the height and width of the average pooling layer with a stride of 2, further reducing the complexity of the model.

In [3]:
class Transition_Block(keras.layers.Layer):
    def __init__(self, num_of_channels, **kwargs):
        super().__init__(**kwargs)
        self.main_layers = [keras.layers.BatchNormalization(),
                           keras.layers.Activation("relu"),
                           # Reduce the number of channels via 1 x 1 convolution
                           keras.layers.Conv2D(filters=num_of_channels, kernel_size=(1, 1), 
                                    strides=1, padding="valid"),
                           # Halve the height & width of the maps by average pooling
                           keras.layers.AveragePooling2D(pool_size=2, strides=2)]
    def call(self, inputs):
        Z = inputs
        for layer in self.main_layers:
            Z = layer(Z)
            
        return Z

## DenseNet Architecture: Implementation


The DenseNet architecture is very similar to the ResNet architecture.


Similar to ResNet, DenseNet first uses the same single convolutional layer and maximum pooling layer.

Then, similar to the four modules made up of residual blocks that ResNet uses, DenseNet uses four dense blocks. 

Similar to ResNet, we can set the number of convolutional layers used in each dense block. Here, we set it to 4, consistent with the ResNet-18 model. 

Furthermore, we set the number of channels (i.e., growth rate) for the convolutional layers in the dense block to 32, so 128 channels will be added to each dense block.

In ResNet, the height and width are reduced between each module by a residual block with a stride of 2. Here, we use the transition layer to halve the height and width and halve the number of channels.

Finally, similar to ResNet, a global pooling layer and a FC layer are connected at the end to produce the output.

In [4]:
# DenseNet model
def denseNet(input_shape, num_of_output_classes):
    
    densenet = keras.models.Sequential(name='DenseNet')
    densenet.add(Conv2D(64, (7, 7), strides=2, padding='same', input_shape=input_shape))
    densenet.add(BatchNormalization())
    densenet.add(Activation("relu"))
    densenet.add(MaxPooling2D(pool_size=3, strides=2, padding="same"))
    
    # `num_channels`: the current number of channels
    num_channels = 64
    growth_rate = 32
    num_convs_in_dense_blocks = [4, 4, 4, 4]
    
    for i, num_convs in enumerate(num_convs_in_dense_blocks):
        densenet.add(Dense_Block(num_convs, growth_rate))
        # This is the number of output channels in the previous dense block
        num_channels += num_convs * growth_rate
        # A transition layer that halves the number of channels is added between
        # the dense blocks
        if i != len(num_convs_in_dense_blocks) - 1:
            num_channels //= 2
            densenet.add(Transition_Block(num_channels))
            
    densenet.add(BatchNormalization())
    densenet.add(Activation("relu"))
    densenet.add(GlobalAveragePooling2D())
    densenet.add(Flatten())
    densenet.add(Dense(units=num_of_output_classes, activation="softmax"))
    
    return densenet

# Create a DenseNet network and display layer information
densenet = denseNet((224, 224, 3), 10)
densenet.summary()

Model: "DenseNet"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 112, 112, 64)      9472      
_________________________________________________________________
batch_normalization (BatchNo (None, 112, 112, 64)      256       
_________________________________________________________________
activation (Activation)      (None, 112, 112, 64)      0         
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 56, 56, 64)        0         
_________________________________________________________________
dense__block (Dense_Block)   (None, 56, 56, 19104)     7287552   
_________________________________________________________________
transition__block (Transitio (None, 28, 28, 96)        1910496   
_________________________________________________________________
dense__block_1 (Dense_Block) (None, 28, 28, 27296)     104

## DenseNet: Observations


Superficially, DenseNets are quite similar to ResNets. The main difference is the way channels are added. In DenseNet we concatenate the channels. However, the implications of this seemingly small modification lead to substantially different behaviors of the two network architectures.

- Model compactness. 
As a direct consequence of the input concatenation, the feature-maps learned by any of the DenseNet layers can be accessed by all subsequent layers. This encourages **feature reuse throughout the network**, and leads to more compact models.


- Implicit Deep Supervision. 
One explanation for the improved accuracy of dense convolutional networks may be that individual layers receive additional supervision from the loss function through the shorter connections. One can interpret DenseNets to perform a kind of "deep supervision". The benefits of deep supervision have previously been shown in deeply-supervised nets, which have classifiers attached to every hidden layer, enforcing the intermediate layers to learn discriminative features.

DenseNets perform a similar deep supervision in an implicit fashion: a single classifier on top of the network provides direct supervision to all layers through at most two or three transition layers. However, the loss function and gradient of DenseNets are substantially less complicated, as the same loss function is shared between all layers.

## Ensemble Method: Deep Feature Fusion (2016)


The ILSVRC 2016 challenge winner was a group of researchers (their teams's name was Soushen, meaning God of search) from the third Research Institute of Ministry of Public Security (Trimps) of China. The Trimps-Soushen team lowered the top-5 classification test error rate to 2.99%, which is the first to obtain under 3% error rate.

Led by Jie Shao, the Trimps-Soushen team used an **ensemble approach** for deep feature fusion. More specifically they used the following pretrained models to build their ensemble.

- Inception-v3
- Inception-v4
- Inception-ResNet-v2
- Pre-Activation ResNet-200
- Wide ResNet (WRN-68–2)

The classification performance for the top-10 difficult categories are shown below. 

<img src="http://engineering.unl.edu/images/uploads/CNN_Ensemble_DeepFeatureFusion.png" width=800, height=600>


We see from the results that no models ranks dominant for all categories. Each of the models are strong at classifying some categories, but also weak at classifying some categories. The diversity of models was beneficial for improving the accuracy. 

Although the ensemble technique achieved stunning performance, there was no new architecture or new technology. This group never published any papers or technical reports on their emsemble approach. They only shared their results in the ImageNet and COCO joint workshop in 2016 ECCV.

## SENet (2017)


The Squeeze-and-Excitation Network (SENet) was the winning architecture in the ILSVRC 2017 challenge, which lowered the top-5 error rate down to 2.25%!

https://arxiv.org/abs/1709.01507

Unlike the previous design approaches, SENet focused on a different aspect of network design:

- How to improve feature maps (channels) in relation to other feature maps (channels)?

It developed a technique to improve the quality of feature maps by performing "feature recalibration". A new architectural unit was proposed, which was named as the SE block, that learns to adaptively reweight feature maps.

Let's understand this approach by using an example.

### Motivating Example for SE Block


Consider a simple scenario. Say that we want to recognize cat's faces from images. A reasonable assumption about a cat's face is that on an image representing a cat, its ears, nose, eyes and whiskers will appear together. These are the features of a cat's face. A Conv layer in a neural network will learn different feature detectors to identify these features. The feature maps (output channels) of a Conv layer might learn each of these features individually, for example, one feature map for ears, one for nose, one for eyes, etc. All these feature maps should have similar activation if the features appear in the input image (i.e., if the input image represents a cat's face).

<img src="http://engineering.unl.edu/images/uploads/CNN_SENet_Cat.png" width=300, height=300>

What if the feature maps identifying ear, eyes and nose have strong activation, but the map detecting the whiskers have only weak activation? May be the Conv layer was confused about the whiskers. Then, how do we help the Conv layer to increase the activation of the map that detects whiskers?

We could design a micro-network that will use the output feature maps of each Conv block and will learn which maps need to be "boosted". It will do so by analyzing the inter-relationship between the feature maps. For the above example, this micro-network will increase the activation of the whisker-detector map by finding the fact that this map is strongly related to the strongly activated ears-eyes-nose maps. In other words, this micro-network will perform **feature map recalibration**. This is the idea behind the SE block.


<img src="http://engineering.unl.edu/images/uploads/CNN_SENet_a.png" width=900, height=600>

An SE block analyzes the output of the unit it is attached to, focusing exclusively on the depth dimension (it does not look for any spatial pattern), and it learns which features are usually most active together. It learns the global information to selectively emphasize informative features and suppress less useful ones.



## SE Block Architecture


The SE block uses global information about the feature maps (via global avg. pooling layer) and two fully-connected (FC) layers to determine feature map weights. The architecture of an SE block is composed of just three layers, as shown below.

- A global average pooling layer
- A hidden FC layer using the ReLU activation function, and
- A FC output layer using the sigmoid activation function


<img src="http://engineering.unl.edu/images/uploads/CNN_SENet_b.png" width=500, height=400>

First the feature maps of size $H \times W \times C$ with $C$ channels computed by a Conv layer are passed through the global average pooling layer. It computes the mean activation for each feature map: for example, if its input contains C feature maps, it will output C numbers representing the overall level of response for each filter.


The next layer is where the **squeeze** operation happens. It produces a map descriptor (channel descriptor) by aggregating feature maps across their spatial dimensions ($H \times W$). This layer has significantly fewer than $C$ neurons, $r$ times fewer than the number of feature maps. Typically $r = 16$. Thus, the $C$ numbers get compressed into a small vector (e.g., $\frac{C}{r}$ dimensions). This is a low-dimensional vector representation (i.e., an embedding) of the distribution of channel-wise feature responses. This bottleneck step forces the SE block to learn a general representation of the feature combinations. 

This aggregation is followed by an **excitation** operation in the output layer. The output FC layer takes the embedding and outputs a recalibration vector containing one number per feature map between 0 and 1. The feature maps are then multiplied by this recalibration vector. As a result, irrelevant features (with a low recalibration score) get scaled down while relevant features (with a recalibration score close to 1) are left alone.



## SE Block based Convnet

The SE block can be integrated into standard architectures by inserting it after the nonlinearity following each convolution. Below we show illustrations of how the SE block can be used to extend existing architectures such as Inception networks and ResNets, which are called SE-Inception and SE-ResNet, respectively.

<img src="http://engineering.unl.edu/images/uploads/CNN_SENet_c.png" width=800, height=600>

 
## SE Block based Convnets: Complexity

By comparing ResNet-50 with SE-ResNet-50, it was shown that there was only 0.26% relative increase in GFLOPS (1 gigaflops = 1 billion FLOPS). But with this slight additional computational burden, the SE-ResNet-50 approached the accuracy of deeper ResNet-101 architecture.

The number of parameters in the SE-ResNet-50 model increased by 10%, which is mainly due to the two FC layers in each SE block.



# Comparison of the Notable Convnet Architectures: Bird's-Eye View


We compare the performance of the notable architectures that we have studied so far. We are particularly interested to analyze the model performance with computational complexity (number of operations) and model complexity (number of parameters). This comparison is drawn from the following paper.
https://arxiv.org/abs/1810.00736


We analyze model complexity by counting the total amount of learnable parameters. Specifically, we collect the size of the parameters in terms of MB for the considered models. This information is very useful for understanding the minimum amount of GPU memory required for each model.

We measure the computational cost of each model considered using the floating-point operations (FLOPs) in the number of multiply-adds as in. The multiply-adds are counted as two FLOPs because, in many recent models, convolutions are bias-free and it makes sense to count multiply and add as separate FLOPs.


## Accuracy Rate vs. Computational Complexity vs. Model Complexity

Following figure shows top-1 and top-5 validation accuracy vs. computational complexity and model complexity. The results are based on the ImageNet-1k validation dataset.

- The computational complexity is measured using floating-point operations (FLOPs) required for a single forward pass.

- The model complexity is measured by counting the total amount of learnable parameters. The size of each ball corresponds to the model complexity


A quick observation reveals that NASNet-A-Large achieved the highest top-1 and top-5 accuracy, but most importantly it has the highest computational complexity. 

If we are interested about architectures that perform well with low complexity, then **SE-ResNeXt-50(32 x 4d) will stand out**. It is a modified ResNeXt architecture that uses SE blocks. The number of parallel paths in each ResNet block is 32 (i.e., cardinality = 32 and its bottleneck width = 4d). The SE-ResNeXt-50(32 x 4d) architecture achieves very high accuracy with a relatively low number of operations:
- The lowest computational complexity (i.e. lower than 5 GFLOPs) 
- A low level of model complexity, with approximately 2.76 M-params



The second-best low-complexity and high performance model is **Xception**. It has slightly lower accuracy than the SE-ResNeXt-50(32 x 4d), but it requires less parameters. Though the number of GLOPS is larger as compared to SE-ResNeXt-50 (still lower than 9 GFLOPs).


One of the **most efficient model is GoogLeNet**. The closest is ResNet-18.


Two key observations from this comparison are:

- Overall, it seems that there is no relationship between computational complexity and recognition accuracy, for instance SENet-154 needs about 3× the number of operations that are needed by SE-ResNeXt-101(32x4d) while having almost the same accuracy. 

- It seems that there is no relationship also between model complexity and recognition accuracy. For example, VGG-13 has a much higher level of model complexity (size of the ball) than ResNet-18 while having almost the same accuracy.


<img src="http://engineering.unl.edu/images/uploads/CNN_Comparison_AccuracyVsFLOPS.png" width=1200, height=1000>




### Accuracy-Rate vs. Learning Power


One issue with deep networks is that they are inefficient in the use of their full learning power (measured as the number of parameters with respect to the degrees of freedom).

Thus, we want to see how efficiently each model uses its parameters. Following figure shows the top-1 accuracy vs. top-1 accuracy density, i.e., top-1 accuracy divided by the number of parameters. The higher is this value and the higher is the efficiency. 

From the figure we try to gain knowledge on the trade-off between accuracy and learning power. Following are our observations.

- Among the most efficient models, NASNet-A-Mobile and MobileNet-v2 are the two providing a much higher top-1 accuracy. 

- Among the highest top-1 accuracy (i.e., higher than 80%) models, following two models use their parameters more efficiently: **Inception-v4 and SE-ResNeXt-101 (32x4d)**.



<img src="http://engineering.unl.edu/images/uploads/CNN_Comparison_AcuracyVsAccuracyDensity.png" width=800, height=600>