<a href="https://colab.research.google.com/github/victorviro/Deep_learning_python/blob/master/CNN_Architectures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CNN architectures

In a previous [notebook](https://github.com/victorviro/Deep_learning_python/blob/master/Introduction_to_CNNs.ipynb) we introduced the Convolutional Neural Networks (CNN).

In this notebook, we are going to show some of the classical CNN architectures for a classification task. Typical CNN architectures stack a few convolutional layers (each one generally followed by a ReLU layer), then a pooling layer, then another few convolutional layers (+ReLU), then another pooling layer, and so on. The image gets smaller and smaller as it progresses through the network, but it also typically gets deeper and deeper (i.e., with more feature maps) thanks to the convolutional layers (see  Figure 14-11). At the top of the stack, a regular feedforward neural network is added, composed of a few fully connected layers (+ReLUs), and the final layer outputs the prediction (e.g., a softmax layer that outputs estimated class probabilities).

![](https://i.ibb.co/4P0Vy7n/Captura10.png)

**Note**: A  pattern that is really common in most architectures of convnets is that as we go deeper into the neural network typically you start with large images and the sizes gradually decrease, while the number of channels generally increases.

Over the years, variants of this fundamental architecture have been developed. Let's see some of them.

### LeNet-5

The [LeNet-5](http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf) architecture is perhaps the most widely known CNN architecture. It was created by Yann LeCun in 1998 and widely used for handwritten digit recognition ([MNIST](https://www.kaggle.com/c/digit-recognizer)). It is composed of the layers shown in the next table.

Layer | Type | Maps | Size | Kernel size | Stride | Activation
--- | --- | --- | --- | --- | --- | --- 
Out | Fully Connected| |10|||RBF
F6 | Fully Connected | |84|||tanh
C5 | Convolution | 120 |1 $\times$ 1|5 $\times$ 5|1|tanh
S4 | Avg Pooling | 16 |5 $\times$ 5|2 $\times$ 2|2|tanh
C3 | Convolution | 16 |10 $\times$ 10| 5 $\times$ 5|1|tanh
S2 | Avg Pooling | 6 |14 $\times$ 14|2 $\times$ 2|2|tanh
C1 | Convolution | 6 |28 $\times$ 28| 5 $\times$ 5|1|tanh
In | Input | 1 |32 $\times$ 32|||tanh

![texto alternativo](https://i.ibb.co/zNSF62d/lenet-5-arquitecture.png)


This neural network was small by model standards. It had about 60000 parameters.

One thing you can see is that as you go deeper in the network (you go from left to right) the height and width tend to go down whereas the number of channels tends to increase. 

There are a few extra details to be noted:

- MNIST images are $28 \times 28$ pixels, but they are zero-padded to $32 \times 32$ pixels and normalized before being fed to the network. The rest of the network does not use any padding, which is why the size keeps shrinking as the image progresses through the network.

- The average pooling layers are slightly more complex than usual: each neuron computes the mean of its inputs, then multiplies the result by a learnable coefficient (one per map) and adds a learnable bias term (again, one per map), then finally applies the activation function. 

- Most neurons in C3 maps are connected to neurons in only three or four S2
maps (instead of all six S2 maps) (back then, computers were much slower).

- A modern version of this neural network would use a softmax activation function in the last fully connected layer. Lenet-5 actually used a different activation function for this layer, useless today.



### AlexNet

The [AlexNet CNN](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf) architecture won the 2012 ImageNet ILSVRC challenge by a large margin: it achieved 17% top-5 error rate while the second-best achieved only
26%! It was developed by Alex Krizhevsky (hence the name), Ilya Sutskever, and
Geoffrey Hinton. It is quite similar to LeNet-5, only much larger and deeper, and it was the first to stack convolutional layers directly on top of each other, instead of stacking a pooling layer on top of each convolutional layer. The next table and figure present this architecture.

Layer | Type | Maps | Size | Kernel size | Stride | Padding | Activation
--- | --- | --- | --- | --- | --- | --- | --- 
Out | Fully Connected| |1000||||Softmax
F9 | Fully Connected | |4096||||ReLU
F8 | Fully Connected | |4096||||ReLU
C7 | Convolution | 256 |13 $\times$ 13|3 $\times$ 3|1|SAME|ReLU
C6 | Convolution | 384 |13 $\times$ 13|3 $\times$ 3|1|SAME|ReLU
C5 | Convolution | 384 |13 $\times$ 13|3 $\times$ 3|1|SAME|ReLU
S4 | Max Pooling | 256 |13 $\times$ 13|3 $\times$ 3|2|VALID|
C3 | Convolution | 256 |27 $\times$ 27| 5 $\times$ 5|1|SAME|ReLU
S2 | Max Pooling | 96 |27 $\times$ 27|3 $\times$ 3|2|VALID
C1 | Convolution | 96 |55 $\times$ 55|11 $\times$ 11|4|VALID|ReLU
In | Input | 3 (RGB) |227 $\times$ 227||||


![texto alternativo](https://i.ibb.co/pbq8Fg9/alexnet-arquitecture.png)

To reduce overfitting, the authors used two regularization techniques: first, they applied dropout (see [notebook regularization](https://github.com/victorviro/Deep_learning_python/blob/master/Regularization_DNNs.ipynb)) with a 50% dropout rate during training to the outputs of layers F8 and F9. Second, they performed data augmentation by randomly shifting the training images by various offsets, flipping them horizontally, and changing the lighting conditions.

AlexNet also uses a competitive normalization step immediately after the ReLU step of layers C1 and C3, called *local response normalization*. The most strongly activated neurons inhibit other neurons located at the same position in neighboring feature maps (such activation has been observed in biological neurons). This encourages different feature maps to specialize, pushing them apart and forcing them to explore a wider range of features, ultimately improving generalization.

When authors developed this architecture, GPUs are still a little bit slower so it had a complex way of training two GPUs.

This neural network actually had a lot of similarities to Lenet-5 neural network, but it was much bigger (it had about 60 million of parameters). It had similar building blocks to Lenet-5, but it had a lot more hidden units and train a lot more data (imagenet dataset). Another fact of this architecture they made it much better than Lenet-5 was using the ReLU activation function.

Even before AlexNet deep learning was starting to gain traction in speech recognition and a few other areas but much people think it was this paper that convinced a lot of the computer vision community to take a serious look at deep learning and see that works in computer vision, and then they grew on to have a huge impact not just in computer vision but beyond computer vision as well.


#### Data augmentation

*Data augmentation* artificially increases the size of the training set by generating many realistic variants of each training instance. This reduces overfitting, making this a regularization technique. The generated instances should be as realistic as possible: ideally, given an image from the augmented training set, a human should not be able to tell whether it was augmented or not. Moreover, simply adding white noise will not help; the modifications should be learnable (white noise is not).

For example, you can slightly shift, rotate, and resize every picture in the training set by various amounts and add the resulting pictures to the training set. This forces the model to be more tolerant to variations in the position, orientation, and size of the objects in the pictures. If you want the model to be more tolerant to different lighting conditions, you can similarly generate many images with various contrasts. In general, you can also flip the pictures horizontally (except for text, and other non-symmetrical objects). By combining these transformations you
can greatly increase the size of your training set.

For a further explanation of data augmentation in computer vision, you can check this [video](https://www.youtube.com/watch?v=JI8saFjK84o&list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF&index=21).

### GoogLeNet/Inception

The [GoogLeNet](https://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Szegedy_Going_Deeper_With_2015_CVPR_paper.html) architecture was developed by Christian Szegedy from Google Research, and it won the ILSVRC 2014 challenge by pushing the top-5 error rate below 7%. This great performance came in large part from the fact that the network was much deeper than previous CNNs. This was made possible by sub-networks called *inception modules*, which allow GoogLeNet to use parameters much more efficiently than previous architectures: GoogLeNet actually has 10 times fewer parameters than AlexNet (about 60 million).

Before to go deep in this architeture we are going to see an important idea which is used in the inception modules.



#### Network in network

We saw on the [notebook](https://github.com/victorviro/Deep_learning_python/blob/master/Introduction_to_CNNs.ipynb) introduction to convolutional neural networks that applying a pooling layer we essentially shrink the dimension $n_h$ and $n_w$. By applying, instead, a $1 \times 1$ convolution, we can shrink the number of channels (act like *bottleneck layers*, meaning they reduce dimensionality), therefore save on computation cost and the number of parameters. Furthermore, it adds non-linearity to the network. Although they cannot capture spatial patterns, they can capture patterns along the depth dimension.

Let’s see the following example:

![texto alternativo](https://i.ibb.co/S70J564/1-by-1-convolution-png.png)

Here, the $1 \times 1$ convolution looks at each 36 ($6 \times 6$) different positions and takes the element-wise product between $32$ numbers on the left and $32$ number in the filter, and then it applies a ReLU non-linearity to it. If we apply “$n$“ number of $1 \times 1$ convolutional filters, then the output will be, in our example, $6 \times 6 \times n$.




Even though the details of the architecture in the original [paper](https://arxiv.org/abs/1312.4400) are used widely, this idea of a 1$\times$1 convolution or sometimes called *network in network* has been very influential in many other neural network architectures including the inception module which we will see below.

For a further explanation of $1 \times 1$ convolution you can check this [video](https://www.youtube.com/watch?v=c1RBQzKsDCk&list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF&index=16).

#### Inception module

When designing a layer for a ConvNet we need to pick many things: the number of filters, the type of layer (pooling, conv, etc..), what if we didn’t have to choose, but get them all? That’s what the Inception network does. It uses all these options and stacks them up. Figure 14-13 shows the architecture of an inception module.

![texto alternativo](https://i.ibb.co/Y8D2jTS/inception-module.png)

The notation “$3 \times 3$ + 1(S)” means that the layer uses a $3 \times 3$ kernel, stride 1, and SAME padding. The input signal is first copied and fed to four different layers. All convolutional layers use the ReLU activation function. Note that the second set of convolutional layers uses different kernel sizes ($1 \times 1$, $3 \times 3$, and $5 \times 5$), allowing them to capture patterns at different scales. Also note that every single layer uses a stride of 1 and SAME padding (even the max-pooling layer), so their outputs all have the same height and width as their inputs. This makes it possible to concatenate all the outputs along the depth dimension in the final depth concat layer (i.e., stack the feature maps from all four top convolutional layers).

You may not wonder why inception modules have convolutional layers with $1 \times 1$ kernels. We have already spoken about the network in network.
These layers serve three purposes:

- First, although they cannot capture spatial patterns, they can capture patterns along the depth dimension.

- Second, they are configured to output fewer feature maps than their inputs (shrink the number of channels), so they serve as bottleneck layers. This cuts the computational cost and the number of parameters, speeding up training, and improving generalization.

- Lastly, each pair of convolutional layers ([$1 \times 1$, $3 \times 3$] and [$1 \times 2$, $5 \times 5$]) acts like a single, powerful convolutional layer, capable of capturing more complex patterns. Indeed, instead of sweeping a simple linear classifier across the image (as a single convolutional layer does), this pair of convolutional layers sweeps a two-layer neural network across the image.


In short, the whole inception module is like a convolutional which is able to output feature maps that capture complex patterns at various scales.

**Note**: The number of convolutional kernels (filters) for each convolutional layer is a hyperparameter. Unfortunately, this means that you have six more hyperparameters to tweak for every inception layer you add.



Let's see how the $1 \times 1$ convolution can reduce the computational cost. For example, suppose a $5 \times 5$ filter represented in the next figure:

![texto alternativo](https://i.ibb.co/09jh68s/5-by-5-filter.png)

The total number of multiplications it computes is:
$(28 \ast 28 \ast 32) \ast (5 \ast 5 \ast 192) = 120$ million!

The interposed $1 \times 1$ convolution reduces by 10 times the total computational cost.

![texto alternativo](https://i.ibb.co/5rCWtdg/1-by-1-cost.png)

For a visual explanation of the inception module, you can check this [video](https://www.youtube.com/watch?v=C86ZXvgpejM&list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF&index=17).

#### GoogLeNet

Now let’s look at the architecture of the GoogLeNet CNN (see Figure 14-14). 

![texto alternativo](https://i.ibb.co/Tcgggd5/goog-Le-Net-arquitecture.png)

The number of feature maps output by each convolutional layer and each pooling layer is shown before the kernel size. The architecture is so deep that it has to be represented in three columns, but GoogLeNet is actually one tall stack, including nine inception modules (the boxes with the spinning tops). The six numbers in the inception modules represent the number of feature maps output by each convolutional layer in the module (in the same order as in Figure 14-13). Note that all the convolutional layers use the ReLU activation function.

Let’s go through this network:

- The first two layers divide the image’s height and width by 4 (so its area is divided by 16), to reduce the computational load. The first layer uses a large kernel size, so that much of the information is still preserved.

- Then the local response normalization layer ensures that the previous layers learn a wide variety of features.

- Two convolutional layers follow, where the first acts like a bottleneck layer. As explained earlier, you can think of this pair as a single smarter convolutional layer.

- Again, a local response normalization layer ensures that the previous layers capture a wide variety of patterns.

- Next a max pooling layer reduces the image height and width by 2, again to speed up computations.

- Then comes the tall stack of nine inception modules, interleaved with a couple max pooling layers to reduce dimensionality and speed up the net.

- Next, the global average pooling layer simply outputs the mean of each feature map: this drops any remaining spatial information, which is fine since there was not much spatial information left at that point. Indeed, GoogLeNet input images are typically expected to be $224 \times 224$ pixels, so after 5 max pooling layers, each dividing the height and width by 2, the feature maps are down to $7 \times 7$. Moreover, it is a classification task, not localization, so it does not matter where the object is. Thanks to the dimensionality reduction brought by this layer, there is no need to have several fully connected layers at the top of the CNN (like in AlexNet), and this considerably reduces the number of parameters in the network and limits the risk of overfitting.

- The last layers are self-explanatory: dropout for regularization, then a fully connected layer with 1000 units, since there are a 1000 classes, and a softmax activation function to output estimated class probabilities.

This diagram is slightly simplified: the original GoogLeNet architecture also included two auxiliary classifiers plugged on top of the third and sixth inception modules. They were both composed of one average pooling layer, one convolutional layer, two fully connected layers, and a softmax activation layer. During training, their loss (scaled down by 70%) was added to the overall loss. The goal was to fight the vanishing gradients problem and regularize the network. However, it was later shown that their effect was relatively minor.

For a visual explanation of the inception network you can check this [video](https://www.youtube.com/watch?v=KfV8CJh7hE0&list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF&index=18).

Several variants of the GoogLeNet architecture were later proposed by Google
researchers, including Inception-v3 and Inception-v4, using slightly different inception modules, and reaching even better performance.

### VGGnet

The runner up in the ILSVRC 2014 challenge was [VGGNet](https://arxiv.org/abs/1409.1556), developed by K. Simonyan and A. Zisserman. It had a very simple and classical architecture, with 2 or 3 convolutional layers, a pooling layer, then again 2 or 3 convolutional layers, a pooling layer, and so on (with a total of just 16 convolutional layers, hence the name VGG16), plus a final dense network with 2 hidden layers and the output layer. Instead of having so many hyperparameters, it uses a much simpler network using only $3 \times 3$ filters (but many filters), SAME padding, and all max polling layers $2 \times 2$ over a stride of 2.

![texto alternativo](https://i.ibb.co/M8VR2cH/vgg16.png)

The relative uniformity of this network made it quite attractive to researches. The main downside was that it was a pretty large network in terms of the number of parameters (138 million of parameters). There is the VGG19 and an even bigger version of this network.

### Resnet

The ILSVRC 2015 challenge was won using a [*Residual Network*](https://arxiv.org/abs/1512.03385) (or *ResNet*), developed by Kaiming He, which delivered an astounding top-5 error rate under 3.6%, using an extremely deep CNN composed of 152 layers. It confirmed the general trend: models are getting deeper and deeper, with fewer and fewer parameters. Very deep neural networks are difficult to train because of vanishing and exploding gradients (see [notebook](https://github.com/victorviro/Deep_learning_python/blob/master/Vanishing_Exploding_gradients_problem_DNNs.ipynb)). The key to being able to train such a deep network is to use *skip connections* (also called *shortcut connections*), which allow you to take the activation function from one layer and feed it to another layer even much deeper in the network. Skip connections are the building blocks of a Residual Network (ResNet). Let’s see why this is useful.

When training a neural network, the goal is to make it model a target function $h(\boldsymbol{x})$. If you add the input $\boldsymbol{x}$ to the output of the network (i.e., you add a skip connection), then the network will be forced to model $f(\boldsymbol{x}) = h(\boldsymbol{x}) – \boldsymbol{x}$ rather than $h(\boldsymbol{x})$. This is called *residual learning* (see Figure 14-15).

![texto alternativo](https://i.ibb.co/SXjTmcz/residual-learning.png)

When you initialize a regular neural network, its weights are close to zero, so the network just outputs values close to zero. If you add a skip connection, the resulting network just outputs a copy of its inputs; in other words, it initially models the identity function. If the target function is fairly close to the identity function (which is often the case), this will speed up training considerably.







A residual block is sketched as:

![texto alternativo](https://i.ibb.co/km0PC6s/residual-block.png)

which is computed as:

![texto alternativo](https://i.ibb.co/Jry9Q8M/residual-block-computation.png)

Therefore, with a residual block, instead of taking the regular, “main” path, we take $a^{[l]}$ and add it to a later layer before applying the non-linearity ReLu. For a further explanation of residual blocks, you can check this [video](https://www.youtube.com/watch?v=ZILIbUvp5lk&list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF&index=14).



Moreover, if you add many skip connections, the network can start making progress even if several layers have not started learning yet (see Figure 14-16).



![texto alternativo](https://i.ibb.co/ggY1Zdw/DNN-vs-DRN.png)

Thanks to skip connections, the signal can easily make its way across the whole network. The deep residual network can be seen as a stack of *residual units* or *residual blocks*, where each residual unit is a small neural network with a skip connection. A "plain", regular network has a harder time to train, as the network gets deeper, so the training error increases. The advantage of using a Residual Network, instead, is to train a very deep network and keep decreasing the training error.

![texto alternativo](https://i.ibb.co/6JBKCZq/training-deep-network-resnets.png)

Now let’s look at ResNet’s architecture (see Figure 14-17).

![texto alternativo](https://i.ibb.co/dPWNCCz/resnet-arquitecture.png)


It is actually surprisingly simple. It starts and ends exactly like GoogLeNet (except without a dropout layer), and in between is just a very deep stack of simple residual units. Each residual unit is composed of two convolutional layers (and no pooling layer!), with Batch Normalization (BN) and ReLU activation, using $3 \times 3$ kernels and preserving spatial dimensions (stride 1, SAME padding).

Note that the number of feature maps is doubled every few residual units, at the same time as their height and width are halved (using a convolutional layer with stride 2). When this happens the inputs cannot be added directly to the outputs of the residual unit since they don’t have the same shape (for example, this problem affects the skip connection represented by the dashed arrow in Figure 14-17). To solve this problem, the inputs are passed through a $1 \times 1$ convolutional layer with stride 2 and the right number of output feature maps (see Figure 14-18).

![texto alternativo](https://i.ibb.co/mC15HnN/skip-connection-change-feature-map.png)

ResNet-34 is the ResNet with 34 layers (only counting the convolutional layers and the fully connected layer) containing three residual units that output 64 feature maps, 4 RUs with 128 maps, 6 RUs with 256 maps, and 3 RUs with 512 maps.


ResNets deeper than that, such as ResNet-152, use slightly different residual units. Instead of two $3 \times 3$ convolutional layers with (say) 256 feature maps, they use three convolutional layers: first, a $1 \times 1$ convolutional layer with just 64 feature maps (4 times less), which acts as a bottleneck layer (as discussed already), then a $3 \times 3$ layer with 64 feature maps, and finally another $1 \times 1$ 1 convolutional layer with 256 feature maps (4 times 64) that restores the original depth. ResNet-152 contains three such RUs that output 256 maps, then 8 RUs with 512 maps, a whopping 36 RUs with 1024 maps, and finally 3 RUs with 2,048 maps.

For a further explanation of the ResNet network, you can check this [video](https://www.youtube.com/watch?v=RYth6EbBUqM&list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF&index=15).

**Note**: Google’s [Inception-v4](https://arxiv.org/abs/1602.07261) architecture merged the ideas of GoogLeNet and ResNet and achieved close to 3% top-5 error rate on ImageNet classification.


### Xception

Another variant of the GoogLeNet architecture is also worth noting: [Xception](https://arxiv.org/abs/1610.02357) (which stands for Extreme Inception) was proposed in 2016 by François Chollet (the author of Keras), and it significantly outperformed Inception-v3 on a huge vision task (350 million images and 17000 classes). Just like Inception-v4, it also merges the ideas of GoogLeNet and ResNet, but it replaces the inception modules with a special
type of layer called a *depthwise separable convolution* (or *separable convolution* for short). These layers had been used before in some CNN architectures, but they were not as central as in the Xception architecture. While a regular convolutional layer uses filters that try to simultaneously capture spatial patterns (e.g., an oval) and cross-channel patterns (e.g., mouth + nose + eyes = face), a separable convolutional layer makes the strong assumption that spatial patterns and cross-channel patterns can be modeled separately (see Figure 14-19). Thus, it is composed of two parts: the first part applies a single spatial filter for each input feature map, then the second part looks exclusively for cross-channel patterns—it is just a regular convolutional layer with $1 \times 1$ filters.

![texto alternativo](https://i.ibb.co/hdxR0Ww/depthwise-separable-conv-layer.png)

Since separable convolutional layers only have one spatial filter per input channel, you should avoid using them after layers that have too few channels, such as the input layer (granted, that’s what Figure 14-19 represents, but it is just for illustration purposes). For this reason, the Xception architecture starts with 2 regular convolutional layers, but then the rest of the architecture uses only separable convolutions (34 in all), plus a few max-pooling layers and the usual final layers (a global average pooling layer, and a dense output layer).


You might wonder why Xception is considered a variant of GoogLeNet, since it contains no inception module at all? Well, as we discussed earlier, an Inception module contains convolutional layers with $1 \times 1$ filters: these look exclusively for cross-channel patterns. However, the convolution layers that sit on top of them are regular convolutional layers that look both for spatial and cross-channel patterns. So you can think of an Inception module as an intermediate between a regular convolutional layer (which considers spatial patterns and cross-channel patterns jointly) and a separable convolutional layer (which considers them separately). In practice, it seems that separable convolutions generally perform better.


**Note**: Separable convolutions use less parameters, less memory, and less computations than regular convolutional layers, and in general they even perform better, so you should consider using them by default (except after layers with few channels).

The ILSVRC 2016 challenge was won by the CUImage team from the Chinese University of Hong Kong. They used an ensemble of many different techniques, including a sophisticated object-detection system called [GBD-Net](https://arxiv.org/abs/1610.02579), to achieve a top-5 error rate below 3%. Although this result is unquestionably impressive, the complexity of the solution contrasted with the simplicity of ResNets. Moreover, one year later another fairly simple architecture performed even better, as we will see now.

### SENet

The winning architecture in the ILSVRC 2017 challenge was the [Squeeze-and-
Excitation Network](https://arxiv.org/abs/1709.01507) (SENet). This architecture extends existing architectures such as inception networks or ResNets, and boosts their performance. The extended versions of inception networks and ResNet are called *SE-Inception* and *SE-ResNet* respectively. The boost comes from the fact that a SENet adds a small neural network, called
a *SE Block*, to every unit in the original architecture (i.e., every inception module or every residual unit), as shown in Figure 14-20.

![texto alternativo](https://i.ibb.co/1J4CHhz/se-inception-and-se-resnet.png)

A SE Block analyzes the output of the unit it is attached to, focusing exclusively on the depth dimension (it does not look for any spatial pattern), and it learns which features are usually most active together. It then uses this information to recalibrate the feature maps, as shown in Figure 14-21. 

![texto alternativo](https://i.ibb.co/4sfbjWV/se-block.png)

For example, a SE Block may learn that mouths, noses, and eyes usually appear together in pictures: if you see a mouth and a nose, you should expect to see eyes as well. So if a SE Block sees a strong activation in the mouth and nose feature maps, but only mild activation in the eye feature map, it will boost the eye feature map (more accurately, it will reduce irrelevant feature
maps). If the eyes were somewhat confused with something else, this feature map recalibration will help resolve the ambiguity.

A SE Block is composed of just 3 layers: a global average pooling layer, a hidden dense layer using the ReLU activation function, and a dense output layer using the sigmoid activation function (see Figure 14-22):

![texto alternativo](https://i.ibb.co/3FTLHcy/se-block-arquitecture.png)

As earlier, the global average pooling layer computes the mean activation for each feature map: for example, if its input contains 256 feature maps, it will output 256 numbers representing the overall level of response for each filter. The next layer is where the “squeeze” happens: this layer has much less than 256 neurons, typically 16 times less than the number of feature maps (e.g., 16 neurons), so the 256 numbers get compressed into a small vector (e.g., 16 dimensional). This is a low-dimensional vector representation (i.e., an embedding) of the distribution of feature responses. This bottleneck step forces the SE Block to learn a general representation of the feature combinations. Finally, the output layer takes the embedding and outputs a recalibration vector containing one number per feature map (e.g., 256), each between 0 and 1. The feature maps are then multiplied by this recalibration vector, so irrelevant features (with a low recalibration score) get scaled down while relevant features (with a recalibration score close to 1) are left alone.

#### Understanding SENet

CNNs use their convolutional filters to extract hierarchal information from images. Lower layers find trivial pieces of context like edges or high frequencies, while upper layers can detect faces, text, or other complex shapes. They extract whatever is necessary to solve a task efficiently. All of this works by fusing the spatial and channel information of an image. The different filters will first find spatial features in each input channel before adding the information across all available output channels.

The network weights each of its channels equally when creating the output feature maps. SENets are all about changing this by adding a content aware mechanism to weight each channel adaptively. In it’s most basic form this could mean adding a single parameter to each channel and giving it a linear scalar how relevant each one is.

However, the authors push it a little further. First, they get a global understanding of each channel by squeezing the feature maps to a single numeric value. This results in a vector of size n, where n is equal to the number of convolutional channels. Afterward, it is fed through a two-layer neural network, which outputs a vector of the same size. These n values can now be used as weights on the original features maps, scaling each channel based on its importance.

Let’s jump straight into implementing a SE-block to understand how it works.



```
def se_block(in_block, ch, ratio=16):
    x = GlobalAveragePooling2D()(in_block)
    x = Dense(ch//ratio, activation='relu')(x)
    x = Dense(ch, activation='sigmoid')(x)
    return multiply()([in_block, x])
```
1. The function is given an input convolutional block and the current number of channels it has.

2. We squeeze each channel to a single numeric value using average pooling

3. A fully connected layer followed by a ReLU function adds the necessary nonlinearity. Its output channel complexity is also reduced by a certain ratio.

4. A second fully connected layer followed by a Sigmoid activation gives each channel a smooth gating function.

5. At last, we weight each feature map of the convolutional block based on the result of our side network.

These five steps add almost no additional computing cost (less than 1%) and can be added to any model.

![texto alternativo](https://i.ibb.co/xKZS4BL/Se-Inception-and-SE-Res-Net.png)


### DenseNet

Densely Connected Convolutional Networks ([DenseNets](https://arxiv.org/abs/1608.06993)), are the next step on the way to keep increasing the depth of deep convolutional networks. 

In a standard ConvNet, input image goes through multiple convolutions and obtain high-level features:

![texto alternativo](https://miro.medium.com/max/770/1*RfyAoe6Wlv4aLip2Y5Aw-Q.png)

In ResNet, identity mapping is proposed to promote the gradient propagation using element-wise addition. It can be viewed as algorithms with a state passed from one ResNet module to another one:

![texto alternativo](https://miro.medium.com/max/770/1*4wx7szWCBse9-7eemGQJSw.png)

In DenseNet, each layer obtains additional inputs from all preceding layers and passes on its own feature maps to all subsequent layers using concatenation. Each layer is receiving a *collective knowledge* from all preceding layers:

![texto alternativo](https://miro.medium.com/max/770/1*rmHdoPjGUjRek6ozH7altw.png)





So, it has higher computational efficiency and memory efficiency. The following figure shows the concept of concatenation during forward propagation:

![texto alternativo](https://miro.medium.com/max/660/1*9ysRPSExk0KvXR0AhNnlAA.gif)



**Growth rate**: Since each layer receives feature maps from all preceding layers, the network can be thinner and compact, i.e. number of channels can be fewer. The growth rate $k$ is the additional number of channels for each layer. A relatively small growth rate is enough to obtain state-of-the-art results. One can view the feature maps as the global state of the network. Each layer adds $k$ feature maps of its own to this state. The growth rate regulates how much new information each layer contributes to the global state.

![texto alternativo](https://miro.medium.com/max/669/1*P7tcHlzxm9Afg0Cejftv0g.png)

**Composite function**: Authors define a composite function of three consecutive operations: batch normalization (BN), followed by a rectified linear unit (ReLU), and a $3 \times 3$ convolution (Conv).

![texto alternativo](https://miro.medium.com/max/660/1*IwvJGTxBAcb1H5tSJR6Lng.gif)

**Dense blocks**: The concatenation operation is not viable when the size of feature-maps changes. To facilitate down-sampling in the architecture they divide the network into multiple densely connected *dense blocks*. Feature map sizes are the same within the dense block so that they can be concatenated together easily. They refer to layers between blocks as *transition layers*, which do convolution and pooling. In their experiments, the transition layers consist of a batch normalization layer and a $1 \times 1$ convolutional layer followed by a $2 \times 2$ average pooling layer. 

![texto alternativo](https://miro.medium.com/max/770/1*BJM5Ht9D5HcP5CFpu8bn7g.png)

At the end of the last dense block, a global average pooling is performed and then a softmax classifier is attached.

**Bottleneck layers**: Although each layer only produces $k$
output feature maps, it typically has many more inputs. A $1 \times 1$ convolution is introduced as *bottleneck* layer before each $3 \times 3$ convolution to reduce the number of input feature maps, and thus to
improve computational efficiency. 

![texto alternativo](https://miro.medium.com/max/770/1*dniz8zK2ClBY96ol7YGnJw.png)

**Further Compression**: To further improve model compactness, they reduce the number of feature maps at transition layers. If a dense block contains $m$ feature maps, the following transition layer generates $\theta m$ output feature maps, where $\theta$ is called the *compression* factor ($0 < \theta  ≤1$).

### Comparison of model complexity and accuracy

Lastly, let's see a figure which compare the different architectures in terms of model complexity and accuracy:

![](https://i.ibb.co/d4GCFbV/comparison-cnn-architectures.png)

# References

- [LeNet-5 Paper](http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf), [AlexNet Paper](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf), [GoogLeNet/Inception Paper](https://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Szegedy_Going_Deeper_With_2015_CVPR_paper.html), [VGGNet Paper](https://arxiv.org/abs/1409.1556), [ResNet Paper](https://arxiv.org/abs/1512.03385), [Xception Paper](https://arxiv.org/abs/1610.02357), [SENet Paper](https://arxiv.org/abs/1709.01507), [DenseNet Paper](https://arxiv.org/abs/1608.06993)

- [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)

- [Course CNN deeplearning.ai.](https://www.youtube.com/watch?v=ArPaAX_PhIs&list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF)

- [CNN standford lecture](https://youtu.be/DAOcjicFr1Y?list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv)


- [Medium: CNN](https://medium.com/machine-learning-bites/deeplearning-series-convolutional-neural-networks-a9c2f2ee1524)

- [Medium: googlenet-inception-v1](https://medium.com/coinmonks/paper-review-of-googlenet-inception-v1-winner-of-ilsvlc-2014-image-classification-c2b3565a64e7)


- [Towardsdatascience: senet](https://towardsdatascience.com/squeeze-and-excitation-networks-9ef5e71eacd7)



- [Towardsdatascience: Densenet](https://towardsdatascience.com/review-densenet-image-classification-b6631a8ef803)

