
# Notable Convnet Architectures - I

The design of modern convnet architectures derives from scientific intuition, mathematical insights, and uncompromising engineering efforts (a.k.a trial and error). However, one should not indulge into clueless engineering exercises just because of easy access to the LEGO playground of TensorFlow or PyTorch like libraries.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_LEGO.png" width=500, height=400>


While this playground is luring, performing uneducated trial and error using TensorFlow or PyTorch like libraries won't get us far towards designing better architectures. At best we will be able to implement an existing architecture, albeit without a scientific understanding. No matter whether we intend to implement simple architectures like LeNet-5 or very deep and complex architectures, our design choices must be guided by scientific and engineering principles. The best way to acquire this acumen is to understand the successful designs as well as poor designs that stalled advancement. 

In the current and next two notebooks we discuss some of the notable convnet architectures that were proposed between 2012 and 2017. All of these architectures were the dominant architectures for some period of time. While the practical significance of some architectures dwindled over time, some of them still holds a shining spot inspiring novel architectures.


## How did we select the notable architectures?

Most of these architectures were submitted by academics and industry leaders to the **ImageNet Large Scale Visual Recognition Challenge (ILSVRC)**. The ILSVRC is an annual computer vision competition developed upon a subset of a publicly available computer vision dataset called ImageNet. Some of the architectures that we discuss were  winners or runners-up in the ImageNet competition. This competetion is regarded as a measure of progress on supervised learning in computer vision since 2010. We study the winning entries as well as other notable architectures to develop an understanding of the scientific and engineering efforts behind the success of deep convnets.


## ImageNet Dataset and Annual Competition

ImageNet is a large dataset of annotated photographs intended for computer vision research. It was developed by a group of researchers led by Fei-Fei Li. The ImageNet challenge employed a subset 1 million RGB images (it's a subset from 14 million images) to train convnets. The size of the images are large with varying dimensions. Usually the images are resized to 256 x 256. There are 1000 distinct categories of objects. Some of the classes are really subtle, e.g., it includes 120 breeds of dog.

There are broadly three "objects classification" tasks given in the ILSVRC competetion:

- Image classification: Predict the classes of objects present in an image.

- Single-object localization: Image classification + draw a bounding box around one example of each object present.

- Object detection: Image classification + draw a bounding box around each object present.

In our discussion we will focus on the architectures that performed well on the **image classification** task. To evaluate the performance of a model, we will use top-5 error rate, which is a measure of the number of test images for which the model’s top five predictions did not include the correct answer. 


In the ImageNet annual competition, the top-5 error rate for image classification fell from over 26% to less than 2.3% in just six years. At the beginning of the ImageNet competition, i.e., during 2010 and 2011, the top-5 classification error was above 26%. The breakthrough moment came in 2012 with the AlexNet architecture, which reduced the error down to 16.4%. Thus our story will begin from 2012. Note that our story ends in 2017. Because the annual ImageNet competition is no longer held after 2017. It was moved to Kaggle.


## A Delayed Triumph of Deep Convnets

- Why did it take so long (1998 - 2012) for the Convnets to take off? 

The LeNet-5 architecture was designed in 1998. It was a 5-layer convnet, whereas AlexNet has 8 layers. Since then, there is a trend to create deep convnets. From 2015 onwards typical layer size increased to 152. Why did it take more than a decade to create the first "deep" convnet?

Four developments contributed to the avalanche that started in 2012.

- Hardware
- Large dataset
- Algorithm
- Engineering tricks

The neural network accelerator technologies from 1990s were not powerful enough to train multi-channel deep convnets. The datasets were small. Algorithimic improvements such as weight initialization techniques, activation functions, SGD optimizers, regularization techniques were late coming. That's why during the era to 1990s and 2000s, machine learning models such as Support Vector Machine outperformed neural networks.

Prior 2012 the main trend in computer vision was to **manually engineer features**. Instead of learning features, they were handcrafted. Machine learning scientists were invested in designing novel feature functions. These functions would create useful representations of the raw data, which then was fed to a classifier (e.g., SVM) for decision making. The algorithmic development for automatically learning representations (features) took off since 2012. Thus, the journey of modern convnets could be considered as the evolution of the algorithmic and engineering efforts for **creating powerful representations**.

It was observed that deeper convnets were able to learn complex representations. They use many convolution layers to learn expressive representations in a hierarchical fashion. Depth was the key. But deep networks that are created by stacking many conv layers make the learning infeasible due to the sheer size of the networks. Thus, much of the effort was invested designing novel techniques for reducing complexity of the networks. While some of these inventions were mathematically derived, most of these are accomplished via engineering tricks.

We will discuss both the science and engineering of designing effective and efficient Convnets.
 

## Notable Architectures

We study the following architectures that are winners or runner-ups in the ImageNet challenge. 
- AlexNet (2012 winner)
- ZFNet (2013 winner)
- VGGNet (2014 runner-up)
- GoogLeNet (Inception-v1) (2014 winner)
- ResNet (2015 winner)
- Ensemble (2016 winner)
- SENet (2017 winner)


<img src="https://cse.unl.edu/~hasan/Pics/CNN_ILSVRC_Winners.png" width=800, height=800>


In addition to this, we present other notable architectures that either inspired some of the above architectures, or their powerful variants.

- Network in Network (NiN)
- Variants of Inception: Inception-v2, v-3 and v-4
- Variants of ResNet: Pre-Activation ResNet, ResNeXt, Wide ResNet
- DenseNet


## This Notebook

In this notebook, we study the following notable convnet architectures.
- AlexNet (2012)
- ZFNet (2013)
- VGGNet (2014)
- Network in Network (NiN) (2013)
- GoogLeNet (Inception-v1) (2014)



## AlexNet (2012)


The AlexNet architecture provided the first empirical evidence that deep convnets can achieve good results. It was slightly deeper than LeNet-5.


AlexNet was developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. It won the 2012 ImageNet ILSVRC challenge by a large margin: it achieved a top-5 error rate of 16.4%, while the second best achieved only 25.8%! 

http://www.cs.toronto.edu/~hinton/absps/imagenet.pdf



The AlexNet follows the same **design philosophy** of LeNet, which is its architecture can be partitioned into two parts: 
- the first consisting mostly of convolutional and pooling layers and 
- the second consisting of fully-connected layers.


However, it has some significant differences with LeNet. AlexNet is:
- Deeper (8 layers): 5 Conv layers, 2 FC hidden layers and one FC output layer.
- Larger: larger filter size in the first two Conv layers: 11 x 11 and 5 x 5, respectively.
- Stack Conv layers directly on top of one another, instead of stacking a Pool layer on top of each Conv layer.
- Use ReLU as the activation function instead of sigmoidal functions such as logistic sigmoid or tanh.


<img src="https://cse.unl.edu/~hasan/Pics/CNN_AlexNet.png" width=400, height=200>

Let's discuss some aspects of the AlexNet architecture.

### Larger Filter Size in the First Conv Layer

AlexNet uses 11 x 11 filters in the first Conv layer because of the large size of the ImageNet images (largest image is 4288 x 2848, and average image size is  469 x 387). The ImageNet images are resized to 224 x 224, which is significantly larger than the 28 x 28 MNIST images. Thus objects in the ImageNet data spans across more pixels. To match this, a large Conv filter was required.

### Increased Expressive Capacity
AlexNet uses significantly more filters in each layer as compared to LeNet, which increased its expressiveness.

### Ensures Gradient Flow During Training via Backpropagation
Alexnet uses ReLU activation function, which helps the training in deep networks. 


### Combat Overfitting using Novel Regularization Techniques
AlexNet uses two novel regularization techniques for reducing overfitting (model's complexity).
- Dropout layer: AlexNet applied 50% dropout at the two FC hidden layers.

- Image augmentation: Data was augmented by randomly shifting the training images by various offsets, flipping them horizontally, and changing the lighting conditions.


### Improve Generalization through Novel Normalization

AlexNet applies a novel normalization step between the first two Conv layer, right after the ReLU activation. It's called the called **local response normalization (LRN)**. It is used to activate the most competetive neurons. It ensures that the most strongly activated neurons inhibit other neurons located at the same position in neighboring feature maps. This encourages different feature maps to specialize, pushing them apart and forcing them to explore a wider range of features, ultimately improving generalization. 


## AlexNet: TensorFlow Implementation

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import MaxPooling2D, GlobalAveragePooling2D, AveragePooling2D
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import BatchNormalization

In [2]:
# Create an empty sequential model
model = Sequential(name='AlexNet')
 
# 1st Conv Layer: larger size of the filters, many filters, larger stride
model.add(Conv2D(filters=96, kernel_size=(11, 11), strides=4, activation='relu', 
                 input_shape=(227, 227, 3), padding='valid'))
model.add(MaxPooling2D(pool_size=(3, 3), strides=2, padding='valid'))

# 2nd Conv Layer: decreased size of the filters, increased filters, stride 1
# Use SAME padding (p = 2) for consistent height and width across the input and output
model.add(Conv2D(filters=256, kernel_size=(5, 5), strides=1, activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(3, 3), strides=2, padding='valid'))


# 3rd, 4th, 5th Conv Layers: decreased size of the filters, increased filters, stride 1
# Use SAME padding (p = 2) for consistent height and width across the input and output
# No pooling is done until the 5th Conv layer
model.add(Conv2D(filters=384, kernel_size=(3, 3), strides=1, activation='relu', padding='same'))
model.add(Conv2D(filters=384, kernel_size=(3, 3), strides=1, activation='relu', padding='same'))
model.add(Conv2D(filters=256, kernel_size=(3, 3), strides=1, activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(3, 3), strides=2, padding='valid'))

# Flatten the convnet output to feed it with fully connected layers
model.add(Flatten())

# FC 1: Fully-Connected Layer with larger number of neurons; add dropout
model.add(Dense(units=4096, activation='relu'))
model.add(Dropout(0.5))
 
# FC 2: Fully-Connected Layer with larger number of neurons; add dropout
model.add(Dense(units=4096, activation='relu'))
model.add(Dropout(0.5))
  
# Output layer with softmax activation
model.add(Dense(units=10, activation='softmax'))
  
# Display the model summary
model.summary()

Model: "AlexNet"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 55, 55, 96)        34944     
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 27, 27, 96)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 27, 27, 256)       614656    
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 256)       0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 13, 13, 384)       885120    
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 13, 13, 384)       1327488   
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 13, 13, 256)       8849

## AlexNet: Memory & Number of Parameters

Following table shows the total amount of memory required by a single image to perform forward calculation, as well as the total number of parameters (excluding bias).


<img src="https://cse.unl.edu/~hasan/Pics/CNN_AlexNet_Calculation.png" width=800, height=800>


## AlexNet: Why it took so long to appear?
Observe that AlexNet's implementation is similar to that of LeNet except a few extra lines of code. However, it took the researchers to put together the key ideas that made this magic possible.

Two main resons that delayed the invention of AlexNet type architecture are:
- Lack of large dataset
- Lack of efficient computational tools



## AlexNet's Offshoot ZF Net (2013)

Are the hyperparameters (number of filters, filter size, stride, etc.) of AlexNet optimal? Is it possible to tune the hyperparameters further to improve its performance on ImageNet?

This strategy was adopted by Matthew Zeiler and Rob Fergus. They created a variant of AlexNet and named it as ZF Net. 

https://arxiv.org/abs/1311.2901

<img src="https://cse.unl.edu/~hasan/Pics/CNN_ZFNet.png" width=800, height=800>


Notable changes in the ZFNet architecture are:
- Conv 1: change from (11 x 11 stride 4) to (7 x 7 stride 2)
- Conv 3,4,5: instead of 384, 384, 256 filters, it uses 512, 1024, 512 filters


ZFNet won the 2013 ILSVRC challenge by reducing the ImageNet top-5 classification error rate from 16.4% (AlexNet) to 11.7%.



## AlexNet: Limitation in Design Thinking

However, AlexNet-type architecture had a short life. Its main limitation is that it did not provide a **general template** to guide subsequent researchers in designing new networks. We will see that later the trend was to develop "block"-based patterns to design new architectures. It was a shift in design thinking. Instead of approaching the architecture from designing neurons and layers, there was a growing tendency towards abstraction. Thus effort was put to build repeating patterns using blocks of layers. 

## Repeated Block-based Architectures

The year 2014 presented some repeated block-based yet fundamentally different architectures. Two such notable architectures are:

- GoogleNet (won the 2014 ILSVRC)
- VGG (runner-up in the 2014 ILSVRC)

We discuss VGG first as it is closer to AlexNet than the GoogleNet is. Also GoogleNet is based on another architecture called Network in Networks (NiN), which we will present before discussing GoogleNet.



## VGGNet (2014)

VGG Network or VGGNet was developed by Karen Simonyan and Andrew Zisserman from the Visual Geometry Group (VGG) research lab at the Oxford University. 
https://arxiv.org/abs/1409.1556

Their main contribution is:

- Created a deeper architecture by extending the classical architecture of LeNet

The depth of the network was increased by adding more Conv layers. To ensure that the number of parameters don't explode, they used small 3 x 3 filters throughout the network. This is a departure from the LeNet and AlexNet architectures. It was a common practice to use larger size filters in the first few layers. VGGNet was radical enough to deploy same-size filters everywhere.



Similar to AlexNet, the architecture of VGGNet can be partitioned into two parts:

- the first consisting mostly of convolutional and pooling layers and
- the second consisting of fully-connected layers.


<img src="https://cse.unl.edu/~hasan/Pics/CNN_VGG.png" width=600, height=400>

But the main difference in its design is that it used repeating blocks of Conv and Pool layers to create the network. The network design follows these two steps.

- Step 1: design the VGGNet block
- Step 2: design the network using VGGNet blocks



### Step 1: VGGNet Block

The VGG block consists of a sequence of Conv layers, followed by a max pooling layer for spatial downsampling. The distinctive property of the Conv layers is that all of them used only small 3 x 3 filters. Unlike AlexNet, instead of using large filters (11 x 11 or 5 x 5) in a Conv layer, the VGG net used a stack of small filters based Conv layers.

To retain the input height and width, VGG net 3 × 3 filters used the SAME padding (i.e., they are padded with p=1). The Conv layer stack is followed by 2 × 2 max pooling with stride of 2, which reduces the spatial dimension (resolution) by half after every block. 



### Step 2: VGGNet using the Blocks 


The VGG net is designed by connecting the VGG blocks sequentially. The fully-connected (FC) part of the VGG network is identical to AlexNet.


## Why VGGNet is Effective

The VGG Net improved the classical Convnet architecture of LeCun (as well as AlexNet) by substantially increasing the depth. Multiple Conv layers increase the depth of the network to **learn more complex features**. For increasing the depth, it used smaller 3 x 3 filters. As discussed earlier, the parameters of two 3 x 3 Conv layers are less than the parameters of one 5 x 5 Conv layer. It also increases the number of nonlinear transformation of the signal to ensure that more complex patterns are learned. Also, a stack of two 3 x 3 Conv (stride 1) layers has **same effective receptive field** as one 5 x 5 conv layer.


## VGG-16: TensorFlow Implementation

We implement the VGG-16 architecture that has 16 total number of Conv and FC layers. A slightly more effective VGG architecture is VGG-19 that has 19 layers of Conv and FC layers.


<img src="https://cse.unl.edu/~hasan/Pics/CNN_VGG_16_19_New.png" width=400, height=300>


VGG-16 architecture consists of 12 Conv layers (distributed in 5 blocks) and 4 FC layers.

- The first three blocks have two Conv layers each.
- The latter tho blocks contain three Conv layers each.

The first block has 64 output channels and each subsequent block doubles the number of output channels, until that number reaches 512. 

First we define a class to implement the VGG block, followed by the VGG-16 network.

In [3]:
'''
Define the VGG block class using Keras' Sequential API 
The VGG_Block class takes two arguments:
- conv_block_number: number of convolutional layers 
- num_of_channels: number of output channels 
'''
class VGG_Block(keras.layers.Layer):
    def __init__(self, conv_block_number, num_of_channels, **kwargs):
        super().__init__(**kwargs)
        self.conv_layers = [] 
        for _ in range(conv_block_number):
            self.conv_layers.append(keras.layers.Conv2D(filters=num_of_channels, kernel_size=(3, 3), strides=1,
                                activation='relu', padding='same'))
        
        self.pool_layer = keras.layers.MaxPooling2D(pool_size=(2, 2), strides=2, padding='valid')

    def call(self, inputs):
        Z = inputs
        for layer in self.conv_layers:
            Z = layer(Z)
            
        Z = self.pool_layer(Z)
        return Z


## VGG-16 Network

In [4]:
def vgg_16(conv_blocks, input_shape):
    
    vgg_net = Sequential(name='VGG-16')
    
    vgg_net.add(keras.Input(shape=input_shape))
    
    # Conv part
    for (conv_block_number, num_of_channels) in conv_blocks:
            vgg_net.add(VGG_Block(conv_block_number, num_of_channels))
    
    # Flatten the convnet output to feed it with fully connected layers
    vgg_net.add(Flatten())
    
    # FC part
    vgg_net.add(Dense(units=4096, activation='relu'))
    vgg_net.add(Dropout(0.5))
    vgg_net.add(Dense(units=4096, activation='relu'))
    vgg_net.add(Dropout(0.5))
    vgg_net.add(Dense(units=10, activation='softmax'))
    
    return vgg_net


'''
The layer_info variable consists of a list of tuples (one per block).
Each tuple contains two values: 
- number of convolutional layers
- the number of output channels
These two arguments are used to call the vgg_block function
'''
layer_info = ((2, 64), (2, 128), (2, 256), (3, 512), (3, 512))
vgg_net = vgg_16(layer_info, (224, 224, 3))
vgg_net.summary()

Model: "VGG-16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
vgg__block (VGG_Block)       (None, 112, 112, 64)      38720     
_________________________________________________________________
vgg__block_1 (VGG_Block)     (None, 56, 56, 128)       221440    
_________________________________________________________________
vgg__block_2 (VGG_Block)     (None, 28, 28, 256)       885248    
_________________________________________________________________
vgg__block_3 (VGG_Block)     (None, 14, 14, 512)       5899776   
_________________________________________________________________
vgg__block_4 (VGG_Block)     (None, 7, 7, 512)         7079424   
_________________________________________________________________
flatten_1 (Flatten)          (None, 25088)             0         
_________________________________________________________________
dense_3 (Dense)              (None, 4096)              10276

## VGGNet: Memory & Number of Parameters

Following table shows the total amount of memory required by a single image to perform forward calculation, as well as the total number of parameters (excluding bias) in a VGG-16 network.

Notice the most memory is used in the first two Conv layers (3.2 M + 3.2 M) and most parameters are used by the first FC layer.


The VGGNet architecture is **computationally heavier** than AlexNet.
- Slower in terms of computation: due 138 million parameters, computation is expensive.
- Needs more memory: requires 96 MB of memory for forward calculation of a single image. 





<img src="https://cse.unl.edu/~hasan/Pics/CNN_VGG_Calculation.png" width=800, height=800>




## Why VGGNet is Inefficient

Due to its large number of channels in the Conv layers, VGGNet is not an efficient architecture. Although a single VGG Net achieved 7% top-5 error rate  in the ILSVRC challenge (and 6.8% by using an ensemble of two networks), it's implementation is not feasible on a moderately sized GPU.

## Network in Network (NiN) 2013

The NiN architecture was proposed in 2013, shortly after AlexNet, by Min Lin, Qiang Chen, and Shuicheng Yan from the National University of Singapore. Yet it is fundamentally different from AlexNet.

https://arxiv.org/abs/1312.4400

Consider the architecture of LeNet, alexNet and VGGNet. They perform mainly two taks.
- Use Conv and Pool layer to extract features by exploiting spatial structure of the input map
- Use FC layers to process the representations

The NiN architecture **removes the FC layer from the end of the network entirely**, and used them on the Conv layers. Sounds strange! It employed micro neural networks with complex structures to abstract the data within the receptive field. Let's see how it works. 


Consider the following figure. The left figure represents typical convolution operation in AlexNet or VGGNet. A single feature map for channel $k$ at the location $i, j$ is computed by the colvolution and ReLU activation as follows.

$f_{i, j, k} = max(w_k^{T}x_{i, j}, 0)$


<img src="https://cse.unl.edu/~hasan/Pics/CNN_NiN_1.png" width=800, height=800>

It was hypothesized that representations that achieved good abstraction are generally highly nonlinear functions of the input data. Thus the question was how to achieve more per-pixel nonlinearity?

In NiN, the idea was to scan each pixel location by using a 1 x 1 filter that has narrow enough receiptive field to focus on just one local patch or pixel. The 1 x 1 filter is nothing but a MLP that takes input from a single pixel.

The MLP **scans every pixel along its depth (across channels)**. Thus instead of leveraging the spatial structure of the pixels, it uses the information of a pixel along its depth. This is how it was possible to leverage cross-channel information to achieve per-pixel nonlinearity. 

The MLP is added on top of the Conv layer, as shown in the right figure below. Say that there are $n$ hidden layers in the MLP. Each layer uses ReLU activation. The MLP layer acts independently on each pixel location. 


$f_{i, j, k_{1}}^{1} = max(w_{k_{1}}^{1} x_{i, j} + b_{k_{1}}, 0)$

...

$f_{i, j, k_n}^{n} = max(w_{k_{n}}^{n}f_{i, j}^{n-1} + b_{k_{n}}, 0)$


We may consider each element in the spatial dimension (height and width) as equivalent to an example and a channel as equivalent to a feature. Since the weights across each spatial location are tied, we could think of MLP as a 1 × 1 convolutional layer. Thus we create a micro network with a larger filter followed by 1 x 1 filter. It allows complex and learnable interactions of cross channel information.


## Global Average Pooling Instead of FC Layers


In the VGGNet we have seen that a significantly large number of parameters are used by the FC layers, which makes it prone to overfitting. NiN architecture **avoids the overfitting issue** by replacing the output FC layer by the global average pooling layer.

The last micro network (Mlpconv) layer generates one feature map for each corresponding category of the classification task. Instead of adding FC layers on top of these feature maps, we take the average of each feature map. Then, the resulting vector is fed directly into the Softmax layer.

There are at least three advantages of the global average pooling at the final layers.

- Since the global average pooling layer doesn't have any learnable parameters, it avoids the problem of overfitting.

- The global average pooling layer is more native to the convolution structure by enforcing correspondence between feature maps and categories.

- Finally, global average pooling sums out the spatial information, thus it is more robust to spatial translations of the input.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_NiN_3.png" width=800, height=800>

The overall architecture of NiN is shown below.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_NiN_2.png" width=800, height=800>


## NiN Block vs VGGNet Block

It is helpful to understand the main structural differences between VGGNet and NiN, and their blocks, which is illustrated below. The NiN block consists of one convolutional layer followed by two 1 × 1 convolutional layers that act as per-pixel fully-connected layers with ReLU activations. The convolution window shape of the first layer is typically set by the user. The subsequent window shapes are fixed to 1 × 1.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_NiN_Comparison.png" width=900, height=600>


## NiN Block: TensorFlow Implementation

We define a class to implement the NiN block, followed by the NiN network. We use both the sequential API and the functional API of Keras to design two classes for the NiN block.

In [5]:
# Define the NiN block class using Keras' Sequential API 
class NiN_Block_SequentialAPI(keras.layers.Layer):
    def __init__(self, num_of_channels, kernel_size, strides, padding, **kwargs):
        super().__init__(**kwargs)
        self.conv_layers = [keras.layers.Conv2D(filters=num_of_channels, kernel_size=kernel_size, 
                                                        strides=strides, activation='relu', padding=padding),
                           keras.layers.Conv2D(filters=num_of_channels, kernel_size=1, 
                                                        strides=1, activation='relu', padding='valid'),
                           keras.layers.Conv2D(filters=num_of_channels, kernel_size=1, 
                                                        strides=1, activation='relu', padding='valid')] 
    def call(self, inputs):
        Z = inputs
        for layer in self.conv_layers:
            Z = layer(Z)
            
        return Z
    
# Define the NiN block class using Keras' Functional API 
class NiN_Block(keras.layers.Layer):

    def __init__(self, num_of_channels, kernel_size, strides, padding, **kwargs):
        super().__init__(**kwargs)
        self.conv_a = keras.layers.Conv2D(filters=num_of_channels, kernel_size=kernel_size, 
                                                        strides=strides, activation='relu', padding=padding)
        self.conv_b= keras.layers.Conv2D(filters=num_of_channels, kernel_size=1, 
                                                        strides=1, activation='relu', padding='valid')
        self.conv_c = keras.layers.Conv2D(filters=num_of_channels, kernel_size=1, 
                                                        strides=1, activation='relu', padding='valid')

    def call(self, inputs):
        Z = self.conv_a(inputs)
        Z = self.conv_b(Z) 
        Z = self.conv_c(Z) 
        
        return Z

## NiN Network: TensorFlow Implementation

There are some similarities between AlexNet and the NiN models. 

- The NiN architecture uses larger Conv layers with filters 11 x 11, 5 x 5, and 3 x 3.
- The corresponding numbers of output channels are the same as in AlexNet. 
- Each NiN block is followed by a maximum pooling layer with a stride of 2 and a window shape of 3 × 3.


However, unlike AlexNet, NiN does not have any FC layers at the output. Instead, NiN uses a NiN block with a number of output channels equal to the number of output classes, followed by a global average pooling layer, yielding a vector of logits. 

In [6]:
def nin(input_shape, num_of_output_classes):
    
    nin_net = Sequential(name='NiN')
    
    nin_net.add(keras.Input(shape=input_shape))
    
    nin_net.add(NiN_Block(num_of_channels=96, kernel_size=(11, 11), strides=4, padding='valid'))
    nin_net.add(MaxPooling2D(pool_size=(3, 3), strides=2, padding='valid'))
    
    nin_net.add(NiN_Block(num_of_channels=256, kernel_size=(5, 5), strides=1, padding='same'))
    nin_net.add(MaxPooling2D(pool_size=(3, 3), strides=2, padding='valid'))
    
    
    nin_net.add(NiN_Block(num_of_channels=384, kernel_size=(3, 3), strides=1, padding='same'))
    nin_net.add(MaxPooling2D(pool_size=(3, 3), strides=2, padding='valid'))
    vgg_net.add(Dropout(0.5))
    
    
    # The final NiN block reduces the number of channels to the desired number of outputs
    nin_net.add(NiN_Block(num_of_channels=num_of_output_classes, kernel_size=(3, 3), strides=1, padding='same'))
    
    # The global average pooling layer automatically sets the window shape
    # to the height and width of the input
    nin_net.add(GlobalAveragePooling2D())
    
    
    # Transform the four-dimensional output into two-dimensional output with a shape of (batch size, 10)
    nin_net.add(Flatten())
    
    
    return nin_net



# Create a NiN network and display layer information
nin_net = nin((224, 224, 3), 10)
nin_net.summary()

Model: "NiN"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
ni_n__block (NiN_Block)      (None, 54, 54, 96)        53568     
_________________________________________________________________
max_pooling2d_8 (MaxPooling2 (None, 26, 26, 96)        0         
_________________________________________________________________
ni_n__block_1 (NiN_Block)    (None, 26, 26, 256)       746240    
_________________________________________________________________
max_pooling2d_9 (MaxPooling2 (None, 12, 12, 256)       0         
_________________________________________________________________
ni_n__block_2 (NiN_Block)    (None, 12, 12, 384)       1180800   
_________________________________________________________________
max_pooling2d_10 (MaxPooling (None, 5, 5, 384)         0         
_________________________________________________________________
ni_n__block_3 (NiN_Block)    (None, 5, 5, 10)          34790   

## NiN Network: Practical Considerations & Design Inspiration

Compared to 138 million parameters of VGG, the NiN network has only 2 million trainable parameters. This is a significant improvement in terms of efficiency. 

NiN is important to the Deep Learning community as an inspiration for designing subsequent convnet architectures by using 1 x 1 filters. It opened a new vein of convolution across the depth (channels).

## GoogLeNet (2014)

Let's see what motivated the invention of GoogLeNet. Consider the following three images. Let's say that we want to recognize the face. But the size of the faces has a large variation in these three images. Then, how do we choose the optimal size of the filter? 

- Larger filter: it would help to capture information that is distributed across the entire image.
- Smaller filter: it would help if the information is locally concentrated.

We may use variously-size filters in the layers of a convnet and build a deeper network. But there will be two issues.

- Deeper networks with large number of parameters are prone to overfitting.
- A naive stack of a large number of Conv layers will be computationally expensive.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_Inception_Faces.png" width=900, height=800>


### In Search of an Efficient Solution

One idea is to create a network that is:
- Wider: stack variously-size filters parallely in a single computational block
- Deeper: stack many such blocks

We only need to find an efficient way of implementing this idea. This is exactly what a group of researchers at Google led by Christian Szegedy did. The result was their GoogLeNet architecture!

https://arxiv.org/abs/1409.4842


### What is So Special about the GoogLeNet Architecture?

So far we have seen how the Conv layers and Max pool layers leverage **spatial information** to extract meaningful features. The NiN architecture advanced the feature extraction process by offering a technique to leverage **depth or channel** information. It used 1 x 1 Conv layers for extracting cross-channel information by adding per pixel nonlinearity.


Szegedy's group explored the possibility of combining all of these feature extraction techniques. They combined larger Conv layers and Max Pool layer to scan the input map across spatial dimension, as well as 1 x 1 Conv layers to scan along the channel direction. A single input map is scanned by variously-size filters and Max Pool filter in parallel. Then, the output obtained from the parallel scanning paths are concatenated to create a single output. This block containing parallel combination of filters was named as the **inception** block. These blocks were stacked to create a new architecture, which was named as the GoogLeNet.

Szegedy et al. argued the need to design deeper networks for creating more expressive features. But the deeper the network gets, the more expensive it becomes both in terms of the space (number of parameters) and time (number of floating-point operations or FLOPS). Szegedy's group proposed an ingenuous solution to keep the number of GoogLeNet parameters was significantly smaller as well as reduce the FLOPS substantially. For example, AlexNet required 60 million parameters, whereas GoogLeNet used 6 million parameters.  

The sheer depth and extracting more expressive features led GoogLeNet to win the ILSVRC 2014 challenge by reducing the top-5 error rate down to 6.7%.

First, we discuss the inception block. Our goal is address the following two aspects of GoogLeNet.
- Extratcting Expressive Features
- Reducing the Memory and FLOPS

### Inception Block: Conv layer on steroids

The name of the Inception block in GoogLeNet is inspired by a quote from Christopher Nolan's 2010 movie "Inception".

<img src="https://cse.unl.edu/~hasan/Pics/CNN_Inception_Movie.png" width=500, height=300>

Let's discuss the architecture of the inception block, as shown below. We use a special notation for the presentation. For example, the notation "3 × 3 + 1(S)" indicates that the layer uses a 3 × 3 filter, stride 1, and "same" padding. 

<img src="https://cse.unl.edu/~hasan/Pics/CNN_GoogleNet_Inception_Original_1.png" width=600, height=400>

At first glance the inception architecture is not an easy grasp. So, let's break it down into a simpler architecture. Remember the goal of the inception block is to combine filters of various size and max pooling operation. Thus consider a naive implemention of the inception block, shown below.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_GoogleNet_Naive_Inception.png" width=600, height=400>

This inception block copies the input signal and sends it through four different paths. The four paths use variously-size filters for extracting information from different spatial sizes.

- Path 1: A Conv layer with a 1 x 1 filter.

- Path 2: A Conv layer with a 3 x 3 filter.

- Path 3: A Conv layer with a 5 x 5 filter.

- Path 4: A Max Pool layer. 

Let's try to understand the operation of this naive inception block by using a simple example. Say that the size of the input map is 28 x 28 x 256, i.e., there are 256 input channels. Also, let's say the we have following number of filters on the first 3 paths:

- Path 1: 128 1 x 1 filters.

- Path 2: 192 3 x 3 filters.

- Path 3: 96 5 x 5 filters.

We want to compute the output size. Following are the sizes of the output at the end of each path:

- Path 1: 28 x 28 x 128

- Path 2: 28 x 28 x 192

- Path 3: 28 x 28 x 96

- Path 4: 28 x 28 x 256

Since we used stride 1 and SAME padding, the input height and width does not change at the output. Only the number of output channels change.


<img src="https://cse.unl.edu/~hasan/Pics/CNN_GoogleNet_Naive_Inception_InputOutput.png" width=600, height=400>

The size of the concatenated output is 28 x 28 x (128 + 192 + 96 + 256) = 28 x 28 x 672. I.e., the block outputs 672 filters.

This naive inception block did everything that we initilly planned, i.e., scan the input with variously-size filters. Thus it should be able to extract more expressive features. But the main problem with this block is that it is computationally very expensive. Let's compute the number of FLOPS using this block. We show the calculation of FLOPS for the first 3 paths. The last path doesn't do any floating-point multiplications.

- Path 1: 28 x 28 x 128 x 1 x 1 x 256
- Path 2: 28 x 28 x 192 x 3 x 3 x 256
- Path 3: 28 x 28 x 96 x 5 x 5 x 256

        -- Total FLOPS by the block is 854 million!


<img src="https://cse.unl.edu/~hasan/Pics/CNN_GoogleNet_Naive_Inception_Calculation.png" width=600, height=400>


### Improve Computational Efficiency of the Inception Block

The main contributing factor to the increased cost of computation is the large number of filters. When multiple inception blocks are stacked in the GoogLeNet, which we will see soon, these blocks are joined with Max Pool layers. Thus, the number of output channels doesn't reduce.

Google researchers resolved this issue by using an idea from the NiN architecture. They used 1 x 1 Conv layers to reduce the number of channels. This layer creates fewer feature maps than their inputs, as shown below. So, it serves as a **bottleneck layer**. It means that it reduces dimensionality, which cuts the computational cost and the number of parameters, speeding up training and improving generalization.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_GoogleNet_Bottleneck.png" width=600, height=400>

In the GoogLeNet inception block, three bottleneck layers (1 x 1 Conv layer) are inserted to reduce the dimension of the filters, as shown below.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_GoogleNet_Inception_Original.png" width=600, height=400>

The four paths of this inception block are

- Path 1: A Conv layer with a 1 x 1 filter.

- Path 2: A bottleneck layer followed by a Conv layer with 3 x 3 filter.

- Path 3: A bottleneck layer followed by a Conv layer with 5 x 5 filter.

- Path 4: A Max Pool layer followed by a bottleneck layer. 


The 2nd and 3rd paths use a pair of convolutional layers ([1 × 1, 3 × 3] and [1 × 1, 5 × 5]). This acts like a single powerful convolutional layer, capable of capturing more complex patterns. Indeed, instead of sweeping a simple linear classifier across the image (as a single convolutional layer does), this pair of convolutional layers sweeps a two-layer neural network across the image.


Every single layer uses a stride of 1 and "same" padding (even the max pooling layer), such that the input height and width are maintained at the output. Only the number of channels (feature maps) would vary at the output of four paths.

As a result, it is possible to stack the feature maps from all four top Conv layers by concatenating all the outputs along the depth dimension in the final depth concatenation layer. The combination of the filters make it possible to explore the input maps in varying ranges. That's why details at different extents can be recognized efficiently by different filters. Also different amounts of parameters (filters) can be allocated for different ranges. 

- But how does this inception block reduce the complexity as compared to the naive design?


We show the calculation of FLOPS for the paths, as illustrated in the following figure. 

- Path 1: 
28 x 28 x 128 x 1 x 1 x 256
- Path 2: 28 x 28 x 192 x 3 x 3 x 256
28 x 28 x 64 x 1 x 1 x 256
28 x 28 x 192 x 3 x 3 x 64

- Path 3: 28 x 28 x 96 x 5 x 5 x 256
28 x 28 x 64 x 1 x 1 x 256
28 x 28 x 96 x 5 x 5 x 64

- Path 4:
28 x 28 x 64 x 1 x 1 x 256



        -- Total FLOPS by the block is 358 million, which is significantly smaller than 854 million FLOPS in the naive design.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_GoogleNet_Inception_Original_Calculation.png" width=800, height=600>


### Hyperparameter
The hyperparameter in the inception block that could be tuned is the number of output channels per layer. So we have **6 more hyperparameters** to tweak for every inception layer we add.


















## Inception Block: TensorFlow Inplementation

In [7]:
# Define the Inception block class using Keras' Functional API 
class Inception_Block(keras.layers.Layer):

    def __init__(self, ch1, ch2, ch3, ch4, **kwargs):
        super().__init__(**kwargs)
        
        # Path 1 is a single 1 x 1 Conv layer
        self.p1_1 = keras.layers.Conv2D(filters=ch1, kernel_size=(1, 1), 
                                                        strides=1, activation='relu', padding='valid')
        
        # Path 2 is a 1 x 1 convolutional layer followed by a 3 x 3 Conv layer
        self.p2_1 = keras.layers.Conv2D(filters=ch2[0], kernel_size=(1, 1), 
                                                        strides=1, activation='relu', padding='valid')
        self.p2_2 = keras.layers.Conv2D(filters=ch2[1], kernel_size=(3, 3), 
                                                        strides=1, activation='relu', padding='same')
        
        # Path 3 is a 1 x 1 convolutional layer followed by a 5 x 5 Conv layer
        self.p3_1 = keras.layers.Conv2D(filters=ch3[0], kernel_size=(1, 1), 
                                                        strides=1, activation='relu', padding='valid')
        self.p3_2 = keras.layers.Conv2D(filters=ch3[1], kernel_size=(5, 5), 
                                                        strides=1, activation='relu', padding='same')
        
        # Path 4 is a 3 x 3 maximum pooling layer followed by a 1 x 1 Conv layer
        self.p4_1 = keras.layers.MaxPooling2D(pool_size=(3, 3), strides=1, padding='same')
        self.p4_2 = keras.layers.Conv2D(filters=ch4, kernel_size=(1, 1), 
                                                        strides=1, activation='relu', padding='valid')
        

    def call(self, inputs):  
        p1 = self.p1_1(inputs)
        p2 = self.p2_2(self.p2_1(inputs))
        p3 = self.p3_2(self.p3_1(inputs))
        p4 = self.p4_2(self.p4_1(inputs))
        
        # Concatenate the outputs on the channel dimension
        # We use tf.concat() operation, with axis=3 (this axis is the depth)
        concat = tf.keras.layers.Concatenate(axis=3)([p1, p2, p3, p4])
        return concat

## GoogLeNet Model Architecture


The GoogLeNet uses a stack of a total of 9 inception blocks, which are followed by a global average pooling layer to generate its estimates. Similar to NiN model, the global average pooling layer in GoogLeNet is used to avoid a stack of FC output layers. Max pooling is used after a few inception blocks to reduce dimensionality.

In the following illustration, which is a simplified version of the original GoogLeNet architecture, the number of feature maps output by each Conv layer and each Pool layer is shown before the kernel size. The 9 inception blocks are represented by the boxes with the spinning tops. The 6 numbers in the inception blocks represent the number of feature maps output by each Conv layer in the module. All the Conv layers use the ReLU activation function.

The total number of layers with learnable parameters is 22. Each inception block contains two layers, plus three Conv layers at the beginning and a FC layer at the output.

<img src="https://cse.unl.edu/~hasan/Pics/CNN_GoogleNet.png" width=800, height=600>

Let’s go through this network:

- The first two layers divide the image’s height and width by 4 (so its area is divided by 16), to reduce the computational load. 

- The first layer uses a large kernel size so that much of the information is preserved.

- Then the local response normalization layer ensures that the previous layers learn a wide variety of features.

- Two Conv layers follow, where the first acts like a bottleneck layer. Again, a local response normalization layer ensures that the previous layers capture a wide variety of patterns.

- Next, a max pooling layer reduces the image height and width by 2, again to speed up computations.

- Then comes the tall stack of nine inception modules, interleaved with a couple max pooling layers to reduce dimensionality and speed up the net.

        -- The ratio of the number of channels assigned in the inception block is obtained through a large number of experiments on the ImageNet dataset.

- Next, the global average pooling layer outputs the mean of each feature map: this drops any remaining spatial information, which is fine because there was not much spatial information left at that point. 

- Indeed, GoogLeNet input images are typically expected to be 224 × 224 pixels, so after 5 max pooling layers, each dividing the height and width by 2, the feature maps are down to 7 × 7. Moreover, it is a classification task, not localization, so it does not matter where the object is. 

- Due to the dimensionality reduction brought by this layer, there is no need to have several FC layers at the top of the convnet (like in AlexNet), and this considerably reduces the number of parameters in the network and limits the risk of overfitting.

- The last layers are self-explanatory: dropout for regularization, then a fully connected layer with 1,000 units (since there are 1,000 classes) and a softmax activation function to output estimated class probabilities.

## The Original GoogLeNet Architecture

The original GoogLeNet architecture included two auxiliary classifiers plugged on top of the third and sixth inception modules. They were both composed of one average pooling layer, one convolutional layer, two fully connected layers, and a softmax activation layer. During training, their loss (scaled down by 70%) was added to the overall loss. The goal was to fight the vanishing gradients problem and regularize the network. However, it was later shown (in Inception-v3) that their effect was relatively minor.


<img src="https://cse.unl.edu/~hasan/Pics/CNN_GoogleNet_2.png" width=1000, height=600>

In [8]:
def googLeNet(input_shape, num_of_output_classes):
    
    
    b1 = Sequential()    
    b1.add(Conv2D(filters=64, kernel_size=(7, 7), strides=2, activation='relu', padding='same'))
    b1.add(MaxPooling2D(pool_size=(3, 3), strides=2, padding='same'))
    
    
    b2 = Sequential()    
    b2.add(Conv2D(filters=64, kernel_size=(1, 1), strides=1, activation='relu', padding='valid'))
    b2.add(Conv2D(filters=192, kernel_size=(3, 3), strides=1, activation='relu', padding='same'))
    b2.add(MaxPooling2D(pool_size=(3, 3), strides=2, padding='same'))   
    
    b3 = Sequential()
    b3.add(Inception_Block(64, (96, 128), (16, 32), 32))
    b3.add(Inception_Block(128, (128, 192), (32, 96), 64))
    b3.add(MaxPooling2D(pool_size=(3, 3), strides=2, padding='same'))
    
    b4 = Sequential()
    b4.add(Inception_Block(192, (96, 208), (16, 48), 64))
    b4.add(Inception_Block(160, (112, 224), (24, 64), 64))
    b4.add(Inception_Block(128, (128, 256), (24, 64), 64))
    b4.add(Inception_Block(112, (144, 288), (32, 64), 64))
    b4.add(Inception_Block(256, (160, 320), (32, 128), 128))
    b4.add(MaxPooling2D(pool_size=(3, 3), strides=2, padding='same'))
    
    b5 = Sequential()
    b5.add(Inception_Block(256, (160, 320), (32, 128), 128))
    b5.add(Inception_Block(384, (192, 384), (48, 128), 128))
    b5.add(GlobalAveragePooling2D())
    
    google_net = Sequential(name='GoogLeNet')
    google_net.add(keras.Input(shape=input_shape))
    google_net.add(b1)
    google_net.add(b2)
    google_net.add(b3)
    google_net.add(b4)
    google_net.add(b5)
    google_net.add(Dense(units=num_of_output_classes, activation='softmax'))
    
    return google_net

# Create a GoogLeNet network and display layer information
google_net = googLeNet((96, 96, 3), 10)
google_net.summary()

Model: "GoogLeNet"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
sequential (Sequential)      (None, 24, 24, 64)        9472      
_________________________________________________________________
sequential_1 (Sequential)    (None, 12, 12, 192)       114944    
_________________________________________________________________
sequential_2 (Sequential)    (None, 6, 6, 480)         552432    
_________________________________________________________________
sequential_3 (Sequential)    (None, 3, 3, 832)         2809168   
_________________________________________________________________
sequential_4 (Sequential)    (None, 1024)              2487536   
_________________________________________________________________
dense_6 (Dense)              (None, 10)                10250     
Total params: 5,983,802
Trainable params: 5,983,802
Non-trainable params: 0
_______________________________________________

## GoogLeNet: Limitations

The Inception block-based GoogLeNet architecture is complex and requires careful engineering. To boost up its performance (effectiveness and efficiency), Google researchers employed a lot of tricks. The complexity of its Inception architecture makes it more difficult to make changes to the GoogLeNet architecture. If the architecture is scaled up naively, large parts of the computational gains can be immediately lost. The original paper does not clearly describe the contributing factors that lead to the various design decisions of the GoogLeNet architecture. For example, if we want to increase the capacity of the Inception blocks, we cannot just increase the number of filter bank sizes. Because it will lead to a significant increase in both computational cost and number of parameters. This makes it much harder to adapt it to new use-cases while maintaining its efficiency. 

## Inception-vXXX

To overcome the scaling problem of the GoogLeNet architecture, which is dubbed as Inception-v1, Google researchers kept on modifying the Inception architecture.

The popular versions of the Inception-based networks are as follows.
- Inception-v1 (GoogLeNet)
- Inception-v2 and Inception-v3 (2015): presented in the same paper
- Inception-v4 and Inception-ResNet (2016): presented in the same paper

Each version is an iterative improvement over the previous one. Below we discuss the key changes made in Inception-v2 and v3. The Inception-v4 and Inception-ResNet improvements will be discussed later, after we present the ResNet architecture.

### What were the key design considerations for imroving the Inception-based networks?

The computational gains of the GoogLeNet architecture is mostly attributed to dimension reduction of the filters. This can be viewed as a special case of **factorizing convolutions in a computationally efficient manner**. For example, consider the case of a 1 x 1 Conv layer followed by a 3 x 3 Conv layer. In a vision network, it is expected that the outputs of nearby activations are highly correlated. Therefore, we can expect that their activations can be reduced before aggregation and that this should result in similarly expressive local representations.

The next logical step would be to explore other ways of factorizing convolutions in various settings, especially in order to increase the computational efficiency of the solution. The benefit of suitable factorization is that it would lead to disentangled parameters and therefore with faster training. This gain in compute and memory can be leveraged to increase the filter-bank sizes of the network.


The original Inception block was modified in the following three manners for scaling up convolution networks in efficient ways, as shown in the figure below. 



- Inception Block a: For improving computational speed, 5 x 5 convolutions were factored into two 3 x 3 convolution operations. Convolutions with larger spatial filters (e.g. 5 × 5 or 7 × 7) tend to be disproportionally expensive in terms of computation. 

- Inception Block b: A cheaper factorization technique was used. For example, a 3 x 3 convolution is equivalent to first performing a 1 x 3 convolution, and then performing a 3 x 1 convolution on its output. This method is 33% more cheaper than the single 3 x 3 convolution. Thus, n x n convolutions were factored to a combination of 1 x n and n x 1 convolutions. 

- Inception Block c: For improving the representational bottleneck, the Inception block was made wider instead of deeper by expanding the filter banks. Reducing the dimensions too much may cause loss of information, known as a "representational bottleneck". The dimensionality of the maps provides a rough estimate of information content. Although information content cannot be assessed merely by the dimensionality of the representation as it discards important factors like correlation structure. If the module was made deeper instead, there would be excessive reduction in dimensions, and hence loss of information. It was argued that we should avoid bottlenecks with extreme compression. In general the representation size should gently decrease from the inputs to the outputs before reaching the final representation used for the task at hand. 


<img src="https://cse.unl.edu/~hasan/Pics/CNN_InceptionBlocks_v1_2_3.png" width=1000, height=600>


## Inception-v2 (2015)

The above three Inception blocks were used to design the Inception-v2 architecture, as shown below.

https://arxiv.org/pdf/1512.00567v3.pdf

The output size of each module is the input size of the next one. Variations in reduction technique were used to reduce the grid sizes between the Inception blocks whenever applicable. Some convolutions are marked with 0-padding, which is used to maintain the grid size. 0-padding is also used inside those Inception blocks that do not reduce the grid size. All other layers do not use padding. 

<img src="https://cse.unl.edu/~hasan/Pics/CNN_Inceptionv2_3_Architecture.png" width=600, height=500>


## Inception-v3 (2015)

The GoogLeNet introduced the notion of auxiliary classifiers to improve the convergence of very deep networks. The original motivation was to push useful gradients to the lower layers to make them immediately useful and improve the convergence during training by combating the vanishing gradient problem in very deep networks. However, it was found that auxiliary classifiers did not result in improved convergence early in the training. The training progression of network with and without side head looks virtually identical before both models reach high accuracy.

It was argued that the auxilary classifiers functioned as regularizes, especially if they had BatchNorm or Dropout operations. Thus, in Inception-v3, batch normalization was added in the auxillary Cclassifiers.

https://arxiv.org/pdf/1602.07261.pdf

In addition to this, following changes were incorporated in the Inception-v3 design:
- All of the above upgrades described for Inception-v2
- RMSProp Optimizer
- Factorized 7 x 7 convolutions
- Label Smoothing: It's a type of regularizing component added to the loss formula that prevents the network from becoming too confident about a class. It prevents overfitting.

