## Inception

The Inception V1 architecture was an important milestone in the development of CNN classifiers. Before its inception, the common idea to improve Deep Neural Networks was to try and stack more layers, hoping for better performance. The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while
keeping the computational budget constant. Details can be found at https://arxiv.org/abs/1409.4842v1


One of the most straightforward way of improving the performance of deep neural networks is by increasing their size. This includes both increasing the depth: the number of levels – of the network and its width: the number of units at each level.  However blindly increasing the depth and width often leads to overfitting because of the increase in the number of parameters. Additionally, it also increases the computation resources needed to train and deploy the neural network.

The Inception network was heavily engineered. It used a lot of tricks to push performance; both in terms of speed and accuracy. There have been several stages of improvement: Inception_v1, Inception_v2, Inception_v3, Inception_ResNet etc. 

Here we will consider the main idea behind Inception_V1, the first Inception architecture that started it all. 

Prior deep learning architectures typically stacked convolutional filters in a sequential manner i.e each layer applies a set of convolutional filters of same size, and passes it on to its subsequent layer.  The kernel size of the filter at each layer depended on the architecture. 

However, how do we know that we have chosen the right kernel size at each layer? 
Intuitively, we expect huge variation in the information. Thus, a larger kernel is preferred for information that is distributed more globally, and a smaller kernel is preferred for information that is distributed more locally. By being forced to choose one kernel size, the resulting architecture may not be optimal.  This is the problem that Inception v1 tried to solve.

This was done by using an Inception module. The idea is to have wider layers that allowed for multiple filter sizes at the same level. 


The naive implementation of the Inception module performs convolutions on the input using 3 different kernel sizes (1x1, 3x3, 5x5). Additionally, max pooling is also performed. The outputs are concatenated, and sent into the next inception module. 

In [1]:
import torch
from torch import nn

In [2]:
class NaiveInceptionModule(nn.Module):
    def __init__(self, in_channels, num_features=64):
        super(NaiveInceptionModule, self).__init__()
        # 1x1 branch
        self.branch1x1 = torch.nn.Sequential(
                        nn.Conv2d(in_channels, num_features, kernel_size=1, bias=False),
                        nn.BatchNorm2d(num_features, eps=0.001),
                        nn.ReLU(inplace=True))
        
        # 3x3 branch
        self.branch3x3 = torch.nn.Sequential(
                        nn.Conv2d(in_channels, num_features, kernel_size=3, padding=1, bias=False),
                        nn.BatchNorm2d(num_features, eps=0.001),
                        nn.ReLU(inplace=True))
        
        # 5x5 branch
        self.branch5x5 = torch.nn.Sequential(
                        nn.Conv2d(in_channels, num_features, kernel_size=5, padding=2, bias=False),
                        nn.BatchNorm2d(num_features, eps=0.001),
                        nn.ReLU(inplace=True))
        
        # Pooling
        self.pool = torch.nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
        
    def forward(self, x):
        conv1x1 = self.branch1x1(x)
        conv3x3 = self.branch3x3(x)
        conv5x5 = self.branch5x5(x)
        pool_out = self.pool(x)
        out = torch.cat([conv1x1, conv3x3, conv5x5, pool_out], 1)
        return out
        

This naive inception block has a major flaw. Using even a small number of 5x5 filters can prohibitively increase the number of parameters. 
This becomes even more expensive when we add the pooling layer. This is because the  number of output filters equals to the number of filters in the previous stage. Thus when we concatenate the output of the pooling layer with the outputs of convolutional layers would lead to an inevitable increase in the number of output features.

In [3]:
naive_inception_module = NaiveInceptionModule(in_channels=256)

x = torch.rand((1, 256, 28, 28))
out = naive_inception_module(x)

# Note that the output has more number of features than the input. 
assert out.shape == torch.Size([1, 64+64+64+256, 28, 28])

num_params = sum(p.numel() for p in naive_inception_module.parameters() if p.requires_grad)
print(f"Number of parameters: {num_params}")

Number of parameters: 573824


How can we fix this problem?
We add more 1x1 convolutional layers!!

We use extra 1x1 convolutional layers to reduce the number of input channels before the 3x3 and 5x5 filters. This may seem counterintuitive, but 1x1 convs are much cheaper than 3x3 and 5x5. And reducing the input channels drastically reduces the number of parameters of the 3x3 and 5x5 convs. Additionally, 1x1 convolution is also applied after pooling. 

In [4]:
class Inceptionv1Module(nn.Module):
    def __init__(self, in_channels, num_1x1=64, 
                 reduce_3x3=96, num_3x3=128, 
                 reduce_5x5=16, num_5x5=32,
                 pool_proj=32):
        super(Inceptionv1Module, self).__init__()
        # 1x1 branch
        self.branch1x1 = torch.nn.Sequential(
                        nn.Conv2d(in_channels, num_1x1, kernel_size=1, bias=False),
                        nn.BatchNorm2d(num_1x1, eps=0.001),
                        nn.ReLU(inplace=True))
        
        # 3x3 branch
        # 1x1 conv
        self.branch3x3_1 = torch.nn.Sequential(
                        nn.Conv2d(in_channels, reduce_3x3, kernel_size=1, bias=False),
                        nn.BatchNorm2d(reduce_3x3, eps=0.001),
                        nn.ReLU(inplace=True))
        # 3x3 conv
        self.branch3x3_2 = torch.nn.Sequential(
                        nn.Conv2d(reduce_3x3, num_3x3, kernel_size=3, padding=1, bias=False),
                        nn.BatchNorm2d(num_3x3, eps=0.001),
                        nn.ReLU(inplace=True))
        
        # 5x5 branch
        # 1x1 conv
        self.branch5x5_1 = torch.nn.Sequential(
                        nn.Conv2d(in_channels, reduce_5x5, kernel_size=1, bias=False),
                        nn.BatchNorm2d(reduce_5x5, eps=0.001),
                        nn.ReLU(inplace=True))
        self.branch5x5_2 = torch.nn.Sequential(
                        nn.Conv2d(reduce_5x5, num_5x5, kernel_size=5, padding=2, bias=False),
                        nn.BatchNorm2d(num_5x5, eps=0.001),
                        nn.ReLU(inplace=True))
        
        # Pooling
        self.pool = torch.nn.Sequential(
                        torch.nn.MaxPool2d(kernel_size=3, stride=1, padding=1), # Pool
                        nn.Conv2d(in_channels, pool_proj, kernel_size=1, bias=False),
                        nn.BatchNorm2d(pool_proj, eps=0.001),
                        nn.ReLU(inplace=True))
                        
        
    def forward(self, x):
        conv1x1 = self.branch1x1(x)
        conv3x3 = self.branch3x3_2(self.branch3x3_1((x)))
        conv5x5 = self.branch5x5_2(self.branch5x5_1((x)))
        pool_out = self.pool(x)
        out = torch.cat([conv1x1, conv3x3, conv5x5, pool_out], 1)
        return out

In [5]:
inception_v1_module = Inceptionv1Module(in_channels=256)

x = torch.rand((1, 256, 28, 28))
out = inception_v1_module(x)

assert out.shape == torch.Size([1, 64+128+32+32, 28, 28])

num_params = sum(p.numel() for p in inception_v1_module.parameters() if p.requires_grad)
# Notice how we have drastically reduced the number of parameters.
print(f"Number of parameters: {num_params}")



Number of parameters: 177376


Using the dimension reduced inception module, a neural network architecture was built. This was popularly known as GoogLeNet. 

GoogLeNet has 9 such inception modules stacked linearly. It is 22 layers deep (27, including the pooling layers). It uses global average pooling at the end of the last inception module. With such a deep network, there is always the problem of the vanishing gradient. To prevent the middle part of the network from “dying out”, the paper introduced two auxiliary classifiers. This was done by applying softmax to the output of two of the intermediate inception modules, and computed an auxiliary loss over the ground truth. The total loss function is a weighted sum of the auxiliary loss and the real loss.