Link to paper: [Going deeper with convolutions/Inception/GoogLeNet](https://arxiv.org/pdf/1409.4842.pdf)

Andrew Ng's Video: [Inception Network Motivaion](https://www.youtube.com/watch?v=C86ZXvgpejM)


GoogLeNet is one of the incarnation of Inception architecture and it has won ILSVRC14 while VGGNet was slight behind. Most of the authors were from Google and this might be the reason that emphasis was given for computational cost while not compromising with the quality.

Inception Architecture answers two questions:

### 1. Which size convolutional kernels are best?

There has been a lot of experiment around the kernel size. Many popular neural network has used kernels ranging from 11x11 to 1x1. Kernels of different size will learn different features from input image or input activation maps. What Inception Network proposes is instead of using certain kernel what if we use combination of such kernels and let them learn what they want. Using such combination increases our chance of finding best weights which will be used to do tasks like classificatin, recognition etc.

![Combination of Filters](assets/combination_of_filters.png)

The idea here is to use combination of 1x1, 3x3, 5x5 filter. At that time, most popular and sucessfull networks have used max-pooling layer and it was believed that max-pooling is compulsary to do well on ILSVRC types problems. But [Striving for Simplicity: The All Convolutional Net](https://arxiv.org/abs/1412.6806) have shown that **max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy.** Since the original architecture have used max-pooling layer we will use the same convention.

The output activation map is the combination of 1x1, 3x3, 5x5 and a max-pool filters. **Same** padding has been used to make sure that height and width after applying all filters remain same. If we apply 64 1x1 filters, we would get output activation map of size 28x28x64(shown as green), if we apply 128 3x3 filters we would get output activation map of size 28x28x128. By doing so, we would get output activation maps of size 28x28x256 and this activation map will be passed to next layer.

**The answer to what size convolutional kernels are best : We will use combination of kernels and let the network learn what is best based on optimization algorithms**

### 2. How will you make your network less resource hungry?
One big problem with using such combination of kernels is that it will be very resource hungry. In above figure, if we look at a 5x5 convolution, it gives output of size 28x28x32. So 5x5x192 kernels are used to give 28x28x32 activation maps. Total computation requred is 28x28x32x5x5x192 = 120422400(>120 millions). So without dimension reduction our network will blow up within next stages.

So, we will use 1x1 convolution as a mean to reduce the computational cost before heavy 3x3 and 5x5 convolution.

![inception_naive_vs_dimension_reduction](assets/inception_naive_vs_dimension_reduction.png)




Such inception blocks are repeated a number of times to define a network architecture. We can see the repetition of inception blocks in the GoogleNet:

### GoogLeNet
GoogLeNet is most successfull incarnation of Inception Architecture.

![GoogLeNet](assets/googlenet.png)

#### Why auxiliary classifiers at the intermediate layers?
These auxiliary classifiers takes some hidden layers and tries to make a prediction. The gradients are added and gets propagated back. This makes sure that the feature learned at the hidden layers are not completely useless



**Compact GoogLeNet**

![Compact GoogLeNet](assets/compact_googlenet.png)

**GoogLeNet System Architecture**
![GoogLeNet_system_architecture](assets/GoogLeNet_system_architecture.png)

### GoogLeNet in PyTorch

As we see, the basic convolutional block in GoogLeNet is Inception block which allows us to use combination of kernels. We will start by designing Inception block and use it define whole GoogLeNet architecture.

![Inception block](assets/inception_block.png)

There is 4 parallel path and output of each path is concatenated at the end.

In [124]:
class InceptionBlock(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        self.in_channels = config[0]
        self.c1_out = config[1]
        self.c2_out = config[2]
        self.c3_out = config[3]
        self.c4_out = config[4]
        
        self.path1 = nn.Conv2d(self.in_channels, self.c1_out, kernel_size=1)
        
        self.path2_1 = nn.Conv2d(self.in_channels, self.c2_out[0], kernel_size=1)
        self.path2_2 = nn.Conv2d(self.c2_out[0], self.c2_out[1], kernel_size=3, padding=1)
        
        self.path3_1 = nn.Conv2d(self.in_channels, self.c3_out[0], kernel_size=1)
        self.path3_2 = nn.Conv2d(self.c3_out[0], self.c3_out[1], kernel_size=5, padding=2)
        
        self.path4_1 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
        self.path4_2 = nn.Conv2d(self.in_channels, self.c4_out, kernel_size=1)
        
        self.relu = nn.ReLU(inplace=True)
    
    def forward(self, x):
        path1 = self.relu(self.path1(x))
        path2 = self.relu(self.path2_2(self.relu(self.path2_1(x))))
        path3 = self.relu(self.path3_2(self.relu(self.path3_1(x))))
        path4 = self.relu(self.path4_2(self.path4_1(x)))
        
        return torch.cat((path1, path2, path3, path4), axis=1)
        
        

In [125]:
output_size = 10

class GoogLeNet(nn.Module):
    
    def __init__(self, output_size):
        super().__init__()
        
        self.output_size = output_size
        
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=64,
                              kernel_size=7, stride=2, padding=3)
        
        self.conv2_1 = nn.Conv2d(in_channels=64, out_channels=64,
                              kernel_size=1)
        self.conv2_2 = nn.Conv2d(in_channels=64, out_channels=192,
                            kernel_size=3, padding=1)
        
        self.inception3 = self.get_inception_layers("type_3", 2)
        self.inception4 = self.get_inception_layers("type_4", 5)
        self.inception5 = self.get_inception_layers("type_5", 2)
        
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.relu = nn.ReLU(inplace=True)
        
        self.avgpool = nn.AvgPool2d(kernel_size=7, stride=1, padding=0)
        
        self.dropout = nn.Dropout2d(p=0.4)
        
        self.linear = nn.Linear(1024, output_size)
        
    
    def get_inception_layers(self, block_type, num_inception_blocks):
        
        layers = list()
        
        for i in range(num_inception_blocks):
            layers.append(InceptionBlock(GoogLeNet_config[block_type][i]))
            
            
        return nn.Sequential(*layers)
    
    def forward(self, x):
        
        x = self.relu(self.conv1(x))
        x = self.maxpool(x)
        
        x = self.relu(self.conv2_2(self.relu(self.conv2_1(x))))
        x = self.maxpool(x)
        
        x = self.inception3(x)
        x = self.maxpool(x)
        
        x = self.inception4(x)
        x = self.maxpool(x)
        
        x = self.inception5(x)
        x = self.avgpool(x)
        x = self.dropout(x)
        x = x.view(x.shape[0], -1)
        x = self.linear(x)
        
        return x

In [126]:
GoogLeNet_config = {
    "type_3": [
        [192, 64, (96, 128), (16, 32), 32],
        [256, 128, (128, 192), (32, 96), 64]
    ],
    
    "type_4":[
        [480, 192, (96, 208), (16, 48), 64 ],
        [512, 160, (112, 224), (24, 64), 64],
        [512, 128, (128, 256), (24, 64), 64],
        [512, 112, (144, 288), (32, 64), 64],
        [528, 256, (160, 320), (32, 128), 128]
    ],
    
    "type_5":[
        [832, 256, (160, 320), (32, 128), 128],
        [832, 384, (192, 384), (48, 128), 128]
    ]
}


In [127]:
my_googlenet = GoogLeNet(10)
my_googlenet

GoogLeNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3))
  (conv2_1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
  (conv2_2): Conv2d(64, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (inception3): Sequential(
    (0): InceptionBlock(
      (path1): Conv2d(192, 64, kernel_size=(1, 1), stride=(1, 1))
      (path2_1): Conv2d(192, 96, kernel_size=(1, 1), stride=(1, 1))
      (path2_2): Conv2d(96, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (path3_1): Conv2d(192, 16, kernel_size=(1, 1), stride=(1, 1))
      (path3_2): Conv2d(16, 32, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
      (path4_1): MaxPool2d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False)
      (path4_2): Conv2d(192, 32, kernel_size=(1, 1), stride=(1, 1))
      (relu): ReLU(inplace=True)
    )
    (1): InceptionBlock(
      (path1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1))
      (path2_1): Conv2d(256, 128, kernel_size=(1, 1), str

In [128]:
import numpy as np
from torch.autograd import Variable

my_googlenet = my_googlenet.double()

X = np.random.uniform(size=(1, 3, 224, 224))
X = torch.from_numpy(X)
output = my_googlenet(X)

print(output.shape)


torch.Size([1, 10])


In [None]:
## Which is the required output size