# Implementing GoogLeNet and train it on CIFAR-10.

## [Link to my Youtube Video Explaining this whole Notebook](https://www.youtube.com/watch?v=CZNYrkdDrmQ&list=PLxqBkZuBynVRyOJs4RWmB_fKlOVe5S8CR&index=12)

[![Imgur](https://imgur.com/a5lSG5y.png)](https://youtu.be/AIK6Gi3NUhI)


### Original Problem

Salient parts in the image can have extremely large variation in size. For instance, an image with a cat can be either of the following, as shown below. The area occupied by the cat is different in each image.

![Imgur](https://imgur.com/uAWiQaC.png)

- Because of this huge variation in the location of the information, choosing the right kernel size for the convolution operation becomes tough. A larger kernel is preferred for information that is distributed more globally, and a smaller kernel is preferred for information that is distributed more locally.

- Very deep networks are prone to overfitting. It also hard to pass gradient updates through the entire network.

- Naively stacking large convolution operations is computationally expensive.

The Solution:

Why not have filters with multiple sizes operate on the same level? The network essentially would get a bit “wider” rather than “deeper”. The authors designed the inception module to reflect the same.

The below image is the “naive” inception module. It performs convolution on an input, with 3 different sizes of filters (1x1, 3x3, 5x5). Additionally, max pooling is also performed. The outputs are concatenated and sent to the next inception module.

![Imgur](https://imgur.com/x5pl2QB.png)

However,  deep neural networks are computationally expensive. To make it cheaper, the authors limit the number of input channels by adding an extra 1x1 convolution before the 3x3 and 5x5 convolutions. Though adding an extra operation may seem counterintuitive, 1x1 convolutions are far more cheaper than 5x5 convolutions, and the reduced number of input channels also help. Do note that however, the 1x1 convolution is introduced after the max pooling layer, rather than before.


![Imgur](https://imgur.com/zqrJ2jo.png)

Using the dimension reduced inception module, a neural network architecture was built. This was popularly known as GoogLeNet (Inception v1). The architecture is shown below:


Proposed Architectural Details
The paper proposes a new type of architecture – GoogLeNet or Inception v1. It is basically a convolutional neural network (CNN) which is 27 layers deep. Below is the model summary:


![Imgur](https://imgur.com/8Br0HLk.png)

Notice in the above image that there is a layer called inception layer. This is actually the main idea behind the paper’s approach.

![Imgur](https://imgur.com/B281dyr.png)


### (Inception Layer) is a combination of all those layers (namely, 1×1 Convolutional layer, 3×3 Convolutional layer, 5×5 Convolutional layer) with their output filter banks concatenated into a single output vector forming the input of the next stage.


===========================================================================

### Inception Module:

The inception module is different from previous architectures such as AlexNet, ZF-Net. In this architecture, there is a fixed convolution size for each layer.

In theory you can have as many filter sizes as possible, but the Inception Architecture is restricted to filter sizes 1 × 1, 3 × 3 and 5 × 5. The small filters help capture the local details and features whereas spread out features of higher abstraction are captured by the larger filters. A 3 × 3 max pooling is also added to the Inception architecture, because, why not? Historically, it has been found that pooling layers make the network work better, so might as well add them!

#### In the Inception module 1×1, 3×3, 5×5 convolution and 3×3 max pooling performed in a parallel way at the input and the output of these are stacked together to generate final output. The idea behind is - that the convolution filters of different sizes will handle objects at multiple scale better.


===========================================================================

![Imgur](https://imgur.com/U7HLYco.png)

### The orange box is the stem, which has some preliminary convolutions. The purple boxes are auxiliary classifiers. The wide parts are the inception modules.


GoogLeNet has 9 such inception modules stacked linearly. It is 22 layers deep (27, including the pooling layers). It uses global average pooling at the end of the last inception module.
Needless to say, it is a pretty deep classifier. As with any very deep network, it is subject to the vanishing gradient problem.


The network is 22 layers deep. The initial layers are simple convolution layers.

After that there are multiple blocks of inception modules with layers of max pooling following some of the blocks. **The spatial dimensions get affected by these max pooling layers only.**

Another interesting addition to the architecture is to change the second last fully-connected layer with an average pooling layer. This layer spatially averages the feature map, converting 7 × 7 × 1024 input to 1 × 1 × 1024. Doing this not only reduces the computation and the number of parameters, by a factor of 49, of the network but also improves the accuracy of the model, improving top-1 accuracy by 0.6%.

This average pooling layer is finally followed by a normal fully-connected layer with 1000 neurons (and 1024 × 1000 parameters), for the 1000 ImageNet classes.

### Two auxiliary classifier

From the Original Paper

![Imgur](https://imgur.com/DlXDcFO.png)

Auxiliary Classifiers are type of architectural component that seek to improve the convergence of very deep networks. They are classifier heads we attach to layers before the end of the network. The motivation is to push useful gradients to the lower layers to make them immediately useful and improve the convergence during training by combatting the vanishing gradient problem.

To prevent the middle part of the network from “dying out”, the authors introduced two auxiliary classifiers (The purple boxes in the image). They essentially applied softmax to the outputs of two of the inception modules, and computed an auxiliary loss over the same labels.


So, the two auxiliary classifier layers are connected to the output of Inception (4a) and Inception (4d) layers.

The architectural details of auxiliary classifiers as follows:


• An average pooling layer with 5x5 filter size and stride 3, resulting in an
4x4x512 output or the (4a), and 4x4x528 for the (4d) stage.

• A 1x1 convolution with 128 filters for dimension reduction and rectified linear (ReLU) activation.

• A fully connected layer with 1024 units and rectified linear activation.

• A dropout layer with 70% ratio of dropped outputs.

• A linear layer with softmax loss as the classifier (predicting the same
1000 classes as the main classifier, but removed at inference time)


### Loss Function

The total loss function is a weighted sum of the auxiliary loss and the real loss. Weight value used in the paper was 0.3 for each auxiliary loss.


### The total loss used by the inception net during training.

### total_loss = real_loss + 0.3 * aux_loss_1 + 0.3 * aux_loss_2


What’s novel in Inception v1?

Instead of stacking convolutional layers, we stack modules or blocks, within which are convolutional layers.

===========================================================================

### Google Inception model:why there are 3 softmax ?

When creating deeper networks, there arises a problem coined as the "vanishing gradients" problem.

Intuitively, you can think about this problem lik this that - gradients carrying less and less information the deeper we go inside the network, which is of course a major concern, since we tune the network's parameters (weights) based solely on the gradients, using the "back-prop" algorithm.

How did the developers of GoogLeNet handle this problem ? They recognized the fact that it's not only the features of the final layers that carry all the discriminatory information: intermediate features are also capable of discriminating different labels; and, most importantly, their values are more "reliable" since they are extracted from earlier layers in which the gradient carry more information. Building on this intuition, they added "auxiliary classifiers" in two intermediate layers. This is the reason for the "early escape" loss layers in the middle of the network


The total loss is then a combination of these three loss layers.


### total_loss = real_loss + 0.3 * aux_loss_1 + 0.3 * aux_loss_2


I quote from the original article:


These classifiers take the form of smaller convolutional networks put on top of the output of the Inception (4a) and (4d) modules. During training, their loss gets added to the total loss of the network with a discount weight (the losses of the auxiliary classi- fiers were weighted by 0.3). At inference time, these auxiliary networks are discarded.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import numpy as np
from matplotlib import pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, random_split
from torchvision import transforms, datasets
from torchsummary import summary

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

## Building the initial Convolutional Block

In [3]:
class ConvBlock(nn.Module):
    
    def __init__(self, in_channels, out_channels, kernel_size, stride, padding):
        super(ConvBlock, self).__init__()
        
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding)
        self.bn = nn.BatchNorm2d(out_channels)
        self .activation = nn.ReLU()
        
    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)
        x = self .activation(x)
        return x

## Building the Inception Block

### “#3×3 reduce” and “#5×5 reduce”

From Paper - In the above “#3 × 3 reduce” and “#5 × 5 reduce” stands for the number of 1 × 1 filters in the reduction layer used before the 3 × 3 and 5 × 5 convolutions. One can see the number of 1 × 1 filters in the projection layer after the built-in max-pooling in the “pool proj” column. All these reduction/ projection layers use rectified linear (ReLU) activation. 

In [4]:
class Inception(nn.Module):
    
    def __init__(self, in_channels, num1x1, num3x3_reduce, num3x3, num5x5_reduce, num5x5, pool_proj):
        super(Inception, self).__init__()
        
        # Four output channel for each parallel block of network
        # Note, within Inception the individual blocks are running parallely
        # NOT sequentially. 
        self.block1 = nn.Sequential(
            ConvBlock(in_channels, num1x1, kernel_size=1, stride=1, padding=0)
        )
        
        self.block2 = nn.Sequential(
            ConvBlock(in_channels, num3x3_reduce, kernel_size=1, stride=1, padding=0),
            ConvBlock(num3x3_reduce, num3x3, kernel_size=3, stride=1, padding=1)
        )
        
        self.block3 = nn.Sequential(
            ConvBlock(in_channels, num5x5_reduce, kernel_size=1, stride=1, padding=0),
            ConvBlock(num5x5_reduce, num5x5, kernel_size=5, stride=1, padding=2)
        )
        
        self.block4 = nn.Sequential(
            nn.MaxPool2d(3, stride=1, padding=1, ceil_mode=True),
            ConvBlock(in_channels, pool_proj, kernel_size=1, stride=1, padding=0)
        )
        
    def forward(self, x):
        # Note the different way this forward function 
        # calculates the output.
        block1 = self.block1(x)
        block2 = self.block2(x)
        block3 = self.block3(x)
        block4 = self.block4(x)
        
        return torch.cat([block1, block2, block3, block4], 1)

In [5]:
class Auxiliary(nn.Module):
    
    def __init__(self, in_channels, num_classes):
        super(Auxiliary, self).__init__()
        
        self.pool = nn.AdaptiveAvgPool2d((4,4))
        self.conv = nn.Conv2d(in_channels, 128, kernel_size=1, stride=1, padding=0)
        self .activation = nn.ReLU()
        self.fc1 = nn.Linear(2048, 1024)
        self.dropout = nn.Dropout(0.7)
        self.fc2 = nn.Linear(1024, num_classes)
    
    def forward(self, x):
        out = self.pool(x)
        
        out = self.conv(out)
        out = self .activation(out)
    
        out = torch.flatten(out, 1)
        
        out = self.fc1(out)
        out = self .activation(out)
        out = self.dropout(out)
        
        out = self.fc2(out)
        
        return out

In [6]:
class GoogLeNet(nn.Module):
    
    def __init__(self, num_classes = 10):
        super(GoogLeNet, self).__init__()
      
        self.conv1 = ConvBlock(3, 64, kernel_size=7, stride=2, padding=3)
        self.pool1 = nn.MaxPool2d(3, stride=2, padding=0, ceil_mode=True)
        self.conv2 = ConvBlock(64, 64, kernel_size=1, stride=1, padding=0)
        self.conv3 = ConvBlock(64, 192, kernel_size=3, stride=1, padding=1)
        self.pool3 = nn.MaxPool2d(3, stride=2, padding=0, ceil_mode=True)
        
        self.inception3A = Inception(in_channels=192, num1x1=64, num3x3_reduce=96, num3x3=128, num5x5_reduce=16, num5x5=32, pool_proj=32)
        self.inception3B = Inception(in_channels=256, num1x1=128, num3x3_reduce=128, num3x3=192, num5x5_reduce=32, num5x5=96, pool_proj=64)
        self.pool4 = nn.MaxPool2d(3, stride=2, padding=0, ceil_mode=True)
        
        self.inception4A = Inception(in_channels=480, num1x1=192, num3x3_reduce=96, num3x3=208, num5x5_reduce=16, num5x5=48, pool_proj=64)
        self.inception4B = Inception(in_channels=512, num1x1=160, num3x3_reduce=112, num3x3=224, num5x5_reduce=24, num5x5=64, pool_proj=64)
        self.inception4C = Inception(in_channels=512, num1x1=128, num3x3_reduce=128, num3x3=256, num5x5_reduce=24, num5x5=64, pool_proj=64)
        self.inception4D = Inception(in_channels=512, num1x1=112, num3x3_reduce=144, num3x3=288, num5x5_reduce=32, num5x5=64, pool_proj=64)
        self.inception4E = Inception(in_channels=528, num1x1=256, num3x3_reduce=160, num3x3=320, num5x5_reduce=32, num5x5=128, pool_proj=128)
        self.pool5 = nn.MaxPool2d(3, stride=2, padding=0, ceil_mode=True)
        
        self.inception5A = Inception(in_channels=832, num1x1=256, num3x3_reduce=160, num3x3=320, num5x5_reduce=32, num5x5=128, pool_proj=128)
        self.inception5B = Inception(in_channels=832, num1x1=384, num3x3_reduce=192, num3x3=384, num5x5_reduce=48, num5x5=128, pool_proj=128)
        self.pool6 = nn.AdaptiveAvgPool2d((1,1))
        
        self.dropout = nn.Dropout(0.4)
        self.fc = nn.Linear(1024, num_classes)
        
        self.aux4A = Auxiliary(512, num_classes) 
        self.aux4D = Auxiliary(528, num_classes)

    def forward(self, x):
        out = self.conv1(x)
        out = self.pool1(out)
        out = self.conv2(out)
        out = self.conv3(out)
        out = self.pool3(out)
        out = self.inception3A(out)
        out = self.inception3B(out)
        out = self.pool4(out)
        out = self.inception4A(out)
  
        aux1 = self.aux4A(out)
        
        out = self.inception4B(out)
        out = self.inception4C(out)
        out = self.inception4D(out)
  
        aux2 = self.aux4D(out)
        
        out = self.inception4E(out)
        out = self.pool5(out)
        out = self.inception5A(out)
        out = self.inception5B(out)
        out = self.pool6(out)
        out = torch.flatten(out,1)
        out = self.dropout(out)
        out = self.fc(out)
        
        return out, aux1, aux2
        

In [7]:
model = GoogLeNet()

In [8]:
model.to(device)
summary(model, (3, 96, 96))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1           [-1, 64, 48, 48]           9,472
       BatchNorm2d-2           [-1, 64, 48, 48]             128
              ReLU-3           [-1, 64, 48, 48]               0
         ConvBlock-4           [-1, 64, 48, 48]               0
         MaxPool2d-5           [-1, 64, 24, 24]               0
            Conv2d-6           [-1, 64, 24, 24]           4,160
       BatchNorm2d-7           [-1, 64, 24, 24]             128
              ReLU-8           [-1, 64, 24, 24]               0
         ConvBlock-9           [-1, 64, 24, 24]               0
           Conv2d-10          [-1, 192, 24, 24]         110,784
      BatchNorm2d-11          [-1, 192, 24, 24]             384
             ReLU-12          [-1, 192, 24, 24]               0
        ConvBlock-13          [-1, 192, 24, 24]               0
        MaxPool2d-14          [-1, 192,

## Loading CIFAR-10

In [9]:
def cifar_dataloader():
    
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize(mean=[0.5], std=[0.5])])
            
    # Input Data in Local Machine
    # train_dataset = datasets.CIFAR10('../input_data', train=True, download=True, transform=transform)
    # test_dataset = datasets.CIFAR10('../input_data', train=False, download=True, transform=transform)
    
    # Input Data in Google Drive
    train_dataset = datasets.CIFAR10('/content/drive/MyDrive/All_Datasets/CIFAR10', train=True, download=True, transform=transform)
    test_dataset = datasets.CIFAR10('/content/drive/MyDrive/All_Datasets/CIFAR10', train=False, download=True, transform=transform)

    # Split dataset into training set and validation set.
    train_dataset, val_dataset = random_split(train_dataset, (45000, 5000))
    
    print("Image shape of a random sample image : {}".format(train_dataset[0][0].numpy().shape), end = '\n\n')
    
    print("Training Set:   {} images".format(len(train_dataset)))
    print("Validation Set:   {} images".format(len(val_dataset)))
    print("Test Set:       {} images".format(len(test_dataset)))
    
    BATCH_SIZE = 128

    # Generate dataloader
    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=True)
    
    return train_loader, val_loader, test_loader

In [10]:
train_loader, val_loader, test_loader = cifar_dataloader()

Image Shape: (3, 96, 96)

Training Set:   45000 samples
Validation Set:   5000 samples
Test Set:       10000 samples


In [11]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

## c) Training the model

In [12]:
def train_model():
    EPOCHS = 15
    train_samples_num = 45000
    val_samples_num = 5000
    train_epoch_loss_history, val_epoch_loss_history = [], []
    
    for epoch in range(EPOCHS):

        train_running_loss = 0
        correct_train = 0
        
        model.train().cuda()
        
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            
            """ for every mini-batch during the training phase, we typically want to explicitly set the gradients 
            to zero before starting to do backpropragation """
            optimizer.zero_grad()
            
            # Start the forward pass
            prediction0, aux_pred_1, aux_pred_2 = model(inputs)
            
            # Compute the loss.
            real_loss = criterion(prediction0, labels)
            aux_loss_1 = criterion(aux_pred_1, labels)
            aux_loss_2 = criterion(aux_pred_2, labels)
            
            loss = real_loss + 0.3 * aux_loss_1 + 0.3 * aux_loss_2
            
            # do backpropagation and update weights with step()# Backward pass.
            loss.backward()
            optimizer.step()
            
            # Update the running corrects 
            _, predicted = torch.max(prediction0.data, 1)
            
            correct_train += (predicted == labels).float().sum().item()
            
            ''' Compute batch loss
            multiply each average batch loss with batch-length. 
            The batch-length is inputs.size(0) which gives the number total images in each batch. 
            Essentially I am un-averaging the previously calculated Loss '''
            train_running_loss += (loss.data.item() * inputs.shape[0])


        train_epoch_loss = train_running_loss / train_samples_num
        
        train_epoch_loss_history.append(train_epoch_loss)
        
        train_acc =  correct_train / train_samples_num

        val_loss = 0
        correct_val = 0
          
        model.eval().cuda()
        
        with torch.no_grad():
          for inputs, labels in val_loader:
              inputs, labels = inputs.to(device), labels.to(device)

              # Forward pass.
              prediction0, aux_pred_1, aux_pred_2 = model(inputs)
              
              # Compute the loss.
              real_loss = criterion(prediction0, labels)
              aux_loss_1 = criterion(aux_pred_1, labels)
              aux_loss_2 = criterion(aux_pred_2, labels)
              
              loss = real_loss + 0.3 * aux_loss_1 + 0.3 * aux_loss_2
              
              # Compute training accuracy.
              _, predicted = torch.max(prediction0.data, 1)
              correct_val += (predicted == labels).float().sum().item()
              
              # Compute batch loss.
              val_loss += (loss.data.item() * inputs.shape[0])

          val_loss /= val_samples_num
          val_epoch_loss_history.append(val_loss)
          val_acc =  correct_val / val_samples_num
        
        info = "[For Epoch {}/{}]: train-loss = {:0.5f} | train-acc = {:0.3f} | val-loss = {:0.5f} | val-acc = {:0.3f}"
        
        print(info.format(epoch+1, EPOCHS, train_epoch_loss, train_acc, val_loss, val_acc))
        
        torch.save(model.state_dict(), '/content/sample_data/checkpoint{}'.format(epoch + 1)) 
                      
    torch.save(model.state_dict(), '/content/sample_data/googlenet_model')  
        
    return train_epoch_loss_history, val_epoch_loss_history

In [None]:
# train_epoch_loss_history, val_epoch_loss_history = train_model()
train_epoch_loss_history, val_epoch_loss_history = train_model(model, train_loader, val_loader, criterion, optimizer)



## Evaluating model

In [None]:
model = GoogLeNet()
model.load_state_dict(torch.load('/content/sample_data/googlenet_weights_gpu'))

In [None]:
num_test_samples = 10000
correct = 0 

model.eval().cuda()

with  torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        # Make predictions.
        prediction, _, _ = model(inputs)

        # Retrieve predictions indexes.
        _, predicted_class = torch.max(prediction.data, 1)

        # Compute number of correct predictions.
        correct += (predicted_class == labels).float().sum().item()

test_accuracy = correct / num_test_samples

print('Test accuracy: {}'.format(test_accuracy))

## Bonus Points - Understanding 1 x 1 Convolutions

1x1 convolution sometimes referred to as “Network In Network”, was introduced in 2013 in this paper by Lin et al.

A 1x1 convolution takes the element-wise product of all pixel values of an image. A convolution operation occurs between the image(input data) and the conv 1x1 filter, to create an output with the dimensions 1 x 1 x n (where ‘n’ is the number of filters).

Although a 1x1 filter does not learn any spatial patterns that occur within the image, it does learn patterns across the depth(cross channel) of the input image. Therefore not only do 1x1 convolution filters provide a method for dimension reduction, but they also provide the additional benefit of enabling the network to learn more.

The input channels are reduced by the 1x1 convolution, creating output with a reduced number of channels. This part of the Inception network is the bottleneck layer


### Example of how 1x1 Conv reduces dimension / number of parameters For Inception Module in GoogleNet

First WITHOUT using 1x1 Conv

![Imgur](https://imgur.com/JQIKIk2.png)


In the above diagram, the input is of size 28x28x192 convoluted with 5x5 filters of channel size 192 with 32 filters of 5x5.

We got the output of size 28x28x32.

How we got the output as 28x28x32? We can use 2 formulas for calculating the output size after applying convolution using a filter on the input image, they are:

result image (Height) = ((original image height + 2 * padding value — filter size (height))/stride value) +1
result image (width) = ((original image width + 2 * padding value — filter size (width))/stride value) +1

Here,

Original image height = 28
Original image width = 28
Padding value = 0
Filter size = 1
Stride value = 1

result image (Height) = 28 + 2*0 -1 + 1 = 28
result image (width) = 28 + 2*0 -1 + 1 = 28

As we used 32 filters of 5x5, the output has 32 channels

So finally, we get the output size of 28x28x32

So, When we apply a 5x5 filter on 28x28x192, the number of operations to be performed is (28x28x32)+(5x5x192)= 120 million operations.


Now using 1x1 Conv


![Imgur](https://imgur.com/qAFxZcZ.png)


In the above image where we first applied 16 filters of 1x1 and then 32 filters of 5x5.

Here we just need 12.4 million operations


((28x28x16x1x1x192)+(28x28x32x5x5x1)) to complete the same task.

### To summarize the reasons to use 1x1 Convolution in Inception Module

A filter of 3×3 cannot capture a feature in a 5×5 window. And a 5×5 filter has a hard time modeling a 3×3 filter. So we try to combine patterns on different scales. Because we are doing so much computation, and then joining all this information, we dont want to be very large hence we do a 1×1 before so that we reduce the dimension.


The whole point of using 1x1 convolutions is to reduce the dimension along the direction of the number of channels while - keeping other dimensions same, not losing lots of useful information and not having to learn lots of new parameters to do this.


This, "Pointwise Convolution", aka 1x1 Convolution (kernel size=1, stride=1), is currently used in the vast majority of well-known CNN architectures.