<a href="https://colab.research.google.com/github/Nisag/EVA4/blob/master/Assignment_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Session 4 Assignment:
1. 99.4% validation accuracy
2. Less than 20k parameters 
3. Less than 20 epochs
4. No fully connected layer

Things covered so far - 
* No of layers in a network 
* Receptive field
* 3x3 Convolutions 
* MaxPooling, position, distance from prediction 
* No of kernels 
* Image Normalization
* 1x1 Convolutions 
* Transition layers, concept, position
* SoftMax
* Batch Normalization, distance from prediction 
* Dropout, when to introduce 
* Larger kernel or alternatives - When to stop convolutions and go ahead 
* Performance - How do we know network is not doing well, comparatively, very early
* Batch Size, effects of batch size
* Learning Rate

To learn how to add different things we covered in this session, you can refer to this code: https://www.kaggle.com/enwei26/mnist-digits-pytorch-cnn-99 






## Import libraries

In [0]:
from __future__ import print_function
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms

## Model definition

In [0]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 8, 3) # input - 28*28*1, output - 26*26*8
        self.bn1 = nn.BatchNorm2d(8)
        self.conv2 = nn.Conv2d(8, 16, 3) # input - 26*26*8, output - 24*24*16
        self.bn2 = nn.BatchNorm2d(16)
        self.conv3 = nn.Conv2d(16, 32, 3) # input - 24*24*16, output - 22*22*32
        self.bn3 = nn.BatchNorm2d(32)
        self.pool1 = nn.MaxPool2d(2, 2) # input - 22*22*32, output - 11*11*32
        self.conv4 = nn.Conv2d(32, 8, 1) # input - 11*11*32, output - 11*11*8
        self.bn4 = nn.BatchNorm2d(8)
        self.conv5 = nn.Conv2d(8, 16, 3) # input - 11*11*8, output - 9*9*16
        self.bn5 = nn.BatchNorm2d(16)
        self.conv6 = nn.Conv2d(16, 32, 3) # input - 9*9*16, output - 7*7*32
        self.bn6 = nn.BatchNorm2d(32)
        self.avgpool = nn.AdaptiveAvgPool2d(1) # input 7*7*32 output - 1*1*32
        self.conv7 = nn.Conv2d(32, 10, 1) # input - 1*1*32, output - 1*1*10

    def forward(self, x):
        x = F.relu(self.bn2(self.conv2(F.relu(self.bn1(self.conv1(x))))))
        x = self.pool1(F.relu(self.bn3(self.conv3(x))))
        x = F.relu(self.bn5(self.conv5(F.relu(self.bn4(self.conv4(x))))))
        x = F.relu(self.bn6(self.conv6(x)))
        x = self.avgpool(x)
        x = self.conv7(x)
        x = x.view(-1, 10)
        return F.log_softmax(x)

## Model summary

In [0]:
!pip install torchsummary
from torchsummary import summary
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
model = Net().to(device)
summary(model, input_size=(1, 28, 28))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1            [-1, 8, 26, 26]              80
       BatchNorm2d-2            [-1, 8, 26, 26]              16
            Conv2d-3           [-1, 16, 24, 24]           1,168
       BatchNorm2d-4           [-1, 16, 24, 24]              32
            Conv2d-5           [-1, 32, 22, 22]           4,640
       BatchNorm2d-6           [-1, 32, 22, 22]              64
         MaxPool2d-7           [-1, 32, 11, 11]               0
            Conv2d-8            [-1, 8, 11, 11]             264
       BatchNorm2d-9            [-1, 8, 11, 11]              16
           Conv2d-10             [-1, 16, 9, 9]           1,168
      BatchNorm2d-11             [-1, 16, 9, 9]              32
           Conv2d-12             [-1, 32, 7, 7]           4,640
      BatchNorm2d-13             [-1, 32, 7, 7]              64
AdaptiveAvgPool2d-14             [-1, 3



## Prepare and load the dataset

In [0]:
torch.manual_seed(1)
batch_size = 32 # changed from 128 

kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../data', train=True,
                   download=True,
                    transform=transforms.Compose([
                        transforms.ToTensor(),
                        transforms.Normalize((0.1307,), (0.3081,))
                    ])),
    batch_size=batch_size, shuffle=True, **kwargs)
test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../data', train=False, transform=transforms.Compose([
                        transforms.ToTensor(),
                        transforms.Normalize((0.1307,), (0.3081,))
                    ])),
    batch_size=batch_size, shuffle=True, **kwargs)


Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw
Processing...
Done!




## Model training and validation

In [0]:
from tqdm import tqdm
def train(model, device, train_loader, optimizer, epoch):
    model.train()
    pbar = tqdm(train_loader)
    for batch_idx, (data, target) in enumerate(pbar):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        pbar.set_description(desc= f'loss={loss.item()} batch_id={batch_idx}')


def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.1f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))

## Optimizer definition and running the model

In [0]:
model = Net().to(device)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

for epoch in range(1, 20):
    train(model, device, train_loader, optimizer, epoch)
    test(model, device, test_loader)

  0%|          | 0/1875 [00:00<?, ?it/s]




loss=0.11956486105918884 batch_id=1874: 100%|██████████| 1875/1875 [00:21<00:00, 87.16it/s]
  0%|          | 0/1875 [00:00<?, ?it/s]


Test set: Average loss: 0.0935, Accuracy: 9723/10000 (97.2%)



loss=0.009263277053833008 batch_id=1874: 100%|██████████| 1875/1875 [00:21<00:00, 89.17it/s]
  0%|          | 0/1875 [00:00<?, ?it/s]


Test set: Average loss: 0.0469, Accuracy: 9858/10000 (98.6%)



loss=0.0057594627141952515 batch_id=1874: 100%|██████████| 1875/1875 [00:21<00:00, 87.83it/s]
  0%|          | 0/1875 [00:00<?, ?it/s]


Test set: Average loss: 0.0370, Accuracy: 9889/10000 (98.9%)



loss=0.03366020321846008 batch_id=1874: 100%|██████████| 1875/1875 [00:20<00:00, 89.37it/s]
  0%|          | 0/1875 [00:00<?, ?it/s]


Test set: Average loss: 0.0360, Accuracy: 9889/10000 (98.9%)



loss=0.002917453646659851 batch_id=1874: 100%|██████████| 1875/1875 [00:21<00:00, 89.20it/s]
  0%|          | 0/1875 [00:00<?, ?it/s]


Test set: Average loss: 0.0273, Accuracy: 9907/10000 (99.1%)



loss=0.02572450041770935 batch_id=1874: 100%|██████████| 1875/1875 [00:20<00:00, 89.56it/s]
  0%|          | 0/1875 [00:00<?, ?it/s]


Test set: Average loss: 0.0275, Accuracy: 9913/10000 (99.1%)



loss=0.00375249981880188 batch_id=1874: 100%|██████████| 1875/1875 [00:21<00:00, 88.65it/s]
  0%|          | 0/1875 [00:00<?, ?it/s]


Test set: Average loss: 0.0272, Accuracy: 9913/10000 (99.1%)



loss=0.027218297123908997 batch_id=1874: 100%|██████████| 1875/1875 [00:20<00:00, 89.35it/s]
  0%|          | 0/1875 [00:00<?, ?it/s]


Test set: Average loss: 0.0242, Accuracy: 9923/10000 (99.2%)



loss=0.001628786325454712 batch_id=1874: 100%|██████████| 1875/1875 [00:20<00:00, 89.73it/s]
  0%|          | 0/1875 [00:00<?, ?it/s]


Test set: Average loss: 0.0252, Accuracy: 9917/10000 (99.2%)



loss=0.024715274572372437 batch_id=1874: 100%|██████████| 1875/1875 [00:21<00:00, 88.82it/s]
  0%|          | 0/1875 [00:00<?, ?it/s]


Test set: Average loss: 0.0243, Accuracy: 9929/10000 (99.3%)



loss=0.06545257568359375 batch_id=1874: 100%|██████████| 1875/1875 [00:21<00:00, 88.87it/s]
  0%|          | 0/1875 [00:00<?, ?it/s]


Test set: Average loss: 0.0240, Accuracy: 9927/10000 (99.3%)



loss=0.01260325312614441 batch_id=1874: 100%|██████████| 1875/1875 [00:21<00:00, 89.23it/s]
  0%|          | 0/1875 [00:00<?, ?it/s]


Test set: Average loss: 0.0251, Accuracy: 9917/10000 (99.2%)



loss=0.00449785590171814 batch_id=1874: 100%|██████████| 1875/1875 [00:20<00:00, 90.36it/s]
  0%|          | 0/1875 [00:00<?, ?it/s]


Test set: Average loss: 0.0255, Accuracy: 9924/10000 (99.2%)



loss=0.006565719842910767 batch_id=1874: 100%|██████████| 1875/1875 [00:20<00:00, 90.02it/s]
  0%|          | 0/1875 [00:00<?, ?it/s]


Test set: Average loss: 0.0237, Accuracy: 9934/10000 (99.3%)



loss=0.002406895160675049 batch_id=1874: 100%|██████████| 1875/1875 [00:20<00:00, 89.94it/s]
  0%|          | 0/1875 [00:00<?, ?it/s]


Test set: Average loss: 0.0203, Accuracy: 9937/10000 (99.4%)



loss=0.0003854036331176758 batch_id=1874: 100%|██████████| 1875/1875 [00:20<00:00, 89.38it/s]
  0%|          | 0/1875 [00:00<?, ?it/s]


Test set: Average loss: 0.0215, Accuracy: 9935/10000 (99.3%)



loss=0.002652466297149658 batch_id=1874: 100%|██████████| 1875/1875 [00:21<00:00, 88.79it/s]
  0%|          | 0/1875 [00:00<?, ?it/s]


Test set: Average loss: 0.0234, Accuracy: 9926/10000 (99.3%)



loss=0.0019318163394927979 batch_id=1874: 100%|██████████| 1875/1875 [00:20<00:00, 89.39it/s]
  0%|          | 0/1875 [00:00<?, ?it/s]


Test set: Average loss: 0.0214, Accuracy: 9934/10000 (99.3%)



loss=0.010960191488265991 batch_id=1874: 100%|██████████| 1875/1875 [00:20<00:00, 89.95it/s]



Test set: Average loss: 0.0200, Accuracy: 9938/10000 (99.4%)



## S4-Solution Quiz:
1. What is the final validation accuracy of your model? Mention in percentage (without % sign)
 
 99.4

2. How many parameters your network has? (Mention without a comma or any special character). 
 
 12514

3.  Have you used DropOut? 
 
 False

4. Have you used BatchNormalization?
 
 True

5. Have you used a Fully Connected Layer? 
 
 False

6. Have you used 1x1 kernels?
  
 True

## Quiz 4:
1. When you read "Those circles are "temporary" values that will be stored. Once you train the model, lines are what all matters!" in the notes, what is the meaning of temporary?

 * Circles represent the calculated neuron value, or the channel's pixel value. These values are temporary as they will change with every image and are dumped out of memory after every inference.
 * Circles represents the values calculated after multiplying the input with the weights (represented by the lines). Since inputs will change, multiplying the inputs with weights will also change. Hence they are temporary.

2. When you read "Those circles are "temporary" values that will be stored. Once you train the model, lines are what all matters!" in the notes, what is the meaning of "lines are what all matter"?

 * Lines represent the weights, and for achieving correct weights we are training the model. Hence finally it is those lines which matter. 
 *  Lines are what matter, as they not only represent the weights which we want to train, they also represent how "dense" our connections are. More the lines, denser the network. "Denseness" has direct implication on the model type.

3. When you read "Exactly, that's the point. " what was meant by it? 

 * That a 1D pattern created by converting 2D pattern has lost its spatial meaning.
 * Converting 2D pattern into a 1D pattern throws away the "spatial information". And without spatial information it wouldn't be ideal to train a "vision" dnn.

4. In the image below (don't consider biases):

 * The input size is 13d
 * The output size is 10d 
 * Total weights used are 130 
 * If we connect all the input circles to the output circles (right part of the image), we will end up drawing 130 lines.
 * The weight matrix is 13x10 

5. In the image below (don't consider biases):
 
 * Hidden Layer has 100 weights
 * Target Output is shown as a One Hot Vector 
 * total 7380 weights are used 

6. In this image:
 * If we flatten both input and output, we would need an FC layer with 225 weights
 * If we draw lines to show the connections, we will end up drawing 91 lines **(doubt)**

7. In the image below, the 3 blue boxes represent 3 FC (first two have same 4096 neurons) (don't consider biases):
 
 total 123633664 parameters are used in the fc layers 

8. It is a good idea to use ReLU as the activation function for the logits to softmax
 
 No! Are you kidding! Never! 

9. Why Softmax is not probability, but likelihood! **(doubt)**
 * Because it is the measure of the features it has actually found!
 * Because everything which sums up to 1 is not probability.

10. Assume that we are using Negative-Log_Likelihood. Then in the image below:
 
 1.41058

11. In the BatchNormalization notes, you read "indirectly you have sort of already used it!". What do you think it means? 
 
 When we applied normalization to our images, that was very similar to what we do in batch normalization 

12. Select all which are true (context dropout):
 
 * It is not recommended to use Dropout before the last prediction layer
 * In DropOut, we need to divide the input to a layer by 2 if dropout of 0.5 was used while training it.
 * DropOut is applied only during training. During test/validation, it is automatically removed.
 * If we actually have used dropout of 0.5 before the final layer, the training accuracy of a very well trained model will not cross 50% (assume it was hotdog-NotHotdog problem)










### That's all Folks!