## Session 2 Assignment:
1. read the file carefully 
2. add comments to all the cells carefully, explaining exactly what that cell does (for your own good)! 
3. in the cell where the main model is defined:
 1. write receptive field of each layer as a comment
 2. write the input channel dimensions
4. run each cell one by one
5. experiment
6. Once you are done with your experiments, attempt S2 Solution Quiz. You will have 45 minutes to answer questions about this code. You will also be running the code once/twice within this 45 minutes. 
7. Read the S2 - Solution Quiz carefully before attempting it. 

## Import libraries

In [None]:
from __future__ import print_function
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms

## Model definition

In [None]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, padding=1) # input - 28*28*1, output - 28*28*32, RF - 3*3
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1) # input - 28*28*32, output - 28*28*64, RF - 5*5
        self.pool1 = nn.MaxPool2d(2, 2) # input - 28*28*64, output - 14*14*64, RF - 10*10
        self.conv3 = nn.Conv2d(64, 128, 3, padding=1) # input - 14*14*64, output - 14*14*128, RF - 12*12
        self.conv4 = nn.Conv2d(128, 256, 3, padding=1) # input - 14*14*128, output - 14*14*256, RF - 14*14
        self.pool2 = nn.MaxPool2d(2, 2) # input - 14*14*256, output - 7*7*256, RF - 28*28
        self.conv5 = nn.Conv2d(256, 512, 3) # input - 7*7*256, output - 5*5*512, RF - 30*30
        self.conv6 = nn.Conv2d(512, 1024, 3) # input - 5*5*512, output - 3*3*1024, RF - 32*32
        self.conv7 = nn.Conv2d(1024, 10, 3) # input - 3*3*1024, output - 1*1*10, RF - 34*34

    def forward(self, x):
        x = self.pool1(F.relu(self.conv2(F.relu(self.conv1(x)))))
        x = self.pool2(F.relu(self.conv4(F.relu(self.conv3(x)))))
        x = F.relu(self.conv6(F.relu(self.conv5(x))))
        x = self.conv7(x) # Removed last RELU to get 95+ accuracy in first epoch
        x = x.view(-1, 10)
        return F.log_softmax(x)

## Model summary 

In [None]:
!pip install torchsummary
from torchsummary import summary
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
model = Net().to(device)
summary(model, input_size=(1, 28, 28))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1           [-1, 32, 28, 28]             320
            Conv2d-2           [-1, 64, 28, 28]          18,496
         MaxPool2d-3           [-1, 64, 14, 14]               0
            Conv2d-4          [-1, 128, 14, 14]          73,856
            Conv2d-5          [-1, 256, 14, 14]         295,168
         MaxPool2d-6            [-1, 256, 7, 7]               0
            Conv2d-7            [-1, 512, 5, 5]       1,180,160
            Conv2d-8           [-1, 1024, 3, 3]       4,719,616
            Conv2d-9             [-1, 10, 1, 1]          92,170
Total params: 6,379,786
Trainable params: 6,379,786
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 1.51
Params size (MB): 24.34
Estimated Total Size (MB): 25.85
-------------------------------------



## Prepare and load the dataset

In [None]:
torch.manual_seed(1)
batch_size = 128

kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../data', train=True, download=True,
                    transform=transforms.Compose([
                        transforms.ToTensor(),
                        transforms.Normalize((0.1307,), (0.3081,))
                    ])),
    batch_size=batch_size, shuffle=True, **kwargs)
test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../data', train=False, transform=transforms.Compose([
                        transforms.ToTensor(),
                        transforms.Normalize((0.1307,), (0.3081,))
                    ])),
    batch_size=batch_size, shuffle=True, **kwargs)


Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw
Processing...
Done!




## Model training and validation

In [None]:
from tqdm import tqdm
def train(model, device, train_loader, optimizer, epoch):
    model.train()
    pbar = tqdm(train_loader)
    for batch_idx, (data, target) in enumerate(pbar):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        pbar.set_description(desc= f'loss={loss.item()} batch_id={batch_idx}')


def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()
    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))

## Optimizer definition and running the model

In [None]:
model = Net().to(device)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

for epoch in range(1, 2):
    train(model, device, train_loader, optimizer, epoch)
    test(model, device, test_loader)

loss=0.03553972393274307 batch_id=468: 100%|██████████| 469/469 [00:38<00:00, 12.06it/s]



Test set: Average loss: 0.0655, Accuracy: 9796/10000 (98%)



## S2 - Solution Quiz:
1. What is torch?

 An open source machine learning framework that accelerates the path from research prototyping to production deployment.

2. What is the purpose of adding padding = 1?
 
 To add 2 additional pixels in x and y rows for convolution.

3. What is that -1 in output shape when we call summary(model,input_size = (1,28,28))?
 
 * It refers to batch size.
 * It refers to the dimension "outside" what might be available of input_size.

4. What is CUDA?
 
 CUDA is a parallel computing platform and application programming interface model created by Nvidia. It allows software developers and software engineers to use a CUDA-enabled graphics processing unit for general purpose processing - an approach termed GPGPU.

5. What is a Tensor?
 
 * A tensor is a container which can house data in N dimensions.
 * A tensor is NOT a matrix, as matrices are specifically 2D, where as Tensors can be nD.
 * is an algebraic object that describes a linear mapping from one set of algebraic objects to another.

6. What is 0.1307 and 0.3081 in transforms.Normalize?

 That's the mean and std of the training dataset (not complete dataset).

7. What is the use of torch.no_grad()?
   
   * To perform inference, but without training.
   * To make sure test data does not "leak" into the model.
   * To perform inference without gradient calculation.

8. What is wrong with this model? Generally in 1 epoch we should be able to get 95%+, but here we do not? 
 
 Not attempted.
 (Answer: last ReLU layer)

9. Only 1 change is required in this model such that it gets up to 97% within 1 epoch! What is that 1 change? 
 
 Not attempted.
 (Answer: remove ReLU layer)



## Quiz 2:
1. If we perform convolution with a kernel of size 3x3 on 47x49, the output size would be?
 
 45x47

2. Which of these are true, w.r.t. what we discussed in Session 2?

   * We always use a kernel with size 3x3.
   * We always use kernels with stride of 1.
   * We add as many layers as required to reach full image/object size.

3. How many 3x3 layers do we need to add to reach a receptive field of 21x21?

 10

4. Let us assume we have an image of size 100x100. What is the minimum number of convolution layers do we need to add such that 

 1. you cannot use max-pooling without convolving twice or more
 2. the output is at least 2-3 convolution layers away from max-pooling
 3. You can stop either at 2x2 or 1x1 based on how you have used your layers
 4. we will always "not consider" the last rows and columns in an odd-resolution channel while performing max-pooling)
 5. "do not" count max-pooling layer
 
 10

 100x100 | 3x3 (conv)
98x98 | 3x3 (conv)
96x96 | 3x3 (conv)
48x48 | 3x3 (conv)
46x46 | 3x3 (conv)
44x44 | 2x2 (maxpool)
22x22 | 3x3 (conv)
20x20 | 3x3 (conv)
18x18 | 3x3 (maxpool)
9x9 | 3x3 (conv)
7x7 | 3x3 (conv)
5x5 | 3x3 (conv)
3x3 | 3x3 (conv)
1x1

5. If the input channels have 128 layers, how many kernels do we need to add?
 Number of kernels do not depend on input channels.

6. Consider the following layers
 
 ...
49x49x256 | Convolved with 512 kernels of size 3x3 |
... What is the total number of kernel parameters we just added?
 
 1179648

7. Consider this network

 400x400x3 | 32x(3x3x3) |
398x398x32 | 64x(3x3x32) |
396x396x64 | 128x(3x3x64) |
394x394x128 | 256x(3x3x128) |
392x392x256 | 512x(3x3x256) |
390x390x512 | 1024x(3x3x256) |
MaxPooling(2x2)...

 Assume this network is trained and we are doing inference on an image. Before we hit the max-pooling layer, how many channels of size more than 350x350 are there in the GPU RAM?

 2019

8. What are few advantage of using MaxPooling?
 
 * Reduction in Channel Size
 * Slight Rotational Invariance
 * Slight Translational Invariance

9. If we start with an image of 400x400 color, and during a model we use MaxPooling 4 times, reducing the image size to 400>200>100>50 (we used convs with padding, so convs did not reduce the image size), have we lost 4 times the information we started with? At 50x50 we have 1000 channels.
 
 No, convs and poolings operation are loosing some information, but more importantly, they are "filtering" the information. We do not need full information at the last layer, just the most important one. We are also scaling in Z axis (from 3 to 1000), and it is the increase in z axis where we store this "proposed" lost information.






### That's all Folks!