# Convolutional Neural Network
Adapt the CNN example for MNIST digit classfication from Notebook 3A. Feel free to play around with the model architecture and see how the training time/performance changes, but to begin, try the following:

Image ->
convolution (32 3x3 filters) -> nonlinearity (ReLU) ->
convolution (32 3x3 filters) -> nonlinearity (ReLU) -> (2x2 max pool) ->
convolution (64 3x3 filters) -> nonlinearity (ReLU) ->
convolution (64 3x3 filters) -> nonlinearity (ReLU) -> (2x2 max pool) -> flatten -> fully connected (256 hidden units) -> nonlinearity (ReLU) ->
fully connected (10 hidden units) -> softmax

## Dimensions calculation
See [l1](https://towardsdatascience.com/understanding-and-calculating-the-number-of-parameters-in-convolution-neural-networks-cnns-fc88790d530d) and [l2](https://aldozaimi.wordpress.com/2020/02/13/determine-the-number-of-trainable-parameters-in-a-neural-network/)

In [2]:
# Dimensions calculator
def w_out(w_in, k, p,s=1):
    return (w_in-k+2*p)/s+1

In [1]:
## Convolutional model
import torch.nn as nn

class MNIST_CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.fc1 = nn.Linear(7*7*64, 256)
        self.fc2 = nn.Linear(256, 10)

    def forward(self, x):
        # conv layer 1
        x = self.conv1(x)
        x = F.relu(x)
        x = F.max_pool2d(x, kernel_size=2)
        
        # conv layer 2
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, kernel_size=2)
        
        # fc layer 1
        x = x.view(-1, 7*7*64)
        x = self.fc1(x)
        x = F.relu(x)
        
        # fc layer 2
        x = self.fc2(x)
        return x

Here I only use three [epochs](https://www.simplilearn.com/tutorials/machine-learning-tutorial/what-is-epoch-in-machine-learning#:~:text=An%20epoch%20is%20when%20all,dataset%20takes%20around%20an%20algorithm.).

> An epoch is when all the training data is used at once and is defined as the total number of iterations of all the training data in one cycle for training the machine learning model. 
Another way to define an epoch is the number of passes a training dataset takes around an algorithm.

In [5]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from tqdm.notebook import tqdm, trange

# Load the data
mnist_train = datasets.MNIST(root="./datasets", train=True, transform=transforms.ToTensor(), download=True)
mnist_test = datasets.MNIST(root="./datasets", train=False, transform=transforms.ToTensor(), download=True)
train_loader = torch.utils.data.DataLoader(mnist_train, batch_size=100, shuffle=True)
test_loader = torch.utils.data.DataLoader(mnist_test, batch_size=100, shuffle=False)

## Training
# Instantiate model  
model = MNIST_CNN()  # <---- change here

# Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)  # <---- change here


# Iterate through train set minibatchs 
for epoch in trange(3): 
    for images, labels in tqdm(train_loader):
        # Zero out the gradients
        optimizer.zero_grad()

        # Forward pass
        x = images  # <---- change here 
        y = model(x)
        loss = criterion(y, labels)
        # Backward pass
        loss.backward()
        optimizer.step()

## Testing
correct = 0
total = len(mnist_test)

with torch.no_grad():
    # Iterate through test set minibatchs 
    for images, labels in tqdm(test_loader):
        # Forward pass
        x = images  # <---- change here 
        y = model(x)

        predictions = torch.argmax(y, dim=1)
        correct += torch.sum((predictions == labels).float())

print('Test accuracy: {}'.format(correct/total))

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/600 [00:00<?, ?it/s]

  0%|          | 0/600 [00:00<?, ?it/s]

  0%|          | 0/600 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

Test accuracy: 0.9873999953269958


## Short answer
1. How does the CNN compare in accuracy with yesterday's logistic regression and MLP models? How about training time?

In [7]:
old_accuracy = 0.914
new_accuracy = correct/total
accuracy_ratio = new_accuracy-old_accuracy
print("It increases its accuracy by {}. However, the training time is a bit slower".format(accuracy_ratio*100))

It increases its accuracy by 7.340002059936523. However, the training time is a bit slower


2. How many trainable parameters are there in the CNN you built for this assignment?

### CONV Layer
This is where CNN learns, so certainly we’ll have weight matrices. To calculate the learnable parameters here, all we have to do is just multiply the by the shape of width $w$, height $h$, previous layer’s filters (*channels*) $d$ and account for all such filters $k$ (*channels*) in the current layer. Don’t forget the bias term for each of the filter. Number of parameters in a CONV layer would be : $((m * n * d)+1)* k)$, added $1$ because of the bias term for each filter. The same expression can be written as follows: **((shape of width of the filter * shape of height of the filter * number of filters in the previous layer+1)*number of filters)**. Where the term “filter” refer to the number of filters in the current layer.

### Fully connected
For the fully connected layers, the number of trainable parameters can be computed by $(n + 1) × m$, where $n$ is the number of input units and $m$ is the number of output units. The $+1$ term in the equation takes into account the bias terms.

In [8]:
def learn_conv(w,h,d,k):
    return ((w*h*d)+1)*k
def learn_full(n,m):
    return (n+1)*m

In [12]:
c1 = learn_conv(3,3,1,32)
c2 = learn_conv(3,3,32,64)
f1 = learn_full(7*7*64,256)
f2 = learn_full(256,10)
total_parameters = c1+c2+f1+f2
old_parameters = 105681
paremeters_ratio = total_parameters/old_parameters
print("It has {} parameters. which is {} times larger than the multilayer perceptron".format(total_parameters,paremeters_ratio))

It has 824458 parameters. which is 7.8013834085597225 times larger than the multilayer perceptron


3\. When would you use a CNN versus a logistic regression model or an MLP?

Use MLPs For:

* Tabular datasets
* Classification prediction problems
* Regression prediction problems

They are very flexible and can be used generally to **learn a mapping from inputs to outputs**.

This flexibility allows them to be applied to other types of data. For example, the pixels of an image can be reduced down to one long row of data and fed into a MLP. The words of a document can also be reduced to one long row of data and fed to a MLP. Even the lag observations for a time series prediction problem can be reduced to a long row of data and fed to a MLP.

As such, if your data is in a form other than a tabular dataset, such as an image, document, or time series, I would recommend at least testing an MLP on your problem. The results can be used as a baseline point of comparison to confirm that other models that may appear better suited add value.

Use CNNs For:

* Image data
* Classification prediction problems
* Regression prediction problems

More generally, CNNs work well with data that has a spatial relationship.

The CNN input is **traditionally two-dimensional, a field or matrix**, but can also be changed to be one-dimensional, allowing it to develop an internal representation of a one-dimensional sequence.

This allows the CNN to be used more generally on other types of data that has a spatial relationship. For example, there is an order relationship between words in a document of text. There is an ordered relationship in the time steps of a time series.

Although not specifically developed for non-image data, CNNs achieve state-of-the-art results on problems such as document classification used in sentiment analysis and related problems.

Use RNNs For:

* Text data
* Speech data
* Classification prediction problems
* Regression prediction problems
* Generative models

Recurrent neural networks are **not appropriate for tabular datasets** as you would see in a CSV file or spreadsheet. They are also not appropriate for image data input.
[source](https://machinelearningmastery.com/when-to-use-mlp-cnn-and-rnn-neural-networks/)