# MNIST Digit Recognizer with CNN

MNIST ("Modified National Institute of Standards and Technology") is the de facto "hello world" dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms.

So the tradition does not change here, and I'm using MNIST as my hello world to the Computer Vision. Here I'm going to try to explain some pieces of digit recognition, but since this is my first try ever, take everything with a grain of salt because I'm not expert of anything I'm talking here.

We are going to split our notebook into three phases. In the **Preparation Phase**, we will prepare everything to train our model. In **Training Phase**, we will train the model. And lastly, in the **Testing Phase**, we will test our model with test data and save the results.


# 1) Preparation Phase

In this phase will preprocess our data and make some visualizations to understand it better.

# Environment Setup

First lets turn off the warnings and see what files do we have.

In [None]:
# Don't show warnings for the sake of humanity.
import warnings
warnings.filterwarnings("ignore")

# Print files.
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

So here we're going to use the `train.csv` to train our model and then test it over `test.csv`. The significant difference between them is that since this is a competition, labels are not available in `test.csv`. Our job is to develop our best model to recognize digits in test data as accurate as possible.

So let's visualize our dataset to see what we are talking about, but first, we need to import our libraries.

# Import Libraries

In this notebook, we are using PyTorch and its components.

In [None]:
# Common Libraries
import math
import pickle
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # visualization
import seaborn as sns

# PyTorch
import torch, torchvision
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.optim import lr_scheduler
from torchvision import transforms
from torch.utils.data import Dataset, DataLoader
from torchvision.utils import make_grid
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

# Read Data

We are using [Pandas](https://pandas.pydata.org/) to read csv files. `read_csv` function returns a [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame) here. If you are new to Pandas, be sure to check the [getting started](https://pandas.pydata.org/docs/getting_started/index.html#getting-started) documentation. 

In [None]:
train_dataframe = pd.read_csv("/kaggle/input/digit-recognizer/train.csv")
test_dataframe = pd.read_csv("/kaggle/input/digit-recognizer/test.csv")

# Visualize Raw Data

Let's visualize the raw data. In both train and test data, every row has flattened pixels of images from **pixel0** to **pixel783**, making 784 pixels, `28x28`. Unlike test data, our train data also has the label of each image as its first column. In the preprocessing, we will also get rid of that. 

We are also printing the shape of our dataset. We have **42000** training images and **28000** test images.

In [None]:
print(train_dataframe.shape)
train_dataframe.head()

In [None]:
print(test_dataframe.shape)
test_dataframe.head()

# Preprocess Data

We will do a couple of things here, separate labels from training data, convert everything to NumPy array and resize images as `28x28`. Lastly, we will rescale our intensities from `0 ~ 255` to `0 ~ 1`.

Rescaling intensities into the range of `0 ~ 1` makes our model converge faster. But we don't need to do it manually here; PyTorch will have already do it when we transform our images into tensors. We are only rescaling them here to make things generally more straightforward for us. 

In [None]:
# Settings
img_size = 28

# Alias for our dataframes.
train_df = train_dataframe
test_df = test_dataframe

# Get numbers.
n_train = len(train_df)
n_test = len(test_df)
n_pixels = len(train_df.columns) - 1
n_class = len(set(train_df['label']))

# Seperate data and label.
train_df_labels = train_df["label"]
train_df = train_df.drop(labels=["label"], axis=1)

# Convert to numpy.
X_train = train_df.to_numpy(dtype=np.float32).reshape((-1, img_size, img_size))
Y_train = train_df_labels.to_numpy(dtype=np.uint8)
X_test = test_df.to_numpy(dtype=np.float32).reshape((-1, img_size, img_size))

# Normalize intensities
X_train /= 255.
X_test /= 255.

# Print meta
print(f"Lengths: Train={n_train}, Test={n_test}")
print(f"Pixels={n_pixels}, Classes={n_class}")

# Plot the Histogram of the Classes

A histogram makes it easier to see the distribution of data. As you can see, our data has a balanced distribution of classes which is a good thing. Keep in mind that in real world, most of the data you will collect would not be like this. 

"Why is balanced data good?" you may ask. Think about it this way, what would happen if you showed the model one dog image for every ten cat images. Probably, the model will be biased to predict most images as a cat because most of the training time will be spent optimizing the parameters for the cats. For this reason, it is crucial to have the distribution of data according to classes is to be balanced.

In [None]:
# Visualize number of digits classes
value_counts = train_df_labels.value_counts()

plt.rcParams['figure.figsize'] = (8, 5)
plt.bar(value_counts.index, value_counts)
plt.xticks(np.arange(n_class))
plt.xlabel('Class', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.grid('on', axis='y')

print(value_counts.sort_index())

# Show Some Samples

Lets see some samples from our train data. To understand what is going on exactly with code you may need to take a look of explanations of some utils like [unsqueeze](https://stackoverflow.com/a/65831759/1294887) or [reshape](https://numpy.org/devdocs/reference/generated/numpy.reshape.html). We choose eight random images from train data and show them along with their label as the title.

In [None]:
# Helper util to show given images as grid.
def show_grid(images, title):
    tensor = torch.Tensor(images).unsqueeze(1)
    grid = make_grid(tensor, nrow=8)
    
    plt.rcParams['figure.figsize'] = (16, 2)
    plt.imshow(grid.numpy().transpose((1,2,0)))
    plt.axis('off')
    plt.title(title);
    
# Take random selection from images.
selection = np.random.randint(n_train, size=8)
images = X_train[selection]
show_grid(images, list(Y_train[selection]))

# 2) Training Phase

In this phase, we first define all the needed pieces to train our model and then, train our model.


# DataSet & DataLoader

First we define our dataset. PyTorch provides two data primitives: `DataLoader` and `Dataset` that allow you to use pre-loaded datasets as well as your own data. **Dataset** stores the samples and their corresponding labels, and **DataLoader** wraps an iterable around the Dataset to enable easy access to the samples. ([read more](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)) DataLoader helps us to load data in batches. It also can shuffle data for you. 

In [None]:
class MNISTDataset(Dataset):
    # Initialization
    def __init__(self, X, Y=None, transform=None):
        self.x = X
        self.y = Y
        self.transform = transform
        self.n_samples = self.x.shape[0]
        
        if self.y is not None:
            self.y = torch.from_numpy(self.y).type(torch.LongTensor)
        
    # Indexing: dataset[index]
    def __getitem__(self, index):
        sample = self.x[index]
        
        if self.transform is not None:
            sample = self.transform(sample)
        
        if self.y is not None:
            sample = sample, self.y[index]
                
        return sample
    
    # Length: len(dataset)
    def __len__(self):
        return self.n_samples

Here we are applying some transformations to our data before using it. Transformation can be used to do a lot of things. PyTorch offers us some [common ones](https://pytorch.org/vision/stable/transforms.html), and we can also create our ones if we need them.

Data augmentation is a set of techniques to artificially increase the amount of data by generating new data points from existing data. This includes making small changes to data or using deep learning models to generate new data points. For example, for images you can do this by padding, random rotating, re-scaling, vertical and horizontal flipping, translation (image is moved along X, Y direction), cropping, zooming, darkening & brightening/color modification, grayscaling, changing contrast, adding noise, or random erasing. ([read more](https://research.aimultiple.com/data-augmentation/))

There are two main methods for augmenting data. The first method is augmenting data before training the model - as done in this [notebook](https://www.kaggle.com/code/cdeotte/25-million-images-0-99757-mnist/notebook). This way, you will have more data, and in each epoch, your model will train with this augmented data. In other words, if you have 500 images and you augment them to 1000 by doing random transformations, your model will see all of these 1000 images in each epoch. The second method is not increasing the data before but feeding augmented data into the model in the training phase - as I did here. This second method is called **on the fly data augmentation**, and it differs from the first method as the images that our model will see in each epoch will be random and be derived on the fly by our DataLoader via using our list of transformations.

We have two sets of transformers — one for train data and the other for the test data. For train data, we will also do on the fly data augmentation via transformers RandomRotation and RandomAffine. But for the test data, we will not do this. We want to train our model with various augmented images to increase diversity and decrease the risk of overfitting, but this is not necessary for testing. We also shuffling the train data.

We are using [ToTensor](https://pytorch.org/vision/stable/generated/torchvision.transforms.ToTensor.html#torchvision.transforms.ToTensor) transformer to transform our image to [PyTorch Tensor](https://pytorch.org/docs/stable/tensors.html) which currently is a [NumPy array](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html). [RandomRotation](https://pytorch.org/vision/stable/generated/torchvision.transforms.RandomRotation.html#torchvision.transforms.RandomRotation) transformer rotates the image in a random direction with the given angle at maximum. It also fills the area outside the rotated image with 0 by default. [RandomAffine](https://pytorch.org/vision/stable/generated/torchvision.transforms.RandomAffine.html#torchvision.transforms.RandomAffine) transforms the image while keeping center invariant. We are using it to shift the image positionally.

Lastly, we standardize ([z-score normalization](https://www.statology.org/z-score-normalization)) the image by applying [Normalize](https://pytorch.org/vision/stable/generated/torchvision.transforms.Normalize.html#torchvision.transforms.Normalize) transformer, this helps our model converge faster and outliers in our data will no longer have as big of an influence that they might have on the model fit before. (Also see [Standardization vs Normalization](https://www.statology.org/standardization-vs-normalization/) and [How to use Normalize](https://stackoverflow.com/a/65679179/1294887) and [What are the numbers in torch.transforms.Normalize and how to select them?](https://stackoverflow.com/questions/65467621/what-are-the-numbers-in-torch-transforms-normalize-and-how-to-select-them)).

In [None]:
# Hyperparameters
n_batchs = 64

# Decide sizes for validation and training data sets.
n_val = math.floor(n_train * 0.1)
n_train = n_train - n_val

# Findout global mean and std of our images to use in standardization.
X = np.concatenate([X_train, X_test])
mean, std = X.mean(), X.std()

# Transformers of train data.
train_transforms = transforms.Compose([
    transforms.ToTensor(),
    transforms.RandomRotation(20),
    transforms.RandomAffine(0, translate=(0.1, 0.1)),
    transforms.Normalize((mean,), (std,)),
])

# Transformers of test data.
test_transforms = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((mean,), (std,)),
])

# Split data into training and validation.
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=n_val, shuffle=True)

# Create datasets.
train_dataset = MNISTDataset(X_train, Y_train, train_transforms)
val_dataset = MNISTDataset(X_val, Y_val, test_transforms)
test_dataset = MNISTDataset(X_test, transform=test_transforms)

# Create loaders.
train_loader = DataLoader(dataset=train_dataset, batch_size=n_batchs, shuffle=True)
val_loader = DataLoader(dataset=val_dataset, batch_size=n_batchs, shuffle=False)
test_loader = DataLoader(dataset=test_dataset, batch_size=n_batchs, shuffle=False)

Another thing to consider here is that we are splitting our data into three; **train**, **validation**, and **test**. Our model will train with `train` data, and we'll check how training goes with `validation` data. If training accuracy improves but validation accuracy does not, our model is [overfitting](https://www.v7labs.com/blog/overfitting). If this is the case, we will stop training. Lastly, our model will not see `test` data while training; we'll use it as real-world data and check the success of our model with it afterwards.

# Network Architecture

Now we come to the part where we create our [network architecture](https://stats.stackexchange.com/a/291482). The thing is, I did not come up with the architecture below; I took it from [this](https://www.kaggle.com/code/juiyangchang/cnn-with-pytorch-0-995-accuracy/notebook) notebook since I don't know the finer points of creating architectures yet. Notice that this network have two main layer groups.
    
<div style="width:100%; text-align:center; padding:10px 0 20px">
    <img align=middle src="https://i.imgur.com/tAjRW2M.jpg" />
</div>

In Machine Learning, the process of learning from data consist of several steps. Some of the early steps are [feature selection](https://en.wikipedia.org/wiki/Feature_selection) and [feature extraction](https://en.wikipedia.org/wiki/Feature_extraction). In feature selection, we get rid of the features that are not so helpful for our goal. For example, if we are trying to predict the incomes of individuals, their weight or height info probably won't help us. In feature extraction, we transform our data so that while our number of features is reduced, we still keep as much info as possible. For example, converting an image to grayscale is one kind of reduction. The data reduction helps build the model with less machine effort and increases the speed of learning and generalization steps in the machine learning process. (See: [Feature Extraction in Image Processing](https://www.mygreatlearning.com/blog/feature-extraction-in-image-processing/)).

The thing with Deep Learning is that it does feature selection and feature extraction by itself in contrast to other Machine Learning approaches. This also helps us with assumptions and biases we may have with data. Doing these steps with wrong assumptions or biases may do more harm than good to the learning process. We may get rid of some necessary features because we can't see the relations. This is one of the reasons that Deep Learning is being preferred.

<div style="width:100%; text-align:center; padding:10px 0 20px">
    <img align=middle src="https://i.imgur.com/32vxBrO.png" />
</div>

Feature selection and extraction could be made with convolutional layers for image classification. This is why it is our first part of the network. Then, when we have our features, we need to figure out what the combination of features means; this is our classification layer. The classification layer will take features from the feature extraction layer and will learn to classify them for us to come up with the final result.

**Feature extraction layers** are the main building block of a CNN. It contains a set of filters (or kernels), parameters of which are to be learned throughout the training. The size of the filters is usually smaller than the actual image. Each filter convolves with the image and creates an activation map.

These words sound cool but hard to understand if you are new to convolutional layers. I'll give you a shallow idea, the concept of the convolutional layers here, but will not explain it in depth. Instead, I'll cite a few sources that are sure to explain it better than I can. (Recommended resources: [Stanford](https://www.youtube.com/watch?v=bNb2fEVKeEo), [MIT](https://www.youtube.com/watch?v=AjtX1N_VT9E), [Towards Data Science](https://towardsdatascience.com/basics-of-the-classic-cnn-a3dce1225add), [Machine Learning Mastery](https://machinelearningmastery.com/convolutional-layers-for-deep-learning-neural-networks/)). With a bit of research, you can also find many resources on the Internet that explain this topic quite well, both intuitively and theoretically.

<div style="width:100%; text-align:center; padding:10px 0 20px">
    <img align=middle src="https://i.imgur.com/h5qCaIp.png" />
</div>

In image processing, there are things that are called **kernels**; when applied to an image, they will transform it in a desirable way. For example, they may find edges of an image for you. If you are curious, you can search for *Sobel Filter* or *Canny Edge Detection*, or take a look at my image processing lecture notebooks where I implemented the sobel filter and the canny edge detection. ([Sobel Filter Notebook](https://github.com/ramesaliyev/image-processing-101/blob/master/03-%20edge%20detection/magnitude.ipynb), [Canny Edge Detection Notebook](https://github.com/ramesaliyev/image-processing-101/blob/master/03-%20edge%20detection/canny.ipynb)).

Since our images are just some matrices with numbers, these "kernels" are also just some matrices - pretty small compared to the images. You can see a demonstration above; on the left, we have our source, aka image, and a 3x3 kernel on its right. The kernel slides over our image and multiplies its values with the original picture values where it sits. Then the sum of all this multiplication becomes a pixel of our output image (on the rightmost). This output image is called the **activation map**. Each convolutional layer could consist of **n** numbers of kernels, which will produce **n** numbers of activation maps. So this is how basically convolution works for images.

If we know what we are looking for in the picture, we can develop and use kernels to reveal these parts. But as our problems become more complex, it becomes harder to find these kernels ourselves. This is where CNN comes into play. Simply put, what a CNN does is automatically find the kernels that should be used according to the meaning we're trying to take out of the image (or data in general). The training of CNN can be summarized as finding these kernels by the model. These kernels are also called filters in this context.

So now we know what is the purpose of our feature extraction layers. Our data (images) have some features that will help us achieve our goal if used. But we don't know these features. We want our model to learn it from our data. And the feature extraction layer extracts the features that define our inputs (images) with their most distinguishable aspects.

After we got our most distinguishable features, now we need to classify them somehow so that they would point to our output classes. So we ask, when we have this set of features, what do they mean? You may expect a digit to be 8 when it has two circles. But what if it has one circle? Is it 6, 9, or 0? What if it does not contain any circle at all? So this is the job of the **classifier layer**; it'll take a look at the features and learn to find out our digits from them.

In [None]:
class Net(nn.Module):    
    def __init__(self, n_class, img_size):
        super(Net, self).__init__()
          
        out_filter_size = img_size // (2*2)    
        
        # Feature Extraction Layer Group
        self.features = nn.Sequential(
            # Convolutional Layer 1
            nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            
            # Convolutional Layer 2
            nn.Conv2d(32, 32, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            
            # Convolutional Layer 3
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            
            # Convolutional Layer 4
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
          
        # Classification Layer Group
        self.classifier = nn.Sequential(
            # Fully Connected Layer 1
            nn.Dropout(p = 0.5),
            nn.Linear(64 * (out_filter_size**2), 512),
            nn.BatchNorm1d(512),
            nn.ReLU(inplace=True),
            
            # Fully Connected Layer 2
            nn.Dropout(p = 0.5),
            nn.Linear(512, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(inplace=True),
            
            # Fully Connected Layer 3
            nn.Dropout(p = 0.5),
            nn.Linear(512, n_class),
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        
        return x 

As you can see, on the forward pass, we first feed input into the feature extraction layer and then into the classifier. Our feature extraction layers group has 4 convolutional layers, and our classifier layers group has 3 layers of fully connected layers.

<div style="width:100%; text-align:center; padding:10px 0 20px">
    <img align=middle src="https://i.imgur.com/5eiVHN9.png" />
</div>

Let's now examine this model step by step and see what each layer does. (To develop better intuition about how these layers work, check great visualizations in [tensorspace.org](https://tensorspace.org/), especially visualization of the [lenet](https://tensorspace.org/html/playground/lenet.html) would be helpful. Also [this notebook](https://www.kaggle.com/code/rprkh15/deep-learning-visualizing-cnn-layers) has some great visualizations.)

<hr/>

## Conv2d

We already know what a convolution layer does. Since we are operating on the images, our convolutional layer is 2 dimensional. Here we define 4 sets of convolutional layers; their outputs (activation maps) will be processed by other layers, and we will see soon what they do. Parameters of [Conv2d](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html) are;

- `in_channels` The number of channels in the input image. Since our images are grayscale, our first conv2d layer accepts 1 channel. The subsequent conv2d layers accept outputs (activation maps) from earlier conv2d layers. So channels for the next ones are filter counts of the previous ones. 
- `out_channels` Number of channels produced by the convolution layer, aka kernel/filter count.
- `kernel_size` Size of the convolving kernel. When kernels are bigger, more neighbour of each pixel contributes to the result.
- `stride` We said that kernel slides on the image, stride is the amount of movement between applications of the filter to the input image, and it is almost always symmetrical in height and width dimensions.
- `padding` Padding added to all four sides of the input. Padding is the amount of pixels added to an image when it is being processed by the kernel of a CNN. Why? Because let's say you have a kernel with a 3x3 size. When you apply it to an image, 1px sized border of the image will not be able to be processed by this kernel. To solve this, we pad all four sides of the image with zeros (black).

See [A Gentle Introduction to Padding and Stride for Convolutional Neural Networks](https://machinelearningmastery.com/padding-and-stride-for-convolutional-neural-networks/) for more visual explanations about stride and padding.

But why do we have 4 sets of convolutional layers? The first convolutional layer takes our image as its input and produces low-level features like edges, lines, etc. The second convolutional layer takes the output of the first convolutional layer and produces more complex features like semi-circles, squares, etc. As we go through the network, our activation maps represent more and more complex features. ([read more](https://towardsdatascience.com/basics-of-the-classic-cnn-a3dce1225add))

Another benefit we get from convolutional layers is that they have fewer parameters, so they are faster to optimize. Consider the first layer, our image being 28x28, which means it has 784 pixels. If we were to use a fully connected layer here, even if only one neuron, it would have 784 weights. But a kernel with a size 3x3 means it has 9 weights in total. 32 of them will make 288 weights, which is about one-third of the previous count.

Lastly, convolutional layers take relations between pixels into account. Every neighbour of the pixel - within the kernel size - contributes to the result, which means convolutional filters benefit from spatial context.

<hr/>

## BatchNorm

Batch normalization scales layers outputs to have mean 0 and variance 1. The outputs are scaled such a way to train the network faster. It also reduces problems due to poor parameter initialization. It similar to normalization/standardization of the inputs, we are doing the same for the parameters after each forward pass. ([read more](https://towardsdatascience.com/what-is-batch-normalization-46058b4f583))

> Specifically, batch normalization normalizes the output of a previous layer by subtracting the batch mean and dividing by the batch standard deviation.

We are using `BatchNorm1d` to implement batch normalization on linear outputs and `BatchNorm2d` for 2D outputs, the filtered images from convolutional layers.

Benefits of batch normalization are; 
- Train faster.
- Use higher learning rates.
- Parameter initialization is easier.
- Makes activation functions viable by regulating the inputs to them.
- Better results overall.
- It adds noise which reduces overfitting with a regularization effect. Thus make sure to use less dropout when you apply batch normalization as dropout itself adds noise.

<hr/>

## ReLU

The rectified linear activation function or ReLU for short is a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero. It has become the default activation function for many types of neural networks because a model that uses it is easier to train and often achieves better performance. ([read more](https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/))

Another benefit of the ReLU is it helps with the [Vanishing Gradients Problem](https://machinelearningmastery.com/how-to-fix-vanishing-gradients-using-the-rectified-linear-activation-function/) which can be caused by [Saturating Gradient Problem](https://datascience.stackexchange.com/questions/27665/what-is-saturating-gradient-problem).

> A problem with training networks with many layers (e.g. deep neural networks) is that the gradient diminishes dramatically as it is propagated backward through the network. The error may be so small by the time it reaches layers close to the input of the model that it may have very little effect. As such, this problem is referred to as the “**vanishing gradients**” problem.

> In fact, the error gradient can be unstable in deep neural networks and not only vanish, but also explode, where the gradient exponentially increases as it is propagated backward through the network. This is referred to as the “**exploding gradient**” problem.

<hr/>

## MaxPool2d

Pooling layers are used to reduce the size of images in a CNN and to compress the information down to a smaller scale. Max Pooling is a pooling operation that calculates the maximum value for patches of a feature map, and uses it to create a downsampled (pooled) feature map. It is usually used after a convolutional layer. It adds a small amount of translation invariance - meaning translating the image by a small amount does not significantly affect the values of most pooled outputs. ([read more](https://paperswithcode.com/method/max-pooling))

<div style="width:100%; text-align:center; padding:10px 0 30px">
    <img align=middle src="https://i.imgur.com/rnNpDeQ.png" />
</div>

Parameters of [MaxPool2d](https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html) are;
- `kernel_size` The size of the window to take a max over. We are using 2x2 windows here. So we'll take most bright pixels from each 2x2 squares of the output image.
- `stride` This has same meaning with the one in Conv2d.

<hr/>

## Dropout

From the paper:
> In a standard neural network, the derivative received by each parameter tells it how it should change so the final loss function is reduced, given what all other units are doing. Therefore, units may change in a way that they fix up the mistakes of the other units. This may lead to complex co-adaptations. This in turn leads to overfitting because these co-adaptations do not generalize to unseen data. We hypothesize that for each hidden unit, dropout prevents co-adaptation by making the presence of other hidden units unreliable. Therefore, a hidden unit cannot rely on other specific units to correct its mistakes. It must perform well in a wide variety of different contexts provided by the other hidden units. To observe this effect directly, we look at the first level features learned by neural networks trained on visual tasks with and without dropout. ([read more](https://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf))

So the idea is, when all network learns together, it also becomes dependent on each other. And this dependence means not every part of the network is doing its best. Layers or neurons fix each other's mistakes. But they are doing this for a given set of data, for the training data. So when they see the data they did not see before, the test data, all this dependence makes things worse. To prevent the network from becoming dependent on each other, with some probability, we *turn off* some paths in it in each forward pass. This prevents components (layers, neurons etc.) from getting used to the presence of another component and doing their best.

<hr/>

## Linear

`Linear(n, m)` is a module that creates single layer feed forward network with **n** inputs and **m** output. Mathematically, this module is designed to calculate the linear equation **Ax = b** where x is input, b is output, A is weight. This is where the name **Linear** came from. ([read more](http://www.sharetechnote.com/html/Python_PyTorch_nn_Linear_01.html))

<hr/>

# Training Loop for an Epoch

Here we have our `train` function, which will train the given model for an epoch. To understand details of this loop and basics of Deep Learning with PyTorch take a look at [Deep Learning with PyTorch Course](https://www.youtube.com/watch?v=c36lUUr864M). 

In [None]:
# Configurations.
log_interval = (n_train // 3) // n_batchs

# Train with an epoch.
def train(epoch, model, optimizer, criterion, lrate_scheduler, train_loader):
    # Set model mode to train.
    model.train()
    
    # Loop through each batch.
    for batch_idx, (data, target) in enumerate(train_loader):
        # Use GPU if available.
        if torch.cuda.is_available():
            data = data.cuda()
            target = target.cuda()
        
        # Forward pass.
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        
        # Backward pass.
        loss.backward()
        optimizer.step()
    
        # Log
        if (batch_idx+1) % log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
    
    # Decay learning rate
    lrate_scheduler.step()

# Evaluation Loop

Here in the evaluation loop, we calculate and print some statistics for given data by running it through our model. `y_true` is the correct classes of inputs, `y_pred` is the classes that our model predicted for given inputs. `loss` is the average loss for given data and `acc` is accuracy, the ratio of our correct predictions to all inputs.

In [None]:
def evaluate(model, data_loader, collect_results=False):
    # Set model mode to evaluation.
    model.eval()
    
    # Collect predicted and true classes.
    y_true = []
    y_pred = []
        
    # Keep loss, correct predictions and accuracy for epoch.
    loss = 0
    correct = 0
    acc = 0
    
    data_len = len(data_loader.dataset)
    
    # Don't calculate gradients.
    with torch.no_grad():
        # Loop through each batch.
        for data, target in data_loader:
            # Use GPU if available.
            if torch.cuda.is_available():
                data = data.cuda()
                target = target.cuda()
    
            # Forward pass.
            output = model(data)

            # Calculate loss.
            loss += F.cross_entropy(output, target, size_average=False).item()
            
            # Calculate correct predictions.
            pred = output.data.max(1, keepdim=True)[1]
            correct += pred.eq(target.data.view_as(pred)).cpu().sum()

            # Collect results.
            if collect_results:
                y_true.append(target.tolist())
                y_pred.append(output.tolist())
            
        # Average loss.
        loss /= data_len
        acc = correct / data_len

        print('Average loss: {:.4f}, Accuracy: {}/{} ({:.3f}%)'.format(
            loss, correct, data_len, 100. * acc))
        
    return acc, loss, y_true, y_pred

# Early Stopping

We want our training to stop if our training accuracy keeps increasing but validation accuracy does not. Because meaning of this is overfitting. So let's implement a basic class to handle this.

In [None]:
class EarlyStopping:
    def __init__(self, model, threshold):
        self.model = model
        self.threshold = threshold
        self.counter = 0
        self.max_value = 0
        self.best_model = None
        
    # On call, check the threshold and return the decision.
    def __call__(self, value):
        if value > self.max_value:
            # Persist current state if value is better.
            self.best_model = self.model.state_dict()
            self.max_value = value
            self.counter = 0
        else:
            # Otherwise; 
            self.counter += 1

            # If threshold exceeded;
            if self.counter >= self.threshold:
                # Rollback to the best version.
                self.model.load_state_dict(self.best_model);
                # Signal early stopping.
                return True
        
        # Dont stop.
        return False

# Training or Loading Model

Since the training model takes some time, I don't want to train the model whenever I make a change while organising this notebook. Because of this, I come up with a `load_model()` function, which will try to load the model and additional data from the disk; if it can't, then it will call the `train_model()` function, which will train our model for all the epochs and then save the parameters of the model and additional data to disk. `load_model()` function also has `force` parameter, which will won't check for the saved files and always trains the model. This will be useful when submitting notebook for the score.

In [None]:
# Hyperparameters
epochs = 50
learning_rate = 0.003
lrs_step_size = 7
lrs_gamma = 0.1

def train_model():
    # Accuracy/Loss Data
    accuracies = {'train': [], 'val': []}
    losses = {'train': [], 'val': []}

    # Initialize components.
    model = Net(n_class, img_size)
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    criterion = nn.CrossEntropyLoss()
    lrate_scheduler = lr_scheduler.StepLR(optimizer, step_size=lrs_step_size, gamma=lrs_gamma)
    early_stop = EarlyStopping(model, 10)
    
    # Use GPU if available.
    if torch.cuda.is_available():
        model = model.cuda()
        criterion = criterion.cuda()

    print("Training:")
    print(f"Sample sizes: train={n_train} val={n_val} test={n_test}\n")

    # Train 
    for epoch in range(epochs):
        # Train for an epoch.
        train(epoch, model, optimizer, criterion, lrate_scheduler, train_loader)
        
        # Calculate epoch statistics for train and test data.
        print()
        train_acc, train_loss, _, _ = evaluate(model, train_loader)
        val_acc, val_loss, _, _ = evaluate(model, val_loader)
        print()
        
        # Keep record of epoch statistics.
        accuracies['train'].append(train_acc)
        losses['train'].append(train_loss)
        accuracies['val'].append(val_acc)
        losses['val'].append(val_loss)

        if early_stop(val_acc):
            print(f"Early stopping at epoch: #{epoch}")
            break
    
    return model, {
        'accuracies':accuracies,
        'losses': losses
    }

In [None]:
def load_model(trainer, force=False):
    # Paths of data.
    model_path = '/kaggle/working/model.pth'
    meta_path = '/kaggle/working/meta.pickle'
    
    model = None
    meta = None
    
    # Force Reset
    if not force:
        try:
            # Try to load model.
            model = torch.load(model_path)
            # Try to load meta.
            with open(meta_path, 'rb') as file:
                meta = pickle.load(file)
        except:
            pass
        
    # Either files are missing or forced.
    if model is None:
        # Train.
        model, meta = trainer()
        # Save model.
        torch.save(model, model_path)
        # Save meta.
        with open(meta_path, 'wb') as file:
            pickle.dump(meta, file, pickle.HIGHEST_PROTOCOL)
        
    # Either way return the results.
    return model, meta

In [None]:
# Load or train the model.
model, meta = load_model(train_model, force=True)

# Get values from meta.
accuracies = meta['accuracies']
losses = meta['accuracies']

This concludes our training phase. Now let check the results with some visualization.

# 3) Prediction Phase

In this phase, we'll do some visualizations and also will save our predictions for test data to submit it later.

# Plot Loss and Accuracy



In [None]:
train_accs = np.array(accuracies['train'])
train_losses = np.array(losses['train'])
val_accs = np.array(accuracies['val'])
val_losses = np.array(losses['val'])

length = train_accs.shape[0]
fig, ax = plt.subplots(2, 1, figsize=(15,10))

ax[0].plot(accuracies['train'], label="train")
ax[0].plot(accuracies['val'], label="val")
ax[0].legend()
ax[0].set_title('Accuracies');

ax[1].plot(losses['train'], label="train")
ax[1].plot(losses['val'], label="val")
ax[1].legend()
ax[1].set_title('Losses');

# Confusion Matrix

[Confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) is allow us to see which confusion does our model have. Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class.

In [None]:
acc, loss, y_true, y_pred = evaluate(model, val_loader, collect_results=True)

y_true = np.concatenate(y_true).ravel()
y_pred = np.concatenate(y_pred).argmax(axis=1)
conf_matrix = confusion_matrix(y_true, y_pred)

df_cm = pd.DataFrame(conf_matrix)
plt.figure(figsize = (15,7))
sns.heatmap(df_cm, annot=True);

# Save Test Predictions

In [None]:
def prediction(data_loader):
    # Set model mode to evaluation.
    model.eval()
    
    # Collect test predictions.
    test_pred = torch.LongTensor()
    
    # Don't calculate gradients.
    with torch.no_grad():
        # Loop through each batch.
        for data in data_loader:
            # Use GPU if available.
            if torch.cuda.is_available():
                data = data.cuda()
            
            # Forward pass.
            output = model(data)

            # Get predictions.
            pred = output.cpu().data.max(1, keepdim=True)[1]
            test_pred = torch.cat((test_pred, pred), dim=0)
        
    return test_pred.numpy()

In [None]:
# Get test predictions.
test_pred = prediction(test_loader)

# Create submission dataframe.
submission_df = pd.DataFrame()
submission_df['ImageId'] = list(range(1, len(test_pred) + 1))
submission_df['Label'] = test_pred

submission_df.to_csv('submission.csv', index=False)
submission_df.head(10)

# What more could be done?

For now, we have come to the end of this notebook. However, the things that can be tried are of course not over. For now, I'll just be content with making lists. But I'll add it if I find the time.

- [Hyperparameter Optimization With Random Search and Grid Search](https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/)
- [Cross Validation](https://towardsdatascience.com/what-is-cross-validation-60c01f9d9e75) 


> 