# Identifying hand-written digits(MNIST) using PyTorch

We will use the famous <b>MNIST Handwritten Digits Databases</b> as our training dataset.It consists of 28px by 28px grayscale images of handwritten disgits(0 - 9), along with labels for each image indicating which digit it represents. MNIST stands for <b>Modified National Institute of Standards and Technology.</b>

![image.png](attachment:image.png)

<font color='black'><h2 align = 'center' style = 'background:LightGray'> Quick Navigation </h2></font>
#### [1. Brief about PyTorch](#1)
#### [2. Working with images in PyToch(using MNIST Dataset)](#2)
#### [3. Splitting a dataset into training, Validation and test sets](#3)
#### [4. Creating PyTorch models with custom logic by extending the nn.Module Class](#4)
#### [5. Interpreting model outputs as probabilities using softmax, and picking predicted labels](#5)
#### [6. Picking a good evaluation metric(accuracy) and loss function(cross entropy) for Classification problems](#6)
#### [7.Setting up a training loop that also evaluates the model using Validation set](#7)
#### [8. Testing the model manually on randomly picked examples](#8)
#### [9.Saving and loading the model checkpoints to avoid retraining from scratch](#9)
#### [10. References](#10)

PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.

<h1><font color='red'> If you learn anything new from this notebook, Please upvote....</font></h1>

In [None]:
## Imports
import torch
import torchvision ## Contains some utilities for working with the image data
from torchvision.datasets import MNIST
import matplotlib.pyplot as plt
#%matplotlib inline
import torchvision.transforms as transforms
from torch.utils.data import random_split
from torch.utils.data import DataLoader
import torch.nn.functional as F

We will import <b>torchvision</b> which contains some utility functions for working with the image data. It also contain helper classes to automatically download and import the famous datasets like MNIST.

MNIST dataset has 60,000 images which can be used to train the model. There is also an additional test set of 10,000 images which can be created by passing <b>train = False</b> to the MNIST class.

### Loading the MNIST dataset

In [None]:
dataset = MNIST(root = 'data/', download = True)
print(len(dataset))

In [None]:
image, label = dataset[10]
plt.imshow(image, cmap = 'gray')
print('Label:', label)

These images are small in size, and recognizing the digits can sometimes be hard. <b>PyTorch</b> doesn't know how to work with images. We need to convert the images into <b>tensors</b>. We can do this by specifying a <b>transform</b> while creating our dataset.

PyTorch datasets allow us to specify one or more transformation function which are applied to the images as they are loaded.

<b>torchvision.transforms</b> contains many such predefined functions and we will use <b>ToTensor</b> transform to convert images into Pytorch tensors.

### Loading the MNIST data with transformation applied while loading

In [None]:
## MNIST dataset(images and labels)
mnist_dataset = MNIST(root = 'data/', train = True, transform = transforms.ToTensor())
print(mnist_dataset)

In [None]:
image_tensor, label = mnist_dataset[0]
print(image_tensor.shape, label)

The image is now convert to a <b>28 X 28 tensor</b>.The first dimension is used to keep track of the color channels. Since images in the <b>MNIST dataset are grayscale</b>, there's just one channel. Other datasets have images with color, in that case the color channels would be <b>3(Red, Green, Blue).</b>

In [None]:
print(image_tensor[:,10:15,10:15])
print(torch.max(image_tensor), torch.min(image_tensor))

The values range from 0 to 1, with 0 representing black, 1 white and the values between different shades of grey. We can also plot the tensor as an image using lt.imshow

In [None]:
## Plot the image of the tensor
plt.imshow(image_tensor[0,10:15,10:15],cmap = 'gray')

## Training and Validation Datasets

In [None]:
train_data, validation_data = random_split(mnist_dataset, [50000, 10000])
## Print the length of train and validation datasets
print("length of Train Datasets: ", len(train_data))
print("length of Validation Datasets: ", len(validation_data))

While building a machine learning/Deep learning models, it is common to split the dataset into 3 parts:

1. <b>Training set</b> - The part of the data will be used to train the model,compute the loss and adjust the weights of the model using gradient descent.


2. <b>Validation set</b> - This part of the dataset will be used to evalute the traing model, adjusting the hyperparameters and pick the best version of the model.


3. <b>Test set</b> - This part of the dataset is used to final check the model predictions on the new unseen data to evaluate how well the model is performing.

In [None]:
batch_size = 128
train_loader = DataLoader(train_data, batch_size, shuffle = True)
val_loader = DataLoader(validation_data, batch_size, shuffle = False)

Here we will use <b>DataLoaders</b> to help us load the data in batches. We will use a batch size of 128. We will set <b>shuffle = True</b> for the training dataloader, so that the batches generated in each epoch are different, and this randomization helps in generalizing and speed up the process.

Since Validation dataloader is used only for evaluating the model, there is no need to shuffle the images.

## Model

<b>Logistic Regression</b> model is identical to a linear regression model i.e, there are weights and bias matrices, and the output is obtained using simple matrix operations(pred = x@ w.t() + b).

We can use <b>nn.Linear</b> to create the model instead of defining and initializing the matrices manually.

Since <b>nn.Linear</b> expects the each training example to a vector, each <b>1 X 28 X 28</b> image tensor needs to be flattened out into a vector of size <b>784(28 X 28)</b>, before being passed into the model.

The output for each image is vector of size 10, with each element of the vector signifying the probability a particular target <b>label(i.e 0 to 9)</b>. The predicted label for an image is simply the one with the highest probability.

In [None]:
import torch.nn as nn

input_size = 28 * 28
num_classes = 10

## Logistic regression model
model = nn.Linear(input_size, num_classes)
print(model.weight.shape)
print(model.weight)
print(model.bias.shape)
print(model.bias)

In [None]:
for images, labels in train_loader:
    print(labels)
    print(images.shape)
    outputs = model(images)
    break

**Note** This leads to an error, because our input data does not have the right shape.
Our images are of the shape 1X28X28, but we need them to be vectors of size 784 i.e we need to flatten them out. We will use the <b>.reshape()</b> method of a tensor, which will allow us to efficiently view each image as a flat vector, without really changing the underlying data.

To include this additional functionality within model, we need to define a custom model, by extending the <b>nn.Module</b> class from PyTorch.

### Defining the Logistic Model

In [None]:
class MnistModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(input_size, num_classes)
        
    def forward(self, xb):
        xb = xb.reshape(-1, 784)
        print(xb)
        out = self.linear(xb)
        print(out)
        return(out)

model = MnistModel()
print(model.linear.weight.shape, model.linear.bias.shape)
list(model.parameters())

Inside the __init__  constructor method, we instantiate the weights and biases using <b>nn.Linear</b>. Inside the <b>forward method</b>, which is invoked when we pass a batch of inputs to the model, we flatten out the input tensor, and then pass it into <b>self.linear</b>.

<b>xb.reshape(-1, 28 * 28)</b> indicates to PyTorch that we want a view of the <b>xb</b> tensor with two dimensions, where the length along the 2nd dimension is <b>28 * 28(i.e 784)</b>. One argument to <b>.reshape</b> can be set to <b>-1(in this case the first dimension)</b>, to let PyTorch figure it out automatically based on the shape of the original tensor.

Note that the model no longer has <b>.weight and .bias </b>attributes(as they are now inside the <b>.linear attribute)</b>,but it does have a <b>.parameters</b> method which returns a list containg the <b>weights and bias</b>, and can be used by a <b>PyTorch optimizer</b>.

In [None]:
for images, labels in train_loader:
    outputs = model(images)
    break
    
print('outputs shape: ', outputs.shape)
print('Sample outputs: \n', outputs[:2].data)

For each of the 100 input images, we get 10 outputs, one for each class. These outputs represent probabilities, but for the that the output row should lie between 0 to 1 and add upto 1.

For converting the output to probabilities such that it lies between 0 to 1 we use <b>Softmax function</b>.

## What is Softmax function?

#### The softmax function is a function that turns a vector of K real values into a vector of K real values that sum to 1. The input values can be positive, negative, zero, or greater than one, but the softmax transforms them into values between 0 and 1, so that they can be interpreted as probabilities. If one of the inputs is small or negative, the softmax turns it into a small probability, and if an input is large, then it turns it into a large probability, but it will always remain between 0 and 1.

#### The softmax function is sometimes called the softargmax function, or multi-class logistic regression. This is because the softmax is a generalization of logistic regression that can be used for multi-class classification, and its formula is very similar to the sigmoid function which is used for logistic regression. The softmax function can be used in a classifier only when the classes are mutually exclusive.

#### Many multi-layer neural networks end in a penultimate layer which outputs real-valued scores that are not conveniently scaled and which may be difficult to work with. Here the softmax is very useful because it converts the scores to a normalized probability distribution, which can be displayed to a user or used as input to other systems. For this reason it is usual to append a softmax function as the final layer of the neural network.

![image.png](attachment:image.png)

## How it is Calculated ?

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In [None]:
## Apply softmax for each output row
probs = F.softmax(outputs, dim = 1)

## chaecking at sample probabilities
print("Sample probabilities:\n", probs[:2].data)

print("\n")
## Add up the probabilities of an output row
print("Sum: ", torch.sum(probs[0]).item())
max_probs, preds = torch.max(probs, dim = 1)
print("\n")
print(preds)
print("\n")
print(max_probs)

In [None]:
labels

## Evaluation Metric and Loss Function

Here we evaluate our model by finding the percentage of labels that were predicted correctly i.e. the <b>accuracy</b> of the predictions.

The <b>==</b> performas an element-wise comparision of two tensors with the same shape, and returns a tensor of the same shape,containing <b>0s</b> for unequal elements, and <b>1s</b> for equal elements. Passing the result to <b>torch.sum</b> returns the number of labels that were predicted correctly. Finally we divide by the total total number of images to get the accuracy.

In [None]:
def accuracy(outputs, labels):
    _, preds = torch.max(outputs, dim = 1)
    return(torch.tensor(torch.sum(preds == labels).item()/ len(preds)))

print("Accuracy: ",accuracy(outputs, labels))
print("\n")
loss_fn = F.cross_entropy
print("Loss Function: ",loss_fn)
print("\n")
## Loss for the current batch
loss = loss_fn(outputs, labels)
print(loss)

While accuracy is a great way to evluate the model, it can't be used as a loss function for optimizing our model using gradient descent in this case for the following reasons:

- > It does not take into account the actual probabilities predicted by the model,so it can't provide sufficient feedback for increemental improvements.

Due to this reason accuracy is a great evaluation metric for classification metric ,but not a good loss function.A commonly used loss function for classification problems is the <b>Cross Entropy</b>.

## What is Cross-Entropy

<b>Cross-entropy</b> is commonly used to quantify the difference between two probabilities distribution. Usually the "True" distribution(the one that your machine learning algorithm is trying to match) is expressed in terms of a one-hot distribution.

For example, suppose for a specific training instance,the label is B(out of possible labels A,B and C).The one-hot distribution for this training instance is therefore.

![image.png](attachment:image.png)

We can interpret the above "True" distribution to mean that the training instance has 0% probability of being Class A, 100% probability of Class B and 0% probability of being Class C.

Now suppose the machine learning algorithm predicts the following probability distribution:

![image.png](attachment:image.png)

Now to check how this probability distribution is close to True distribution.Here we will use the <b>Cross-entropy loss function.</b>

![image.png](attachment:image.png)

Here <b>p(x) = output probability</b>, and <b>q(x) = Actual probability</b>.

The sum is over the three classes A,B and C.In this case the loss is<b>0.479.</b>

![image.png](attachment:image.png)

## Training the Model

Here is the pseudo-code which we will use to trainthe model

![image.png](attachment:image.png)

In [None]:
class MnistModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(input_size, num_classes)
    
    def forward(self, xb):
        xb = xb.reshape(-1, 784)
        out = self.linear(xb)
        return(out)
    
    def training_step(self, batch):
        images, labels = batch
        out = self(images) ## Generate predictions
        loss = F.cross_entropy(out, labels) ## Calculate the loss
        return(loss)
    
    def validation_step(self, batch):
        images, labels = batch
        out = self(images)
        loss = F.cross_entropy(out, labels)
        acc = accuracy(out, labels)
        return({'val_loss':loss, 'val_acc': acc})
    
    def validation_epoch_end(self, outputs):
        batch_losses = [x['val_loss'] for x in outputs]
        epoch_loss = torch.stack(batch_losses).mean()
        batch_accs = [x['val_acc'] for x in outputs]
        epoch_acc = torch.stack(batch_accs).mean()
        return({'val_loss': epoch_loss.item(), 'val_acc' : epoch_acc.item()})
    
    def epoch_end(self, epoch,result):
        print("Epoch [{}], val_loss: {:.4f}, val_acc: {:.4f}".format(epoch, result['val_loss'], result['val_acc']))
        
    
model = MnistModel()

In [None]:
def evaluate(model, val_loader):
    outputs = [model.validation_step(batch) for batch in val_loader]
    return(model.validation_epoch_end(outputs))

def fit(epochs, lr, model, train_loader, val_loader, opt_func = torch.optim.SGD):
    history = []
    optimizer = opt_func(model.parameters(), lr)
    for epoch in range(epochs):
        
        ## Training Phas
        for batch in train_loader:
            loss = model.training_step(batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        
        ## Validation phase
        result = evaluate(model, val_loader)
        model.epoch_end(epoch, result)
        history.append(result)
    return(history)

In [None]:
result0 = evaluate(model, val_loader)
result0

The initial accuracy is around 8% which is one might expect from a randomly initialized model.

We are ready to train the model. Let's train for 5 epochs

In [None]:
history1 = fit(5, 0.001, model, train_loader, val_loader)

In [None]:
history2 = fit(5, 0.001, model, train_loader, val_loader)

In [None]:
history3 = fit(5, 0.001, model, train_loader, val_loader)

In [None]:
history4 = fit(5, 0.001, model, train_loader, val_loader)

In [None]:
## Replace these values with your result
history = [result0] + history1 + history2 + history3 + history4
accuracies = [result['val_acc'] for result in history]
plt.plot(accuracies, '-x')
plt.xlabel('epoch')
plt.ylabel('accuracy')
plt.title('Accuracy Vs. No. of epochs')

## Testing with individual images

In [None]:
## Define the test dataset
test_dataset = MNIST(root = 'data/', train = False, transform = transforms.ToTensor())

In [None]:
img, label = test_dataset[0]
plt.imshow(img[0], cmap = 'gray')
print("shape: ", img.shape)
print('Label: ', label)

In [None]:
print(img.unsqueeze(0).shape)
print(img.shape)

In [None]:
def predict_image(img, model):
    xb = img.unsqueeze(0)
    yb = model(xb)
    _, preds = torch.max(yb, dim = 1)
    return(preds[0].item())

In [None]:
img, label = test_dataset[0]
plt.imshow(img[0], cmap = 'gray')
print('Label:', label, ', Predicted :', predict_image(img, model))

In [None]:
img, label = test_dataset[9]
plt.imshow(img[0], cmap = 'gray')
print("Label:", label, ',Predicted:', predict_image(img, model))

In [None]:
img, label = test_dataset[25]
plt.imshow(img[0], cmap = 'gray')
print("Label:", label, ',Predicted:', predict_image(img, model))

In [None]:
img, label = test_dataset[5000]
plt.imshow(img[0], cmap = 'gray')
print("Label:", label, ',Predicted:', predict_image(img, model))

In [None]:
test_loader = DataLoader(test_dataset, batch_size = 256)
result = evaluate(model, test_loader)
result

## Saving and loading the Model

In [None]:
torch.save(model.state_dict(), 'mnist-logistic.pth')

The <b>.state_dict</b> method returns an OrderedDict containing all the weights and bias matrices mapped to the right attributes of the model.

In [None]:
model.state_dict()

In [None]:
model2 = MnistModel()
model2.load_state_dict(torch.load('mnist-logistic.pth'))
model2.state_dict()

In [None]:
test_loader = DataLoader(test_dataset, batch_size = 256)
result = evaluate(model2, test_loader)
result

## Credits

#### [1. https://jovian.ai/aakashns/03-logistic-regression](#1)
#### [2. https://deepai.org/machine-learning-glossary-and-terms/softmax-layer](#2)
#### [3. https://stackoverflow.com/questions/41990250/what-is-cross-entropy](#3)
#### [4. https://en.wikipedia.org/wiki/MNIST_database](#4)
#### [5. https://github.com/pytorch/pytorch](#5)