<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0; color:tomato' role="tab" aria-controls="home"><center>MNIST with Pytorch Logistic Regression</center></h2>


`We are using Mnist kaggle training set as our training dataset. It consists of 28px by 28px grayscale images of handwritten digits (0 to 9), along with labels for each image indicating which digit it represents. Here are some sample images from the dataset:`

![mnist-sample](https://i.imgur.com/CAYnuo1.jpg)

# Importing Libraries

In [None]:
import warnings 
warnings.filterwarnings('ignore')
import torch
import torchvision
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("paper", font_scale = 1, rc={"grid.linewidth": 3})
pd.set_option('display.max_rows', 100, 'display.max_columns', 400)
from torch.utils.data import DataLoader,Dataset,ConcatDataset
from torchvision import transforms
import torch.optim as optim
from torchvision.datasets import MNIST
from torch.utils.data import random_split
import torch.nn as nn
import torch.nn.functional as F

In [None]:
dataset = MNIST(root='data/', download=True)

In [None]:
len(dataset)

The dataset has 60,000 images which can be used to train the model. There is also an additonal test set of 10,000 images which can be created by passing `train=False` to the `MNIST` class. We will use it as test dataset.

In [None]:
test_dataset = MNIST(root='data/', train=False,transform=transforms.ToTensor())
len(test_dataset)

`Let's look at a sample element from the training dataset.`

In [None]:
dataset[0]

It's a pair, consisting of a 28x28 image and a label. The image is an object of the class `PIL.Image.Image`, which is a part of the Python imaging library [Pillow](https://pillow.readthedocs.io/en/stable/). We can view the image within Jupyter using [`matplotlib`](https://matplotlib.org/), the de-facto plotting and graphing library for data science in Python.

In [None]:
plt.figure(figsize=(20,8))
for i in range(10,18):
    image,label=dataset[i]
    plt.subplot(231+(i))
    plt.imshow(image,cmap='gray')
    plt.title('Label:'+str(label),fontweight='bold',size=20)

PyTorch datasets allow us to specify one or more transformation functions which are applied to the images as they are loaded. `torchvision.transforms` contains many such predefined functions, and we'll use the `ToTensor` transform to convert images into PyTorch tensors.

In [None]:
# MNIST dataset (images and labels)
dataset = MNIST(root='data/', 
                train=True,
                transform=transforms.ToTensor())

In [None]:
img_tensor, label=dataset[0]
print(img_tensor.shape,label)

The image is now converted to `1x28x28 tensor`. The first dimension is used to keep track of the color channels. Since images in the MNIST dataset are grayscale, there's just one channel.

In [None]:
print(img_tensor[:,5:10,5:10])
print(torch.max(img_tensor), torch.min(img_tensor))

The values range from 0 to 1, with 0 representing black, 1 white and the values in between different shades of grey. We can also plot the tensor as an image using `plt.imshow`.

Let's visualise the colours for above 5x5 matrix.

In [None]:
plt.figure(figsize=(6,4))
plt.imshow(img_tensor[0,5:10,5:10],cmap='gray')
plt.title('0:Black,1:White',fontweight='bold',size=12)
plt.show()

In the MNIST dataset, there are `60,000` training images, and `10,000` test images. The test set is standardized so that different researchers can report the results of their models against the same set of images. 

Since there's no predefined validation set, we must manually split the `60,000` images into training and validation datasets. Let's set aside `10,000` randomly chosen images for validation.

In [None]:
train_df,val_df= random_split(dataset,[50000,10000])
print(len(train_df))
print(len(val_df))

We can now created data loaders to help us load the data in batches. We'll use a batch size of 128.

In [None]:
batch_size=128
train_loader=DataLoader(train_df,batch_size,shuffle=True)
val_loader=DataLoader(val_df,batch_size)

We set `shuffle=True` for the training dataloader, so that the batches generated in each epoch are different, and this randomization helps generalize & speed up the training process. On the other hand, since the validation dataloader is used only for evaluating the model, there is no need to shuffle the images. 
Total no of Batches formed for `train_df= 50000/128 ~ 391` and for `val_df= 10000/128 ~ 79`.

* A **logistic regression** model is almost identical to a linear regression model i.e. there are weights and bias matrices, and the output is obtained using simple matrix operations (`pred = x @ w.t() + b`). 

* Just as we did with linear regression, we can use `nn.Linear` to create the model instead of defining and initializing the matrices manually.

* Since `nn.Linear` expects the each training example to be a vector, each `1x28x28` image tensor needs to be flattened out into a vector of size 784 (`28*28`), before being passed into the model. 

* The output for each image is vector of size 10, with each element of the vector signifying the probability a particular target label (i.e. 0 to 9). The predicted label for an image is simply the one with the highest probability.

In [None]:
input_size=28*28 #784
num_classes=10 #0-9
log_model = nn.Linear(in_features=input_size,out_features=num_classes) # Logistic regression model

In [None]:
#model weights 
print(log_model.weight)
print('\n')
print(log_model.weight.shape) #7840
print('\n')
print(log_model.bias) #shape=10 

Total Number of paramters: `7840 + 10= 7850`

In [None]:
class MnistModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear=nn.Linear(input_size,num_classes)
        
    def forward(self,x):
        # Calls to layer to make predictions
        # performs the actual computation, that is, it outputs a prediction, given the input x.
        x=x.reshape(-1,784)
        out=self.linear(x)
        return out

In [None]:
model=MnistModel()

Inside the `__init__` constructor method, we instantiate the weights and biases using `nn.Linear`. And inside the `forward` method, which is invoked when we pass a batch of inputs to the model, we flatten out the input tensor, and then pass it into `self.linear`.

`x.reshape(-1, 28*28)` indicates to PyTorch that we want a *view* of the `x` tensor with two dimensions, where the length along the 2nd dimension is 28\*28 (i.e. 784). One argument to `.reshape` can be set to `-1` (in this case the first dimension), to let PyTorch figure it out automatically based on the shape of the original tensor.

Note that the model no longer has `.weight` and `.bias` attributes (as they are now inside the `.linear` attribute), but it does have a `.parameters` method which returns a list containing the weights and bias, and can be used by a PyTorch optimizer.

In [None]:
#model weights 
print(model.linear.weight)
print('\n')
print(model.linear.weight.shape) #7840
print('\n')
print(model.linear.bias.shape) #shape=10 
print('\n')
print(model.parameters) #shape=10 

In [None]:
for images,label in train_loader:
    outputs=model(images)
    break
print('outputs.shape : ', outputs.shape)
print('Sample outputs :\n', outputs[:2].data)

For each of the 100 input images, we get 10 outputs, one for each class. As discussed earlier, we'd like these outputs to represent probabilities, but for that the elements of each output row must lie between 0 to 1 and add up to 1, which is clearly not the case here. 

To convert the output rows into probabilities, we use the softmax function, which has the following formula:

![softmax](https://i.imgur.com/EAh9jLN.png)

First we replace each element `yi` in an output row by `e^yi`, which makes all the elements positive, and then we divide each element by the sum of all elements to ensure that they add up to 1. 

While it's easy to implement the softmax function (you should try it!), we'll use the implementation that's provided within PyTorch, because it works well with multidimensional tensors (a list of output rows in our case).

In [None]:
# Apply softmax for each output row
prob = F.softmax(outputs, dim=1)
# Look at sample probabilities
print("Sample probabilities:\n", prob[:2].data)
print('\n')
# Add up the probabilities of an output row
print("Sum: ", torch.sum(prob[0]).item())

Finally, we can determine the predicted label for each image by simply choosing the index of the element with the highest probability in each output row. This is done using `torch.max`, which returns the largest element and the index of the largest element along a particular dimension of a tensor.

In [None]:
max_prob, predict = torch.max(prob, dim=1)
print(predict)
print(max_prob)

## Evaluation Metric and Loss Function

In [None]:
def accuracy(outputs,label):
    _, pred = torch.max(outputs, dim=1)
    return torch.tensor(torch.sum(pred==label).item()/len(pred))

In [None]:
accuracy(outputs,label)

A commonly used loss function for classification problems is the **cross entropy**, which has the following formula:

![cross-entropy](https://i.imgur.com/VDRDl1D.png)

While it looks complicated, it's actually quite simple:

* For each output row, pick the predicted probability for the correct label. E.g. if the predicted probabilities for an image are `[0.1, 0.3, 0.2, ...]` and the correct label is `1`, we pick the corresponding element `0.3` and ignore the rest.

* Then, take the [logarithm](https://en.wikipedia.org/wiki/Logarithm) of the picked probability. If the probability is high i.e. close to 1, then its logarithm is a very small negative value, close to 0. And if the probability is low (close to 0), then the logarithm is a very large negative value. We also multiply the result by -1, which results is a large postive value of the loss for poor predictions.

* Finally, take the average of the cross entropy across all the output rows to get the overall loss for a batch of data.

Unlike accuracy, cross-entropy is a continuous and differentiable function that also provides good feedback for incremental improvements in the model (a slightly higher probability for the correct label leads to a lower loss). This makes it a good choice for the loss function. 

As you might expect, PyTorch provides an efficient and tensor-friendly implementation of cross entropy as part of the `torch.nn.functional` package. Moreover, it also performs softmax internally, so we can directly pass in the outputs of the model without converting them into probabilities.

In [None]:
loss_fn=F.cross_entropy

In [None]:
# Loss for first batch of data
loss = loss_fn(outputs, label)
print(loss)
print(loss.item())

## Training the model

Now that we have defined the data loaders, model, loss function and optimizer, we are ready to train the model. The training process is identical to linear regression, with the addition of a "validation phase" to evaluate the model in each epoch. Here's what it looks like in pseudocode:

```
for epoch in range(num_epochs):
    # Training phase
    for batch in train_loader:
        # Generate predictions
        # Calculate loss
        # Compute gradients
        # Update weights
        # Reset gradients
    
    # Validation phase
    for batch in val_loader:
        # Generate predictions
        # Calculate loss
        # Calculate metrics (accuracy etc.)
    # Calculate average validation loss & metrics
    
    # Log epoch, loss & metrics for inspection
```

Some parts of the training loop are specific the specific problem we're solving (e.g. loss function, metrics etc.) whereas others are generic and can be applied to any deep learning problem. Let's impelment the problem-specific parts within our `MnistModel` class:

In [None]:
class MnistModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear=nn.Linear(input_size,num_classes)
        
    def forward(self,x):
        x=x.reshape(-1,784)
        out=self.linear(x)
        return out
    
    def training_step(self,batch):
        images,labels=batch
        out=self(images)  #Generate predictions
        loss=F.cross_entropy(out,labels) # calculating loss for cost/loss optimisation
        return loss
        
    def validation_step(self,batch):
        images,labels=batch
        out=self(images)
        loss=F.cross_entropy(out,labels) #loss
        acc=accuracy(out,labels) # accuracy
        return {'val_loss': loss, 'val_acc': acc} 
    
    def validation_epoch_end(self,outputs):
        batch_losses=[x['val_loss'] for x in outputs]
        epoch_loss= torch.stack(batch_losses).mean()  
        batch_accs=[x['val_acc'] for x in outputs]
        epoch_acc = torch.stack(batch_accs).mean()
        return {'val_loss': epoch_loss.item(), 'val_acc': epoch_acc.item()}
    
    def epoch_end(self, epoch, result):
        print("Epoch [{}], val_loss: {:.4f}, val_acc: {:.4f}".format(epoch, result['val_loss'], result['val_acc']))

model=MnistModel()

Now we'll define an `evaluate` function, which will perform the validation phase, and a `fit` function which will peform the entire training process.

In [None]:
def evaluate(model,val_loader):
    outputs=[model.validation_step(batch) for batch in val_loader]
    return model.validation_epoch_end(outputs)

In [None]:
def fit(epochs,lr,model,train_loader,val_loader,opt_func=torch.optim.Adam):
    history=[]
    optimizer=opt_func(model.parameters(),lr)
    for epoch in range(epochs):
        # Training Phase 
        for batch in train_loader:
            loss=model.training_step(batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            #Validation Phase
        result=evaluate(model,val_loader)
        model.epoch_end(epoch,result)
        history.append(result)
    return history

In [None]:
result0=evaluate(model,val_loader)
result0

The initial accuracy for one epoch or one iteration is around 10%, which is what one might expect from a randomly intialized model (since it has a 1 in 10 chance of getting a label right by guessing randomly). 

We are now ready to train the model. Let's train for 5 epochs and look at the results.

In [None]:
history1 = fit(5, 0.001, model, train_loader, val_loader)

In [None]:
history2 = fit(5, 0.001, model, train_loader, val_loader)

In [None]:
history3 = fit(5, 0.001, model, train_loader, val_loader)

In [None]:
history4 = fit(5, 0.001, model, train_loader, val_loader)

While the accuracy does continue to increase as we train for more epochs, the improvements get smaller with every epoch. This is easier to see using a line graph.

In [None]:
# Replace these values with your results
history = [result0] + history1 + history2 + history3 + history4
accuracies = [result['val_acc'] for result in history]
plt.plot(accuracies, '-x')
plt.xlabel('epoch')
plt.ylabel('accuracy')
plt.title('Accuracy vs. No. of epochs');

It's quite clear from the above picture that the model probably won't cross the accuracy threshold of 93% even after training for a very long time. One possible reason for this is that the learning rate might be too high. It's possible that the model's paramaters are "bouncing" around the optimal set of parameters that have the lowest loss. You can try reducing the learning rate and training for a few more epochs to see if it helps.

The more likely reason that **the model just isn't powerful enough**. If you remember our initial hypothesis, we have assumed that the output (in this case the class probabilities) is a **linear function** of the input (pixel intensities), obtained by perfoming a matrix multiplication with the weights matrix and adding the bias. This is a fairly weak assumption, as there may not actually exist a linear relationship between the pixel intensities in an image and the digit it represents. While it works reasonably well for a simple dataset like MNIST (getting us to 85% accuracy), we need more sophisticated models that can capture non-linear relationships between image pixels and labels for complex tasks like recognizing everyday objects, animals etc. 


## Testing with individual images

While we have been tracking the overall accuracy of a model so far, it's also a good idea to look at model's results on some sample images. Let's test out our model with some images from the predefined test dataset of 10000 images. We begin by recreating the test dataset with the `ToTensor` transform.

In [None]:
img, label = test_dataset[0]
plt.imshow(img[0], cmap='gray')
print('Shape:', img.shape)
print('Label:', label)

In [None]:
img.unsqueeze(0).shape

Let's define a helper function `predict_image`, which returns the predicted label for a single image tensor.

In [None]:
def predict_image(img,model):
    x=img.unsqueeze(0)
    y=model(x)
    _, preds = torch.max(y, dim=1)
    return preds[0].item()

`img.unsqueeze` simply adds another dimension at the begining of the 1x28x28 tensor, making it a 1x1x28x28 tensor, which the model views as a batch containing a single image.

Let's try it out with a few images.

In [None]:
img, label = test_dataset[0]
plt.imshow(img[0], cmap='gray')
print('Label:', label, ', Predicted:', predict_image(img, model))

In [None]:
img, label = test_dataset[10]
plt.imshow(img[0], cmap='gray')
print('Label:', label, ', Predicted:', predict_image(img, model))

In [None]:
img, label = test_dataset[200]
plt.imshow(img[0], cmap='gray')
print('Label:', label, ', Predicted:', predict_image(img, model))

In [None]:
img, label = test_dataset[1839]
plt.imshow(img[0], cmap='gray')
print('Label:', label, ', Predicted:', predict_image(img, model))

Identifying where our model performs poorly can help us improve the model, by collecting more training data, increasing/decreasing the complexity of the model, and changing the hypeparameters.

As a final step, let's also look at the overall loss and accuracy of the model on the test set.

In [None]:
test_loader = DataLoader(test_dataset, batch_size=256)
result = evaluate(model, test_loader)
result

## Saving and loading the model

In [None]:
torch.save(model.state_dict(), 'mnist-logistic.pth')

The `.state_dict` method returns an `OrderedDict` containing all the weights and bias matrices mapped to the right attributes of the model.

In [None]:
model.state_dict()

To load the model weights, we can instante a new object of the class `MnistModel`, and use the `.load_state_dict` method.

In [None]:
model2 = MnistModel()
model2.load_state_dict(torch.load('mnist-logistic.pth'))
model2.state_dict()

Just as a sanity check, let's verify that this model has the same loss and accuracy on the test set as before.

In [None]:
test_loader = DataLoader(test_dataset, batch_size=256)
result = evaluate(model2, test_loader)
result

<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0; color:tomato' role="tab" aria-controls="home"><center>If you found this notebook helpful , some upvotes would be very much appreciated - That will keep me motivated :)</center></h2>


<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0; color:tomato' role="tab" aria-controls="home"><center>Thank You:)</center></h2>
