<div align='center'><font size="5" color='#353B47'>Pytorch: Introduction to CNN</font></div>
<div align='center'><font size="4" color="#353B47">on MNIST digit dataset</font></div>
<br>
<hr>

<img src="https://en.mlab.ai/sites/default/files/inline-images/handwritten_numbers.png">

The objective of this notebook is to create a model running on pytorch that allows to correctly classify a handwritten digit.

# <div id="summary">Summary</div>

**<font size="2"><a href="#chap1">1. Load libraries and check TPU settings</a></font>**
**<br><font size="2"><a href="#chap2">2. EDA and preprocessing</a></font>**
**<br><font size="2"><a href="#chap3">3. CNN</a></font>**
**<br><font size="2"><a href="#chap4">4. Evaluation</a></font>**

# <div id="chap1">1. Load libraries</div>

In [None]:
# Remove warning messages
import warnings
warnings.filterwarnings('ignore')

import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import plotly
import plotly.graph_objects as go
%matplotlib inline

import os

from sklearn.calibration import calibration_curve
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import itertools

import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
from torch.optim import lr_scheduler

In [None]:
# Set seed
np.random.seed(42)

In [None]:
PATH_TO_DATA = '../input/digit-recognizer/'

In [None]:
# Load train and test 
train = pd.read_csv(PATH_TO_DATA + 'train.csv', dtype = np.float32)
test = pd.read_csv(PATH_TO_DATA + 'test.csv', dtype = np.float32)

In [None]:
# First rows of train
train.head()

--------

**<font size="2"><a href="#summary">Back to summary</a></font>**

# <div id="chap2">2. EDA and preprocessing</div>

## <font color='blue'> 2.1 Class distribution</font>

In [None]:
def plot_distribution_classes(x_values, y_values):

    fig = go.Figure(data=[go.Bar(
                x=x_values, 
                y=y_values,
                text=y_values
    )])

    fig.update_layout(height=600, width=1200, title_text="Distribution of classes")
    fig.update_xaxes(type="category")

    fig.show()

In [None]:
x = np.sort(train.label.unique())
y = train.label.value_counts().sort_index()

plot_distribution_classes(x, y)

## <font color='blue'>2.2 Preprocessing</font>

In [None]:
def preprocessing(train, test, split_train_size = 0.2):
    
    # Split data into features(pixels) and labels(numbers from 0 to 9)
    targets = train.label.values
    features = train.drop(["label"], axis = 1).values
    
    # Normalization
    features = features/255.
    X_test = test.values/255.
    
    # Train test split. Size of train data is (1-split_train_size)*100% and size of test data is split_train_size%. 
    X_train, X_val, y_train, y_val = train_test_split(features,
                                                      targets,
                                                      test_size = split_train_size,
                                                      random_state = 42) 
    
    # Create feature and targets tensor for train set. I need variable to accumulate gradients. Therefore first I create tensor, then I will create variable
    X_train = torch.from_numpy(X_train)
    y_train = torch.from_numpy(y_train).type(torch.LongTensor) # data type is long

    # Create feature and targets tensor for test set.
    X_val = torch.from_numpy(X_val)
    y_val = torch.from_numpy(y_val).type(torch.LongTensor) # data type is long
    
    # Create feature tensor for train set.
    X_test = torch.from_numpy(X_test)
    
    return X_train, y_train, X_val, y_val, X_test

In [None]:
X_train, y_train, X_val, y_val, X_test = preprocessing(train, test)

print(f'Shape of training data: {X_train.shape}')
print(f'Shape training labels: {y_train.shape}')
print(f'Shape of validation data: {X_val.shape}')
print(f'Shape of valiation labels: {y_val.shape}')
print(f'Shape of testing data: {X_test.shape}')

In [None]:
# batch_size, epoch and iteration
BATCH_SIZE = 100
N_ITER = 2500
EPOCHS = 5
# I will be trainin the model on another 10 epochs to show flexibility of pytorch
EXTRA_EPOCHS = 10

# Pytorch train and test sets
train_tensor = torch.utils.data.TensorDataset(X_train, y_train)
val_tensor = torch.utils.data.TensorDataset(X_val, y_val)
test_tensor = torch.utils.data.TensorDataset(X_test)

# data loader
train_loader = torch.utils.data.DataLoader(train_tensor, 
                                           batch_size = BATCH_SIZE,
                                           shuffle = True)
val_loader = torch.utils.data.DataLoader(val_tensor, 
                                         batch_size = BATCH_SIZE, 
                                         shuffle = False)
test_loader = torch.utils.data.DataLoader(test_tensor, 
                                          batch_size = BATCH_SIZE,
                                          shuffle = False)

## <font color='blue'>2.3 Display some examples</font>

In [None]:
def display_images(graph_indexes = np.arange(9)):
    
    plt.figure(figsize=(12,12))
    for graph_index in graph_indexes:
        
        # Draw randomly an index
        index = random.randint(1, X_train.shape[0])
        
        # Get corresponding label (.numpy to get value of a tensor)
        label = y_train[index].numpy()
        
        # define subplot
        plt.subplot(330 + 1 + graph_index)
        plt.title('Label: %s \n'%label,
                 fontsize=18)
        # plot raw pixel data (1d tensor that needs to be resized)
        plt.imshow(X_train[index].resize(28,28), cmap=plt.get_cmap('gray'))
        
    # the bottom of the subplots of the figure
    plt.subplots_adjust(bottom = 0.001)
    plt.subplots_adjust(top = 0.99)
    
    # show the figure
    plt.show()

In [None]:
display_images()

**<font size="2"><a href="#summary">Back to summary</a></font>**

-------

# <div id="chap3">3. CNN</div>

## <font color='blue'>3.1 What is a CNN ?</font>

A CNN is quite similar to Classic Neural Networks (RegularNets) where there are neurons with weights and biases. Just like in RegularNets, we use a loss function and an optimizer in CNNs. Additionally though, in CNNs, there are Convolutional Layers, Pooling Layers, and Flatten Layers. CNNs are mainly used for image classification.

### CNN layers
* **Convolutional layer** 

The very first layer where we extract features from the images in our datasets. Due to the fact that pixels are only related with the adjacent and close pixels, convolution allows us to preserve the relationship between different parts of an image. Convolution is basically filtering the image with a smaller pixel filter to decrease the size of the image without loosing the relationship between pixels. When we apply convolution to 5x5 image by using a 3x3 filter with 1x1 stride (1 pixel shift at each step). We will end up having a 3x3 output (64% decrease in complexity).


* **Pooling layer**

When constructing CNNs, it is common to insert pooling layers after each convolution layer to reduce the spatial size of the representation to reduce the parameter counts which reduces the computational complexity. In addition, pooling layers also **helps with the overfitting problem**. Basically we select a pooling size to reduce the amount of the parameters by selecting the maximum, average, or sum values inside these pixels.


* **Flatten layer**

Flattens the input. Does not affect the batch size.

## <font color='blue'>3.2 Network Structure</font>

To build a model with pytorch, a class should be created. This class will contain an __init__ with the different layers that will be used to define the architecture of the neural network. Then, the forward method will consist in building the network.

In [None]:
class CNNModel(nn.Module):
    def __init__(self):
        super(CNNModel, self).__init__()
        
        # convolution 1
        self.c1 = nn.Conv2d(in_channels=1, out_channels=16, kernel_size=(5,5), stride=1, padding=0)
        self.relu1 = nn.ReLU()
        
        # maxpool 1
        self.maxpool1 = nn.MaxPool2d(kernel_size=(2,2))
        
        # dropout 1
        self.dropout1 = nn.Dropout(0.25)
        
        # convolution 2
        self.c2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=(3,3), stride=1, padding=0)
        self.relu2 = nn.ReLU()
        
        # maxpool 2
        self.maxpool2 = nn.MaxPool2d(kernel_size=(2,2))

        # dropout 2
        self.dropout2 = nn.Dropout(0.25)
        
        # linear 1
        self.fc1 = nn.Linear(32*5*5, 256)
        
        # dropout 3
        self.dropout3 = nn.Dropout(0.25)
        
        # linear 2
        self.fc2 = nn.Linear(256, 10)
        
    def forward(self, x):
        
        out = self.c1(x) # [BATCH_SIZE, 16, 24, 24]
        out = self.relu1(out) 
        out = self.maxpool1(out) # [BATCH_SIZE, 16, 12, 12]
        out = self.dropout1(out) 
        
        out = self.c2(out) # [BATCH_SIZE, 32, 10, 10]
        out = self.relu2(out) 
        out = self.maxpool2(out) # [BATCH_SIZE, 32, 5, 5]
        out = self.dropout2(out) 
        
        out = out.view(out.size(0), -1) # [BATCH_SIZE, 32*5*5=800]
        out = self.fc1(out) # [BATCH_SIZE, 256]
        out = self.dropout3(out)
        out = self.fc2(out) # [BATCH_SIZE, 10]
        
        return out

In [None]:
# Create CNN
model = CNNModel()

# Optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.003)

# Cross Entropy Loss 
criterion = nn.CrossEntropyLoss()

# LR scheduler
exp_lr_scheduler = lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.5)

# On GPU if possible
if torch.cuda.is_available():
    print("Model will be training on GPU")
    model = model.cuda()
    criterion = criterion.cuda()
else:
    print("Model will be training on CPU")

## <font color='blue'>3.3 Training and evaluation</font>

In [None]:
def fit(epoch):
    
    print("Training...")
    # Set model on training mode
    model.train()
    
    # Update lr parameter
    exp_lr_scheduler.step()
    
    # Initialize train loss and train accuracy
    train_running_loss = 0.0
    train_running_correct = 0
    train_running_lr = optimizer.param_groups[0]['lr']
    
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = Variable(data.view(BATCH_SIZE,1,28,28)), Variable(target)
        
        if torch.cuda.is_available():
            data = data.cuda()
            target = target.cuda()
        
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        
        train_running_loss += loss.item()
        _, preds = torch.max(output.data, 1)
        train_running_correct += (preds == target).sum().item()
        
        loss.backward()
        optimizer.step()
        
        if (batch_idx + 1)% 50 == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                 epoch+1, 
                 (batch_idx + 1) * len(data), 
                 len(train_loader.dataset),
                 BATCH_SIZE * (batch_idx + 1) / len(train_loader), 
                 loss.cpu().detach().numpy())
                 )
            
    train_loss = train_running_loss/len(train_loader.dataset)
    train_accuracy = 100. * train_running_correct/len(train_loader.dataset)    
    
    return train_loss, train_accuracy, train_running_lr

In [None]:
def validate(data_loader):
    
    print("Validating...")
    # Set model on validating mode
    model.eval()
    val_preds = torch.LongTensor().cuda()
    val_proba = torch.LongTensor().cuda()
    
    # Initialize validation loss and validation accuracy
    val_running_loss = 0.0
    val_running_correct = 0
    
    for data, target in data_loader:
        # Regarding volatile argument, check the note below
        data, target = Variable(data.view(BATCH_SIZE,1,28,28), volatile=True), Variable(target)
        
        if torch.cuda.is_available():
            data = data.cuda()
            target = target.cuda()
        
        output = model(data)
        loss = criterion(output, target)
        
        val_running_loss += loss.item()
        pred = output.data.max(1, keepdim=True)[1]
        proba = torch.nn.functional.softmax(output.data)

        val_running_correct += pred.eq(target.data.view_as(pred)).cpu().sum() 
        
        # Store val_predictions with probas for confusion matrix calculations & best errors made
        val_preds = torch.cat((val_preds, pred), dim=0)
        val_proba = torch.cat((val_proba, proba))

    val_loss = val_running_loss/len(data_loader.dataset)
    val_accuracy = 100. * val_running_correct/len(data_loader.dataset) 
    
    return val_loss, val_accuracy, val_preds, val_proba

Volatile is recommended for purely inference mode, when you’re sure you won’t be even calling .backward(). It’s more efficient than any other autograd setting - it will use the absolute minimal amount of memory to evaluate the model. volatile also determines that requires_grad is False

In [None]:
train_loss, train_accuracy = [], []
val_loss, val_accuracy = [], []
val_preds, val_proba = [], []
train_lr = []

for epoch in range(EPOCHS):
    
    print(f"Epoch {epoch+1} of {EPOCHS}\n")
    
    train_epoch_loss, train_epoch_accuracy, train_epoch_lr = fit(epoch)
    val_epoch_loss, val_epoch_accuracy, val_epoch_preds, val_epoch_proba = validate(val_loader)
    
    train_loss.append(train_epoch_loss)
    train_accuracy.append(train_epoch_accuracy)
    train_lr.append(train_epoch_lr)
    
    val_loss.append(val_epoch_loss)
    val_accuracy.append(val_epoch_accuracy)
    val_preds.append(val_epoch_preds)
    val_proba.append(val_epoch_proba)
    
    print(f"Train Loss: {train_epoch_loss:.4f}, Train Acc: {train_epoch_accuracy:.2f}")
    print(f'Val Loss: {val_epoch_loss:.4f}, Val Acc: {val_epoch_accuracy:.2f}\n')

After the training of 5 epochs, the model can be saved here with its latest weights and any information that qualify the way this CNN was trained (optimizer, loss...).

In [None]:
# save model checkpoint
torch.save({'epoch': EPOCHS,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': criterion,},
           './model.pth')

Now the model will be loaded from its latest status. This technic is convenient if you cannot train all of your data through all the epochs in once.

In [None]:
# load the model checkpoint
checkpoint = torch.load('./model.pth')

# load model weights state_dict
model.load_state_dict(checkpoint['model_state_dict'])
print('Previously trained model weights state_dict loaded...')

# load trained optimizer state_dict
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
print('Previously trained optimizer state_dict loaded...')
EPOCHS = checkpoint['epoch']

# load the criterion
criterion = checkpoint['loss']
print('Trained model loss function loaded...')
print(f"Previously trained for {EPOCHS} number of epochs...")

# train for more epochs
print(f"Train for {EXTRA_EPOCHS} more epochs...")

In [None]:
for epoch in range(EXTRA_EPOCHS):
    
    print(f"Epoch {epoch+1} of {EXTRA_EPOCHS}\n")
    
    train_epoch_loss, train_epoch_accuracy, train_epoch_lr = fit(epoch)
    val_epoch_loss, val_epoch_accuracy, val_epoch_preds, val_epoch_proba = validate(val_loader)
    
    train_loss.append(train_epoch_loss)
    train_accuracy.append(train_epoch_accuracy)
    train_lr.append(train_epoch_lr)
    
    val_loss.append(val_epoch_loss)
    val_accuracy.append(val_epoch_accuracy)
    val_preds.append(val_epoch_preds)
    val_proba.append(val_epoch_proba)
    
    print(f"Train Loss: {train_epoch_loss:.4f}, Train Acc: {train_epoch_accuracy:.2f}")
    print(f'Val Loss: {val_epoch_loss:.4f}, Val Acc: {val_epoch_accuracy:.2f}\n')

In [None]:
# save model checkpoint
torch.save({'epoch': EXTRA_EPOCHS,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': criterion}, 
           './model.pth')

This part could have been done much quicker by chosing to train directly on 15 epochs. However I wanted to show how interesting it was to split the training in two parts and how to save and load a pytorch model.

## <font color='blue'>3.4 History of CNN</font>

In [None]:
def plot_history():

    plt.figure(figsize = (20,15))
    
    plt.subplot(221)
    
    # summarize history for accuracy
    plt.plot(train_accuracy)
    plt.plot(val_accuracy)
    plt.title('model accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.legend(['train', 'test'], loc='upper left')
    plt.grid()
    
    
    plt.subplot(222)
    # summarize history for loss
    plt.plot(train_loss)
    plt.plot(val_loss)
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train', 'test'], loc='upper left')
    plt.grid()
    
    plt.subplot(223)
    # summarize history for lr
    plt.plot(train_lr)
    plt.title('learning rate')
    plt.ylabel('lr')
    plt.xlabel('epoch')
    plt.grid()
    
    plt.show()

In [None]:
plot_history()

**<font size="2"><a href="#summary">Back to summary</a></font>**

-------

# <div id="chap4">4 Evaluation</div>

## <font color='blue'>4.1 Confusion Matrix</font>

In [None]:
def plot_confusion_matrix(confusion_matrix, 
                          cmap=plt.cm.Reds):
    
    classes = range(10)
    
    plt.figure(figsize=(8,8))
    plt.imshow(confusion_matrix, 
               interpolation='nearest', 
               cmap=cmap)
    plt.title('Confusion matrix')
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    thresh = confusion_matrix.max() / 2.
    for i, j in itertools.product(range(confusion_matrix.shape[0]), range(confusion_matrix.shape[1])):
        plt.text(j, i, confusion_matrix[i, j],
                 horizontalalignment="center",
                 color="white" if confusion_matrix[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
# Get predictions of validation set from last epoch
# val_preds is a list of (EPOCHS + EXTRA_EPOCHS) tensors
y_pred_classes = val_preds[EPOCHS + EXTRA_EPOCHS - 1].cpu().numpy().ravel()

# compute the confusion matrix
cm = confusion_matrix(y_val, y_pred_classes) 

# plot the confusion matrix
plot_confusion_matrix(cm)

## <font color='blue'>4.2 Some examples of predicted images</font>

In [None]:
def display_predicted_images(graph_indexes = np.arange(9)):
    
    # plot first few images
    plt.figure(figsize=(12,12))
    
    for graph_index in graph_indexes:
        
        index = random.randint(1, X_val.shape[0])
        
        # Get corresponding label
        predicted_label = y_pred_classes[index]
        true_label = y_val[index]
        
        # define subplot
        plt.subplot(330 + 1 + graph_index)
        plt.title('Predicted label: %s \n'%predicted_label+\
                  'True label %s \n'%true_label.item(),
                 fontsize=18)
        # plot raw pixel data
        plt.imshow(X_val[index].view(28,28), cmap=plt.get_cmap('gray'))
        
    plt.subplots_adjust(bottom = 0.001)  # the bottom of the subplots of the figure
    plt.subplots_adjust(top = 0.99)
    
    # show the figure
    plt.show()

In [None]:
display_predicted_images()

## <font color='blue'>4.3 "Best" errors</font>

In [None]:
# Retrieve validation proba predictions of last epoch
y_pred = val_proba[EPOCHS + EXTRA_EPOCHS - 1].cpu().numpy()

# Display errors 
errors = (y_pred_classes - y_val.cpu().numpy() != 0)

y_pred_classes_errors = y_pred_classes[errors]
y_pred_errors = y_pred[errors]
y_true_errors = y_val[errors]

X_val_errors = X_val[errors]

In [None]:
def display_top9_wrongly_predicted_images(list_of_indexes, graph_indexes = np.arange(9)):
    
    # plot first few images
    plt.figure(figsize=(12,12))
    
    for graph_index in graph_indexes:
        
        index = list_of_indexes[graph_index]
        
        # Get corresponding label
        predicted_label = y_pred_classes_errors[index]
        true_label = y_true_errors[index]
        
        
        # define subplot
        plt.subplot(330 + 1 + graph_index)
        plt.title('Predicted label: %s \n'%predicted_label+\
                  'True label %s \n'%true_label.item(),
                 fontsize=18)
        # plot raw pixel data
        plt.imshow(X_val_errors[index].view(28,28), cmap=plt.get_cmap('gray'))
        
    plt.subplots_adjust(bottom = 0.001)  # the bottom of the subplots of the figure
    plt.subplots_adjust(top = 0.99)
    
    # show the figure
    plt.show()

In [None]:
# Probabilities of the wrong predicted numbers
y_pred_errors_prob = np.max(y_pred_errors,axis = 1)

# Predicted probabilities of the true values in the error set
true_prob_errors = np.diagonal(np.take(y_pred_errors, y_true_errors, axis=1))

# Difference between the probability of the predicted label and the true label
delta_pred_true_errors = y_pred_errors_prob - true_prob_errors

# Sorted list of the delta prob errors
sorted_dela_errors = np.argsort(delta_pred_true_errors)

# Top 9 errors 
most_important_errors = sorted_dela_errors[-9:]

# Show the top 9 errors
display_top9_wrongly_predicted_images(list_of_indexes = most_important_errors)

# Submission

In [None]:
def prediction(data_loader):
    
    print("Infering predictions...")
    # Set model on validating mode
    model.eval()
    test_pred = torch.LongTensor()
    
    for batch_idx, data in enumerate(data_loader):
        data = Variable(data[0].view(BATCH_SIZE,1,28,28), volatile=True)
        
        if torch.cuda.is_available():
            data = data.cuda()
            
        output = model(data)
        
        pred = output.cpu().data.max(1, keepdim=True)[1]
        test_pred = torch.cat((test_pred, pred), dim=0)
    
    print("Completed")   
    return test_pred

In [None]:
# predict results
y_test_pred = prediction(test_loader)

# Associate max probability obs with label class
y_test_pred = y_test_pred.numpy().ravel()
y_test_pred = pd.Series(y_test_pred, name="Label")

submission = pd.concat([pd.Series(range(1,28001), name = "ImageId"), y_test_pred], axis = 1)

submission.to_csv("CNN_model_TPU_submission.csv", index = False)

**<font size="2"><a href="#summary">Back to summary</a></font>**

-------

# References

* https://debuggercafe.com/effective-model-saving-and-resuming-training-in-pytorch/
* https://pytorch.org/tutorials/

<hr>
<div align='center'><font size="3" color="#353B47">There is also CNN implementation using Tensorflow/keras.</font></div>
<div align='center'><a href="https://www.kaggle.com/bryanb/keras-cnn-for-mnist-digit-recognition-with-tpus">Keras: CNN for MNIST digit recognition with TPUs</a></div>
<br>
<div align='justify'><font color="#353B47" size="4">Thank you for taking the time to read this notebook. I hope that I was able to answer your questions or your curiosity and that it was quite understandable. <u>any constructive comments are welcome</u>. They help me progress and motivate me to share better quality content. I am above all a passionate person who tries to advance my knowledge but also that of others. If you liked it, feel free to <u>upvote and share my work.</u> </font></div>
<br>
<div align='center'><font color="#353B47" size="3">Thank you and may passion guide you.</font></div>