<a href="https://colab.research.google.com/github/umiralles/BNGoogleWorkSample/blob/master/gdo_voicebot/grammar_correction_service/grammar_checker_model/Grammar_Checker_Model_Training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Training the BERT uncased model into a grammatical error checker**

This notebook creates a custom language model consisting of an extra layer added to the pretrained BERT base model (uncased) and training it to predict if an input sentence is grammatically correct or incorrect. This is achieved in two main stages:

1.   **Pre-training:** the custom model is pre-trained using the English data of the [Lang-8 Corpus of Learner English](https://sites.google.com/site/naistlang8corpora/), a dataset compiled from a language exchange social network service. 
2.   **Training:** the custom model is trained using in-domain data from the [Corpus of Linguistic Acceptability (CoLA)](https://nyu-mll.github.io/CoLA/) consisting of sentences from 23 linguistic publications.

For **validation** we use a sample of 10% of CoLA's in-domain dataset, and for **testing** we use CoLA's out-of-domain dev set.

## **Pre-Processing**
Both datasets are already pre-processed when inputed to the model using methods described our [Pre-Processing_Datasets](https://colab.research.google.com/drive/1is-XvliKcN4XXTK167TEeutVOvYB6Nqn?usp=sharing) file. The actual data is not modified during preprocessing, though punctuation apart from apostrophes is removed and sentences are made lowercase (as is the case for the GDO's speech-to-text system).

For the **CoLA** dataset, sentences labeled as 'incorrect' are oversampled to balance the distribution of the dataset. For the **Lang-8** dataset, a random sample of around 20,000 sentences is used (out of a possible 1,000,000 to reduce training time). Lang-8 is undersampled if the random 20,000 samples are of an uneven distribution. The actual pre-processed datasets we used for our trained model can be found [here](https://imperialcollegelondon.box.com/s/0s602z27cgzq6fxk5lhb8446li8030ta).

The **input dataset** for the model is in the format:
```
  Column     Description
 ------------------------------------------------------------------------------------------
    1	    the acceptability judgment label (0 = unacceptable, 1 = acceptable).
    2 	   the lowercase sentence with no puntuation apart from apostrophes.
```
For example, a sample from 'cola-train.tsv' reads:
```
  1     john and the man went to the store
  0     i loved intensely the policeman with all my heart
  0     i'm sure we got any tickets
  1     the umpire called the game off
```

## **Hyperparameter Tuning**

The hyperparameters we chose to modify were the **learning rate** and **number of epochs**. Hyperparameter tuning was done using 10-fold cross-validation twice: first training with learning rates from 0.0001 to 0.1 and then between 0.0223 and 0.0334 since these were the best performing learning rates in general. We plotted metrics using the output of the file, as can be seen in our [Plot_Metrics](https://colab.research.google.com/drive/1Xa6VR26_FpDcx7boygx69mJe4MrAlifM?usp=sharing) file, and used these to decide on the number of epochs for each final model (after pre-training and after training).

The current model we are using is trained with a **learning rate of 0.02353** with **14 Lang-8 epochs** and **10 CoLA epochs** of training. The test accuracy (tested using the methods below) is 0.566.

## **Using the Model**

As shown in the validation and testing stages, predictions can be put through a sigmoid function and rounded to produce a predicted label (0 = unacceptable/incorrect, 1 = acceptable/correct). There is an example of this [here](https://colab.research.google.com/drive/1AjT56COhhcABOLmSQFau_djdhWEPRwwb?usp=sharing).

In [None]:
##################################
#   Mount Drive to Save Models   #
##################################

from google.colab import drive
drive.mount('/content/drive')

# Folder path to where models will be saved
# Note: folders must already exist to save them there
folder_path = './drive/My Drive/Galileo/models/'

In [None]:
###################################
#         Upload datasets         #
###################################

# Upload all of:
#   lang-8-train.tsv
#   cola-train.tsv
#   cola-validate.tsv
#   cola-test.tsv
from google.colab import files
uploaded = files.upload()

In [None]:
###################################
#             Imports             #
###################################

!pip install transformers

import io 
import os
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from transformers import BertTokenizer, BertModel
import torch.nn as nn
import pandas as pd
import numpy as np

In [None]:
###################################
#          Set Constants          #
###################################

## Changeable Constants ##
# Learning rate of the optimiser
LEARNING_RATE = 0.0223

# Batch size (for Google Colab this should be 32 so as to not run out of memory)
BATCH_SIZE = 32

## Unchangeable Constants ##
# Sentence acceptability labels
CORRECT = 1
INCORRECT = 0

## Find and set the current device ##
# Use graphics card if available, otherwise use CPU
if torch.cuda.is_available():
  device = torch.device("cuda")
else:
  device = torch.device("cpu")

n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

## **Model Evaluation Functions**

For evaluating our model, we first compute the confusion matrix:

```
            Predicted 0 | Predicted 1
  Actual 0 | [[  TN     ,     FP  ],
  Actual 1 |  [  FN     ,     FN  ]]

  Where:  TN = [0][0] True Negative
          FN = [1][0] False Negative
          FP = [0][1] False Positive
          TP = [1][1] True Positive
```

This is done by taking the predictions from the model and converting them to either a '1' or '0' using `np.round(sigmoid(y_pred))`. Sigmoid is used due to our optimiser being BCEWithLogitsLoss (which includes a sigmoid activation function on top of the Binary Cross-Entropy loss function).

The metrics we calculate using the confusion matrix are:

*   Recall
*   Precision
*   F1 Measure
*   Accuracy

Functions for all of which can be found below.

In [None]:
###################################
#      Evaluation Functions       #
###################################

def sigmoid(x):
    return 1/(1 + np.exp(-x))

# Computes a confusion matrix of the format above based on y_pred (predictions)
#   and y_test (gold)
# Note: y_pred and y_test should be numpy arrays
def confusion_matrix(y_pred, y_test):
  # For each prediction, convert to a tag of 0 or 1
  y_pred_tag = np.round(sigmoid(y_pred)).astype('int32')
  y_test = y_test.astype('int32')

  # Since this is binary classification, the confusion matrix is (2, 2)
  confusion = np.zeros((2, 2), dtype=np.int)

  # Count each class for each prediction
  for i in range(len(y_pred_tag)):
    confusion[y_test[i], y_pred_tag[i]] += 1

  return confusion

# Calculates precision from a confusion matrix of the format above
# Note: class_label should be CORRECT or INCORRECT
def precision(confusion, class_label):
  true_pos = confusion[1][class_label]
  false_pos = confusion[0][class_label]

  return true_pos / (true_pos + false_pos)

# Calculates macro average precision from a confusion matrix 
#   of the format above
def marco_avg_precision(confusion):
  correct_precision = precision(confusion, CORRECT)
  incorrect_precision = precision(confusion, INCORRECT)

  return (correct_precision + incorrect_precision) / 2

# Calculates recall from s a confusion matrix of the format above
# Note: class_label should be CORRECT or INCORRECT
def recall(confusion, class_label):
  true_pos = confusion[class_label][1]
  false_neg = confusion[class_label][0]

  return true_pos / (true_pos + false_neg)

# Calculates macro average recall from a confusion matrix of the format above
def macro_avg_recall(confusion):
  correct_recall = recall(confusion, CORRECT)
  incorrect_recall = recall(confusion, INCORRECT)

  return (correct_recall + incorrect_recall) / 2

# Calculates f1 measure from a confusion matrix of the format above
# Note: class_label should be CORRECT or INCORRECT
def f_one_measure(confusion, class_label):
  total_precision = precision(confusion, class_label)
  total_recall = recall(confusion, class_label)

  return (2 * total_precision * total_recall) / (total_precision + total_recall)

# Calculates the average f1 measure from a confusion matrix of the format above
def avg_f_one_measure(confusion):
  correct_f_one = f_one_measure(confusion, CORRECT)
  incorrect_f_one = f_one_measure(confusion, INCORRECT)

  return (correct_f_one + incorrect_f_one) / 2

# Calculates the accuracy from a confusion matrix of the format above
def accuracy(confusion):
  true_pos = confusion[1][1]
  true_neg = confusion[0][0]

  total = sum([np.sum(row) for row in confusion])

  return (true_pos + true_neg) / total

## **Custom BERT Model Class**

We create our own custom grammar checker model by adding an extra layer to the pre-trained BERT base model (uncased due to the GDO's speech-to-text system being lowercase). We previously attempted using BertForSequenceClassification as a base, but this way we are able to output only one value as a prediction.

We also modified the `state_dict()` and `load_state_dict()` functions to enable us to only save and load the weights of the final layer — vastly reducing the amount of storage a model takes up.

In [None]:
##################################
#          Model Class           #
##################################
# with reference to https://stackoverflow.com/questions/64156202/add-dense-layer-on-top-of-huggingface-bert-model
# and documentation at https://huggingface.co/transformers/model_doc/bert.html#bertmodel

class CustomBERTModel(torch.nn.Module):
  def __init__(self):
    super(CustomBERTModel, self).__init__()
    self.bert = BertModel.from_pretrained('bert-base-uncased')
    ## New Layer
    self.linear = torch.nn.Linear(768, 1)

  # A forward pass through both the BERT model and linear layer
  def forward(self, input_ids):
    outputs = self.bert(input_ids, token_type_ids=None)

    # Gets the ouput of the last hidden layer of the BERT model
    last_hidden_states = outputs.last_hidden_state
    linear_output = self.linear(last_hidden_states[:,0,:])

    return linear_output

  # Modified state_dict to only save linear layer weights and bias
  def state_dict(self):
    return self.linear.state_dict()

  # Modified load_state_dict to only load linear layer weights and bias
  def load_state_dict(self, state_dict):
    self.linear.load_state_dict(state_dict)


## **Loading Models**

Before training, we create our model and load it to the currect device (as defined in the 'Constants' section above).

If training is continuing, you may upload the model above and load it with the second cell of code.

In [None]:
# Load new custom model to the GPU
model = CustomBERTModel()
model.cuda()

In [None]:
# Load a model after uploading with the name:
#   bert-base-uncased-GDO-trained.pth
model = CustomBERTModel()
model.load_state_dict(torch.load('bert-base-uncased-GDO-trained.pth', map_location=device))
model.cuda()

## **Optimiser and Loss Function**

For the optimisation, we use pytorch's Stochastic Gradient Descent optimiser. As the imported BERT model is already pre-trained, faster learning optimisers such as Adam have proved to overfit the data in just a couple of epochs.

We use the `BCEWithLogitsLoss` function (Binary Cross-Entropy with a sigmoid activation function) as our loss function, as this is widely used for binary classification.

In [None]:
###################################
#       Optimiser and Loss        #
###################################

optimiser = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)
criterion = nn.BCEWithLogitsLoss()

## **Data Tokenization**

For tokenizing we use the standard BERT base uncased tokenizer with the `do_lower_case` flag set to true (since this is the case for the GDO speech-to-text system).

In [None]:
##################################
#           Tokenizer            #
##################################

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          do_lower_case = True)

## **Pre-Training**

### **Preparing Datasets**
For **pre-training** we use the **Lang-8 train** dataset (in the uploaded `lang-8-train.tsv` file). We tokenize the input sentences using the tokenizer, map the resulting tokens to IDs and pad the outputs to the maximum length in the IDs sequence. When loading the training dataset into a DataLoader, we use a `RandomSampler` so the indexes will be shuffled for each epoch. 

For **validation** we use a 10% sample of the **CoLA in-domain train** dataset (in the uploaded `cola-validate.tsv` file) and process the data in the same way as for pre-training. When loading the validation dataset into a DataLoader, we use a `SequentialSampler` so the indexes will be in the same order for each epoch. 

In [None]:
###################################
#          Training Set           #
###################################

df = pd.read_csv(io.BytesIO(uploaded['lang-8-train.tsv']), delimiter='\t', header=None,
  names=['label', 'sentence'])

# Create sentence and label lists
sentences = df.sentence.values
labels = df.label.values
tokenized_texts = [tokenizer.tokenize(str(sent)) for sent in sentences]

# Padding sentences to the maximum length sentence
padded_sequence = \
  [tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts]
max_len = max([len(txt) for txt in padded_sequence])

# Pad our input tokens
input_ids = pad_sequences(padded_sequence, maxlen=max_len, dtype="long", 
                          truncating="post", padding="post")

# Create input and label matrices
train_inputs = torch.tensor(input_ids)
train_labels = torch.tensor(labels)

# Create iterator from formatted training data
train_data = TensorDataset(train_inputs, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = \
  DataLoader(train_data, sampler=train_sampler, batch_size=BATCH_SIZE)

In [None]:
###################################
#         Validation Set          #
###################################

df = pd.read_csv(io.BytesIO(uploaded['cola-validate.tsv']), delimiter='\t', header=None,
  names=['label', 'sentence'])

# Create sentence and label lists
sentences = df.sentence.values
labels = df.label.values
tokenized_texts = [tokenizer.tokenize(str(sent)) for sent in sentences]

# Padding sentences to the maximum length sentence
padded_sequence = [tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts]
max_len = max([len(txt) for txt in padded_sequence])

# Pad our input tokens
input_ids = pad_sequences(padded_sequence, maxlen=max_len, dtype="long", 
                          truncating="post", padding="post")

# Create input and label matrices
validation_inputs = torch.tensor(input_ids)
validation_labels = torch.tensor(labels)

# Create iterator from formatted training data
validation_data = \
  TensorDataset(validation_inputs, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = \
  DataLoader(validation_data, sampler=validation_sampler, batch_size=BATCH_SIZE)

### **Pre-Training Loop**

Similarly to the training loop, the pre-training loop has gone through several versions before this one.

*   In our first version, we set the **number of epochs to be 3** every time, only modifying the learning rate between models.
*   In our second version, we used the **validation accuracy** to decide when to finish training. We would store the best accuracy, and if it decreased we would revert to the previous iteration's model (the one with the best accuracy) and continue to training. Since this method only took accuracy into account, the performance of the models wasn't very reliable.
*   In the final version, we train with **up to 25 epochs**. After this, we use our [Plot_Metrics](https://colab.research.google.com/drive/1Xa6VR26_FpDcx7boygx69mJe4MrAlifM#scrollTo=VEDZ3rjp53ms) file to determine the best models based on our output metrics, load these and continue to training.

We also previously included a way of plotting the loss per batch (based on the version used [here](https://github.com/sunilchomal/GECwBERT/blob/master/BERT_GED_Model.ipynb)), but with the final version loading already trained models this wasn't possible to include.

In [None]:
##################################
#       Pre-Training Loop        #
##################################

# Store the current epoch number
epochs = 0

# Iterate for up to 25 epochs
while epochs < 25:
  epochs += 1
  
  # Tracking variables (nb = Naive Bayes, tr=Tracking)
  tr_loss = 0
  nb_tr_steps = 0

  print("Processing Epoch Number: {}".format(epochs))
  

  ## Training ##

  # Set the model to training mode
  model.train()
  
  # Train the data for one epoch
  for step, batch in enumerate(train_dataloader):
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    
    # Unpack the inputs from the dataloader
    b_input_ids, b_labels = batch

    # Clear out the gradients
    optimiser.zero_grad()

    # Forward pass
    # Shape of outputs -> (batch_size, num_features)
    #                     (so in this case 'torch.Size([32, 1])')
    outputs = model(b_input_ids)

    # Make b_labels the same shape as outputs and convert to float
    #     (i.e. from 'torch.Size([32])' to 'torch.Size([32, 1])')
    b_labels = b_labels.unsqueeze(1)
    b_labels = b_labels.float()

    # Calculate loss
    loss = criterion(outputs, b_labels)

    # Backward pass
    loss.backward()

    # Update parameters and take a step
    optimiser.step()

    # Update tracking variables
    tr_loss += loss.item()
    nb_tr_steps += 1

  # Total loss for this epoch
  print(" Train loss: {}".format(tr_loss/nb_tr_steps))
    

  ## Validation ##

  # Put model in evaluation mode
  model.eval()

  # Initialise confusion matrix
  confusion = np.zeros((2, 2), dtype=np.int)

  # Evaluate data for one epoch
  for batch in validation_dataloader:
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)

    # Unpack the inputs from our dataloader
    b_input_ids, b_labels = batch
    
    # Don't compute or store gradients
    with torch.no_grad():
      # Forward pass, calculate logit predictions (predicted values)
      logits = model(b_input_ids)
      
    # Move logits and labels to CPU
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()

    # Calculate confusion matrix
    confusion += confusion_matrix(logits, label_ids)

  # Print calculated metrics for this epoch
  print(" Confustion Matrix:\n " + str(confusion))
  print("")
  print(" Validation Accuracy: {}".format(accuracy(confusion)))
  print("")
  print(" Validation Correct Recall: {}".format(recall(confusion, CORRECT)))
  print(" Validation Incorrect Recall: {}".format(recall(confusion, INCORRECT)))
  print(" Validation Total Recall: {}".format(macro_avg_recall(confusion)))
  print("")
  print(" Validation Correct Precision: {}".format(precision(confusion, CORRECT)))
  print(" Validation Incorrect Precision: {}".format(precision(confusion, INCORRECT)))
  print(" Validation Total Precision: {}".format(marco_avg_precision(confusion)))
  print("")
  print(" Validation Correct F1: {}".format(f_one_measure(confusion, CORRECT)))
  print(" Validation Incorrect F1: {}".format(f_one_measure(confusion, INCORRECT)))
  print(" Validation Total F1: {}".format(avg_f_one_measure(confusion)))
  print("")


  ## Saving the Model for this Epoch ##

  torch.save(model.state_dict(), 'bert-base-uncased-GDO-trained.pth')
  os.system('cp bert-base-uncased-GDO-trained.pth "{}/bert-base-uncased-GDO-{}-lang-8.pth"'.format(folder_path, str(epochs)))

## **Training**

### **Preparing Datasets**

For **training** we use a 90% sample of the **CoLA in-domain train** dataset (in the uploaded `cola-train.tsv` file). As for pre-training, we tokenize the input sentences using the tokenizer, map the resulting tokens to IDs and pad the outputs to the maximum length in the IDs sequence. When loading the training dataset into a DataLoader, we also use a `RandomSampler` so the indexes will be shuffled for each epoch. 

For **validation** we use the same set as for pre-training: a 10% sample of the **CoLA in-domain train** dataset (in the uploaded `cola-validate.tsv` file). The same DataLoader as for pre-training is used for training.

In [None]:
# Training Set
df = pd.read_csv(io.BytesIO(uploaded['cola-train.tsv']), delimiter='\t', 
                 header=None, names=['label', 'sentence'])

# Create sentence and label lists
sentences = df.sentence.values
labels = df.label.values

tokenized_texts = [tokenizer.tokenize(str(sent)) for sent in sentences]

# Padding sentences to the maximum length sentence
padded_sequence = \
  [tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts]
max_len = max([len(txt) for txt in padded_sequence])

# Pad our input tokens
input_ids = pad_sequences(padded_sequence, maxlen=max_len, dtype="long", 
                          truncating="post", padding="post")

# Create input and label matrices
train_inputs = torch.tensor(input_ids)
train_labels = torch.tensor(labels)

# Create iterator from formatted training data
train_data = TensorDataset(train_inputs, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = \
  DataLoader(train_data, sampler=train_sampler, batch_size=BATCH_SIZE)

### **Training Loop**

Similarly to the pre-training loop, the training loop has gone through several versions before this one.

*   In our first version, we set the **number of epochs to be 5** every time, only modifying the learning rate between models.
*   In our second version, like pre-training, we used the **validation accuracy** to decide when to finish training. We used the same method as for pre-training.
*   In our third version, we used a **set number of epochs for pre-training** and the **validation accuracy for training**, since our models would generally overfit to pre-training data.
*   In the final version, we take a model saved from pre-training, and train for a further **up to 50 epochs**. After this, we again use our [Plot_Metrics](https://colab.research.google.com/drive/1Xa6VR26_FpDcx7boygx69mJe4MrAlifM?usp=sharing) file to determine the final model based on the output metrics.

In [None]:
##################################
#         Training Loop          #
##################################

# Store the current epoch number
epochs = 0

# Iterate for up to 50 epochs
while epochs < 50:
  epochs += 1
  
  # Tracking variables (nb = Naive Bayes, tr=Tracking)
  tr_loss = 0
  nb_tr_steps = 0

  print("Processing Epoch Number: {}".format(epochs))
  

  ## Training ##

  # Set the model to training mode
  model.train()
  
  # Train the data for one epoch
  for step, batch in enumerate(train_dataloader):
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    
    # Unpack the inputs from the dataloader
    b_input_ids, b_labels = batch

    # Clear out the gradients
    optimiser.zero_grad()

    # Forward pass
    # Shape of outputs -> (batch_size, num_features)
    #                     (so in this case 'torch.Size([32, 1])')
    outputs = model(b_input_ids)

    # Make b_labels the same shape as outputs and convert to float
    #     (i.e. from 'torch.Size([32])' to 'torch.Size([32, 1])')
    b_labels = b_labels.unsqueeze(1)
    b_labels = b_labels.float()

    # Calculate loss
    loss = criterion(outputs, b_labels)

    # Backward pass
    loss.backward()

    # Update parameters and take a step
    optimiser.step()

    # Update tracking variables
    tr_loss += loss.item()
    nb_tr_steps += 1

  # Total loss for this epoch
  print(" Train loss: {}".format(tr_loss/nb_tr_steps))
    

  ## Validation ##

  # Put model in evaluation mode
  model.eval()

  # Initialise confusion matrix
  confusion = np.zeros((2, 2), dtype=np.int)

  # Evaluate data for one epoch
  for batch in validation_dataloader:
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)

    # Unpack the inputs from our dataloader
    b_input_ids, b_labels = batch
    
    # Don't compute or store gradients
    with torch.no_grad():
      # Forward pass, calculate logit predictions (predicted values)
      logits = model(b_input_ids)
      
    # Move logits and labels to CPU
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()

    # Calculate confusion matrix
    confusion += confusion_matrix(logits, label_ids)

  # Print calculated metrics for this epoch
  print(" Confusion Matrix:\n " + str(confusion))
  print("")
  print(" Validation Accuracy: {}".format(accuracy(confusion)))
  print("")
  print(" Validation Correct Recall: {}".format(recall(confusion, CORRECT)))
  print(" Validation Incorrect Recall: {}".format(recall(confusion, INCORRECT)))
  print(" Validation Total Recall: {}".format(macro_avg_recall(confusion)))
  print("")
  print(" Validation Correct Precision: {}".format(precision(confusion, CORRECT)))
  print(" Validation Incorrect Precision: {}".format(precision(confusion, INCORRECT)))
  print(" Validation Total Precision: {}".format(marco_avg_precision(confusion)))
  print("")
  print(" Validation Correct F1: {}".format(f_one_measure(confusion, CORRECT)))
  print(" Validation Incorrect F1: {}".format(f_one_measure(confusion, INCORRECT)))
  print(" Validation Total F1: {}".format(avg_f_one_measure(confusion)))
  print("")


  ## Saving the Model for this Epoch ##
  
  torch.save(model.state_dict(), 'bert-base-uncased-GDO-trained.pth')
  os.system('cp bert-base-uncased-GDO-trained.pth "{}/bert-base-uncased-GDO-{}-lang-8-cola.pth"'.format(folder_path, str(epochs)))

## **Testing**

### **Preparing Datasets**

For **testing** we use the **CoLA out-of-domain dev** dataset (in the uploaded `cola-test.tsv` file). As before, we tokenize the input sentences using the tokenizer, map the resulting tokens to IDs and pad the outputs to the maximum length in the IDs sequence. When loading the training dataset into a DataLoader, we use a `SequentialSampler` so the indexes will be in the same order for each epoch.

In [None]:
df = pd.read_csv(io.BytesIO(uploaded['cola-test.tsv']), delimiter='\t', 
                 header=None, names=['label', 'sentence'])

# Create sentence and label lists
sentences = df.sentence.values
labels = df.label.values

tokenized_texts = [tokenizer.tokenize(str(sent)) for sent in sentences]

# Padding sentences to the maximum length sentence
padded_sequence = [tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts]
max_len = max([len(txt) for txt in padded_sequence])

# Pad our input tokens
input_ids = pad_sequences(padded_sequence, maxlen=max_len, dtype="long", 
                          truncating="post", padding="post")

# Create input and label matrices
prediction_inputs = torch.tensor(input_ids)
prediction_labels = torch.tensor(labels)

# Create iterator from formatted training data
prediction_data = \
  TensorDataset(prediction_inputs, prediction_labels)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = \
  DataLoader(prediction_data, sampler=prediction_sampler, batch_size=BATCH_SIZE)

### **Test Epoch**

This epoch is very similar to a validation epoch. However, rather than go through each batch seperately, each prediction for each batch is added to a list to more easily calculate metrics.

In [None]:
#################################
#          Test Epoch           #
#################################

# Put model in evaluation mode
model.eval()

# Tracking variables
predictions, true_labels = [], []

# Predict
for batch in prediction_dataloader:
  # Add batch to GPU
  batch = tuple(t.to(device) for t in batch)

  # Unpack the inputs from our dataloader
  b_input_ids, b_labels = batch

  # Don't compute or store gradients
  with torch.no_grad():
    # Forward pass
    logits = model(b_input_ids)
  
  # Move logits and labels to CPU
  logits = logits.detach().cpu().numpy()
  label_ids = b_labels.to("cpu").numpy()
  
  # Store predictions and true labels
  predictions.append(logits)
  true_labels.append(label_ids)

# Flatten predictions and true_labels into one list each
flat_predictions = [item for sublist in predictions for item in sublist]
flat_predictions = np.round(sigmoid(np.array(flat_predictions)))
flat_true_labels = [item for sublist in true_labels for item in sublist]

### **Test Evaluation**

We evaluate the test predictions both using the same metrics as before (**recall**, **precision**, **F1 measure** and **accuracy**) and matthew's correlation coefficient, based on the method [here](https://github.com/sunilchomal/GECwBERT/blob/master/BERT_GED_Model.ipynb).

In [None]:
##################################
#         Print Metrics          #
##################################

confusion = confusion_matrix(flat_predictions, flat_true_labels)

print(" Confustion Matrix:\n " + str(confusion))
print("")
print(" Test Accuracy: {}".format(accuracy(confusion)))
print("")
print(" Test Correct Recall: {}".format(recall(confusion, CORRECT)))
print(" Test Incorrect Recall: {}".format(recall(confusion, INCORRECT)))
print(" Test Total Recall: {}".format(macro_avg_recall(confusion)))
print("")
print(" Test Correct Precision: {}".format(precision(confusion, CORRECT)))
print(" Test Incorrect Precision: {}".format(precision(confusion, INCORRECT)))
print(" Test Total Precision: {}".format(marco_avg_precision(confusion)))
print("")
print(" Test Correct F1: {}".format(f_one_measure(confusion, CORRECT)))
print(" Test Incorrect F1: {}".format(f_one_measure(confusion, INCORRECT)))
print(" Test Total F1: {}".format(avg_f_one_measure(confusion)))

In [None]:
##################################################
#  Matthew's Correlation Coefficient Evaluation  #
##################################################

# Evaluate Each Test Batch using Matthew's correlation coefficient
from sklearn.metrics import matthews_corrcoef 

matthews_set = []

for i in range(len(true_labels)):
  matthews = matthews_corrcoef(true_labels[i], 
                               np.round(sigmoid(predictions[i])))

  matthews_set.append(matthews)

matthews_corrcoef(flat_true_labels, flat_predictions)