<a href="https://colab.research.google.com/github/reganmeloche/mrpc_paraphrase/blob/main/bert_approach.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT-Based Approach

In this tutorial, we will use a BERT-based approach on the paraphrase identification task. This approach borrows heavily from this [article](https://medium.com/mlearning-ai/nlp-day-26-semantic-similarity-with-bert-and-huggingface-transformers-ce76011d5a51), which applies some BERT techniques for preprocessing a different dataset. We will use BERT for the entire training and prediction processes. Since we are using BERT for this approach, it is strongly recommended to upgrade to a CoLab tier that allows you to use a GPU, as processing will be much much faster.

In [None]:
!pip install transformers

In [None]:
import tensorflow as tf
import transformers
import torch
from transformers import DistilBertTokenizer, DistilBertModel, DistilBertForSequenceClassification
from torch.utils.data import TensorDataset, DataLoader
from tensorflow import keras

In [None]:
import numpy as np

## Import Data

In [None]:
import pandas as pd
import csv

ROOT_PATH = '/content/drive/MyDrive/Colab Notebooks/NLP/ms_paraphrase'

data_path = f'{ROOT_PATH}/data'

train_df = pd.read_csv(f'{data_path}/train_df.csv')
test_df = pd.read_csv(f'{data_path}/test_df.csv')

test_df.head()

In [None]:
def format_df(df):
    s1_list = df['s1'].values
    s2_list = df['s2'].values
    labels = df['label'].values
    sentence_pairs = list(zip(s1_list, s2_list))
    return sentence_pairs, labels

One important note is that BERT will require us to also use a validation set for training. We will split our test set in half and use one half as the validation set

In [None]:
from sklearn.model_selection import train_test_split

val_df, test_df = train_test_split(test_df, test_size=0.5)

In [None]:
X, y = format_df(train_df) 

print(X[0])

## Preprocessing

For many NLP tasks, BERT takes 3 pieces of input related to the natural language text input:

- **The input ids:** Each token in the NL text is associated to a numerical id from a vector embedding. For each sentence pair, we will get an array of values where the text is replaced by the corresponding id. We will concatenate each of the sentence pairs for our input.
- **The token type ids:** Since we've concatenated our input pairs, we need to give BERT a way of knowing where the first sentence ends and the second begins. This input will be an array of 0s representing the tokens in the first sentence, followed by an array of 1s representing the tokens in the second sentence. 
- **The mask ids:** All of our input vectors must be the same length, but not all of our sentence pairs are the same length. This input vector gives BERT information on the length of each sentence pair. It will consist of 1s that represent the tokens of the input text followed by 0s to fill up the rest of the vector, so that all vectors are the same length.

The tokenizer that we get from HuggingFace provides powerful functionality that allows us to easily extract this information

In [None]:
def get_inputs(sentence_pairs, tokenizer, max_len):
    encoded = tokenizer.batch_encode_plus(
        sentence_pairs, 
        add_special_tokens=True, 
        max_length = max_len,
        truncation=True,
        return_attention_mask=True,
        return_token_type_ids=True,
        padding='max_length',
    )

    input_ids = np.array(encoded['input_ids'], dtype='int32')
    attention_masks = np.array(encoded['attention_mask'], dtype='int32')
    token_type_ids = np.array(encoded['token_type_ids'], dtype='int32')

    return input_ids, token_type_ids, attention_masks

In [None]:
TOKENIZER = DistilBertTokenizer.from_pretrained("distilbert-base-uncased", do_lower_case=True)
MAX_LEN = 150

In [None]:
input_ids, token_type_ids, attention_masks = get_inputs(X, TOKENIZER, MAX_LEN)

Take a minute to analyze each of these inputs to understand how they relate to the input text. Recall that the sentence pair is first concatenated before these outputs are produced
- **Input Ids:** Each entry represents a numeric id for the corresponding NL token. This will always start and end with some standardized ids (e.g. 101, 1012, 102). By aligning the input_ids with the original text, you can match up which number corresponds to NL token. Once the text runs out, the rest of the vector consists of 0s

- **Token Type Ids:** 0s are used to represent the first sentence, 1s are for the second. Once the second sentence runs out, the rest is followed by 0s

- **Attention Masks:** 1s are used to represent tokens for both sentences. This is followed by 0s. 


In [None]:
print(X[0])

print('\nInput Ids:', input_ids[0])

print('\nToken type Ids:', token_type_ids[0])

print('\nAttention Mask:', attention_masks[0])

## Batches

BERT operates on *batches* of input - we must therefore separate all of our input data into equal-sized batches.

We have 4076 training instances. So if we operate on batches of size 8, then we should have 4076/8 ~= 510 batches


In [None]:
BATCH_SIZE = 8

In [None]:
def create_batches(input_ids, token_type_ids, attention_masks, labels, batch_size):
    t_inputs = torch.tensor(input_ids)
    t_tts = torch.tensor(token_type_ids)
    t_masks = torch.tensor(attention_masks)
    t_labels = torch.tensor(labels)

    t_dataset = TensorDataset(t_inputs, t_tts, t_masks, t_labels)
    batches = DataLoader(t_dataset, shuffle=True, batch_size=batch_size)
    
    return batches

In [None]:
training_batches = create_batches(input_ids, token_type_ids, attention_masks, y, BATCH_SIZE)

In [None]:
len(training_batches)

## BERT Model

Now we can build the actual BERT model. This class is essentially a convenient wrapper for the DistilBERT model that we will be using from HuggingFace. It exposes a train function, into which we can pass in our training batches (and validation batches). Once it is trained, we can pass our test batch into the predict function to generate a set of predictions on the test set.

The DistilBERT model will be passed into the constructor (as inner_model) along with an optimizer and device specification. 



In [None]:
class BertModel():
    def __init__(self, inner_model, device, optimizer):
        self.__inner_model = inner_model
        self.__device = device
        self.__optimizer = optimizer
        

    def train(self, train_loader, val_loader, epochs):
        for i in range(epochs):
            self._run_training_epoch(i, train_loader, val_loader)


    def predict(self, test_loader):
        y_pred = []
        y_true = []
        y_logits = []

        self.__inner_model.eval()

        for i, batch in enumerate(test_loader):
            _, next_preds, next_labels, next_logits = self._run_batch(i, batch, use_grad=False)
            y_pred.extend(next_preds.to('cpu').numpy())
            y_true.extend(next_labels.to('cpu').numpy())
            y_logits.extend(next_logits.to('cpu').numpy())

        return y_pred, y_true, y_logits


    def _run_training_epoch(self, epoch, train_loader, val_loader):
        print(f'\n\n --- Epoch {epoch+1}')

        # Set to train mode: Tells model to compute gradients 
        self.__inner_model.train()
        train_loss, train_acc = self._run_training_step(train_loader, use_grad=True)

        # Set to eval mode
        self.__inner_model.eval()
        val_loss, val_acc = self._run_training_step(val_loader, use_grad=False)

        # Display results
        print(f'Epoch {epoch+1}: train_loss: {train_loss:.4f} train_acc: {train_acc:.4f} | val_loss: {val_loss:.4f} val_acc: {val_acc:.4f}')



    # Can be for either training or validation mode
    def _run_training_step(self, data_loader, use_grad):
        total_loss = 0
        total_acc  = 0

        for i, batch in enumerate(data_loader):
            next_loss, next_preds, next_labels, _ = self._run_batch(i, batch, use_grad=use_grad)
            total_loss += next_loss
            total_acc += self._get_acc(next_preds, next_labels)
        
        loss = total_loss / len(data_loader)
        acc = total_acc / len(data_loader)
        return loss, acc



    # Used for training, validation, and prediction
    def _run_batch(self, i, batch, use_grad):
        # Add batch to device
        batch = tuple(t.to(self.__device) for t in batch)

        # Unpack inputs from the dataloader
        b_inputs, b_tts, b_masks, b_labels = batch

        # Clear gradients
        self.__optimizer.zero_grad()

        with torch.set_grad_enabled(use_grad):
            loss, logits = self.__inner_model(b_inputs, 
                                        #token_type_ids=b_tts, # Not needed for DistilBERT
                                        attention_mask=b_masks, 
                                        labels=b_labels).values()
        
        preds = self._logits_to_preds(logits)

        if use_grad:
            loss.backward()
            self.__optimizer.step()
        
        if i%500==0:
            print(f'...batch {i}')

        return loss.item(), preds, b_labels, logits


    def _logits_to_preds(self, logits):
        _sm = torch.log_softmax(logits, dim=1)
        _argmax = _sm.argmax(dim=1)
        return _argmax


    def _get_acc(self, y_pred, y_true):
        _comp = (y_pred == y_true)
        _sum = _comp.sum().float()
        _size = float(y_true.size(0))
        return _sum / _size
    

## Training

To use our BertModel class, we need to define the building blocks used in the constructor. This includes the inner DistilBERT model from HuggingFace, and optimizer, and the device.

For the device, we want to use a GPU (cuda:0) if available, and this is strongly recommended in order to reduce training time. We will use a standard AdamW optimizer.

For the inner model, we use DistilBERT for Sequence Classification. We specify that we want to use distilbert-base-uncased, and that there are 2 labels (paraphrase or not-paraphrase)



In [None]:
device = torch.device('cuda:0' if torch.cuda.is_available() else "cpu")
print(device)

In [None]:
inner_model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)
inner_model.to(device)

In [None]:
param_optimizer = list(inner_model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
    'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
    'weight_decay_rate': 0.0}
]

# This variable contains all of the hyperparameter information our training loop needs
optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=2e-5)


In [None]:
model = BertModel(inner_model, device, optimizer)

For training, recall that BERT requires a training AND a validation set. We have our training batches ready to go. We must apply the same preprocessing to our validation set. We can also do our test set, which will be needed for evaluation

In [None]:
X_val, y_val = format_df(val_df)
i_val, t_val, m_val = get_inputs(X_val, TOKENIZER, MAX_LEN)
val_batches = create_batches(i_val, t_val, m_val, y_val, BATCH_SIZE) 

In [None]:
X_test, y_test = format_df(test_df)
i_test, t_test, m_test = get_inputs(X_test, TOKENIZER, MAX_LEN)
test_batches = create_batches(i_test, t_test, m_test, y_test, BATCH_SIZE) 

Now we are ready to train. This can take a while, especially if you are not using a GPU. We will train for 2 epochs

In [None]:
model.train(training_batches, val_batches, 2)

## Test set and Evaluation

Now we can run it against our test set and evaluate the performance

In [None]:
y_pred, y_true, _ = model.predict(test_batches)

In [None]:
from sklearn import metrics

print(metrics.classification_report(y_true, y_pred))