# Natural Language Processing COMM061 Course Work


---



## Selected Experiments

*   data pre-processing techniques
*   text encoding/transformation into numerical vectors
*   choices of loss functions and optimisers
*   hyperparameter optimisation

<span style="color:red">Note:</span> Some parts of the code have been adopted from the second part of Lab 4 [lab04-1b.ipynb](https://github.com/surrey-nlp/NLP-2024/blob/main/lab04/lab04-1b.ipynb) so we avoided overexplaining them here.

### **Experiment 1: Data Pre-processing Techniques**
In this experiment, we are going to compare different preprocessing techniques such as lowercasing the data, stemming, and N-grams.

before we proceed, we need to install some necessary libraries.The versions used in Google Colab, on which this model has been trained, are provided in 'requirements.txt'. Do not specify versions here if it's not necessary, as I have already checked that it is compatible with IFH lab computers.

In [16]:

%pip install torch
%pip install torchtext
%pip install datasets
%pip install nltk
%pip install torchmetrics
%pip install pytorch-crf
%pip install torcheval
%pip install numpy
%pip install scikit-learn
%pip install matplotlib
%pip install seaborn



Note: you may need to restart the kernel to use updated packages.

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.



setting up the environment for using PyTorch and torchtext (including setting the device to CPU or GPU).

In [17]:

import torch
import torchtext

SEED = 1234
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

print("PyTorch Version: ", torch.__version__)
print("torchtext Version: ", torchtext.__version__)
print(f"Using {'GPU' if str(DEVICE) == 'cuda' else 'CPU'}.")

PyTorch Version:  2.3.0+cpu
torchtext Version:  0.18.0+cpu
Using CPU.


 downloading the PLOD dataset using the Hugging Face library (datasets).

In [18]:
from datasets import load_dataset, load_metric
dataset = load_dataset("surrey-nlp/PLOD-CW")

Here, we split our PLOD-CW dataset into three different datasets (train, validation, test) to use each part for training and testing the model separately.

In [19]:
train_dataset = dataset["train"].remove_columns("pos_tags")
val_dataset = dataset["validation"].remove_columns("pos_tags")
test_dataset = dataset["test"].remove_columns("pos_tags")

We use GloVe embeddings (GloVe embeddings themselves will be compared to other word embeddings in the relevant experiment) as a measurement criterion to observe how this word embedding model reacts to different data preprocessing techniques that we are going to implement on tokens in PLOD-CW. After that, we will decide which of these techniques are useful or not.

<span style="color:red">Very Important Note:</span> The experiment was conducted using MAX_VOCAB_SIZE = 1000000. If you want to see the same result as reported, change the MAX_VOCAB_SIZE to 1000000. However, make sure to reinitialize MAX_VOCAB_SIZE to 25000 (some number lower than 1000000); otherwise, the model will have problems indexing the words and will raise an error for the embedding layer during training.

In [20]:
from torchtext import vocab

MAX_VOCAB_SIZE = 25000
glove_vectors = vocab.GloVe(name='6B', dim=100, max_vectors=MAX_VOCAB_SIZE)
text_vocab = vocab.vocab(glove_vectors.stoi, min_freq=0, specials=("<unk>", "<pad>"), special_first=True)
text_vocab.set_default_index(text_vocab["<unk>"])


#### Lower Casing:
We will utilize Hugging Face provided functions, such as dataset.map(), for dataset operations. More information about these functions can be found here https://huggingface.co/docs/datasets/en/process

In [21]:
def lower_casing (dataset):
  dataset['tokens'] = list(map(str.lower, dataset['tokens']))
  return dataset
train_dataset_lower = train_dataset.map(lower_casing)
val_dataset_lower = val_dataset.map(lower_casing)
test_dataset_lower= test_dataset.map(lower_casing)

Now, we can check the count of zero IDs in both the train_dataset and train_dataset_lower.

In [22]:
def zero_ids (dataset):
  count = 0
  for sentence in dataset['tokens']:
    for token in sentence:
      if text_vocab[token] == 0:
        count += 1
  return count

print ('count of zeros in train_dataset: ', zero_ids(train_dataset))
print ('count of zeros in train_dataset_lower: ', zero_ids(train_dataset_lower))

count of zeros in train_dataset:  10041
count of zeros in train_dataset_lower:  7030


#### Stemming:



here we use porter stemmer and and mapping function provided by hugging face to stemm tokens in PLOD-CW.


In [23]:
import nltk

nltk.download('punkt')
nltk.download('wordnet')

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

def text_stemmer (dataset):
  dataset['tokens'] = list(map(stemmer.stem , dataset['tokens']))
  return dataset

train_dataset_stem = train_dataset.map(text_stemmer)
val_dataset_stem = val_dataset.map(text_stemmer)
test_dataset_stem = test_dataset.map(text_stemmer)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\OWNER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\OWNER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Now we can check both train_dataset_stem and train_dataset.

In [24]:
print ('count of zeros in train_dataset: ', zero_ids(train_dataset))
print ('count of zeros in train_dataset_stem: ', zero_ids(train_dataset_stem))

count of zeros in train_dataset:  10041
count of zeros in train_dataset_stem:  11652



As we can see, the first part is a long form of "relative biological effectiveness," and the second part is "it is unclear how effective this switch will be".

If we check the root for both words "effectiveness" and "effective" using a stemmer, we can see that the word "effect" is the common root for them. If we decide to use "effect" instead of "effectiveness" or "effective," we may lose some valuable information. Since the word "effectiveness" (adjective + '-ness', indicating a state, quality, or condition) is more likely to be labeled as "I-LF" (in a long form) compared to "effective".

Checking IDs for 'effective' and 'effectiveness' in GloVe, we observe that they are represented using different vectors:

In [25]:
print ('effective ID:' , text_vocab['effective'])
print ('effectiveness ID:' , text_vocab['effectiveness'])

effective ID: 2039
effectiveness ID: 8307


#### N-grams:
feature space for 'general' and 'tasks'.




In [26]:
print ('Vectors for word general:' , glove_vectors['general'].shape)
print ('Vectors for word tasks:' , glove_vectors['tasks'].shape)

Vectors for word general: torch.Size([100])
Vectors for word tasks: torch.Size([100])


both words have the shape of torch.Size([100]).to use 'general tasks' as a token we need to concatenate the both tensor and then we can expect the shape of torch.Size([200]) for each token.

In [27]:
concatenated_tensor = torch.cat((glove_vectors['general'], glove_vectors['tasks']), dim=0)
print (concatenated_tensor.shape)

torch.Size([200])


The first experiment ends here, and from now on, we will build the model and the necessary dependencies for both the model and further experiments.

### **Model and dependencies**

In these experiments, we will compare two different word embedding systems: GloVe and FastText. Ultimately, we will decide which one to use based on their respective advantages and disadvantages.

##### GloVe:

Here we create a glove_vectors object with 100 dimensions for word embeddings and a vocabulary size of 25,000. Ultimately, text_vocab is created to convert tokens into their corresponding IDs based on the vocabulary dictionary, to be used in the neural network embedding layer. We set default IDs for <pad> (which we won't use in this part) and <unk> (representing out-of-dictionary tokens) to zero. Additionally, we include all words in the vocabulary by setting min_freq to 0, not just the most frequent ones.Before using GloVe to operate on the dataset, we also create an object for FastText.

In [28]:
from torchtext import vocab
MAX_VOCAB_SIZE = 200_000
glove_vectors = vocab.GloVe(name='6B', dim = 100 , max_vectors=MAX_VOCAB_SIZE)
text_vocab_glove = vocab.vocab(glove_vectors.stoi, min_freq=0, specials=("<unk>", "<pad>"), special_first=True)
text_vocab_glove.set_default_index(text_vocab_glove["<unk>"])

##### FastText:
We create a fasttext_vectors object similar to glove_vectors for word embeddings with a vocabulary size of 200,000. However, as FastText does not offer vectors in 100 dimensions, we have to reduce its dimensionality using PCA, which is a tool to reduce dimensionality in data while maintaining the patterns and trends. After reducing the dimensionality of vectors, we reinitialize fasttext_vectors.vectors with the new vectors and set the dim variable of the class to 100, which is the new dimensionality.

In [29]:
from sklearn.decomposition import PCA

new_dim = 100
MAX_VOCAB_SIZE = 200_000
fasttext_vectors = vocab.FastText(language='en', max_vectors=MAX_VOCAB_SIZE)
vectors = fasttext_vectors.vectors.numpy()
pca = PCA(n_components= new_dim)
vectors_reduced = pca.fit_transform(vectors)
fasttext_vectors.vectors = torch.tensor(vectors_reduced)
fasttext_vectors.dim = new_dim 

text_vocab_fasttext = vocab.vocab(fasttext_vectors.stoi, min_freq=0, specials=("<unk>", "<pad>"), special_first=True)
text_vocab_fasttext.set_default_index(text_vocab_fasttext["<unk>"])

We also create a function to initialize pretrained embeddings for use in the embedding layer of the neural network during the comparison.

In [30]:
def pretrained_embedding (embedding_vectors):
   pretrained_embeddings = torch.cat([
    torch.empty(1, embedding_vectors.dim).normal_(),  
    torch.zeros(1, embedding_vectors.dim),  
    embedding_vectors.vectors])
   return pretrained_embeddings

Before comparing these two systems, we need to create our neural network and handle its dependencies. First we start by encoding the labels.

In [31]:

def encoding_ner(dataset):
    # example['ner_tags'] = 'My sentence: ' + example["sentence1"]
    label_list=[]
    for  tag in dataset['ner_tags']:
        if tag == 'B-O':
          label_list.append(0)
        elif tag == 'B-AC' :
          label_list.append(1)
        elif tag == 'B-LF' :
          label_list.append(2)
        elif tag == 'I-LF' :
          label_list.append(3)
    dataset['ner_tags'] = label_list
    return dataset

train_dataset = train_dataset_lower.map(encoding_ner)
val_dataset = val_dataset_lower.map(encoding_ner)
test_dataset = test_dataset_lower.map(encoding_ner)



We also create a list for class weights to be used further in weighted CrossEntropy Loss. To do this, we need to have all the labels in one list and then feed it to compute_class_weight from sklearn.utils.class_weight to calculate the class weights.


In [32]:
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

def get_labels(dataset): 
   label_list = []
   for sentence in dataset['ner_tags']:
      for label in sentence:
         label_list.append(label)
   return label_list

train_dataset_lis =  get_labels(train_dataset)
val_dataset_lis =  get_labels(val_dataset)
test_dataset_lis =  get_labels(test_dataset)
final_list = train_dataset_lis + val_dataset_lis + test_dataset_lis

class_weights = compute_class_weight(class_weight ='balanced', classes= np.unique(final_list),y= final_list)
class_weights = np.append(class_weights , 0.01)
class_weights = torch.tensor(class_weights, dtype=torch.float)
print (class_weights)

tensor([0.3010, 4.3569, 7.0982, 3.2501, 0.0100])


Now we create tuples for each sentence and its corresponding labels to convert the datasets into a Torch MapStyleDataset.

In [33]:
short_dataset = tuple(zip(train_dataset['tokens'], train_dataset['ner_tags']))
val_dataset = tuple(zip(val_dataset['tokens'], val_dataset['ner_tags']))
test_dataset = tuple(zip(test_dataset['tokens'], test_dataset['ner_tags']))

Now we can use the to_map_style_dataset module from torchtext to convert the datasets.

In [34]:
from torchtext.data.functional import to_map_style_dataset

train_data = to_map_style_dataset(short_dataset)
val_data = to_map_style_dataset(val_dataset)
test_data = to_map_style_dataset(test_dataset)



Here we create a class to calculate the length of each sentence in order to use them for padding.

In [35]:
class ToLengths(torch.nn.Module):
    def forward(self, input):
        if isinstance(input[0], list):
            lengths = []
            for text in input:
                lengths.append(len(text))
            return lengths
        elif isinstance(input, list):
            return len(input)
        raise ValueError(f"Type {type(input)} is not supported.")

Creating the pipeline using torchtext.transforms. Note that we do not use any tokenizer such as spaCy and others here, since the PLOD-CW dataset is already tokenized. We initialize a padding value for the labels to match their size with their corresponding sentences after padding.

In [36]:
import torchtext.transforms as T

text_transform = T.Sequential(
    T.VocabTransform(text_vocab),  # Conver to vocab IDs
    T.ToTensor(padding_value=text_vocab["<pad>"]),  # Convert to tensor and pad
)

label_transform = T.Sequential(
    T.ToTensor(padding_value= 4),  # Convert to tensor
)

lengths_transform = T.Sequential(
    ToLengths(),
    T.ToTensor(),
)



And finally, we define the batch collater function to use it for creating the data loaders.

In [37]:
BATCH_SIZE = 16

In [38]:

from torch.utils.data import DataLoader

def collate_batch(batch):
    texts,labels = zip(*batch)
    lengths = lengths_transform(list(texts))
    texts = text_transform(list(texts))
    labels = label_transform (list(labels))
    return labels.long().to(DEVICE), texts.to(DEVICE), lengths.cpu()

def _get_dataloader(data):
    return DataLoader(data,batch_size=BATCH_SIZE, shuffle= False, collate_fn=collate_batch)


So now we are ready to build the neural network. First, we create an object named RNN inheriting from nn.Module. Then, since we want to use LSTM for this, we initialize the rnn variable of the object using torch.nn.LSTM. We could directly create an object from torch.nn.LSTM, but to maintain the style of the lab, we use this method.

In [39]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, pretrained_embeddings, hidden_dim, output_dim, n_layers, bidirectional, dropout, pad_idx):
        super().__init__()
        self.out_dim = output_dim
        self.num_directions = 2 if bidirectional else 1
        self.h_dim = hidden_dim
        self.embedding = nn.Embedding.from_pretrained(pretrained_embeddings, freeze=True, padding_idx=pad_idx)
        self.rnn = nn.LSTM(pretrained_embeddings.shape[1],
                           hidden_dim,
                           num_layers=n_layers,
                           bidirectional=bidirectional,
                           dropout=dropout,
                           batch_first=True)
        self.fc = nn.Linear(hidden_dim * self.num_directions, output_dim)

        self.dropout = nn.Dropout(dropout)

    def forward(self, text, lengths):
        embedded = self.dropout(self.embedding(text))
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, lengths.cpu(), batch_first=True, enforce_sorted=False)
        packed_output, (hidden, cell) = self.rnn(packed_embedded)
        output, _ = nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True)
        return self.fc(output)

Initializing the RNN object variables except for pretrained_embeddings for now.

In [40]:
HIDDEN_DIM = 256
OUTPUT_DIM = 5
N_LAYERS = 4
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = text_vocab["<pad>"]

Defining the categorical_accuracy function to calculate accuracy in each epoch. It's worth noting that we're not using binary accuracy here, as in the lab, since the task is entirely different. We're dealing with multiple classes for labels instead of binary ones.

Since padding labels are present both in the labels and predictions, we also mask them using ignore_index to prevent them from affecting accuracy. The output of BiLSTM is a tensor which shows the probability for each class and has a shape like [batch_size, sequence_length, output_dim]. By using torch.max(), we obtain the index of the class that has the highest probability, which is the result for that specific token. Removing padding and extracting the class index from the BiLSTM has been widely used for metrics and drawing some plots. Therefore, we avoid mentioning them again in further parts.

In [41]:

def categorical_accuracy(preds, labels,ignore_index = None):
    _, predicted = torch.max(preds, 1)
    if ignore_index is not None:
        mask = labels != ignore_index
        predicted = predicted[mask]
        labels = labels[mask]
    correct = (predicted == labels).float()
    accuracy = correct.sum() / len(correct)
    return accuracy

We utilize multiclass_f1_score, provided by torcheval, to compute the weighted F1 score for the test data

In [42]:

from torcheval.metrics.functional import multiclass_f1_score

def f1_score(preds, labels,ignore_index = None):
    
    _, predicted = torch.max(preds, 1)
    if ignore_index is not None:
        mask = labels != ignore_index
        predicted = predicted[mask]
        labels = labels[mask]
    weighted_f1_score = multiclass_f1_score(predicted, labels, num_classes=5)
   
    return weighted_f1_score


FileNotFoundError: Could not find module 'C:\Users\OWNER\AppData\Local\Programs\Python\Python312\Lib\site-packages\torchaudio\lib\libtorchaudio.pyd' (or one of its dependencies). Try using the full path with constructor syntax.

We create a function for training and evaluating the model. As CrossEntropyLoss uses a flattened tensor to calculate the loss, we have to flatten both labels and predictions before sending them to CrossEntropyLoss(). For CRF data, there is no need to flatten them, so we send them straight away without any changes.It should be noted that as the CRF's final result is negative, it should be multiplied by -1 to transform it to the actual loss value.

In [None]:
from tqdm import tqdm

def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0

    model.train()
    for batch in tqdm(iterator, desc="\tTraining"):
        optimizer.zero_grad()

        labels, texts, lengths = batch  
        predictions = model(texts, lengths)
        if isinstance(criterion, nn.CrossEntropyLoss):
            loss = criterion(predictions.view(-1, OUTPUT_DIM), labels.view(-1))
        elif isinstance(criterion, CRF):
            mask = labels != 4
            loss = criterion(predictions, labels , mask = mask , reduction='token_mean' ) * (-1)
        acc = categorical_accuracy(predictions.view(-1, OUTPUT_DIM), labels.view(-1),ignore_index=4)
        
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator) 

In [None]:
from tqdm import tqdm

def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0

    model.eval()
   
    with torch.no_grad():
        for batch in tqdm(iterator, desc="\tEvaluation"):
            labels, texts, lengths = batch  
            predictions = model(texts, lengths)
            if isinstance(criterion, nn.CrossEntropyLoss):
                loss = criterion(predictions.view(-1, OUTPUT_DIM), labels.view(-1))
            elif isinstance(criterion, CRF):
                mask = labels != 4
                loss = criterion(predictions, labels , mask = mask , reduction='token_mean' ) * (-1)
            acc = categorical_accuracy(predictions.view(-1, OUTPUT_DIM), labels.view(-1),ignore_index=4)
            epoch_loss += loss.item()
            epoch_acc += acc.item()    
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


Here we add an EarlyStopping class to halt the model when the validation loss starts to increase from the best value achieved for a certain number of times which has been defined by the patience parameter. Here, we set the patience to 5, but it can change based on the task. This code has been inspired by [PyTorch Lightning documentation](https://lightning.ai/docs/pytorch/stable/_modules/lightning/pytorch/callbacks/early_stopping.html#EarlyStopping)

In [None]:
class EarlyStopping:
    def __init__(self, patience=5, verbose=False):
        self.patience = patience
        self.verbose = verbose
        self.counter = 0
        self.best_score = None
        self.early_stop = False
        self.val_loss_min = np.Inf

    def __call__(self, val_loss, model):
        if self.best_score is None:
            self.best_score = val_loss
            self.save_checkpoint(val_loss, model)
        elif val_loss > self.best_score:
            self.counter += 1
            print(f'EarlyStopping counter: {self.counter} out of {self.patience}')
            if self.counter >= self.patience:
                self.early_stop = True
        else:
            self.best_score = val_loss
            self.save_checkpoint(val_loss, model)
            self.counter = 0

    def save_checkpoint(self, val_loss, model):
        if self.verbose:
            print(f'Validation loss decreased ({self.val_loss_min:.6f} --> {val_loss:.6f}).  Saving model ...')
        torch.save(model.state_dict(), 'checkpoint.pt')
        self.val_loss_min = val_loss

adding epoch_time function to measure the duration of training epochs.

In [None]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

We encapsulate the training and validation process into functions named epoch_trainer() and tester() to call them when conducting the experiment.

In [None]:
def epoch_trainer( num_epochs , Early_stop):
    N_EPOCHS = num_epochs
    best_valid_loss = float('inf')
    print(f"Using {'GPU' if str(DEVICE) == 'cuda' else 'CPU'} for training.")
    epo_train_loss = [] 
    epo_valid_loss = []
    epo_train_acc = []
    epo_valid_acc = []
    for epoch in range(N_EPOCHS):
        print(f'Epoch: {epoch+1:02}')
        start_time = time.time()

        train_loss, train_acc  = train(model, train_dataloader, optimizer, criterion)
        print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')

        valid_loss, valid_acc  = evaluate(model, valid_dataloader, criterion)
        print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

        end_time = time.time()

        epoch_mins, epoch_secs = epoch_time(start_time, end_time)

        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            torch.save(model.state_dict(), 'tut2-model.pt')


        if Early_stop is not None:
            Early_stop (valid_loss, model)

            if Early_stop.early_stop:
                print("Early stopping")
                break


        epo_train_loss.append(train_loss)
        epo_valid_loss.append(valid_loss)
        epo_train_acc.append(train_acc)
        epo_valid_acc.append(valid_acc)
        
    return epo_train_loss ,  epo_valid_loss , epo_train_acc , epo_valid_acc

In [None]:
def tester():
    model.load_state_dict(torch.load('tut2-model.pt'))
    test_loss, test_acc  = evaluate(model, test_dataloader, criterion)
    print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')
   

The final function that is going to be used for experimentation has been defined below. In each experiment, we will send the related parameters to the function to conduct the experiment, except for the hyperparameter optimizing experiment, as that model's hyperparameters have been defined globally already and can be changed without the need to define a function for it. We set an argument as return_param for experiment_conductor if we need the model parameters like training and validation loss to draw related graphs. torch.nn.init.uniform_() has been used to initialize each parameter with a uniform random value between -1 and 1. This step is necessary for  initializing  the CRF parameters.

In [None]:

import torch.optim as optim
import torch
from torchcrf import CRF
def experiment_conductor ( text_vocab, pretrained_embeddings , Loss_func ,learn_rate = 0.001 , param_return = False , epochs = 7 , early_stoping = None):
    text_vocab = text_vocab
    global train_dataloader
    global valid_dataloader
    global test_dataloader
    global model 
    global criterion
    global optimizer


    train_dataloader = _get_dataloader(train_data)
    valid_dataloader = _get_dataloader(val_data)
    test_dataloader = _get_dataloader(test_data)

    pretrained_embeddings = pretrained_embedding(pretrained_embeddings)

    model = RNN(pretrained_embeddings, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)

    optimizer = optim.Adam(model.parameters() , learn_rate)
    
    if isinstance(Loss_func, nn.CrossEntropyLoss):
        criterion = Loss_func
    
    elif isinstance(Loss_func, CRF):
        criterion = Loss_func
        for p in criterion.parameters():
            _ = torch.nn.init.uniform_(p, -1, 1)

    
    model = model.to(DEVICE)
    criterion = criterion.to(DEVICE)
    epo_t_loss , epo_v_loss , epo_t_acc , epo_v_acc= epoch_trainer(epochs , early_stoping )    
    tester()
    if param_return == True:
        return epo_t_loss , epo_v_loss , epo_t_acc , epo_v_acc
        


#### **Visualizing**
We define some necessary functions for drawing related graphs such as the confusion matrix, loss curves, and so on. All the code in this part has been adopted from [Matplotlib](https://matplotlib.org/stable/gallery/index.html) documentation. 

Some metrics are going to be calculated in confusion_matrix. We set param_return and draw arguments for this function in case we need the metrics for the test dataset without the confusion matrix. All the metrics have been calculated using the torcheval library(more information can be found in [torcheval documentation](https://pytorch.org/torcheval/stable/torcheval.metrics.html)), and both recall and precision have been computed for all the classes, in addition to the weighted values. Only the F1 score has been calculated in weighted solely.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
def conusion_matrix (model, iterator ,draw = True , param_return = False):
    import torch
    from torcheval.metrics.functional import multiclass_f1_score , multiclass_precision , multiclass_recall
    model.eval()
    labels_con = torch.tensor([]).to(DEVICE)
    prediction_con = torch.tensor([]).to(DEVICE)
    for batch in (iterator):
        labels, texts, lengths = batch  
        predictions = model(texts, lengths)
      
        labels_con = torch.cat((labels_con , labels.view(-1) ), dim = 0 )
        prediction_con = torch.cat((prediction_con , predictions.view(-1, OUTPUT_DIM)), dim = 0 )

    if draw == True :
        from torchmetrics import ConfusionMatrix
        _, prediction = torch.max(prediction_con, 1)
        mask = labels_con != 4
        prediction = prediction[mask]
        labels = labels_con[mask]
        conf =ConfusionMatrix(task="multiclass", num_classes= 4).to(DEVICE)

        conf_matrix_values = conf(prediction , labels )
        

        confusion_matrix = conf_matrix_values.cpu().numpy()

        plt.figure(figsize=(8, 6))
        sns.set(font_scale=1.2)
        sns.heatmap(confusion_matrix, annot=True, cmap="Blues", fmt="d", cbar=False,
                    xticklabels=['B-O', 'B-AC', 'B-LF','I-LF'],
                    yticklabels=['B-O', 'B-AC', 'B-LF','I-LF'])
        plt.xlabel('Predicted')
        plt.ylabel('Actual')
        plt.title('Confusion Matrix')
        plt.show()


    precision = multiclass_precision(prediction.to(torch.int64), labels.to(torch.int64), average=None, num_classes= 4).tolist()
    precision_weighted = multiclass_precision(prediction.to(torch.int64), labels.to(torch.int64), average='macro', num_classes= 4).tolist()
    recall = multiclass_recall(prediction.to(torch.int64), labels.to(torch.int64), average=None, num_classes=4).tolist()
    recall_weighted = multiclass_recall(prediction.to(torch.int64), labels.to(torch.int64), average='macro', num_classes=4).tolist()
    weighted_f1_score = multiclass_f1_score(prediction, labels, num_classes=4)
    
    
    print(f'\t precision Valuse for test( B-O: {precision[0]:.2f} | B-AC: {precision[1]:.2f} | B-LF: {precision[2]:.2f} | I-LF: {precision[3]:.2f}  | Overall: {precision_weighted:.2f})')
    print(f'\t Recall Valuse for test( B-O: {recall[0]:.2f} | B-AC: {recall[1]:.2f} | B-LF: {recall[2]:.2f} | I-LF: {recall[3]:.2f} | Overall: {recall_weighted:.2f})')
    print(f'\t weighted_f1_scor:{weighted_f1_score:.2f}')
    
    if param_return == True:
        return precision_weighted , recall_weighted , weighted_f1_score

In [None]:
def loss_curve(train_loss , valid_loss):

    train_values = train_loss
    val_values = valid_loss

    plt.plot(train_values, label='Training Loss', color='blue')
    plt.plot(val_values, label='validation loss', color='red')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.title('Training Loss Curve')
    plt.legend()
    plt.grid(True)

    
    plt.show()


In [None]:
def accuracy_curve (value_list , train_acc ,valid_acc , dim , label , epo):
    
    values = value_list 
    epochs = range(1, epo + 1)  
    train_accuracies = train_acc 
                   
    val_accuracies = valid_acc  
                    
    fig, axes = plt.subplots(dim[0], dim[1], figsize=(12, 8), sharex=True, sharey=True)

    for i, ax in enumerate(axes.flat):
        ax.plot(epochs, train_accuracies[i], label='Training Accuracy')
        ax.plot(epochs, val_accuracies[i], label='Validation Accuracy')
        ax.set_title(f'{label}: {values[i]}')
        ax.set_xlabel('Epoch')
        ax.set_ylabel('Accuracy')
        ax.legend()
        ax.grid(True)

    plt.tight_layout()
    plt.show()

In [None]:
def Heatmap ( lr_rates , final_validation_loss , final_Train_loss , final_val_accuracy ,final_train_accuracy ):

    learning_rates = lr_rates
    final_validation_loss = final_validation_loss
    final_Train_loss = final_Train_loss 
    final_val_accuracy = final_val_accuracy  
    final_train_accuracy = final_train_accuracy

    metrics = np.array([final_validation_loss, final_Train_loss, final_val_accuracy, final_train_accuracy ])

    plt.figure(figsize=(8, 6))
    plt.imshow(metrics, cmap='viridis', interpolation='nearest')

    plt.xticks(np.arange(len(learning_rates)), learning_rates)
    plt.yticks(np.arange(4), ['Final Validation Loss', 'Final Train loss', 'Final Train Accuracy' ,'Final Validation Accuracy' ])

    plt.colorbar(label='Metric Value')

    plt.title('Performance of Different Learning Rates')
    plt.xlabel('Learning Rate')
    plt.ylabel('Metric')

    plt.show()

In [None]:
def bar_chart ( Param_values ,final_train_accuracy , final_valid_accuracy , label ):
   
    batch_sizes = Param_values 
    final_train_accuracy = final_train_accuracy 
    final_valid_accuracy = final_valid_accuracy  

    plt.figure(figsize=( 8, 6))
    bar_width = 0.25
    index = range(len(batch_sizes))
    plt.bar(index, final_train_accuracy, bar_width, label='Final Train Accuracy')
    plt.bar([i + bar_width for i in index], final_valid_accuracy, bar_width, label='Final Valid Accuracy')

    for i in index:
        plt.text(i, final_train_accuracy[i], f'{final_train_accuracy[i]:.2f}', ha='center', va='bottom')
        plt.text(i + bar_width, final_valid_accuracy[i], f'{final_valid_accuracy[i]:.2f}', ha='center', va='bottom')

    plt.title(f'Final Train and Validation Accuracy for Different {label}')
    plt.xlabel(label)
    plt.ylabel('Accuracy')
    plt.xticks([i + bar_width / 2 for i in index], batch_sizes)
    plt.legend()


    plt.ylim(0.7, 1.0)

    plt.tight_layout()
    plt.show()

### **Experiment 2: Text Encoding/Transformation into Numerical Vectors**
In these experiments, we will compare two different word embedding systems: GloVe and FastText. Ultimately, we will decide which one to use based on their respective advantages and disadvantages. All the other systems, such as loss function, optimizer, and hyperparameters, remain the same and unchanged during this experiment.

#### Glove:

In [None]:
experiment_conductor( text_vocab_glove , glove_vectors , nn.CrossEntropyLoss(weight= class_weights ,ignore_index = 4 ))


Visualizing the results using a confusion matrix for the test dataset.

In [None]:
conusion_matrix (model, test_dataloader)

#### fasTtext:

In [None]:
experiment_conductor( text_vocab_fasttext , fasttext_vectors , nn.CrossEntropyLoss(weight= class_weights ,ignore_index = 4))

Visualizing the results using a confusion matrix for the test dataset.

In [None]:
conusion_matrix (model, test_dataloader)

#### Preferred system: **GloVe**

### **Experiment 3: choices of loss functions and optimisers**

In these experiments, we will compare two loss functions systems: Cross Entropy Loss and CRF. Ultimately, we will decide which one to use based on visualizing their functionality using loss curves.

#### Cross Entropy Loss:

In [None]:
train_loss , valid_loss , _ , _ = experiment_conductor( text_vocab_glove , glove_vectors  , nn.CrossEntropyLoss(weight= class_weights ,ignore_index = 4) , param_return = True , epochs= 7)

Visualizing the results using loss curves

In [None]:
loss_curve(train_loss , valid_loss)

#### CRF:

In [None]:
train_loss , valid_loss , _ , _ = experiment_conductor( text_vocab_glove , glove_vectors ,CRF( num_tags = 5,batch_first= True) , param_return = True , epochs= 7 ,)

Visualizing the results using loss curves

In [None]:
loss_curve(train_loss , valid_loss)

#### Preferred system: **CRF**

### **Experiment 4: hyperparameter optimisation**

In this experiment, we will compare different values for each hyperparameter and ultimately decide which value is the most suitable one.
The experimented hyperparameters are as follows:

#### Learning Rate:

In [None]:
learning_rates = [0.0001 , 0.001 , 0.01, 0.1]

train_accuracy = []
valid_accuracy = []

fin_train_losses=  []
fin_val_losses=  []
fin_train_acc = []
fin_val_acc = []
for lr in learning_rates:
    train_loss , val_loss , train_acc, valid_acc = experiment_conductor( text_vocab_glove , glove_vectors ,CRF( num_tags = 5,batch_first= True) , learn_rate = lr  , param_return = True , epochs= 7 )
    train_accuracy.append(train_acc)
    valid_accuracy.append(valid_acc)
    fin_train_losses.append(train_loss[6])
    fin_val_losses.append(val_loss[6])
    fin_train_acc.append(train_acc[6])
    fin_val_acc.append(valid_acc[6])

Visualizing the results using accuracies curves

In [None]:
accuracy_curve (learning_rates , train_accuracy ,valid_accuracy , dim = (2 , 2) , label= 'Learning Rate' , epo= 7)

Visualizing the results using heatmap

In [None]:
Heatmap ( learning_rates, fin_val_losses, fin_train_losses, fin_val_acc, fin_train_acc )


**Preferred Value for Learning Rate: 0.001**

#### Dropout Value:

In [None]:
DROP = [0.4 , 0.5 , 0.6 , 0.7]
train_accuracy = []
valid_accuracy = []
for drop_value in DROP:
    DROPOUT = drop_value
    _ , _ , train_acc, valid_acc =experiment_conductor( text_vocab_glove , glove_vectors ,CRF( num_tags = 5,batch_first= True) , param_return = True , learn_rate= 0.001)
    train_accuracy.append(train_acc [6])
    valid_accuracy.append(valid_acc[6])

Visualizing the results using bar charts

In [None]:
bar_chart ( DROP ,train_accuracy , valid_accuracy , label = 'Dropout' )

**Preferred Value Dropout: 0.4**

#### Batch Size:

In [None]:
BATCH = [ 16, 32, 64, 128]
train_accuracy = []
valid_accuracy = []
for batch_size in BATCH:
    BATCH_SIZE = batch_size
    print ('Batch size:' , BATCH_SIZE)
    _ , _ , train_acc, valid_acc = experiment_conductor( text_vocab_glove , glove_vectors ,CRF( num_tags = 5,batch_first= True) , param_return = True , learn_rate= 0.001 )
    train_accuracy.append(train_acc [6])
    valid_accuracy.append(valid_acc[6])

Visualizing the results using bar charts

In [None]:
bar_chart ( BATCH ,train_accuracy , valid_accuracy , label = 'Batch size' )

**Preferred Value for Batch Size: 16**

In [None]:
BATCH_SIZE = 16

#### Hidden Dimension:

In [None]:
hidden = [64 , 128 , 256 , 512]
train_accuracy = []
valid_accuracy = []
for hidden_value in hidden:
    HIDDEN_DIM  = hidden_value
    _ , _ , train_acc, valid_acc = experiment_conductor( text_vocab_glove , glove_vectors ,CRF( num_tags = 5,batch_first= True) , param_return = True , learn_rate= 0.001)
    train_accuracy.append(train_acc [6])
    valid_accuracy.append(valid_acc[6])

Visualizing the results using bar charts

In [None]:
bar_chart ( hidden ,train_accuracy , valid_accuracy , label = 'Hidden Dimension' )

**Preferred Value for Hidden Dimension: 256**

In [None]:
HIDDEN_DIM = 256

#### Number of Layers:

In [None]:
layer = [ 2,  4 , 6 , 8]
train_accuracy = []
valid_accuracy = []
for num_layer in layer:
    N_LAYERS = num_layer
    _ , _ , train_acc, valid_acc = experiment_conductor( text_vocab_glove , glove_vectors ,CRF( num_tags = 5,batch_first= True) , param_return = True , learn_rate= 0.001)
    train_accuracy.append(train_acc [6])
    valid_accuracy.append(valid_acc[6])

Visualizing the results using bar charts

In [None]:
bar_chart ( layer ,train_accuracy , valid_accuracy , label = 'Number of Layers' )

**Preferred Value for Number of Layers: 4**

In [None]:
N_LAYERS = 4

#### Number of Epochs:

In [None]:
train_loss , valid_loss , _ , _ = experiment_conductor( text_vocab_glove , glove_vectors ,CRF( num_tags = 5,batch_first= True) , param_return = True , epochs= 50 )

Visualizing the results using loss curves

In [None]:
loss_curve(train_loss , valid_loss)

##### Early Stopping:

In [None]:
train_loss , valid_loss , _ , _ = experiment_conductor( text_vocab_glove , glove_vectors ,CRF( num_tags = 5,batch_first= True) , param_return = True , epochs= 50 , early_stoping=EarlyStopping(patience=5, verbose=True))

Visualizing the results for Earlystopping using loss curves

In [None]:
loss_curve(train_loss , valid_loss)

**Preferred Value for Number of epochs: 17**

________________________________________________________________________________
### **User Input**

To test the model with manual user input, we provided this part. Since the inputs need to be tokenized, we used spaCy, which the PLOD-CW dataset has been tokenized based on.

In [None]:
from torchtext.data.utils import get_tokenizer

class SpacyTokenizer(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.tokenizer = get_tokenizer("spacy", language="en_core_web_sm")

    def forward(self, input):
        if isinstance(input, list):
            tokens = []
            for text in input:
                tokens.append(self.tokenizer(text))
            return tokens
        elif isinstance(input, str):
            return self.tokenizer(input)
        raise ValueError(f"Type {type(input)} is not supported.")

We also need to create a new pipeline as we need to include tokenization in it.

In [None]:
import torchtext.transforms as T

input_text_transform = T.Sequential(
    SpacyTokenizer(),  
    T.VocabTransform(text_vocab),  
    T.ToTensor(padding_value=text_vocab["<pad>"]), 
)


input_lengths_transform = T.Sequential(
    SpacyTokenizer(),
    ToLengths(),
    T.ToTensor(),
)

Finally, defining the label_predictor() function to produce the results. We designed it in a way that the final result is represented as a Pandas DataFrame. If the result didn't appear fully on the terminal, use DataFrame.head(x) with the preferred value of x to check the results.

In [None]:
import spacy
import pandas as pd

nlp = spacy.load('en_core_web_sm')

dict = {0:'B-O', 1:'B-AC' , 2:'B-LF',3:'I-LF'}

def label_predictor(text):
    text_normal = text.lower()
    processed_sentence = input_text_transform([text_normal]).to(DEVICE)
    sentence_length = input_lengths_transform([text_normal]).cpu()
    prediction = model(processed_sentence, sentence_length)
    _, predi = torch.max(prediction.view(-1, OUTPUT_DIM), 1)

    tokenizer = SpacyTokenizer()
    tokens = tokenizer(text_normal)

    df = pd.DataFrame({'Token': tokens, 'Ner_tag': predi.tolist()})

    df['Ner_tag'] = df['Ner_tag'].map(dict)
    
    return df

In [None]:
sample_text = """Polymerase chain reaction (PCR), ligase chain reaction (LCR), and transcription-mediated amplification (TMA) are examples."""

result_1  = label_predictor(sample_text)

In [None]:
result_1

In [None]:
sample_text = """DNA replication (DNA amplification) can also be performed in vitro (artificially, outside a cell)."""

result_2 = label_predictor(sample_text)

In [None]:
result_2