# Lab 06 - BERT Finetuning

This notebook will take you through a practical scenario on how to **finetune** a **pre-trained** language model (BERT) on the task of **Abbreviation Detection** using a subset of the **PLOD dataset**.

In [74]:
# Install the necessary dependencies
%pip install datasets
%pip install transformers
%pip install spacy
%pip install torch
%pip install spacy-transformers
%pip install transformers[torch]
%pip install seqeval



Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### 1. Dataset

To access PLOD, we will use the HuggingFace Datasets library.
Hugging Face is a company and an open-source community that primarily focuses on natural language processing (NLP) technologies. 
** **

In [75]:
from datasets import load_dataset, load_metric
dataset = load_dataset("surrey-nlp/PLOD-CW")

### 2. Language Model (BERT)

They are best known for their development and maintenance of the "Transformers" library, which is a popular open-source library for state-of-the-art NLP architectures. We can use it to load our BERT model and its pre-trained weights. They also offer other pipeline compones like model-specific tokenizers.
** **

In [76]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForTokenClassification.from_pretrained("bert-base-uncased", num_labels=4)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 3. Dataset pre-processing
** **

We start out by separating the 3 subsets of PLOD.

*(Train, validation and test data)*

In [77]:
short_dataset = dataset["train"][:200]
val_dataset = dataset["validation"]
test_dataset = dataset["test"]

We then tokenize the train set using BERT's tokenizer, which will get our sentences into the form that the model is used to seeing during pre-training. In this case, the authors added two special tokens to their sentences:
-   [CLS]: The final embedding of this token is usually treated as the overall representation of the entire input.
-   [SEP]: A separator token that is used in task that require of multiple sentences to inform the model about their limits.

In [78]:
tokenized_input = tokenizer(short_dataset["tokens"], is_split_into_words=True)

# Example single sentence example.
for token in tokenized_input["input_ids"]:
    print(tokenizer.convert_ids_to_tokens(token))
    break

['[CLS]', 'for', 'this', 'purpose', 'the', 'gothenburg', 'young', 'persons', 'empowerment', 'scale', '(', 'g', '##ype', '##s', ')', 'was', 'developed', '.', '[SEP]']


The PLOD dataset provides its labels in string form, we have to convert them to class indexes so that they can be understood by the model

In [79]:
label_encoding = {"B-O": 0, "B-AC": 1, "B-LF": 2, "I-LF": 3}

label_list = []
for sample in short_dataset["ner_tags"]:
    label_list.append([label_encoding[tag] for tag in sample])

val_label_list = []
for sample in val_dataset["ner_tags"]:
    val_label_list.append([label_encoding[tag] for tag in sample])

test_label_list = []
for sample in test_dataset["ner_tags"]:
    test_label_list.append([label_encoding[tag] for tag in sample])


As we added two special tokens to our input sentences, we have to align the labels to account for this. 

In [80]:
def tokenize_and_align_labels(short_dataset, list_name):
    tokenized_inputs = tokenizer(short_dataset["tokens"], truncation=True, is_split_into_words=True) ## For some models, you may need to set max_length to approximately 500.

    labels = []
    for i, label in enumerate(list_name):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx])
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [81]:
tokenized_datasets = tokenize_and_align_labels(short_dataset, label_list)
tokenized_val_datasets = tokenize_and_align_labels(val_dataset, val_label_list)
tokenized_test_datasets = tokenize_and_align_labels(test_dataset, test_label_list)
# print(tokenized_datasets)

In [82]:
# BERT's tokenizer returns the dataset in the form of a dictionary of lists (sentences). 
# we have to convert it into a list of dictionaries for training.
def turn_dict_to_list_of_dict(d):
    new_list = []

    for labels, inputs in zip(d["labels"], d["input_ids"]):
        entry = {"input_ids": inputs, "labels": labels}
        new_list.append(entry)

    return new_list

In [83]:
tokenised_train = turn_dict_to_list_of_dict(tokenized_datasets)
tokenised_val = turn_dict_to_list_of_dict(tokenized_val_datasets)
tokenised_test = turn_dict_to_list_of_dict(tokenized_test_datasets)

In order to improve training efficiency, we can parallelize it by feeding multiple sentences to BERT at once. Data collators are objects that will form a batch by using a list of dataset elements as input.

In [84]:
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer)

### 4. Training and metrics
** **

In [85]:
import numpy as np

metric = load_metric("seqeval")
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [86]:
from transformers import TrainingArguments, Trainer, EarlyStoppingCallback

# Training arguments (feel free to play arround with these values)
model_name = "bert-base-uncased"
epochs = 6
batch_size = 4
learning_rate = 2e-5

args = TrainingArguments(
    f"BERT-finetuned-NER",
    # evaluation_strategy = "epoch", ## Instead of focusing on loss and accuracy, we will focus on the F1 score
    evaluation_strategy ='steps',
    eval_steps = 7000,
    save_total_limit = 3,
    learning_rate=learning_rate,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=epochs,
    weight_decay=0.001,
    save_steps=35000,
    metric_for_best_model = 'f1',
    load_best_model_at_end=True
)

trainer = Trainer(
    model,
    args,
    train_dataset=tokenised_train,
    eval_dataset=tokenised_val,
    data_collator = data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)


In [57]:
trainer.train()

100%|██████████| 150/150 [05:08<00:00,  2.05s/it]

{'train_runtime': 308.1618, 'train_samples_per_second': 1.947, 'train_steps_per_second': 0.487, 'train_loss': 0.06668521245320638, 'epoch': 6.0}





TrainOutput(global_step=150, training_loss=0.06668521245320638, metrics={'train_runtime': 308.1618, 'train_samples_per_second': 1.947, 'train_steps_per_second': 0.487, 'train_loss': 0.06668521245320638, 'epoch': 6.0})

### 5. Testing
** **

In [58]:
# Prepare the test data for evaluation in the same format as the training data

predictions, labels, _ = trainer.predict(tokenised_test)
predictions = np.argmax(predictions, axis=2)

# Remove the predictions for the [CLS] and [SEP] tokens 
true_predictions = [
    [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

# Compute multiple metrics on the test restuls
results = metric.compute(predictions=true_predictions, references=true_labels)
results

100%|██████████| 39/39 [00:15<00:00,  2.46it/s]


{'0, 0, 0, 0, 0, 0, 0, 2, 3, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 3, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 3, 3, 0, 1, 0, 0, 0, 0, 0, 0, 2, 3, 3, 1, 0, 1, 0, 0, 0, 0, 0, 0]': {'precision': 0.5760869565217391,
  'recall': 0.7940074906367042,
  'f1': 0.6677165354330709,
  'number': 267},
 '0, 0, 0, 0, 2, 3, 3, 3, 3, 0, 1, 0, 0, 0, 0]': {'precision': 0.5187861271676301,
  'recall': 0.6697761194029851,
  'f1': 0.5846905537459284,
  'number': 536},
 '0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 3, 0, 1, 0, 0, 0, 0, 2, 3, 3, 0, 1, 0, 0, 0, 0, 0]': {'precision': 0.4789473684210526,
  'recall': 0.610738255033557,
  'f1': 0.5368731563421829,
  'number': 149},
 '1, 0, 2, 3, 3, 0]': {'precision': 0.4523809523809524,
  'recall': 0.5891472868217055,
  'f1': 0.5117845117845118,
  'number': 129},
 'overall_precision': 0.5204513399153737,
 'overall_recall': 0.6827012025901943,
 'overall_f1': 0.5906362545018007,
 'overall_accuracy': 0.9043752819124944}

### Challenge

Now that you can observe the F1 score for your fine-tuned model, you should try and change the name of the Transformer model to roberta-base and look at the performance. Further, before you start experimenting with Transformers for the coursework, I would suggest having a look at the [Huggingface MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). MTEB is a Massive Text Embedding Benchmark which allows you to check how each model is performing on the each subset of this benchmark data. 

HINT: Is there a sequence/token classification task among the tasks/dataset listed on the benchmark? If you specifically check for performance measures for the token classification, you may be able to get an indication about how it may perform compare to other models on the board. Check out for the rank of the ['SentenceBERT' model](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) called all-mpnet-base-v2, find out where bert-base models rank, and where roberta ranks on this list; see if any other models can help you with the coursework (CW)   

However, I request you to look carefully at the model size for roberta-base/bert-base/all-mpnet-base-v2; and find a model from the leaderboard which is similar in terms of size so that you do not get out of memory error during the fine-tuning process. Remember that free tier GPUs on Colab offer only 16GB of GPU RAM, and any fine-tuning experiments for Transformers should be performed with GPU; CPU-based fine-tuning can be (very) slow. 