# Using the fine-tuned BERT model

In the previous notebook, we walked through the process of fine-tuning a pre-trained BERT model using masked language modeling, and saved the fine-tuned model for later use. 

In this notebook, we will continue our exploration of BERT by using the fine-tuned model to perform classification on a medical text dataset. Specifically, we will define a custom classification head on top of the fine-tuned BERT model using PyTorch, and train and evaluate the classification model on a held-out test set of medical text data. 

We will also compare the performance of our fine-tuned model to an out-of-the-box BERT model, which has not been fine-tuned on the medical text dataset. By the end of this notebook, you will have a solid understanding of how to use a fine-tuned BERT model for classification of medical text data, and how to compare its performance to an out-of-the-box BERT model.

## Loading the fine-tuned BERT model

To use the fine-tuned BERT model for classification, we can load the saved model from the previous notebook using the ``BertForMaskedLM`` class from the Hugging Face Transformers library. This class is the same pre-trained BERT model that we fine-tuned using masked language modeling in the previous notebook, and does not include a classification head.

To attach our own classification head to the fine-tuned BERT model, we can define a custom classification head using PyTorch, which we will do in the next section. Once the classification head is defined, we can combine it with the fine-tuned BERT model using the PyTorch ``nn.Sequential`` module. We can then move the combined model to the device we plan to use for training and evaluation, such as a GPU if available.

> Load a pre-trained ``bert-base-uncased`` tokenizer to ``tokenizer`` variable.
>
> Load the fine-tuned ``bert-base-uncased-finetuned`` from ``"../../model"`` to ``model`` variable.

In [1]:
from transformers import BertTokenizer, BertForMaskedLM

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model_finetuned = BertForMaskedLM.from_pretrained(
    "../../model/bert-base-uncased-finetuned"
)


> Load the processed data to ``df_train`` and ``df_validation``

In [2]:
import pandas as pd

df_train = pd.read_csv("../../data/train.csv", index_col=0)
df_validation = pd.read_csv("../../data/validation.csv", index_col=0)


## Defining the Dataset

Before proceeding to training our model, we must first define a ``Dataset``. It needs to provide the model with inputs and target class label values. Let's begin with our inputs. We need to feed tokens to our model. Let's, therefore, transform our sentences to lists of tokens.

In [3]:
x_train = tokenizer(
    df_train["transcription"].values.tolist(),
    return_tensors="pt",
    max_length=512,
    truncation=True,
    padding="max_length",
)
x_validation = tokenizer(
    df_validation["transcription"].values.tolist(),
    return_tensors="pt",
    max_length=512,
    truncation=True,
    padding="max_length",
)


Now we go for the target values. Let's check once more that what classes we have in our data. Those are the values of the ``"medical_specialty"`` column.

In [4]:
df_train["medical_specialty"].unique().tolist()


['Surgery',
 'Radiology',
 'Consult',
 'Cardiovascular',
 'Orthopedic',
 'General Medicine']

Due to our labels being in text, we need to map the labels to integers for our model. Let's create a label mapping.

In [5]:
class_to_label = {
    "Surgery": 0,
    "Radiology": 1,
    "Consult": 2,
    "Cardiovascular": 3,
    "Orthopedic": 4,
    "General Medicine": 5,
}


Let's then create a new ``"label"`` column to our datasets.

In [6]:
y_train = df_train["medical_specialty"].map(class_to_label).values
y_validation = df_validation["medical_specialty"].map(class_to_label).values


Let's then define our dataset. We'll leave our target values as indices, since the [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss) expects the labels as indices.

In [7]:
import torch


class MedicalTranscriptionDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = torch.tensor(labels)

    def __getitem__(self, idx):
        x = {key: val[idx].clone().detach() for key, val in self.encodings.items()}
        x["labels"] = self.labels[idx]
        return x

    def __len__(self):
        return len(self.encodings.input_ids)


Let's recap here a bit. The thing to note here is that the output of our dataset must conform to what is expected by the BERT model. This needs to be taken into account especially when using the Hugging Face ``Trainer`` to train the model. From the documentation of the [BertFormMaskedLM.forward](https://huggingface.co/transformers/v3.0.2/model_doc/bert.html#transformers.BertForMaskedLM.forward) we can see that the model expects _at least_ the following arguments:
 - ``input_ids`` of shape ``[batch, sequence_length]`` (required)
 - ``attention`` of shape ``[batch, sequence_length]`` (optional)
 - ``token_type_ids`` of shape ``[batch, sequence_length]`` (optional)
 - ``labels`` of shape ``[batch, sequence_length]`` (optional)

For the last one, ``labels``, the ``BertForMaskedLM`` model expects per-token labels but this time our labels are per-sentence. For sentence classification we will thus omit passing the per-sentence labels by design as our labels are per-sentence and not per-token. The per-sentence labels will be used with the separately defined classification head. For each batch of samples, the ``Trainer`` passes the data batched under each keyword as a separate keyword argument for the model's ``forward`` function. That is why the ``labels`` have still to be included in our samples.

Let's then see what the ``MedicalTranscriptionDataset`` actually does.

In [8]:
dataset = MedicalTranscriptionDataset(encodings=x_train, labels=y_train)
sample = dataset[1]
print(f"Sample type: {type(sample)}")
print(f"Sample keys: {sample.keys()}")
sample


Sample type: <class 'dict'>
Sample keys: dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])


{'input_ids': tensor([  101, 11360, 27011,  2187,  3244,  6612,  2656,  2095,  2214,  2381,
         18549,  7749,  3653, 29021,  2260,  2322,  2384,  4531, 10514, 18098,
          3022, 26787,  5809,  7166,  5740,  6190,  2302,  2440, 14983,  7697,
          6578, 11917,  2128,  6494,  7542, 13472,  2012, 18981, 10536,  2186,
          8746,  3746,  1018,  1020,  3671,  1999, 27843, 13102,  3981,  5809,
          4942, 15782, 14289,  8017,  2483,  7166,  2239,  3671,  2146, 27947,
          7166,  2239,  2306, 12170,  6895, 23270,  2389, 14100, 23828,  4942,
         25148,  3370,  7166,  2239, 18323, 20368,  7941, 25641,  7166,  5740,
          6190, 26721, 17695, 23722,  2906,  4664,  7166,  2239,  7704, 13311,
          3143,  7697, 12532, 16778, 11231,  3012, 27947,  8133, 10109,  2186,
          8746,  3746,  1018,  1021,  2186,  9402,  3746,  2184,  2570,  2312,
          2940, 22818, 19583,  5994,  2471,  2972, 15219,  2431, 20368,  7941,
          2132,  2186,  9402,  3746,  2

Again, it is important to notice that the dataset's samples must conform to the format expected by our underlying BERT model. 

Let's finally try to feed a sample to the model to see what it does.

In [9]:
# Batchify
batched_sample = {}
for key, value in sample.items():
    batched_sample[key] = value.unsqueeze(0)

model_finetuned.to("cpu")

batched_sample.pop("labels")
sample_output = model_finetuned(**batched_sample, output_hidden_states=True)
print(f"Sample output keys: {sample_output.keys()}")
sample_output.hidden_states[-1][:, 0, :].shape


Sample output keys: odict_keys(['logits', 'hidden_states'])


torch.Size([1, 768])

## Defining the classification head

To perform classification on the fine-tuned BERT model, we need to attach a custom classification head on top of the pre-trained model. This classification head will take the output embeddings from the fine-tuned BERT model and map them to a set of output classes using a feedforward neural network.

To define the classification head, we can use PyTorch to create a new ``nn.Sequential`` module. The classification head should include at least one hidden layer with a ReLU activation function, as well as an output layer. The input size to the classification head should be the same as the output size of the fine-tuned BERT model, which can be obtained using the ``config.hidden_size`` attribute of the BERT configuration. The output size of the classification head should be equal to the number of output classes in our classification task. 

We will omit any further steps to the output of the final layer, as the [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss) is happy to receive the raw predictions.

In [10]:
from torch import nn


class ClassificationHead(nn.Module):
    def __init__(self, embedding_dims: int, num_classes: int):
        super(ClassificationHead, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(embedding_dims, embedding_dims),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(embedding_dims, embedding_dims),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(embedding_dims, num_classes),
        )

    def forward(self, x):
        return self.model(x)


We defined a simple feedforward neural network with two hidden layers each followed by rectified linear activation and a dropout, followed by a linear layer with six output classes. Size and number of layers can be adjusted as needed, but this is a good starting point. Let's see the classification head in operation.

The ``sequence_length`` is the static length of input sequences. The ``embedding_dims`` is set to the size of the BERT model's output embeddings, which is 768 for the ``bert-base-uncased`` model. In the original BERT paper they mention the following:

_"The first token of every sequence is always a special classification token (``[CLS]``). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks."_

Consider a mini-batch of two tokenized sequences. The shape of the mini-batch tensor will be ``[2,512,768]``, where:
 - 2 corresponds to the number of sequences (i.e. the batch size)
 - 512 corresponds to the tokens in a sequence
 - 768 corresponds to the embedding dimensions for a token

For such input, without any modifications the ``ClassificationHead`` should expect an input of ``[2,1,768]``. However, due to having everything encoded to a single token, the classifier can expected to receive a single token for each sentence in an input batch.


In [11]:
import torch

inputs = torch.rand(2, 1, 768)
cls_head = ClassificationHead(embedding_dims=inputs.shape[2], num_classes=6)
print(cls_head(inputs[:, 0, :]).shape)
cls_head(inputs[:, 0, :])

torch.Size([2, 6])


tensor([[-0.0197,  0.2046, -0.0770, -0.0518, -0.1724, -0.0071],
        [-0.0590,  0.1225, -0.0316,  0.0219, -0.0804, -0.0677]],
       grad_fn=<AddmmBackward0>)

## Combining fine-tuned BERT with classification head

Now that the classification head is defined, we can combine it with the fine-tuned BERT model using another ``nn.Sequential`` module. This combined model will take in the preprocessed input text data and output the predicted probability distribution over the output classes. We can then move the combined model to the device we plan to use for training and evaluation, such as a GPU if available.

Here, pay attention to the `forward` method and its arguments. As discussed earlier with the dataset definition, the arguments conform to sample-wise keys of the dict provided by our dataset. These, in turn, conform to the data format expected by the ``BertForMaskedLM``.

We also want to omit tuning our BERT models. That is achieved with the ``requires_grad = False``.

In [12]:
class BertWithSeparateClassificationHead(nn.Module):
    def __init__(self, bert_mlm: BertForMaskedLM, num_classes: int):
        super(BertWithSeparateClassificationHead, self).__init__()
        self.bert_mlm = bert_mlm
        for param in self.bert_mlm.parameters():
            param.requires_grad = False
        self.classifier = ClassificationHead(
            embedding_dims=self.bert_mlm.config.hidden_size, num_classes=num_classes
        )
        self.loss = nn.CrossEntropyLoss()

    def forward(self, input_ids, attention_mask, token_type_ids, labels=None, **kwargs):
        outputs = self.bert_mlm(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            output_hidden_states=True,
        )
        last_hidden_state = outputs.hidden_states[-1]
        cls_hidden_state = last_hidden_state[:, 0, :]
        logits = self.classifier(cls_hidden_state)
        if labels is not None:
            loss = self.loss(logits, labels)
            return {"loss": loss, "logits": logits}
        return {"logits": logits}


This code defines a new class ``BertWithSeparateClassificationHead`` that inherits from ``nn.Module``. 

It uses a provided pre-trained BERT model and the previously defined ``ClassificationHead``. In the forward method, the input ``input_ids`` are first through the BERT model and the output of the ``[CLS]`` token from the last hidden state of the BERT model are extracted. Then, the ``[CLS]`` token embeddings are passed through the ``ClassificationHead`` and a softmax function to get the probabilities for the six output classes for each sentence in the input batch.

In [13]:
# Batchify
batched_sample = {}
for key, value in sample.items():
    batched_sample[key] = value.unsqueeze(0)

model_finetuned_with_classifier = BertWithSeparateClassificationHead(
    bert_mlm=model_finetuned, num_classes=6
)
model_finetuned_with_classifier.to("cpu")
y_pred = model_finetuned_with_classifier.forward(**batched_sample)
y_pred


{'loss': tensor(1.7533, grad_fn=<NllLossBackward0>),
 'logits': tensor([[-0.0703,  0.0388,  0.0642, -0.0541, -0.0276,  0.0426]],
        grad_fn=<AddmmBackward0>)}

## Training the classification model

To train the classification model for the medical specialties of the transcriptions, we will use the preprocessed data from the previous notebook, which includes a separate column for the medical specialty labels. We will use this column to obtain the labels for each transcription.

To set up the training process, we use the Hugging Face ``Trainer`` class, as in the preceding notebook with the MLM task. During training, the model will learn to map the preprocessed input text data to the correct output class (i.e., medical specialty) using a combination of the fine-tuned BERT model and the custom classification head. We can monitor the training progress using various performance metrics, such as accuracy and loss, and make adjustments to the hyperparameters as needed to improve performance.

We already know what to do from the fine-tuning process. Let's do it again, with necessary modifications. We'll define a function to train the composite model to help training both our fine-tuned and only pre-trained base BERT models.

In [14]:
import os
from transformers import TrainingArguments, Trainer


def train_bert_with_cls_head(model_name: str, bert_mlm: BertForMaskedLM) -> str:
    trainer = Trainer(
        model=BertWithSeparateClassificationHead(bert_mlm=bert_mlm, num_classes=6),
        train_dataset=MedicalTranscriptionDataset(encodings=x_train, labels=y_train),
        args=TrainingArguments(
            output_dir="out",
            per_device_train_batch_size=24,
            num_train_epochs=8,
            dataloader_num_workers=8,
        ),
    )
    trainer.train()
    model_path = f"../../model/{model_name}/model.pt"
    os.makedirs(os.path.dirname(model_path), exist_ok=True)
    # We'll save like any Pytorch model, as the model is not a Hugging Face `PreTrainedModel`
    torch.save(trainer.model, model_path)
    return model_path


In [15]:
fine_model_path = train_bert_with_cls_head(
    model_name="classifier-fine", bert_mlm=model_finetuned
)


***** Running training *****
  Num examples = 2290
  Num Epochs = 8
  Instantaneous batch size per device = 24
  Total train batch size (w. parallel, distributed & accumulation) = 24
  Gradient Accumulation steps = 1
  Total optimization steps = 768
  Number of trainable parameters = 1185798


Step,Training Loss
500,1.2583


Saving model checkpoint to out/checkpoint-500
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Trainer.model is not a `PreTrainedModel`, only saving its state dict.


## Evaluating the classification model

To evaluate the performance of the classification model, we will use the preprocessed validation data from the previous notebook, which includes a separate column for the medical specialty labels. We will use this column to obtain the labels for each transcription.

To evaluate the model, we can use the ``Trainer.evaluate()`` method to obtain the model's predictions on the validation data, and compare them to the true labels. We can compute various performance metrics, such as accuracy, precision, recall, and F1 score, to assess the model's performance on the classification task.

Let's first define our metric computation function. It will receive as its input the prediction output of our model, see [Trainer.predict](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.predict) for more information.

In [16]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support


def compute_metrics(preds):
    y_true = preds.label_ids
    y_pred = preds.predictions.argmax(-1)
    acc = accuracy_score(y_true, y_pred)
    precision, recall, f1, _ = precision_recall_fscore_support(
        y_true, y_pred, average="weighted"
    )
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}



To ease up comparing models later, let's also define a function to load and evaluate the model.

In [17]:
from transformers import Trainer

def load_and_evaluate_saved_model(model_path):
    torch.cuda.empty_cache()
    trainer = Trainer(model=torch.load(model_path))
    preds = trainer.predict(
        test_dataset=MedicalTranscriptionDataset(
            encodings=x_validation, labels=y_validation
        )
    )
    return compute_metrics(preds=preds)


Let's then evaluate the composite model with a fine-tuned BERT.

In [18]:
load_and_evaluate_saved_model(model_path=fine_model_path)


No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running Prediction *****
  Num examples = 572
  Batch size = 8


  _warn_prf(average, modifier, msg_start, len(result))


{'accuracy': 0.6083916083916084,
 'precision': 0.5142903687221869,
 'recall': 0.6083916083916084,
 'f1': 0.5182635361941162}

### Comparing our fine-tuned BERT with separate head to pre-trained BERT and separate head

To compare the performance of our fine-tuned BERT model with separate head to a pre-trained BERT model with separate head, we can load the pre-trained BERT model using the Hugging Face ``BertForMaskedLM`` class, which includes a pre-trained BERT model. We will then join this non-fine-tuned pre-trained model with our own classification head.  We then train only the classification head for a similar amount of time as we did for our fine-tuned model.

After training, we can evaluate the performance of both models on the same validation set and compare their performance metrics, such as accuracy, precision, recall, and F1 score.

By comparing the performance of the fine-tuned BERT model with separate head to the pre-trained BERT model with separate head, we can assess the impact of our fine-tuning process on the performance of the model for the specific medical text classification task, as well as the benefit of using a custom classification head.

In [19]:
import torch

torch.cuda.empty_cache()
model_pretrained = BertForMaskedLM.from_pretrained("bert-base-uncased")
pre_model_path = train_bert_with_cls_head(
    model_name="classifier-pre", bert_mlm=model_pretrained
)

loading configuration file config.json from cache at /home/grimfada/.cache/huggingface/hub/models--bert-base-uncased/snapshots/0a6aa9128b6194f4f3c4db429b6cb4891cdb421b/config.json
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.24.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file pytorch_model.bin from cache at /home/grimfada/.cache/huggingface/hub/models--bert-base-uncased/snapshots/0a6aa9128b6194f4f3c4db429b6cb4891cdb421b/pytorch_model.bin
Some weights 

Step,Training Loss
500,1.3519


Saving model checkpoint to out/checkpoint-500
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Trainer.model is not a `PreTrainedModel`, only saving its state dict.


In [20]:
load_and_evaluate_saved_model(model_path=pre_model_path)


No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running Prediction *****
  Num examples = 572
  Batch size = 8


  _warn_prf(average, modifier, msg_start, len(result))


{'accuracy': 0.6031468531468531,
 'precision': 0.4926711462594029,
 'recall': 0.6031468531468531,
 'f1': 0.5013725085029777}

> What can be deduced from the results? Why?

## Conclusion

In this and preceding two notebooks we've gone through the process of preprocessing textual data for BERT-based NLP, fine-tuning a pre-trained model and creating a custom classification head for the model.