## Fine-Tuning a BERT base model for NER on Custom Dataset
### Run it on Google Collab

The concept was adopted from Rohan Paul's tutorial on the same and his code mentioned below. <br>
https://www.youtube.com/watch?v=dzyDHMycx_c&list=PLxqBkZuBynVQEvXfJpq3smfuKq3AiNW-N&index=19

https://github.com/rohan-paul/LLM-FineTuning-Large-Language-Models/blob/main/Other-Language_Models_BERT_related/YT_Fine_tuning_BERT_NER_v1.ipynb


### Dataset

#### https://huggingface.co/datasets/EMBO/BLURB

JNLPBA
The BioNLP / JNLPBA Shared Task 2004 involves the identification and classification of technical terms referring to concepts of interest to biologists in the domain of molecular biology. The task was organized by GENIA Project based on the annotations of the GENIA Term corpus (version 3.02). Corpus format: **The JNLPBA corpus is distributed in IOB format, with each line containing a single token and its tag, separated by a tab character. Sentences are separated by blank lines.**

### Base Model
AutoModelForTokenClassification.from_pretrained("**bert-base-uncased**")

Ref : https://huggingface.co/course/chapter7/2

**BertTokenizerFast <br>
Transformer pipeline aggregation strategies.**

In [1]:
!pip install transformers datasets tokenizers seqeval accelerate -U -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.1/315.1 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.7 MB/s[0m eta 

### https://huggingface.co/course/chapter7/2

In [2]:
# For CUDA 10.0 which pytorch version ? https://pytorch.org/get-started/previous-versions/
# https://stackoverflow.com/questions/48152674/how-do-i-check-if-pytorch-is-using-the-gpu
import torch
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.current_device())
print(torch.cuda.get_device_name(0))

True
1
0
Tesla T4


In [3]:
# Import libraries
import datasets
import numpy as np
from transformers import BertTokenizerFast
from transformers import DataCollatorForTokenClassification
from transformers import AutoModelForTokenClassification

### Load the bio-medical dataset and explore a bit

In [4]:
blurb = datasets.load_dataset("EMBO/BLURB", "JNLPBA")
blurb

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/26.0k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/30.3k [00:00<?, ?B/s]

The repository for EMBO/BLURB contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/EMBO/BLURB.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/109k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/235k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Before the download


Generating validation split: 0 examples [00:00, ? examples/s]

Before the download


Generating test split: 0 examples [00:00, ? examples/s]

Before the download


DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 18608
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1940
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 4261
    })
})

In [5]:
type(blurb)

datasets.dataset_dict.DatasetDict

In [6]:
type(blurb['train'])

In [7]:
blurb.shape

{'train': (18608, 3), 'validation': (1940, 3), 'test': (4261, 3)}

In [8]:
blurb["train"][5]

{'id': '5',
 'tokens': ['Our',
  'data',
  'suggest',
  'that',
  'lipoxygenase',
  'metabolites',
  'activate',
  'ROI',
  'formation',
  'which',
  'then',
  'induce',
  'IL-2',
  'expression',
  'via',
  'NF-kappa',
  'B',
  'activation',
  '.'],
 'ner_tags': [0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 0]}

In [9]:
blurb["train"].features

{'id': Value(dtype='string', id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=ClassLabel(names=['O', 'B-protein', 'I-protein', 'B-cell_type', 'I-cell_type', 'B-cell_line', 'I-cell_line', 'B-DNA', 'I-DNA', 'B-RNA', 'I-RNA'], id=None), length=-1, id=None)}

In [10]:
blurb['train'].description

'The BioNLP / JNLPBA Shared Task 2004 involves the identification \n                    and classification of technical terms referring to concepts of interest to \n                    biologists in the domain of molecular biology. The task was organized by GENIA \n                    Project based on the annotations of the GENIA Term corpus (version 3.02). \n                    Corpus format: The JNLPBA corpus is distributed in IOB format, with each line \n                    containing a single token and its tag, separated by a tab character. \n                    Sentences are separated by blank lines.'

### Data Preparation for modeling

In [11]:
# Collect the NER labels available in the dataset
label_list = blurb["train"].features["ner_tags"].feature.names
label_list

['O',
 'B-protein',
 'I-protein',
 'B-cell_type',
 'I-cell_type',
 'B-cell_line',
 'I-cell_line',
 'B-DNA',
 'I-DNA',
 'B-RNA',
 'I-RNA']

In [12]:
# Load the BertTokenizerFast from the pre-trained BERT model.
# The tokenizer is the first component in the model pipeline -
# It tokenizes and converts the text to numerical ids to be fed to the model training/inferencing.
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



In [13]:
## Let's observe the effect of the tokenizer on a sample data item.

In [14]:
example_text = blurb['train'][5]
original_words = example_text["tokens"]

print("Original text words/tokens :\n", original_words)
print("\nNumber of original text words/tokens :\n", len(original_words))
print()

print("Original tags/labels :\n", example_text["ner_tags"])
print("\nNumber of original tags/labels :\n", len(example_text["ner_tags"]))
print()

tokenized_input = tokenizer(original_words, is_split_into_words=True)
print("Numeric Ids created by tokenizer:\n", tokenized_input["input_ids"])
print("\nNumber of numeric Ids created by tokenizer :\n", len(tokenized_input["input_ids"]))
print()

new_tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print("Tokens created by tokenizer:\n", new_tokens)
print("\nNumber of new tokens created by tokenizer :\n", len(new_tokens))
print()

word_ids = tokenized_input.word_ids()
print("Word Ids :\n", word_ids)
print("\nNumber of unique words :\n", len(set(word_ids)))

Original text words/tokens :
 ['Our', 'data', 'suggest', 'that', 'lipoxygenase', 'metabolites', 'activate', 'ROI', 'formation', 'which', 'then', 'induce', 'IL-2', 'expression', 'via', 'NF-kappa', 'B', 'activation', '.']

Number of original text words/tokens :
 19

Original tags/labels :
 [0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 0]

Number of original tags/labels :
 19

Numeric Ids created by tokenizer:
 [101, 2256, 2951, 6592, 2008, 5423, 11636, 2100, 28835, 18804, 14956, 7616, 20544, 25223, 4195, 2029, 2059, 19653, 6335, 1011, 1016, 3670, 3081, 1050, 2546, 1011, 16000, 1038, 13791, 1012, 102]

Number of numeric Ids created by tokenizer :
 31

Tokens created by tokenizer:
 ['[CLS]', 'our', 'data', 'suggest', 'that', 'lip', '##ox', '##y', '##genase', 'meta', '##bol', '##ites', 'activate', 'roi', 'formation', 'which', 'then', 'induce', 'il', '-', '2', 'expression', 'via', 'n', '##f', '-', 'kappa', 'b', 'activation', '.', '[SEP]']

Number of new tokens created by tokenizer :

## Problem of sub-tokens

We see above that when we pass the original words in the data into the tokenizer,
the tokenizer creates new tokens by introducing a couple of special tokens [CLS] and [SEP], and
also by splitting some original words in to new ones. For example, 'lipoxygenase' is split into 'lip', '##ox', '##y', '##genase'. But the word Ids in the tokenized input show word id 4 for all the subwords of this word, meaning all the subwords belong to the same word.

The  input ids/tokens returned by the tokenizer are longer than the lists of labels our dataset contain. In the above example, we have 19 words and 19 corresponding tags/labels for those in the original data, but we have more (31) new tokens which we have to attach some labels to.

This is a problem brought by the pre-trained tokenizer, that we have to address before we feed the tokens to the model during training.

We solve the issue ad follows:
We set the labels of all special tokens to -100 (the index that is ignored by PyTorch) and the labels of all other tokens to the label of the word they come from. Another strategy is to set the label only on the first token obtained from a given word, and give a label of -100 to the other subtokens from the same word.
Why did we choose –100 ? The reason is that in PyTorch the cross-entropy loss class torch.nn.CrossEntropyLoss has an attribute called ignore_index whose value is –100. This index is ignored during training.

In [15]:
## The below function `tokenize_and_align_labels` does 2 jobs
# set –100 as the label for the special tokens and the subwords are assigned
#   either the same label as the main word, or -100 depending on the flag 'label_all_tokens'
# Then we align the labels with the token ids using the strategy we picked

def tokenize_and_align_labels(examples, label_all_tokens=True):
    """
    Function to tokenize and align labels with respect to the tokens.
    This function is specifically designed for Named Entity Recognition (NER) tasks
    where alignment of the labels is necessary after tokenization.

    Parameters:
    examples (dict): A dictionary containing the tokens and the corresponding NER tags.
                     The tokens and NER tags are 2-d nested lists, each inner list contains
                     one tokens and ner_tags of one row/sentence.
                     - "tokens": list of words in a sentence.
                     - "ner_tags": list of corresponding entity tags for each word.

    label_all_tokens (bool): A flag to indicate whether all tokens should have labels.
                             If False, only the first token of a word will have a label,
                             the other tokens (subwords) corresponding to the same word will
                             be assigned -100.

    Returns:
    tokenized_inputs (dict): A dictionary containing the tokenized inputs and the corresponding labels aligned with the tokens.
    """

    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    # Recomputed labels for the tokens of all given examples(rows).
    labels = []
    # Word_ids for the tokens of all given examples(rows)
    word_ids_lst = []

    # Iterate over each example.
    for i, orig_label_lst in enumerate(examples["ner_tags"]):

        # word_ids() => Return a list mapping the tokens
        # to their actual word in the initial sentence.
        # It Returns a list indicating the word corresponding to each token.
        word_ids = tokenized_inputs.word_ids(batch_index=i)

        previous_word_idx = None

        # List for storing computed labels for each token in the example(row).
        label_ids = []

        # Special tokens like `<s>` and `<\s>` are originally mapped to None
        # We need to set the label to -100 so they are automatically ignored in the loss function.
        for word_idx in word_ids:

            if word_idx is None:
                # set –100 as the label for these special tokens
                label_ids.append(-100)

            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            elif word_idx != previous_word_idx:
                # if current word_idx is != prev then its the most regular case
                # and add the corresponding token
                label_ids.append(orig_label_lst[word_idx])

            else:
                # to take care of sub-words which have the same word_idx
                # set -100 as well for them, but only if label_all_tokens == False
                label_ids.append(orig_label_lst[word_idx] if label_all_tokens else -100)


            previous_word_idx = word_idx

        # Append this row's label_ids to the master list.
        labels.append(label_ids)
        # Append this row's word_ids to the master list.
        word_ids_lst.append(word_ids)

    # Fill the main tokenized_inputs with extra information - labels and word_ids.
    tokenized_inputs["labels"] = labels
    tokenized_inputs["word_ids"] = word_ids_lst

    return tokenized_inputs

#### Let's apply the above method tokenize_and_align_labels() on a training sample and observe the results

In [16]:
blurb['train'][1:2]

{'id': ['1'],
 'tokens': [['IL-2',
   'gene',
   'expression',
   'and',
   'NF-kappa',
   'B',
   'activation',
   'through',
   'CD28',
   'requires',
   'reactive',
   'oxygen',
   'production',
   'by',
   '5-lipoxygenase',
   '.']],
 'ner_tags': [[7, 8, 0, 0, 1, 2, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0]]}

In [18]:
q = tokenize_and_align_labels(blurb['train'][1:2])
q

{'input_ids': [[101, 6335, 1011, 1016, 4962, 3670, 1998, 1050, 2546, 1011, 16000, 1038, 13791, 2083, 3729, 22407, 5942, 22643, 7722, 2537, 2011, 1019, 1011, 5423, 11636, 2100, 28835, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[-100, 7, 7, 7, 8, 0, 0, 1, 1, 1, 1, 2, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, -100]], 'word_ids': [[None, 0, 0, 0, 1, 2, 3, 4, 4, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 14, 14, 14, 14, 14, 14, 15, None]]}

### So before applying the `tokenize_and_align_labels()` the `tokenized_input` has 3 keys
- input_ids
- token_type_ids
- attention_mask

But after applying `tokenize_and_align_labels()` we have extra keys - `'labels'` and `'word_ids'`


In [19]:
# Print new_token, word_id and labels assigned for each new token.
for new_token, word_id, label in zip(tokenizer.convert_ids_to_tokens(q["input_ids"][0]), q["word_ids"][0], \
                                     q["labels"][0]):
    if word_id is None:
        print(f"{new_token:_<20}",f"{'None':_<20}", label)
    else:
        print(f"{new_token:_<20} {word_id:_<20} {label}")

[CLS]_______________ None________________ -100
il__________________ 0___________________ 7
-___________________ 0___________________ 7
2___________________ 0___________________ 7
gene________________ 1___________________ 8
expression__________ 2___________________ 0
and_________________ 3___________________ 0
n___________________ 4___________________ 1
##f_________________ 4___________________ 1
-___________________ 4___________________ 1
kappa_______________ 4___________________ 1
b___________________ 5___________________ 2
activation__________ 6___________________ 0
through_____________ 7___________________ 0
cd__________________ 8___________________ 1
##28________________ 8___________________ 1
requires____________ 9___________________ 0
reactive____________ 10__________________ 0
oxygen______________ 11__________________ 0
production__________ 12__________________ 0
by__________________ 13__________________ 0
5___________________ 14__________________ 1
-___________________ 14______

In [20]:
# Apply tokenize_and_align_labels() on the entire dataset and get the tokenized output
tokenized_datasets = blurb.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/18608 [00:00<?, ? examples/s]

Map:   0%|          | 0/1940 [00:00<?, ? examples/s]

Map:   0%|          | 0/4261 [00:00<?, ? examples/s]

In [21]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels', 'word_ids'],
        num_rows: 18608
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels', 'word_ids'],
        num_rows: 1940
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels', 'word_ids'],
        num_rows: 4261
    })
})

Now we have prepared the dataset for training. We will move on to the things needed for Model training and evaluation

## Model training and evaluation

#### Load the pre-trained BERT Base model and examine it.

In [22]:
# We use AutoModelForTokenClassification as we
# are doing classification of tokens to different NER labels.
# We specify the number of labels to be the number of unique NER labels in our data.

model = AutoModelForTokenClassification.from_pretrained("bert-base-uncased", \
                                                        num_labels=len(label_list))

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
model = model.to('cuda')

In [24]:
model

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12

In [25]:
#Look at the last layer in the base model. It is a classifier which takes input of length 768 and then
# gives out output of length 11. This is the same as the length of label_list.

# This means that each token is converted in to a feature vector of length 768 and then
# at the last classification layer, the model throws 11 values (logits) for each token.
# Each of the 11 logits indicate the probability of that token belonging to each of the 11 NER labels.
# The index of the highest value among the 11 logits (argmax) will
# be the predicted label for the token.

len(label_list)

11

#### Install and load the seqeval metric <br>

This metric helps in comparing two sequences -"predicted labels" and "true labels" across a set of data and to compute the precision, recall, F1-score and accuracy etc.

In [26]:
!pip install seqeval
metric = datasets.load_metric("seqeval")



  metric = datasets.load_metric("seqeval")


Downloading builder script:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

The repository for seqeval contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/seqeval.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


In [None]:
#print(blurb['train'][2:5]['ner_tags'])

In [27]:
# Take a subset of train data and see how seqeval computes metric.

examples = blurb['train'][2:5]

print("True NER tag indices :\n",examples['ner_tags'])
print()

true_labels = [[label_list[lbl_idx] for lbl_idx in i_ner_tag_lst] \
               # For each example in the batch
               for i_ner_tag_lst in examples['ner_tags']]

print("True NER tag strings :\n", true_labels)
print()

# Compute the metric - just try giving only the true_labels for both predictions and references
metric.compute(predictions=true_labels, references=true_labels)

True NER tag indices :
 [[0, 0, 0, 1, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0], [0, 3, 4, 4, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 2, 0, 1, 2, 0, 1, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 2, 2, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0]]

True NER tag strings :
 [['O', 'O', 'O', 'B-protein', 'I-protein', 'I-protein', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-protein', 'O', 'B-protein', 'O', 'O', 'O', 'O', 'O'], ['O', 'B-cell_type', 'I-cell_type', 'I-cell_type', 'O', 'O', 'O', 'B-protein', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-protein', 'O', 'O', 'O', 'O', 'B-protein', 'I-protein', 'O', 'B-protein', 'I-protein', 'O', 'B-protein', 'O', 'O'], ['O', 'O', 'O', 'B-protein', 'O', 'O', 'O', 'O', 'O', 'O', 'B-protein', 'I-protein', 'I-protein', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-protein', 'I-protein', 'O', 'B-protein', 'O']]



{'cell_type': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'protein': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 12},
 'overall_precision': 1.0,
 'overall_recall': 1.0,
 'overall_f1': 1.0,
 'overall_accuracy': 1.0}

Since we gave the same values for both predictions and references, we see perfect 1 for all precision,
recall, f1 and accuracy.

We can see that the seqeval package works by accepting list of lists ( true_labels above ).
The seqeval package expects the predictions and labels as a list of lists, with each inner list containing the labels corresponding to a single example in our validation or test sets.

To integrate these metrics evaluation during model training, we need a function that can take the outputs of the model and convert them into the lists that seqeval expects.

In [28]:
# Example for reference to understand the compute_metrics code
#pred_logits = [[[1,2,3], [3,2,1]]]
#pred_logits = np.argmax(pred_logits, axis=2)
#pred_logits

In [29]:
#The following function does the trick.
# It identifies the predicted label of each token from the list of logits (argmax) given by the model.
# It also ensures we ignore the label IDs associated with subsequent subwords of a main word.

def compute_metrics(eval_preds, debug = False):
    """
    Function to compute the evaluation metrics for Named Entity Recognition (NER) tasks.
    The function computes precision, recall, F1 score and accuracy.

    Parameters:
    eval_preds (tuple): A tuple containing the predicted logits and the true labels.

    Returns:
    A dictionary containing the precision, recall, F1 score and accuracy.
    """
    # eval_preds is a tuple. A tuple containing the predicted logits and the true labels.

    # pred_logits - 3d list
    #                - Sequences in a batch , Tokens in a sequence, Label probs/logit in a token
    # true_label_indices - 2d list
    #                - Sequences in a batch, Tokens' true label in a sequence
    pred_logits, true_label_indices = eval_preds

    # pred_label_indices - 2d list
    #                - Sequences in a batch, Tokens' predicted label index in a sequence
    pred_label_indices = np.argmax(pred_logits, axis=2)

    if debug:
        print("True label indices for each token in each sequence")
        print(true_label_indices)
        print()
        print("Predicted label indices for each token in each sequence")
        print(pred_label_indices)
        print()
        print(list(zip(pred_label_indices, true_label_indices)))


    # the logits and the probabilities are in the same order,
    # so we don’t need to apply the softmax

    # We remove all the values where the label is -100
    predicted_labels = [
        # Get the predicted label in string from the label_list
        [label_list[pred_l_index]
        # Iterate for each token in the sequence
        for (pred_l_index, true_l_index) in zip(pred_index_lst, true_index_lst) \
         if true_l_index != -100]
        # Iterate for each sequence in the batch
        for pred_index_lst, true_index_lst in zip(pred_label_indices, true_label_indices)
    ]

    true_labels = [
       # Get the true label in string from the label_list
       [label_list[true_l_index]
       # Iterate for each token in the sequence
       for (pred_l_index, true_l_index) in zip(pred_index_lst, true_index_lst) if true_l_index != -100]
       # Iterate for each sequence in the batch
       for pred_index_lst, true_index_lst in zip(pred_label_indices, true_label_indices)
   ]

    if debug:
        print()
        print("True labels :\n", true_labels)
        print()
        print("Predicted labels :\n", predicted_labels)
        print()

    results = metric.compute(predictions=predicted_labels, references=true_labels)

    return {
       "precision": results["overall_precision"],
       "recall": results["overall_recall"],
       "f1": results["overall_f1"],
       "accuracy": results["overall_accuracy"],
       }

In [30]:
# We can test and understand the compute_metrics() method by calling it
# with a simple toy example below.

# Just for simplification, assume there are only 4 unique labels (0,1,2,3).
# So each token will have a logits/probabilities list of length 4, with each element
# in the list being the logit/probability of a particular label possible.
# The label with the highest logit value will be the predicted label for that token.


# pred_logits is a 3-d list.
# First dimension - Size of a batch of example sequences / rows.
# Second dimension - Number of tokens inside one example sequence.
# Third dimension - Number of possible labels (probabilities for each label) for one token
#                   inside the example.

pred_logits = [ # First dimension

                # Example sequence 1
                  [   # Second dimension

                      # Token 1
                      [ # Third dimension

                          #Each label's probabilities for the token
                          1, 2, 3, 4],

                      # Token 2
                      [   # Third dimension

                          #Each label's probabilities for the token
                          4, 3, 2, 1],

                      # Token 3
                      [   # Third dimension

                          #Each label's probabilities for the token
                          3, 2, 4, 1]


                  ],

                  # Example sequence 2
                  [
                       # Token 1
                       [#Each label's probabilities for the token
                        0, 2, 3, 4],

                       # Token 2
                       [#Each label's probabilities for the token
                        4, 5, 2, 1],

                       # Token 3
                       [   # Third dimension

                          #Each label's probabilities for the token
                          3, 2, 4, 1]

                  ]
              ]

# true_labels is a 2-d list.
# First dimension - Size of a batch of example sequences / rows.
# Second dimension - Number of tokens (true label for each token) inside one example sequence.

true_labels = [ # First dimension

           # Sequence 1
           [# Second dimension

              # True label for each token in the sequence
              3, 0, -100],

           # Sequence 2
           [
              # True label for each token in the sequence
              3, 1, -100]
         ]

compute_metrics((pred_logits, true_labels), True)

True label indices for each token in each sequence
[[3, 0, -100], [3, 1, -100]]

Predicted label indices for each token in each sequence
[[3 0 2]
 [3 1 2]]

[(array([3, 0, 2]), [3, 0, -100]), (array([3, 1, 2]), [3, 1, -100])]

True labels :
 [['B-cell_type', 'O'], ['B-cell_type', 'B-protein']]

Predicted labels :
 [['B-cell_type', 'O'], ['B-cell_type', 'B-protein']]



{'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'accuracy': 1.0}

Now, we have the data prepared and metrics computing functionality ready,
we will go ahead straight to training the BERT model with these.

In [31]:
from transformers import TrainingArguments, Trainer
#import accelerate

args = TrainingArguments(
"test-ner-blurb",
evaluation_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=10,
weight_decay=0.01,
)

# To be able to build batches, data collators may apply some processing (like padding).
# Data collator that will dynamically pad the inputs received, as well as the labels.
# https://huggingface.co/docs/transformers/en/main_classes/data_collator
data_collator = DataCollatorForTokenClassification(tokenizer)

trainer = Trainer(
    model,
    args,
   train_dataset=tokenized_datasets["train"],
   eval_dataset=tokenized_datasets["validation"],
   data_collator=data_collator,
   tokenizer=tokenizer,
   compute_metrics=compute_metrics
)



In [32]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.2738,0.225556,0.791778,0.818425,0.804881,0.924198
2,0.2047,0.219932,0.795836,0.830752,0.812919,0.927107
3,0.1614,0.21983,0.799632,0.846971,0.822621,0.92907
4,0.1326,0.2397,0.808202,0.840646,0.824105,0.929986
5,0.1061,0.254901,0.820255,0.834158,0.827148,0.931557
6,0.0916,0.27483,0.811771,0.829941,0.820755,0.929826
7,0.0744,0.284208,0.819293,0.831319,0.825263,0.930932
8,0.0603,0.3089,0.816741,0.831644,0.824125,0.93019
9,0.0528,0.334479,0.812234,0.834563,0.823247,0.929463
10,0.0462,0.336788,0.815448,0.831319,0.823307,0.929885


TrainOutput(global_step=11630, training_loss=0.1275965333907438, metrics={'train_runtime': 3506.5626, 'train_samples_per_second': 53.066, 'train_steps_per_second': 3.317, 'total_flos': 7458775690579872.0, 'train_loss': 0.1275965333907438, 'epoch': 10.0})

In [33]:
# This evaluates the model on 'eval_dataset' (se to validation dataset in trainer above).
trainer.evaluate()

{'eval_loss': 0.33678779006004333,
 'eval_precision': 0.8154482539177472,
 'eval_recall': 0.8313194388127484,
 'eval_f1': 0.8233073648702915,
 'eval_accuracy': 0.9298845225282876,
 'eval_runtime': 10.4328,
 'eval_samples_per_second': 185.952,
 'eval_steps_per_second': 11.694,
 'epoch': 10.0}

In [34]:
# Evaluate the model on test dataset
trainer.evaluate(eval_dataset=tokenized_datasets["test"])

{'eval_loss': 0.48505499958992004,
 'eval_precision': 0.7126924737372283,
 'eval_recall': 0.813719449578969,
 'eval_f1': 0.759862680040659,
 'eval_accuracy': 0.9065338877559128,
 'eval_runtime': 24.4545,
 'eval_samples_per_second': 174.242,
 'eval_steps_per_second': 10.918,
 'epoch': 10.0}

## Save the Model and prepare "index to label string" mapping

In [35]:
model.save_pretrained("ner_model_blurb")

In [36]:
id2label = {
    str(i): label for i,label in enumerate(label_list)
}
label2id = {
    label: str(i) for i,label in enumerate(label_list)
}

print(id2label)
print(label2id)

{'0': 'O', '1': 'B-protein', '2': 'I-protein', '3': 'B-cell_type', '4': 'I-cell_type', '5': 'B-cell_line', '6': 'I-cell_line', '7': 'B-DNA', '8': 'I-DNA', '9': 'B-RNA', '10': 'I-RNA'}
{'O': '0', 'B-protein': '1', 'I-protein': '2', 'B-cell_type': '3', 'I-cell_type': '4', 'B-cell_line': '5', 'I-cell_line': '6', 'B-DNA': '7', 'I-DNA': '8', 'B-RNA': '9', 'I-RNA': '10'}


In [37]:
import json

config = json.load(open("ner_model_blurb/config.json"))

config["id2label"] = id2label
config["label2id"] = label2id

json.dump(config, open("ner_model_blurb/config.json","w"))

## Load the saved model and test it with an example

In [38]:
fine_tuned_model = AutoModelForTokenClassification.from_pretrained("ner_model_blurb")
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")



In [39]:
# Import Hugging Face transformers pipeline
from transformers import pipeline

In [40]:
# Make a transformers pipeline with "ner" head, our fine tuned model and the tokenizer
nlp = pipeline("ner", model=fine_tuned_model, tokenizer=tokenizer)

example = "Number of glucocorticoid receptors in lymphocytes and their sensitivity to hormone action."
print(f"\n{example}\n")
ner_results = nlp(example)

# Raw results from model
print("\nRaw output without any aggregation:")
for i in ner_results:
    print(i)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.



Number of glucocorticoid receptors in lymphocytes and their sensitivity to hormone action.


Raw output without any aggregation:
{'entity': 'B-protein', 'score': 0.99729985, 'index': 3, 'word': 'g', 'start': 10, 'end': 11}
{'entity': 'B-protein', 'score': 0.99739397, 'index': 4, 'word': '##lu', 'start': 11, 'end': 13}
{'entity': 'B-protein', 'score': 0.997474, 'index': 5, 'word': '##co', 'start': 13, 'end': 15}
{'entity': 'B-protein', 'score': 0.99756026, 'index': 6, 'word': '##cor', 'start': 15, 'end': 18}
{'entity': 'B-protein', 'score': 0.99683374, 'index': 7, 'word': '##tic', 'start': 18, 'end': 21}
{'entity': 'B-protein', 'score': 0.9979552, 'index': 8, 'word': '##oid', 'start': 21, 'end': 24}
{'entity': 'I-protein', 'score': 0.99784315, 'index': 9, 'word': 'receptors', 'start': 25, 'end': 34}
{'entity': 'B-cell_type', 'score': 0.9930347, 'index': 11, 'word': 'l', 'start': 38, 'end': 39}
{'entity': 'B-cell_type', 'score': 0.99458873, 'index': 12, 'word': '##ym', 'start': 39, 'end

In [41]:
nlp = pipeline("ner", model=fine_tuned_model, tokenizer=tokenizer, aggregation_strategy="first", device=0)
example = "Number of glucocorticoid receptors in lymphocytes and their sensitivity to hormone action."
ner_results = nlp(example)
ner_results

[{'entity_group': 'protein',
  'score': 0.99757147,
  'word': 'glucocorticoid receptors',
  'start': 10,
  'end': 34},
 {'entity_group': 'cell_type',
  'score': 0.9930347,
  'word': 'lymphocytes',
  'start': 38,
  'end': 49}]

## Push the fine-tuned NER model to Huggingface hub !

In [42]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [43]:
fine_tuned_model.push_to_hub("rajaramsblr/My_FineTuneBert_Blurb")

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/rajaramsblr/My_FineTuneBert_Blurb/commit/113df836d561a5bb57cae449d536a5a402cf55eb', commit_message='Upload BertForTokenClassification', commit_description='', oid='113df836d561a5bb57cae449d536a5a402cf55eb', pr_url=None, pr_revision=None, pr_num=None)

## Load the fine-tuned model from Huggingface and test it !!

In [44]:
import datasets
blurb = datasets.load_dataset("EMBO/BLURB", "JNLPBA")
blurb

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 18608
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1940
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 4261
    })
})

In [None]:
#print(blurb['test'][1:3]['tokens'])

In [45]:
from transformers import pipeline
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
model_fine_tuned="rajaramsblr/My_FineTuneBert_Blurb"

#### Try token aggregation strategies available in transformers NER pipeline

https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/token_classification.py

  aggregation_strategy (`str`, *optional*, defaults to `"none"`):
            
    The strategy to fuse (or not) tokens based on the model prediction.

    - "none" : Will simply not do any aggregation and simply return raw results from the model
    
    - "simple" : Will attempt to group entities following the default schema.
       (A, B-TAG), (B, I-TAG), (C, I-TAG), (D, B-TAG2) (E, B-TAG2) will end up being
       [{"word": ABC, "entity": "TAG"},
        {"word": "D", "entity": "TAG2"},
        {"word": "E", "entity": "TAG2"}]
        
        Notice that two consecutive B tags will end up as different entities.
        On word based languages, we might end up splitting words undesirably :
        Imagine Microsoft being tagged as
        [{"word": "Micro", "entity": "ENTERPRISE"},
        {"word": "soft", "entity":   "NAME"}].
        Look for FIRST, MAX, AVERAGE for ways to mitigate that and disambiguate words (on languages
        that support that meaning, which is basically tokens separated by a space).
        These mitigations will only work on real words, "New york" might still be tagged with
        two different entities.
        - "first"   : (works only on word based models) Will use the `SIMPLE` strategy except
                      that words, cannot end up with different tags. Words will simply use the
                      tag of the first token of the word when there is ambiguity.
        - "average" : (works only on word based models) Will use the `SIMPLE` strategy except
                      that words, cannot end up with different tags. scores will be averaged
                      first across tokens, and then the maximum label is applied.
        - "max" :     (works only on word based models) Will use the `SIMPLE` strategy except
                      that words, cannot end up with different tags. Word entity will simply be
                      the token with the maximum score.""",
)


In [46]:

example = "Number of glucocorticoid receptors in lymphocytes and their sensitivity to hormone action."
print(f"\n{example}\n")

# Raw results from model
print("\nRaw output without any aggregation:")
nlp = pipeline("ner", model=model_fine_tuned, tokenizer=tokenizer)
ner_results = nlp(example)
for i in ner_results:
    print(i)

# With simple aggregation
print("\nSimple Aggregation:")
nlp = pipeline("ner", model=model_fine_tuned, tokenizer=tokenizer, aggregation_strategy = "simple")
ner_results = nlp(example)
for i in ner_results:
    print(i)


print("\nFirst Aggregation:")
nlp = pipeline("ner", model=model_fine_tuned, tokenizer=tokenizer, aggregation_strategy = "first")
ner_results = nlp(example)
for i in ner_results:
    print(i)

print("\nAverage Aggregation:")
nlp = pipeline("ner", model=model_fine_tuned, tokenizer=tokenizer, aggregation_strategy = "average")
ner_results = nlp(example)
for i in ner_results:
    print(i)

print("\nMax Aggregation:")
nlp = pipeline("ner", model=model_fine_tuned, tokenizer=tokenizer, aggregation_strategy = "max")
ner_results = nlp(example)
for i in ner_results:
    print(i)

# Try with grouped_entities instead of aggregation_strategy if transformers is of old version
#nlp = pipeline("ner", model=model_fine_tuned, tokenizer=tokenizer, aggregation_strategy="simple")
#nlp = pipeline("ner", model=model_fine_tuned, tokenizer=tokenizer, grouped_entities="true")




Number of glucocorticoid receptors in lymphocytes and their sensitivity to hormone action.


Raw output without any aggregation:


config.json:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


{'entity': 'B-protein', 'score': 0.99729985, 'index': 3, 'word': 'g', 'start': 10, 'end': 11}
{'entity': 'B-protein', 'score': 0.99739397, 'index': 4, 'word': '##lu', 'start': 11, 'end': 13}
{'entity': 'B-protein', 'score': 0.997474, 'index': 5, 'word': '##co', 'start': 13, 'end': 15}
{'entity': 'B-protein', 'score': 0.99756026, 'index': 6, 'word': '##cor', 'start': 15, 'end': 18}
{'entity': 'B-protein', 'score': 0.99683374, 'index': 7, 'word': '##tic', 'start': 18, 'end': 21}
{'entity': 'B-protein', 'score': 0.9979552, 'index': 8, 'word': '##oid', 'start': 21, 'end': 24}
{'entity': 'I-protein', 'score': 0.99784315, 'index': 9, 'word': 'receptors', 'start': 25, 'end': 34}
{'entity': 'B-cell_type', 'score': 0.9930347, 'index': 11, 'word': 'l', 'start': 38, 'end': 39}
{'entity': 'B-cell_type', 'score': 0.99458873, 'index': 12, 'word': '##ym', 'start': 39, 'end': 41}
{'entity': 'B-cell_type', 'score': 0.9944101, 'index': 13, 'word': '##ph', 'start': 41, 'end': 43}
{'entity': 'B-cell_type'

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


{'entity_group': 'protein', 'score': 0.99729985, 'word': 'g', 'start': 10, 'end': 11}
{'entity_group': 'protein', 'score': 0.99739397, 'word': '##lu', 'start': 11, 'end': 13}
{'entity_group': 'protein', 'score': 0.997474, 'word': '##co', 'start': 13, 'end': 15}
{'entity_group': 'protein', 'score': 0.99756026, 'word': '##cor', 'start': 15, 'end': 18}
{'entity_group': 'protein', 'score': 0.99683374, 'word': '##tic', 'start': 18, 'end': 21}
{'entity_group': 'protein', 'score': 0.9978992, 'word': '##oid receptors', 'start': 21, 'end': 34}
{'entity_group': 'cell_type', 'score': 0.9930347, 'word': 'l', 'start': 38, 'end': 39}
{'entity_group': 'cell_type', 'score': 0.99458873, 'word': '##ym', 'start': 39, 'end': 41}
{'entity_group': 'cell_type', 'score': 0.9944101, 'word': '##ph', 'start': 41, 'end': 43}
{'entity_group': 'cell_type', 'score': 0.9952114, 'word': '##ocytes', 'start': 43, 'end': 49}

First Aggregation:


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


{'entity_group': 'protein', 'score': 0.99757147, 'word': 'glucocorticoid receptors', 'start': 10, 'end': 34}
{'entity_group': 'cell_type', 'score': 0.9930347, 'word': 'lymphocytes', 'start': 38, 'end': 49}

Average Aggregation:


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


{'entity_group': 'protein', 'score': 0.9976313, 'word': 'glucocorticoid receptors', 'start': 10, 'end': 34}
{'entity_group': 'cell_type', 'score': 0.9943112, 'word': 'lymphocytes', 'start': 38, 'end': 49}

Max Aggregation:
{'entity_group': 'protein', 'score': 0.9978992, 'word': 'glucocorticoid receptors', 'start': 10, 'end': 34}
{'entity_group': 'cell_type', 'score': 0.9952114, 'word': 'lymphocytes', 'start': 38, 'end': 49}


In [47]:
example =  'The study demonstrated a decreased level of \
            glucocorticoid receptors (GR) in peripheral blood lymphocytes from \
            hypercholesterolemic subjects, and an elevated level in patients with \
            acute myocardial infarction.'

ner_results = nlp(example)
for i in ner_results:
    print(i)

{'entity_group': 'protein', 'score': 0.9959862, 'word': 'glucocorticoid receptors', 'start': 56, 'end': 80}
{'entity_group': 'protein', 'score': 0.9883597, 'word': 'gr', 'start': 82, 'end': 84}
{'entity_group': 'cell_type', 'score': 0.971181, 'word': 'peripheral blood lymphocytes', 'start': 89, 'end': 117}


In [48]:
# Check if the fine-tuned model can do NER on other generic example.
# It is not able to recognise Person, location etc now.

example =  'Bill lives in Canada and gave lymphocytes'

nlp = pipeline("ner", model=model_fine_tuned, tokenizer=tokenizer, aggregation_strategy = "first", \
              ignore_labels = [])
ner_results = nlp(example)
for i in ner_results:
    print(i)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


{'entity_group': 'O', 'score': 0.9935884, 'word': 'bill lives in canada and gave', 'start': 0, 'end': 29}
{'entity_group': 'cell_type', 'score': 0.72998524, 'word': 'lymphocytes', 'start': 30, 'end': 41}
