# **Medical Entity Recognition with Pretrained Transformers**

# Project Description:
In this notebook I aim to explore the capabilities of pretrained transformer models, with a particular focus on BERT (Bidirectional Encoder Representations from Transformers), for the task of identifying medical entities within textual data.

Medical entity recognition is a critical component of various healthcare applications, including clinical decision support systems, electronic health record analysis, and biomedical research.



## Data

The [NCBI Disease](https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/)  corpus is a valuable dataset consisting of 793 PubMed abstracts that have been annotated with 6,892 disease mentions. It serves as a valuable resource for researchers and developers working on tasks related to disease recognition and information extraction from biomedical literature.

The NCBI Disease corpus offers opportunities for training and evaluating models in the field of natural language processing and machine learning. It enables tasks such as disease recognition, named entity recognition, entity linking, and information extraction from scientific articles.

By leveraging the NCBI Disease corpus, researchers and developers can advance the state-of-the-art in biomedical text mining, contribute to the development of clinical decision support systems, and facilitate the discovery of novel insights and knowledge in the field of medicine.
(https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/annotationprocess.png)

![The NCBI Disease Corpus](https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/annotationprocess.png)


In [1]:
from datasets import load_dataset
dataset = load_dataset('ncbi_disease')

Downloading builder script:   0%|          | 0.00/2.28k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading and preparing dataset ncbi_disease/ncbi_disease (download: 1.47 MiB, generated: 3.04 MiB, post-processed: Unknown size, total: 4.52 MiB) to /root/.cache/huggingface/datasets/ncbi_disease/ncbi_disease/1.0.0/92314c7992b0b8a5ea2ad101be33f365b684a2cc011e0ffa29c691e6d32b2d03...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/284k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/51.2k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/52.4k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/5433 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/924 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/941 [00:00<?, ? examples/s]

Dataset ncbi_disease downloaded and prepared to /root/.cache/huggingface/datasets/ncbi_disease/ncbi_disease/1.0.0/92314c7992b0b8a5ea2ad101be33f365b684a2cc011e0ffa29c691e6d32b2d03. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]


## The NCBI disease corpus is fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community.


### Corpus Characteristics

793 PubMed abstracts
6,892 disease mentions
790 unique disease concepts
Medical Subject Headings (MeSH®)
Online Mendelian Inheritance in Man (OMIM®)
91% of the mentions map to a single disease concept
divided into training, developing and testing sets.
### Corpus Annotation
Fourteen annotators
Two-annotators per document (randomly paired)
Three annotation phases
Checked for corpus-wide consistency of annotations

The abstracts are split into sentences, which already have been tokenized for us. There are 5433 sentences in the training data, 924 in the validation data and another 941 in the test data.

In [2]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 5433
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 924
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 941
    })
})

In [19]:
splits = list(dataset.keys())
num_examples = [dataset[s].num_rows for s in splits]

for split, num in zip(splits, num_examples):
    print(f"{split}: {num} examples")

total_examples = sum(num_examples)
print("Total examples:", total_examples)


train: 5433 examples
validation: 924 examples
test: 941 examples
Total examples: 7298


The first training example is the sentence 'Identification of APC2, a homologue of the adenomatous polyposis coli tumour suppressor.' The phrase 'adenomatous polyposis coli tumour' has been labeled as a disease.

In [3]:
dataset["train"][0] #first example

{'id': '0',
 'tokens': ['Identification',
  'of',
  'APC2',
  ',',
  'a',
  'homologue',
  'of',
  'the',
  'adenomatous',
  'polyposis',
  'coli',
  'tumour',
  'suppressor',
  '.'],
 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0]}

## Preprocessing

I choose one of the available PubMedBERTs — BERT models that have been pretrained on abstracts (and in this case, also full texts) from PubMed. I start by getting the tokenizer that was used for pretraining this model, because texts need to be tokenized in exactly the same manner!

In [4]:
from transformers import AutoTokenizer

MODEL = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"

tokenizer = AutoTokenizer.from_pretrained(MODEL)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Then I used this tokenizer to tokenize the texts (every sentence in our corpus is a list of words, so I need to tell the tokenizer the text has already been split into words). In addition, I'll  pad and truncate the texts. Sentences that are longer than 256 tokens will be truncated, and all sentences will be padded to the length of the (resulting) longest one.

In [6]:
train_texts = [item["tokens"] for item in dataset["train"]]
dev_texts = [item["tokens"] for item in dataset["validation"]]
test_texts = [item["tokens"] for item in dataset["test"]]

train_texts_encoded = tokenizer(train_texts, padding=True, truncation=True, max_length=256, is_split_into_words=True)
dev_texts_encoded = tokenizer(dev_texts, padding=True, truncation=True, max_length=256, is_split_into_words=True)
test_texts_encoded = tokenizer(test_texts, padding=True, truncation=True, max_length=256, is_split_into_words=True)

I have 3 lists of `Encoding`s, which contain all information that model needs, - the ids of the tokens, their type id, and their attention mask. 

In [7]:
train_texts_encoded[0]

Encoding(num_tokens=138, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [20]:
train_texts_encoded[0].tokens[:20]

['[CLS]',
 'identification',
 'of',
 'apc',
 '##2',
 ',',
 'a',
 'homologue',
 'of',
 'the',
 'adenomatous',
 'polyposis',
 'coli',
 'tumour',
 'suppressor',
 '.',
 '[SEP]',
 '[PAD]',
 '[PAD]',
 '[PAD]']

To keep the size of the vocabulary manageable, unknown words have been split up into known subword parts, such as apc2, which has been split up into apc and ##2, where the ## indicates this is a continuation.

## Preprocessing of labels

Because new tokens are different from the original tokens in the corpus, I can't just train the model on the original labels: I need to align the labels with the new tokens. Luckily the tokenizer also provides a list of offsets for every new token, where any can easily identify tokens that do not correspond to the original words. 

For example, the offsets of the first training sentence tell us that `apc2` has been split up into two tokens, one for the first three characters of the word (indices 0 to, but not including, 3) and one for the last character of the word (indices 3 to, but not including, 4). 


Additionally, I can also identified special tokens, such as `[CLS]` and `[PAD]` by the offset pair `[(0,0)]`. 

In [9]:
train_texts_encoded[0].offsets[:20]

[(0, 0),
 (0, 14),
 (0, 2),
 (0, 3),
 (3, 4),
 (0, 1),
 (0, 1),
 (0, 9),
 (0, 2),
 (0, 3),
 (0, 11),
 (0, 9),
 (0, 4),
 (0, 6),
 (0, 10),
 (0, 1),
 (0, 0),
 (0, 0),
 (0, 0),
 (0, 0)]

There are only three labels in the corpus — `O`, `B-disease` and `I-disease` — which have already been mapped to their index.

In [11]:
all_labels = list(set([label for item in dataset["train"] for label in item["ner_tags"]]))
all_labels

[0, 1, 2]

For each sentence, I first create a numpy array filled with the label `-100`, a special label in the `transformers` library that will be ignored during training. Then I copy the original labels to the tokens at the start of every word. This means the remaining tokens of the word will still have the label `-100`. 

In [21]:
import numpy as np

def map_entities_to_tokens(items, encodings):
    
    labels = [item["ner_tags"] for item in items]
    offsets = [encoding.offsets for encoding in encodings]
    encoded_labels = []
    
    for doc_labels, doc_offset in zip(labels, offsets):
        #  empty array of -100
        doc_enc_labels = np.ones(len(doc_offset),dtype=int) * -100
        arr_offset = np.array(doc_offset)

        # set labels whose first offset position is 0 and the second is not 0
        doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
        encoded_labels.append(doc_enc_labels.tolist())

    return encoded_labels

train_labels = map_entities_to_tokens(dataset["train"], train_texts_encoded.encodings)
dev_labels = map_entities_to_tokens(dataset["validation"], dev_texts_encoded.encodings)
test_labels = map_entities_to_tokens(dataset["test"], test_texts_encoded.encodings)

In [22]:
list(zip(train_texts_encoded[0].tokens[:20], train_labels[0][:20]))

[('[CLS]', -100),
 ('identification', 0),
 ('of', 0),
 ('apc', 0),
 ('##2', -100),
 (',', 0),
 ('a', 0),
 ('homologue', 0),
 ('of', 0),
 ('the', 0),
 ('adenomatous', 1),
 ('polyposis', 2),
 ('coli', 2),
 ('tumour', 2),
 ('suppressor', 0),
 ('.', 0),
 ('[SEP]', -100),
 ('[PAD]', -100),
 ('[PAD]', -100),
 ('[PAD]', -100)]

## Setting up the dataset

NERDataset -   returns for every item all the information in the encodings as a dictionary, and adds an additional key with the labels.  

This NERDataset class provides a convenient interface for working with NER data within the torch and torchvision packages. It allows you to use iterators and other functions from the torch.utils.data module to efficiently train and evaluate a model on a NER dataset.


In [23]:
import torch

class NERDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    # get item by index
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
    
    # get len of elements
    def __len__(self):
        return len(self.labels)
    

train_dataset = NERDataset(train_texts_encoded, train_labels)
dev_dataset = NERDataset(dev_texts_encoded, dev_labels)
test_dataset = NERDataset(test_texts_encoded, test_labels)


# Evaluation of the results:
Accuracy score on all labels, excluding `-100`. In named entity recognition, accuracy tends to be very high, because most tokens are not part of an entity mention. Therefore I also compute precision, recall and F-score on the entity labels only (excluding label `0`). This is a much better measure of the model's success at identifying entities.

In [25]:
import warnings
warnings.filterwarnings("ignore")

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    flat_labels, flat_preds = [], []
    flat_ent_labels, flat_ent_preds = [], []
    
    for label_row, pred_row in zip(labels, preds):
        for label, pred_label in zip(label_row, pred_row):
            # ignore -100 labels
            if label != -100:
                flat_labels.append(label)
                flat_preds.append(pred_label)
                if label != 0 or pred_label != 0:
                    flat_ent_labels.append(label)
                    flat_ent_preds.append(pred_label)
                    
        
    precision, recall, f1, _ = precision_recall_fscore_support(flat_ent_labels, flat_ent_preds, average='micro')
    acc = accuracy_score(flat_labels, flat_preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

## Training the model

This is easy to do with the `Trainer` class of the `transformers` package

In [26]:
import torch

if torch.cuda.is_available():
    print("CUDA is available.")
else:
    print("CUDA is not available.")


CUDA is available.


In [28]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [30]:
from transformers import Trainer, TrainingArguments, AutoModelForTokenClassification, BertForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(MODEL, num_labels=len(all_labels))
model.to(device)

training_args = TrainingArguments(
    output_dir='./results_NER',          
    num_train_epochs=4,              
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=8,   # batch size for evaluation
    warmup_steps=int(len(train_dataset)/8),  # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    evaluation_strategy="steps",
    eval_steps=200,
    save_steps=200,
    save_total_limit=10,
    load_best_model_at_end=True,
    no_cuda=False
)

trainer = Trainer(
    model=model,                        
    args=training_args,                 
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,         
    eval_dataset=dev_dataset,            
)

trainer.train()


Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForToken

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
200,No log,0.066694,0.97843,0.760093,0.760093,0.760093
400,No log,0.046011,0.985523,0.832529,0.832529,0.832529
600,0.175600,0.062417,0.983479,0.796191,0.796191,0.796191
800,0.175600,0.045596,0.986524,0.837607,0.837607,0.837607
1000,0.026600,0.041772,0.987734,0.853073,0.853073,0.853073
1200,0.026600,0.05261,0.987317,0.84943,0.84943,0.84943


TrainOutput(global_step=1360, training_loss=0.07668719852671904, metrics={'train_runtime': 518.1322, 'train_samples_per_second': 41.943, 'train_steps_per_second': 2.625, 'total_flos': 1530547341755472.0, 'train_loss': 0.07668719852671904, 'epoch': 4.0})

## Evaluating the results

Finally,I evaluate the model on the test dataset.

In [31]:
trainer.evaluate(test_dataset)

{'eval_loss': 0.04833297058939934,
 'eval_accuracy': 0.9847328244274809,
 'eval_f1': 0.8363954505686789,
 'eval_precision': 0.8363954505686789,
 'eval_recall': 0.8363954505686789,
 'eval_runtime': 6.0943,
 'eval_samples_per_second': 154.406,
 'eval_steps_per_second': 9.681,
 'epoch': 4.0}

### Based on the evaluation metrics, the BERT pretrained model performs well on the NCBI dataset corpus. Here are the key observations:

- Evaluation Loss: The evaluation loss of 0.048 indicates the average loss of the model's predictions on the evaluation dataset. A lower evaluation loss suggests that the model is making more accurate predictions.

- Evaluation Accuracy: The evaluation accuracy of 0.985 indicates the proportion of correctly predicted tokens in the evaluation dataset. A higher accuracy indicates that the model is performing well in correctly identifying and classifying tokens.

- Evaluation F1 Score: The evaluation F1 score of 0.836 measures the model's balance between precision and recall. It considers both false positives and false negatives, and a higher F1 score indicates a better overall performance.

- Evaluation Precision: The evaluation precision of 0.836 represents the proportion of correctly predicted positive (named entity) tokens out of all predicted positive tokens. It measures the model's ability to avoid false positives.

- Evaluation Recall: The evaluation recall of 0.836 represents the proportion of correctly predicted positive (named entity) tokens out of all true positive tokens. It measures the model's ability to capture all positive tokens.

- Evaluation Samples per Second: The evaluation samples per second value of 154.406 represents the number of samples processed by the model per second during evaluation. 


### It can be concluded that the BERT pretrained model demonstrates good performance on the NCBI dataset corpus, achieving high accuracy, precision, recall, and F1 score. The model appears to have learned the patterns in the data and is capable of identifying named entities effectively.

For inference, I load the model and combine it with the tokenizer in an `ner` pipeline - easily label new texts and inspect the results.

In [41]:
from transformers import pipeline

model = AutoModelForTokenClassification.from_pretrained("/results_NER")
nlp = pipeline("ner", tokenizer=tokenizer, model=model)


In [19]:
print(dataset["test"][1])
nlp(dataset["test"][1]["tokens"])

{'id': '1', 'ner_tags': [1, 2, 2, 0, 1, 2, 2, 0, 0, 0, 1, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'tokens': ['Ataxia', '-', 'telangiectasia', '(', 'A', '-', 'T', ')', 'is', 'a', 'recessive', 'multi', '-', 'system', 'disorder', 'caused', 'by', 'mutations', 'in', 'the', 'ATM', 'gene', 'at', '11q22', '-', 'q23', '(', 'ref', '.', '3', ')', '.']}


[[{'word': 'ataxia',
   'score': 0.9266947507858276,
   'entity': 'LABEL_1',
   'index': 1,
   'start': 0,
   'end': 6}],
 [{'word': '-',
   'score': 0.9998806715011597,
   'entity': 'LABEL_0',
   'index': 1,
   'start': 0,
   'end': 1}],
 [{'word': 'telangiect',
   'score': 0.9920430779457092,
   'entity': 'LABEL_1',
   'index': 1,
   'start': 0,
   'end': 10},
  {'word': '##asia',
   'score': 0.9963495135307312,
   'entity': 'LABEL_2',
   'index': 2,
   'start': 10,
   'end': 14}],
 [{'word': '(',
   'score': 0.9998924732208252,
   'entity': 'LABEL_0',
   'index': 1,
   'start': 0,
   'end': 1}],
 [{'word': 'a',
   'score': 0.9997237324714661,
   'entity': 'LABEL_0',
   'index': 1,
   'start': 0,
   'end': 1}],
 [{'word': '-',
   'score': 0.9998806715011597,
   'entity': 'LABEL_0',
   'index': 1,
   'start': 0,
   'end': 1}],
 [{'word': 't',
   'score': 0.9996972680091858,
   'entity': 'LABEL_0',
   'index': 1,
   'start': 0,
   'end': 1}],
 [{'word': ')',
   'score': 0.9998168945312

# Conclusion:

### The model identified several entities related to a genetic disorder called ataxia-telangiectasia. It recognized terms such as "ataxia," "telangiectasia," "recessive," "multi-system disorder," and "mutations in the ATM gene at 11q22." These findings suggest that the model is able to accurately identify and label relevant medical entities associated with ataxia-telangiectasia.