<a href="https://colab.research.google.com/github/tienhuynh96/NLP_Projects/blob/main/Demo_Medical_NER_Maccrobat2018_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Data preparation

In [7]:
# https://figshare.com/articles/dataset/MACCROBAT2018/9764942
# https://brat.nlplab.org/standoff.html

In [8]:
!mkdir MACCROBAT2018

mkdir: cannot create directory ‘MACCROBAT2018’: File exists


In [9]:
!unzip /content/MACCROBAT2018.zip -d ./MACCROBAT2018

Archive:  /content/MACCROBAT2018.zip
  inflating: ./MACCROBAT2018/15939911.ann  
  inflating: ./MACCROBAT2018/15939911.txt  
  inflating: ./MACCROBAT2018/16778410.ann  
  inflating: ./MACCROBAT2018/16778410.txt  
  inflating: ./MACCROBAT2018/17803823.ann  
  inflating: ./MACCROBAT2018/17803823.txt  
  inflating: ./MACCROBAT2018/18236639.ann  
  inflating: ./MACCROBAT2018/18236639.txt  
  inflating: ./MACCROBAT2018/18258107.ann  
  inflating: ./MACCROBAT2018/18258107.txt  
  inflating: ./MACCROBAT2018/18416479.ann  
  inflating: ./MACCROBAT2018/18416479.txt  
  inflating: ./MACCROBAT2018/18561524.ann  
  inflating: ./MACCROBAT2018/18561524.txt  
  inflating: ./MACCROBAT2018/18666334.ann  
  inflating: ./MACCROBAT2018/18666334.txt  
  inflating: ./MACCROBAT2018/18787726.ann  
  inflating: ./MACCROBAT2018/18787726.txt  
  inflating: ./MACCROBAT2018/18815636.ann  
  inflating: ./MACCROBAT2018/18815636.txt  
  inflating: ./MACCROBAT2018/19009665.ann  
  inflating: ./MACCROBAT2018/19009665.t

In [10]:
!pip -q install evaluate accelerate -U

In [11]:
ls

[0m[01;34mMACCROBAT2018[0m/  MACCROBAT2018.zip  [01;34msample_data[0m/


In [12]:
import os
from typing import List, Dict, Tuple

dataset_folder = "./MACCROBAT2018"
file_ids = [f.split(".")[0] for f in os.listdir(dataset_folder) if f.endswith('.txt')]

# Create .txt file name for input
text_files = [f+".txt" for f in file_ids]
# Create .ann file name for label
anno_files = [f+".ann" for f in file_ids]

num_samples = len(file_ids)
texts: List[str] = []
for i in range(num_samples):
    file_path = os.path.join(dataset_folder, text_files[i])
    with open(file_path, 'r') as f:
      texts.append(f.read())

print(texts[3])
print(len(texts))

A 19-year-old man presented at the emergency department, 12 h after insertion of a high pressure container with tanning spray into his rectum.
A plain abdominal radiograph (Figure 1) showed the container in the rectosigmoid region.
There were no signs of perforation.
A flexible sigmoidoscopy was performed under conscious sedation.
The object was located just above the rectosigmoid junction.
The container could not be extracted by bimanual manipulation.
An attempt to remove the object with conventional endoscopic instruments, such as polypectomy snares, was unsuccessful.
The sigmoidoscope could be passed alongside the foreign body to its proximal end.
A guide wire was left behind with the sigmoidoscope removed.
Subsequently, a 40 mm pneumatic dilatation balloon (Rigiflex®, Boston Scientific), normally used in achalasia patients, was inserted over the guide wire and inflated just above the container (Figure 2).
For safety purposes, the sigmoidoscope was reintroduced alongside the cathete

In [13]:
import os
from typing import List, Dict, Tuple

class Preprocessing_Maccrobat:
    def __init__(self, dataset_folder, tokenizer):
        # Creating list of file name in dataset_folder with type is .txt
        # f.split(".")[0] is standing for getting file name only
        self.file_ids = [f.split(".")[0] for f in os.listdir(dataset_folder) if f.endswith('.txt')]

        # Create .txt file name for input
        self.text_files = [f+".txt" for f in self.file_ids]
        # Create .ann file name for label
        self.anno_files = [f+".ann" for f in self.file_ids]

        # Get the number of file input
        self.num_samples = len(self.file_ids)

        # Read all the text file and append into texts (a list of str)
        self.texts: List[str] = []
        for i in range(self.num_samples):
            file_path = os.path.join(dataset_folder, self.text_files[i])
            with open(file_path, 'r') as f:
                self.texts.append(f.read())

        # Read all the label file and append into tags (a list of str)
        self.tags: List[Dict[str, str]] = []
        for i in range(self.num_samples):
            file_path = os.path.join(dataset_folder, self.anno_files[i])
            with open(file_path, 'r') as f:
                # \t is tab and \n is newline
                # Creating text_bound_ann contains the line with start with "T" and split them by tab word
                text_bound_ann = [t.split("\t") for t in f.read().split("\n") if t.startswith("T")]
                # Creating text bound list (each elements in each line)
                text_bound_lst = []
                # Iterator "for" to get each line in text bound ann
                for text_b in text_bound_ann:
                    # Get annotation type and a number and split them
                    label = text_b[1].split(" ")
                    try:
                        # Check the number opject is number
                        _ = int(label[1])
                        _ = int(label[2])
                        # Creating a dictionary for each annotation includes: text, label, start and end position in paragraph
                        tag = {
                            "text": text_b[-1],
                            "label": label[0],
                            "start": label[1],
                            "end": label[2]
                        }
                        # Append dic of annotation to text_bound_lst
                        text_bound_lst.append(tag) # For example: 28_year-old, Age, 8, 19
                    except:
                        pass

                # Append text_bound_lst tags
                self.tags.append(text_bound_lst)
        # Setting tokenizer
        self.tokenizer = tokenizer

    # Creating process function to pre-process the tags and return a tuple of two list[list[str]]
    def process(self) -> Tuple[List[List[str]], List[List[str]]]:
        input_texts = []
        input_labels = []

        # Iterator for get full text and tags in each file.
        for idx in range(self.num_samples):
            full_text = self.texts[idx]
            tags = self.tags[idx]

            # Create a label offset for example: 4,5,6
            label_offset = []
            # Create a continous label offset example: 4,5,6
            continuous_label_offset = []
            # Interator to get the offset where contains objects
            for tag in tags:
                offset = list(range(int(tag["start"]), int(tag["end"])+1))
                label_offset.append(offset)
                continuous_label_offset.extend(offset)

            # Get all offset of each text
            all_offset = list(range(len(full_text)))
            # Finding the zero offset in text
            zero_offset = [offset for offset in all_offset if offset not in continuous_label_offset]
            # Fine each continous ranges in zero offset
            zero_offset = Preprocessing_Maccrobat.find_continuous_ranges(zero_offset)

            self.tokens = []
            self.labels = []
            self._merge_offset(full_text, tags, zero_offset, label_offset)
            assert len(self.tokens) == len(self.labels), f"Length of tokens and labels are not equal"

            input_texts.append(self.tokens)
            input_labels.append(self.labels)

        return input_texts, input_labels

    # Merge offset into full text
    def _merge_offset(self, full_text, tags, zero_offset, label_offset):
        # zero_offset = [[0,1,2,3],[7,8]]
        # labal_offset = [[4,5,6]]

        i = j = 0
        while i < len(zero_offset) and j < len(label_offset):
            if zero_offset[i][0] < label_offset[j][0]:
                self._add_zero(full_text, zero_offset, i)
                i += 1
            else:
                self._add_label(full_text, label_offset, j, tags)
                j += 1

        while i < len(zero_offset):
            self._add_zero(full_text, zero_offset, i)
            i += 1

        while j < len(label_offset):
            self._add_label(full_text, label_offset, j, tags)
            j += 1

    # separate word and add label for zero offset
    def _add_zero(self, full_text, offset, index):
        start, *_ ,end =  offset[index] if len(offset[index]) > 1 else (offset[index][0], offset[index][0]+1)
        text = full_text[start:end]
        text_tokens = self.tokenizer.tokenize(text)

        self.tokens.extend(text_tokens)
        self.labels.extend(
            ["O"]*len(text_tokens)
        )

    # separate word and add label for label offset
    def _add_label(self, full_text, offset, index, tags):
        start, *_ ,end =  offset[index] if len(offset[index]) > 1 else (offset[index][0], offset[index][0]+1)
        text = full_text[start:end]
        text_tokens = self.tokenizer.tokenize(text)

        self.tokens.extend(text_tokens)
        self.labels.extend(
            [f"B-{tags[index]['label']}"] + [f"I-{tags[index]['label']}"]*(len(text_tokens)-1)
        )

    @staticmethod
    # Build a dictionary with input as lists of list of strings, output is a dictionary label2id
    def build_label2id(tokens: List[List[str]]):
        # Initializes an empty dictionary
        label2id = {}
        # Initializes a counter variable id_counter to keep track of unique IDs for each token.
        id_counter = 0
        # flattens the list of lists into a single list.
        for token in [token for sublist in tokens for token in sublist]:
            # If the token is not present in the dictionary, it assigns the current value of id_counter as the value for the token key in the label2id dictionary
            if token not in label2id:
                label2id[token] = id_counter
                id_counter += 1
        return label2id

    @staticmethod
    def find_continuous_ranges(data: List[int]):
        if not data:
            return []
        ranges = []
        start = data[0]
        prev = data[0]
        for number in data[1:]:
            if number != prev + 1:
                ranges.append(list(range(start, prev + 1)))
                start = number
            prev = number
        ranges.append(list(range(start, prev + 1)))
        return ranges



In [14]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("d4data/biomedical-ner-all")

dataset_folder = "./MACCROBAT2018"

Maccrobat_builder = Preprocessing_Maccrobat(dataset_folder, tokenizer)
input_texts, input_labels = Maccrobat_builder.process()

label2id = Preprocessing_Maccrobat.build_label2id(input_labels)
id2label = {v: k for k, v in label2id.items()}

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/373 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

## 2. Dataset loader

In [15]:
from sklearn.model_selection import train_test_split


inputs_train, inputs_val, labels_train, labels_val = train_test_split(
    input_texts,
    input_labels,
    test_size=0.2,
    random_state=42
)

In [16]:
import torch
from torch.utils.data import Dataset

MAX_LEN = 512

class NER_Dataset(Dataset):
    def __init__(self, input_texts, input_labels, tokenizer, label2id, max_len=MAX_LEN):
        super().__init__()
        self.tokens = input_texts
        self.labels = input_labels
        self.tokenizer = tokenizer
        self.label2id = label2id
        self.max_len = max_len

    def __len__(self):
        return len(self.tokens)

    def __getitem__(self, idx):
        input_token = self.tokens[idx]
        label_token = [self.label2id[label] for label in self.labels[idx]]

        input_token = self.tokenizer.convert_tokens_to_ids(input_token)
        attention_mask = [1] * len(input_token)

        input_ids = self.pad_and_truncate(input_token, pad_id= self.tokenizer.pad_token_id)
        labels = self.pad_and_truncate(label_token, pad_id=0)
        attention_mask =  self.pad_and_truncate(attention_mask, pad_id=0)

        return {
            "input_ids": torch.as_tensor(input_ids),
            "labels": torch.as_tensor(labels),
            "attention_mask": torch.as_tensor(attention_mask)
            }

    def pad_and_truncate(self, inputs: List[int], pad_id: int):
        if len(inputs) < self.max_len:
            padded_inputs = inputs + [pad_id] * (self.max_len - len(inputs))
        else:
            padded_inputs = inputs[:self.max_len]
        return padded_inputs

    def label2id(self, labels: List[str]):
        return [self.label2id[label] for label in labels]

In [17]:
train_set = NER_Dataset(inputs_train, labels_train, tokenizer, label2id)
val_set = NER_Dataset(inputs_val, labels_val, tokenizer, label2id)

## 3. Model

In [18]:
from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained(
    "d4data/biomedical-ner-all",
    label2id=label2id,
    id2label=id2label,
    ignore_mismatched_sizes=True
)
model



config.json:   0%|          | 0.00/5.00k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/266M [00:00<?, ?B/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at d4data/biomedical-ner-all and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([84]) in the checkpoint and torch.Size([83]) in the model instantiated
- classifier.weight: found shape torch.Size([84, 768]) in the checkpoint and torch.Size([83, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForTokenClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
    

## 4. Training

In [19]:
import evaluate
import numpy as np

accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    mask = labels != 0
    predictions = np.argmax(predictions, axis=-1)
    return accuracy.compute(predictions=predictions[mask], references=labels[mask])

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [20]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="out_dir",
    learning_rate=1e-4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    # num_train_epochs=20,
    num_train_epochs=2,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    optim="adamw_torch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_set,
    eval_dataset=val_set,
    tokenizer = tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.780154,0.262471
2,No log,1.393557,0.42563


TrainOutput(global_step=20, training_loss=2.065944290161133, metrics={'train_runtime': 31.3117, 'train_samples_per_second': 10.22, 'train_steps_per_second': 0.639, 'total_flos': 41870224588800.0, 'train_loss': 2.065944290161133, 'epoch': 2.0})

## 5. Inference

In [21]:
test_sentence = """A 48 year - old female presented with vaginal bleeding and abnormal Pap smears .
Upon diagnosis of invasive non - keratinizing SCC of the cervix ,
she underwent a radical hysterectomy with salpingo - oophorectomy
which demonstrated positive spread to the pelvic lymph nodes and the parametrium .
Pathological examination revealed that the tumour also extensively involved the lower uterine segment .
"""
input = torch.as_tensor([tokenizer.convert_tokens_to_ids(test_sentence.split())])

In [22]:
input = input.to("cuda")

In [23]:
outputs = model(input)
outputs.logits.shape

torch.Size([1, 63, 83])

In [24]:
_, preds = torch.max(outputs.logits, -1)
preds = preds[0].cpu().numpy()

In [25]:
for token, pred in zip(test_sentence.split(), preds):
    print(f"{token}\t{id2label[pred]}")

A	O
48	I-Age
year	I-Age
-	I-Age
old	I-Age
female	I-Age
presented	O
with	O
vaginal	O
bleeding	B-Sign_symptom
and	O
abnormal	O
Pap	O
smears	O
.	O
Upon	O
diagnosis	O
of	O
invasive	O
non	O
-	O
keratinizing	O
SCC	O
of	O
the	O
cervix	O
,	O
she	O
underwent	O
a	O
radical	O
hysterectomy	O
with	O
salpingo	O
-	O
oophorectomy	O
which	O
demonstrated	O
positive	O
spread	O
to	O
the	O
pelvic	O
lymph	O
nodes	O
and	O
the	O
parametrium	O
.	O
Pathological	O
examination	O
revealed	O
that	O
the	O
tumour	O
also	O
extensively	O
involved	O
the	O
lower	O
uterine	I-Biological_structure
segment	O
.	O


In [26]:
for token, pred in zip(tokenizer.tokenize(test_sentence), preds):
    print(f"{token}\t{id2label[pred]}")

a	O
48	I-Age
year	I-Age
-	I-Age
old	I-Age
female	I-Age
presented	O
with	O
va	O
##ginal	B-Sign_symptom
bleeding	O
and	O
abnormal	O
pa	O
##p	O
sm	O
##ears	O
.	O
upon	O
diagnosis	O
of	O
invasive	O
non	O
-	O
ke	O
##rat	O
##ini	O
##zing	O
sc	O
##c	O
of	O
the	O
ce	O
##r	O
##vi	O
##x	O
,	O
she	O
underwent	O
a	O
radical	O
h	O
##yst	O
##ere	O
##ct	O
##omy	O
with	O
sal	O
##ping	O
##o	O
-	O
o	O
##op	O
##hore	O
##ct	O
##omy	O
which	O
demonstrated	O
positive	O
spread	O
to	I-Biological_structure
the	O
pe	O
