# NER - BioBERT: Disease Identification in a text

## Set Up

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate

Successfully installed datasets-2.8.0 evaluate-0.4.0 huggingface-hub-0.11.1 multiprocess-0.70.14 responses-0.18.0 sentencepiece-0.1.97 tokenizers-0.13.2 transformers-4.25.1 urllib3-1.26.14 xxhash-3.2.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting accelerate
  Downloading accelerate-0.15.0-py3-none-any.whl (191 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m191.5/191.5 KB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.15.0


In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


## Data Download

In [None]:
from datasets import load_dataset
raw_datasets = load_dataset('EMBO/BLURB', 'NCBI-disease-IOB')

Downloading builder script:   0%|          | 0.00/26.0k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/30.3k [00:00<?, ?B/s]

Downloading and preparing dataset blurb/NCBI-disease-IOB to /root/.cache/huggingface/datasets/EMBO___blurb/NCBI-disease-IOB/1.0.0/c9736b8ffc197d4eb4f0b33fdea18902cede876fba559bbdb3dca05abf0042bc...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/284k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/51.2k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/52.4k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Before the download


Generating validation split: 0 examples [00:00, ? examples/s]

Before the download


Generating test split: 0 examples [00:00, ? examples/s]

Before the download
Dataset blurb downloaded and prepared to /root/.cache/huggingface/datasets/EMBO___blurb/NCBI-disease-IOB/1.0.0/c9736b8ffc197d4eb4f0b33fdea18902cede876fba559bbdb3dca05abf0042bc. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
#train, validation and test datasets
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 5425
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 924
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 941
    })
})

In [None]:
#first sentence and the corresponding NER tags

print(raw_datasets["train"][0]["tokens"])
print(raw_datasets["train"][0]["ner_tags"])
print(len(raw_datasets["train"][0]["ner_tags"]))

['Identification', 'of', 'APC2', ',', 'a', 'homologue', 'of', 'the', 'adenomatous', 'polyposis', 'coli', 'tumour', 'suppressor', '.']
[0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0]
14


In [None]:
#tag id and tag names
ner_feature = raw_datasets["train"].features["ner_tags"]
ner_feature

Sequence(feature=ClassLabel(names=['O', 'B-Disease', 'I-Disease'], id=None), length=-1, id=None)

In [None]:
#store tag names to label names
label_names = ner_feature.feature.names
label_names

['O', 'B-Disease', 'I-Disease']



*   O indicates the token doesn’t correspond to disease entity.
*   B- indicates the beginning of an entity.
*   I- indicates a token is contained inside the same entity (e.g., the “York” token is a part of the “New York” entity).







In [None]:
#first sentence and the corresponding NER tags (in a better way)
words = raw_datasets["train"][0]["tokens"]
labels = raw_datasets["train"][0]["ner_tags"]
line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

Identification of APC2 , a homologue of the adenomatous polyposis coli      tumour    suppressor . 
O              O  O    O O O         O  O   B-Disease   I-Disease I-Disease I-Disease O          O 


In [None]:
#downlond biobert from hugging space
from transformers import AutoTokenizer

model_checkpoint = "dmis-lab/biobert-v1.1"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/462 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

## Data Preprocessing

### Tokenization

In [None]:
inputs = tokenizer(raw_datasets["train"][0]["tokens"], is_split_into_words=True)
print(inputs.tokens())
print(len(inputs.tokens()))

['[CLS]', 'I', '##dent', '##ification', 'of', 'AP', '##C', '##2', ',', 'a', 'ho', '##mo', '##logue', 'of', 'the', 'ad', '##eno', '##mat', '##ous', 'p', '##oly', '##po', '##sis', 'co', '##li', 't', '##umour', 'suppress', '##or', '.', '[SEP]']
31


The tokenizer added the special tokens used by the model ([CLS] at the beginning and [SEP] at the end) and breaks most of the words. This introduces a mismatch between our inputs and the labels: the list of labels has only 14 elements, whereas our input now has 31 tokens. Accounting for the special tokens is easy (as they are at the beginning and the end), but we also need to make sure we align all the labels with the proper words.

In [None]:
print(inputs.word_ids())

[None, 0, 0, 0, 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 7, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 11, 11, 12, 12, 13, None]


We can then expand our label list to match the tokens. 


*   The first rule we’ll apply is that special tokens get a label of -100. This is because by default -100 is an index that is ignored in the loss function we will use (cross entropy)
*   Then, each token gets the same label as the token that started the word it’s inside, since they are part of the same entity.
*   For tokens inside a word but not at the beginning, we replace the B- with I- (since the token does not begin the entity)




### Aligning labels with tokens

In [None]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

In [None]:
labels = raw_datasets["train"][0]["ner_tags"]
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

[0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0]
[-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, -100]


The function added the -100 for the two special tokens at the beginning and the end, and a new 0 for our word that was split into two tokens

In [None]:
#function for data preprcoessing for all instances
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [None]:
#preprcocessing the whole data into using the map function
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 5425
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 924
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 941
    })
})

### Padding

In [None]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [None]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
print(batch["labels"])

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


tensor([[-100,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    1,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    0,    0,    0, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100],
        [-100,    0,    1,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0, -100]])


In [None]:
for i in range(2):
    print(tokenized_datasets["train"][i]["labels"])

[-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, -100]
[-100, 0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]


The first set of labels has been padded to the length of the second one using -100s.

## Evaluation Metric

In [None]:
!pip install seqeval

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 KB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16179 sha256=46c549a89077ea067c687dec8d9020db0d1ecba6d44e1053cd492b3fa8ab1895
  Stored in directory: /root/.cache/pip/wheels/ad/5c/ba/05fa33fa5855777b7d686e843ec07452f22a66a138e290e732
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [None]:
import evaluate

metric = evaluate.load("seqeval")

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

For more details on seqeval:

https://github.com/chakki-works/seqeval


In [None]:
import numpy as np


def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

In [None]:
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

In [None]:
print(id2label)
print(label2id)

{0: 'O', 1: 'B-Disease', 2: 'I-Disease'}
{'O': 0, 'B-Disease': 1, 'I-Disease': 2}


In [None]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

Downloading:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dmis-lab/biobert-v1.1 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model.config.num_labels

3

## Trainer API

We have used early stopping and weight decay to prevent overfitting

In [None]:
from transformers import TrainingArguments,Trainer,EarlyStoppingCallback

training_args = TrainingArguments(
    "biobert-finetuned-ner",
    num_train_epochs=10,
    learning_rate=2e-5,
    per_device_train_batch_size=16,   
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    warmup_steps=500, 
    evaluation_strategy="epoch",
    save_strategy="epoch",
    metric_for_best_model = 'f1',
    load_best_model_at_end=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 4)]
)

trainer.train()

***** Running training *****
  Num examples = 5425
  Num Epochs = 10
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 3400
  Number of trainable parameters = 107721987


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.062781,0.712085,0.763659,0.736971,0.979833
2,0.241700,0.045743,0.795139,0.872935,0.832223,0.984782
3,0.030900,0.054587,0.78972,0.858958,0.822885,0.985186
4,0.030900,0.057012,0.841787,0.885642,0.863158,0.986431
5,0.009700,0.059964,0.825829,0.885642,0.85469,0.986275
6,0.004800,0.072157,0.828467,0.865311,0.846489,0.985373
7,0.004800,0.079563,0.836342,0.8831,0.859085,0.986058
8,0.001700,0.082427,0.836759,0.879288,0.857497,0.985715


***** Running Evaluation *****
  Num examples = 924
  Batch size = 64
Saving model checkpoint to biobert-finetuned-ner/checkpoint-340
Configuration saved in biobert-finetuned-ner/checkpoint-340/config.json
Model weights saved in biobert-finetuned-ner/checkpoint-340/pytorch_model.bin
tokenizer config file saved in biobert-finetuned-ner/checkpoint-340/tokenizer_config.json
Special tokens file saved in biobert-finetuned-ner/checkpoint-340/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 924
  Batch size = 64
Saving model checkpoint to biobert-finetuned-ner/checkpoint-680
Configuration saved in biobert-finetuned-ner/checkpoint-680/config.json
Model weights saved in biobert-finetuned-ner/checkpoint-680/pytorch_model.bin
tokenizer config file saved in biobert-finetuned-ner/checkpoint-680/tokenizer_config.json
Special tokens file saved in biobert-finetuned-ner/checkpoint-680/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 924
  Batch size = 64
Sa

TrainOutput(global_step=2720, training_loss=0.05321214084020432, metrics={'train_runtime': 733.4072, 'train_samples_per_second': 73.97, 'train_steps_per_second': 4.636, 'total_flos': 1602483193384740.0, 'train_loss': 0.05321214084020432, 'epoch': 8.0})

## Test Set Evaluation

In [None]:
logits, labels, _ = trainer.predict(tokenized_datasets["test"])
predictions = np.argmax(logits, axis=-1)

# Remove ignored index (special tokens) and covert to labels
true_predictions = [
    [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [label_names[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

results = metric.compute(predictions=true_predictions, references=true_labels)
results

***** Running Prediction *****
  Num examples = 941
  Batch size = 64


{'Disease': {'precision': 0.831081081081081,
  'recall': 0.896875,
  'f1': 0.8627254509018035,
  'number': 960},
 'overall_precision': 0.831081081081081,
 'overall_recall': 0.896875,
 'overall_f1': 0.8627254509018035,
 'overall_accuracy': 0.9822615737942705}

## Inference

In [None]:
import pandas as pd

def tag_sentence(text:str):
    # convert our text to a  tokenized sequence
    inputs = tokenizer(text, truncation=True, return_tensors="pt").to("cuda")
    # get outputs
    outputs = model(**inputs)
    # convert to probabilities with softmax
    probs = outputs[0][0].softmax(1)
    # get the tags with the highest probability
    word_tags = [(tokenizer.decode(inputs['input_ids'][0][i].item()), id2label[tagid.item()]) 
                  for i, tagid in enumerate (probs.argmax(axis=1))]

    return pd.DataFrame(word_tags, columns=['word', 'tag'])

In [None]:
text = """Clustering of missense mutations in the ataxia - telangiectasia gene in a sporadic T - cell leukaemia ."""

print(tag_sentence(text))

         word        tag
0       [CLS]          O
1           C          O
2    ##luster          O
3       ##ing          O
4          of          O
5        miss          O
6      ##ense          O
7   mutations          O
8          in          O
9         the          O
10         at  B-Disease
11       ##ax  I-Disease
12       ##ia  I-Disease
13          -  I-Disease
14         te  I-Disease
15     ##lang  I-Disease
16       ##ie  I-Disease
17       ##ct  I-Disease
18      ##asi  I-Disease
19        ##a  I-Disease
20       gene          O
21         in          O
22          a          O
23          s  B-Disease
24  ##poradic  I-Disease
25          T  I-Disease
26          -  I-Disease
27       cell  I-Disease
28         le  I-Disease
29      ##uka  I-Disease
30     ##emia  I-Disease
31          .          O
32      [SEP]          O


In [None]:
from transformers.pipelines.token_classification import AggregationStrategy
from transformers import pipeline

# Replace this with your own checkpoint
model_checkpoint = "./biobert-finetuned-ner/checkpoint-1360"
token_classifier = pipeline(
    "token-classification", model=model_checkpoint,aggregation_strategy="simple"
)

In [None]:
token_classifier("Clustering of missense mutations in the ataxia - telangiectasia gene in a sporadic T - cell leukaemia.")

[{'entity_group': 'Disease',
  'score': 0.9995025,
  'word': 'ataxia - telangiectasia',
  'start': 40,
  'end': 63},
 {'entity_group': 'Disease',
  'score': 0.99272454,
  'word': 'sporadic T - cell leukaemia',
  'start': 74,
  'end': 101}]

## Gradio Application


In [None]:
! pip install gradio

In [None]:
import gradio as gr

examples = [
    "Clustering of missense mutations in the ataxia - telangiectasia gene in a sporadic T - cell leukaemia.",
    "Ataxia - telangiectasia ( A - T ) is a recessive multi - system disorder caused by mutations in the ATM gene at 11q22 - q23 ( ref . 3 ).",
    "The risk of cancer , especially lymphoid neoplasias , is substantially elevated in A - T patients and has long been associated with chromosomal instability.",
    "These clustered in the region corresponding to the kinase domain , which is highly conserved in ATM - related proteins in mouse , yeast and Drosophila.",
    "Constitutional RB1 - gene mutations in patients with isolated unilateral retinoblastoma .",
    "The evidence of a significant proportion of loss - of - function mutations and a complete absence of the normal copy of ATM in the majority of mutated tumours establishes somatic inactivation of this gene in the pathogenesis of sporadic T - PLL and suggests that ATM acts as a tumour suppressor.",
]

def ner(text):
    output = token_classifier(text)
    for hmap in output:
      hmap['entity'] = hmap['entity_group']
      del hmap['entity_group']
    return {"text": text, "entities": output}    

demo = gr.Interface(ner,
             gr.Textbox(placeholder="Enter sentence here..."), 
             gr.HighlightedText(),
             examples=examples,
             allow_flagging = 'never')

demo.launch()

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>



## Push Final Model to Hub

In [7]:
! pip install huggingface_hub transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [8]:
from transformers import AutoModelForTokenClassification,AutoTokenizer

model_checkpoint = "./biobert-finetuned-ner/checkpoint-1360"
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [9]:
from huggingface_hub import notebook_login

notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.huggingface/token
Login successful


In [None]:
model.push_to_hub("biobert-finetuned-ner")

In [None]:
tokenizer.push_to_hub("biobert-finetuned-ner")