# NER - BERT Base: Disease Identification in a text

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

## Set Up

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate

## Data Download

In [None]:
from datasets import load_dataset
raw_datasets = load_dataset('EMBO/BLURB', 'NCBI-disease-IOB')

Downloading builder script:   0%|          | 0.00/26.0k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/30.3k [00:00<?, ?B/s]

Downloading and preparing dataset blurb/NCBI-disease-IOB to /root/.cache/huggingface/datasets/EMBO___blurb/NCBI-disease-IOB/1.0.0/c9736b8ffc197d4eb4f0b33fdea18902cede876fba559bbdb3dca05abf0042bc...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/284k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/51.2k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/52.4k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Before the download


Generating validation split: 0 examples [00:00, ? examples/s]

Before the download


Generating test split: 0 examples [00:00, ? examples/s]

Before the download
Dataset blurb downloaded and prepared to /root/.cache/huggingface/datasets/EMBO___blurb/NCBI-disease-IOB/1.0.0/c9736b8ffc197d4eb4f0b33fdea18902cede876fba559bbdb3dca05abf0042bc. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
#train, validation and test datasets
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 5425
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 924
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 941
    })
})

In [None]:
#first sentence and the corresponding NER tags

print(raw_datasets["train"][0]["tokens"])
print(raw_datasets["train"][0]["ner_tags"])
print(len(raw_datasets["train"][0]["ner_tags"]))

['Identification', 'of', 'APC2', ',', 'a', 'homologue', 'of', 'the', 'adenomatous', 'polyposis', 'coli', 'tumour', 'suppressor', '.']
[0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0]
14


In [None]:
#tag id and tag names
ner_feature = raw_datasets["train"].features["ner_tags"]
ner_feature

Sequence(feature=ClassLabel(names=['O', 'B-Disease', 'I-Disease'], id=None), length=-1, id=None)

In [None]:
#store tag names to label names
label_names = ner_feature.feature.names
label_names

['O', 'B-Disease', 'I-Disease']



*   O indicates the token doesn’t correspond to disease entity.
*   B- indicates the beginning of an entity.
*   I- indicates a token is contained inside the same entity (e.g., the “York” token is a part of the “New York” entity).







In [None]:
#first sentence and the corresponding NER tags (in a better way)
words = raw_datasets["train"][0]["tokens"]
labels = raw_datasets["train"][0]["ner_tags"]
line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

Identification of APC2 , a homologue of the adenomatous polyposis coli      tumour    suppressor . 
O              O  O    O O O         O  O   B-Disease   I-Disease I-Disease I-Disease O          O 


In [None]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

## Data Preprocessing

### Tokenization

In [None]:
inputs = tokenizer(raw_datasets["train"][0]["tokens"], is_split_into_words=True)
print(inputs.tokens())
print(len(inputs.tokens()))

['[CLS]', 'I', '##dent', '##ification', 'of', 'AP', '##C', '##2', ',', 'a', 'ho', '##mo', '##logue', 'of', 'the', 'ad', '##eno', '##mat', '##ous', 'p', '##oly', '##po', '##sis', 'co', '##li', 't', '##umour', 'suppress', '##or', '.', '[SEP]']
31


The tokenizer added the special tokens used by the model ([CLS] at the beginning and [SEP] at the end) and breaks most of the words. This introduces a mismatch between our inputs and the labels: the list of labels has only 14 elements, whereas our input now has 31 tokens. Accounting for the special tokens is easy (we know they are at the beginning and the end), but we also need to make sure we align all the labels with the proper words.

In [None]:
print(inputs.word_ids())

[None, 0, 0, 0, 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 7, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 11, 11, 12, 12, 13, None]


We can then expand our label list to match the tokens. 


*   The first rule we’ll apply is that special tokens get a label of -100. This is because by default -100 is an index that is ignored in the loss function we will use (cross entropy)
*   Then, each token gets the same label as the token that started the word it’s inside, since they are part of the same entity.
*   For tokens inside a word but not at the beginning, we replace the B- with I- (since the token does not begin the entity)




### Aligning labels with tokens

In [None]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

In [None]:
labels = raw_datasets["train"][0]["ner_tags"]
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

[0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0]
[-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, -100]


The function added the -100 for the two special tokens at the beginning and the end, and a new 0 for our word that was split into two tokens

In [None]:
#function for data preprcoessing for all instances
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [None]:
#preprcocessing the whole data into using the map function
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 5425
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 924
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 941
    })
})

### Padding

In [None]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [None]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
print(batch["labels"])

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


tensor([[-100,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    1,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    0,    0,    0, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100],
        [-100,    0,    1,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0, -100]])


In [None]:
for i in range(2):
    print(tokenized_datasets["train"][i]["labels"])

[-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, -100]
[-100, 0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]


The first set of labels has been padded to the length of the second one using -100s.

## Evaluation Metric

In [None]:
!pip install seqeval

In [None]:
import evaluate

metric = evaluate.load("seqeval")

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

For more details on seqeval:

https://github.com/chakki-works/seqeval


In [None]:
import numpy as np


def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

In [None]:
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

In [None]:
print(id2label)
print(label2id)

{0: 'O', 1: 'B-Disease', 2: 'I-Disease'}
{'O': 0, 'B-Disease': 1, 'I-Disease': 2}


In [None]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

Downloading:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

In [None]:
model.config.num_labels

3

## Trainer API

In [None]:
from transformers import TrainingArguments,Trainer,EarlyStoppingCallback

training_args = TrainingArguments(
    "bert-finetuned-ner",
    num_train_epochs=10,
    learning_rate=2e-5,
    per_device_train_batch_size=16,   
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    warmup_steps=500, 
    evaluation_strategy="epoch",
    save_strategy="epoch",
    metric_for_best_model = 'f1',
    load_best_model_at_end=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 4)]
)

trainer.train()

***** Running training *****
  Num examples = 5425
  Num Epochs = 10
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 3400
  Number of trainable parameters = 107721987


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.065195,0.645678,0.768742,0.701856,0.977624
2,0.234200,0.054081,0.761092,0.850064,0.803121,0.98307
3,0.039900,0.062464,0.767045,0.857687,0.809838,0.982323
4,0.039900,0.062493,0.800234,0.870394,0.833841,0.985155
5,0.014800,0.086323,0.768445,0.860229,0.811751,0.980642
6,0.006600,0.087209,0.797018,0.8831,0.837854,0.98279
7,0.006600,0.093556,0.81764,0.871665,0.843788,0.983723
8,0.002500,0.099796,0.81861,0.8831,0.849633,0.984035
9,0.001200,0.106368,0.809917,0.871665,0.839657,0.984128
10,0.001200,0.108929,0.813899,0.878018,0.844743,0.984159


***** Running Evaluation *****
  Num examples = 924
  Batch size = 64
Saving model checkpoint to bert-finetuned-ner/checkpoint-340
Configuration saved in bert-finetuned-ner/checkpoint-340/config.json
Model weights saved in bert-finetuned-ner/checkpoint-340/pytorch_model.bin
tokenizer config file saved in bert-finetuned-ner/checkpoint-340/tokenizer_config.json
Special tokens file saved in bert-finetuned-ner/checkpoint-340/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 924
  Batch size = 64
Saving model checkpoint to bert-finetuned-ner/checkpoint-680
Configuration saved in bert-finetuned-ner/checkpoint-680/config.json
Model weights saved in bert-finetuned-ner/checkpoint-680/pytorch_model.bin
tokenizer config file saved in bert-finetuned-ner/checkpoint-680/tokenizer_config.json
Special tokens file saved in bert-finetuned-ner/checkpoint-680/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 924
  Batch size = 64
Saving model checkpoint to bert-

TrainOutput(global_step=3400, training_loss=0.04410528949078391, metrics={'train_runtime': 935.8047, 'train_samples_per_second': 57.971, 'train_steps_per_second': 3.633, 'total_flos': 2005515139559454.0, 'train_loss': 0.04410528949078391, 'epoch': 10.0})

## Test Set Evaluation

In [None]:
logits, labels, _ = trainer.predict(tokenized_datasets["test"])
predictions = np.argmax(logits, axis=-1)

# Remove ignored index (special tokens) and covert to labels
true_predictions = [
    [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [label_names[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

results = metric.compute(predictions=true_predictions, references=true_labels)
results

***** Running Prediction *****
  Num examples = 941
  Batch size = 64


{'Disease': {'precision': 0.8174757281553398,
  'recall': 0.8770833333333333,
  'f1': 0.8462311557788944,
  'number': 960},
 'overall_precision': 0.8174757281553398,
 'overall_recall': 0.8770833333333333,
 'overall_f1': 0.8462311557788944,
 'overall_accuracy': 0.9799649462105645}

## Inference

In [None]:
from transformers.pipelines.token_classification import AggregationStrategy
from transformers import pipeline

# Replace this with your own checkpoint
model_checkpoint = "./bert-finetuned-ner/checkpoint-2720"
token_classifier = pipeline(
    "token-classification", model=model_checkpoint,aggregation_strategy="simple"
)

In [None]:
token_classifier("Clustering of missense mutations in the ataxia - telangiectasia gene in a sporadic T - cell leukaemia.")

[{'entity_group': 'Disease',
  'score': 0.9999463,
  'word': 'ataxia - telangiectasia',
  'start': 40,
  'end': 63},
 {'entity_group': 'Disease',
  'score': 0.9987177,
  'word': 'T - cell leukaemia',
  'start': 83,
  'end': 101}]