<a href="https://colab.research.google.com/github/rahiakela/transformers-research-and-practice/blob/main/mastering-transformers/06-fine-tuning-language-models-for-token-classification/01_ner_fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##NER Fine-tuning

In this notebook, we will fine-tune BERT for the following tasks: fine-tuning BERT for token classification problems such as NER and POS, fine-tuning a language model for an NER problem, and thinking of the QA problem as a start/stop token classification.

##Setup

In [None]:
!pip install transformers datasets tokenizers
!pip install seqeval

In [2]:
import datasets 
from transformers import BertTokenizerFast
from transformers import AutoModelForTokenClassification 
from transformers import TrainingArguments, Trainer 
from transformers import DataCollatorForTokenClassification
from transformers import pipeline

import numpy as np 
import pandas as pd
import json

##Loading dataset

In [None]:
conll2003 = datasets.load_dataset("conll2003")

In [4]:
# double-check the dataset by accessing the train samples
conll2003["train"][0]

{'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'id': '0',
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.']}

In [5]:
# we will use only NER tags
conll2003["train"].features["ner_tags"]

Sequence(feature=ClassLabel(num_classes=9, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

##Load tokenizer

In [None]:
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

In [7]:
# let's see how tokenizer can be used with a white-space tokenized sentence
token_dict = tokenizer(["Oh", "this", "sentence", "is", "tokenized", "and", "splitted", "by", "spaces"], is_split_into_words=True)

In [8]:
pd.DataFrame.from_dict({"input_ids": token_dict["input_ids"], 
                        "token_type_ids": token_dict["token_type_ids"],
                        "attention_mask": token_dict["attention_mask"]}).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
input_ids,101,2821,2023,6251,2003,19204,3550,1998,3975,3064,2011,7258,102
token_type_ids,0,0,0,0,0,0,0,0,0,0,0,0,0
attention_mask,1,1,1,1,1,1,1,1,1,1,1,1,1


It is required to preprocess the data before using it for training. 

To do so, we must
use the following function and map into the entire dataset:

In [9]:
def tokenize_and_align_labels(examples, label_all_tokens=True):
  tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
  labels = []

  for i, label in enumerate(examples["ner_tags"]):
    word_ids = tokenized_inputs.word_ids(batch_index=i)
    previous_word_idx = None
    label_ids = []
    for word_idx in word_ids:
      if word_idx is None:
        label_ids.append(-100)
      elif word_idx != previous_word_idx:
        label_ids.append(label[word_idx])
      else:
        label_ids.append(label[word_idx] if label_all_tokens else -100)
      previous_word_idx = word_idx
    labels.append(label_ids)
  tokenized_inputs["labels"] = labels
  return tokenized_inputs

In [10]:
# let's test it giving a single sample
q = tokenize_and_align_labels(conll2003["train"][4:5])

pd.DataFrame.from_dict({"input_ids": q["input_ids"], 
                        "token_type_ids": q["token_type_ids"],
                        "attention_mask": q["attention_mask"],
                        "labels": q["labels"]}).T

Unnamed: 0,0
input_ids,"[101, 2762, 1005, 1055, 4387, 2000, 1996, 2647..."
token_type_ids,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
attention_mask,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
labels,"[-100, 5, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 1, ..."


In [11]:
# let's make it readable
for token, label in zip(tokenizer.convert_ids_to_tokens(q["input_ids"][0]), q["labels"][0]):
  print(f"{token:_<40} {label}")

[CLS]___________________________________ -100
germany_________________________________ 5
'_______________________________________ 0
s_______________________________________ 0
representative__________________________ 0
to______________________________________ 0
the_____________________________________ 0
european________________________________ 3
union___________________________________ 4
'_______________________________________ 0
s_______________________________________ 0
veterinary______________________________ 0
committee_______________________________ 0
werner__________________________________ 1
z_______________________________________ 2
##wing__________________________________ 2
##mann__________________________________ 2
said____________________________________ 0
on______________________________________ 0
wednesday_______________________________ 0
consumers_______________________________ 0
should__________________________________ 0
buy_____________________________________ 0
sheep___

In [None]:
# let's apply it on the entire dataset
tokenized_datasets = conll2003.map(tokenize_and_align_labels, batched=True)

##Fine-tuning model

In [None]:
# load the BERT model
model = AutoModelForTokenClassification.from_pretrained("bert-base-uncased", num_labels=9)

In [14]:
# prepare the trainer and training parameters
training_args = TrainingArguments("test-ner",
                                  evaluation_strategy="epoch",
                                  learning_rate=2e-5,
                                  per_device_train_batch_size=16,
                                  per_device_eval_batch_size=16,
                                  num_train_epochs=3,
                                  weight_decay=0.01)

In [15]:
# prepare the data collator that will apply batch operations on the training dataset to use less memory and perform faster
data_collator = DataCollatorForTokenClassification(tokenizer)

In [None]:
# load the sequence evaluation metric for NER
metric = datasets.load_metric("seqeval")

In [17]:
# let's see how the metric works
example = conll2003["train"][0]

In [18]:
label_list = conll2003["train"].features["ner_tags"].feature.names
labels = [label_list[i] for i in example["ner_tags"]]

metric.compute(predictions=[labels], references=[labels])

{'MISC': {'f1': 1.0, 'number': 2, 'precision': 1.0, 'recall': 1.0},
 'ORG': {'f1': 1.0, 'number': 1, 'precision': 1.0, 'recall': 1.0},
 'overall_accuracy': 1.0,
 'overall_f1': 1.0,
 'overall_precision': 1.0,
 'overall_recall': 1.0}

In [21]:
# so let's define compute the metrics
def compute_metrics(preds):
  predictions, labels = preds
  predictions = np.argmax(predictions, axis=2)

  true_predictions = [
     [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
     for prediction, label in zip(predictions, labels)                 
  ]
  true_labels = [
     [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
     for prediction, label in zip(predictions, labels)                 
  ]

  results = metric.compute(predictions=true_predictions, references=true_labels)

  return {
    "precision": results["overall_precision"],
    "recall": results["overall_recall"],
    "f1": results["overall_f1"],
    "accuracy": results["overall_accuracy"],
  }

In [None]:
# in the last step, let's last steps are to make a trainer and train it accordingly
trainer = Trainer(
  model, training_args, 
  train_dataset=tokenized_datasets["train"],
  eval_dataset=tokenized_datasets["validation"],
  data_collator=data_collator,
  tokenizer=tokenizer,
  compute_metrics=compute_metrics
)

# train the model
trainer.train()

In [23]:
# let's save the model and tokenizer after training
model.save_pretrained("ner_model")

Configuration saved in ner_model/config.json
Model weights saved in ner_model/pytorch_model.bin


In [24]:
tokenizer.save_pretrained("tokenizer")

tokenizer config file saved in tokenizer/tokenizer_config.json
Special tokens file saved in tokenizer/special_tokens_map.json


('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/vocab.txt',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')

If you wish to use the model with the pipeline, you must read the config file and assign `label2id` and `id2label` correctly according to the labels you have used in the `label_list` object.

In [25]:
# let's define id-to-label and label-to-id
id2label = {str(i): label for i, label in enumerate(label_list)}
label2id = {label: str(i) for i, label in enumerate(label_list)}

In [26]:
# let's load the config file and assign label2id and id2label correctly
config = json.load(open("ner_model/config.json"))

In [27]:
config["id2label"] = id2label
config["label2id"] = label2id

In [28]:
json.dump(config, open("ner_model/config.json", "w"))

Afterward, it is easy to use the saved model using `pipeline`.

In [None]:
fine_tuned_model = AutoModelForTokenClassification.from_pretrained("ner_model")

In [30]:
ner_pipeline = pipeline("ner", model=fine_tuned_model, tokenizer=tokenizer)

In [31]:
example = "I live in New Delhi"

ner_results = ner_pipeline(example)
print(ner_results)

[{'entity': 'B-LOC', 'score': 0.99875283, 'index': 4, 'word': 'new', 'start': 10, 'end': 13}, {'entity': 'I-LOC', 'score': 0.99805224, 'index': 5, 'word': 'delhi', 'start': 14, 'end': 19}]


In [34]:
example = "I live in Mumbai"

ner_results = ner_pipeline(example)
print(ner_results)

[{'entity': 'B-LOC', 'score': 0.9977386, 'index': 4, 'word': 'mumbai', 'start': 10, 'end': 16}]
