<a href="https://colab.research.google.com/github/wothmag07/genai-bootcamp/blob/main/BERTFinetuning_NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
! pip install -q BitsandBytes transformers datasets seqeval evaluate accelerate tokenizer

In [2]:
import datasets
from transformers import BertTokenizerFast, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification
import evaluate
import numpy as np
from transformers import pipeline

seqeval = evaluate.load("seqeval")


In [3]:
dataset = datasets.load_dataset("eriktks/conll2003", trust_remote_code=True)

**CoNLL-2003 NER Dataset**

The CoNLL-2003 dataset is a widely used benchmark for NER, featuring language-independent named entity recognition. It includes four entity types:

* PER (Persons)
* LOC (Locations)
* ORG (Organizations)
* MISC (Miscellaneous entities)

**Dataset Format**

Each data sample follows a structured format with four columns:

* Word - The actual token in the sentence.
* POS Tag - The part-of-speech tag.
* Chunk Tag - The syntactic chunking label.
* NER Tag - The named entity label in IOB2 format.

**IOB2 Tagging Scheme**

The dataset follows the IOB2 tagging scheme:

* B-TYPE (Beginning) - Marks the first word of a named entity.
* I-TYPE (Inside) - Marks subsequent words of a named entity.
* O (Outside) - Indicates words that are not part of any named entity.

Each word is placed on a separate line, and sentences are separated by empty lines.

This dataset is commonly used for training NER models with deep learning and machine learning techniques, including LSTMs, CRFs, and Transformers (like BERT).

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [5]:
dataset['train']

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
    num_rows: 14041
})

In [6]:
eg = dataset['train'][70]
eg

{'id': '70',
 'tokens': ['GOV',
  'LAW',
  'GERMAN',
  'HOME',
  'CTRY',
  '=',
  'TAX',
  'PROVS',
  'STANDARD'],
 'pos_tags': [22, 22, 22, 22, 22, 34, 21, 24, 38],
 'chunk_tags': [11, 12, 12, 12, 12, 21, 11, 12, 21],
 'ner_tags': [0, 0, 7, 0, 0, 0, 0, 0, 0]}

In [7]:
dataset['train'].features

{'id': Value(dtype='string', id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'pos_tags': Sequence(feature=ClassLabel(names=['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'], id=None), length=-1, id=None),
 'chunk_tags': Sequence(feature=ClassLabel(names=['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP'], id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)}

This dictionary describes the schema of a dataset, likely from a Natural Language Processing (NLP) task involving tokenization, Part-of-Speech (POS) tagging, chunking, and Named Entity Recognition (NER). Let’s break it down:

1. id
Type: string
Represents a unique identifier for each data sample.
2. tokens
Type: Sequence(Value(dtype='string'))
A sequence of word tokens from the text.
3. pos_tags (Part-of-Speech Tags)
Type: Sequence(ClassLabel(...))
Each token has a corresponding POS tag based on the Penn Treebank POS tagset, which includes:
NN (noun, singular)
VB (verb, base form)
JJ (adjective)
IN (preposition), etc.
POS tagging helps in syntactic and semantic analysis of sentences.
4. chunk_tags (Syntactic Chunking)
Type: Sequence(ClassLabel(...))
Indicates phrase chunks (e.g., noun phrases, verb phrases) using Inside-Outside-Beginning (IOB) tagging:
B-NP (Begin noun phrase)
I-NP (Inside noun phrase)
B-VP (Begin verb phrase)
I-VP (Inside verb phrase)
O (Outside any chunk)
Helps in grouping words into meaningful phrases.
5. ner_tags (Named Entity Recognition)
Type: Sequence(ClassLabel(...))
Labels each token with an NER tag, identifying named entities like:
B-PER (Beginning of a person’s name)
I-PER (Inside a person’s name)
B-ORG (Beginning of an organization)
I-ORG (Inside an organization)
B-LOC (Beginning of a location)
I-LOC (Inside a location)
O (Outside any entity)
This helps in recognizing important entities in text.

In [8]:
model = AutoModelForTokenClassification.from_pretrained("bert-base-uncased", num_labels=9)
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased", padding=True, truncation=True)


Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
tokenized_ids = tokenizer(eg["tokens"],is_split_into_words=True)
print(tokenized_ids)
tokens = tokenizer.convert_ids_to_tokens(tokenized_ids["input_ids"])
print(tokens)

{'input_ids': [101, 18079, 2375, 2446, 2188, 14931, 2854, 1027, 4171, 4013, 15088, 3115, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'gov', 'law', 'german', 'home', 'ct', '##ry', '=', 'tax', 'pro', '##vs', 'standard', '[SEP]']


In [10]:
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)
ner_pipeline("My name is Wolfgang and I live in Berlin")

Device set to use cuda:0


[{'entity': 'LABEL_4',
  'score': np.float32(0.14410286),
  'index': 1,
  'word': 'my',
  'start': 0,
  'end': 2},
 {'entity': 'LABEL_4',
  'score': np.float32(0.1356418),
  'index': 2,
  'word': 'name',
  'start': 3,
  'end': 7},
 {'entity': 'LABEL_6',
  'score': np.float32(0.14858478),
  'index': 3,
  'word': 'is',
  'start': 8,
  'end': 10},
 {'entity': 'LABEL_2',
  'score': np.float32(0.18200429),
  'index': 4,
  'word': 'wolfgang',
  'start': 11,
  'end': 19},
 {'entity': 'LABEL_8',
  'score': np.float32(0.18727687),
  'index': 5,
  'word': 'and',
  'start': 20,
  'end': 23},
 {'entity': 'LABEL_6',
  'score': np.float32(0.16109316),
  'index': 6,
  'word': 'i',
  'start': 24,
  'end': 25},
 {'entity': 'LABEL_0',
  'score': np.float32(0.15144825),
  'index': 7,
  'word': 'live',
  'start': 26,
  'end': 30},
 {'entity': 'LABEL_4',
  'score': np.float32(0.14949417),
  'index': 8,
  'word': 'in',
  'start': 31,
  'end': 33},
 {'entity': 'LABEL_0',
  'score': np.float32(0.1467645),
  '

In [11]:
def label_tokenize(samples, label_all_tokens=True):
  """
  Tokenizes and aligns labels with sub-tokens for a given dataset sample.
  """

  # Tokenization (no change)
  tokenized_inputs = tokenizer(samples['tokens'], truncation=True, is_split_into_words=True)
  labels = []

  for i, label in enumerate(samples['ner_tags']):
      word_ids = tokenized_inputs.word_ids(batch_index=i)

      prev_word_index = None
      label_ids = []

      for word_index in word_ids:
          if word_index is None:
              label_ids.append(-100)
          elif word_index != prev_word_index:
              label_ids.append(label[word_index])
          else:
              label_ids.append(label[word_index] if label_all_tokens else -100)
          prev_word_index = word_index
      labels.append(label_ids)
  # Return (no change)
  tokenized_inputs['labels'] = labels # changed to label_ids
  return tokenized_inputs

In [12]:
q = label_tokenize(samples=dataset['train'][0:1])
q

{'input_ids': [[101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, -100]]}

In [13]:
q["labels"][0]

[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, -100]

In [14]:
for token, label in zip(tokenizer.convert_ids_to_tokens(q["input_ids"][0]),q["labels"][0]):
    print(f"{token:_<40} {label}")

[CLS]___________________________________ -100
eu______________________________________ 3
rejects_________________________________ 0
german__________________________________ 7
call____________________________________ 0
to______________________________________ 0
boycott_________________________________ 0
british_________________________________ 7
lamb____________________________________ 0
._______________________________________ 0
[SEP]___________________________________ -100


In [15]:
tokenized_ds = dataset.map(label_tokenize, batched=True)
tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3453
    })
})

Listing the labels

In [16]:
label_list=dataset["train"].features["ner_tags"].feature.names
label_id_dict = {i: label for i, label in enumerate(label_list)}
label_id_dict


{0: 'O',
 1: 'B-PER',
 2: 'I-PER',
 3: 'B-ORG',
 4: 'I-ORG',
 5: 'B-LOC',
 6: 'I-LOC',
 7: 'B-MISC',
 8: 'I-MISC'}

Lets consider our eg sample

In [17]:
eg

{'id': '70',
 'tokens': ['GOV',
  'LAW',
  'GERMAN',
  'HOME',
  'CTRY',
  '=',
  'TAX',
  'PROVS',
  'STANDARD'],
 'pos_tags': [22, 22, 22, 22, 22, 34, 21, 24, 38],
 'chunk_tags': [11, 12, 12, 12, 12, 21, 11, 12, 21],
 'ner_tags': [0, 0, 7, 0, 0, 0, 0, 0, 0]}

In [18]:
labels = [label_list[i] for i in eg["ner_tags"]]
labels

['O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'O']

In [19]:
# for param in model.parameters():
#   print(param)

In [20]:
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [[label_id_dict[p] for (p, l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels)]
    true_labels = [[label_id_dict[l] for (p, l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels)]
    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [21]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

args = TrainingArguments("valid-ner",
                         evaluation_strategy = "epoch",
                         learning_rate=3e-4,
                         per_device_train_batch_size=16,
                         per_device_eval_batch_size=16,
                         num_train_epochs=25,
                         weight_decay=0.01,
                         report_to="none")

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics = compute_metrics
)

  trainer = Trainer(


In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.2208,0.153866,0.783666,0.807249,0.795283,0.957663
2,0.1394,0.142965,0.817209,0.828728,0.822928,0.961348
3,0.1107,0.155932,0.811806,0.809263,0.810532,0.958235
4,0.0979,0.22621,0.750453,0.786889,0.768239,0.947496
5,0.0808,0.169302,0.821925,0.826155,0.824035,0.960808
6,0.5861,0.973507,0.0,0.0,0.0,0.789108
7,0.9258,0.976074,0.0,0.0,0.0,0.789108
8,0.9211,0.98792,0.0,0.0,0.0,0.789108
9,0.9184,0.986001,0.0,0.0,0.0,0.789108
10,0.9231,0.975173,0.0,0.0,0.0,0.789108


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
