# [KerasNLP] Named Entity Recognition using RoBERTa

**Author:** [Usha Rengaraju](https://www.linkedin.com/in/usha-rengaraju-b570b7a2/)<br>
**Date created:** 2023/07/10<br>
**Last modified:** 2023/07/10<br>
**Description:** Named Entity Recognition using pretrained RoBERTa


## Overview

Named entity recognition (NER) is an NLP task that extracts information from text. NER detects and categorizes important information in text known as named entities.

KerasNLP has a variety of pretrained models available. In this guide we create the whole NER pipeline using the pretrained Roberta Backbone.


## Imports & setup

This tutorial requires you to have KeraNLP installed:

```shell
pip install keras-nlp
```

We begin by importing all required packages:

In [None]:
!pip3 install -q datasets
!wget https://raw.githubusercontent.com/sighsmile/conlleval/master/conlleval.py

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/486.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m29.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m
[?25h--2023-07-08 13:24:16--  https://raw.githubusercontent.com/sighsmile/conlleval/master/conlleval.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubus

In [None]:
import os
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from datasets import load_dataset
from collections import Counter
from conlleval import evaluate
import keras_nlp

## Data loading

This guide uses the
[Conll 2003 dataset](https://huggingface.co/datasets/conll2003)
for demonstration purposes.

To get started, we first download and unzip the dataset:

In [None]:
conll_data = load_dataset("conll2003")



  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
def export_to_file(export_file_path, data):
    with open(export_file_path, "w") as f:
        for record in data:
            ner_tags = record["ner_tags"]
            tokens = record["tokens"]
            if len(tokens) > 0:
                f.write(
                    str(len(tokens))
                    + "\t"
                    + "\t".join(tokens)
                    + "\t"
                    + "\t".join(map(str, ner_tags))
                    + "\n"
                )


os.mkdir("data")
export_to_file("./data/conll_train.txt", conll_data["train"])
export_to_file("./data/conll_val.txt", conll_data["validation"])

Generating the entities and tags mapping

In [None]:
def make_tag_lookup_table():
    iob_labels = ["B", "I"]
    ner_labels = ["PER", "ORG", "LOC", "MISC"]
    all_labels = [(label1, label2) for label2 in ner_labels for label1 in iob_labels]
    all_labels = ["-".join([a, b]) for a, b in all_labels]
    all_labels = ["[PAD]", "O"] + all_labels
    return dict(zip(range(0, len(all_labels) + 1), all_labels))


mapping = make_tag_lookup_table()
print(mapping)

{0: '[PAD]', 1: 'O', 2: 'B-PER', 3: 'I-PER', 4: 'B-ORG', 5: 'I-ORG', 6: 'B-LOC', 7: 'I-LOC', 8: 'B-MISC', 9: 'I-MISC'}


In [None]:
all_tokens = sum(conll_data["train"]["tokens"], [])
all_tokens_array = np.array(list(map(str.lower, all_tokens)))

counter = Counter(all_tokens_array)
print(len(counter))

num_tags = len(mapping)
vocab_size = 20000
vocabulary = [token for token, count in counter.most_common(vocab_size - 2)]

lookup_layer = keras.layers.StringLookup(
    vocabulary=vocabulary
)

21009


In [None]:
train_data = tf.data.TextLineDataset("./data/conll_train.txt")
val_data = tf.data.TextLineDataset("./data/conll_val.txt")

In [None]:
print(list(train_data.take(1).as_numpy_iterator()))

[b'9\tEU\trejects\tGerman\tcall\tto\tboycott\tBritish\tlamb\t.\t3\t0\t7\t0\t0\t0\t7\t0\t0']


## Preprocessing Dataset

For tokenizing the text we use the tensorflow text `Fastwordpiecetokenizer` and create the data generator for training the model.


In [None]:
import tensorflow_text as tf_text
tok = keras_nlp.models.BertTokenizer.from_preset("bert_base_en_uncased", lowercase=True)
tokenizer = tf_text.FastWordpieceTokenizer(tok.vocabulary)

In [None]:

def map_record_to_training_data(record):
    record = tf.strings.split(record, sep="\t")
    length = tf.strings.to_number(record[0], out_type=tf.int32)
    tokens = record[1 : length + 1]
    # mask = tf.ones([length])
    # print(tokens)

    # tokens = tf.split(tokens, num_or_size_splits = tokens.shape[0], axis = 0)
    tokens = tf.strings.reduce_join(record[1 : length + 1],separator=' ')
    tokens = tokenizer.tokenize_with_offsets(tokens)[0]
    tags = record[length + 1 :]
    tags = tf.strings.to_number(tags, out_type=tf.int64)
    tags += 1
    return (tokens, tags)

def fil(ds):
  return ds.filter(lambda x,y: tokenizer.tokenize_with_offsets(x)[0].shape==y.shape)


batch_size = 32
train_dataset = train_data.map(map_record_to_training_data)
    # .map(lambda x, y,z: (lowercase_and_convert_to_ids(x), y,z))

# train_dataset = train_dataset.apply(fil)
val_dataset = val_data.map(map_record_to_training_data)
    # .map(lambda x, y,z: (lowercase_and_convert_to_ids(x), y,z))

# val_dataset = val_dataset.apply(fil)


In [None]:
x_train = []
y_train = []
cnt =0
mnt= 0
for x,y in train_dataset:
  if x.shape == y.shape:
    x_train.append(x)
    y_oh=[]
    for tag in y:
      t = [0]*num_tags
      t[tag]=1
      y_oh.append(t)
    y_train.append(y_oh)
len(x_train)

5416

In [None]:
x_val = []
y_val = []
cnt =0
mnt= 0
for x,y in val_dataset:
  if x.shape == y.shape:
    x_val.append(x)
    y_oh=[]
    for tag in y:
      t = [0]*num_tags
      t[tag]=1
      y_oh.append(t)
    y_val.append(y_oh)
len(x_val)

1205

## Model Building

For this pipeline we use the `CustomNonPaddingTokenLoss` and then create the NER model. The backbone of the model is the pretrained `Roberta` model of KerasNLP with the base configuration. Then we use a Dense layer head for entity classification.

In [None]:
class CustomNonPaddingTokenLoss(keras.losses.Loss):
    def __init__(self, name="custom_ner_loss"):
        super().__init__(name=name)

    def call(self, y_true, y_pred):
        loss_fn = keras.losses.CategoricalCrossentropy()
        loss = loss_fn(y_true, y_pred)
        mask = tf.cast((y_true > 0), dtype=tf.float32)
        loss = loss * mask
        return tf.reduce_sum(loss) / tf.reduce_sum(mask)


loss = CustomNonPaddingTokenLoss()

In [None]:
class NERModel(keras.Model):
    def __init__(
        self, num_tags, ff_dim=32
    ):
        super().__init__()
        self.tokenizer_ = tokenizer
        # self.proc = keras_nlp.models.RobertaPreprocessor.from_preset("roberta_base_en")
        self.transformer_block =keras_nlp.models.RobertaBackbone.from_preset("roberta_base_en")
        # self.transformer_block = keras_nlp.models.RobertaBackbone(vocab_size,4, num_heads, ff_dim,32,max_sequence_length=maxlen)
        self.dropout1 = layers.Dropout(0.1)
        self.flat=layers.Flatten()
        self.ff = layers.Dense(ff_dim, activation="relu")
        self.dropout2 = layers.Dropout(0.1)
        self.ff_final = layers.Dense(num_tags, activation="softmax")

    def call(self, inputs, training=False):
      # print(inputs)
      # inputs = self.tokenizer_.tokenize_with_offsets(inputs)[0]
      # print(inputs)
      # print(inputs.shape)
      mask = tf.ones_like(inputs)
      # print(mask)
      # inp = self.proc(inputs)
      x = self.transformer_block([tf.expand_dims(inputs,axis=0),tf.expand_dims(mask,0)])
      x = self.dropout1(x, training=training)
      x = self.ff(x)
      x = self.dropout2(x, training=training)
      x = self.ff_final(x)
      return x
ner_model = NERModel(num_tags, ff_dim=64)
# ner_model.compile(optimizer="adam", loss=loss)

In [None]:
optimizer = keras.optimizers.Adam(10e-5)
# Instantiate a loss function.
loss_fn = loss
train_acc_metric = keras.metrics.CategoricalAccuracy()
val_acc_metric = keras.metrics.CategoricalAccuracy()

In [None]:
import numpy as np

@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        logits = ner_model(x, training=True)
        loss_value = loss_fn(y, logits)
    grads = tape.gradient(loss_value, ner_model.trainable_weights)
    optimizer.apply_gradients(zip(grads, ner_model.trainable_weights))
    train_acc_metric.update_state(y, logits)
    return loss_value
@tf.function
def test_step(x, y):
    val_logits = ner_model(x, training=False)
    val_acc_metric.update_state(y, val_logits)
import time
from tqdm import tqdm
train_acc_list=[]
train_loss_list=[]
epochs = 2
for epoch in range(epochs):
    print("\nStart of epoch %d" % (epoch,))
    start_time = time.time()
    train_loss = []
    train_loss_batch=[]
    for step, (x_batch_train, y_batch_train) in tqdm(enumerate(zip(x_train,y_train))):
        loss_value = train_step(x_batch_train, tf.expand_dims(y_batch_train,axis=0))
        train_loss.append(float(loss_value))
        train_loss_batch.append(float(loss_value))
        if step % 1000 == 0:
            print(
                "Training loss (for one batch) at step %d: %.4f"
                % (step, np.mean(train_loss_batch))
            )
            train_loss_batch=[]
            print("Seen so far: %d samples" % ((step + 1) ))
    train_loss_list.append(np.mean(train_loss))
    train_acc = train_acc_metric.result()
    print("Training acc over epoch: %.4f" % (float(train_acc),))
    train_acc_list.append(float(train_acc))
    train_acc_metric.reset_states()
    print("Time taken: %.2fs" % (time.time() - start_time))


Start of epoch 0


1it [00:48, 48.86s/it]

Training loss (for one batch) at step 0: 1.3125
Seen so far: 1 samples


1004it [08:09, 21.06it/s]

Training loss (for one batch) at step 1000: 1.0708
Seen so far: 1001 samples


2003it [09:04, 21.12it/s]

Training loss (for one batch) at step 2000: 1.0809
Seen so far: 2001 samples


3003it [10:05, 19.56it/s]

Training loss (for one batch) at step 3000: 1.0715
Seen so far: 3001 samples


4003it [11:07, 20.41it/s]

Training loss (for one batch) at step 4000: 0.9544
Seen so far: 4001 samples


5005it [12:19, 20.50it/s]

Training loss (for one batch) at step 5000: 1.0672
Seen so far: 5001 samples


5416it [12:41,  7.12it/s]


Training acc over epoch: 0.8199
Time taken: 761.06s

Start of epoch 1


3it [00:00, 20.47it/s]

Training loss (for one batch) at step 0: 1.3068
Seen so far: 1 samples


1003it [00:55, 21.15it/s]

Training loss (for one batch) at step 1000: 1.0671
Seen so far: 1001 samples


2004it [01:50, 19.74it/s]

Training loss (for one batch) at step 2000: 1.0768
Seen so far: 2001 samples


3003it [02:43, 17.31it/s]

Training loss (for one batch) at step 3000: 1.0724
Seen so far: 3001 samples


4005it [03:39, 20.32it/s]

Training loss (for one batch) at step 4000: 0.9605
Seen so far: 4001 samples


5004it [04:33, 19.27it/s]

Training loss (for one batch) at step 5000: 1.0558
Seen so far: 5001 samples


5416it [04:55, 18.35it/s]

Training acc over epoch: 0.8199
Time taken: 295.22s





In [None]:

txt= "eu rejects german call to boycott british lamb"
# Sample inference using the trained model
sample_input = tokenizer.tokenize_with_offsets(txt)[0]

output = ner_model.predict(sample_input)
prediction = np.argmax(output, axis=-1)[0]
prediction = [mapping[i] for i in prediction]

# eu -> B-ORG, german -> B-MISC, british -> B-MISC
print(sample_input)
print(prediction)
for tok, pred in zip(txt.split(), prediction):
  print(tok, pred)

tf.Tensor([ 7327 19164  2446  2655  2000 17757  2329 12559], shape=(8,), dtype=int64)
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
eu O
rejects O
german O
call O
to O
boycott O
british O
lamb O
