# Custom Named Entity Recognition (NER) using Tensorflow

(Original: Named Entity Recognition (NER) for CoNLL dataset with Tensorflow 2.2.0 [https://medium.com/analytics-vidhya/ner-tensorflow-2-2-0-9f10dcf5a0a]  by Bhuvana Kundumani)
Refernce: https://github.com/bhuvanakundumani/NER_tensorflow2.2.0

Named Entity Recognition (NER) identifes named entities like ‘America’ , ‘Emily’ , ‘London’ ,etc.. and categorize them as PERSON, LOCATION , and so on. In spacy, NER is implemented by the pipeline component 'ner'. This notebook details the steps for Named Entity Recognition (NER) tagging of sentences (CoNLL-2003 dataset ) using Tensorflow2.2.0

## Data

Data source: https://www.clips.uantwerpen.be/conll2003/ner/

CoNLL-2003 dataset includes 1,393 English and 909 German news articles. Only English here. Data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence.

![CoNLL Data](images/ner1.png)

Need only the named entity tags. We extract the words along with their named entities into an array — [ [‘EU’, ‘B-ORG’], [‘rejects’, ‘O’], [‘German’, ‘B-MISC’], [‘call’, ‘O’], [‘to’, ‘O’], [‘boycott’, ‘O’], [‘British’, ‘B-MISC’], [‘lamb’, ‘O’], [‘.’, ‘O’] ]. 

In [5]:
def split_text_label(filename):
  f = open(filename)
  split_labeled_text = []
  sentence = []
  for line in f:
    if len(line)==0 or line.startswith('-DOCSTART') or line[0]=="\n":
       if len(sentence) > 0:
         split_labeled_text.append(sentence)
         sentence = []
       continue
    splits = line.split(' ')
    sentence.append([splits[0],splits[-1].rstrip("\n")])
  if len(sentence) > 0:
    split_labeled_text.append(sentence)
    sentence = []
  return split_labeled_text
split_train = split_text_label("data/conll_ner/train.txt")
split_valid = split_text_label("data/conll_ner/valid.txt")
split_test = split_text_label("data/conll_ner/test.txt")

Build the vocabulary for all unique words and unique labels (named entities will be referred to as labels) in the folders (train, valid and test). labelSet contains all the unique words in the labels i.e the named entities. wordSet contains all the unique words.

In [6]:
labelSet = set()
wordSet = set()
# words and labels
for data in [split_train, split_valid, split_test]:
  for labeled_text in data:
    for word, label in labeled_text:
      labelSet.add(label)
      wordSet.add(word.lower())

Associate a unique index to each word/ label in the vocabulary. We assign index 0 for ‘PADDING_TOKEN’ and 1 for ‘UNKNOWN_TOKEN’.

In [7]:
# Sort the set to ensure '0' is assigned to 0
sorted_labels = sorted(list(labelSet), key=len)
# Create mapping for labels
label2Idx = {}
for label in sorted_labels:
  label2Idx[label] = len(label2Idx)
idx2Label = {v: k for k, v in label2Idx.items()}
# Create mapping for words
word2Idx = {}
if len(word2Idx) == 0:
  word2Idx["PADDING_TOKEN"] = len(word2Idx)
  word2Idx["UNKNOWN_TOKEN"] = len(word2Idx)
for word in wordSet:
  word2Idx[word] = len(word2Idx)

Read the words in the split_train, split_valid and split_test folders and convert the words and labels in them to their respective indices.

In [8]:
def createMatrices(data, word2Idx, label2Idx):
  sentences = []
  labels = []
  for split_labeled_text in data:
     wordIndices = []
     labelIndices = []
     for word, label in split_labeled_text:
       if word in word2Idx:
          wordIdx = word2Idx[word]
       elif word.lower() in word2Idx:
          wordIdx = word2Idx[word.lower()]
       else:
          wordIdx = word2Idx['UNKNOWN_TOKEN']
       wordIndices.append(wordIdx)
       labelIndices.append(label2Idx[label])
     sentences.append(wordIndices)
     labels.append(labelIndices)
  return sentences, labels
train_sentences, train_labels = createMatrices(split_train, word2Idx, label2Idx)
valid_sentences, valid_labels = createMatrices(split_valid, word2Idx, label2Idx)
test_sentences, test_labels = createMatrices(split_test, word2Idx, label2Idx)

The sentences are of different lengths. We need to pad the sentences and the labels in order to make them of equal lengths. max_seq_len is taken as 128.

In [13]:
import tensorflow as tf

from tensorflow.keras.preprocessing.sequence import pad_sequences

def padding(sentences, labels, max_len, padding='post'):
  padded_sentences = pad_sequences(sentences, max_len,       
  padding='post')
  padded_labels = pad_sequences(labels, max_len, padding='post')
  return padded_sentences, padded_labels

# padding sentences and labels to max_length of 128
max_seq_len = 128

    
train_features, train_labels = padding(train_sentences, train_labels, max_seq_len, padding='post' )
valid_features, valid_labels = padding(valid_sentences, valid_labels, max_seq_len, padding='post' )
test_features, test_labels = padding(test_sentences, test_labels, max_seq_len, padding='post' )

Using pre-trained Glove word Embeddings. 

Download the Glove embeddings — glove.6B.100d.txt from http://nlp.stanford.edu/data/glove.6B.zip to the embeddings folder. 

For all the words in our vocabulary, we get the Glove representation for the words. embedding_vector has the Glove representation for all the words in our vocabulary.

In [14]:
EMBEDDING_DIM = 100

# Loading glove embeddings
embeddings_index = {}
f = open('embeddings/glove.6B.100d.txt', encoding="utf-8")
for line in f:
  values = line.strip().split(' ')
  word = values[0] # the first entry is the word
  coefs = np.asarray(values[1:], dtype='float32') #100d vectors   
  representing the word
  embeddings_index[word] = coefs
f.close()
embedding_matrix = np.zeros((len(word2Idx), EMBEDDING_DIM))
# Word embeddings for the tokens
for word,i in word2Idx.items():
  embedding_vector = embeddings_index.get(word)
  if embedding_vector is not None:
    embedding_matrix[i] = embedding_vector

SyntaxError: invalid syntax (<ipython-input-14-7a2f085cc72c>, line 10)

Use tf.data.Dataset.from_tensor_slices for batching and shuffling the dataset.

In [15]:
train_batch_size = 32
valid_batch_size = 64
test_batch_size = 64

train_dataset = tf.data.Dataset.from_tensor_slices((train_features, train_labels))
valid_dataset = tf.data.Dataset.from_tensor_slices((valid_features, valid_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((test_features, test_labels))

shuffled_train_dataset = train_dataset.shuffle(buffer_size=train_features.shape[0], reshuffle_each_iteration=True)

batched_train_dataset = shuffled_train_dataset.batch(train_batch_size, drop_remainder=True)
batched_valid_dataset = valid_dataset.batch(valid_batch_size, drop_remainder=True)
batched_test_dataset = test_dataset.batch(test_batch_size, drop_remainder=True)

## Model using Bidirectional LSTM

In [16]:
from tensorflow.keras import layers
class TFNer(tf.keras.Model):
    def __init__(self, max_seq_len, embed_input_dim, embed_output_dim, num_labels, weights):
       super(TFNer, self).__init__() 
       self.embedding = layers.Embedding(input_dim=embed_input_dim, 
       output_dim=embed_output_dim, weights=weights,    
       input_length=max_seq_len, trainable=False, mask_zero=True)        

       self.bilstm = layers.Bidirectional(layers.LSTM(128,  
       return_sequences=True))
       self.dense = layers.Dense(num_labels)
    def call(self, inputs):
       x = self.embedding(inputs) # batchsize, max_seq_len,      
       embedding_output_dim
       x = self.bilstm(x) #batchsize, max_seq_len, hidden_dim_bilstm
       logits = self.dense(x) #batchsize, max_seq_len, num_labels
       return logits

Use a bidirectional LSTM after the embedding and we have a fully connected layer that transforms the output of the LSTM.

In [17]:
model = TFNer(max_seq_len=max_seq_len,embed_input_dim=len(word2Idx), embed_output_dim=EMBEDDING_DIM, weights=[embedding_matrix], num_labels=num_labels)
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
scce = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

NameError: name 'embedding_matrix' is not defined

In [14]:
train_loss_metric = tf.keras.metrics.Mean('training_loss', dtype=tf.float32)
valid_loss_metric = tf.keras.metrics.Mean('valid_loss', dtype=tf.float32)

epoch_bar = 10

    
def train_step_fn(sentences_batch, labels_batch):
  with tf.GradientTape() as tape:
    logits = model(sentences_batch)
    loss = scce(labels_batch, logits)
  grads = tape.gradient(loss, model.trainable_variables)
  optimizer.apply_gradients(list(zip(grads,   
  model.trainable_variables)))
  return loss, logits

def valid_step_fn(sentences_batch, labels_batch):
  logits = model(sentences_batch)
  loss = scce(labels_batch, logits)
  return loss, logits

for epoch in epoch_bar:
  for sentences_batch, labels_batch in progress_bar(batched_train_dataset, total=train_pb_max_len, parent=epoch_bar) :
    loss, logits = train_step_fn(sentences_batch, labels_batch)
    train_loss_metric(loss)
  train_loss_metric.reset_states()
  for sentences_batch, labels_batch in 
  progress_bar(batched_valid_dataset, total=valid_pb_max_len, 
  parent=epoch_bar):
    loss, logits = valid_step_fn(sentences_batch, labels_batch
    valid_loss_metric.update_state(loss)
  valid_loss_metric.reset_states()
                                 
model.save_weights(f"{args.output}/model_weights",save_format='tf')

Loading from \content
Entities [('Fridge', 'PRODUCT'), ('FlipKart', 'ORG')]


## Evaluating model performance on test dataset

Use precision, recall and f1 score for evaluating the model. We use seqeval package. seqeval is a Python framework for sequence labelling evaluation. seqeval can evaluate the performance of chunking tasks such as named-entity recognition, part-of-speech tagging, semantic role labelling and so on. classification_report metric builds a text report showing the main classification metrics.

In [16]:
def idx_to_label(predictions, correct, idx2Label):
  label_pred = []
  for sentence in predictions:
    for i in sentence:
      label_pred.append([idx2Label[elem] for elem in i ])
  label_correct = []
  if correct != None:
    for sentence in correct:
    for i in sentence:
      label_correct.append([idx2Label[elem] for elem in i ])
  return label_correct, label_pred

test_model =  TFNer(max_seq_len=max_seq_len, embed_input_dim=len(word2Idx), embed_output_dim=EMBEDDING_DIM, weights=[embedding_matrix], num_labels=num_labels)
test_model.load_weights(f"{args.output}/model_weights")

true_labels = []
pred_labels = []
for sentences_batch, labels_batch in progress_bar(batched_test_dataset, total=test_pb_max_len):
  logits = test_model(sentences_batch)
  temp1 = tf.nn.softmax(logits)
  preds = tf.argmax(temp1, axis=2)
  true_labels.append(np.asarray(labels_batch))
  pred_labels.append(np.asarray(preds))

label_correct, label_pred = idx_to_label(pred_labels, true_labels, idx2Label)

report = classification_report(label_correct, label_pred, digits=4)

Pass sufficient examples and good number of iterations, say 20.

In [17]:
from spacy.util import minibatch, compounding
import random

# Begin training by disabling other pipeline components
with nlp.disable_pipes(*other_pipes) :

  sizes = compounding(1.0, 4.0, 1.001)
  # Training for 30 iterations     
  for itn in range(30):
    # shuffle examples before training
    random.shuffle(TRAIN_DATA)
    # batch up the examples using spaCy's minibatch
    batches = minibatch(TRAIN_DATA, size=sizes)
    # ictionary to store losses
    losses = {}
    for batch in batches:
      texts, annotations = zip(*batch)
      # Calling update() over the iteration
      nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
      print("Losses", losses)

Losses {'ner': 6.135417460280541}
Losses {'ner': 18.055886029082423}
Losses {'ner': 26.98418468125451}
Losses {'ner': 29.93112833611222}
Losses {'ner': 42.66409005037997}
Losses {'ner': 47.60032568394229}
Losses {'ner': 52.556862453945804}
Losses {'ner': 57.99250314309007}
Losses {'ner': 63.61409512900268}
Losses {'ner': 65.15030186584116}
Losses {'ner': 72.25433349458224}
Losses {'ner': 77.04438918883169}
Losses {'ner': 83.47645027281031}
Losses {'ner': 90.64281290115103}
Losses {'ner': 1.9889164222981925}
Losses {'ner': 6.92639906005658}
Losses {'ner': 9.8967614361081}
Losses {'ner': 14.314272661293769}
Losses {'ner': 16.30601631543266}
Losses {'ner': 19.142050187208387}
Losses {'ner': 27.31261651880357}
Losses {'ner': 34.23241859108691}
Losses {'ner': 44.25706630379443}
Losses {'ner': 49.1747580840816}
Losses {'ner': 53.14143339348539}
Losses {'ner': 61.481861582634345}
Losses {'ner': 66.44031007089961}
Losses {'ner': 73.32213563857846}
Losses {'ner': 3.7872812948189676}
Losses {'ne

Losses {'ner': 15.070842868026716}
Losses {'ner': 17.41177810350598}
Losses {'ner': 20.929445350954097}
Losses {'ner': 25.880433935083452}
Losses {'ner': 32.75067881063372}
Losses {'ner': 37.85085251409017}
Losses {'ner': 40.69515348538933}
Losses {'ner': 46.009373735042196}
Losses {'ner': 53.27040446996633}
Losses {'ner': 57.52870198565014}
Losses {'ner': 65.76614764546557}
Losses {'ner': 4.269399568118388}
Losses {'ner': 9.889373769663507}
Losses {'ner': 13.241571735736215}
Losses {'ner': 15.619898533834203}
Losses {'ner': 24.24267604007764}
Losses {'ner': 24.254182549579127}
Losses {'ner': 28.367168055854563}
Losses {'ner': 31.062403062631347}
Losses {'ner': 36.04521516179011}
Losses {'ner': 37.667203489265376}
Losses {'ner': 40.24817592875843}
Losses {'ner': 47.3054819800127}
Losses {'ner': 50.601195993291185}
Losses {'ner': 53.497107728657284}
Losses {'ner': 3.7994599402516087}
Losses {'ner': 7.913474875472957}
Losses {'ner': 8.871293293435656}
Losses {'ner': 13.129610070431}
Loss

## Testing Custom NER

In [21]:
# Testing the NER

test_text = "I ate Sushi yesterday. Maggi is a common fast food "
doc = nlp(test_text)
print("Entities in '%s'" % test_text)
for ent in doc.ents:
  print(ent)

Entities in 'I ate Sushi yesterday. Maggi is a common fast food '
Maggi


In [22]:
# Output directory
from pathlib import Path
output_dir=Path('/content/')

# Saving the model to the output directory
if not output_dir.exists():
  output_dir.mkdir()
nlp.meta['name'] = 'my_ner'  # rename model
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

Saved model to \content
Loading from \content


In [27]:
# Loading the model from the directory
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
assert nlp2.get_pipe("ner").move_names == move_names
doc2 = nlp2(' Idli is an extremely famous south Indian dish')
for ent in doc2.ents:
  print(ent.label_, ent.text)
else:
  print("No Entities found!!")

Loading from \content
No Entities found!!
