# Building a NER System

A simple approach to building an NER system is to maintain a large collection of person/organization/location names that are the most relevant to our company (e.g., names of all clients, cities in their addresses, etc.); this is typically referred to as a gazetteer.

Rule-based NER, which can be based on a compiled list of patterns based on word tokens and POS tags.

Train an ML model, which can predict the named entities in unseen text.
- Normal classifier: classify text word by word
- Sequence classifier: looking at the context in which it's being used. (For NER models)

Conditional Random Fields (CRFs), one of the popular sequence classifier training algorithms.

Typical training data for NER follow BIO notation:
- B indicates the beginning of an entity
- I inside an entity, indicates when entities comprise more than one word
- O other, indicates non-entities.

Example: 'Peter' gets tagged as a B-PER

# NER Training

In [9]:
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
from sklearn_crfsuite import CRF, metrics
from sklearn.metrics import make_scorer,confusion_matrix
from pprint import pprint
from sklearn.metrics import f1_score,classification_report
from sklearn.pipeline import Pipeline
import string

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Yasir Abdur
[nltk_data]     Rohman\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


In [10]:
# loading data
"""
Load the training/testing data. 
input: conll format data, but with only 2 tab separated colums - words and NEtags.
output: A list where each item is 2 lists.  sentence as a list of tokens, NER tags as a list for each token.
"""
def load__data_conll(file_path):
    myoutput,words,tags = [],[],[]
    fh = open(file_path)
    for line in fh:
        line = line.strip()
        if "\t" not in line:
            #Sentence ended.
            myoutput.append([words,tags])
            words,tags = [],[]
        else:
            word, tag = line.split("\t")
            words.append(word)
            tags.append(tag)
    fh.close()
    return myoutput

In [11]:
"""
Get features for all words in the sentence
Features:
- word context: a window of 2 words on either side of the current word, and current word.
- POS context: a window of 2 POS tags on either side of the current word, and current tag. 
input: sentence as a list of tokens.
output: list of dictionaries. each dict represents features for that word.
"""
def sent2feats(sentence):
    feats = []
    sen_tags = pos_tag(sentence) #This format is specific to this POS tagger!
    for i in range(0,len(sentence)):
        word = sentence[i]
        wordfeats = {}
       #word features: word, prev 2 words, next 2 words in the sentence.
        wordfeats['word'] = word
        if i == 0:
            wordfeats["prevWord"] = wordfeats["prevSecondWord"] = "<S>"
        elif i==1:
            wordfeats["prevWord"] = sentence[0]
            wordfeats["prevSecondWord"] = "</S>"
        else:
            wordfeats["prevWord"] = sentence[i-1]
            wordfeats["prevSecondWord"] = sentence[i-2]
        #next two words as features
        if i == len(sentence)-2:
            wordfeats["nextWord"] = sentence[i+1]
            wordfeats["nextNextWord"] = "</S>"
        elif i==len(sentence)-1:
            wordfeats["nextWord"] = "</S>"
            wordfeats["nextNextWord"] = "</S>"
        else:
            wordfeats["nextWord"] = sentence[i+1]
            wordfeats["nextNextWord"] = sentence[i+2]
        
        #POS tag features: current tag, previous and next 2 tags.
        wordfeats['tag'] = sen_tags[i][1]
        if i == 0:
            wordfeats["prevTag"] = wordfeats["prevSecondTag"] = "<S>"
        elif i == 1:
            wordfeats["prevTag"] = sen_tags[0][1]
            wordfeats["prevSecondTag"] = "</S>"
        else:
            wordfeats["prevTag"] = sen_tags[i - 1][1]

            wordfeats["prevSecondTag"] = sen_tags[i - 2][1]
            # next two words as features
        if i == len(sentence) - 2:
            wordfeats["nextTag"] = sen_tags[i + 1][1]
            wordfeats["nextNextTag"] = "</S>"
        elif i == len(sentence) - 1:
            wordfeats["nextTag"] = "</S>"
            wordfeats["nextNextTag"] = "</S>"
        else:
            wordfeats["nextTag"] = sen_tags[i + 1][1]
            wordfeats["nextNextTag"] = sen_tags[i + 2][1]
        #That is it! You can add whatever you want!
        feats.append(wordfeats)
    return feats

In [12]:
# Extracting features
#Extract features from the conll data, after loading it.
def get_feats_conll(conll_data):
    feats = []
    labels = []
    for sentence in conll_data:
        feats.append(sent2feats(sentence[0]))
        labels.append(sentence[1])
    return feats, labels

In [13]:
# training a model
#Train a sequence model
def train_seq(X_train,Y_train,X_dev,Y_dev):
   # crf = CRF(algorithm='lbfgs', c1=0.1, c2=0.1, max_iterations=50, all_possible_states=True)
    crf = CRF(algorithm='lbfgs', c1=0.1, c2=10, max_iterations=50)#, all_possible_states=True)
    #Just to fit on training data
    crf.fit(X_train, Y_train)
    labels = list(crf.classes_)
    #testing:
    y_pred = crf.predict(X_dev)
    sorted_labels = sorted(labels, key=lambda name: (name[1:], name[0]))
    print(metrics.flat_f1_score(Y_dev, y_pred,average='weighted', labels=labels))
    print(metrics.flat_classification_report(Y_dev, y_pred, labels=sorted_labels, digits=3))
    #print(metrics.sequence_accuracy_score(Y_dev, y_pred))
    get_confusion_matrix(Y_dev, y_pred,labels=sorted_labels)

In [14]:
def print_cm(cm, labels):
    print("\n")
    """pretty print for confusion matrixes"""
    columnwidth = max([len(x) for x in labels] + [5])  # 5 is value length
    empty_cell = " " * columnwidth
    # Print header
    print("    " + empty_cell, end=" ")
    for label in labels:
        print("%{0}s".format(columnwidth) % label, end=" ")
    print()
    # Print rows
    for i, label1 in enumerate(labels):
        print("    %{0}s".format(columnwidth) % label1, end=" ")
        sum = 0
        for j in range(len(labels)):
            cell = "%{0}.0f".format(columnwidth) % cm[i, j]
            sum =  sum + int(cell)
            print(cell, end=" ")
        print(sum) #Prints the total number of instances per cat at the end.
        
#python-crfsuite does not have a confusion matrix function, 
#so writing it using sklearn's confusion matrix and print_cm from github
def get_confusion_matrix(y_true,y_pred,labels):
    trues,preds = [], []
    for yseq_true, yseq_pred in zip(y_true, y_pred):
        trues.extend(yseq_true)
        preds.extend(yseq_pred)
    print_cm(confusion_matrix(trues,preds,labels),labels)

In [15]:
train_path = 'Data/conlldata/train.txt'
test_path = 'Data/conlldata/test.txt'
conll_train = load__data_conll(train_path)
conll_dev = load__data_conll(test_path)

print("Training a Sequence classification model with CRF")
feats, labels = get_feats_conll(conll_train)
devfeats, devlabels = get_feats_conll(conll_dev)
train_seq(feats, labels, devfeats, devlabels)
print("Done with sequence model")

Training a Sequence classification model with CRF
0.9255103670420659




              precision    recall  f1-score   support

           O      0.973     0.981     0.977     38323
       B-LOC      0.694     0.765     0.728      1668
       I-LOC      0.738     0.482     0.584       257
      B-MISC      0.648     0.309     0.419       702
      I-MISC      0.626     0.505     0.559       216
       B-ORG      0.670     0.561     0.611      1661
       I-ORG      0.551     0.704     0.618       835
       B-PER      0.773     0.766     0.769      1617
       I-PER      0.819     0.886     0.851      1156

    accuracy                          0.928     46435
   macro avg      0.721     0.662     0.679     46435
weighted avg      0.926     0.928     0.926     46435



                O  B-LOC  I-LOC B-MISC I-MISC  B-ORG  I-ORG  B-PER  I-PER 
         O  37579    118      3     22     32    193    224     88     64 38323
     B-LOC    143   1276      1     36      1     95     14     98      4 1668
     I-LOC     32      6    124      1      5      0     52

In [19]:
conll_train[0]

[['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'],
 ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']]

In [20]:
feats[0]

[{'word': 'EU',
  'prevWord': '<S>',
  'prevSecondWord': '<S>',
  'nextWord': 'rejects',
  'nextNextWord': 'German',
  'tag': 'NNP',
  'prevTag': '<S>',
  'prevSecondTag': '<S>',
  'nextTag': 'VBZ',
  'nextNextTag': 'JJ'},
 {'word': 'rejects',
  'prevWord': 'EU',
  'prevSecondWord': '</S>',
  'nextWord': 'German',
  'nextNextWord': 'call',
  'tag': 'VBZ',
  'prevTag': 'NNP',
  'prevSecondTag': '</S>',
  'nextTag': 'JJ',
  'nextNextTag': 'NN'},
 {'word': 'German',
  'prevWord': 'rejects',
  'prevSecondWord': 'EU',
  'nextWord': 'call',
  'nextNextWord': 'to',
  'tag': 'JJ',
  'prevTag': 'VBZ',
  'prevSecondTag': 'NNP',
  'nextTag': 'NN',
  'nextNextTag': 'TO'},
 {'word': 'call',
  'prevWord': 'German',
  'prevSecondWord': 'rejects',
  'nextWord': 'to',
  'nextNextWord': 'boycott',
  'tag': 'NN',
  'prevTag': 'JJ',
  'prevSecondTag': 'VBZ',
  'nextTag': 'TO',
  'nextNextTag': 'VB'},
 {'word': 'to',
  'prevWord': 'call',
  'prevSecondWord': 'German',
  'nextWord': 'boycott',
  'nextNextWo

In real-world scenarios, using the trained model by itself won’t be sufficient, as the data keeps changing and new entities keep getting added, and there will also be some domain-specific entities or patterns that were not seen in generic training datasets. Hence, most NER systems deployed in real-world scenarios use a combination of ML models, gazetteers, and some pattern matching–based heuristics to improve their performance 

# NER Using an Existing Library (Spacy)

In [21]:
# Google Colab
# https://colab.research.google.com/drive/1z1hHpd8emVHUhth5hnp-fj0UXLsmWlPZ?usp=sharing

# NER Using Active Learning

The best approach to NER when we want customized solutions but don’t want to train everything from scratch is to start with an off-the-shelf product and either augment it with customized heuristics for our problem domain (using tools such as RegexNER or EntityRuler) and/or use active learning using tools like Prodigy.

TIPS: Start with a pre-trained NER model and enhance it with heuristics, active learning, or both.

In [22]:
# Google Colab NER BERT
# https://colab.research.google.com/drive/1z1hHpd8emVHUhth5hnp-fj0UXLsmWlPZ?usp=sharing

# Practical Advice
- NER is very sensitive to the format of its input. It’s more accurate with well-formatted plain text than with, say, a PDF document from which plain text needs to be extracted first. One approach is to do custom post-processing of PDFs to extract blobs of text, then run NER on the blobs.
- NER is also very sensitive to the accuracy of the prior steps in its processing pipeline: sentence splitting, tokenization, and POS tagging. So, some amount of pre-processing may be necessary before passing a piece of text into an NER model to extract entities.

TIPS: If you’re working with documents, such as reports, etc., pre-process them to extract text blobs, then run NER on them.