<a href="https://colab.research.google.com/github/rahiakela/practical-natural-language-processing/blob/chapter-5-information-extraction/3_named_entity_recognition_using_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Recognition using BERT

Consider a scenario where the user asks a search query—“Where was Albert Einstein born?”—using Google search.

<img src='https://github.com/practical-nlp/practical-nlp-figures/raw/master/figures/5-5.png?raw=1' width='800'/>

To be able to show “Ulm, Germany” for this query, the search engine needs to decipher that Albert Einstein is a person before going on to look for a place of birth. This is an example of NER in action in a real-world application.

**NER refers to the IE task of identifying the entities in a document. Entities are typically names of persons, locations, and organizations, and other specialized strings, such as money expressions, dates, products, names/numbers of laws or articles, and so on. NER is an important step in the pipeline of several NLP applications involving information extraction.**

<img src='https://github.com/practical-nlp/practical-nlp-figures/raw/master/figures/5-6.png?raw=1' width='800'/>

As seen in the figure, for a given text, NER is expected to identify person names, locations, dates, and other entities. Different categories of entities identified here are some of the ones commonly used in NER system development.

**NER is a prerequisite for being able to do other IE tasks, such as relation extraction or event extraction**.

NER is also useful in other applications like machine translation, as names
need not necessarily be translated while translating a sentence. So, clearly, there’s a range of scenarios in NLP projects where NER is a major component. It’s one of the common tasks you’re likely to encounter in NLP projects in industry.

## Setup

In [None]:
%tensorflow_version 1.x
!pip install pytorch-pretrained-bert==0.4.0
!pip install seqeval==0.0.12

In [2]:
# importing packages for string processing,dataframe handling, array manipulations, etc
import string
import pandas as pd
import numpy as np
from tqdm import tqdm, trange

# importing all the pytorch packages
import torch
from torch.optim import Adam
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from pytorch_pretrained_bert import BertTokenizer, BertConfig
from pytorch_pretrained_bert import BertForTokenClassification, BertAdam

# importing additonal packages to aid preprocessing of data
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

# importing packages to calculate the f1_score of our model
from seqeval.metrics import f1_score

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.


Using TensorFlow backend.


## Building an NER System

A simple approach to building an NER system is to maintain a large collection of person/ organization/location names that are the most relevant to our company (e.g., names of all clients, cities in their addresses, etc.); this is typically referred to as a gazetteer. To check whether a given word is a named entity or not, just do a lookup in the gazetteer. If a large number of entities present in our data are covered by a gazetteer, then it’s a great way to start, especially when we don’t have an existing NER system available.

An approach that goes beyond a lookup table is rule-based NER, which can be based on a compiled list of patterns based on word tokens and POS tags.

For example, a pattern “NNP was born,” where “NNP” is the POS tag for a proper noun, indicates that the word that was tagged “NNP” refers to a person. Such rules can be programmed to cover as many cases as possible to build a rule-based NER system. 

1. **[Stanford NLP’s RegexNER](https://nlp.stanford.edu/software/regexner.html)**
2. **[spaCy’s EntityRuler](https://spacy.io/usage/rule-based-matching#entityruler)**

provide functionalities to implement your own rule-based NER.

A more practical approach to NER is to train an ML model, which can predict the
named entities in unseen text. For each word, a decision has to be made whether or not that word is an entity, and if it is, what type of the entity it is. In many ways, this is very similar to the classification problems.

**The only difference here is that NER is a “sequence labeling” problem.**

The typical classifiers predict labels for texts independent of their surrounding context. Consider a classifier that classifies sentences in a movie review into positive/negative/neutral categories based on their sentiment. This classifier does not (usually) take into account the sentiment of previous (or subsequent) sentences when classifying the current sentence.

**In a sequence classifier, such context is important. A common use case for sequence labeling is POS tagging, where we need information about the parts of speech of surrounding words to estimate the part of speech of the current word. NER is traditionally modeled as a sequence classification problem, where the entity prediction for the current word also depends on the context.**

For example, if the previous word was a person name, there’s a higher probability that the current word is also a person name if it’s a noun (e.g., first and last names).

To illustrate the difference between a normal classifier and a sequence classifier, consider the following sentence: “Washington is a rainy state.” When a normal classifier sees this sentence and has to classify it word by word, it has to make a decision as to whether Washington refers to a person (e.g., George Washington) or the State of Washington without looking at the surrounding words. It’s possible to classify the word “Washington” in this particular sentence as a location only after looking at the context in which it’s being used. It’s for this reason that sequence classifiers are used
for training NER models.

**Conditional random fields (CRFs) is one of the popular sequence classifier training algorithms.**

Recent advances in NER research either exclude or augment the kind of feature engineering we did in this example with neural network models. [NCRF++](https://github.com/jiesutd/NCRFpp) is another library that can be used to train your own NER using different neural network architectures. This notebook that uses the BERT model for training an NER system using the same dataset.

NCRF++ is a PyTorch based framework with flexiable choices of input features and output structures. The design of neural sequence labeling models with NCRF++ is fully configurable through a configuration file, which does not require any code work. NCRF++ can be regarded as a neural network version of CRF++, which is a famous statistical CRF framework.

To perform sequence classification, we need data in a format that allows us to model the context. Typical training data for NER looks like below, which is a sentence from the CONLL-03 dataset.


<img src='https://github.com/practical-nlp/practical-nlp-figures/raw/master/figures/5-7.png?raw=1' width='800'/>

The labels in the figure follow what’s known as a BIO notation: B indicates the beginning of an entity; I, inside an entity, indicates when entities comprise more than one word; and O, other, indicates non-entities. Peter Such is a name with two words in the example shown above.

Thus, “Peter” gets tagged as a B-PER, and “Such” gets tagged as an I-PER to indicate that Such is a part of the entity from the previous word. The remaining entities in this example, Essex, Yorkshire, and Headingley, are
all one-word entities. So, we only see B-ORG and B-LOC as their tags. Once we
obtain a dataset of sentences annotated in this form and we have a sequence classifier algorithm, how should we train an NER system?

The steps are the same as those for the text classifiers:
1. Load the dataset
2. Extract the features
3. Train the classifier
4. Evaluate it on a test set


### Loading The Data

Loading the dataset is straightforward. This particular dataset is also already split into a train/dev/test set. So, we’ll train the model using the training set.

In [3]:
%%shell

wget https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch5/Data/conlldata/test.txt
wget https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch5/Data/conlldata/train.txt

mkdir conlldata
mv *.txt conlldata

--2020-12-31 09:47:52--  https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch5/Data/conlldata/test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 376236 (367K) [text/plain]
Saving to: ‘test.txt’


2020-12-31 09:47:52 (15.5 MB/s) - ‘test.txt’ saved [376236/376236]

--2020-12-31 09:47:52--  https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch5/Data/conlldata/train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1655711 (1.6M) [text/plain]
Saving to: ‘train.txt’


2020-12-31 09:47:53 (24.1 MB/s) -



In [4]:
"""
Load the training/testing data. 
input: conll format data, but with only 2 tab separated colums - words and NEtags.
output: A list where each item is 2 lists.  sentence as a list of tokens, NER tags as a list for each token.
"""
def load__data_conll(file_path):
  myoutput, words, tags = [], [], []
  fh = open(file_path)
  for line in fh:
    line = line.strip()
    if "\t" not in line:
      # Sentence ended.
      myoutput.append([words, tags])
      words, tags = [], []
    else:
      word, tag = line.split("\t")
      words.append(word)
      tags.append(tag)
      
  fh.close()
  return myoutput

### Extract the features

Let’s look at an example using handcrafted features this time. What features seem intuitively relevant for this task? To identify names of people or places, for example, patterns such as whether the word starts with an uppercase character or whether it’s preceded or succeeded by a verb/ noun, etc., can be used as starting points to train an NER model. 

The following function that extracts the previous and next words’ POS tags for a given sentence.

In [5]:
"""
Get features for all words in the sentence
Features:
- word context: a window of 2 words on either side of the current word, and current tag.
- POS context: a window of 2 POS tags on either side of the current word, and current tag. 
input: sentence as a list of tokens.
output: list of dictionaries. each dict represents features for that word.
"""
def sent2features(sentence):
  features = []
  sent_tags = pos_tag(sentence)   # This format is specific to this POS tagger!
  for i in range(0, len(sentence)):
    word = sentence[i]
    wordfeatures = {}
    # word features: word, prev 2 words, next 2 words in the sentence.
    wordfeatures["word"] = word

    if i == 0:
      wordfeatures["prevWord"] = wordfeatures["prevSecondWord"] = "<S>"
    elif i == 1:
      wordfeatures["prevWord"] = sentence[0]
      wordfeatures["prevSecondWord"] = "</S>"
    else:
      wordfeatures["prevWord"] = sentence[i - 1]
      wordfeatures["prevSecondWord"] = sentence[i - 2]

    # next two words as features
    if i == len(sentence) - 2:
      wordfeatures["nextWord"] = sentence[i + 1]
      wordfeatures["nextNextWord"] = "</S>"
    elif i == len(sentence) - 1:
      wordfeatures["nextWord"] = "</S>"
      wordfeatures["nextNextWord"] = "</S>"
    else:
      wordfeatures["nextWord"] = sentence[i + 1]
      wordfeatures["nextNextWord"] = sentence[i + 2]

    # POS tag features: current tag, previous and next 2 tags.
    wordfeatures["tag"] = sent_tags[i][1]
    if i == 0:
      wordfeatures["prevTag"] = wordfeatures["prevSecondTag"] = "<S>"
    elif i == 1:
      wordfeatures["prevTag"] = sent_tags[0][1]
      wordfeatures["prevSecondTag"] = "</S>"
    else:
      wordfeatures["prevTag"] = sent_tags[i -1][1]
      wordfeatures["prevSecondTag"] = sent_tags[i - 2][1]

    # next two words as features
    if i == len(sentence) - 2:
      wordfeatures["nextTag"] = sent_tags[i + 1][1]
      wordfeatures["nextNextTag"] = "</S>"
    elif i == len(sentence) - 1:
      wordfeatures["nextTag"] = "</S>"
      wordfeatures["nextNextTag"] = "</S>"
    else:
      wordfeatures["nextTag"] = sent_tags[i + 1][1]
      wordfeatures["nextNextTag"] = sent_tags[i + 2][1]
    
    # That is it! You can add whatever you want!
    features.append(wordfeatures)
  return features

In [7]:
# preprocess the data by calling the functions
train_path = 'conlldata/train.txt'
test_path = 'conlldata/test.txt'

conll_train = load__data_conll(train_path)
conll_test = load__data_conll(test_path)

### Pre-process the text data according to BERT

BERT needs us to pre-process the data in a particular way.

Lets take the raw data from the txt files

In [8]:
df_train = pd.read_csv("conlldata/train.txt", engine="python", delimiter="\t", header=None, encoding='utf-8', error_bad_lines=False)
df_test = pd.read_csv("conlldata/test.txt", engine="python", delimiter="\t", header=None, encoding='utf-8', error_bad_lines=False)

Skipping line 23407: unexpected end of data


In [10]:
# merge dataframe
df = pd.merge(df_train, df_test)

# we will be using this to make a set of all unique labels
label = list(df[1].values)

In [11]:
# calculating the size
np.array(conll_train).shape

  


(14041, 2)

In [12]:
np.array(conll_test).shape

  """Entry point for launching an IPython kernel.


(3453, 2)

We need to join all the tokens into a single sentence. We will use the untokenize function in token_utils from [this github repo](https://github.com/commonsense/metanl/blob/master/metanl/token_utils.py).

In [None]:
if __name__ == "__main__":
  train_path = "conlldata/train.txt"
  test_path = "conlldata/test.txt"

  # 1. Load the dataset
  conll_train = load__data_conll(train_path)
  conll_dev = load__data_conll(test_path)

  # 2. Extract the features
  print("Training a Sequence classification model with CRF")
  features, labels = get_features_conll(conll_train)
  devfeatures, devlabels = get_features_conll(conll_dev)
  #print(features.shape, labels.shape)
  #print(devfeatures.shape, devlabels.shape)

  # 3. Train the classifier
  train_sequence(features, labels, devfeatures, devlabels)
  print("Done with sequence model")

Training a Sequence classification model with CRF
0.9255103670420659
              precision    recall  f1-score   support

           O      0.973     0.981     0.977     38323
       B-LOC      0.694     0.765     0.728      1668
       I-LOC      0.738     0.482     0.584       257
      B-MISC      0.648     0.309     0.419       702
      I-MISC      0.626     0.505     0.559       216
       B-ORG      0.670     0.561     0.611      1661
       I-ORG      0.551     0.704     0.618       835
       B-PER      0.773     0.766     0.769      1617
       I-PER      0.819     0.886     0.851      1156

    accuracy                          0.928     46435
   macro avg      0.721     0.662     0.679     46435
weighted avg      0.926     0.928     0.926     46435



                O  B-LOC  I-LOC B-MISC I-MISC  B-ORG  I-ORG  B-PER  I-PER 
         O  37579    118      3     22     32    193    224     88     64 38323
     B-LOC    143   1276      1     36      1     95     14     98   

Training this CRF model gave an F1 score of 0.92 on the development data, which is a very good score! 

Here, we showed some of the most commonly used features in learning an NER system and used a popular training method and a publicly available dataset.

Clearly, there’s a lot to be done in terms of tuning the model and developing
(even) better features; this example only serves to illustrate one way of developing an NER model quickly using one particular library. [MITIE](https://github.com/mit-nlp/MITIE) is another such library to train NER systems.