# Overview

This Jupyter notebook demonstrates training and serving custom named entity recognition (NER) models, which are used to identify named entities such as locations, times, and people in text documents. NER is used in a number of business applications such as powering recommender systems, simplifying customer support, and optimizing internal search engines. 

The notebook is broken down into the following three sections:
   * NER packages: An overview of the language, license, and methodology of commercially available NER packages
   * NER with SpaCy : Code examples for training and serving a custom NER model in SpaCy
   * NER with Tensorflow : Code examples for creating, training, and serving a custom deep-learning NER model with Tensorflow

# 1. Named Entity Recognition Research & Examples



## Named entity recognition packages

NER can be implemented with either statistical or rule-based methods, both of which require a large amount of labeled training data and are typically trained in a fully or semi-supervised manner. Statistical approaches to NER include Hidden Markov Models, Maximum Entropy, and Conditional Random Fields, as well as deep learning approaches with Recurrent Neural Networks, such as Seq2Seq. All of these processes involve sentence inputs and annotated sentence outputs. Many of these processes also involve additional feature engineering, providing as input summary statistics of the sentences.

Many of the production-ready NER packages are written in Java and served as Docker containers, such as GATE, OpenNLP, and DBPedia spotlight. SpaCy is perhaps the most frequently used NER package in Python. 


| Name | Language   | License | Method        |
|------|------------|--------|----------------|
| [GATE](https://github.com/GateNLP/gateplugin-Java) | Java       |   GPLv3     |  Hidden Markov |
|[OpenNLP](https://opennlp.apache.org)| Java      |     Apache 2.0   | Maximum Entropy / Rule-based |
|[DBPedia](https://opennlp.apache.org)| Java      |    Apache 2.0    | Rule-based |
|[SpaCy](https://spacy.io)| Python/Cython      |    MIT  | Convolutional NN |

## NER Methods

NER may be implemented with a variety of statistical and rule-based methods with varying amounts of feature engineering. All production-ready NER methods are at least semi-supervised, though unsupervised approaches are an emerging research topic.

#### Supervised statistical

Supervised statistical approaches to NER typically use either Hidden Markov Models (HMM), Maximum Entropy (ME), or Conditional Random Fields (CRF). OpenNLP's statistical NER relies on ME. GATE relies on HMM.

Typical feature engineering approaches for NER include such approaches as orthography, n-grams, lexicons, suffixes and prefixes, unsupervised cluster features, and trigger words for named entities (such as river or lake). These features are generated algorithmically in a rule-based manner.

    
#### Supervised rule-based

OpenNLP contains rule based (as well as statistical) NER. The rule-based approach relies on a series of regular expression matches. The feature generation seems to be done with a beam search to determine the word context.

DBPedia spotlight performs NER with substring matching using the Aho-Corasick algorithm. The approach only uses tokenization with no other feature engineering. The two-step approach first involves generating all possible candidate annotations that form known labels. This is rule-based in that it involves identifying nouns, prepositions, capitalized words, and known entities. This is based on OpenNLP under the hood. The second step selects the best candidates from the proposed candidates. Each candidate is scored based on annotation probability using a version of tf-idf with article links and anchor texts instead of documents and terms.


#### Supervised deep learning

**[SpaCy](https://spacy.io),** which is one of the most popular productionized NER environments, **uses residual convolutional neural networks (CNN) and incremental parsing with Bloom embeddings for NER.** See [this](https://www.youtube.com/watch?v=sqDHBH9IjRU) Youtube explanation from the developers for more detail. To summarize the algorithm, 1D convolutional filters are applied over the input text to predict how the upcoming words may change the current entity tags. Upcoming words may either shift (change the entity), reduce (make the entity more granular), or output the entity. The input sequence is embedded with bloom embeddings, which model the characters, prefix, suffix, and part of speech of each word. Residual blocks are used for the CNNs, andn the filter sizes are chosen with beam search.

Recurrent neural network (RNN) approaches to NER also exist, typically comprising long short term memory networks (LSTM) at either the word- or character-level, relying on word or character embeddings, respectively (e.g. word2vec, gloVe, FASTtext).
    
## NER Datasets

Although there are a number of datasets for NER in other languages, here we will focus on English datasets. Many of the NER datasets are domain-specific (i.e. Twitter, biomedical, advertising, news). A few standard NER datasets are described below to show the range of domain applications of NER.
   * [i2b2](https://www.i2b2.org/NLP/DataSets/) - Medication, treatments, diseases, risk factors, and medications
   * [CoNLL 2003](https://www.clips.uantwerpen.be/conll2003/ner/) - English and german news articles annotated with location, organization, person, and miscellaneous
    
    
## NER Evaluation metrics

NER is most commonly evaluated with precision, recall, and F1-score. F1-score can either be relaxed or strict, with the latter requiring the character offsets to match exactly. 
    
# 2. Named entity recognition with SpaCy

This section will focus on the Python package SpaCy for demonstrating named entity recognition. SpaCy is an open-source python library for NLP written in Python and Cython. It offers pre-trained models for multi-language NER, as well as allowing developers to train and deploy custom NER models on domain specific corpuses. SpaCy models are designed to be production-ready. 

SpaCy's pretrained models are trained on the [OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) corpus, and support the identification of the [following entities](https://spacy.io/models/en): 


|TYPE	| DESCRIPTION
------------ | ------------
|PERSON	| People, including fictional
|NORP	|Nationalities or religious or political groups
|FAC	|Buildings, airports, highways, bridges, etc
|ORG	|Companies, agencies, institutions, etc
|GPE	|Countries, cities, states
|LOC	|Non-GPE locations, mountain ranges, bodies of water
|PRODUCT|	Objects, vehicles, foods, etc. (Not services.)
|EVENT	|Named hurricanes, battles, wars, sports events, etc
|WORK_OF_ART|	Titles of books, songs, etc
|LAW	|Named documents made into laws
|LANGUAGE|	Any named language
|DATE	|Absolute or relative dates or periods
|TIME	|Times smaller than a day
|PERCENT|	Percentage, including ”%“
|MONEY	|Monetary values, including unit
|QUANTITY|	Measurements, as of weight or distance
|ORDINAL|	“first”, “second”, etc
|CARDINAL|	Numerals that do not fall under another type

While these pretrained models are often sufficient for general applications, we will consider a domain-specific application of NER on the [MIT Movies corpus](https://groups.csail.mit.edu/sls/downloads/movie/), which contains 10,000 queries about various aspects of movies, with the following entity labels:

| Type | Example |
------- | ------- |
| ACTOR | Matt Damon |
| YEAR | 1980s |
| TITLE | Pulp Fiction
| GENRE | science fiction
| DIRECTOR | George Lucas |
| SONG | Aerosmith |
| PLOT | Flying cars |
| REVIEW | must see |
| CHARACTER | Queen Elizabeth |
|RATING | PG-13 |
|RATINGS_AVERAGE | best rated |
| TRAILER | preview

As these tables show, the pretrained SpaCy models would not be sufficient to identify entities to help answer a question such as "did george clooney make a science fiction movie in the 1980s?" While the pre-trained entities may identify the presence of `PERSON`, `DATE`, and `PRODUCT`, a custom model should be able to detect `ACTOR`, `GENRE`, and `DATE`. In the following sections, we will compare the results of applying a pre-trained and a custom-trained model to the MIT movies corpus.

### Install and import required packages

In [1]:
import sys
!{sys.executable} -m pip install spacy # !{sys.executable} ensures package installation in conda env



In [2]:
import spacy
import random
import time
import numpy as np
from spacy.util import minibatch, compounding

### Load and transform data

Create the data directory if it doesn't exist

In [3]:
from os import path, mkdir
if not path.isdir("data/"):
    mkdir("data/")
if not path.isdir("models/"):
    mkdir("models/")

Download the test and training dataset from MIT's Computer Science and Aritficial Intelligence Laboratory (CSAIL)

In [4]:
!curl https://groups.csail.mit.edu/sls/downloads/movie/engtest.bio -o data/test.txt
!curl https://groups.csail.mit.edu/sls/downloads/movie/engtrain.bio -o data/train.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  246k  100  246k    0     0   235k      0  0:00:01  0:00:01 --:--:--  235k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  989k  100  989k    0     0   258k      0  0:00:03  0:00:03 --:--:--  258k


<img src="https://storage.googleapis.com/kf-pipeline-contrib-public/release-0.1.3/kfp-components/notebooks/entity_extraction/assets/fig1.png" width="700" align = "left"/>

SpaCy requires training data to be in the format of `TRAIN_DATA = [(Sentence, {entities: [(start, end, label)]}, ...]`. The `load_data` function parses and transforms the input data into the required format for spaCy.

In [5]:
def load_data_spacy(file_path):
    ''' Converts data from:
    label \t word \n label \t word \n \n label \t word
    to: sentence, {entities : [(start, end, label), (stard, end, label)]}
    '''
    file = open(file_path, 'r')
    training_data, entities, sentence, unique_labels = [], [], [], []
    current_annotation = None
    end = 0 # initialize counter to keep track of start and end characters
    for line in file:
        line = line.strip("\n").split("\t")
        # lines with len > 1 are words
        if len(line) > 1:
            label = line[0][2:]     # the .txt is formatted: label \t word, label[0:2] = label_type
            label_type = line[0][0] # beginning of annotations - "B", intermediate - "I"
            word = line[1]
            sentence.append(word)
            end += (len(word) + 1)  # length of the word + trailing space
            
            if label_type != 'I' and current_annotation:  # if at the end of an annotation
                entities.append((start, end - 2 - len(word), current_annotation))  # append the annotation
                current_annotation = None                 # reset the annotation
            if label_type == 'B':                         # if beginning new annotation
                start = end - len(word) - 1  # start annotation at beginning of word
                current_annotation = label   # append the word to the current annotation
            if label_type == 'I':            # if the annotation is multi-word
                current_annotation = label   # append the word
            
            if label != 'O' and label not in unique_labels:
                unique_labels.append(label)
 
        # lines with len == 1 are breaks between sentences
        if len(line) == 1: 
            if current_annotation:
                entities.append((start, end - 1, current_annotation))
            sentence = " ".join(sentence)
            training_data.append([sentence, {'entities' : entities}])
            # reset the counters and temporary lists
            end = 0            
            entities, sentence = [], []
            current_annotation = None
    file.close()
    return training_data, unique_labels            
            
TRAIN_DATA, LABELS = load_data_spacy("data/train.txt")

### Data overview

Sample sentences from the training data, which contains queries about movie information

In [6]:
[x[0] for x in TRAIN_DATA[1:10]]

['show me films with drew barrymore from the 1980s',
 'what movies starred both al pacino and robert deniro',
 'find me all of the movies that starred harold ramis and bill murray',
 'find me a movie with a quote about baseball in it',
 'what movies have mississippi in the title',
 'show me science fiction films directed by steven spielberg',
 'do you have any thrillers directed by sofia coppola',
 'what leonard cohen songs have been used in a movie',
 'show me films elvis films set in hawaii']

<img src="https://storage.googleapis.com/kf-pipeline-contrib-public/release-0.1.3/kfp-components/notebooks/entity_extraction/assets/fig2.png" align = "left" width="600"/>

Sample labeled annotations for the training data

In [7]:
[x[1] for x in TRAIN_DATA[1:10]]

[{'entities': [(19, 33, 'ACTOR'), (43, 48, 'YEAR')]},
 {'entities': [(25, 34, 'ACTOR'), (39, 52, 'ACTOR')]},
 {'entities': [(39, 51, 'ACTOR'), (56, 67, 'ACTOR')]},
 {'entities': []},
 {'entities': [(17, 28, 'TITLE')]},
 {'entities': [(8, 29, 'GENRE'), (42, 58, 'DIRECTOR')]},
 {'entities': [(16, 25, 'GENRE'), (38, 51, 'DIRECTOR')]},
 {'entities': [(5, 24, 'SONG')]},
 {'entities': [(14, 19, 'ACTOR'), (26, 39, 'PLOT')]}]

<img src="https://storage.googleapis.com/kf-pipeline-contrib-public/release-0.1.3/kfp-components/notebooks/entity_extraction/assets/fig3.png" align = "left" width="500"/>

### Test pre-trained NER Model

First, download the pre-trained model with a subprocess call.

In [8]:
!{sys.executable} -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


The pretrained model fails to identify any genres, plots, actors, directors, characters, movie titles, or ratings present in the movie queries. Interestingly, it also fails to identify persons, works of art, and products! Clearly, the pretrained model does not fit this domain application, so we will train our own model from scratch.


In [9]:
from spacy import displacy
import warnings
warnings.filterwarnings("ignore")
nlp = spacy.load('en')
TEST_DATA, _ = load_data_spacy("data/test.txt")

test_sentences = [x[0] for x in TEST_DATA[0:15]] # extract the sentences from [sentence, entity]
for x in test_sentences:
    doc = nlp(x)
    displacy.render(doc, jupyter = True, style = "ent")
warnings.filterwarnings("default")

<img src="https://storage.googleapis.com/kf-pipeline-contrib-public/release-0.1.3/kfp-components/notebooks/entity_extraction/assets/fig4.png" align = "left" width="500"/>

### Train and save custom NER model

In [10]:
# A simple decorator to log function processing time
def timer(method):
    def timed(*args, **kw):
        ts = time.time()
        result = method(*args, **kw)
        te = time.time()
        print("Completed in {} seconds".format(int(te - ts)))
        return result
    return timed

# Data must be of the form (sentence, {entities: [start, end, label]})
@timer
def train_spacy(train_data, labels, iterations, dropout = 0.2, display_freq = 1):
    ''' Train a spacy NER model, which can be queried against with test data
    
    train_data : training data in the format of (sentence, {entities: [(start, end, label)]})
    labels : a list of unique annotations
    iterations : number of training iterations
    dropout : dropout proportion for training
    display_freq : number of epochs between logging losses to console
    '''
    nlp = spacy.blank('en') 
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner)
    
    # Add entity labels to the NER pipeline
    for i in labels:
        ner.add_label(i)

    # Disable other pipelines in SpaCy to only train NER
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):
        nlp.vocab.vectors.name = 'spacy_model' # without this, spaCy throws an "unnamed" error
        optimizer = nlp.begin_training()
        for itr in range(iterations):
            random.shuffle(train_data) # shuffle the training data before each iteration
            losses = {}
            batches = minibatch(train_data, size = compounding(4., 32., 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(           
                    texts,
                    annotations, 
                    drop = dropout,   
                    sgd = optimizer,
                    losses = losses)
            if itr % display_freq == 0:
                print("Iteration {} Loss: {}".format(itr + 1, losses))
    return nlp

# Train (and save) the NER model
ner = train_spacy(TRAIN_DATA, LABELS,6)
ner.to_disk("models/spacy_example")

  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  return f(*args, **kwds)


Iteration 1 Loss: {'ner': 19326.44994279179}
Iteration 2 Loss: {'ner': 12763.510975522777}
Iteration 3 Loss: {'ner': 10998.607198777125}
Iteration 4 Loss: {'ner': 9898.289678454643}
Iteration 5 Loss: {'ner': 9015.409992271101}
Iteration 6 Loss: {'ner': 8368.51944459541}
Completed in 2319 seconds


  srsly.json_dumps(self.meta)
  writer(path / key)


<img src="https://storage.googleapis.com/kf-pipeline-contrib-public/release-0.1.3/kfp-components/notebooks/entity_extraction/assets/fig5.png" align = "left" width="400"/>

### Test model on new sentences

To test the model on new sentences, the `load_model` function is used to reload the trained model weights, and `load_data` is called to load and transform the test data. The spacy function `displacy` is used to visualize the predictions of the first 15 test sentences. As the results show, the architecture has learned good representations of the entities. However, there still exist a few errors. While some of these may be mitigated with increased training time (the loss was still decreasing rapidly after 5 iterations), others may require additional pre-processing, such as fixing spelling mistakes.

In [11]:
from spacy import displacy

def load_model(model_path):
    ''' Loads a pre-trained model for prediction on new test sentences
    
    model_path : directory of model saved by spacy.to_disk
    '''
    nlp = spacy.blank('en') 
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner)
    ner = nlp.from_disk(model_path)
    return ner

ner = load_model("models/spacy_example")

TEST_DATA, _ = load_data_spacy("data/test.txt")

test_sentences = [x[0] for x in TEST_DATA[0:15]] # extract the sentences from [sentence, entity]
for x in test_sentences:
    doc = ner(x)
    displacy.render(doc, jupyter = True, style = "ent")

  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):


<img src="https://storage.googleapis.com/kf-pipeline-contrib-public/release-0.1.3/kfp-components/notebooks/entity_extraction/assets/fig6.png" align = "left" width="500"/>

### Evaluation Metrics

Model performance is assessed on the entirety of the test dataset (2,443 sentences) based on the following metrics and their definitions.

   * Precision: true positives / (true positives + false positives)
   * Recall: true positives / (true positives + false negatives)
   * F1-score: harmonic average of precision and recall

In [12]:
def calc_precision(pred, true):        
    precision = len([x for x in pred if x in true]) / (len(pred) + 1e-20) # true positives / total pred
    return precision

def calc_recall(pred, true):
    recall = len([x for x in true if x in pred]) / (len(true) + 1e-20)    # true positives / total test
    return recall

def calc_f1(precision, recall):
    f1 = 2 * ((precision * recall) / (precision + recall + 1e-20))
    return f1

In [13]:
from itertools import chain

# run the predictions on each sentence in the test dataset, and return the spacy object
preds = [ner(x[0]) for x in TEST_DATA]

precisions, recalls, f1s = [], [], [] 

# iterate over predictions and test data and calculate precision, recall, and F1-score
for pred, true in zip(preds, TEST_DATA): 
    true = [x[2] for x in list(chain.from_iterable(true[1].values()))] # x[2] = annotation, true[1] = (start, end, annot)
    pred = [i.label_ for i in pred.ents] # i.label_ = annotation label, pred.ents = list of annotations
    precision = calc_precision(true, pred)
    precisions.append(precision)
    recall = calc_recall(true, pred)
    recalls.append(recall)
    f1s.append(calc_f1(precision, recall))
    
print("Precision: {} \nRecall: {} \nF1-score: {}".format(np.around(np.mean(precisions), 3),
                                                         np.around(np.mean(recalls), 3),
                                                         np.around(np.mean(f1s), 3)))

  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):


Precision: 0.865 
Recall: 0.877 
F1-score: 0.863


<img src="https://storage.googleapis.com/kf-pipeline-contrib-public/release-0.1.3/kfp-components/notebooks/entity_extraction/assets/fig7.png" align = "left" width="200"/>