In [225]:
#Import libraries
import numpy 
import spacy
import pandas as pd
import numpy as np

# Introduction

spaCy is an open-source library for NLP. It is widely used because of its flexible and advanced features. Before diving into NER is implemented in spaCy, lets start by understanding what a Named Entity rEcognizer is.

Named Entity Recognition is a standard NLP task that can identify entities discussed in the a text document. A Named Entity Recognizer is a model that can do this recognizing task. It should be able to identify named entities like 'New Zealand', 'John','Auckland', etc.. and categorize them as a PERSON, LOCATION, and so on. It is a very useful tool and helps in information Retrival. In sapCy, NER is implemented by pipeline component ner. Most of the models have it in their processing pipeline by default. 

First we start by installing the packages and see if the ner model is in our package.

In [226]:
#load a spacy model and check if it has ner (pre trained model for Named Entity Recognizer)
nlp = spacy.load('en_core_web_sm')
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

As shown above, NER is in the pipeline and we can proceed to categorize our entities based on the baseline mode, or already a pre trained model.

# Custom NER model

As we saw, spaCy has in-built pipeline ner for NER. Although it performs well, it is not always completely accurate for your text. Sometimes, a word can be categorized as a PERSON or a ORG depending upon the context. Also, sometimes the category we want may not be built-in in spaCy.

Lets have a look at how the default NER performs survey responses.

In [227]:
#import the survey file
survey_responses = pd.read_csv("Survey response sample data.csv")
survey_responses = pd.DataFrame(survey_responses)

In [228]:
text = []
label = []
for i in range(0,len(survey_responses)):
    default_entities = nlp(survey_responses.iloc[i,1])
    for ent in default_entities.ents:
        text.append(ent.text)
        label.append(ent.label_)

In [229]:
Output = pd.DataFrame({'Text':text,'Label':label})
Output

Unnamed: 0,Text,Label
0,"Gentlemen, Outlander",WORK_OF_ART
1,London,GPE
2,Dublin,GPE
3,Neon,ORG
4,Undoing,ORG
5,CB Strike,PERSON
6,British,NORP


As shown above in the table, the default NER is recognizing some of the entities wrong. For instance, it is categorizing CB Strike to be a PERSON and Undoing to be ORG. Also, there are a lot more movie or show titles, such as Fear the walking dead, Supernatural and others that have not been categorized in WORK_OF_ART category. In case like tis, we will have the need to update and train the NER model as per the context and requirements. In the next section, it is shown how to do it. 

# Updating the Named Entity Recognizer

In the previous section, we saw why we need to update and train the NER. Now lets go a head and see how to do it.

Our task is make sure the NER recognizes the movie titles as MOVIE. To enable this, we need to provide training examples which will make the NER learn for future samples. To do this, lets use an existing pre trained spaCy model and update it with newer examples. First, lets load a pre existing spaCY model with an in-built ner component. Then get the Named Entity Recognizer using get_pipe() method.

In [230]:
#load pre-existing spacy model
import spacy
nlp = spacy.load('en_core_web_sm')

#getting the pipeline component
ner = nlp.get_pipe("ner")


In order to update the pre-trained model with new examples, we will have to provide many examples to meaningfully improve the system.

# Format the training examples

spaCy accepts training data as list of tuples.

Each tuple should contain the text and dictionary. The dictionary should hold the start and end indices of the named entity in the text and the category or label of the named entity.

For example.

However, creating a trining set manually is exhausting and consumes a lot of time. I am using a tool developed by https://github.com/ManivannanMurugavel/spacy-ner-annotator to create a training set. 

Now we need to add these labels to the ner.add_label() method of pipeline. Below code demonstrates the same:

In [231]:
#import the converter script
import convert_spacy_train_data

In [272]:
#training format
train_data = [('Fear the walking dead,Supernatural (huge fan and sad it has finished),The Gentlemen, Outlander', {'entities': [(85, 94, 'Movie_Show_Titles'), (70, 83, 'Movie_Show_Titles'), (22, 34, 'Movie_Show_Titles'), (0, 21, 'Movie_Show_Titles')]}), ('Miss scarlet and the duke,knifes out,Dublin murders', {'entities': [(51, 37, 'Movie_Show_Titles'), (36, 26, 'Movie_Show_Titles'), (25, 0, 'Movie_Show_Titles')]}), ('A lot!-good doctor-gangs of London�\xa0- the gentleman-ma�\xa0-spies in disguise�\xa0', {'entities': [(73, 55, 'Movie_Show_Titles'), (53, 51, 'Movie_Show_Titles'), (50, 37, 'Movie_Show_Titles'), (34, 19, 'Movie_Show_Titles'), (18, 7, 'Movie_Show_Titles')]}), ('The Undoing,Game of thrones,Outlander, Vikings,CB Strike (and most all British dramas) Westworld', {'entities': [(56, 47, 'Movie_Show_Titles'), (39, 46, 'Movie_Show_Titles'), (37, 28, 'Movie_Show_Titles'), (27, 12, 'Movie_Show_Titles'), (11, 0, 'Movie_Show_Titles')]})]

In [273]:
#Adding labels to the ner
for _, __annotations__ in train_data:
    for ent in __annotations__.get('entities'):
        ner.add_label(ent[2])

Now it is time to train the NER over these examples. But before we train, remember from part from ner, the model has other pipeline components. These components should not get affected in training. Hence, we disable the other pipeline components through nlp.disable_pipes() method.

In [274]:
# Disable pipeline components you dont need to change
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

Now train the model with unaffected_pipes disabled.

# Train the NER model

In [275]:
# Import requirements
import random
from spacy.util import minibatch, compounding
from pathlib import Path
from spacy.training.example import Example


# TRAINING THE MODEL
with nlp.disable_pipes(unaffected_pipes):

  # Training for 30 iterations
  for iteration in range(30):

    # shuufling examples  before every iteration
    random.shuffle(train_data)
    losses = {}
    # batch up the examples using spaCy's minibatch
    batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
    for batch in spacy.util.minibatch(train_data, size=2):
        for text, annotations in batch:
        # create Example
            doc = nlp.make_doc(text)
            example = Example.from_dict(doc, annotations)
        # Update the model
            nlp.update([example], losses=losses, drop=0.5)
        print("Losses", losses)

Losses {'ner': 0.0006535927592013421}
Losses {'ner': 1.2728909784336284}
Losses {'ner': 19.234741273561404}




Losses {'ner': 19.234819016233278}
Losses {'ner': 5.920956446019839}
Losses {'ner': 5.92096343075455}
Losses {'ner': 2.961780950514288}
Losses {'ner': 2.9617810814022745}
Losses {'ner': 1.3283611132283022e-09}
Losses {'ner': 3.793948932436468}
Losses {'ner': 0.0007439994851964275}
Losses {'ner': 1.354505801537168}
Losses {'ner': 0.0003145630991667814}
Losses {'ner': 1.3460260736468397}
Losses {'ner': 3.322381851598655e-07}
Losses {'ner': 0.9070113753810878}
Losses {'ner': 0.4309020247381171}
Losses {'ner': 0.43092953154817254}
Losses {'ner': 1.694827023568131}
Losses {'ner': 1.6981200102628475}
Losses {'ner': 0.27591652550804635}
Losses {'ner': 0.27660825126305166}
Losses {'ner': 7.222739803670015e-07}
Losses {'ner': 0.002118568362661769}
Losses {'ner': 0.0004506372694007022}
Losses {'ner': 0.4953172382733111}
Losses {'ner': 4.389336957745914e-09}
Losses {'ner': 1.1029082841880111}
Losses {'ner': 1.1357048622013453}
Losses {'ner': 1.1357056658310527}
Losses {'ner': 0.002487787426997685

# Predict on new texts the model has not seen

Training of our NER is completed now. We can test if the NER is now working as we expected. If it is not up to our expectations, we will need more examples and try again.


In [276]:
# Testing the model
text = []
label = []
for i in range(0,len(survey_responses)):
    default_entities = nlp(survey_responses.iloc[i,1])
    for ent in default_entities.ents:
        text.append(ent.text)
        label.append(ent.label_)


In [284]:
doc = nlp(survey_responses.iloc[4,1])
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])


Entities [('Vikings', 'Movie_Show_Titles')]


As shown above, the model has extracted few