In [14]:
#Import libraries
import numpy 
import spacy
import pandas as pd
import numpy as np

# Introduction

spaCy is an open-source library for NLP. It is widely used because of its flexible and advanced features. Before diving into NER is implemented in spaCy, lets start by understanding what a Named Entity rEcognizer is.

Named Entity Recognition is a standard NLP task that can identify entities discussed in the a text document. A Named Entity Recognizer is a model that can do this recognizing task. It should be able to identify named entities like 'New Zealand', 'John','Auckland', etc.. and categorize them as a PERSON, LOCATION, and so on. It is a very useful tool and helps in information Retrival. In sapCy, NER is implemented by pipeline component ner. Most of the models have it in their processing pipeline by default. 

First we start by installing the packages and see if the ner model is in our package.

In [9]:
#load a spacy model and check if it has ner (pre trained model for Named Entity Recognizer)
nlp = spacy.load('en_core_web_sm')
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

As shown above, NER is in the pipeline and we can proceed to categorize our entities based on the baseline mode, or already a pre trained model.

# Custom NER model

As we saw, spaCy has in-built pipeline ner for NER. Although it performs well, it is not always completely accurate for your text. Sometimes, a word can be categorized as a PERSON or a ORG depending upon the context. Also, sometimes the category we want may not be built-in in spaCy.

Lets have a look at how the default NER performs survey responses.

In [18]:
#import the survey file
survey_responses = pd.read_csv("Survey response sample data.csv")
survey_responses = pd.DataFrame(survey_responses)

In [69]:
text = []
label = []
for i in range(0,len(survey_responses)):
    default_entities = nlp(survey_responses.iloc[i,1])
    for ent in default_entities.ents:
        text.append(ent.text)
        label.append(ent.label_)

In [70]:
Output = pd.DataFrame({'Text':text,'Label':label})
Output

Unnamed: 0,Text,Label
0,"Gentlemen, Outlander",WORK_OF_ART
1,London,GPE
2,Dublin,GPE
3,Neon,ORG
4,Undoing,ORG
5,CB Strike,PERSON
6,British,NORP


As shown above in the table, the default NER is recognizing some of the entities wrong. For instance, it is categorizing CB Strike to be a PERSON and Undoing to be ORG. Also, there are a lot more movie or show titles, such as Fear the walking dead, Supernatural and others that have not been categorized in WORK_OF_ART category. In case like tis, we will have the need to update and train the NER model as per the context and requirements. In the next section, it is shown how to do it. 

# Updating the Named Entity Recognizer

In the previous section, we saw why we need to update and train the NER. Now lets go a head and see how to do it.

Our task is make sure the NER recognizes the movie titles as MOVIE. To enable this, we need to provide training examples which will make the NER learn for future samples. To do this, lets use an existing pre trained spaCy model and update it with newer examples. First, lets load a pre existing spaCY model with an in-built ner component. Then get the Named Entity Recognizer using get_pipe() method.

In [None]:
#load pre-existing spacy model
import spacy
nlp = spacy.load('en_core_web_sm')

#getting the pipeline component
ner = nlp.get_pipe("ner")


In order to update the pre-trained model with new examples, we will have to provide many examples to meaningfully improve the system.

# Format the training examples

spaCy accepts training data as list of tuples.

Each tuple should contain the text and dictionary. The dictionary should hold the start and end indices of the named entity in the text and the category or label of the named entity.

For example.

However, creating a trining set manually is exhausting and consumes a lot of time. I am using a tool developed by https://github.com/ManivannanMurugavel/spacy-ner-annotator to create a training set. 

In [91]:
#Training 
import convert_spacy_train_data

In [84]:
#Training format
with open('train.txt') as f:
    train_data = f.readlines()
train_data

["[('Content_name Fear the Walking Dead Supernatural The Gentlemen Outlander The Good Doctor Gangs of London Ma Spies in Disguise Miss Scarlet and the Duke Knives Out Dublin Murders Vikings Casper The Undoing Game of Thrones C.B. Strike Westworld', {'entities': [(232, 241, 'WOKR_OF_ART'), (220, 231, 'WOKR_OF_ART'), (205, 219, 'WOKR_OF_ART'), (192, 203, 'WOKR_OF_ART'), (185, 191, 'WOKR_OF_ART'), (177, 184, 'WOKR_OF_ART'), (162, 176, 'WOKR_OF_ART'), (151, 161, 'WOKR_OF_ART'), (125, 150, 'WOKR_OF_ART'), (107, 124, 'WOKR_OF_ART'), (104, 106, 'WOKR_OF_ART'), (88, 103, 'WOKR_OF_ART'), (72, 87, 'WOKR_OF_ART'), (62, 71, 'WOKR_OF_ART'), (48, 61, 'WOKR_OF_ART'), (35, 47, 'WOKR_OF_ART'), (13, 34, 'WOKR_OF_ART')]})]"]

Now we need to add these labels to the ner.add_label() method of pipeline. Below code demonstrates the same:

In [85]:
#Adding labels to the ner
for _, __annotations__ in train_data:
    for ent in __annotations__.get('entities'):
        ner.add_label(ent[2])

ValueError: too many values to unpack (expected 2)