In [225]:
#Import libraries
import numpy 
import spacy
import pandas as pd
import numpy as np

# Introduction

spaCy is a free and open-source NLP library. It is widely used due to its advanced and versatile characteristics. Before we get into how NER works in spaCy, it's important to understand what a Named Entity Recognizer is.

A standard NLP task is named entity recognition, which can recognize entities discussed in a written document. A model that can perform this task is a Named Entity Recognizer. It should be able to recognize named entities such as "New Zealand," "John," "Auckland," and so on, and classify them as PERSON, LOCATION, and so on. It is a highly handy tool that supports in the extraction of information. The pipeline component ner implements NER in sapCy. By default, most models have it in their processing pipeline.


First, we'll instal the packages and check to see if the ner model is included.

In [226]:
#load a spacy model and check if it has ner (pre trained model for Named Entity Recognizer)
nlp = spacy.load('en_core_web_sm')
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

As shown above, NER is in the pipeline and we can proceed to categorize our entities based on the baseline mode, or already a pre trained model.

As we've seen, spaCy features a built-in NER pipeline. It does a good job, although it isn't always 100% correct for our text. Depending on the context, a word can be classified as a PERSON or an ORG. Also, the category we require might not be available in spaCy.

Let's have a look at how the NER handles survey responses by default.

In [227]:
#import the survey file
survey_responses = pd.read_csv("Survey response sample data.csv")
survey_responses = pd.DataFrame(survey_responses)

In [228]:
text = []
label = []
for i in range(0,len(survey_responses)):
    default_entities = nlp(survey_responses.iloc[i,1])
    for ent in default_entities.ents:
        text.append(ent.text)
        label.append(ent.label_)

In [229]:
Output = pd.DataFrame({'Text':text,'Label':label})
Output

Unnamed: 0,Text,Label
0,"Gentlemen, Outlander",WORK_OF_ART
1,London,GPE
2,Dublin,GPE
3,Neon,ORG
4,Undoing,ORG
5,CB Strike,PERSON
6,British,NORP


As shown in the table above, the default NER recognizes some of the entities incorrectly. CB Strike, for example, is classified as a PERSON, while Undoing is classified as an ORG. There are also a lot more movie or show titles that haven't been categorized in the WORK OF ART category, such as Fear of the Walking Dead, Supernatural, and others. In such a circumstance, we will need to update and train the NER model in accordance with the context and requirements. How to do it is demonstrated in the following section.

# Updating the Recognizer for Named Entities

We saw why we need to update and train the NER in the previous section. Let's get started and figure out how to do it.

It is our responsibility to ensure that the NER identifies the movie titles as MOVIE. To do so, we'll need to provide training examples that the NER can use to learn for future samples. Let's do this by updating an existing pre-trained spaCy model with newer samples. Let's start by loading a pre-built spaCY model with a ner component. Then use the get_pipe() method to get the Named Entity Recognizer.

In [230]:
#load pre-existing spacy model
import spacy
nlp = spacy.load('en_core_web_sm')

#getting the pipeline component
ner = nlp.get_pipe("ner")


To substantially improve the system, we will need to submit a large number of instances in order to update the pre-trained model with new examples.

# Format the training examples


Training data is accepted as a list of tuples by spaCy.

The text and dictionary should be included in each tuple. The start and end indices of the identified entity in the text, as well as the category or label of the named entity, should be stored in the dictionary.

As an example.

However, creating a trining set manually is exhausting and consumes a lot of time. I am using a tool developed by https://github.com/ManivannanMurugavel/spacy-ner-annotator to create a training set. 

Now we must add these labels to pipeline's ner.add_label() method. The following code demonstrates this:

In [231]:
#import the converter script
import convert_spacy_train_data

In [272]:
#training format
train_data = [('Fear the walking dead,Supernatural (huge fan and sad it has finished),The Gentlemen, Outlander', {'entities': [(85, 94, 'Movie_Show_Titles'), (70, 83, 'Movie_Show_Titles'), (22, 34, 'Movie_Show_Titles'), (0, 21, 'Movie_Show_Titles')]}), ('Miss scarlet and the duke,knifes out,Dublin murders', {'entities': [(51, 37, 'Movie_Show_Titles'), (36, 26, 'Movie_Show_Titles'), (25, 0, 'Movie_Show_Titles')]}), ('A lot!-good doctor-gangs of London�\xa0- the gentleman-ma�\xa0-spies in disguise�\xa0', {'entities': [(73, 55, 'Movie_Show_Titles'), (53, 51, 'Movie_Show_Titles'), (50, 37, 'Movie_Show_Titles'), (34, 19, 'Movie_Show_Titles'), (18, 7, 'Movie_Show_Titles')]}), ('The Undoing,Game of thrones,Outlander, Vikings,CB Strike (and most all British dramas) Westworld', {'entities': [(56, 47, 'Movie_Show_Titles'), (39, 46, 'Movie_Show_Titles'), (37, 28, 'Movie_Show_Titles'), (27, 12, 'Movie_Show_Titles'), (11, 0, 'Movie_Show_Titles')]})]

In [273]:
#Adding labels to the ner
for _, __annotations__ in train_data:
    for ent in __annotations__.get('entities'):
        ner.add_label(ent[2])

It's now time to put the NER through its paces with these examples. However, remember from part ner that the model contains additional pipeline components before we train. These elements should not be harmed during training. As a result, we use the nlp.disable_pipes() method to disable the other pipeline components.

In [274]:
# Disable pipeline components you dont need to change
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

Unaffected pipes should now be disabled while training the model.

# Train the NER model

In [275]:
# Import requirements
import random
from spacy.util import minibatch, compounding
from pathlib import Path
from spacy.training.example import Example


# TRAINING THE MODEL
with nlp.disable_pipes(unaffected_pipes):

  # Training for 30 iterations
  for iteration in range(30):

    # shuufling examples  before every iteration
    random.shuffle(train_data)
    losses = {}
    # batch up the examples using spaCy's minibatch
    batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
    for batch in spacy.util.minibatch(train_data, size=2):
        for text, annotations in batch:
        # create Example
            doc = nlp.make_doc(text)
            example = Example.from_dict(doc, annotations)
        # Update the model
            nlp.update([example], losses=losses, drop=0.5)
        print("Losses", losses)

Losses {'ner': 0.0006535927592013421}
Losses {'ner': 1.2728909784336284}
Losses {'ner': 19.234741273561404}




Losses {'ner': 19.234819016233278}
Losses {'ner': 5.920956446019839}
Losses {'ner': 5.92096343075455}
Losses {'ner': 2.961780950514288}
Losses {'ner': 2.9617810814022745}
Losses {'ner': 1.3283611132283022e-09}
Losses {'ner': 3.793948932436468}
Losses {'ner': 0.0007439994851964275}
Losses {'ner': 1.354505801537168}
Losses {'ner': 0.0003145630991667814}
Losses {'ner': 1.3460260736468397}
Losses {'ner': 3.322381851598655e-07}
Losses {'ner': 0.9070113753810878}
Losses {'ner': 0.4309020247381171}
Losses {'ner': 0.43092953154817254}
Losses {'ner': 1.694827023568131}
Losses {'ner': 1.6981200102628475}
Losses {'ner': 0.27591652550804635}
Losses {'ner': 0.27660825126305166}
Losses {'ner': 7.222739803670015e-07}
Losses {'ner': 0.002118568362661769}
Losses {'ner': 0.0004506372694007022}
Losses {'ner': 0.4953172382733111}
Losses {'ner': 4.389336957745914e-09}
Losses {'ner': 1.1029082841880111}
Losses {'ner': 1.1357048622013453}
Losses {'ner': 1.1357056658310527}
Losses {'ner': 0.002487787426997685

# Predict on new texts the model has not seen

Our NER's training is now complete. We can now see if the NER is functioning as planned. We'll need more examples and try again if it doesn't meet our expectations.


In [292]:
# Testing the model
text = list()
label = []
for i in range(0,len(survey_responses)):
    default_entities = nlp(survey_responses.iloc[i,1])
    for ent in default_entities.ents:
        text.append(ent.text)
        label.append(ent.label_)


In [295]:
Output = pd.DataFrame({'Text':text,'Label':label})
Output

Unnamed: 0,Text,Label
0,Fear the walking dead,Movie_Show_Titles
1,Supernatural,Movie_Show_Titles
2,Outlander,Movie_Show_Titles
3,Vikings,Movie_Show_Titles
4,Sometimes,Movie_Show_Titles
5,Vikings,Movie_Show_Titles


The output above demonstrates that the model is not bad and has accurately spotted the titles, but not all of them. As a result, we must train the model with more examples in order to extract more data from the survey.