# Custom Named Entity Recognition (NER) using Spacy

(Original: How to Train spaCy to Autodetect New Entities (NER)[https://www.machinelearningplus.com/nlp/training-custom-ner-model-in-spacy/]  by Shrivarsheni)

Named Entity Recognition (NER) identifes named entities like ‘America’ , ‘Emily’ , ‘London’ ,etc.. and categorize them as PERSON, LOCATION , and so on. In spacy, NER is implemented by the pipeline component 'ner'.

In [1]:
# Load a spacy model and chekc if it has ner
import spacy
nlp=spacy.load('en_core_web_sm')

nlp.pipe_names

['tagger', 'parser', 'ner']

## Default NER

Testing default 'ner'

In [5]:
article_text="""India that previously comprised only a handful of players in the e-commerce space, is now home to many biggies and giants battling out with each other to reach the top. This is thanks to the overwhelming internet and smartphone penetration coupled with the ever-increasing digital adoption across the country. These new-age innovations not only gave emerging startups a unique platform to deliver seamless shopping experiences but also provided brick and mortar stores with a level-playing field to begin their online journeys without leaving their offline legacies.
Flipkart – Founded in 2007, Flipkart is recognized as the national leader in the Indian e-commerce market. Just like Amazon, it started operating by selling books and then entered other categories such as electronics, fashion, and lifestyle, mobile phones, etc. And now that it has been acquired by Walmart, one of the largest leading platforms of e-commerce in the US, it has also raised its bar of customer offerings in all aspects and giving huge competition to Amazon. 
Snapdeal – Started as a daily deals platform in 2010, Snapdeal became a full-fledged online marketplace in 2011 comprising more than 3 lac sellers across India. The platform offers over 30 million products across 800+ diverse categories from over 125,000 regional, national, and international brands and retailers. The Indian e-commerce firm follows a robust strategy to stay at the forefront of innovation and deliver seamless customer offerings to its wide customer base. It has shown great potential for recovery in recent years despite losing Freecharge and Unicommerce. 

Grofers – One of the leading e-commerce players in the grocery segment, Grofers started its operations in 2013 and has reached overwhelming heights in the last 5 years. Its wide range of products includes atta, milk, oil, daily need products, vegetables, dairy products, juices, beverages, among others. With its growing reach across India, it has become one of the favorite supermarkets for Indian consumers who want to shop grocery items from the comforts of their homes."""

doc=nlp(article_text)
for ent in doc.ents:
  print(ent.text,ent.label_)

India GPE
Flipkart PERSON
2007 DATE
Flipkart PERSON
Indian NORP
Amazon ORG
Walmart LOC
one CARDINAL
US GPE
Amazon ORG
daily DATE
2010 DATE
2011 DATE
more than 3 CARDINAL
India GPE
over 30 million CARDINAL
over 125,000 CARDINAL
Indian NORP
recent years DATE
Freecharge PERSON
Unicommerce GPE
One CARDINAL
2013 DATE
the last 5 years DATE
daily DATE
India GPE
Indian NORP


See that 'Flipkar' is marked as PERSON instead of ORG. 'Snapdeal' is not even tagged. Need to update the default model.

Our task is make sure the NER recognizes the company asORGand not as PERSON , place the unidentified products under PRODUCT and so on.

To enable this, you need to provide training examples which will make the NER learn for future samples.

To do this, let’s use an existing pre-trained spacy model and update it with newer examples.

First , let’s load a pre-existing spacy model with an in-built ner component. Then, get the Named Entity Recognizer using get_pipe() method 

In [6]:
# Load pre-existing spacy model
import spacy
nlp=spacy.load('en_core_web_sm')

# Getting the pipeline component
ner=nlp.get_pipe("ner")

To update a pretrained model with new examples, need to provide many examples to meaningfully improve the system — a few hundred is a good start, although more is better.

## Training Data Prep

Format: list of tuples where each tuple should contain the text and a dictionary. The dictionary should hold the start and end indices of the named enity in the text, and the category or label of the named entity.

For example, ("Walmart is a leading e-commerce company", {"entities": [(0, 7, "ORG")]})

In [8]:
# training data
TRAIN_DATA = [
              ("Walmart is a leading e-commerce company", {"entities": [(0, 7, "ORG")]}),
              ("I reached Chennai yesterday.", {"entities": [(19, 28, "GPE")]}),
              ("I recently ordered a book from Amazon", {"entities": [(24,32, "ORG")]}),
              ("I was driving a BMW", {"entities": [(16,19, "PRODUCT")]}),
              ("I ordered this from ShopClues", {"entities": [(20,29, "ORG")]}),
              ("Fridge can be ordered in Amazon ", {"entities": [(0,6, "PRODUCT")]}),
              ("I bought a new Washer", {"entities": [(16,22, "PRODUCT")]}),
              ("I bought a old table", {"entities": [(16,21, "PRODUCT")]}),
              ("I bought a fancy dress", {"entities": [(18,23, "PRODUCT")]}),
              ("I rented a camera", {"entities": [(12,18, "PRODUCT")]}),
              ("I rented a tent for our trip", {"entities": [(12,16, "PRODUCT")]}),
              ("I rented a screwdriver from our neighbour", {"entities": [(12,22, "PRODUCT")]}),
              ("I repaired my computer", {"entities": [(15,23, "PRODUCT")]}),
              ("I got my clock fixed", {"entities": [(16,21, "PRODUCT")]}),
              ("I got my truck fixed", {"entities": [(16,21, "PRODUCT")]}),
              ("Flipkart started it's journey from zero", {"entities": [(0,8, "ORG")]}),
              ("I recently ordered from Max", {"entities": [(24,27, "ORG")]}),
              ("Flipkart is recognized as leader in market",{"entities": [(0,8, "ORG")]}),
              ("I recently ordered from Swiggy", {"entities": [(24,29, "ORG")]})
              ]

# Adding labels to the `ner`

for _, annotations in TRAIN_DATA:
  for ent in annotations.get("entities"):
    ner.add_label(ent[2])

But before you train, remember that apart from ner , the model has other pipeline components. These components should not get affected in training.

So, disable the other pipeline components through nlp.disable_pipes() method.

In [9]:
# Disable pipeline components you dont need to change
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

## Training the NER model

(a) To train an ner model, the model has to be looped over the example for sufficient number of iterations. If you train it for like just 5 or 6 iterations, it may not be effective.

(b) Before every iteration it’s a good practice to shuffle the examples randomly throughrandom.shuffle() function .

This will ensure the model does not make generalizations based on the order of the examples.

(c) The training data is usually passed in batches.

For each iteration , the model or ner is updated through the nlp.update()

At each word, the update() it makes a prediction. It then consults the annotations to check if the prediction is right. If it isn’t , it adjusts the weights so that the correct action will score higher next time.

In [10]:
import random
from spacy.util import minibatch, compounding
from pathlib import Path

# TRAINING THE MODEL
with nlp.disable_pipes(*unaffected_pipes):

  # Training for 30 iterations
  for iteration in range(30):

    # shuufling examples  before every iteration
    random.shuffle(TRAIN_DATA)
    losses = {}
    # batch up the examples using spaCy's minibatch
    batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
    for batch in batches:
        texts, annotations = zip(*batch)
        nlp.update(
                    texts,  # batch of texts
                    annotations,  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    losses=losses,
                )
        print("Losses", losses)

Losses {'ner': 5.671148613560945}
Losses {'ner': 6.244043747428805}
Losses {'ner': 9.126212551142089}
Losses {'ner': 14.664301947474087}
Losses {'ner': 15.751911189801831}
Losses {'ner': 6.258584780152887}
Losses {'ner': 6.3051013308531765}
Losses {'ner': 8.96836465296201}
Losses {'ner': 9.972942887255158}
Losses {'ner': 15.103250343748186}
Losses {'ner': 3.4161520173074678}
Losses {'ner': 8.004657640978166}
Losses {'ner': 8.137368077315386}
Losses {'ner': 8.14868353762131}
Losses {'ner': 9.987013692619627}
Losses {'ner': 0.4492399188457057}
Losses {'ner': 4.926055098764664}
Losses {'ner': 8.422810546170012}
Losses {'ner': 11.918632652472212}
Losses {'ner': 15.097795510526254}
Losses {'ner': 0.9270217642939542}
Losses {'ner': 3.145356021896646}
Losses {'ner': 6.297815726981867}
Losses {'ner': 9.724805513806643}
Losses {'ner': 11.612874856617282}
Losses {'ner': 3.389610820672715}
Losses {'ner': 3.4108191625882682}
Losses {'ner': 7.383644360548715}
Losses {'ner': 7.3865686765020655}
Loss

## Testing

If predictions are not up to expectations, include more training examples and try again.

In [12]:
# Testing the model
doc = nlp("I was driving a Alto which I bought from Flipkart")
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])

Entities [('Alto', 'PRODUCT'), ('Flipkart', 'ORG')]


You can save it your desired directory through the to_disk command.

After saving, you can load the model from the directory at any point of time by passing the directory path to spacy.load() function.

In [13]:
# Save the  model to directory
output_dir = Path('/content/')
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

Saved model to \content


In [14]:
# Load the saved model and predict
print("Loading from", output_dir)
nlp_updated = spacy.load(output_dir)
doc = nlp_updated("Fridge can be ordered in FlipKart" )
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])

Loading from \content
Entities [('Fridge', 'PRODUCT'), ('FlipKart', 'ORG')]


## Adding new labels

Similar 'ner' updating is needed with Training Data. Add the label to ner through add_label() method. Next, you can use resume_training() function to return an optimizer.

In [16]:
# New label to add
LABEL = "FOOD"

# Training examples in the required format
TRAIN_DATA =[ ("Pizza is a common fast food.", {"entities": [(0, 5, "FOOD")]}),
              ("Pasta is an italian recipe", {"entities": [(0, 5, "FOOD")]}),
              ("China's noodles are very famous", {"entities": [(8,14, "FOOD")]}),
              ("Shrimps are famous in China too", {"entities": [(0,7, "FOOD")]}),
              ("Lasagna is another classic of Italy", {"entities": [(0,7, "FOOD")]}),
              ("Sushi is extemely famous and expensive Japanese dish", {"entities": [(0,5, "FOOD")]}),
              ("Unagi is a famous seafood of Japan", {"entities": [(0,5, "FOOD")]}),
              ("Tempura , Soba are other famous dishes of Japan", {"entities": [(0,7, "FOOD")]}),
              ("Udon is a healthy type of noodles", {"entities": [(0,4, "ORG")]}),
              ("Chocolate soufflé is extremely famous french cuisine", {"entities": [(0,17, "FOOD")]}),
              ("Flamiche is french pastry", {"entities": [(0,8, "FOOD")]}),
              ("Burgers are the most commonly consumed fastfood", {"entities": [(0,7, "FOOD")]}),
              ("Burgers are the most commonly consumed fastfood", {"entities": [(0,7, "FOOD")]}),
              ("Frenchfries are considered too oily", {"entities": [(0,11, "FOOD")]})
           ]

# Add the new label to ner
ner.add_label(LABEL)

# Resume training
optimizer = nlp.resume_training()
move_names = list(ner.move_names)

# List of pipes you want to train
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]

# List of pipes which should remain unaffected in training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

Pass sufficient examples and good number of iterations, say 20.

In [17]:
from spacy.util import minibatch, compounding
import random

# Begin training by disabling other pipeline components
with nlp.disable_pipes(*other_pipes) :

  sizes = compounding(1.0, 4.0, 1.001)
  # Training for 30 iterations     
  for itn in range(30):
    # shuffle examples before training
    random.shuffle(TRAIN_DATA)
    # batch up the examples using spaCy's minibatch
    batches = minibatch(TRAIN_DATA, size=sizes)
    # ictionary to store losses
    losses = {}
    for batch in batches:
      texts, annotations = zip(*batch)
      # Calling update() over the iteration
      nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
      print("Losses", losses)

Losses {'ner': 6.135417460280541}
Losses {'ner': 18.055886029082423}
Losses {'ner': 26.98418468125451}
Losses {'ner': 29.93112833611222}
Losses {'ner': 42.66409005037997}
Losses {'ner': 47.60032568394229}
Losses {'ner': 52.556862453945804}
Losses {'ner': 57.99250314309007}
Losses {'ner': 63.61409512900268}
Losses {'ner': 65.15030186584116}
Losses {'ner': 72.25433349458224}
Losses {'ner': 77.04438918883169}
Losses {'ner': 83.47645027281031}
Losses {'ner': 90.64281290115103}
Losses {'ner': 1.9889164222981925}
Losses {'ner': 6.92639906005658}
Losses {'ner': 9.8967614361081}
Losses {'ner': 14.314272661293769}
Losses {'ner': 16.30601631543266}
Losses {'ner': 19.142050187208387}
Losses {'ner': 27.31261651880357}
Losses {'ner': 34.23241859108691}
Losses {'ner': 44.25706630379443}
Losses {'ner': 49.1747580840816}
Losses {'ner': 53.14143339348539}
Losses {'ner': 61.481861582634345}
Losses {'ner': 66.44031007089961}
Losses {'ner': 73.32213563857846}
Losses {'ner': 3.7872812948189676}
Losses {'ne

Losses {'ner': 15.070842868026716}
Losses {'ner': 17.41177810350598}
Losses {'ner': 20.929445350954097}
Losses {'ner': 25.880433935083452}
Losses {'ner': 32.75067881063372}
Losses {'ner': 37.85085251409017}
Losses {'ner': 40.69515348538933}
Losses {'ner': 46.009373735042196}
Losses {'ner': 53.27040446996633}
Losses {'ner': 57.52870198565014}
Losses {'ner': 65.76614764546557}
Losses {'ner': 4.269399568118388}
Losses {'ner': 9.889373769663507}
Losses {'ner': 13.241571735736215}
Losses {'ner': 15.619898533834203}
Losses {'ner': 24.24267604007764}
Losses {'ner': 24.254182549579127}
Losses {'ner': 28.367168055854563}
Losses {'ner': 31.062403062631347}
Losses {'ner': 36.04521516179011}
Losses {'ner': 37.667203489265376}
Losses {'ner': 40.24817592875843}
Losses {'ner': 47.3054819800127}
Losses {'ner': 50.601195993291185}
Losses {'ner': 53.497107728657284}
Losses {'ner': 3.7994599402516087}
Losses {'ner': 7.913474875472957}
Losses {'ner': 8.871293293435656}
Losses {'ner': 13.129610070431}
Loss

## Testing Custom NER

In [21]:
# Testing the NER

test_text = "I ate Sushi yesterday. Maggi is a common fast food "
doc = nlp(test_text)
print("Entities in '%s'" % test_text)
for ent in doc.ents:
  print(ent)

Entities in 'I ate Sushi yesterday. Maggi is a common fast food '
Maggi


In [22]:
# Output directory
from pathlib import Path
output_dir=Path('/content/')

# Saving the model to the output directory
if not output_dir.exists():
  output_dir.mkdir()
nlp.meta['name'] = 'my_ner'  # rename model
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

Saved model to \content
Loading from \content


In [27]:
# Loading the model from the directory
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
assert nlp2.get_pipe("ner").move_names == move_names
doc2 = nlp2(' Idli is an extremely famous south Indian dish')
for ent in doc2.ents:
  print(ent.label_, ent.text)
else:
  print("No Entities found!!")

Loading from \content
No Entities found!!
