<a href="https://colab.research.google.com/github/graviraja/100-Days-of-NLP/blob/applications%2Fclassification/applications/classification/ner_tagging/NER%20tagging%20with%20Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NER using Spacy and Custom Training

### Imports

In [1]:
import random
import spacy

from spacy import displacy
from spacy.util import minibatch, compounding

Load the small english model

In [71]:
nlp = spacy.load("en_core_web_sm")

The models comes with three pipelines:

- tagger (Parts-of-Speech)
- parser (Dependency Parsing)
- ner (Named Entity Recognition)

The focus of this notebook will be on `ner`

In [73]:
nlp.pipe_names

['tagger', 'parser', 'ner']

### Spacy NER

Spacy provides ner tagging support for the following entities. Learn more about these [here](https://spacy.io/api/annotation#named-entities)

|NER      | Description                 |
|---------|-----------------------------|
|PERSON|	People, including fictional.|
|NORP|	Nationalities or religious or political groups.|
|FAC|	Buildings, airports, highways, bridges, etc.|
|ORG|	Companies, agencies, institutions, etc.|
|GPE|	Countries, cities, states.|
|LOC|	Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT|	Objects, vehicles, foods, etc. (Not services.)|
|EVENT|	Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART|	Titles of books, songs, etc.|
|LAW|	Named documents made into laws.|
|LANGUAGE|	Any named language.|
|DATE|	Absolute or relative dates or periods.|
|TIME|	Times smaller than a day.|
|PERCENT|	Percentage, including ”%“.|
|MONEY|	Monetary values, including unit.|
|QUANTITY|	Measurements, as of weight or distance.|
|ORDINAL|	“first”, “second”, etc.|
|CARDINAL|	Numerals that do not fall under another type.|


In [74]:
doc = nlp("Australia wants to force Facebook and Google to pay media companies for news")

In [75]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Australia 0 9 GPE
Facebook and Google 25 44 ORG


Visualization using [displacy](https://explosion.ai/demos/displacy-ent)

In [76]:
displacy.render(nlp(doc.text), style="ent", jupyter=True)

In [79]:
doc = nlp("A little less than a decade later, dozens of self-driving startups have cropped up while automakers around the world clamor")
displacy.render(nlp(doc.text), style="ent", jupyter=True)

In [85]:
doc = nlp("I am working at Microsoft from 27/07/2017")
displacy.render(nlp(doc.text), style="ent", jupyter=True)

### Custom data

In [86]:
doc = nlp("I do not have money to pay my credit card account")

In [87]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Not able to predict any entity

In [88]:
doc = nlp("what is the process to open a new saving account")

In [89]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

### Custom Training

Let's create 2 new entities called **`ACTIVITY`** and **`SERVICE`** in a specific domain data (bank).

There are many tools available for creating data for training the NER model. Few of them are:

- [prodigy](https://prodi.gy/)
- [doccano](https://github.com/doccano/doccano)
- [inception](https://inception-project.github.io/)

In [90]:
train = [
    ("Money transfer from my checking account is not working", {"entities": [(6, 13, "ACTIVITY"), (23, 39, "SERVICE")]}),
    ("I want to check balance in my savings account", {"entities": [(16, 23, "ACTIVITY"), (30, 45, "SERVICE")]}),
    ("I suspect a fraud in my credit card account", {"entities": [(12, 17, "ACTIVITY"), (24, 35, "SERVICE")]}),
    ("I am here for opening a new savings account", {"entities": [(14, 21, "ACTIVITY"), (28, 43, "SERVICE")]}),
    ("Your mortage is in delinquent status", {"entities": [(20, 30, "ACTIVITY"), (5, 13, "SERVICE")]}),
    ("Your credit card is in past due status", {"entities": [(23, 31, "ACTIVITY"), (5, 16, "SERVICE")]}),
    ("My loan account is still not approved and funded", {"entities": [(25, 37, "ACTIVITY"), (3, 15, "SERVICE"), (42, 48, "ACTIVITY")]}),
    ("How do I open a new loan account", {"entities": [(9, 13, "ACTIVITY"), (20, 32, "SERVICE")]}),
    ("what are the charges on Investment account", {"entities": [(13, 20, "ACTIVITY"), (24, 42, "SERVICE")]}),
    ("Can you explain late charges on my credit card", {"entities": [(21, 28, "ACTIVITY"), (35, 46, "SERVICE")]}),
    ("I want to open a new loan account", {"entities": [(10, 14, "ACTIVITY"), (21, 33, "SERVICE")]}),
    ("Can you help updating payment on my credit card", {"entities": [(22, 29, "ACTIVITY"), (36, 47, "SERVICE")]}),
    ("When is the payment due date on my card", {"entities": [(12, 19, "ACTIVITY"), (35, 39, "SERVICE")]})
]

In [91]:
# get the ner pipeline
ner = nlp.get_pipe("ner")

In [92]:
# add the labels to ner pipeline
for _, annotations in train:
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])

In [93]:
# disable other pipelines, since we are only training NER
disable_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']

Training

In [94]:
with nlp.disable_pipes(*disable_pipes):
    # resumes from the previous learning
    optimizer = nlp.resume_training()
    
    # run for 100 iterations
    for iteration in range(100):
        # randomly shuffle the data
        random.shuffle(train)
        losses = {}

        # create minibatches for training
        batches = minibatch(train, size=compounding(1.0, 4.0, 1.001))
        for batch in batches:
            text, annotation = zip(*batch)
            nlp.update(
                text,
                annotation,
                drop=0.5,
                losses=losses,
                sgd=optimizer
            )
        print(f"Losses: {losses}")

Losses: {'ner': 101.85570331102822}
Losses: {'ner': 106.18786608839305}
Losses: {'ner': 104.9596626463599}
Losses: {'ner': 108.4543282064842}
Losses: {'ner': 95.47085125884041}
Losses: {'ner': 95.4647522950545}
Losses: {'ner': 90.36361821950413}
Losses: {'ner': 94.79046963702422}
Losses: {'ner': 87.54523835342843}
Losses: {'ner': 88.80624604038894}
Losses: {'ner': 85.0540884686634}
Losses: {'ner': 88.80510039441288}
Losses: {'ner': 87.75526690855622}
Losses: {'ner': 83.14760943097644}
Losses: {'ner': 86.08151522581466}
Losses: {'ner': 87.79587774234824}
Losses: {'ner': 88.09533671662211}
Losses: {'ner': 80.36609265883453}
Losses: {'ner': 83.99004088307265}
Losses: {'ner': 83.48203520294919}
Losses: {'ner': 80.71927871492517}
Losses: {'ner': 83.37758408839}
Losses: {'ner': 87.51363011309877}
Losses: {'ner': 84.31884038745193}
Losses: {'ner': 90.0542846408207}
Losses: {'ner': 83.50976573883031}
Losses: {'ner': 81.33262253063731}
Losses: {'ner': 86.20648693013936}
Losses: {'ner': 87.14824

In [95]:
for text, entities in train:
    doc = nlp(text)
    print(f"Text: {text} | entites: {entities}")
    print(f"\tActual: {[(text[ent[0]: ent[1]], ent[2]) for ent in entities['entities']]}")
    print(f"\tPredicted: {[(ent.text, ent.label_) for ent in doc.ents]}")

Text: Your credit card is in past due status | entites: {'entities': [(23, 31, 'ACTIVITY'), (5, 16, 'SERVICE')]}
	Actual: [('past due', 'ACTIVITY'), ('credit card', 'SERVICE')]
	Predicted: [('credit card', 'SERVICE')]
Text: How do I open a new loan account | entites: {'entities': [(9, 13, 'ACTIVITY'), (20, 32, 'SERVICE')]}
	Actual: [('open', 'ACTIVITY'), ('loan account', 'SERVICE')]
	Predicted: [('open', 'ACTIVITY'), ('loan account', 'SERVICE')]
Text: When is the payment due date on my card | entites: {'entities': [(12, 19, 'ACTIVITY'), (35, 39, 'SERVICE')]}
	Actual: [('payment', 'ACTIVITY'), ('card', 'SERVICE')]
	Predicted: [('payment', 'ACTIVITY')]
Text: I want to check balance in my savings account | entites: {'entities': [(16, 23, 'ACTIVITY'), (30, 45, 'SERVICE')]}
	Actual: [('balance', 'ACTIVITY'), ('savings account', 'SERVICE')]
	Predicted: [('balance', 'ACTIVITY'), ('savings account', 'SERVICE')]
Text: I am here for opening a new savings account | entites: {'entities': [(14, 21,

As we can see from the results, the model is decent enough if not 100% with only small amount of training

In [96]:
# visualize using displacy
for text, _ in train:
    doc = nlp(text)
    displacy.render(nlp(doc.text), style="ent", jupyter=True)

Let's see how it predicts on unseen data

In [97]:
doc = nlp("My credit card payment will be delayed")
displacy.render(nlp(doc.text), style="ent", jupyter=True)

In [98]:
doc = nlp("what are the charges on credit card late payment in Bank of America")
displacy.render(nlp(doc.text), style="ent", jupyter=True)

In [99]:
doc = nlp("Australia wants to force Facebook and Google to pay media companies for news")
displacy.render(nlp(doc.text), style="ent", jupyter=True)

  "__main__", mod_spec)


As we can see that it is not able to predict the entities which were done prior to training. This is due to [pseudo-rehearsal-catastrophic-forgetting](https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting). In order to fix this, we need to train the model on complete data