## Customizing the Spacy Pipeline - Identifying Custom / New Entities

Using spacy model it is very easy to identify the entities which is called NER - Named Entity Recognition. It is always better to use medium or large models for NER as I have seen that these models identify more entities than the smaller version of the model.

Below is how you load the model and check the entities identified by the model.



In [95]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_md")

In [97]:
text = "When Ghosn took over Nissan in the late ‘90s, the Japanese automaker was in a state of shambles. The company had been struggling with its finances. The debt burden was spiralling out of control. Innovation had taken a backseat and they were ceding market share to the likes of Ford, General Motors and Toyota. They needed a massive handout if they had any hopes of surviving, let alone competing. While there were a few suitors initially, most dropped out after baulking at the $16 billion debt burden Nissan was carrying with it. Until the french auto company, Renault finally decided to set up."

doc = nlp(text)
print("-----------------------------")
for ent in doc.ents:
    print(ent.text, ent.label_)

print("-----------------------------")
displacy.render(doc, style= "ent", jupyter= True)

-----------------------------
Ghosn PERSON
Nissan ORG
the late ‘90s DATE
Japanese NORP
Ford ORG
General Motors ORG
Toyota ORG
$16 billion MONEY
Nissan ORG
Renault ORG
-----------------------------


<br>

These entities are identified because the model which we are using has been trained in text which might have these values already. But does this model work well with the may be some unknown entities as well equally well.

Here I am going to use some text which is specific to India and start up companies in India.

<br>


In [98]:
text = "On Wednesday, we offered our two cents on the biggest  Food-Tech deal in India. So if you haven’t heard yet, Zomato acquired UberEats India for a whopping $350 Mn. This deal makes perfect sense for Uber. After all, they were suffering huge losses and had very little to show for all the money they were pouring in."
doc = nlp(text)
displacy.render(doc, style="ent", jupyter=True)

<br>
Zomato and Uber are identified as persons but they are infact organizations or startups. How we do make sure our model identifies even these as entities which we are looking for. 

This is done through adding these as entities to the model. How do we do it? 

Have shown that below::

In [102]:
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

matcher = PhraseMatcher(nlp.vocab)

startups = ['Zomato', 'UberEats India', 'Uber']
pattern = list(nlp.pipe(startups))
matcher.add('ORG', None, *pattern)

# Define the custom component
def startup_component(doc):
    doc.ents = [Span(doc, start, end, label='ORG')
                for match_id, start, end in matcher(doc)]
    return doc
    
# Add the component to the pipeline
nlp.add_pipe(startup_component, after='ner')

In [103]:
doc = nlp("On Wednesday, we offered our two cents on the biggest  Food-Tech deal in India. So if you haven’t heard yet, Zomato acquired UberEats India for a whopping $350 Mn. This deal makes perfect sense for Uber. After all, they were suffering huge losses and had very little to show for all the money they were pouring in.")
print([(ent.text, ent.label_) for ent in doc.ents])
displacy.render(doc, style='ent', jupyter=True)

[('Zomato', 'ORG'), ('UberEats India', 'ORG'), ('Uber', 'ORG')]


I added a new processes within the Spacy pipeline to identify these named entities. This new step was added after the "NER" in pipeline. This seems to be removing all the entities which are identified by the default NER step. So solution for this is adding the custom NER before the actual NER.

Before adding you have to remove the previously added custom step from the pipeline.

In [104]:
nlp.remove_pipe('startup_component')

('startup_component', <function __main__.startup_component(doc)>)

In [105]:
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

matcher = PhraseMatcher(nlp.vocab)

startups = ['Zomato', 'UberEats India', 'Uber']
pattern = list(nlp.pipe(startups))
matcher.add('ORG', None, *pattern)

# Define the custom component
def startup_component(doc):
    doc.ents = [Span(doc, start, end, label='ORG')
                for match_id, start, end in matcher(doc)]
    return doc
    
# Add the component to the pipeline
nlp.add_pipe(startup_component, before='ner')

In [107]:
doc = nlp("On Wednesday, we offered our two cents on the biggest  Food-Tech deal in India. So if you haven’t heard yet, Zomato acquired UberEats India for a whopping $350 Mn. This deal makes perfect sense for Uber. After all, they were suffering huge losses and had very little to show for all the money they were pouring in.")
print([(ent.text, ent.label_) for ent in doc.ents])
print("-----------------------------------------")
displacy.render(doc, style='ent', jupyter=True)

[('Wednesday', 'DATE'), ('two cents', 'MONEY'), ('Food-Tech', 'ORG'), ('India', 'GPE'), ('Zomato', 'ORG'), ('UberEats India', 'ORG'), ('350', 'MONEY'), ('Uber', 'ORG')]
-----------------------------------------


<br>

Lets say we want to label these new entities as "StartUps". How do we do it?


In [108]:
nlp.remove_pipe('startup_component')

('startup_component', <function __main__.startup_component(doc)>)

In [109]:
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

matcher = PhraseMatcher(nlp.vocab)

startups = ['Zomato', 'UberEats India', 'Uber']
pattern = list(nlp.pipe(startups))
matcher.add('Startups', None, *pattern)

# Define the custom component
def startup_component(doc):
    doc.ents = [Span(doc, start, end, label='Startups')
                for match_id, start, end in matcher(doc)]
    return doc
    
# Add the component to the pipeline
nlp.add_pipe(startup_component, before='ner')

In [112]:
doc = nlp("On Wednesday, we offered our two cents on the biggest  Food-Tech deal in India. So if you haven’t heard yet, Zomato acquired UberEats India for a whopping $350 Mn. This deal makes perfect sense for Uber. After all, they were suffering huge losses and had very little to show for all the money they were pouring in.")

print([(ent.text, ent.label_) for ent in doc.ents])
print("-----------------------------------------")

displacy.render(doc, style='ent', jupyter=True)

[('Wednesday', 'DATE'), ('two cents', 'MONEY'), ('Food-Tech', 'ORG'), ('India', 'GPE'), ('Zomato', 'Startups'), ('UberEats India', 'Startups'), ('Uber', 'Startups')]
-----------------------------------------


In this notebook we have look at how we can add new entity types spacy pipeline. This is a easy and powerful way to extend spacy pipeline.

Thinking of where we can use this is what makes these tools and technologies more powerful. Below are some of the use cases which can think of :

- Journalists analyzing specific topics like Indian Startups, Indian Cricket, etc.
- Medical Devices company who wants their Service Engineer notes analysed for adding business value.
- Investing Firm which wants to analyze the annual reports of a Coal Mining company.

And the list goes on.