In [1]:
## http://spacy.pythonhumanities.com/02_01_entityruler.html

## refer to this article

The Python library spaCy offers a few different methods for performing rules-based NER. One such method is via its EntityRuler.

The EntityRuler is a spaCy factory that allows one to create a set of patterns with corresponding labels. A factory in spaCy is a set of classes and functions preloaded in spaCy that perform set tasks. In the case of the EntityRuler, the factory at hand allows the user to create an EntityRuler, give it a set of instructions, and then use this instructions to find and label entities.

Once the user has created the EntityRuler and given it a set of instructions, the user can then add it to the spaCy pipeline as a new pipe. I have spoken in the past notebooks briefly about pipes, but perhaps it is good to address them in more detail here.

A pipe is a component of a pipeline. A pipeline’s purpose is to take input data, perform some sort of operations on that input data, and then output those operations either as a new data or extracted metadata. A pipe is an individual component of a pipeline. In the case of spaCy, there are a few different pipes that perform different tasks. The tokenizer, tokenizes the text into individual tokens; the parser, parses the text, and the NER identifies entities and labels them accordingly. All of this data is stored in the Doc object as we saw in Notebook 01_01 of this series.

It is important to remember that pipelines are sequential. This means that components earlier in a pipeline affect what later components receive. Sometimes this sequence is essential, meaning later pipes depend on earlier pipes. At other times, this sequence is not essential, meaning later pipes can function without earlier pipes. It is important to keep this in mind as you create custom spaCy models (or any pipeline for that matter).

In this notebook, we will be looking closely at the EntityRuler as a component of a spaCy model’s pipeline. Off-the-shelf spaCy models come preloaded with an NER model; they do not, however, come with an EntityRuler. In order to incorperate an EntityRuler into a spaCy model, it must be created as a new pipe, given instructions, and then added to the model. Once this is complete, the user can save that new model with the EntityRuler to the disk.

## Entity Ruler Demonstration

In [2]:
import spacy
nlp = spacy.load("en_core_web_sm")

text = "I live in the state of Goa in India. It has a great diversity and culture."
doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.label_)

Goa PERSON
India GPE


In [3]:
## here Goa is a state that is GPE still it ruled it at person level
## lets correct that and
## first we will need to create a new model

import spacy
nlp = spacy.load("en_core_web_sm")

text = "I live in the state of Goa which is in India. It has a great diversity and culture."

ruler = nlp.add_pipe("entity_ruler")


## patterns are list of dictionaries
patterns = [
    {"label":"GPE", "pattern":"Goa"}
    ]

ruler.add_patterns(patterns)

doc2 = nlp(text)

for ent in doc2.ents:
    print(ent.text, ent.label_)


## no change, still Goa as Person
# becuase when you see the next line of code result, ner comes before entiy_ruler
# so Goa is already tagged by ner and not by entity_ruler
# so we need entity_ruler before ner  

Goa PERSON
India GPE


In [4]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ent

In [5]:
## so making the new model\

import spacy
nlp2 = spacy.load("en_core_web_sm")

text = "I live in the state of Goa which is in India. It has a great diversity and culture."

## main step: before='ner'
ruler = nlp2.add_pipe("entity_ruler", before='ner')


## patterns are list of dictionaries
patterns = [
    {"label":"GPE", "pattern":"Goa"}
    ]

ruler.add_patterns(patterns)

doc2 = nlp2(text)

for ent in doc2.ents:
    print(ent.text, ent.label_)

## see Goa is classified as "GPE" now    

Goa GPE
India GPE
