# spaCy's EntityRuler

Based on **Dr. William Mattingly** video: https://www.youtube.com/watch?v=dIUTsFT2MeQ&t

and his Jupyter Book: http://spacy.pythonhumanities.com/02_01_entityruler.html

## Named Entity Recognition (NER)

NER is a NLP technique that aims to identify and classify named entities in text into predefined categories. Named entities are specific types of word or phrases that represent names of people, organizations, locations, dates, quantities, monetary values, and other similar entites.

The goal of NER is to extract and label these named entities from unstructured text and assign them to their respective categories. By identifyinf named entities, NER helps in understanding the structure amd semantics of text, enabling various downstream applications such as information retrieval, questions answering, information extraction, sentiment analysis and more.

NER typically nvolves the following steps:
1. **Tokenization**: The input text is divided into individual words or tokens.
2. **Part-of-Speech Tagging (POS)**: Each token is assigned a part-of-speech tag that denotes its grammatical category (e.g., noun, verb, adjective).
3. **Named Entity Recognition**: This is core step of NER. Here, the tokens are analyzed to determine if they represent named entities and if so, which category they belong to (e.g., person, organization, location). This can be done using rule-based approaches, machine learning algorithms such as Conditional Random Fields or Recurrent Neural Networks or a combination of both.
4. **Entity Classification**: After identifying the named entities, they are further classified into predefined categories based on the ocntext and domain. For example, an organization entity can be classified as a compay, government agency, or educational institution.

NER systems are trained on annotated datasets where human annotators label the named entities in the text. These annotated datasets serve as training data for machine learning algorithms to learn patterns and features that can help identify and classify named entities accurately.

NER has broad applications in various domains. For example:
+ **Information Extraciton**: NER can extract specific information such as names, dates, and locations from text.
+ **Question Answering**: NER helps in uderstanding and extracting relevant information to answer questions
+ **Document Summarization**: NER can identify important entites to generate concise sumaries.
+ **Recommendation Systems**: NER can identify entities to provide personalized recommendations.

Overall, NER plays a crucial role in text understanding and information extraction by identifying and categorizing named entities, enabling more advanced analysis and processing of text data.

## EntityRuler

spaCy provides several methods for rules-based NER. One method is through its **EntityRuler**.

The EntityRuler is spaCy **factory** that allows creating patterns with corresponding labels. A factory in spaCy is a set of classes and functions preloaded in spaCy for specific tasks. For the EntityRuler, the factory enables creating an EntityRuler, defining patterns and labels, and using them to find and label entities.

Once the EntityRuler is created and instructed, it can be added to the spaCy pipeline as a new **pipe**.

A pipe is a component of a pipeline, which takes input data, performs operations, and outputs new data or extract metadata. In spaCy, different pipes handle different tasks. The **tokenizer** tokenizes the text into individual tokens, the **parser** parses the text, and the NER identifies entities and assigns labels. The processed data is stored in the Doc object.

Pipelines are sequential, meaning earlier components affect the input received by late components. This sequence can be crucial, as later pipes may depend on earlier ones. However, some pipes can function independently of earlier ones. Remember this when creating custom spaCy models or any pipeline.

[<ins>Full spaCy's Entity Ruler documentation</ins>](https://spacy.io/api/entityruler)

## Examples

We will try to identify Warszawa (Warsaw).

In [7]:
import spacy

# Build upon the spaCy sm
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "The capital of Poland is Warszawa. Warszawa was demolished during World War 2."

# Doc object
doc = nlp(text)

# extract entities
for ent in doc.ents:
    print(f"Entity text: {ent.text:<10}, entity label: {ent.label_}")

Entity text: Poland    , entity label: GPE
Entity text: Warszawa  , entity label: PERSON
Entity text: Warszawa  , entity label: PERSON
Entity text: World War 2, entity label: EVENT


The code above illustrates the performance of spaCy's small model in identifying "Warszawa". Model identified "Warszawa" as a Perosn.
 These results highlight the model's failure to generalize on unfamiliar data, potentially due to a lack of exposure to the word "Warszawa".

This issue is common in NLP when dealing with specific domains. Off-the-shelf models often struggle in domains where they haven;t been trained on domain-specific texts. However, we can address this by using spaCy's EntityRuler or training a new model.

Now we will try to correctly identify "Treblinka"

In [9]:
import spacy

# Build upon the spaCy sm
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "The capital of Poland is Warszawa. Warszawa was demolished during World War 2."
# Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler")

# List of Entities and Patterns
patterns = [
    {"label": "GPE", "pattern": "Warszawa"}
]

ruler.add_patterns(patterns)

# Doc object
doc = nlp(text)

# extract entities
for ent in doc.ents:
    print(f"Entity text: {ent.text:<10}, entity label: {ent.label_}")

Entity text: Poland    , entity label: GPE
Entity text: Warszawa  , entity label: PERSON
Entity text: Warszawa  , entity label: PERSON
Entity text: World War 2, entity label: EVENT


Our metod gave us same output. That's because when we create and add EntityRuler to the spaCy model's pipeline, by default spaCy add's new pipe to the end of the pipeline.

In [10]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ent

In order for our EntityRuler to have primacy, we habe to assign it to after the **ner pipe**.

In [13]:
import spacy

# Build upon the spaCy sm
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "The capital of Poland is Warszawa. Warszawa was demolished during World War 2."
# Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler", before="ner")

# List of Entities and Patterns
patterns = [
    {"label": "GPE", "pattern": "Warszawa"}
]

ruler.add_patterns(patterns)

# Doc object
doc = nlp(text)

# extract entities
for ent in doc.ents:
    print(f"Entity text: {ent.text:<10}, entity label: {ent.label_}")

Entity text: Poland    , entity label: GPE
Entity text: Warszawa  , entity label: GPE
Entity text: Warszawa  , entity label: GPE
Entity text: World War 2, entity label: EVENT


## Complex Rules and Variance to the EntityRuler

Labels in NER can have specific variations that follow distinct patterns. One such example is phone number, which can have different formats. In the United States, the standard format is (xxx)-xxx-xxxx, but variations like xxx-xxx-xxxx or xxxxxxxxxx are also common. If a US phone number is provided to someone outside the US, it may be represented as +1(xxx)-xxx-xxxx.

When working within a United Stated domain, we can utilize **regular expressions** (RegEx) with the pattern matcher in spaCy to capture these different instances.

The spaCy EntityRuler allows the user to incorporate complex rules and variations, including RegEx patterns, by specifying them in the patterns argument. There are multiple arguments that can be passed to the patterns.

In [14]:
import spacy

text = "This is a sample number (555) 555-5555."

# Build upon the spacy sm
nlp = spacy.blank("en")

# Create the Ruler and Add it
ruler = nlp.add_pipe("entity_ruler")

# List of Entities and Pattenrs (source: https://spacy.io/usage/rule-based-matching)
patterns = [
    {"label": "PHONE_NUMBER", "pattern": [{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "ddd"},
     {"ORTH": "-"}, {"SHAPE": "dddd"}]}
]
# Add patterns to ruler
ruler.add_patterns(patterns)

# Doc object
doc = nlp(text)

# extract entities
for ent in doc.ents:
    print(f"Entity text: {ent.text:<10}, entity label: {ent.label_}")

Entity text: (555) 555-5555, entity label: PHONE_NUMBER
