<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).

For questions/comments/improvements, email author@email.address.<br />
____

# `spaCy 2` `1`

This is lesson `1` of 3 in the educational series on `spaCy and NLP`. This notebook is intended `to teach the spaCy EntityRuler and the basics of Rules-Based NLP`. 

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` / `How-To` / `Explanation` 

`Include the use case definition from [here](https://constellate.org/docs/documentation-categories)`

**Difficulty:** `Intermediate`

`Beginner assumes users are relatively new to Python and Jupyter Notebooks. The user is helped step-by-step with lots of explanatory text.`
`Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.`
`Advanced assumes users are very familiar with Python and have been programming for years, but they may not be familiar with the process being explained.`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, flow control, functions, lists, dictionaries)
* A basic understanding of spaCy (see notebooks 1-3)
```

**Knowledge Recommended:**
```
* Basic file operations (open, close, read, write)
* Loading data with Pandas
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Learn about the spaCy EntityRuler and how to apply it
2. Learn about spaCy Patterns
3. Learn about spaCy rules-based pipelines
```
___

# Required Python Libraries
`List out any libraries used and what they are used for`
* [Tesseract](https://tesseract-ocr.github.io/) for performing [optical character recognition](https://docs.constellate.org/key-terms/#ocr).
* [Pandas](https://pandas.pydata.org/) for manipulating and cleaning data.
* [Pdf2image](https://pdf2image.readthedocs.io/en/latest/) for converting pdf files into image files.

## Install Required Libraries

In [18]:
### Install Libraries ###

# Using !pip installs
!pip install spacy
!pip install pandas
!python -m download en_core_web_sm
!python -m download en_core_web_md
!python -m download en_core_web_lg

# Using %%bash magic with apt-get and yes prompt







c:\Users\wma22\anaconda3\python.exe: No module named download
c:\Users\wma22\anaconda3\python.exe: No module named download
c:\Users\wma22\anaconda3\python.exe: No module named download


In [None]:
%%bash
apt-get install tesseract-ocr
y

In [None]:
import pandas as pd
from spacy import displacy
import spacy

# Required Data

We will be using a CSV file that contains a list of characters in Lord of the Rings (see link below).


## Download Required Data

In [None]:
### Grab files with Pandas' read CSV
url = "https://raw.githubusercontent.com/juandes/lotr-names-classification/master/characters_data.csv"
df = pd.read_csv(url)
df


# Introduction

In this notebook, we will be looking at the basic rules-based components available to you in spaCy. We will look briefly at the `Matcher` and `PhraseMatcher`. We will do a deep dive in this notebook into the `EntityRule`. This will set us up so that in the next notebook, we can look at the `SpanRuler`. All of this will lay the groundwork for understanding how to design and implement custom components in spaCy in the final notebook from this week.

We are starting this week with a single problem that we will solve over the next 2 weeks. We want to create from scratch a spaCy pipeline specifically designed to work with texts from Lord of the Rings. We want it to be able to identify correctly Middle Earth people and places. We also want it to be able to identify domain-specific entity types, such as named weapons (such as Orcist and Glamdring). Because we are not great at coming up with names, we are going to simply call this HobbitspaCy. The goal is to make this available to others on HuggingFace, an open-source repository and framework for machine learning models, datasets, and spaCy pipelines (among many other things). Think of it as a GitHub specifically designed for machine learning.

![HobbitspaCy Image](../images/hobbitspacy.png)


# What is Rules-Based NLP?

Rules-based natural language processing is the process by which we design rules for performing specific tasks. Rules-based NLP should be viewed as distinctly different from machine learning NLP, which trains statistical models to recognize features of data via training (we will meet this later).

A good way to think about rules-based NLP is to think about a list. Imagine you have a list of characters in a book that you want to find. Would you need to train a machine learning model to recognize all those characters? Probably not. If it is a popular book, such as Lord of the Rings, character lists already exist. You can find them on numerous sites as HTML or even on GitHub as CSV files. Will these lists have mistakes? Yes, most likely. Often when we work with datasets that others have cultivated there will be mistakes or things we don't like. Nevertheless, these available datasets often provide a good starting point.

Now, imagine the named entity recognition model in a spaCy pipeline. Let's see how well it performs on a fairly simple sentence describing the realm of Gondor in Middle Earth.

We will be going over the code in more depth later in this notebook. For now, focus on the output.

In [16]:
#Import the requisite library
import spacy

#Build upon the spaCy Small Model
nlp = spacy.load("en_core_web_sm")

#Sample text
text = "Gondor is a realm in Lord of the Rings."

#Create the Doc object
doc = nlp(text)

displacy.render(doc, style="ent")

I would argue that Gondor should be a place, either a GPE or LOC. Why has it not been correctly identified? There are a few different reasons for this. First, Gondor likely did not appear in the training data for the English small model or, if it did, it was quite rare. The training data for that model likely consisted of more real-world places. Second, the small model's vectors were not saved, meaning it cannot make good predictions on out-of-scope data, such as Gondor. We use the term out-of-scope to reference data that is far removed from the core of the training data.

Let's see how everything works with the medium model.

In [17]:
#Build upon the spaCy Small Model
nlp = spacy.load("en_core_web_md")

#Sample text
text = "Gondor is a realm in Lord of the Rings."

#Create the Doc object
doc = nlp(text)

displacy.render(doc, style="ent")

We have identical results. And what about the large model?

In [20]:
#Build upon the spaCy Small Model
nlp = spacy.load("en_core_web_lg")

#Sample text
text = "Gondor is a realm in Lord of the Rings."

#Create the Doc object
doc = nlp(text)

displacy.render(doc, style="ent")

Notice that the large model is better. It knows that `Rings` is not an `ORG`. Nevertheless, if we wanted to run this model across Lord of the Rings, we can likely expect mixed results, even with the large model.

Let's take a look at the benefits of a transformer model (not downloaded in this notebook).

In [22]:
#Build upon the spaCy Small Model
nlp = spacy.load("en_core_web_trf")

#Sample text
text = "Gondor is a realm in Lord of the Rings."

#Create the Doc object
doc = nlp(text)

displacy.render(doc, style="ent")

Here we see marked improvement. We have correctly grabbed Gondor as a GPE and Lord of the Rings as a Work of Art. Why is that? Because the transformer model is able to make better predictions based on the ways in which learns from training data and creates vectors. We will learn more about this next week. Without running the transformer model over all Lord of the Rings, I would venture to guess that we still would not have perfect results.

In this notebook, we will begin to correct that by designing a custom MiddleEarthspaCy pipeline from scratch. It will have a rules-based EntityRuler, custom components, and even a machine learning NER.

For now, let's dive into the rules-based components available to us in spaCy.

# spaCy Rules-Based Components

Why should you consider working with rules-based components rather than outside of spaCy? As we will see, working with everything inside of a spaCy pipelines means that you can construct complex rules not only from lists of entities, but from grammatical features of the text.

Inside of spaCy, we have several rules-based components that we can use. In this notebook, we will be looking closely at the EntityRuler, but look at an overview of the others.

## Tokenizer

In the previous notebooks, we have met the tokenizer. It is one of the built-in rules-based components that you can customize. Because the tokenizer breaks up the input string into individual words, or "tokens", you have a lot of flexibility as to how the data is processed. I am in the middle of designing a spaCy pipeline for DNA Sequences. We needed the tokenizer to take a continuous string of 4 letters and break it into individual KMERs, or window-ed segments. Here is an example of what that custom tokenizer looks like:


```python
class KMerTokenizer:
    name = "kmer_tokenizer"

    def __init__(self, vocab=None, kmer_size=None, window_size=None):
        self.vocab = vocab
        self.kmer_size = kmer_size
        self.window_size = window_size
        self.letter_dict = {"A": 1, "C": 2, "G": 3, "T": 4}
        self._calc = np.int64(4) ** self.kmer_size  # Calculate this once

    def __call__(self, dna):
        dna = dna.replace("N", '').replace("N", '')
        kmers = []
        end = len(dna) - self.kmer_size + 1
        kmer_values = np.zeros(end, dtype=np.int64)

        kmer_value = np.int64(0)
        for i in range(self.kmer_size):
            kmer_value = kmer_value * np.int64(4) + np.int64(self.letter_dict[dna[i]])

        kmer_values[0] = kmer_value

        for i in range(1, end): 
            kmer_value = kmer_value * np.int64(4) - np.int64(self.letter_dict[dna[i-1]]) * self._calc + np.int64(self.letter_dict[dna[i+self.kmer_size-1]])
            kmer_values[i] = kmer_value
            kmers.append(dna[i:i+self.kmer_size])

        spaces = [True] * len(kmers)

        doc = Doc(self.vocab, words=kmers, spaces=spaces)

        for idx, token in enumerate(doc):
            token._.numerical_value = kmer_values[idx]
        return doc
```

I won't cover all the code here, rather I want to emphasize that these are custom rules to handle the conversion a continuous text into individual tokens (words) at every pre-designed KMER size, so it would take an input that looks like this:

```python
"GGCATGCATGGCAGGCATGCATGGCA"
```

and return a list of tokens that looks like this:

```python
['GCATGCA', 'CATGCAT', 'ATGCATG', 'TGCATGG', 'GCATGGC', 'CATGGCA', 'ATGGCAG', 'TGGCAGG', 'GGCAGGC', 'GCAGGCA', 'CAGGCAT', 'AGGCATG', 'GGCATGC', 'GCATGCA']
```

This demonstrates the power of and customizability of spaCy.


## Matchers

We have three Matchers in spaCy: [Matcher](https://spacy.io/api/matcher), [PhraseMatcher](https://spacy.io/api/phrasematcher), and [DependencyMatcher](https://spacy.io/api/dependencymatcher). These allows to define patterns at either the token level to flag specific things in a text. Unlike the EntityRuler and NER, these are not added to doc.ents.

## Other Rules-Based Components.

In these notebooks, we are interested in three different types of components: the EntityRuler (rules-based NER), the SpanRuler (which can handle overlapping entities), and custom components. We will meet each of these in turn.

# EntityRuler

The EntityRuler is a spaCy factory that allows one to create a set of patterns with corresponding labels. A factory in spaCy is a set of classes and functions preloaded in spaCy that perform set tasks. In the case of the EntityRuler, the factory at hand allows the user to create an EntityRuler, give it a set of instructions, and then use this instructions to find and label entities.

Once the user has created the EntityRuler and given it a set of instructions, the user can then add it to the spaCy pipeline as a new pipe. I have spoken in the past notebooks briefly about pipes, but perhaps it is good to address them in more detail here.

A pipe is a component of a pipeline. A pipeline’s purpose is to take input data, perform some sort of operations on that input data, and then output those operations either as a new data or extracted metadata. A pipe is an individual component of a pipeline. In the case of spaCy, there are a few different pipes that perform different tasks. The tokenizer, tokenizes the text into individual tokens; the parser, parses the text, and the NER identifies entities and labels them accordingly. All of this data is stored in the Doc object as we saw in Notebook 01_01 of this series.

It is important to remember that pipelines are sequential. This means that components earlier in a pipeline affect what later components receive. Sometimes this sequence is essential, meaning later pipes depend on earlier pipes. At other times, this sequence is not essential, meaning later pipes can function without earlier pipes. It is important to keep this in mind as you create custom spaCy models (or any pipeline for that matter).

In this notebook, we will be looking closely at the EntityRuler as a component of a spaCy model’s pipeline. Off-the-shelf spaCy models come preloaded with an NER model; they do not, however, come with an EntityRuler. In order to incorperate an EntityRuler into a spaCy model, it must be created as a new pipe, given instructions, and then added to the model. Once this is complete, the user can save that new model with the EntityRuler to the disk.

The full documentation of spaCy EntityRuler can be found here: https://spacy.io/api/entityruler .

This notebook with synthesize this documentation for non-specialists and provide some examples of it in action.

## Demonstration of the EntityRuler in Action

In the code below, we will introduce a new pipe into spaCy’s off-the-shelf small English model. The purpose of this EntityRuler will be to identify small villages in Poland correctly.

In [4]:
#Import the requisite library
import spacy

#Build upon the spaCy Small Model
nlp = spacy.load("en_core_web_sm")

#Sample text
text = "Gondor is a realm in Lord of the Rings."

#Create the Doc object
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

Rings ORG


Depending on which version of spaCy and the `en_core_web_sm` model you are using, the results may vary.

The output from the code above demonstrates spaCy’s small model's inability to identify Gondor, which is a realm (and city) in Middle Earth. It also flagged `Rings` as `ORG`. This is incorrect.

This is a common problem in NLP for specific domains. Often times the domains in which we wish to deploy models, off-the-shelf models will fail because they have not been trained on domain-specific texts. We can resolve this, however, either via spaCy’s EntityRuler or via training a new model. In this notebook, we will focus on the EntityRuler.

Before we dive into how we can use the EntityRuler, let's see a result of applying it in this specific scenario.

In [6]:
#Build upon the spaCy Small Model
nlp = spacy.load("en_core_web_sm")

#Sample text
text = "Gondor is a realm in Lord of the Rings."

#Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns
patterns = [
                {"label": "REALM", "pattern": "Gondor"}
            ]

ruler.add_patterns(patterns)


doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

Gondor REALM
Rings ORG


# Working with the EntityRuler

Let's dive into how the code above actually works by taking a look into what the EntityRuler is and how it works.

The EntityRuler, as noted above, is a rules-based component available to us in spaCy natively. It relies on an input of a list of patterns. These patterns will always have two components, a pattern, or the thing you wish to label and `label`, or the NER label you wish to assign to that pattern. Under the hood, spaCy will automatically flag anything we tell it to assign as that label.

## Gazetteer 

In its simplest form, the patterns can function as a `gazetteer`. In NLP, a gazetteer is a dictionary of terms that are mapped (or assigned) to a specific category. A good way to think about this in the example above would be to think of all the realms in Middle Earth. Can we realistically create a list of these? Yes. I've begun compiling that list from various sources online. It is available under the directory `lotr` in the file `realm.txt`. Let's take a look at that list.

In [8]:
with open('../lotr/realm.txt', 'r') as f:
    realms = f.read().splitlines()
print(realms)
print(len(realms))

['Anfalas', 'Angmar', 'Anórien', 'Arnor', 'Arvernien', 'Beleriand', 'Belfalas', 'Brethil', 'Calenardhon', 'Dagorlad', 'Drúwaith Iaur', 'Eldamar', 'Falas', 'Gondor', 'Isengard', 'Khand', 'Rhovanion', 'Lebennin', 'Lindórinand', 'Lonely Mountain', 'Lossarnach', 'Lothlórien', 'Mordor', 'Moria', 'Númenor', 'Rivendell', 'Rohan', 'South Gondor', 'Valinor']
29


Here is a list of 29 realms in Middle Earth. We can use lists like this to supply a spaCy EntityRuler with a list of patterns to identify. We can then assign a label to each of these, like we did with Gondor in the example above. This label can be whatever we want. Let's make it realm so that we can add some specificity to how we conceptualize places in Middle Earth.


## Patterns

What should a pattern look like? The EntityRuler takes a list of patterns as an input. Each pattern should be a dictionary that is structured like this:

```python
{"pattern": <PATTERN>, "label" <LABEL_NAME>}
```

Here, you would supply your own pattern in `<PATTERN>` and label for `<LABEL_NAME`. In the example of Gondor above, a pattern would look like this:

```python
{"label": "REALM", "pattern": "Gondor"}
```

Notice that I have changed the order of the dictionary. This is perfectly fine because dictionaries are not structured in such a way where order matters. In this example, we see a very simple pattern where `Gondor` is the pattern that we are looking for and `REALM` is our label. Let's look at the code above again now.

In [9]:
#Build upon the spaCy Small Model
nlp = spacy.load("en_core_web_sm")

#Sample text
text = "Gondor is a realm in Lord of the Rings."

#Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns
patterns = [
                {"label": "REALM", "pattern": "Gondor"}
            ]

ruler.add_patterns(patterns)


doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

Gondor REALM
Rings ORG


Now that we can see this code, let's breakdown precisely what is happening, step-by-step.

Here, we are loading up our spaCy small model, as we have for the past 3 days.

```python
# Build upon the spaCy Small Model
nlp = spacy.load("en_core_web_sm")
```

Here we are defining a sample text that we want to process.

```python
# Sample text
text = "Gondor is a realm in Lord of the Rings."
```

This line adds an "EntityRuler" pipeline component to the `nlp` pipeline. Notice that we are specifying which pipe we want to add. The EntityRuler's name here is `entity_ruler`. It is very important to spell this correctly. We are creating the ruler as an object so that we can easily access it via the variable `ruler` when we add our patterns.

```python
# Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler")
```
Here we define a list of patterns that the `EntityRuler` should recognize as named entities. In this case, we are defining a single pattern: whenever the `EntityRuler` sees the word "Gondor", it should label it as a "REALM". Notice that we are still putting patterns in a list even though we have only 1 pattern. The EntityRuler always expects a list, so make sure you structure your data in this way.

```python
# List of Entities and Patterns
patterns = [
                {"label": "REALM", "pattern": "Gondor"}
            ]
```

This line adds the patterns we defined to the `EntityRuler`. After this line executes, the `EntityRuler` will label "Gondor" as a "REALM" whenever it processes text.

```python
ruler.add_patterns(patterns)
```

With our pipeline created, we can now process our text and create the `doc` container.

```python
doc = nlp(text)
```

Finally, this code block iterates over the named entities that were recognized in the processed text. For each entity, it prints the text of the entity and its label. In our case, it should print "Gondor REALM" because "Gondor" is labeled as a "REALM" by the `EntityRuler`.

```python
# Extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)
```


In [10]:
from spacy import displacy

In [11]:
displacy.render(doc, style="ent")

# First Exercise

Try and create a list of patterns to identify all the people in this text. Unlike before, we will be working with a blank spaCy model. Instead of using `spacy.load()`, we will use `spacy.blank()`. This will take one argument, the language we want to use. We will use `en` which is the English tokenizer.

Come up with whatever labels you like and try and flag any entity you like. Want a pipeline to only find people? Do that. Want to find people and places? Do that. Want to label people according to nuanced labels like `HOBBIT`, `AINUR`, or `ELF`, do that. Have fun and most importantly, make mistakes!

One thing I would like to see is for you to correct the patterns where Frodo is not grabbed in isolation from his family name.

In [15]:
nlp = spacy.blank("en")

ruler = nlp.add_pipe("entity_ruler")

patterns = [

    {"pattern": "Frodo", "label": "PERSON"}
]

ruler.add_patterns(patterns)

text = "Frodo Baggins is a Hobbit in Middle Earth. Gandalf is a wizard. They met in Rivendell, the home of Elrond."

doc = nlp(text)

displacy.render(doc, style="ent")

# Complex Patterns

The largest benefit of using spaCy is that you can leverage linguistic features to create complex patterns. Let's look at a few fun examples. Imagine, we wanted to flag all proper nouns so that we could then manually verify all the proper nouns in our data. The accuracy of the parser here is far better than the NER model. The reason for this is that the parser is able to learn the features of a proper noun and predict accurately on out-of-scope data better than the named entity recognition model which needs to learn a lot about the language deeply to understand if something is a label and what label it is.

In [29]:
nlp = spacy.load("en_core_web_sm")

ruler = nlp.add_pipe("entity_ruler", before="ner")

patterns = [

    {"pattern": [
            {"POS": "PROPN"}
    ],
    "label": "PROPER_NOUN"}
]

ruler.add_patterns(patterns)

text = "Frodo Baggins is a Hobbit in Middle Earth. Gandalf is a wizard. They met in Rivendell, the home of Elrond."

doc = nlp(text)

displacy.render(doc, style="ent")

Notice that some of our proper nouns are multi-word tokens. Because "pattern"  points to a list, we can treat each index in the list as an individual token. This means we can flag multi-word proper nous by specifying a sequence of two tokens, each of which is a proper noun.

In [37]:
nlp = spacy.load("en_core_web_sm")

ruler = nlp.add_pipe("entity_ruler", before="ner")

patterns = [

    {"pattern": [
            {"POS": "PROPN"},
            {"POS": "PROPN"}
    ],
    "label": "MWT_PROPER_NOUN"}
]

ruler.add_patterns(patterns)

text = "Frodo Baggins is a Hobbit in Middle Earth. Gandalf is a wizard. They met in Rivendell, the home of Elrond."

doc = nlp(text)

displacy.render(doc, style="ent")

Why is constructing rules like this useful? It means you can keep a list quite small. Imagine a lot of your characters have the same last name. You know for a fact that anytime a proper noun is followed by the name "Baggins", it is always (or nearly always) going to be a HOBBIT. You can take this rule and translate it into a pattern like so:

In [38]:
nlp = spacy.load("en_core_web_sm")

ruler = nlp.add_pipe("entity_ruler", before="ner")

patterns = [

    {"pattern": [
            {"POS": "PROPN"},
            {"TEXT": "Baggins"}
    ],
    "label": "HOBBIT"}
]

ruler.add_patterns(patterns)

text = "Frodo Baggins is a Hobbit in Middle Earth. Gandalf is a wizard. They met in Rivendell, the home of Elrond."

doc = nlp(text)

displacy.render(doc, style="ent")

What if your data is messy and Baggins is sometimes lowercased because you are working with Reddit data?

In [39]:
nlp = spacy.load("en_core_web_sm")

ruler = nlp.add_pipe("entity_ruler", before="ner")

patterns = [

    {"pattern": [
            {"POS": "PROPN"},
            {"TEXT": "Baggins"}
    ],
    "label": "HOBBIT"}
]

ruler.add_patterns(patterns)

text = "Frodo baggins is a Hobbit in Middle Earth. Gandalf is a wizard. They met in Rivendell, the home of Elrond."

doc = nlp(text)

displacy.render(doc, style="ent")

Notice that this does not work. We can create a custom pattern that looks for any occurrence of a proper noun followed by a lowercase token that matches `baggins`.

In [40]:
nlp = spacy.load("en_core_web_sm")

ruler = nlp.add_pipe("entity_ruler", before="ner")

patterns = [

    {"pattern": [
            {"POS": "PROPN"},
            {"LOWER": "baggins"}
    ],
    "label": "HOBBIT"}
]

ruler.add_patterns(patterns)

text = "Frodo baggins is a Hobbit in Middle Earth. Gandalf is a wizard. They met in Rivendell, the home of Elrond."

doc = nlp(text)

displacy.render(doc, style="ent")

We will be learning how to be build more elaborate patterns like this over the course of the week and we will learn precisely how to apply them on Lord of the Rings.