<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).

For questions/comments/improvements, email author@email.address.<br />
____

# `spaCy 2` `2`

This is lesson `1` of 3 in the educational series on `spaCy and NLP`. This notebook is intended `to teach the spaCy EntityRuler and the basics of Rules-Based NLP`. 

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` / `How-To` / `Explanation` 

`Include the use case definition from [here](https://constellate.org/docs/documentation-categories)`

**Difficulty:** `Intermediate`

`Beginner assumes users are relatively new to Python and Jupyter Notebooks. The user is helped step-by-step with lots of explanatory text.`
`Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.`
`Advanced assumes users are very familiar with Python and have been programming for years, but they may not be familiar with the process being explained.`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, flow control, functions, lists, dictionaries)
* A basic understanding of spaCy (see notebooks 1-3)
```

**Knowledge Recommended:**
```
* Basic file operations (open, close, read, write)
* Loading data with Pandas
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Learn about the Attributes we can use in Patterns
2. Learn about the SpanRuler
```
___

In [3]:
### Install Libraries ###

# Using !pip installs
!pip install spacy
!pip install pandas
!pip install en-hobbit

# Using %%bash magic with apt-get and yes prompt










Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
     --------------------------------------- 12.8/12.8 MB 25.1 MB/s eta 0:00:00
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


c:\Users\wma22\anaconda3\lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
c:\Users\wma22\anaconda3\lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll
c:\Users\wma22\anaconda3\lib\site-packages\numpy\.libs\libopenblas64__v0.3.21-gcc_10_3_0.dll


Collecting en-core-web-md==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.6.0/en_core_web_md-3.6.0-py3-none-any.whl (42.8 MB)
     --------------------------------------- 42.8/42.8 MB 31.1 MB/s eta 0:00:00
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_md')


c:\Users\wma22\anaconda3\lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
c:\Users\wma22\anaconda3\lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll
c:\Users\wma22\anaconda3\lib\site-packages\numpy\.libs\libopenblas64__v0.3.21-gcc_10_3_0.dll


Collecting en-core-web-lg==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.6.0/en_core_web_lg-3.6.0-py3-none-any.whl (587.7 MB)
     -------------------------------------- 587.7/587.7 MB 5.6 MB/s eta 0:00:00
Installing collected packages: en-core-web-lg
  Attempting uninstall: en-core-web-lg
    Found existing installation: en-core-web-lg 3.3.0
    Uninstalling en-core-web-lg-3.3.0:
      Successfully uninstalled en-core-web-lg-3.3.0
Successfully installed en-core-web-lg-3.6.0
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_lg')


c:\Users\wma22\anaconda3\lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
c:\Users\wma22\anaconda3\lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll
c:\Users\wma22\anaconda3\lib\site-packages\numpy\.libs\libopenblas64__v0.3.21-gcc_10_3_0.dll






In [1]:
import pandas as pd
from spacy import displacy
import spacy

c:\Users\wma22\anaconda3\lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
c:\Users\wma22\anaconda3\lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll
c:\Users\wma22\anaconda3\lib\site-packages\numpy\.libs\libopenblas64__v0.3.21-gcc_10_3_0.dll


# Required Data

We will be using a CSV file that contains a list of characters in Lord of the Rings (see link below).


## Download Required Data

In [2]:
### Grab files with Pandas' read CSV
url = "https://raw.githubusercontent.com/juandes/lotr-names-classification/master/characters_data.csv"
df = pd.read_csv(url)
df

Unnamed: 0,name,race
0,Aragorn II,Man
1,Arwen,Elf
2,Elrond,Elf
3,Celebrían,Elf
4,Elrohir,Elf
...,...,...
822,Brodda,Man
823,Annael,Elf
824,Gelmir,Elf
825,Arminas,Elf


# Introduction

In this notebook, we will begin looking more closely at the patterns we can cultivate. We will also continue our work with Hobbit spaCy. The first version of this pipeline is now available [here](https://github.com/wjbmattingly/hobbit-spacy).

![HobbitspaCy Image](../images/hobbitspacy.png)

The pipeline can by pip installed:

```python
pip install en-hobbit
```

Once installed, it can be loaded up like any other spaCy model.

In [4]:
nlp = spacy.load("en_hobbit")

Lets see how it performs with some real Tolkien data.

In [5]:
with open("../data/lotr.txt", "r") as f:
    text = f.read()
text[:500]

'     Next day Frodo woke early, feeling refreshed and well. He walked along the terraces above the loud-flowing Bruinen and watched the pale, cool sun rise above the far mountains, and shine down. Slanting through the thin silver mist; the dew upon the yellow leaves was glimmering, and the woven nets of gossamer twinkled on every bush. Sam walked beside him, saying nothing. but sniffing the air, and looking every now and again with wonder in his eyes at the great heights in the East. The snow wa'

In [7]:
doc = nlp(text[:2000])
displacy.render(doc, style="ent")

We can make this look a bit nicer by adding some custom colors to our output. Let's see what this looks like with these changes.

In [8]:
colors = {
    'HOBBIT': "#ADD8E6",   # Light blue
    'WIZARD': "#FFC0CB",   # Pink
    'REALM': "#FFFFE0",    # Light yellow
    'MAN': "#E6E6FA",      # Lavender
    'DWARF': "#98FB98",    # Pale green
    'ELF': "#FFE4B5",      # Moccasin
    'AINUR': "#FFDAB9"     # Peachpuff
}

options = {"ents": ['HOBBIT', 'WIZARD', 'REALM', 'MAN', 'DWARF', 'ELF', 'AINUR'], "colors": colors}
print(doc.spans["ruler"])
displacy.render(doc, style="ent", options=options)

[]


That looks better. Today, we will build off the basics from last class and learn how to create a pipeline that can be saved and perform these exact tasks. We will also learn the benefits and limitations of working with open-source datasets, such as a list of characters available on GitHub.

# Open Source Datasets

Real world open source datasets are a double-edged sword. They are a blessing and a curse. While they will provide you a great starting point, they are rarely precisely what you need. The reason for this is because datasets are cultivated by humans who either make mistakes or are tailoring the dataset to a specific use-case. This means that when you obtain open-source datasets, you should never just use them in their original form without special consideration to your own use case.

Let's dive in and take a look at a real open-source dataset available to us on GitHub that is a list of characters from Lord of the Rings. We can use Pandas to grab the data. If you want to learn about Pandas, the TAP Institute has many great resources available to you. We will not use Pandas in this notebook except to grab the data.

In [9]:
### Grab files with Pandas' read CSV
url = "https://raw.githubusercontent.com/juandes/lotr-names-classification/master/characters_data.csv"
df = pd.read_csv(url)
df

Unnamed: 0,name,race
0,Aragorn II,Man
1,Arwen,Elf
2,Elrond,Elf
3,Celebrían,Elf
4,Elrohir,Elf
...,...,...
822,Brodda,Man
823,Annael,Elf
824,Gelmir,Elf
825,Arminas,Elf


Our data is in CSV, or comma-separated value format. This tabular data has two fields, name and race. Remember, when working with Pandas `name` is always the name of the index. Always use `['name']` to index the name field, rather than `.name`.

The names are various characters in LOTR while the race field is their race. We have five races in this dataset: Man, Elf, Ainur, Dwarf, and Hobbit. For our purposes, this will work.

In [10]:
df.race.unique()

array(['Man', 'Elf', 'Ainur', 'Dwarf', 'Hobbit'], dtype=object)

This list is a gazatteer, or a list of entities that can be mapped to specific labels. In our case, we want to map an individual person to a specific entity label. We have two options here. We can either stick with the original spaCy set of entity labels and assign the label of `PERSON` to each of these people. This would be useful because we can then immediately identify all people in a given portion of a Tolkien text.

Another option is to use the original naming convention have each race be its own label. This is a tough decision to make and either can be justified depending on the given goals of a project. This is, however, one of the key limitations of approaching this as a hard classification problem where each token can only have a single label. In spaCy, we have two ways to resolve this issue, we can either use custom attributes (which function like metadata) for each token or we can use a SpanRuler. We will meet the SpanRuler later in this notebook and the custom attributes in the next notebook.

For now, lets presume that we want to use each race as a label. How can we convert this dataset into something that we can use via spaCy?

# Data Manipulation

One of the fundamental skills you will develop while working in NLP is data manipulation. Data manipulation is the process by which we change our data to fit a specific need. Data, even cultivated datasets, are always messy and unreliable. Even if the dataset is perfect, you often still need to manipulate it to get it into a format you need.

Let's take this CSV file and make it useable. Because I am presuming no knowledge of Pandas, we will work with the data as two separate lists: `names` and `races`.

In [11]:
names = df["name"].tolist()
races = df["race"].tolist()

In [12]:
print(len(names))
print(len(races))

827
827


In [15]:
print(names[:2])
print(races[:2])

['Aragorn II', 'Arwen']
['Man', 'Elf']


Our goal is to get this data in a format that spacy expects. Remember, a pattern looks like this:

```python
{"PATTERN": <pattern>, "label" <label>}
```

In [16]:
patterns = []

for name, race in zip(names, races):
    patterns.append({"pattern": name, "label": race})
patterns[:2]

[{'pattern': 'Aragorn II', 'label': 'Man'},
 {'pattern': 'Arwen', 'label': 'Elf'}]

In [17]:
nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)
nlp.pipe_names

['entity_ruler']

In [19]:
doc = nlp(text[:2500])
displacy.render(doc, style="ent", options=options)

## First Exercise: Identify the Source of the Problem

In [None]:
for name in names:
    <CHECK TO SEE IF FRODO IS HERE>
        print(name)

# Pattern Attributes

There are many ways that we can modify our patterns that give us a great deal of flexibility in how we construct rules in spaCy to match a specific sequence of tokens. Here is a complete list, all of which can be found in the official spaCy docs [here](https://spacy.io/usage/rule-based-matching).


## Token Attributes

| Attribute | Description |
| --- | --- |
| ORTH | The exact verbatim text of a token. str |
| TEXT | The exact verbatim text of a token. str |
| NORM | The normalized form of the token text. str |
| LOWER | The lowercase form of the token text. str |
| LENGTH | The length of the token text. int |
| IS_ALPHA, IS_ASCII, IS_DIGIT | Token text consists of alphabetic characters, ASCII characters, digits. bool |
| IS_LOWER, IS_UPPER, IS_TITLE | Token text is in lowercase, uppercase, titlecase. bool |
| IS_PUNCT, IS_SPACE, IS_STOP | Token is punctuation, whitespace, stop word. bool |
| IS_SENT_START | Token is start of sentence. bool |
| LIKE_NUM, LIKE_URL, LIKE_EMAIL | Token text resembles a number, URL, email. bool |
| SPACY | Token has a trailing space. bool |
| POS, TAG, MORPH, DEP, LEMMA, SHAPE | The token’s simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. Note that the values of these attributes are case-sensitive. For a list of available part-of-speech tags and dependency labels, see the Annotation Specifications. str |
| ENT_TYPE | The token’s entity label. str |
| _ | Properties in custom extension attributes. Dict[str, Any] |
| OP | Operator or quantifier to determine how often to match a token pattern. str |


## Extended Pattern Syntax and Attributes

| Attribute | Description |
| --- | --- |
| IN | Attribute value is member of a list. Any |
| NOT_IN | Attribute value is not member of a list. Any |
| IS_SUBSET | Attribute value (for MORPH or custom list attributes) is a subset of a list. Any |
| IS_SUPERSET | Attribute value (for MORPH or custom list attributes) is a superset of a list. Any |
| INTERSECTS | Attribute value (for MORPH or custom list attributes) has a non-empty intersection with a list. Any |
| ==, >=, <=, >, < | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. Union[int, float] |

## Operators and Quantifiers (Keys for OP)

| OP | Description |
| --- | --- |
| ! | Negate the pattern, by requiring it to match exactly 0 times. |
| ? | Make the pattern optional, by allowing it to match 0 or 1 times. |
| + | Require the pattern to match 1 or more times. |
| * | Allow the pattern to match zero or more times. |
| {n} | Require the pattern to match exactly n times. |
| {n,m} | Require the pattern to match at least n but not more than m times. |
| {n,} | Require the pattern to match at least n times. |
| {,m} | Require the pattern to match at most m times. |



# Using Pattern Attributes

We can leverage these attributes at the token level to create rules to flag all ways in which a name may appear, e.g. Baggins by itself, Frodo by itself, etc. One way to do this would be to iterate over each token for each name and add the operator *, which is a wildcard that says it may or not be there. If it finds one or more of the tokens in the sequence, it will flag it as a match. By default, it will align with the longest sequence. This means Frodo Baggins will receive priority over Frodo or Baggins individually. Let's implement this solution as a Pattern.

```markdown
This block of code starts by initializing an empty list `patterns2`. This is so that we don't have to replace `patterns` above.

```python
patterns2 = []
```

The `for` loop iterates over two lists - `names` and `races` - simultaneously using the `zip()` function. For each pair of elements from `names` and `races` (referred to as `name` and `race`), it executes the body of the loop. Zip allows us to iterate over two lists simultaneously. You can learn about it [here](https://www.youtube.com/watch?v=Ek49cGiAOwo).

```python
for name, race in zip(names, races):
```

Within this loop, another empty list `token_patterns` is initialized. We will append our patterns at the token level here.

```python
    token_patterns = []
```

Then, another for loop splits each `name` into tokens (words) using the `split()` function and for each token, it appends a dictionary to the `token_patterns` list. This dictionary has two keys - "TEXT", whose value is the token, and "OP", whose value is "\*". The "\*" operator used here in the "OP" key implies that the pattern can match zero or more times. The "TEXT" key is used to specify the exact text of the token to be matched. We will see one of the limitations of this approach below.

```python
    for token in name.split():
        token_patterns.append({"TEXT": token, "OP": "*"})
```

After this inner loop finishes, another dictionary is appended to the `patterns2` list. This dictionary has two keys - "pattern", whose value is the list `token_patterns`, and "label", whose value is `race`.

```python
    patterns2.append({"pattern": token_patterns, "label": race})
```

Finally, we examine the first 2 patterns.

```python
patterns2[:2]
```


In [22]:
patterns2 = []

for name, race in zip(names, races):
    token_patterns = []
    for token in name.split():
        token_patterns.append({"TEXT": token, "OP": "*"})
    patterns2.append({"pattern": token_patterns, "label": race})
patterns2[:2]

[{'pattern': [{'TEXT': 'Aragorn', 'OP': '*'}, {'TEXT': 'II', 'OP': '*'}],
  'label': 'Man'},
 {'pattern': [{'TEXT': 'Arwen', 'OP': '*'}], 'label': 'Elf'}]

As we can see in the example below, this is not a great solution. What has gone wrong?

In [23]:
nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns2)
nlp.pipe_names
doc = nlp(text[:2500])
displacy.render(doc, style="ent", options=options)

## Second Exercise: Isolate the Problem

This will be our second exercisie. Identify why this is occuring, specifically all the occurences of `the` and `I`

In [None]:
for name, race in zip(names, races):
    <ADD A CONDITION TO ISOLATE WHAT MY BE THE ISSUE>
        print(name)

# Further Cleaning

In [26]:
patterns3 = []

for name, race in zip(names, races):
    token_patterns = []
    for token in name.split():
        if token[0].isupper() and len(token) > 2:
            token_patterns.append({"TEXT": token, "OP": "*"})
        patterns3.append({"pattern": token_patterns, "label": race})
patterns3[:2]

[{'pattern': [{'TEXT': 'Aragorn', 'OP': '*'}], 'label': 'Man'},
 {'pattern': [{'TEXT': 'Aragorn', 'OP': '*'}], 'label': 'Man'}]

In [27]:
nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns3)
nlp.pipe_names
doc = nlp(text[:2500])
displacy.render(doc, style="ent", options=options)

One of the biggest limitations in this approach, however, is that we cannot clearly see that each of these individuals is also a PERSON. In order to flag each as a person, a user would have to have a downstream task that has a condition to see if any of the `ent.label_` results in `doc.ents` is in one of our five categories. How do we make it so that a user does not have to do this? How can we make it so that they only have to check to see if someone is a `PERSON` once? This is where the SpanRuler comes into focus.

# Introduction to SpanRuler

In spaCy we can assign an individual token multiple labels via the SpanRuler. The SpanRuler stores data inside `doc.spans`. This is a dictionary. You can have as many SpanRulers you like in a pipeline. Each can store data in a different key. By default, the SpanRuler stores data in `doc.spans["ruler]`. The nice thing about the SpanRuler is that we can construct patterns in the precise same way. In the code below, we are going to add an extra label (PERSON) to each individual in our dataset.

In [47]:
span_patterns = []

for name, race in zip(names, races):
    token_patterns = []
    for token in name.split():
        if token[0].isupper() and len(token) > 2:
            token_patterns.append({"TEXT": token, "OP": "*"})
        span_patterns.append({"pattern": token_patterns, "label": race.upper()})
        span_patterns.append({"pattern": token_patterns, "label": "PERSON"})
span_patterns[:2]

[{'pattern': [{'TEXT': 'Aragorn', 'OP': '*'}], 'label': 'MAN'},
 {'pattern': [{'TEXT': 'Aragorn', 'OP': '*'}], 'label': 'PERSON'}]

Lets try and visualize that data again.

In [56]:
nlp = spacy.blank("en")
ruler = nlp.add_pipe("span_ruler")
ruler.add_patterns(patterns3)
nlp.pipe_names
doc = nlp(text[:2500])
displacy.render(doc, style="ent", options=options)



Uh oh! It's blank! Why is that? Two things. First, `displacy` expects the style `ent` to have `doc.ents`. Because this is a SpanRuler, we don't have anything in `doc.ents`. Instead, our data is stored in `doc.spans["ruler"]`. We need to change the keyword argument of `style` to `span`. We also need to clarify which key in `doc.spans` the display should grab. We will, therefore, add `spans_key` to our `options` dictionary. We will have a value of `ruler`. This corresponds to the name of the key in the `doc.spans` dictionary inside of which our spans data sits.

In [57]:
nlp = spacy.blank("en")
ruler = nlp.add_pipe("span_ruler")
ruler.add_patterns(span_patterns)
nlp.pipe_names
doc = nlp(text[:2500])
options["spans_key"] = "ruler"
displacy.render(doc, style="span", options=options)

As we can see, we now can now loop through our data and grab an individual if they are a PERSON or a specific race, such as ELF. Let's see how that might work. Lets say I wanted to grab all elves in this text.

In [58]:
elves = []
for span in doc.spans["ruler"]:
    if span.label_ == "ELF":
        elves.append(span.text)
print(elves)

['Elrond', 'Elrond', 'Glorfindel', 'Elrond', 'Glorfindel']


Likewise, what if I wanted to grab all people regardless of race?

In [59]:
people = []
for span in doc.spans["ruler"]:
    if span.label_ == "PERSON":
        people.append(span.text)
print(people)

['Frodo', 'Gandalf', 'Bilbo', 'Bilbo', 'Frodo', 'Gandalf', 'Elrond', 'Gandalf', 'Bilbo', 'Frodo', 'Bilbo', 'Gandalf', 'Frodo', 'Frodo', 'Elrond', 'Frodo', 'Glorfindel', 'Glóin', 'Elrond', 'Frodo', 'Frodo', 'Drogo', 'Frodo', 'Glóin', 'Gimli', 'Glorfindel']


# Structuring an EntityRuler and a SpanRuler

The nice thing about spaCy is that we can structure complex pipelines that can inherit from earlier pipes. This means we can have an EntityRuler and a SpanRuler in the same pipeline. This can be very powerful. Imagine we wanted a SpanRuler that needed to use EntityType (Label) for a rule. We could have an EntityRuler assign labels and then use that data in a SpanRuler.

Imagine we wanted to find all places where Tolkien does the construction `X the son/daughter of Y`. In order to find all possible ways this is constructed, we could construct elaborate rules with all the variant ways names may appear or proper nouns may be used (reliant upon a good Tagger). Instead, we could have one pipe assign EntityType to all people and then simply look for any place where there is a construction of PERSON son/daughter of PERSON. Let's take a look at how this would look as a spaCy pattern. 

In [61]:
relationship_pattern = [

    {"pattern": [
        {"ENT_TYPE": {"IN": list(races)}},
        {"TEXT": {"IN": ["son", "daughter"]}},
        {"TEXT": "of"},
        {"ENT_TYPE": {"IN": list(races)}},

    ], "label": "RELATION"}
]


nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns3)

ruler2 = nlp.add_pipe("span_ruler")
ruler2.add_patterns(relationship_pattern)
print(nlp.pipe_names)
doc = nlp(text[:2500])
options["spans_key"] = "ruler"
displacy.render(doc, style="span", options=options)

['entity_ruler', 'span_ruler']


In [62]:
doc.spans

{'ruler': [Frodo son of Drogo]}

Notice that we were able to grab `Frodo son of Drogo` with a single rule.