<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).

For questions/comments/improvements, email author@email.address.<br />
____

# `spaCy 2` `3`

This is lesson `1` of 3 in the educational series on `spaCy and NLP`. This notebook is intended `to teach the spaCy EntityRuler and the basics of Rules-Based NLP`. 

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` / `How-To` / `Explanation` 

`Include the use case definition from [here](https://constellate.org/docs/documentation-categories)`

**Difficulty:** `Intermediate`

`Beginner assumes users are relatively new to Python and Jupyter Notebooks. The user is helped step-by-step with lots of explanatory text.`
`Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.`
`Advanced assumes users are very familiar with Python and have been programming for years, but they may not be familiar with the process being explained.`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, flow control, functions, lists, dictionaries)
* A basic understanding of spaCy (see notebooks 1-3)
```

**Knowledge Recommended:**
```
* Basic file operations (open, close, read, write)
* Loading data with Pandas
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Learn about custom components
```
___

In [63]:
### Install Libraries ###

# Using !pip installs
!pip install spacy
!pip install pandas
!python -m download en_core_web_sm
!pip install en-hobbit

# Using %%bash magic with apt-get and yes prompt













In [15]:
import pandas as pd
from spacy import displacy
import spacy
from collections import Counter
from spacy.language import Language
from spacy.tokens import Doc, Span
import re

# Required Data

We will be using a CSV file that contains a list of characters in Lord of the Rings (see link below).


## Download Required Data

In [3]:
### Grab files with Pandas' read CSV
url = "https://raw.githubusercontent.com/juandes/lotr-names-classification/master/characters_data.csv"
df = pd.read_csv(url)
df

Unnamed: 0,name,race
0,Aragorn II,Man
1,Arwen,Elf
2,Elrond,Elf
3,Celebrían,Elf
4,Elrohir,Elf
...,...,...
822,Brodda,Man
823,Annael,Elf
824,Gelmir,Elf
825,Arminas,Elf


# Introduction

In this notebook, we will once again be working with Hobbit spaCy. We will seek to modify the behavior of this pipeline through custom components and custom attributes. We will also learn how to save a nlp pipeline to disk and load it from disk.

Let's once again load up our pipeline

In [5]:
nlp = spacy.load("en_hobbit")

And once again let's grab our text, the Council of Elrond.

In [6]:
with open("../data/lotr.txt", "r") as f:
    text = f.read()
text[:500]

'     Next day Frodo woke early, feeling refreshed and well. He walked along the terraces above the loud-flowing Bruinen and watched the pale, cool sun rise above the far mountains, and shine down. Slanting through the thin silver mist; the dew upon the yellow leaves was glimmering, and the woven nets of gossamer twinkled on every bush. Sam walked beside him, saying nothing. but sniffing the air, and looking every now and again with wonder in his eyes at the great heights in the East. The snow wa'

In [8]:
doc = nlp(text[:2500])

Once again lets add some color to the display and render our data.

In [52]:
colors = {
    'HOBBIT': "#C3B7FF",  
    'REALM': "#E6EC9C",    
    'MAN': "#9DECDF",    
    'DWARF': "#88CCEE",    
    'ELF': "#B79292",      
    'AINUR': "#FFFB8B"     
}

options = {"ents": ['HOBBIT', 'REALM', 'MAN', 'DWARF', 'ELF', 'AINUR'], "colors": colors}
print(doc.spans["ruler"])
displacy.render(doc, style="ent", options=options)


[Frodo son of Drogo]


# Using a spaCy Output

Once we run a spaCy pipeline over data, we can cultivate the output and apply it in some meaningful way. Let's try and create a function that can count each of the entities. One way to do this would be to loop over each entity and store their name in a dictionary with a key of 1 upon the first time they are found and then moving that integer up 1 each time they are found again. Let's see how this might look in code.

In [20]:
def count_entities(doc):
    ent_counts = {}
    for ent in doc.ents:
        if ent.text in ent_counts:
            ent_counts[ent.text] += 1
        else:
            ent_counts[ent.text] = 1
    print(ent_counts)
count_entities(doc)

{'Frodo': 9, 'Gandalf': 4, 'Bilbo': 4, 'Rivendell': 1, 'Elrond': 3, 'Glorfindel': 2, 'Glóin': 2, 'Drogo': 1, 'Gimli': 1}


This is a lot of code, however, and we can much more easily achieve this precise same result by using the built-in Python function `Counter()` which is in the built-in library `collections`. `Counter()` takes a list and outputs a dictionary that has the precise same structure. The nice thing about this approach is that it does not require us to check to see if a key exists in a `Counter` object that functions rather like a dictionary. It's a cleaner implementation and more polished code.

In [26]:
def count_entities2(doc):
    entities = []
    for ent in doc.ents:
        entities.append(ent.text)
    ent_counts = Counter(entities)
    print(ent_counts)
count_entities2(doc)

Counter({'Frodo': 9, 'Gandalf': 4, 'Bilbo': 4, 'Elrond': 3, 'Glorfindel': 2, 'Glóin': 2, 'Rivendell': 1, 'Drogo': 1, 'Gimli': 1})


## First Exercise (10 minutes):

In the code block below, modify the code so that you count the number of time each entity label appears in the Doc container. If you finish early, try to create a function to do something else entirely.

In [29]:
# You will need to remove `pass`
def label_counter(doc):
    pass


label_counter(doc)

# Applying our Function Inside of a spaCy Pipeline

In the above example, we are using the output of a spaCy pipeline to do some downstream task outside of spaCy. What if we had to do this action in a lot of different Python scripts or notebooks? What if our users would need to do this same thing? Wouldn't it be nice to provide ourselves and users with this precise data each and every single time the pipeline runs?

To do this, we can leverage what are known as custom components or factories. Components in spaCy are functions, while factories are more elaborate classes. We will just be using components for now.

To create a custom component, we need to create to use a decorator (`@`). The construction looks like this:

```python
@Language.component(<NAME OF COMPONENT>)
```

Decorators are more advanced Python and beyond the scope of this notebook. At a very basic level, they allow us to manipulate a function or class' behavior by allowing that function or class to inherit the properties of the decorated function.

The decorator will sit immediately above the line where you declare a function. This decorator allows us to pass this function into the spaCy pipeline so that every time we pass a new text to the pipeline, this function will run automatically. It can be positioned anywhere we like in the pipeline. It will always take one argument: `doc` which is the `doc` container object that is being passed through the pipeline. It also must return the `doc` container. If the function does not return the `doc` container, the pipeline breaks because the `doc` container is not passed to the next pipe.

Inside the function, everything can function precisely the same as above.

In total, our code will look like this:

```python
@Language.component("counter_component")
def counter_component(doc):
    entities = []
    for ent in doc.ents:
        entities.append(ent.text)
    counts = Counter(entities)
    print(counts)
    return doc
```

In [35]:

@Language.component("counter_component")
def counter_component(doc):
    entities = []
    for ent in doc.ents:
        entities.append(ent.text)
    counts = Counter(entities)
    print(counts)
    return doc

nlp = spacy.load("en_hobbit")
nlp.add_pipe("counter_component")

doc = nlp(text[:2500])

Counter({'Frodo': 9, 'Gandalf': 4, 'Bilbo': 4, 'Elrond': 3, 'Glorfindel': 2, 'Glóin': 2, 'Rivendell': 1, 'Drogo': 1, 'Gimli': 1})


# Assigning Custom Attributes

What makes spaCy useful is that we can also store data inside of the Doc container. The can be entirely custom attributes. A good way to think about a spaCy attribute is to think about it as metadata. We can store attributes at the Doc-level, Token-level, or Span-level. For now, we will only work with the Doc-level.

When we store custom attributes, we can access the data by using `._.<ATTRIBUTE NAME>` on the container. If we are working with data in the Doc container that is stored at `ent_counter`, therefore, wew could grab that data with the following command: `doc._.ent_counter`.

In order to create a special extension, it needs to be manually created inside of your script and set. We can do this with the following line:


```python
Doc.set_extension("ent_counter", default=Counter(), force=True)
```

Here, we are grabbing the Doc class and using the `.set_extension()` method. Here, we are essentially customizing the spaCy Doc class by adding a special attribute that does not currently exist. In doing so, we are creating an attribute that we can assign data to in the middle of a pipeline. We use the keyword argument `default` here to specify what we want its default value to be. Since this will be an attribute that contains a counter, it seems fitting to have the default value be an empty counter. The argument `force=True` is something I like to do in my notebooks because it overwrites an earlier attribute of the same name. When you are prototyping a pipeline, this can be helpful.

Notice in the code snippet below, we use the precise same function, but instead of printing off the counter, we are using this line:

```python
doc._.ent_counter = counts
```

here, we are setting the `doc._.ent_counter` attribute to our counter, thus storing the data in the doc container.

In [41]:
Doc.set_extension("ent_counter", default=Counter(), force=True)

@Language.component("counter_component")
def counter_component(doc):
    entities = []
    for ent in doc.ents:
        entities.append(ent.text)
    counts = Counter(entities)
    doc._.ent_counter = counts
    return doc

nlp = spacy.load("en_hobbit")
nlp.add_pipe("counter_component")

doc = nlp(text[:2500])

Now, we will not see the result printed each time. Instead, we can grab the data when necessary. This is a cleaner implementation and more useful because the data is now available to us for downstream tasks.

Let's grab the data and print it off.

In [42]:
print(doc._.ent_counter)

Counter({'Frodo': 9, 'Gandalf': 4, 'Bilbo': 4, 'Elrond': 3, 'Glorfindel': 2, 'Glóin': 2, 'Rivendell': 1, 'Drogo': 1, 'Gimli': 1})


# Second Exercise (15 Minutes)

In the code snipet below, create a custom attribute to count the number of times specific labels appear in the output from the entity_ruler.

In [55]:
# 1. create your custom attribute here. Don't forget to set it!


# 2. Use the decorator to the Language class here

def counter_component(doc):
        
    # 3. add the bits of the function that grab all the labels and count them

    # 4. set the extension

    # 5. don't forget to return the doc!



nlp = spacy.load("en_hobbit")


# 6. add the component to your pipeline


doc = nlp(text[:2500])

# 7. print off the results here

# Components that Identify Multiword Tokens with Regex

What if we wanted to write a rule that used Regular Expressions that worked across multiple tokens? We have two choices: convert that RegEx into a spaCy pattern sequences (sometimes challenging) or insert that RegEx into the pipeline as a special component. In the code snippet below, we have an example of how to do just that. Let's break down what's happening in each line.

In order to achieve this there are steps that we must follow.

In [20]:
@Language.component("military_finder")
def military_finder(doc):
    military_pattern = "(?:Captain|Cpt\.|Major|Maj)\s(?:[A-Z]\w+\s)+[A-Z]\w+"
    entities = list(doc.ents)
    for match in re.finditer(military_pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end, alignment_mode="expand")
        if span is not None:
            entities.append(Span(doc, span.start, span.end, label="MILITARY"))
    doc.ents = entities
    return doc

nlp = spacy.blank("en")
nlp.add_pipe("entity_ruler")
nlp.add_pipe("military_finder", before="entity_ruler")

doc = nlp("Captain John Luc Picard commands the Enterprise.")
displacy.render(doc, style="ent")



# Saving a spaCy Pipeline to Disk and Stacking Rulers (Live Coding)

In [None]:
pattern1 = {"pattern": "Captain John Luc Picard", "label": "PERSON"}
pattern2 = {"pattern": "Enterprise", "label": "VESSEL"}

nlp = spacy.blank("en")
ruler1 = nlp.add_pipe("entity_ruler")
ruler1.add_patterns([pattern1])

ruler2 = nlp.add_pipe("entity_ruler")
ruler2.add_patterns([pattern2])

doc = nlp("Captain John Luc Picard commands the Enterprise.")
displacy.render(doc, style="ent")