<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).

For questions/comments/improvements, email author@email.address.<br />
____

# `spaCy 3` `1`

This is lesson `1` of 3 in the educational series on `spaCy and NLP`. This notebook is intended `to teach the spaCy EntityRuler and the basics of Rules-Based NLP`. 

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` / `How-To` / `Explanation` 

`Include the use case definition from [here](https://constellate.org/docs/documentation-categories)`

**Difficulty:** `Intermediate`

`Beginner assumes users are relatively new to Python and Jupyter Notebooks. The user is helped step-by-step with lots of explanatory text.`
`Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.`
`Advanced assumes users are very familiar with Python and have been programming for years, but they may not be familiar with the process being explained.`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, flow control, functions, lists, dictionaries)
* A basic understanding of spaCy (see notebooks 1-3)
```

**Knowledge Recommended:**
```
* Basic file operations (open, close, read, write)
* Loading data with Pandas
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Learn about the basics of supervised learning and the machine learning components in spaCy
```
___

In [1]:
# ### Install Libraries ###

# # Using !pip installs
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_lg


# # Using %%bash magic with apt-get and yes prompt

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Collecting en-core-web-lg==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.5.0/en_core_web_lg-3.5.0-py3-none-any.whl (587.7 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:02[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.5.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_co

In [21]:
import pandas as pd
from spacy import displacy
import spacy
from spacy.tokens import DocBin
from sklearn.model_selection import train_test_split
import srsly

In [3]:
with open("../data/lotr.txt", "r") as f:
    text = f.read().strip()
text[:250]

'Next day Frodo woke early, feeling refreshed and well. He walked along the terraces above the loud-flowing Bruinen and watched the pale, cool sun rise above the far mountains, and shine down. Slanting through the thin silver mist; the dew upon the ye'

In [4]:
nlp = spacy.load("en_core_web_lg")



# How to Choose Labels?

When training a machine learning model, it is important to understand that this will be a trial-and-error process. This is true for all levels of training a model and the reason for this is because any time you are seeking to create a model to do something, you are creating something unique. Often times you are working with training data that you have cultivated that has not been used to train another model before. In order to figure out the right training data, the right model architecture, and the right labels, you must perform a series of tests.

One of the things that is not given enough consideration frequently are the labels you choose. Remember, the labels are the ways in which you want to classify your documents. When choosing labels it is important to remember that if your labels are difficult for you to explain and differentiate to another human, this will likely be a major issue. This is a good indication that your label distinctions potentially overlap conceptually. This will lead to two major issues. First, it will mean that your annotators (or even you) will have a hard time consistently labeling the data. Second, it will mean that the model will likely struggle in being able to identify these distinctions that you want to identify.

When creating labels, it is best to have labels that are clearly and conceptually distinct from one another. This does not mean that they cannot be part of a similar larger, category. In a current project, we are seeking to classify different types of places and how they appear in Holocaust oral testimonies. Our different types of places are clear and distinct however. Some are `ENVIRONMENTAL_FEATURE`, such as rivers and forests, while others are `POPULATED_PLACE`, such as a city or a ghetto, and others are labels such as `INTERIOR` to indicate a place that is inside another location. We have many other types of place labels we are using but each is distinct with few labels having cross-overs.

It is equally important to consider the ethics behind your labels. Just because you can train a machine learning model to do something does not mean you should. A good example of this is a project rooted in violence in 20th century South Africa. We were interested in understanding victim-perpetrator relationships in oral testimonies. In order to do this, we need to be able to identify the victim in a text and then identify the perpetrator. There are ways to do this via machine learning. However, were we to train a model that could label VICTIM and PERPETRATOR as distinct entities, it may be right a certain percentage of the time. But what about the times its wrong? What if this model was given to the public to use? What if it made a wrong prediction and that output was not verified and instead used in a negative way? These are the questions that you should ask when cultivating labels.

When constructing labels, therefore, consider these aspects.

# Using spaCy to Cultivate Training Data

When creating training data, it is important to use annotation software. There are many available. I personally use Prodigy, which comes from the creators of spaCy. It has a higher cost than its competitors, but it is far superior since it is designed to work specifically with spaCy. It makes the process of annotation-training seamless. It also has a very good research license that you can apply for.

In this part of the notebook, I will demonstrate a trick that you can do, however, to use an EntityRuler (or SpanRuler), to assist in the cultivation of a quick dataset. The goal of this process is to not train a perfect model, rather a model that is good enough to then help in the annotation process in Prodigy.


In [137]:
with open("../data/lotr.txt", "r") as f:
    text = f.read().strip()
text[:250]

'Next day Frodo woke early, feeling refreshed and well. He walked along the terraces above the loud-flowing Bruinen and watched the pale, cool sun rise above the far mountains, and shine down. Slanting through the thin silver mist; the dew upon the ye'

In [6]:
nlp = spacy.load("en_hobbit", disable="span_ruler")

In [35]:
doc = nlp(text[:1000])
displacy.render(doc, style="ent")

# Creating some Non-Annotated Training Data

In order to understand how to take data from JSON format (a format often outputted by annotation software like Doccano or Label Studio), you need to be familiar with how to manually create a Doc container. In this section, we will cover this as well as some methodological considerations for merging our labels into two labels: PERSON and REALM.

To do this, we will need a spaCy pipeline that can generate our sentences. We will then use the Hobbit spaCy pipeline to annotate our data. First, let's make our sentencizer pipeline with the `en_core_web_sm` model and disable the NER pipe so that it runs more quickly.

In [8]:
nlp2 = spacy.load('en_core_web_sm', disable="ner")



Now that we have our sentence pipeline, we can create a doc object called `doc2`. We only want to use this document to just iterate over the sentences.

In [None]:
doc2 = nlp2(text)

At this stage, we have our doc2 object that contains doc2.sents (the sentences that we will use for our training data) To convert each sentence into annotated data, we will use the `sent.text` and give that to our Hobbit spaCy pipeline.

First, we will initialize an empty list to store the training data.

```python
training_data = []
```

Next, we will iterate over each sentence (`sent`) in the given document (`doc2`).

```python
for sent in doc2.sents:
```

Within this loop, we will initialize an empty list to store the entities found in the current sentence.
```python
    ents = []
```

Within this loop, we will also create a spaCy `Doc` object by processing the text of the current sentence with the spaCy model (`nlp`). This will allow us to access information about the named entities in the sentence.

```python
    doc = nlp(sent.text)
```

Once we have our doc container, we can iterate over the named entities (`ent`) found in the current sentence. If the entity label is one of the specified labels ("HOBBIT", "DWARF", "MAN", "AINUR", "ELF"), it is classified as a "PERSON". Otherwise, it is classified as a "REALM". Here, we are interested in merging all races into a single label of PERSON. The goal here is to make the problem of NER easier to solve. It is easier for the model to learn the features of PERSON than 5 distinct races, especially when working with minimal training data.

```python
    for ent in doc.ents:
        if ent.label_ in ["HOBBIT", "DWARF", "MAN", "AINUR", "ELF"]:
            ents.append({"start": ent.start_char, "end": ent.end_char, "label": "PERSON"})
        else:
            ents.append({"start": ent.start_char, "end": ent.end_char, "label": "REALM"})
```

I also know that there are some missed true positives, namely Sam and Strider, so I want to ignore training data that contains these names. In this line, I state that if the names "Sam" and "Strider" are not in the current sentence's text, then use that sentence in the training data.

```python
    if "Sam" not in sent.text and "Strider" not in sent.text:
```

If there are any entities in the current sentence, add it to the training data. Additionally, if the name "Arwen" is in the current sentence, print the sentence. I want to illustrate here that Arwen does not appear anywhere in our training data. This will be important down below.

```python
        if ents:
            if "Arwen" in sent.text:
                print(sent)
            training_data.append({"text": sent.text, "ents": ents})
```

---

In [140]:
training_data = []

for sent in doc2.sents:
    ents = []
    doc = nlp(sent.text)
    for ent in doc.ents:
        if ent.label_ in ["HOBBIT", "DWARF", "MAN", "AINUR", "ELF"]:
            ents.append({"start": ent.start_char, "end": ent.end_char, "label": "PERSON"})
        else:
            ents.append({"start": ent.start_char, "end": ent.end_char, "label": "REALM"})
    if "Sam" not in sent.text and "Strider" not in sent.text:
        if ents:
            if "Arwen" in sent.text:
                print(sent)
            training_data.append({"text": sent.text, "ents": ents})
print(len(training_data))

396


In this example, we are only grabbing training data that has entities present. We are ignoring the other sentences. Let's take a look at our first example.

In [77]:
training_data[0]

{'text': 'Next day Frodo woke early, feeling refreshed and well.',
 'ents': [{'start': 9, 'end': 14, 'label': 'PERSON'}]}

Now that we have our data, let's go ahead and create a train/validation split using the same sklearn function as in the previous notebook.

In [141]:
train, valid = train_test_split(training_data, test_size=0.20, random_state=42)

print(len(train), len(valid))

316 80


# Converting JSON to .spacy Format

In order to train a spaCy model in spaCy 3x, there are a few steps that must be done. First, we must convert our JSON data into .spacy. We will do this with a custom function. This function is a modification of the one provided by spaCy.


First, we need to create the function. This will be called `json2spacy` that takes training data in JSON format and converts it to spaCy's binary format, saving it to a file. The function takes three arguments: `training_data`, `annotation_key`, and `output_file`. The annotation key is the key in the dictionary where the annotations sit. In our case, this is `ents`. It also takes the `output_file`, this is the .spacy file to which you wish to dump the data.

```python
def json2spacy(training_data, annotation_key, output_file):
```

THe first thing this function does is create a blank English model using spaCy. This will be used to process the text and create `Doc` objects.

```python
    nlp = spacy.blank("en")
```

Next, we initalize a `DocBin` object. `DocBin` is a container class in spaCy used to efficiently collect multiple `Doc` objects, which can be saved to disk in binary format.

```python
    db = DocBin()
```

Now, we can begin to iterate over each sample in the `training_data`. A sample contains the text and its corresponding annotations.

```python
    for sample in training_data:
```

Extract the text from the current sample.

```python
        text = sample["text"]
```

Extract the annotations using the provided `annotation_key` (e.g., "ents").

```python
        annotations = sample[annotation_key]
```

Create a `Doc` object by processing the text with the blank English model.

```python
        doc = nlp(text)
```

Initialize an empty list to store the entity spans.

```python
        ents = []
```

Iterate over the annotations, creating a span for each one and adding it to the `ents` list.

```python
        for annotation in annotations:
            start = annotation["start"]
            end = annotation["end"]
            label = annotation["label"]
            span = doc.char_span(start, end, label=label)
            ents.append(span)
```

Add the `doc` to the `DocBin`.

```python
        db.add(doc)
```

Save the `DocBin` to disk using the provided `output_file` path.

```python
    db.to_disk(output_file)
```

Call the `json2spacy` function twice, once for training data (`train`) and once for validation data (`valid`), specifying the output file paths.

```python
json2spacy(train, "ents", "../data/train.spacy")
json2spacy(valid, "ents", "../data/valid.spacy")
```


In [142]:

def json2spacy(training_data, annotation_key, output_file):
    nlp = spacy.blank("en")
    db = DocBin()
    for sample in training_data:
        text = sample["text"]
        annotations = sample[annotation_key]
        doc = nlp(text)
        ents = []
        for annotation in annotations:
            start = annotation["start"]
            end = annotation["end"]
            label = annotation["label"]
            span = doc.char_span(start, end, label=label)
            ents.append(span)
        doc.ents = ents
        db.add(doc)
    db.to_disk(output_file)
json2spacy(train, "ents", "../data/train.spacy")
json2spacy(valid,  "ents", "../data/valid.spacy")

# Training without Vectors

In [131]:
!python -m spacy init fill-config ../data/base_config.cfg ../data/config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
../data/config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [132]:
!python -m spacy train ../data/config.cfg --output ../models/output

[38;5;4mℹ Saving to output directory: ../models/output[0m
[38;5;4mℹ Using CPU[0m
[1m
[2023-08-04 06:28:28,806] [INFO] Set up nlp object from config
[2023-08-04 06:28:28,815] [INFO] Pipeline: ['tok2vec', 'ner']
[2023-08-04 06:28:28,819] [INFO] Created vocabulary
[2023-08-04 06:28:28,819] [INFO] Finished initializing nlp object
[2023-08-04 06:28:29,030] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     24.20    0.00    0.00    0.00    0.00
  2     200         20.54    969.02   93.02   90.91   95.24    0.93
  5     400          0.04      0.05   93.95   91.82   96.19    0.94
  9     600          0.00      0.00   93.46   91.74   95.24    0.93
 15     800          0.00      0.00   93.46   91

## Metrics

In machine learning, we have multiple ways to convey accuracy. Let's look at 3 types right now: precision, recall, and f-score.

### TP, FP, TN, FN

When marking are model's predictions as accurate, we have four ways to classify a token.

First, we have TP, or True Positive. This is something that is a specific label and the model predicted that label.

Next, we have False Positive. This is a label that was assigned to a token that is incorrect.

Next, we have True Negative. This is a token that is not a label and the model correctly did not assign a label to it.

Next we have False Negative. This is when a token is a label but the model missed it.

### 1. Precision

Precision is a measure of how many of the identified entities are correctly classified. In the context of our model, it would represent the proportion of correctly identified PERSON and REALM entities out of all the entities identified as either PERSON or REALM.

Here,
- True Positives (TP): Entities correctly identified as PERSON or REALM.
- False Positives (FP): Entities incorrectly identified as PERSON or REALM (e.g., identifying a mountain as a REALM when it is not labeled as such in the ground truth).

### 2. Recall

Recall, on the other hand, is a measure of how many of the actual entities are identified by the model. It represents the proportion of correctly identified PERSON and REALM entities out of all the true PERSON and REALM entities in the text.


Here,
- False Negatives (FN): Entities that are truly PERSON or REALM but were not identified as such by the model (e.g., missing a character's name and not labeling it as PERSON).

### Balancing Precision and Recall with the F1-Score

In practice, there may be a trade-off between precision and recall. Improving precision might decrease recall, and vice versa. A common way to balance these two measures is to use the F1 score, which is the mean of precision and recall.


### When to use Which?

When designing models, it is sometimes useful to favor precision over recall. In the real world, this is a metric often used for things like spam detection. You do not want to accidently flag something as spam that is not. To err on the side of caution, you accept a high precision which means all things detected as spam likely are, but you know that some cases of spam will be missed. That's okay because it means the user does not miss the email that has an important meeting, but they may have to delete a few annoying emails still.

On the other side of this, we have recall. A good way to think about this in the real world is with cancer screening. A machine learning model would be better if it had high recall at the cost of precision. This is because missing a cancer diagnosis is far more serious than falsely identifying cancer.

## 3. Epochs

An epoch refers to one complete pass through the entire training dataset. During each epoch, the model's weights are updated to minimize the loss function, which is a measure of the discrepancy between the predicted labels and the actual labels.

## Batch Size

Sometimes it is difficult to fit all the training data into memory so we pass the data to the model in batches. An epoch is complete only when all batches have been passed to the model during the training process.

In [133]:
ml_hobbit = spacy.load("../models/output/model-best")

In [138]:
doc = ml_hobbit(text[:1000])
displacy.render(doc, style="ent")

In [139]:
new_text ="Arwen went to the realm of Moria."
doc = ml_hobbit(new_text)
displacy.render(doc, style="ent")

# Training with Vectors

Training in spaCY 3 is almost exclusively done in the command line. Because we are learning in JupyterLab, we will use `!` before each cell to indicate that this should be run as a command line prompt.

In [68]:
!python -m spacy init fill-config ../data/base_config_vec.cfg ../data/config_vec.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
../data/config_vec.cfg
You can now add your data and train your pipeline:
python -m spacy train config_vec.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [90]:
!python -m spacy train ../data/config_vec.cfg --output ../models_vec/output

[38;5;4mℹ Saving to output directory: ../models/output[0m
[38;5;4mℹ Using CPU[0m
[1m
[2023-08-04 01:55:51,232] [INFO] Set up nlp object from config
[2023-08-04 01:55:51,240] [INFO] Pipeline: ['tok2vec', 'ner']
[2023-08-04 01:55:51,243] [INFO] Created vocabulary
[2023-08-04 01:55:52,782] [INFO] Added vectors: en_core_web_lg
[2023-08-04 01:55:52,782] [INFO] Finished initializing nlp object
[2023-08-04 01:55:53,189] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     24.20    0.00    0.00    0.00    0.00
  2     200          6.53    697.55   96.19   96.19   96.19    0.96
  5     400          5.05     26.21   95.24   95.24   95.24    0.95
^C


# Using the Model

We can now use this model by opening it as we would any other model. It is saved to disk in `../models_vec/output-best`

In [91]:
ml_hobbit = spacy.load("../models_vec/output/model-best")

In [119]:
doc = ml_hobbit(text[:1000])
displacy.render(doc, style="ent")

In [121]:
new_text ="Arwen went to the realm of Moriaa?."
doc = ml_hobbit(new_text)
displacy.render(doc, style="ent")

In [143]:
from collections import Counter

label_counts = Counter()
for item in training_data:
    for ent in item['ents']:
        label_counts[ent['label']] += 1

print(label_counts)


Counter({'PERSON': 442, 'REALM': 105})
