<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).

For questions/comments/improvements, email author@email.address.<br />
____

# `spaCy 3` `1`

This is lesson `1` of 3 in the educational series on `spaCy and NLP`. This notebook is intended `to teach the spaCy EntityRuler and the basics of Rules-Based NLP`. 

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` / `How-To` / `Explanation` 

`Include the use case definition from [here](https://constellate.org/docs/documentation-categories)`

**Difficulty:** `Intermediate`

`Beginner assumes users are relatively new to Python and Jupyter Notebooks. The user is helped step-by-step with lots of explanatory text.`
`Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.`
`Advanced assumes users are very familiar with Python and have been programming for years, but they may not be familiar with the process being explained.`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, flow control, functions, lists, dictionaries)
* A basic understanding of spaCy (see notebooks 1-3)
```

**Knowledge Recommended:**
```
* Basic file operations (open, close, read, write)
* Loading data with Pandas
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Learn about the basics of supervised learning and the machine learning components in spaCy
```
___

In [1]:
# ### Install Libraries ###

# # Using !pip installs
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_lg


# # Using %%bash magic with apt-get and yes prompt

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Collecting en-core-web-lg==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.5.0/en_core_web_lg-3.5.0-py3-none-any.whl (587.7 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:02[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.5.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_co

In [2]:
import pandas as pd
from spacy import displacy
import spacy



In [3]:
with open("../data/lotr.txt", "r") as f:
    text = f.read().strip()
text[:250]

'Next day Frodo woke early, feeling refreshed and well. He walked along the terraces above the loud-flowing Bruinen and watched the pale, cool sun rise above the far mountains, and shine down. Slanting through the thin silver mist; the dew upon the ye'

In [4]:
nlp = spacy.load("en_core_web_lg")

In [5]:
doc = nlp(text[:2500])

In [6]:
displacy.render(doc, style="ent")

Let's now take a look at our first token, `Next`.


# Supervised Learning

Supervised learning is fundamentally different from unsupervised learning. In supervised learning, we know the labels of our data. Our goal is to use that labeled data to teach a computer system to understand the key feature of our data that make it correspond to specific labels.


In a supervised learning system, our data would look like this:

SPORTS - The Boston Celtics won the championship.
SPORTS - The Dallas Cowboys lost in overtime.
SPORTS - Basketball is a sport enjoyed worldwide.
POLITICS - The Senator from Florida voted against the bill.
POLITICS - The Congressperson from New Hampshire is leaving office.
POLITICS - A bill is a type of legal document.

Here we have our data clearly labeled. This is precisely how spaCy's available pipelines work. They were trained on thousands of examples of texts that had annotations. Annotations are labels that we assign to specific tokens, or sequence of tokens, that correspond to a specific label. If we are training an NER system, our annotations will be things like: PERSON, GPE, LOC, etc.

# Annotations

In order to understand what annotations are more deeply, let's take a look at a concrete example.

In [104]:
doc = nlp("New York is a state.")
displacy.render(doc, style="ent")

As we can see in the example above, New York is identified as a state. Let's see what this doc looks like as an annotation. To convert it to an annotation, we can use doc.to_json()

In [105]:
doc.to_json()

{'text': 'New York is a state.',
 'ents': [{'start': 0, 'end': 8, 'label': 'GPE'}],
 'sents': [{'start': 0, 'end': 20}],
 'tokens': [{'id': 0,
   'start': 0,
   'end': 3,
   'tag': 'NNP',
   'pos': 'PROPN',
   'morph': 'Number=Sing',
   'lemma': 'New',
   'dep': 'compound',
   'head': 1},
  {'id': 1,
   'start': 4,
   'end': 8,
   'tag': 'NNP',
   'pos': 'PROPN',
   'morph': 'Number=Sing',
   'lemma': 'York',
   'dep': 'nsubj',
   'head': 2},
  {'id': 2,
   'start': 9,
   'end': 11,
   'tag': 'VBZ',
   'pos': 'AUX',
   'morph': 'Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin',
   'lemma': 'be',
   'dep': 'ROOT',
   'head': 2},
  {'id': 3,
   'start': 12,
   'end': 13,
   'tag': 'DT',
   'pos': 'DET',
   'morph': 'Definite=Ind|PronType=Art',
   'lemma': 'a',
   'dep': 'det',
   'head': 4},
  {'id': 4,
   'start': 14,
   'end': 19,
   'tag': 'NN',
   'pos': 'NOUN',
   'morph': 'Number=Sing',
   'lemma': 'state',
   'dep': 'attr',
   'head': 2},
  {'id': 5,
   'start': 19,
   'end'

We have a lot of data here, so let's just focus on two parts of this, the text and the ents.

In [106]:
text = doc.to_json()["text"]
ents = doc.to_json()["ents"]

print(text)
print(ents)

New York is a state.
[{'start': 0, 'end': 8, 'label': 'GPE'}]


Here we can see get a close exmaple of what a machine learning annotation looks like for supervised learning. We have a string, a specific text, with labeled annotations. In our case, we have the entities labeled. Notice that each entity (we only have 1) has a dictionary which has 3 keys: start, end, and label. The start and end correspond to the start and end characters here. The label corresponds to the appropriate label for the thing that falls within that span of characters. In our case, this is `New York`.

Later, we will be learning how to cultivate annotations like this with our own data.

## Model Bias

SpaCy provides numerous open-source pipelines and models for users for many languages. These models are, however, biased. In order to understand what this means and why it is important, let's take a look at a simple example.

In [111]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("John went to the store.")

Our text is simple: `John went to the store.` THe only entity we have in this is `John` who should be identified as `PERSON`

In [24]:
displacy.render(doc, style="ent")

As we can see, the spaCy model has identified this perfectly for us. Why is that? The reason lies in the data that the model was trained on. The model was trained on a lot of English texts and John is a very common name in English. Therefore, the model, as one would expect, does well at predicting that John is a person. It is able to do this likely because it has memorized that John always functions as a name.

Let's see how it performs with a South African name.

In [113]:
doc = nlp("Zuri went to the store.")
displacy.render(doc, style="ent")

As we can see, we have a false positive for Zuri as a GPE, rather than PERSON. This is an expected error for a few reasons. First, the model is biased towards Western English. When we speak about model biases, we are speaking about its tendency to fare better on specific types of data or model towards certain conclusions inconsistently. All models are biased because all data is biased as are all datasets. Without constraints, models can very easily replicate the biases of humans. In our case, the `en_core_web_sm` is biased towards Western names because it saw more Western names in the training data.

In some instances, however, we may want to bias models intentionally. Imagine we had a language model that was good, but could not work well with data relevant to the Holocaust. In this example, we may want to fine-tune the model to Holocaust data. Fine-tuning models is a way that we can bias a model to a particular subject, or domain, or task. These instances require consideration for the intended use of the model and the application of it. Biasing a model towards a specific type of data will greatly improve the results when it runs over similar data, but when applied to more general use cases, it may perform more poorly. If we train a model to work specifically on Holocaust data, therefore, we may have a model that performs well when identifying places in eastern Europe, but may do poorly at classifying Denver as a city.

# spaCy Machine Learning Pipes

The spaCy library has numerous available machine learning components. Most of them are pre-built into the spaCy pipelines. They are trained on open-source datasets and benchmarked with each new version of spaCy. Let's dive into the components that are available to us. When we train a machine learning model in spaCy, spaCy will expect your data to be structured in a very specific format and then converted into binary files. Let's first take a look at the expected data format for each component.


## Named Entity Recognition (NER)

Named Entity Recognition (NER) identifies and classifies named entities within the text, such as names of people, organizations, locations, and more. In the given sentence, "New York" is recognized as a geopolitical entity (GPE), and this information can be utilized in many applications like information retrieval, summarization, and more.

A great trick to identify the format spaCy expects is to convert your doc container into a JSON file. This will show you the output of a spaCy pipeline with each component's data stored as 

In [25]:
doc = nlp("New York is a state.")
json_doc = doc.to_json()
print(json_doc['ents'])

[{'start': 0, 'end': 8, 'label': 'GPE'}]


This format provides for us the start character, end character, and label of or entity. In our case, this is the text: `New York` which has the label `GPE`. We will work very closely with training an NER model in the next notebook, so we will explore this format more closely. In spaCy 3x (or versions higher than spaCy 3), we work with training data as serialized `.spacy` files. These are compressed doc containers that are saved to disk. However, before we can get the data into a .spacy format, we must first create the Doc containers and to do that, it is often important to structure your training data in a consistent manner. The above format is the one you will frequently see for NER because it is how we used to structure training data for spaCy 2. When spaCy upgraded to spaCy 3, many continued to structure their JSON training data in the same format.

Notice that our training data does not have an annotation for every token in the sentence. Instead, the NER training data is a sequence of dictionaries that correspond strictly to the spans that align with specific entities. Training data for other spaCy components will look fundamentally different.

## Part-of-Speech (POS) Tagging

Another machine learning model that you can train is the part-of-speech tagger. Part-of-Speech tagging assigns each token a POS tag, such as noun, verb, adjective, etc. These tags provide insights into the grammatical roles and structure of the sentence. Understanding POS is vital for many NLP tasks like syntactic parsing, text-to-speech conversion, and language translation.

Unlike the NER training data, our part-of-speech training data is a sequence of tokens. These tokens each of which has a unique tag and part-of-speech. Because every token has a part-of-speech tag, our training data will look like a sequence of tags. When the spaCy part-of-speech tagger model trains, it will try to predict for each token its correct part-of-speech. These will be stored as a sequence of tags in training data. We can capture this in the following way:

In [29]:
tags = []
for token in doc:
    tags.append(token.pos_)
print(tags)

['PROPN', 'PROPN', 'AUX', 'DET', 'NOUN', 'PUNCT']


# SpanCat

Another trainable component is the SpanCat component in spaCy. The SpanCat's training data structure looks precisely like the NER structure. Unlike the NER structure, however, we reference our spans as `spans` not `ents`. In the `spans` dictionary, these are stored under the key `sc` by default. `sc` stands for SpanCat. Unlike the NER (and like the SpanRuler), a SpanCat model is able to assign multiple labels to the same token. This makes it particularly well-suited to classification tasks that may have nested spans.

Imagine if you wanted to capture first name, last name, and PERSON as 3 labels for the following text:

```Joe Smith went to the store```

We would want `Joe` to have the label of first name, `Smith` to have the label of last name, and `Joe Smith` to have the label of PERSON. In order to achieve this, we cannot use the NER model, rather we must use the SpanCat model.

# TextCat

Up until now, we have spoken strictly about classifying tokens (POS tagger) or spans (NER or SpanCat), but what if we wanted to classify entire texts? This is where the TextCat component comes into play. The TextCat model allows us to assign categories, or labels, to input texts. With the TextCat component we can perform text classification. A good way to think about the TextCat model is to consider the example of topic modeling from the previous lesson. In topic modeling we did not know the labels for our data. Instead, we wanted to find clusters within our corpus. Text classification is the exact opposite of this. In text classification we know the labels of our training data and we are seeking to train a TextCat model to learn the features of what makes a specific category a specific category. To do this, it will learn from the training data and ultimately be able to classify unseen documents.

# How to Choose Labels?

When training a machine learning model, it is important to understand that this will be a trial-and-error process. This is true for all levels of training a model and the reason for this is because any time you are seeking to create a model to do something, you are creating something unique. Often times you are working with training data that you have cultivated that has not been used to train another model before. In order to figure out the right training data, the right model architecture, and the right labels, you must perform a series of tests.

One of the things that is not given enough consideration frequently are the labels you choose. Remember, the labels are the ways in which you want to classify your documents. When choosing labels it is important to remember that if your labels are difficult for you to explain and differentiate to another human, this will likely be a major issue. This is a good indication that your label distinctions potentially overlap conceptually. This will lead to two major issues. First, it will mean that your annotators (or even you) will have a hard time consistently labeling the data. Second, it will mean that the model will likely struggle in being able to identify these distinctions that you want to identify.

When creating labels, it is best to have labels that are clearly and conceptually distinct from one another. This does not mean that they cannot be part of a similar larger, category. In a current project, we are seeking to classify different types of places and how they appear in Holocaust oral testimonies. Our different types of places are clear and distinct however. Some are `ENVIRONMENTAL_FEATURE`, such as rivers and forests, while others are `POPULATED_PLACE`, such as a city or a ghetto, and others are labels such as `INTERIOR` to indicate a place that is inside another location. We have many other types of place labels we are using but each is distinct with few labels having cross-overs.

It is equally important to consider the ethics behind your labels. Just because you can train a machine learning model to do something does not mean you should. A good example of this is a project rooted in violence in 20th century South Africa. We were interested in understanding victim-perpetrator relationships in oral testimonies. In order to do this, we need to be able to identify the victim in a text and then identify the perpetrator. There are ways to do this via machine learning. However, were we to train a model that could label VICTIM and PERPETRATOR as distinct entities, it may be right a certain percentage of the time. But what about the times its wrong? What if this model was given to the public to use? What if it made a wrong prediction and that output was not verified and instead used in a negative way? These are the questions that you should ask when cultivating labels.

When constructing labels, therefore, consider these aspects.

# Using spaCy to Cultivate Training Data

When creating training data, it is important to use annotation software. There are many available. I personally use Prodigy, which comes from the creators of spaCy. It has a higher cost than its competitors, but it is far superior since it is designed to work specifically with spaCy. It makes the process of annotation-training seamless. It also has a very good research license that you can apply for.

In this part of the notebook, I will demonstrate a trick that you can do, however, to use an EntityRuler (or SpanRuler), to assist in the cultivation of a quick dataset. The goal of this process is to not train a perfect model, rather a model that is good enough to then help in the annotation process in Prodigy.


In [34]:
with open("../data/lotr.txt", "r") as f:
    text = f.read().strip()
text[:250]

'Next day Frodo woke early, feeling refreshed and well. He walked along the terraces above the loud-flowing Bruinen and watched the pale, cool sun rise above the far mountains, and shine down. Slanting through the thin silver mist; the dew upon the ye'

In [62]:
nlp = spacy.load("en_hobbit", disable="span_ruler")

  


In [35]:
doc = nlp(text[:1000])
displacy.render(doc, style="ent")

In [45]:
doc = nlp(text)

In [59]:
nlp2 = spacy.load('en_core_web_sm', disable="ner")

In [60]:
doc2 = nlp2(text)

In [74]:
training_data = []

for sent in doc2.sents:
    doc = nlp(sent.text)
    training_data.append(doc)
    

In [64]:
len(training_data)

1111

In [69]:
training_data[0].to_json()

{'text': 'Next day Frodo woke early, feeling refreshed and well.',
 'ents': [{'start': 9, 'end': 14, 'label': 'HOBBIT'}],
 'tokens': [{'id': 0, 'start': 0, 'end': 4},
  {'id': 1, 'start': 5, 'end': 8},
  {'id': 2, 'start': 9, 'end': 14},
  {'id': 3, 'start': 15, 'end': 19},
  {'id': 4, 'start': 20, 'end': 25},
  {'id': 5, 'start': 25, 'end': 26},
  {'id': 6, 'start': 27, 'end': 34},
  {'id': 7, 'start': 35, 'end': 44},
  {'id': 8, 'start': 45, 'end': 48},
  {'id': 9, 'start': 49, 'end': 53},
  {'id': 10, 'start': 53, 'end': 54}]}

# Converting Docs to Serialized .spacy

In order to train a spaCy model in spaCy 3x, there are a few steps that must be done. First, we must convert a doc container into a serialized binary file and save it to disk. We can also need to split our data into a training and validation set. To do these tasks, we can import DocBin from spaCy and train_test_split from sklearn. Sklearn is a machine learning library that handles a lot of tasks for us automatically.

In [73]:
from spacy.tokens import DocBin
from sklearn.model_selection import train_test_split


To save Doc containers to disk, we need to create a DocBin() class. This will hold multiple doc containers for us. Let's create one for our training data.

In [80]:
train_db = DocBin()

And one for our validation data.

In [81]:
valid_db = DocBin()

Next, we neeed to split our training data into training and validation sets. These will be used during the training process.

In [75]:
train, valid = train_test_split(training_data, test_size=0.20, random_state=42)

Let's take a look at each and each's length.

In [76]:
train[0]

The world has changed much since I last was on the westward roads.
     

In [77]:
valid[0]

I do not know; but I trace here a copy of it, lest it fade beyond recall.

In [78]:
len(train)

888

In [79]:
len(valid)

223

Now, let's add each doc container in the training and validation sets into their respective DocBin() classes. We can do this with `.add()` where we add each doc container.

In [82]:
for t in train:
    train_db.add(t)

In [83]:
for v in valid:
    valid_db.add(v)

Now, we can save each to disk. I am saving each of our data in the `data` folder of this repository as `train.spacy` and `valid.spacy`. We will need to specify where these files are located when we train our spaCy model.

In [85]:
train_db.to_disk("../data/train.spacy")

In [86]:
valid_db.to_disk("../data/valid.spacy")

# Training

Training in spaCY 3 is almost exclusively done in the command line. Because we are learning in JupyterLab, we will use `!` before each cell to indicate that this should be run as a command line prompt.

In [88]:
!python -m spacy init fill-config ../data/base_config.cfg ../data/config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
../data/config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [89]:
!python -m spacy train ../data/config.cfg --output ../models/output

[38;5;2m✔ Created output directory: ../models/output[0m
[38;5;4mℹ Saving to output directory: ../models/output[0m
[38;5;4mℹ Using CPU[0m
[1m
[2023-08-02 07:09:21,784] [INFO] Set up nlp object from config
[2023-08-02 07:09:21,793] [INFO] Pipeline: ['tok2vec', 'ner']
[2023-08-02 07:09:21,796] [INFO] Created vocabulary
[2023-08-02 07:09:21,796] [INFO] Finished initializing nlp object
[2023-08-02 07:09:22,175] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     42.87    0.00    0.00    0.00    0.00
  1     200         29.27   1152.09   87.56   87.13   88.00    0.88
  2     400         32.02     91.11   92.68   90.48   95.00    0.93
  4     600         23.90     27.10   95.57   94.17   97.0

# Using the Model

We can now use this model by opening it as we would any other model. It is saved to disk in `../models/output-best`

In [93]:
ml_hobbit = spacy.load("../models/output/model-best")

  


In [97]:
doc = ml_hobbit(text[:1000])
displacy.render(doc, style="ent")

In [107]:
new_text = "The Lord of the Rings is about a few hobbits, like Frodo and Pippin who go to Mordor to destroy the ring Sauron."
doc = ml_hobbit(new_text)
displacy.render(doc, style="ent")