<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).

For questions/comments/improvements, email author@email.address.<br />
____

# `Multilingual NER` `3`

This is lesson `3` of 3 in the educational series on `TOPIC`. This notebook is intended `to teach XXX and introduce the concepts of XXXX`. 

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` / `How-To` / `Reference` / `Explanation` 

`Include the use case definition from [here](https://constellate.org/docs/documentation-categories)`

**Difficulty:** `Beginner` / `Intermediate` / `Advanced`

`Beginner assumes users are relatively new to Python and Jupyter Notebooks. The user is helped step-by-step with lots of explanatory text.`
`Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.`
`Advanced assumes users are very familiar with Python and have been programming for years, but they may not be familiar with the process being explained.`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, flow control, functions, lists, dictionaries)
* Object-oriented programming (classes, instances, inheritance)
* Regular Expressions (`re`, character classes)

These should be general skills but can mention a particular library
```

**Knowledge Recommended:**
```
* Basic file operations (open, close, read, write)
* Data cleaning with `Pandas`
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Understanding of Machine Learning generally
2. Understanding of Transformer Models generally
3. Understanding of NER ML
4. Understanding of how to do NER ML in spaCy 3
```
___

# Required Python Libraries
`List out any libraries used and what they are used for`
* spacy

## Install Required Libraries

In [1]:
### Install Libraries ###

# Using !pip installs
!pip install spacy

# Using %%bash magic with apt-get and yes prompt



# Introduction

```
In this lesson, we are going to begin learning generally about machine learning which is a branch of artificial intelligence. More specifically, we are going to be discussing deep learning which is itself a field of machine learning.

Unlike a traditional computer system in which the human writes the rules to perform a specific task, in machine learning, we use statistics to model a problem. The output is a system where we use statistics to write the rules for us. This will make more sense as we progress through some concrete examples of machine learning.


```

Before we begin our journey into machine learning, let's start with a fun thinking exercise. I have found it is best to do this as a true story. Take a look at the image below.

<img src="https://i.pinimg.com/originals/68/c8/a0/68c8a0eb4c4ce56e4d54e9df98dfa802.jpg"
     alt="Markdown Monster icon" />
     
If I were to ask anyone what this is an image of, we would likely all have the same answer. "Penguin!" Of course it is. But that's not quite happened when my neighbor's three-year-old child went to my parent's front yard where this very decoration sits every Christmas. To keep this child anonymous, we will simply call him Timmy. Young Timmy did not say "penguin" as we all did. Instead, he said "duck!"

Was Timmy wrong? From our vantage point, yes. He was horribly wrong. Is this his fault? Of course not. He was three. Now, if a grown adult, professor of English came over to the house and made the same comment, we might look at her funny and wonder what she was talking about. We may look behind the penguin to see if there was a duck from the nearby pond lost. Once we realized that she was clearly making a statement about the penguin, we would likely fault her for making an incorrect comment about what that wire-mesh animal truly was.

In this scenario, we have a machine learning problem. Young Timmy is a child. He has limited **experiences** in this world. He has a pet duck, a few dogs, and a few cats. He knows what dogs and cats are and he knows what ducks are because of this. In machine learning terms, Timmy's experiences constitute seeing images in his environment and learning what **label** to assign to those things. His parents, like all parents, likely pointed at the dog before he could remember and said, "dog... dog... dog..." and likewise with cats and ducks. Let's presume Timmy only knows those three animals. Nothing else.

What happened in this scenario? Well, Timmy who was able to classify only three animals out of the millions that exist, met a new out-of-scope animal. Something he had never encountered. He was being asked, in machine learning terms, to generalize on unseen data. Unfortunately for Timmy, that unseen data was unfair. It was something that did not fit into any of the three labels he knew. What did Timmy do in this scenario? He did what any good machine learning model would do. He gave the best answer he could. He said "duck".

Honestly, Timmy did a great job. He showed the ability to understand the salient features of a duck. **Features** in machine learning terms are the aspects of an item that are important. The pieces of it that make it what it is. In Timmy's case, he likely saw the wings, the bill, and the two feet and came to a quick conclusion. "Duck!" These three features clearly make a duck (or a penguin) not a dog or cat. And for these reasons, he gave the label of "duck!".

What did Timmy's parents do in this scenario? Like all great parents they used the experience to engage in what we call **reinforcement learning** (in machine learning terms). They bent down and pointed at the penguin and said "no... penguin... penguin... penguin...". Young Timmy grimaced, looked confused, and moved on with his evening. The nearby cookies had already arrested his attention.

Will Timmy call this specific thing a penguin next time he sees it? Possibly. Will he be able to identify a true penguin in the real world? Maybe not. Real penguins look different and Timmy has only had one experience with penguins. He will need many more before he can confidently identify them consistently.

At the end of the day, this is how machine learning works. Its the replication of this process in a computer system through statistics and mathematics.

## Supervised Learning

Supervised learning is the process by which a system learns from a set of inputs that have known labels. In order to train properly, the input data is divided into three categories: training data, validation data, and testing data. There is no set percentage to assign to each of these categories. A good rule of thumb, however, is save 20% of all annotated data for testing and then divide the remaining 80% 80/20 (testing/validation) ratio.

The first two, training data and validation data, are given to the system that is trying to learn. It uses the training data to hone a statistical model via predetermined algorithms. It does this by making guesses about what the proper labels are. It then checks its accuracy against the labels provided and makes adjustments accordingly.

Once it is finished viewing and guessing across all the training data, the first epoch, or iteration over the data, is finished. At this stage, the model then tests its accuracy against the validation data. These are left out of the training process and give the system a sense of its overall performance.

Because the validation data is left out of the training process, it able to be used for mid-training testing (or validation) of its accuracy. The training data is then randomized and given back to the system for x number of epochs. Again, there is no standard for the number of epochs, but a good rule of thumb is to start at 10 and study the results.

Once the model repeats this process for the set number of epochs, it is finished training. The model’s accuracy can then be tested against the testing dataset to see how well it performs. The reason you want to keep the testing data separate from the validation data is because, despite being not include in the training, some of the validation data seeps into the training process. Because the testing data is well-annotated, the researcher can get an accurate sense of how well that model performs.

With that first model saved, it is common practice to adjust the parameters of the model multiple times to try to create a more accurate model. All models will be tested against the same testing data.

At this stage, depending on the results, more training data may need to be obtained, another test may be called for, or the researcher can begin deploying the model on unseen data and examine the results. Unseen data will be data that does not have annotations.

# Word Vectors

Word vectors, or word embeddings, take these one dimensional bag of words and gives them multidimensional meaning by representing them in higher dimensional space, noted above. This is achieved through machine learning and can be easily achieved via Python libraries, such as FastText

The goal of word vectors is to achieve numerical understanding of language so that a computer can perform more complex tasks on that corpus. Let’s consider the example above. How do we get a computer to understand 2 and 6 are synonyms or mean something similar? One option you might be thinking is to simply give the computer a synonym dictionary. It can look up synonyms and then know what words mean. This approach, on the surface, makes perfect sense, but in practice synonyms are not really the same thing as meaning.

Word vectors have a preset number of dimensions. These dimensions are honed via machine learned. Models take into account word frequency alongside words across a corpus and the appearance of other words in similar contexts. This allows for the the computer to determin the syntactical similarity of words numerically. It then needs to represent these relationships numerically. It does this through the vector, or a matrix of matrices. To represent these more concisely, models flatten a matrix to a float (decimal number). The number of dimensions represent the number of floats in the matrix.

Below is a pretrained model’s output of word vectors for Holocaust documents. This is how the word “know” looks in vectors:

know -0.19911548 -0.27387282 0.04241912 -0.58703226 0.16149549 -0.08585547 -0.10403373 -0.112367705 -0.28902963 -0.42949626 0.051096343 -0.04708015 -0.051914077 -0.010533272 -0.23334776 0.031974062 -0.015784053 -0.21945408 0.07359381 0.04936823 -0.15373217 -0.18460844 -0.055799782 -0.057939123 0.14816307 -0.46049833 0.16128318 0.190906 -0.29180774 -0.08877125 0.23563664 -0.036557104 -0.23812544 0.21938106 -0.2781296 0.5112853 0.049084224 0.14876273 0.20611146 -0.04535578 -0.35051352 -0.26381743 0.20824358 0.29732847 -0.013382204 -0.19970295 -0.34890386 -0.16214448 -0.23497184 0.1656344 0.15815939 0.012848561 -0.22887675 -0.21618247 0.13367777 0.1028471 0.25068823 -0.13625076 -0.11771541 0.4857257 0.102198474 0.06380113 -0.22328818 -0.05281015 0.0059655504 0.095453635 0.39693353 -0.066147 -0.1920163 0.5153346 0.24972811 -0.0076305643 -0.05530072 -0.24668717 -0.074051596 0.29288396 -0.0849124 0.37786478 0.2398532 -0.10374063 0.5445305 -0.41955113 0.39866814 -0.23992492 -0.15373677 0.34488577 -0.07166888 -0.48001364 0.0660652 0.061260436 0.32197484 -0.12741785 0.024006622 -0.07915035 -0.04467735 -0.2387938 -0.07527494 0.07079664 0.074456714 0.17877163 -0.002122373 -0.16164272 0.12381973 -0.5908519 0.5827627 -0.38076186 0.095964395 0.020342976 -0.5244792 0.24467848 -0.12481717 0.2869162 -0.34473857 -0.19579992 -0.18069582 0.015281798 -0.18330036 -0.08794056 0.015334953 -0.5609912 0.17393902 0.04283724 -0.07696586 0.2040299 0.34686008 0.31219167 0.14669564 -0.26249585 -0.42771882 0.5381632 -0.123247474 -0.29142144 -0.29963812 -0.32800657 -0.10684048 -0.08594837 0.19670585 0.13474767 0.18349588 -0.4734125 0.15554792 -0.21062694 -0.14191462 -0.12800062 0.2053445 -0.05258381 0.10878109 0.56381494 0.22724482 -0.17778987 -0.061046753 0.10789692 -0.015310492 0.16563527 -0.31812978 -0.1478078 0.4323269 -0.2543924 -0.25956103 0.38653126 0.5080214 -0.18796602 -0.10318089 0.023921987 -0.14618908 0.22923793 0.37690258 0.13323267 -0.34325415 -0.048353776 -0.30283198 -0.2839813 -0.2627738 -0.07422618 -0.31940162 0.38072023 0.56700015 -0.023362642 -0.3786432 0.084006436 0.0729958 0.09483505 -0.2665334 0.12699558 -0.37927982 -0.39073908 0.0063185897 -0.34464878 -0.24011964 0.09303968 -0.15488827 -0.018486138 0.3560308 -0.26005003 0.089302294 0.116130605 0.07684872 -0.085253105 -0.28178927 -0.17346472 -0.20008522 0.004347025 0.34192443 0.017453942 0.06926512 -0.15926014 -0.018554512 0.18478563 -0.040194467 0.38450953 0.4104423 -0.016453728 0.013374495 -0.011256633 0.09106963 0.20074937 0.17310189 -0.12467103 0.16330549 -0.0009963055 0.12181527 -0.05295286 -0.0059491103 -0.04697837 0.38616535 -0.21074814 -0.32234505 0.47269863 0.27924335 0.13548143 -0.2677968 0.03536313 0.3248672 0.2062973 0.29093853 0.1844036 -0.43359983 0.025519002 -0.06319317 -0.2427806 -0.22732906 0.08803728 -0.041860744 -0.151291 0.3400458 -0.29143015 0.25334117 0.06265491 0.26399022 -0.20121849 0.22156847 -0.50599706 0.069224015 0.52325517 -0.34115726 -0.105219565 -0.37346402 -0.02126528 0.09619415 0.017722093 -0.3621799 -0.109912336 0.021542747 -0.13361925 0.2087667 -0.08780184 0.09494446 -0.25047818 -0.07924239 0.21750642 0.2621652 -0.52888566 0.081884995 -0.20485449 0.18029206 -0.5623824 -0.03897387 0.3213515 0.057455678 -0.26524526 0.14741589 0.1257589 0.04708992 0.026751317 -0.014696863 -0.11038961 0.004459205 -0.01394376 0.091146186 -0.15486309 0.20662159 -0.0987916 -0.07740813 0.009704136 0.28866896 0.3916269 0.35061485 0.31678385 0.43233085 0.44510433

For these vectors, I used the industry-standard of 300 dimensions. We see each of these dimensions represented by each of the floats, separated by whitespace. As the model passes over the corpus it is being trained on, it hones these numbers and changes them for each word. Over multiple epochs, or generations, it gains a clearer sense of the similarity of words, or at least words that are used in similar contexts.

Once a word vector model is trained, we can do similarity matches very quickly and very reliably. AI work primarily with Holocaust and human rights abuses documents. For this reason, I will use a word vector model that I have trained on Holocaust documents. Consider the word "concentration camp". Let’s now use these word vectors to find the 10 most similar words to concentration camp.

Once a word vector model is trained, we can do similarity matches very quickly and very reliably. At the start of the notebook, I asked you to consider the word concentration camp. Let’s now use these word vectors to find the 10 most similar words to concentration camp.

In [13]:
[
    ('extermination_camp', 0.5768706798553467),
    ('camp', 0.5369070172309875),
    ('Flossenbiirg', 0.5099129676818848),
    ('Sachsenhausen', 0.5068483948707581),
    ('Auschwitz', 0.48929861187934875),
    ('Dachau', 0.4765608310699463),
    ('concen', 0.4753464460372925),
    ('Majdanek', 0.4740387797355652),
    ('Sered', 0.47086501121520996),
    ('Buchenwald', 0.4692303538322449)
]

[('extermination_camp', 0.5768706798553467),
 ('camp', 0.5369070172309875),
 ('Flossenbiirg', 0.5099129676818848),
 ('Sachsenhausen', 0.5068483948707581),
 ('Auschwitz', 0.48929861187934875),
 ('Dachau', 0.4765608310699463),
 ('concen', 0.4753464460372925),
 ('Majdanek', 0.4740387797355652),
 ('Sered', 0.47086501121520996),
 ('Buchenwald', 0.4692303538322449)]

These are the items that are most similar to concentration camp in our word vectors. The tuple has two indices. Index 0 is the word and index 1 is the similarity, represented as a float.

Exterimination camp is not a direct synonym, as it has a distinction in what happened to prisoners, i.e. execution, however, these are very similar. Seeing this as the most similar word is a sign that the word vectors are well-aligned. Camp is expected as it is a singular word that has similar meaning in context to contentration camp. The remainder of this list are proper nouns, all of which were concentration camps with one exception: “concen”. This is clearly a result of poor cleaning. Concen is not a word, rather a type of concen-tration, most likely. The fact that this is here is also a good sign that our word vectors have aligned well enough to have typos in near vector space.

Let’s do something similar with Auschwitz.items that are most similar to concentration camp in our word vectors. The tuple has two indices. Index 0 is the word and index 1 is the similarity, represented as a float.

Exterimination camp is not a direct synonym, as it has a distinction in what happened to prisoners, i.e. execution, however, these are very similar. Seeing this as the most similar word is a sign that the word vectors are well-aligned. Camp is expected as it is a singular word that has similar meaning in context to contentration camp. The remainder of this list are proper nouns, all of which were concentration camps with one exception: “concen”. This is clearly a result of poor cleaning. Concen is not a word, rather a type of concen-tration, most likely. The fact that this is here is also a good sign that our word vectors have aligned well enough to have typos in near vector space.

Let’s do something similar with Auschwitz.

In [14]:
[
    ('Auschwitz_Birkenau', 0.6649479866027832),
    ('Birkenau', 0.5385118126869202),
    ('subcamp', 0.5343026518821716),
    ('camp', 0.533636748790741),
    ('III', 0.5323576927185059),
    ('stutthof', 0.518073320388794),
    ('Ravensbriick', 0.5084848403930664),
    ('Berlitzer', 0.5083401203155518),
    ('Malchow', 0.5051567554473877),
    ('Oswiecim', 0.5016494393348694)
]

[('Auschwitz_Birkenau', 0.6649479866027832),
 ('Birkenau', 0.5385118126869202),
 ('subcamp', 0.5343026518821716),
 ('camp', 0.533636748790741),
 ('III', 0.5323576927185059),
 ('stutthof', 0.518073320388794),
 ('Ravensbriick', 0.5084848403930664),
 ('Berlitzer', 0.5083401203155518),
 ('Malchow', 0.5051567554473877),
 ('Oswiecim', 0.5016494393348694)]

As we can see, the words closest to Auchwitz are places assocaited with Auschwitz, such as Birkenau, subcamps (of which Auschwitz had many), other concentration camps (such as Ravensbriick), and the location of the Auschwitz memorial, Oswiecim.

In other words, we have words closely associated with Auschwitz in particular.

# Training Sets for NER

One of the nice things about spaCy, besides the fact that it scales very well (meaning it can perform well on small data and big data), is that it is easy to customize and perform advanced machine learning methods with little to no knowledge of machine learning. Understanding the basics of ML, however, as discussed in notebook 03 of this series, is helpful because it will allow you to understand how to cultivate a good training set and why certain methods may fail or struggle. In truth, you will develop of a sense of what works and what doesn’t work in ML NER by simply doing it.

In Notebook 03 of this series, I mentioned that data for training a machine learning model existed in three forms: training data, validation data, and testing data. All this data will take the same form. It will be a list data structure within which each index will contain a text (a sentence, paragraph, or entire text). The length of this text will depend on what you are hoping to achieve via ML NER. The size of the text will affect the training process. For now, however, let us ignore that. The only other component the training data requires is a list of the entities in that text with their start position, end position, and label. During the training process, these annotations will allow the convolutional neural network (the architecture behind spaCy’s machine learning training process), to learn from the data and be able to correctly identify the entities you are training.

## What does a spaCy Training Set Look Like?

SpaCy requires that your training data be in a very specific form:

TRAIN_DATA = [ (TEXT AS A STRING, {“entities”: [(START, END, LABEL)]}) ]

Note that TRAIN_DATA is capitalized. It is Pythonic not to capitalize objects with a few exceptions. TRAIN_DATA is one of these exceptions. I don’t know the history of this convention, but in every book/tutorial, you will always see TRAIN_DATA done this way. It is, of course, not necessary, but it is always good practice to be as Pythonic as possible in your code so that others will be able to more easily read your code. Any machine learning practitioner will expect to see TRAIN_DATA as such.

Getting the training data into this format is very difficult by hand. A researcher would have to count the characters to assign the start and end of the entity. Even if you consider using Python built-in string functions to get the start and end character, you will run into another problem. The way in which spaCy’s training process reads the start and end characters is different than the way you may count them with the string functions. This means that in the training process, spaCy will drop the annotations that don’t align with the start and end of a token. The reason for this is because of how spaCy tokenizes when compared to how your string functions tokenize the text. Fortunately, there are solutions built into spaCy via the EntityRuler to aid you in this process.

If you are interested in manual annotation, I highly encourage you to explore the paid software from Explosion AI, Prodigy (https://prodi.gy/). I am in no way being paid to promote that product. It is expensive, but if you need to do a lot of annotations (for images, text, video, or even audio), then Prodigy is the tool for you. It has a nice user-interface and because it is developed by the same team who gives us spaCy, it can fit seamlessly into a spaCy workflow. You can explore the Prodigy demo here: https://prodi.gy/demo.

## Creating a Training Set

In the code below, we will make a spaCy machine learning training set via the EntityRuler. In other words, we will use a rules-based method to automatically generate a basic training set. Will this training set have mistakes? Possibly. That’s why it is a good idea to look at the training set and manually verify it. By doing it in this manner, however, you can vastly increase prototyping to see if the custom entity you want to train is potentially viable. In machine learning, there are rarely concrete solutions for domain-specific problems. If there were, people wouldn’t need specialists. Experimentation is often the name of the game in machine learning and it is no different with NER machine learning.

We are going to create a blank English model because we will only use this model temporarily. We don’t need the other components. This model with only have an EntityRuler which we will temporarily use to generate the training set. Recall in our last notebook that the spaCy small model could not identify Treblinka correctly as a location? In the below code, we will create a basic training set from these three sentences that will allow us to generate a very small training set. I want to be clear. This training data is nowhere near enough to train a model. This process scales very well, however.

Here is the same code we saw before, but with a slightly different text. Note the output. It has correctly identified Treblinka as a GPE.

In [15]:
#Import the requisite library
import spacy

#Build upon the spaCy Small Model
nlp = spacy.blank("en")


#Sample text
text = "Treblinka is a small village in Poland. Wikipedia notes that Treblinka is not large."

#Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns
patterns = [
                {"label": "GPE", "pattern": "Treblinka"}
            ]

ruler.add_patterns(patterns)

doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

Treblinka GPE
Treblinka GPE


Now, we are going to modify this code slightly so that we can generate a slightly different output, one with the start and end of the text.

In [16]:
#Import the requisite library
import spacy

#Build upon the spaCy Small Model
nlp = spacy.blank("en")

#Sample text
text = "Treblinka is a small village in Poland. Wikipedia notes that Treblinka is not large."

#Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns
patterns = [
                {"label": "GPE", "pattern": "Treblinka"}
            ]

ruler.add_patterns(patterns)

doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.start_char, ent.end_char, ent.label_)


Treblinka 0 9 GPE
Treblinka 61 70 GPE


Notice now, that our output has 0,9 and 61, 71 for the start and end respectively of each entity. With this data, we can now begin to generate the output we wish. However, let’s try and take the input text and break it down into sentences first to then have two different sets of training data.

In [19]:
#Import the requisite library
import spacy

#Build upon the spaCy Small Model
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")

#Sample text
text = "Treblinka is a small village in Poland. Wikipedia notes that Treblinka is not large."

#Create a blank list for appending later.
corpus = []

doc = nlp(text)

#use the spacy tokenizer to get the sentences.
for sent in doc.sents:
    corpus.append(sent.text)

print (corpus)


#Build upon the spaCy Small Model
nlp = spacy.blank("en")

#Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns
patterns = [
                {"label": "GPE", "pattern": "Treblinka"}
            ]

ruler.add_patterns(patterns)



#iterate over the sentences
for sentence in corpus:
    doc = nlp(sentence)

    #extract entities
    for ent in doc.ents:
        print (ent.text, ent.start_char, ent.end_char, ent.label_)


['Treblinka is a small village in Poland.', 'Wikipedia notes that Treblinka is not large.']
Treblinka 0 9 GPE
Treblinka 21 30 GPE


Notice now we have a different output with different starts and endings. Now, we can once again modify our code to get it into the format we want:

TRAIN_DATA = [ (TEXT AS A STRING, {“entities”: [(START, END, LABEL)]}) ]

In [23]:
#Import the requisite library
import spacy

#Build upon the spaCy Small Model
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")

#Sample text
text = "Treblinka is a small village in Poland. Wikipedia notes that Treblinka is not large."

corpus = []

doc = nlp(text)
for sent in doc.sents:
    corpus.append(sent.text)

#Build upon the spaCy Small Model
nlp = spacy.blank("en")

#Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns
patterns = [
                {"label": "GPE", "pattern": "Treblinka"}
            ]

ruler.add_patterns(patterns)


TRAIN_DATA = []

#iterate over the corpus again
for sentence in corpus:
    doc = nlp(sentence)
    
    #remember, entities needs to be a dictionary in index 1 of the list, so it needs to be an empty list
    entities = []
    
    #extract entities
    for ent in doc.ents:

        #appending to entities in the correct format
        entities.append([ent.start_char, ent.end_char, ent.label_])
        
    TRAIN_DATA.append([sentence, {"entities": entities}])

for data in TRAIN_DATA:
    print (data)

['Treblinka is a small village in Poland.', {'entities': [[0, 9, 'GPE']]}]
['Wikipedia notes that Treblinka is not large.', {'entities': [[21, 30, 'GPE']]}]


## How to Convert the Training Data to spaCy Binary Files

In [25]:
import spacy
from spacy.tokens import DocBin
from pathlib import Path

def convert(lang: str, TRAIN_DATA, output_path: Path):
    nlp = spacy.blank(lang)
    db = DocBin()
    for text, annot in TRAIN_DATA:
        doc = nlp.make_doc(text)
        ents = []
        for start, end, label in annot["entities"]:
            span = doc.char_span(start, end, label=label)
            if span is None:
                msg = f"Skipping entity [{start}, {end}, {label}] in the following text because the character span '{doc.text[start:end]}' does not align with token boundaries:\n\n{repr(text)}\n"
                warnings.warn(msg)
            else:
                ents.append(span)
        doc.ents = ents
        db.add(doc)
    db.to_disk(output_path)

In [26]:
convert("en", TRAIN_DATA, "../data/train.spacy")
convert("en", TRAIN_DATA, "../data/valid.spacy")

## What is the spaCy config.cfg File and How do I create it?

Now that we have our training data ready, it’s time to start preparing our model. In spaCy 3, we have a lot of control over the neural network architecture and hyperperameters of our model. This all takes place in the new config.cfg file. This config file is giving to spaCy during the training process so that it knows what to train and how. In order to create the config.cfg file, we first need to create a base_config.cfg file. To do that, we can use spaCy’s handy GUI, found [here](https://spacy.io/usage/training) (scroll down a bit).

For our purposes, select, “English”, the language that we are training, “ner” only, the model we are training, “CPU” (GPU is a bit more complex), and efficiency (quicker to train and smaller because there are no word vectors). You will copy and paste the output in the GUI into your directory as “base_config.cfg”. We will only make two minor changes to this base_config.cfg file. We will specify the path of train and dev (seen under the first category of paths). We will set these to the location of our train.spacy and valid.spacy files.

Now that the base_config file is setup correctly, it’s time to convert it to a config.cfg file. To do that, we need to execute a terminal command. Fortunately, we can do that here in Jupyter Notebook. I have placed my base_config file in the subfolder data. By running the command below, spaCy reformats the base_config into a properly formatted config.cfg file.

In [27]:
!python -m spacy init fill-config ../data/base_config.cfg ../data/config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
../data/config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


## How to Train a spaCy 3 Model from the config.cfg File

In [30]:
from spacy.cli.train import train

In [35]:
train("../data/config.cfg",
      overrides={"paths.train": "../data/train.spacy",
                 "paths.dev": "../data/valid.spacy"},
    output_path="../models/sample")

[38;5;2m✔ Created output directory: ../models/sample[0m
[38;5;4mℹ Saving to output directory: ../models/sample[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00      7.83   25.00   14.29  100.00    0.25
200     200          0.20     98.13  100.00  100.00  100.00    1.00
400     400          0.00      0.00  100.00  100.00  100.00    1.00
600     600          0.00      0.00  100.00  100.00  100.00    1.00
800     800          0.00      0.00  100.00  100.00  100.00    1.00
1000    1000          0.00      0.00  100.00  100.00  100.00    1.00
1200    1200          0.00      0.00  100.00  100.00  100.00    1.00
1400    1400          0.00      0.00  100.00  100.00  100.00    1.00
1600    1600          0.00      0.00  100.0

In [37]:
trained_nlp = spacy.load("../models/sample/model-best")
text = "The village of Treblinka is located in Poland."
doc = trained_nlp(text)

for ent in doc.ents:
    print (ent.text, ent.label_)

Treblinka GPE


Note that we gave the machine learning model NER a new sentence and it correctly identifies Treblinka as a “GPE”. But we should not get too excited. Minor alterations to this text result in a missed entity.

In [38]:
text = "Mark, from New York, said that he wants to go to Treblinkaa to speak to the locals."
doc = trained_nlp(text)

for ent in doc.ents:
    print (ent.text, ent.label_)
if len(doc.ents) == 0:
    print ("No entities found.")

No entities found.


Why does our model now fail? Because we have trained a machine learning model, not an EntityRuler. It knows that Treblinka is a GPE, but it has only learned to identify it if it is spelled correctly. This is a bad model. Machine learing NER models improve with the more training data that we feed them. Most importantly, however, they improve with the greater amount of varied training data we feed them. A good rule of thumb is to start with 200 training samples and then make adjustments going forward. You may need to gather more varied training data or you may need to reconsider your labels. Another possibility is that you need to fine-tune your hyperperameters in the config.cfg file. We will be covering these problems and solutions throughout the remainder of this textbook. By now, though, you should have a good sense of how the training process works in spaCy 3. The material discussed in this notebook are by far the most challenging so far. Take your time here and get to know this process well before moving forward.

# Transformer Models

If your texts have variance in spelling and form with regularity, there are solutions to this problem. For better models in these scenarios, you should consider using transformer models. These models are more robust and are trained to guess the absence of a word in a text. This results in a deeper understanding of the language. Transformer models also learn to recognize sub-word components of a word and store them as sub-word embeddings. This means that transformer models also learn how to recognize out-of-vocabulary (OOV) words or variant spellings that it has never seen before.