# Text Analysis with Ancient and Medieval Languages<br><br>Day 03:<br>Solving an NER Problem from Beginning to End

<center>Dr. William Mattingly<br>
TAP Institute with JSTOR</center>

In [1]:
!pip install spacy



You should consider upgrading via the 'c:\users\wma22\appdata\local\programs\python\python39\python.exe -m pip install --upgrade pip' command.


## Structure of Today

We have been given a task to develop a named entity recognition (NER) model for Latin. The problem: no NER model exists.

In this notebook we will be solving this problem from beginning to end. In this scenario, no training data exists. No lemmatizer exists. No POS tagger exists. And no NER model exists. In other words, we need to start from the ground up. Fortunately, we have access to two different sources of information: raw text (TheLatinLibrary) and some structured tabular data about Latin names (Wikipedia) and some tabular data about places (Columbia's Orblata). With these limited resources, we will solve our problem, or at least have a base-line model for improvement in the future.

Our workflow moving forward will be something like this:<br>
1) Acquire data for constructing a list of Latin personal names, places, and groups.<br>
2) Find words that appear in both lists and remove them to insure no false positives are matched, such as Romani which can be a PERSON or GROUP.<br>
3) Decline all words in each list and store all these variant forms in separate files.<br>
4) Create an EntityRuler in spaCy, or a rules-based approach with these declined forms<br>
5) Use that EntityRuler to pass over a text. In our case, Caesar's Gallic War and use the output of the EntityRuler to generate a training set for a machine learning NER model.<br>
6) Train base-line NER model<br>
7) Use that base-line NER model to then help automate the cultivation of a domain-expert validated training set.<br>

## Part One: Grabbing the Data

To keep this notebook a little cleaner, I have provided two other notebooks in this repo: wiki_scrape.ipynb and decline_latin.ipynb. We will be turning to these now and discussing how to webscrape and decline Latin words in Python. The need to decline Latin words stems from the fact that Latin is highly inflected and in this scenario, we do not have a lemmatizer, meaning we cannot simply find the roots of the word prior to passing the data to our NER model. Further, we cannot simply extract all proper nouns from a text, because we do not have a POS tagger.

## Part Two: Eliminating Potential False Positives

Discussion outside of notebook

## Part Three: Declining

See the notebook decline_latin.ipynb

## Part Four: Create an EntityRuler in spaCy

In [2]:
import spacy
from spacy.pipeline import EntityRuler
import json
import glob

INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2


In [3]:
def load_data(file):
    with open (file, "r", encoding="utf-8") as f:
        data = json.load(f)
    return (data)

def write_data(file, data):
    with open (file, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=4)


In [4]:
#Gernating rules
def generate_ruler(patterns, name):
    nlp = spacy.blank("en")
    ruler = nlp.add_pipe("entity_ruler")
    ruler.add_patterns(patterns)
    ruler.to_disk(f"models/{name}_ent_ruler/entity_ruler/patterns.jsonl") 
    nlp.to_disk(f"models/{name}_ent_ruler")

In [5]:
def create_training_data(file, type):
    data = load_data(file)
    patterns = []
    for item in data:
        pattern = {
                    "label": type,
                    "pattern": item
                    }
        patterns.append(pattern)
    return (patterns)

In [6]:
def test_ent_ruler(ruler, corpus):
    nlp = spacy.load(ruler)
    with open (corpus, "r", encoding="utf-8") as f:
        corpus = f.read()
    with open ("temp/results.txt", "w", encoding="utf-8") as f:
        doc = nlp(corpus)
        for ent in doc.ents:
            f.write(f"{ent.text}, {ent.label_}\n")

In [7]:
person_patterns = create_training_data("latin_data/all_names_declined.json", "PERSON")
groups_patterns = create_training_data("latin_data/groups_declined.json", "GROUP")
places_patterns = create_training_data("latin_data/places_declined.json", "LOCATION")

In [8]:
all_patterns = person_patterns+groups_patterns+places_patterns

In [9]:
generate_ruler(all_patterns, "latin_loc_per_group")

In [10]:
test_ent_ruler("models/latin_loc_per_group_ent_ruler", "latin_data/corpus.txt")

## Part Five: Use the EntityRuler to Generate Training Data

In [11]:
def create_training_set(corpus, ent_ruler_model, output_file, prodigy=False):
    nlp=spacy.load(ent_ruler_model)
    TRAIN_DATA = []
    with open (corpus, "r", encoding="utf-8") as f:
        data = f.read()
        segments = data.split("\n")
        for segment in segments:
            segment = segment.strip()
            doc = nlp(segment)
            entities = []
            for ent in doc.ents:
                if prodigy==True:
                    entities.append({"start":ent.start_char, "end": ent.end_char,  "label": ent.label_, "text": ent.text})
                    pass
                else:
                    entities.append((ent.start_char, ent.end_char, ent.label_))
            if len(entities) > 0:
                if prodigy==True:
                    TRAIN_DATA.append({"text": segment, "spans": entities})
                else:
                    TRAIN_DATA.append([segment, {"entities": entities}])
    print (len(TRAIN_DATA))
    with open (output_file, "w", encoding="utf-8") as f:
        json.dump(TRAIN_DATA, f, indent=4)



In [12]:
create_training_set("latin_data/corpus.txt", "models/latin_loc_per_group_ent_ruler", "training_data/training_set_spacy.json", prodigy=False)

388


In [13]:
from spacy.tokens import DocBin

In [14]:
all_docs = load_data("training_data/training_set_spacy.json")

In [15]:
print (all_docs[2])

['[3] His rebus adducti et auctoritate Orgetorigis permoti constituerunt ea quae ad proficiscendum pertinerent comparare, iumentorum et carrorum quam maximum numerum coemere, sementes quam maximas facere, ut in itinere copia frumenti suppeteret, cum proximis civitatibus pacem et amicitiam confirmare. Ad eas res conficiendas biennium sibi satis esse duxerunt; in tertium annum profectionem lege confirmant. Ad eas res conficiendas Orgetorix deligitur. Is sibi legationem ad civitates suscipit. In eo itinere persuadet Castico, Catamantaloedis filio, Sequano, cuius pater regnum in Sequanis multos annos obtinuerat et a senatu populi Romani amicus appellatus erat, ut regnum in civitate sua occuparet, quod pater ante habuerit; itemque Dumnorigi Haeduo, fratri Diviciaci, qui eo tempore principatum in civitate obtinebat ac maxime plebi acceptus erat, ut idem conaretur persuadet eique filiam suam in matrimonium dat. Perfacile factu esse illis probat conata perficere, propterea quod ipse suae civit

In [16]:
train_docs = all_docs[:200]

In [17]:
valid_docs = all_docs[200:]

## Part Six: Train the Base-Line NER Model

In [18]:
train_db = DocBin()
from tqdm import tqdm
nlp = spacy.blank("en")
for text, annot in tqdm(train_docs):
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in annot["entities"]:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            pass
        else:
            ents.append(span)
    doc.ents = ents
    train_db.add(doc)

100%|███████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 620.32it/s]


In [19]:
valid_db = DocBin()
from tqdm import tqdm
nlp = spacy.blank("en")
for text, annot in tqdm(valid_docs):
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in annot["entities"]:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            pass
        else:
            ents.append(span)
    doc.ents = ents
    train_db.add(doc)

100%|███████████████████████████████████████████████████████████████████████████████| 188/188 [00:00<00:00, 587.24it/s]


In [20]:
train_db.to_disk("./training_data/train_hs.spacy")

In [21]:
valid_db.to_disk("./training_data/valid_hs.spacy")

In [22]:
!python -m spacy init fill-config base_config.cfg config.cfg

[+] Auto-filled config with all values
[+] Saved config
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


2021-07-02 07:10:26.041561: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll


In [23]:
!python -m spacy train config.cfg --output ./output

[i] Using CPU
[1m
[+] Initialized pipeline
[1m
[i] Pipeline: ['tok2vec', 'ner']
[i] Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     81.36    0.00    0.00    0.00    0.00
  0     200         53.84   2049.80    0.00    0.00    0.00    0.00
  1     400         68.93    486.92    0.00    0.00    0.00    0.00
  1     600         82.93    208.14    0.00    0.00    0.00    0.00
  2     800         81.99    168.52    0.00    0.00    0.00    0.00
  2    1000         84.50     95.88    0.00    0.00    0.00    0.00
  3    1200        104.29     93.00    0.00    0.00    0.00    0.00
  4    1400         81.09     67.93    0.00    0.00    0.00    0.00
  5    1600         87.12     48.55    0.00    0.00    0.00    0.00
[+] Saved pipeline to output directory
output\model-last


2021-07-02 07:10:30.351400: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
[2021-07-02 07:10:33,248] [INFO] Set up nlp object from config
[2021-07-02 07:10:33,254] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-07-02 07:10:33,257] [INFO] Created vocabulary
[2021-07-02 07:10:33,257] [INFO] Finished initializing nlp object
[2021-07-02 07:10:34,376] [INFO] Initialized pipeline components: ['tok2vec', 'ner']


In [24]:
nlp = spacy.load("output/model-best")

In [25]:
with open ("latin_data/livy_01.txt", "r", encoding="utf-8") as f:
    text = f.read()

In [26]:
doc = nlp(text)

In [27]:
for ent in doc.ents:
    print (ent.text, ent.label_)

Paphlagonia GROUP
Pylaemene GROUP
Troiam GROUP
Troiano GROUP
Veneti GROUP
Macedoniam GROUP
Laurentem GROUP
Troiani GROUP
Aboriginesque PERSON
Latinum LOCATION
Aeneam GROUP
Veneris GROUP
Ascanium GROUP
Aborigines GROUP
Aborigines GROUP
Rutulique GROUP
Etruscorum PERSON
Rutulis GROUP
Latinos GROUP
Aborigines GROUP
Aeneam GROUP
Aeneas GROUP
Siculum GROUP
Iovem GROUP
Latina PERSON
Albam GROUP
Etruscis LOCATION
Etruscis LOCATION
Silvius GROUP
Aeneam GROUP
Silvium GROUP
Silvium GROUP
Capys GROUP
Capeto PERSON
Albulae GROUP
Silvius GROUP
Amulium GROUP
Numitori GROUP
— PERSON
— PERSON
Lupercal GROUP
Pallantium GROUP
Arcadia LOCATION
Lycaeum PERSON
Romani GROUP
Numitoris GROUP
Numitori GROUP
Remus PERSON
Numitori GROUP
Remus PERSON
Numitori GROUP
Remumque GROUP
Albanorum GROUP
Albam GROUP
Remus PERSON
Aventinum PERSON
Priori GROUP
Remus PERSON
Herculi GROUP
Evandro LOCATION
Peloponneso GROUP
Italiam LOCATION
Herculi GROUP
Pinariis GROUP
Rebus GROUP
Etruscis LOCATION
Neptuno GROUP
Romanam PERSON