<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).

For questions/comments/improvements, email author@email.address.<br />
____

# `NLP with spaCy` `3`

This is lesson `3` of 3 in the educational series on `Natural Language Processing (NLP)`. This notebook is intended `to teach the basics of NLP and the spaCy library.`. 

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial`

`Include the use case definition from [here](https://constellate.org/docs/documentation-categories)`

**Difficulty:** `Beginner`

`Beginner assumes users are relatively new to Python and Jupyter Notebooks. The user is helped step-by-step with lots of explanatory text.`

`Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.`

`Advanced assumes users are very familiar with Python and have been programming for years, but they may not be familiar with the process being explained.`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, flow control, functions, lists, dictionaries)
```

**Knowledge Recommended:**
```
* Basic file operations (open, close, read, write)
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Understand how to solve an NLP problem, specifically information extraction
2. Understand how to find data
3. Understand how to structure data
4. Understand how to develop a spaCy Pipeline
```
**Research Pipeline:**
```
N/A
```
___

# Required Python Libraries
`List out any libraries used and what they are used for`
* [spaCy](https://spacy.io/) for performing [Natural Language Processing (NLP)](https://docs.constellate.org/key-terms/#nlp).

## Install Required Libraries

In [1]:
### Install Libraries ###

# Using !pip installs
!pip install spacy
!python -m spacy download en_core_web_sm
# Using %%bash magic with apt-get and yes prompt





In [1]:
### Import Libraries ###
import spacy

# Introduction

```
Introduce the lesson topic. Answer questions such as:
* Why is it useful? 
* Why should we learn it? 
* Who might use it? 
* Where has it been used by scholars/industry?
* What do we need to do it?
* What subjects are included in the notebooks?
* What is not in this notebook? Where should we look for it?
```

# Introductory Material: The spaCy EntityRuler

The Python library spaCy offers a few different methods for performing rules-based NER. One such method is via its EntityRuler.

The **EntityRuler** is a spaCy factory that allows one to create a set of patterns with corresponding labels. A **factory** in spaCy is a set of classes and functions preloaded in spaCy that perform set tasks. In the case of the EntityRuler, the factory at hand allows the user to create an EntityRuler, give it a set of instructions, and then use this instructions to find and label entities.

Once the user has created the EntityRuler and given it a set of instructions, the user can then add it to the spaCy pipeline as a new pipe. I have spoken in the past notebooks briefly about pipes, but perhaps it is good to address them in more detail here.

A **pipe** is a component of a **pipeline**. A pipeline's purpose is to take input data, perform some sort of operations on that input data, and then output those operations either as a new data or extracted metadata. A pipe is an individual component of a pipeline. In the case of spaCy, there are a few different pipes that perform different tasks. The tokenizer, tokenizes the text into individual tokens; the parser, parses the text, and the NER identifies entities and labels them accordingly. All of this data is stored in the Doc object as we saw in Notebook 01_01 of this series.

It is important to remember that pipelines are sequential. This means that components earlier in a pipeline affect what later components receive. Sometimes this sequence is essential, meaning later pipes depend on earlier pipes. At other times, this sequence is not essential, meaning later pipes can function without earlier pipes. It is important to keep this in mind as you create custom spaCy models (or any pipeline for that matter).

In this notebook, we will be looking closely at the EntityRuler as a component of a spaCy model's pipeline. Off-the-shelf spaCy models come preloaded with an NER model; they do not, however, come with an EntityRuler. In order to incorperate an EntityRuler into a spaCy model, it must be created as a new pipe, given instructions, and then added to the model. Once this is complete, the user can save that new model with the EntityRuler to the disk.

The full documentation of spaCy EntityRuler can be found here: https://spacy.io/api/entityruler .

This notebook with synthesize this documentation for non-specialists and provide some examples of it in action.

## Demonstration of EntityRuler in Action

In the code below, we will introduce a new pipe into spaCy's off-the-shelf small English model. The purpose of this EntityRuler will be to identify small villages in Poland correctly.

In [4]:
#Build upon the spaCy Small Model
nlp = spacy.load("en_core_web_sm")

#Sample text
text = "The village of Treblinka is in Poland. Treblinka was also an extermination camp."

#Create the Doc object
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

Poland GPE


*Depending on the version of model you are using, some results may vary.*

The output from the code above demonstrates spaCy's small model's to identify Treblinka, which is a small village in Poland. As the sample text indicates, it was also an extermination camp during WWII. In the first sentence, the spaCy model tagged Treblinka as an LOC (location) and in the second it was missed entirely. Both are either imprecise or wrong. I would have accepted ORG for the second sentence, as spaCy's model does not know how to classify an extermination camp, but what these results demonstrate is the model's failure to generalize on data. The reason? There are a few, but I suspect the model never encountered the word Treblinka.

This is a common problem in NLP for specific domains. Often times the domains in which we wish to deploy models, off-the-shelf models will fail because they have not been trained on domain-specific texts. We can resolve this, however, either via spaCy's EntityRuler or via training a new model. As we will see over the next few notebooks, we can use spaCy's EntityRuler to easily achieve both.

For now, let's first remedy the issue by giving the model instructions for correctly identifying Treblinka. For simplicity, we will use spaCy's GPE label. In a later notebook, we will teach a model to correctly identify Treblinka in the latter context as a concentration camp.

In [5]:
#Import the requisite library
import spacy

#Build upon the spaCy Small Model
nlp = spacy.load("en_core_web_sm")

#Sample text
text = "The village of Treblinka is in Poland. Treblinka was also an extermination camp."

#Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns
patterns = [
                {"label": "GPE", "pattern": "Treblinka"}
            ]

ruler.add_patterns(patterns)


doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

Treblinka GPE
Poland GPE
Treblinka GPE


If you executed the code above and found that you had the same output, then you did everything correctly. This method has failed. Why? The answer comes back to the concept of pipelines. We created and added the EntityRuler to the spaCy model's pipeline, but by default, spaCy add's a new pipe to the end of the pipeline. In order to visualize the pipeline, let's use spaCy's analyze_pipes().

In [6]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ent

This can be a bit difficult to read at first, but what it shows us is the order in which our pipes are set up and a few other key pieces of information about each pipe. If we locate "ner", we notice that "entity_ruler" sits behind it.

In order for our EntityRuler to have primacy, we have to assign it to after the "ner" pipe, as the example below shows in this line:

ruler = nlp.add_pipe("entity_ruler", **after="ner"**)

In [7]:
#Build upon the spaCy Small Model
nlp = spacy.load("en_core_web_sm")

#Sample text
text = "The village of Treblinka is in Poland. Treblinka was also an extermination camp."

#Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler", after="ner")

#List of Entities and Patterns
patterns = [
                {"label": "GPE", "pattern": "Treblinka"}
            ]

ruler.add_patterns(patterns)


doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

Treblinka GPE
Poland GPE
Treblinka GPE


Notice now that our EntityRuler is functioning before the "ner" pipe and is, therefore, prefinding entities and labeling them before the NER gets to them. Because it comes earlier in the pipeline, its metadata holds primacy over the later "ner" pipe.

## Introducing Complex Rules and Variance to the EntityRuler (Advanced)

In some instances, labels may have a set type of variance that follow a distinct pattern or sets of patterns. One such example (included in the spaCy documentation) is phone numbers. In the United States, phone numbers have a few forms. The standard formal method is (xxx)-xxx-xxxx, but it is not uncommon to see xxx-xxx-xxxx or xxxxxxxxxx. If the owner of the phone number is giving that same number to someone outside the US, then +1(xxx)-xxx-xxxx.

If you are working within a United States domain, you can pass RegEx formulas to the pattern matcher to grab all of these instances.

The spaCy EntityRuler also allows the user to introduce a variety of complex rules and variances (via, among other things, RegEx) by passing the rules to the pattern. There are many arguments that one can pass to the patterns. For a complete list, see: https://spacy.io/usage/rule-based-matching . To expiremnet with how these work, I recommend using the spaCy Matcher demo: https://explosion.ai/demos/matcher .

In the example below we work with one example from the spaCy documentation in which we extract a phone number from a text. This same task can be done via RegEx as well.

In [8]:
#Sample text
text = "This is a sample number (555) 555-5555."

#Build upon the spaCy Small Model
nlp = spacy.blank("en")

#Create the Ruler and Add it
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
                {"label": "PHONE_NUMBER", "pattern": [{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "ddd"},
                {"ORTH": "-", "OP": "?"}, {"SHAPE": "dddd"}]}
            ]
#add patterns to ruler
ruler.add_patterns(patterns)



#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

(555) 555-5555 PHONE_NUMBER


# Part One: The Problem

Imagine you are a librarian or archivist at an R1 Institution. A researcher at the university has asked if it would be possible to grab all the letters written by Abigail Adams in the Founders Online database.

The researcher would also like for you to keep the metadata intact so that the researcher can understand to whom Abigail Adams wrote and when. The researcher wants to also know whom Abigail Adams references within these letters. The purpose of this project is to understand the epistolary network of Abigail Adams. All of this will be used to understand more broadly Abigail Adams' social network.

Finally, the researcher has some gender-based questions about the data and being able to identify and extract specific gendered words (which they have provided in data/gen_ref_ files).

Your job is to create a heuristic, or rules-based, pipeline with spaCy to solve this problem.

This is a real-world problem that requires several different Python skills to solve programmatically. Since this tutorial is designed around NLP, we will be focusing on the NLP portions of this workflow. Nevertheless, I will detail in Part Two, how to gather the requisite data. I will not, however, explain the code in depth as it involves web-scrapping beyond the scope of this tutorial.

# Part Two: Gathering the Data

In [50]:
import pandas as pd
import ast
import requests
from bs4 import BeautifulSoup
import json
from spacy import displacy
import re

## Analyze Founders Online Data

In [3]:
df = pd.read_json("../data/founders-online-metadata.json")

In [4]:
df

Unnamed: 0,title,permalink,project,authors,recipients,date-from,date-to
0,November 18th. 1755.,https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]",[],1755-11-18,1755-11-18
1,[November 1755],https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]",[],1755-11-18,1755-11-18
2,January the 14th. 1756.,https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]",[],1756-01-14,1756-01-14
3,15.,https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]",[],1756-01-15,1756-01-15
4,16 Fryday.,https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]",[],1756-01-16,1756-01-16
...,...,...,...,...,...,...,...
185311,"To Thomas Jefferson from John Barnes, 3 March ...",https://founders.archives.gov/documents/Jeffer...,Jefferson Papers,"[Barnes, John]","[Jefferson, Thomas]",1809-03-03,1809-03-03
185312,"To Thomas Jefferson from John Benson, 3 March ...",https://founders.archives.gov/documents/Jeffer...,Jefferson Papers,"[Benson, John]","[Jefferson, Thomas]",1809-03-03,1809-03-03
185313,"To Thomas Jefferson from William Matthews, 3 M...",https://founders.archives.gov/documents/Jeffer...,Jefferson Papers,"[Matthews, William]","[Jefferson, Thomas]",1809-03-03,1809-03-03
185314,"To Thomas Jefferson from Thomas Moore, 3 March...",https://founders.archives.gov/documents/Jeffer...,Jefferson Papers,"[Moore, Thomas]","[Jefferson, Thomas]",1809-03-03,1809-03-03


In [5]:
# So we do not have to use Pandas syntax in this tutorial
links = df.permalink.tolist()
authors =  df.authors.tolist()
recipients = df.recipients.tolist()
date_from = df["date-from"].tolist()
date_to = df["date-to"].tolist()
print (len(authors))

185316


In [7]:
final_data = []
for l, a, r, dfrom, dto in zip(links, authors, recipients, date_from, date_to):
    if "Adams, Abigail" in a:
        final_data.append((l, a, r, dfrom, dto))
print (len(final_data))

1238


In [6]:
#Only execute this cell if you want to make 1238 calls to the Founders Online server.
def scrape_founders():
    texts = []
    for link, author, recipient, date_from, date_to in final_data:
        #must have a recipient, i.e. a letter
        if recipient != []:

            #calls the server
            s = requests.get(link)

            #convert the call request into a soup object to parse HTML

            soup = BeautifulSoup(s.content)

            #grabs the text
            text = soup.find("div", {"class": "innerdiv docbody"})

            #removes the footnotes from the text
            for i in text.find_all("a"):
                if 'class' in i.attrs:
                    if "ptr" in i.attrs['class']:
                        i.decompose()
            #get some clean text from the p tags            
            text = [p.text.strip() for p in text.find_all("p")]

            #bring the text together
            text = "\n".join(text)
            data = {"link": link, "author": author, "recipient": recipient, "date_from": date_from, "date_to": date_to, "text": text}
            texts.append(data)
    print (len(texts))
    with open ("../data/adams_abigail_letters.json", "w") as f:
        json.dump(texts, f, indent=4)
    return texts

# Part Three: NER Pipeline

## Load in our Data

In [8]:
with open ("../data/adams_abigail_letters.json", "r") as f:
    texts = json.load(f)

In [9]:
texts[0]

{'link': 'https://founders.archives.gov/documents/Adams/04-01-02-0005',
 'author': ['Adams, Abigail'],
 'recipient': ['Smith, Isaac Jr.'],
 'date_from': '1763-03-16',
 'date_to': '1763-03-16',
 'text': 'Weymouth March 16 1763\nDear Cousin\nTis no small pleasure to me, to hear of the great proficioncy you have made in the French tongue, A Tongue Sweet, and harmonious, a Tongue, useful to Merchants, to Statesmen; to Divines, and especially to Lawyers and Travellers; who by the help of it, may traverse the whole Globe; for in this respect, the French language is pretty much now, what I have heard the Latin formerly was, a universal tongue.\nBy the favor of my Father I have had the pleasure of seeing your Copy of Mrs. Wheelwrights Letter, to her Nephew, and having some small acquaintance with the French tongue, have attempted a translation; of it, which I here send, for your perusal and correction.\nI am sensible that I am but ill qualified for such an undertaking, it being a maxim with me

First, we want to test out an existing model to see how it is performing

## Using the Researcher's Lists

In [67]:
def make_patterns(file, label, lower_case=False):
    temp_patterns = []
    with open (file, "r", encoding="utf-8") as f:
        data = f.read().splitlines()
    for item in data:
        if lower_case == False:
            temp_patterns.append({"pattern": item.strip(), "label": label})
        else:
            temp_patterns.append({"pattern": [{"lemma": item.lower()}], "label": label})
    return temp_patterns

In [68]:
neuter_patterns = make_patterns("../data/gen_ref_neuter.txt", "REF_NEUTER", lower_case=True)
print (len(neuter_patterns))
male_patterns = make_patterns("../data/gen_ref_male.txt", "REF_MALE", lower_case=True)
print (len(male_patterns))
female_patterns = make_patterns("../data/gen_ref_female.txt", "REF_FEMALE", lower_case=True)
print (len(female_patterns))

20
53
23


In [69]:
neuter_patterns[:10]

[{'pattern': [{'lemma': 'people'}], 'label': 'REF_NEUTER'},
 {'pattern': [{'lemma': 'person'}], 'label': 'REF_NEUTER'},
 {'pattern': [{'lemma': 'patriots'}], 'label': 'REF_NEUTER'},
 {'pattern': [{'lemma': 'enemies'}], 'label': 'REF_NEUTER'},
 {'pattern': [{'lemma': 'servant'}], 'label': 'REF_NEUTER'},
 {'pattern': [{'lemma': 'friend'}], 'label': 'REF_NEUTER'},
 {'pattern': [{'lemma': 'sex'}], 'label': 'REF_NEUTER'},
 {'pattern': [{'lemma': 'folk'}], 'label': 'REF_NEUTER'},
 {'pattern': [{'lemma': 'cousin'}], 'label': 'REF_NEUTER'},
 {'pattern': [{'lemma': 'family'}], 'label': 'REF_NEUTER'}]

In [75]:
#load a model and disable the NER component
nlp_gen = spacy.load("en_core_web_sm", disable=['ner'])
ruler = nlp_gen.add_pipe("entity_ruler")
ruler.add_patterns(neuter_patterns+male_patterns+female_patterns)

In [77]:
doc_gen = nlp_gen(texts[0]['text'])
displacy.render(doc_gen, style='ent')

## Finding Proper Nouns that are People

In [78]:
nlp_sm = spacy.load("en_core_web_sm")
doc_sm = nlp_sm(texts[0]['text'])

In [79]:
displacy.render(doc_sm, style="ent")

In [22]:
def model_people(texts):
    potential_people = []
    for text in texts[:10]:
        doc = nlp(text['text'])
        for ent in doc.ents:
            if ent.label_ == "PERSON":
                if ent.text not in potential_people:
                    potential_people.append(ent.text)
    return potential_people
potential_people = model_people(texts)
potential_people.sort()
len(potential_people)

35

In [23]:
potential_people

['A. Smith',
 'Abll',
 'Adams',
 'Adams returnd',
 'Ayers',
 'Ayres',
 'Bed',
 'Betsy',
 'Conclude',
 'Conscience',
 'Cranch',
 'Daughter Betsy',
 'Diana',
 'Eyers',
 'Fabrick',
 'Girl',
 'Girls Letter',
 'Humble Servant',
 'Lisps',
 'Lysander',
 'Mamma',
 'Marcia',
 'Nabby Smith',
 'Nights',
 'Perkins',
 'Small',
 'Sol',
 'Squemish',
 'Thoughts',
 'Tis Bed time',
 'Tom',
 'Vomit',
 'Wheelwrights Letter',
 'eaquil',
 'mine.)—Exit Rattle']

## Finding Noun Chunks

Now that we have a good sense of the strengths and weaknesses of an existing pipeline, we can start to create our own custom pipeline tailored to the data at hand.

In [81]:
chunks = []
for text in texts[:10]:
    doc = nlp_sm(text["text"])
    for chunk in doc.noun_chunks:
        if chunk.text[0].isupper() and len(chunk.text.split()) > 1:
            hit = True
            for part in chunk.text.split():
                if part[0].isupper():
                    pass
                else:
                    hit=False
            if hit == True:
                if chunk.text not in chunks and "Your" not in chunk.text:
                    chunks.append(chunk.text)
chunks.sort()
len(chunks)

26

In [82]:
chunks[:10]

['A Tongue Sweet',
 'Deacon Palmers Children',
 'Dear Cousin',
 'Dear Unkle',
 'Doctor Perkins',
 'Dr. Perkins',
 'Humane Nature',
 'Lord M',
 'Mr. Adams',
 'Mr. Ayers']

## Making a Gazateer

In [26]:
texts[0]

{'link': 'https://founders.archives.gov/documents/Adams/04-01-02-0005',
 'author': ['Adams, Abigail'],
 'recipient': ['Smith, Isaac Jr.'],
 'date_from': '1763-03-16',
 'date_to': '1763-03-16',
 'text': 'Weymouth March 16 1763\nDear Cousin\nTis no small pleasure to me, to hear of the great proficioncy you have made in the French tongue, A Tongue Sweet, and harmonious, a Tongue, useful to Merchants, to Statesmen; to Divines, and especially to Lawyers and Travellers; who by the help of it, may traverse the whole Globe; for in this respect, the French language is pretty much now, what I have heard the Latin formerly was, a universal tongue.\nBy the favor of my Father I have had the pleasure of seeing your Copy of Mrs. Wheelwrights Letter, to her Nephew, and having some small acquaintance with the French tongue, have attempted a translation; of it, which I here send, for your perusal and correction.\nI am sensible that I am but ill qualified for such an undertaking, it being a maxim with me

In [30]:
names = []
for text in texts:
    for author in text['author']:
        if author not in names:
            names.append(author)
    for recipient in text['recipient']:
        if recipient not in names:
            names.append(recipient)
len(names)

94

In [32]:
names[:10]

['Adams, Abigail',
 'Smith, Isaac Jr.',
 'Adams, John',
 'Tufts, Cotton',
 'Green, Hannah Storer',
 'Cranch, Mary Smith',
 'Warren, Mercy Otis',
 'Tudor, William',
 'Macaulay, Catharine Sawbridge',
 'Warren, Joseph']

In [64]:
reconst_names = []
for name in names:
    if ", " in name:
        name = name.replace(" Jr.", "").replace(" Sr.", "")
        name = re.sub(r"\([^()]*\)", "", name)
        lastname, other = name.split(", ")
        final_name = f"{other} {lastname}"
        reconst_names.append(final_name)
        if lastname not in reconst_names:
            reconst_names.append(lastname)
        parts = other.split()
        for p in parts:
            if p not in reconst_names:
                if "." not in p:
                    reconst_names.append(p)
            
reconst_names.sort()
len(reconst_names)

218

In [66]:
reconst_names[:10]

['Abigail',
 'Abigail  Adams',
 'Abigail Adams',
 'Abigail Adams Smith',
 'Abigail Bromfield Rogers',
 'Adams',
 'Alice',
 'Alice Lee Shippen',
 'Anna',
 'Anna Greenleaf Cranch']

In [89]:
name_patterns = []
for name in reconst_names:
    name_patterns.append({"pattern": name, "label": "PERSON-RULER"})

# Part Five: Bringing the Pipeline Together

In [86]:
neuter_patterns = make_patterns("../data/gen_ref_neuter.txt", "REF_NEUTER", lower_case=True)
male_patterns = make_patterns("../data/gen_ref_male.txt", "REF_MALE", lower_case=True)
female_patterns = make_patterns("../data/gen_ref_female.txt", "REF_FEMALE", lower_case=True)
gender_patterns = neuter_patterns+male_patterns+female_patterns
gender_patterns[0]

{'pattern': [{'lemma': 'people'}], 'label': 'REF_NEUTER'}

In [94]:
main_nlp = spacy.load("en_core_web_sm")
gender_ruler = main_nlp.add_pipe("entity_ruler", name="gender_ruler", before="ner")
gender_ruler.add_patterns(gender_patterns)

name_ruler = main_nlp.add_pipe("entity_ruler", name="name_ruler", before="ner")
name_ruler.add_patterns(name_patterns)

In [102]:
doc = main_nlp(texts[0]['text'])
displacy.render(doc, style="ent")

### Finding Names with Salutations

In [100]:
import spacy
from spacy.util import filter_spans
from spacy.tokens import Span
from spacy.language import Language
import re

In [106]:
with open ("../data/salutations.txt", "r") as f:
    sals = f.read().splitlines()
sals = "|".join(sals)
salutation_names_pattern = f"({sals}"+r")(\.)* [A-Z]\w+"
@Language.component("salutation_person")
def salutation_person(doc):
    text = doc.text
    person_ents = []
    original_ents = list(doc.ents)
    for match in re.finditer(salutation_names_pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        if span != None:
            person_ents.append((span.start, span.end, span.text))
    if len(person_ents) > 0:
        for start, end, name in person_ents:
            per_ent = Span(doc, start, end, label="PERSON-SALUTATION")
            original_ents.append(per_ent)
    filtered = filter_spans(original_ents)
    doc.ents = filtered
    return (doc)
main_nlp = spacy.load("en_core_web_sm")
gender_ruler = main_nlp.add_pipe("entity_ruler", name="gender_ruler", before="ner")
gender_ruler.add_patterns(gender_patterns)

name_ruler = main_nlp.add_pipe("entity_ruler", name="name_ruler", before="ner")
name_ruler.add_patterns(name_patterns)
main_nlp.add_pipe("salutation_person", before="gender_ruler")

<function __main__.salutation_person(doc)>

In [107]:
doc = main_nlp(texts[0]['text'])
displacy.render(doc, style="ent")