<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).

For questions/comments/improvements, email author@email.address.<br />
____

# `NLP with spaCy` `3`

This is lesson `3` of 3 in the educational series on `Natural Language Processing (NLP)`. This notebook is intended `to teach the basics of NLP and the spaCy library.`. 

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial`

`Include the use case definition from [here](https://constellate.org/docs/documentation-categories)`

**Difficulty:** `Beginner`

`Beginner assumes users are relatively new to Python and Jupyter Notebooks. The user is helped step-by-step with lots of explanatory text.`

`Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.`

`Advanced assumes users are very familiar with Python and have been programming for years, but they may not be familiar with the process being explained.`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, flow control, functions, lists, dictionaries)
```

**Knowledge Recommended:**
```
* Basic file operations (open, close, read, write)
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Understand how to solve an NLP problem, specifically information extraction
2. Understand how to find data
3. Understand how to structure data
4. Understand how to develop a spaCy Pipeline
```
**Research Pipeline:**
```
N/A
```
___

# Required Python Libraries
`List out any libraries used and what they are used for`
* [spaCy](https://spacy.io/) for performing [Natural Language Processing (NLP)](https://docs.constellate.org/key-terms/#nlp).

## Install Required Libraries

In [1]:
### Install Libraries ###

# Using !pip installs
!pip install spacy
!python -m spacy download en_core_web_sm
# Using %%bash magic with apt-get and yes prompt





In [1]:
### Import Libraries ###
import spacy

# Introduction

```
Introduce the lesson topic. Answer questions such as:
* Why is it useful? 
* Why should we learn it? 
* Who might use it? 
* Where has it been used by scholars/industry?
* What do we need to do it?
* What subjects are included in the notebooks?
* What is not in this notebook? Where should we look for it?
```

# Part One: The Problem

Imagine you are a librarian or archivist at an R1 Institution. A researcher at the university has asked if it would be possible to grab all the letters written by Abigail Adams in the Founders Online database.

The researcher would also like for you to keep the metadata intact so that the researcher can understand to whom Abigail Adams wrote and when. The researcher wants to also know whom Abigail Adams references within these letters. The purpose of this project is to understand the epistolary network of Abigail Adams. All of this will be used to understand more broadly Abigail Adams' social network.

Finally, the researcher has some gender-based questions about the data and being able to identify and extract specific gendered words (which they have provided in data/gen_ref_ files).

Your job is to create a heuristic, or rules-based, pipeline with spaCy to solve this problem.

This is a real-world problem that requires several different Python skills to solve programmatically. Since this tutorial is designed around NLP, we will be focusing on the NLP portions of this workflow. Nevertheless, I will detail in Part Two, how to gather the requisite data. I will not, however, explain the code in depth as it involves web-scrapping beyond the scope of this tutorial.

# Part Two: Gathering the Data

In [13]:
import pandas as pd
import ast
import requests
from bs4 import BeautifulSoup
import json
from spacy import displacy

## Analyze Founders Online Data

In [3]:
df = pd.read_json("../data/founders-online-metadata.json")

In [4]:
df

Unnamed: 0,title,permalink,project,authors,recipients,date-from,date-to
0,November 18th. 1755.,https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]",[],1755-11-18,1755-11-18
1,[November 1755],https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]",[],1755-11-18,1755-11-18
2,January the 14th. 1756.,https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]",[],1756-01-14,1756-01-14
3,15.,https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]",[],1756-01-15,1756-01-15
4,16 Fryday.,https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]",[],1756-01-16,1756-01-16
...,...,...,...,...,...,...,...
185311,"To Thomas Jefferson from John Barnes, 3 March ...",https://founders.archives.gov/documents/Jeffer...,Jefferson Papers,"[Barnes, John]","[Jefferson, Thomas]",1809-03-03,1809-03-03
185312,"To Thomas Jefferson from John Benson, 3 March ...",https://founders.archives.gov/documents/Jeffer...,Jefferson Papers,"[Benson, John]","[Jefferson, Thomas]",1809-03-03,1809-03-03
185313,"To Thomas Jefferson from William Matthews, 3 M...",https://founders.archives.gov/documents/Jeffer...,Jefferson Papers,"[Matthews, William]","[Jefferson, Thomas]",1809-03-03,1809-03-03
185314,"To Thomas Jefferson from Thomas Moore, 3 March...",https://founders.archives.gov/documents/Jeffer...,Jefferson Papers,"[Moore, Thomas]","[Jefferson, Thomas]",1809-03-03,1809-03-03


In [5]:
# So we do not have to use Pandas syntax in this tutorial
links = df.permalink.tolist()
authors =  df.authors.tolist()
recipients = df.recipients.tolist()
date_from = df["date-from"].tolist()
date_to = df["date-to"].tolist()
print (len(authors))

185316


In [7]:
final_data = []
for l, a, r, dfrom, dto in zip(links, authors, recipients, date_from, date_to):
    if "Adams, Abigail" in a:
        final_data.append((l, a, r, dfrom, dto))
print (len(final_data))

1238


In [6]:
#Only execute this cell if you want to make 1238 calls to the Founders Online server.
def scrape_founders():
    texts = []
    for link, author, recipient, date_from, date_to in final_data:
        #must have a recipient, i.e. a letter
        if recipient != []:

            #calls the server
            s = requests.get(link)

            #convert the call request into a soup object to parse HTML

            soup = BeautifulSoup(s.content)

            #grabs the text
            text = soup.find("div", {"class": "innerdiv docbody"})

            #removes the footnotes from the text
            for i in text.find_all("a"):
                if 'class' in i.attrs:
                    if "ptr" in i.attrs['class']:
                        i.decompose()
            #get some clean text from the p tags            
            text = [p.text.strip() for p in text.find_all("p")]

            #bring the text together
            text = "\n".join(text)
            data = {"link": link, "author": author, "recipient": recipient, "date_from": date_from, "date_to": date_to, "text": text}
            texts.append(data)
    print (len(texts))
    with open ("../data/adams_abigail_letters.json", "w") as f:
        json.dump(texts, f, indent=4)
    return texts

# Part Three: NER Pipeline

## Load in our Data

In [8]:
with open ("../data/adams_abigail_letters.json", "r") as f:
    texts = json.load(f)

In [9]:
texts[0]

{'link': 'https://founders.archives.gov/documents/Adams/04-01-02-0005',
 'author': ['Adams, Abigail'],
 'recipient': ['Smith, Isaac Jr.'],
 'date_from': '1763-03-16',
 'date_to': '1763-03-16',
 'text': 'Weymouth March 16 1763\nDear Cousin\nTis no small pleasure to me, to hear of the great proficioncy you have made in the French tongue, A Tongue Sweet, and harmonious, a Tongue, useful to Merchants, to Statesmen; to Divines, and especially to Lawyers and Travellers; who by the help of it, may traverse the whole Globe; for in this respect, the French language is pretty much now, what I have heard the Latin formerly was, a universal tongue.\nBy the favor of my Father I have had the pleasure of seeing your Copy of Mrs. Wheelwrights Letter, to her Nephew, and having some small acquaintance with the French tongue, have attempted a translation; of it, which I here send, for your perusal and correction.\nI am sensible that I am but ill qualified for such an undertaking, it being a maxim with me

First, we want to test out an existing model to see how it is performing

## Using the Researcher's Lists

In [316]:
def make_patterns(file, label, lower_case=False):
    temp_patterns = []
    with open (file, "r", encoding="utf-8") as f:
        data = f.read().splitlines()
    for item in data:
        if lower_case == False:
            temp_patterns.append({"pattern": item.strip(), "label": label})
        else:
            temp_patterns.append({"pattern": [{"lemma": item.lower()}], "label": label})
    return temp_patterns

In [317]:
neuter_patterns = make_patterns("../data/gen_ref_neuter.txt", "REF_NEUTER", lower_case=True)
print (len(neuter_patterns))
male_patterns = make_patterns("../data/gen_ref_male.txt", "REF_MALE", lower_case=True)
print (len(male_patterns))
female_patterns = make_patterns("../data/gen_ref_female.txt", "REF_FEMALE", lower_case=True)
print (len(female_patterns))

20
53
23


In [318]:
neuter_patterns[:10]

[{'pattern': [{'lemma': 'people'}], 'label': 'REF_NEUTER'},
 {'pattern': [{'lemma': 'person'}], 'label': 'REF_NEUTER'},
 {'pattern': [{'lemma': 'patriots'}], 'label': 'REF_NEUTER'},
 {'pattern': [{'lemma': 'enemies'}], 'label': 'REF_NEUTER'},
 {'pattern': [{'lemma': 'servant'}], 'label': 'REF_NEUTER'},
 {'pattern': [{'lemma': 'friend'}], 'label': 'REF_NEUTER'},
 {'pattern': [{'lemma': 'sex'}], 'label': 'REF_NEUTER'},
 {'pattern': [{'lemma': 'folk'}], 'label': 'REF_NEUTER'},
 {'pattern': [{'lemma': 'cousin'}], 'label': 'REF_NEUTER'},
 {'pattern': [{'lemma': 'family'}], 'label': 'REF_NEUTER'}]

## Finding Proper Nouns that are People

In [11]:
potential_people = []
nlp = spacy.load("en_core_web_sm")
doc = nlp(texts[0]['text'])

In [14]:
displacy.render(doc, style="ent")

In [20]:
def model_people(texts):
    potential_people = []
    for text in texts[:10]:
        doc = nlp(text['text'].lower())
        for ent in doc.ents:
            if ent.label_ == "PERSON":
                if ent.text not in potential_people:
                    potential_people.append(ent.text)
    return potential_people
potential_people = model_people(texts)
potential_people.sort()
len(potential_people)

18

In [21]:
potential_people

['a. smith',
 'a. smith\np s',
 'abll',
 'adams',
 'adams returnd',
 'ayers',
 'ayres',
 'betsy',
 'cranch',
 'diana',
 'eaquil',
 'eyers',
 'lord',
 'nabby smith\n',
 'tom',
 'tom hope',
 'tom shall',
 'wheelwrights letter']

## Finding Noun Chunks

Now that we have a good sense of the strengths and weaknesses of an existing pipeline, we can start to create our own custom pipeline tailored to the data at hand.

In [172]:
chunks = []
nlp = spacy.load("en_core_web_sm")
for i, text in enumerate(texts):
    doc = nlp(text["text"])
    for chunk in doc.noun_chunks:
        if chunk.text[0].isupper() and len(chunk.text.split()) > 1:
            hit = True
            for part in chunk.text.split():
                if part[0].isupper():
                    pass
                else:
                    hit=False
            if hit == True:
                if chunk.text not in chunks and "Your" not in chunk.text:
                    chunks.append(chunk.text)
chunks.sort()

In [174]:
len(chunks)

3461

In [176]:
with open ("../data/noun_chunks.txt", "w", encoding="utf-8") as f:
    f.write("\n".join(chunks))

## Making a Gazateer

In [8]:
authors[0]

['Adams, John']

In [9]:
recipients[0]

[]

In [179]:
names = []
for a, r in zip(recipients, authors):
    for author in a:
        if author not in names:
            names.append(author)
    for recipient in r:
        if recipient not in names:
            names.append(recipient)
len(names)

19335

In [230]:
names.sort()

In [139]:
names[:20]

['',
 'Abadie, —— d’',
 'Abbema, Balthasar Elias',
 'Abbot, David',
 'Abbott, James',
 'Abbott, Thomas Jefferson',
 'Abbés Arnoux and Chalut',
 'Abell, Jesse',
 'Abercrombie, James',
 'Abercromby, John',
 'Abernethie, John',
 'Abishai, Thomas',
 'Aborn, Henry',
 'Aborn, Thomas',
 'Abraham Runnels & Son',
 'Abraham, Francis',
 'AcModery, John Jay',
 'Academy of Natural Sciences of Philadelphia',
 'Acarq, Jean-Pierre d’',
 'Acary de La Rivière, Henry-Dominique d’']

In [231]:
with open ('../data/fo_names.txt', 'w', encoding="utf-8") as f:
    f.write("\n".join(names))

In [178]:
with open ('../data/fo_names.txt', 'r', encoding="utf-8") as f:
    names = f.read().splitlines()

In [319]:
with open("../data/government_indicators.txt", "r") as f:
    gov_indicators = f.read().splitlines()
    print (gov_indicators)
with open("../data/places.txt", "r") as f:
    places = f.read().splitlines()
patterns = []
potential = []
for name in names:
    name = name.replace("“", "\"").replace("”", "\"").replace("’", "\'")
    if "(business)" in name or " firm)" in name:
            patterns.append({"pattern": name.split("(")[0].strip(), "label": "ORG"})
    elif ", town of" in name:
        patterns.append({"pattern": name.split(", ")[0].strip(), "label": "TOWN"})
    elif "(newspaper)" in name:
        patterns.append({"pattern": name.split("(")[0].strip(), "label": "NEWSPAPER"})
    elif "Pseudonym" in name:
        name = name.split("Pseudonym:")[1].strip()
        patterns.append({"pattern": name, "label": "PSEUDONYM"})
    elif "(pseudonym)" in name:
        patterns.append({"pattern": name.split("(")[0].strip(), "label": "PSEUDONYM"})
    elif "Church" in name:
        patterns.append({"pattern": name.strip(), "label": "RELIGIOUS_ORG"})
    elif "Indians" in name:
        name = name.split(" Indians")[0]
        name = name.replace(", and ", ", ").replace(" and ", ", ")
        parts = name.split(",")
        for p in parts:
            if p != "Southern":
                patterns.append({"pattern": name.strip(), "label": "NATIVE_AMERICAN"})
    elif "(of" in name:
        patterns.append({"pattern": name.split("(of")[0], "label": "PERSON"})
    elif "\"" in name or "——" in name:
        pass
    elif any(gov in name for gov in gov_indicators):
        patterns.append({"pattern": name.strip(), "label": "GOVERNMENT"})
    elif any(place in name for place in places):
        if "Boston" in name:
            print (name)
        patterns.append({"pattern": name.strip(), "label": "LOC"})
    else:
        if name.count(", ") == 1:
            surname, first = name.split(", ", 1)
            name = f"{first} {surname}"
            name = name.strip()
            name = name.replace("d' ", "d'")
            if name[:2] != "A " and "?" not in name and len(name) < 50 and "Representatives" not in name:
                patterns.append({"pattern": name, "label": "PERSON"})
                if len(name.split()) < 4:
                    for part in name.split():
                        if part[0].isupper() and len(part) > 2 and part.isalpha():
                            if part not in potential:
                                potential.append(part)
                                
                            
print (len(patterns))
print (len(potential))
potential.sort()

['Representatives', 'Senate', 'Health', 'General', 'Officials', 'Congressional', 'Delegation', 'Superintendent', 'Justices', 'Court', 'Virginia', 'County', 'State', 'Prince', 'Committee', 'Commission', 'Citizens', 'Post', 'Academy', 'Sons of Liberty', 'Patriot', 'Society', 'Selectmen', 'Meeting', 'Young Men', 'Assessors', 'Collectors', 'Navy', 'Army']
Boston
16675
7558


In [320]:
patterns = patterns+neuter_patterns+female_patterns+male_patterns
len(patterns)

16771

### Single Names

In [70]:
with open("../data/potential_names.txt", "w", encoding="utf-8") as f:
    f.write("\n".join(potential))

In [321]:
with open("../data/potential_names_manual.txt", "r", encoding="utf-8") as f:
    potential_names = f.read().splitlines()
    for p in potential_names:
        patterns.append({"pattern": p, "label": "PERSON"})

### Native American Tribes

In [103]:
s = requests.get("https://www.legendsofamerica.com/na-tribelist/")
soup = BeautifulSoup(s.content)
content = soup.find("div", {"class": "entry-content"})

In [215]:
tribes = []
for div in content.find_all("div")[1:]:
    tribal_names = div.text.split("\n")
    for t in tribal_names:
        t = t.strip()
        if len(t) > 1:
            if "/" in t:
                for i in t.split("/"):
                    tribes.append(i)
            else:        
                tribes.append(t)
print (len(tribes))
tribes.sort()

828


In [216]:
with open ("../data/native_american_tribes.txt", "w", encoding="utf-8") as f:
    f.write("\n".join(tribes))

In [322]:
with open ("../data/native_american_tribes_manual.txt", "r", encoding="utf-8") as f:
    tribes = f.read().splitlines()
    for t in tribes:
        patterns.append({"pattern": t, "label": "NATIVE_AMERICAN"})

In [222]:
tribes[:10]

['Ababco',
 'Abenaki',
 'Aberginian',
 'Abihka',
 'Abittibi',
 'Abnakii',
 'Absaroka',
 'Absaroka',
 'Absentee Shawnee',
 'Accohanoc']

### Bringing the Patterns Together

In [336]:
print (len(patterns))
# nlp = spacy.blank("en")
nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler", after='ner', config={"overwrite_ents": True})
ruler.add_patterns(patterns)

25130


### Finding Names with Salutations

In [129]:
import spacy
from spacy.util import filter_spans
from spacy.tokens import Span
from spacy.language import Language
import re

In [337]:
with open ("../data/salutations.txt", "r") as f:
    sals = f.read().splitlines()
sals = "|".join(sals)
salutation_names_pattern = f"({sals}"+r")(\.)* [A-Z]\w+( [A-Z]\w+)*"
@Language.component("salutation_person")
def salutation_person(doc):
    text = doc.text
    person_ents = []
    original_ents = list(doc.ents)
    for match in re.finditer(salutation_names_pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        if span != None:
            person_ents.append((span.start, span.end, span.text))
    if len(person_ents) > 0:
        for start, end, name in person_ents:
            per_ent = Span(doc, start, end, label="PERSON")
            original_ents.append(per_ent)
    filtered = filter_spans(original_ents)
    doc.ents = filtered
    return (doc)
nlp.add_pipe("salutation_person", before="ner")

<function __main__.salutation_person(doc)>

### Finding Names with 1-3 Initials and Lastname

In [338]:
initial_pattern = r"[A-Z](\.)*( [A-Z](\.)*)*( [A-Z](\.)*)* [A-Z]\w+"
@Language.component("initial_person")
def salutation_person(doc):
    text = doc.text
    person_ents = []
    original_ents = list(doc.ents)
    for match in re.finditer(initial_pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        if span != None:
            person_ents.append((span.start, span.end, span.text))
    if len(person_ents) > 0:
        for start, end, name in person_ents:
            per_ent = Span(doc, start, end, label="PERSON_INITIALS")
            original_ents.append(per_ent)
    filtered = filter_spans(original_ents)
    doc.ents = filtered
    return (doc)
nlp.add_pipe("initial_person", before="entity_ruler")

<function __main__.salutation_person(doc)>

In [46]:
from spacy import displacy

In [339]:
text = texts[1200]['text']
text =  text.replace("mr ", "Mr. ").replace("mrs ", " Mrs. ").replace("dr ", " Dr. ")
doc = nlp(text)
displacy.render(doc, style="ent")

## Part Four: Applying the Pipeline

In [313]:
connections = []
for i, text in enumerate(texts[:100]):
    authors = []
    recipients = []
    for name in text["author"]:
        if name.count(", ") == 1:
            surname, first = name.split(", ", 1)
            name = f"{first} {surname}"
            authors.append(name)
    for name in text['recipient']:
        if name.count(", ") == 1:
            surname, first = name.split(", ", 1)
            name = f"{first} {surname}"
            recipients.append(name)
    doc = nlp(text["text"])
    people_referenced = []
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            people_referenced.append(ent.text)
    for author in authors:
        for person in recipients:
            connections.append((author, person, "Letter_To"))
    # for author in authors:
    #     for person in people_referenced:
    #         connections.append((author, person, "Letter_Reference"))
    # for recipient in recipients:
    #     for person in people_referenced:
    #         connections.append((recipient, person, "Letter_Reference"))
            
print (len(connections))

102


In [310]:
connections[:10]

[('Abigail Adams', 'Isaac Jr. Smith', 'Letter_To'),
 ('Abigail Adams', 'Mrs. Wheelwrights Letter', 'Letter_Reference'),
 ('Abigail Adams', 'Smith', 'Letter_Reference'),
 ('Isaac Jr. Smith', 'Mrs. Wheelwrights Letter', 'Letter_Reference'),
 ('Isaac Jr. Smith', 'Smith', 'Letter_Reference'),
 ('Abigail Adams', 'John Adams', 'Letter_To'),
 ('Abigail Adams', 'John Adams', 'Letter_To'),
 ('Abigail Adams', 'Cotton Tufts', 'Letter_To'),
 ('Abigail Adams', 'Mr. Eyers', 'Letter_Reference'),
 ('Abigail Adams', 'Ayers', 'Letter_Reference')]

In [302]:
import pandas as pd

In [314]:
df = pd.DataFrame(connections, columns=["source", "target", "relationship"])
df

Unnamed: 0,source,target,relationship
0,Abigail Adams,Isaac Jr. Smith,Letter_To
1,Abigail Adams,John Adams,Letter_To
2,Abigail Adams,John Adams,Letter_To
3,Abigail Adams,Cotton Tufts,Letter_To
4,Abigail Adams,John Adams,Letter_To
...,...,...,...
97,Abigail Adams,John Adams,Letter_To
98,Abigail Adams,John Adams,Letter_To
99,Abigail Adams,John Adams,Letter_To
100,Abigail Adams,John Adams,Letter_To


In [315]:
df.to_csv("../data/abigail_adams_network_strict.csv", index=False)

# Exercises

`I know we covered a lot in this notebook and the best way to understand its contents in depth is to apply it to your own domain, or area of expertise. I encourage you to select a text (or texts) that you use in your own research and try to apply the methods covered in this notebook to those particular texts. I would highly encourage you to do this before moving on to the next notebook.`

recipients