# NLP and entity extraction

## Summary

Use the SpaCy NLP library to identify named entities in a short text, then write code to parse the  output.

## Details

The text in question is one we've already seen (in the sutructured data exercise). To review: it's a brief pamphlet, *Growler's Income Tax* (1864), by the prolific mid-nineteenth-century writer T.S. Arthur. It's a defense of the then-new income tax, instituted in 1861 to fund the Union's war effort. As you'll see, the text is pretty straightforward, but it's kind of nifty (or infuriating, I guess) how similar are the arguments it presents concerning taxation to those you might hear today. Go ahead, download the [plain-text copy](https://github.com/wilkens-teaching/dhgrad2019/blob/master/exercises/ps4/growler.txt) and read it now, if you haven't already. It's short. (Note that you could also work with the [XML version](https://github.com/wilkens-teaching/dhgrad2019/blob/master/exercises/ps3/growler.xml) from the previous exercise and convert that to plain text as shown in the [solution to that exercise](https://github.com/wilkens-teaching/dhgrad2019/blob/master/exercises/ps3/Parsing%20structured%20data%20solution.ipynb).)

Anyway, tax policy isn't really the point. Your task is to identify algorithmically the named entities in the text and to extract them for further processing. To do this, you'll use the SpaCy NLP library. SpaCy isn't included with the default Anaconda distribution, but you can install it via the Anaconda Navigator GUI or from the command line (or from a console within Jupyter lab) by typing:

```
conda install spacy
```

and then installing at least the basic trained model:

```
python -m spacy download en_core_web_sm
```

SpaCy is a deep-learning-based NLP package. But the underlying details aren't super important for our purposes; we're interested in how to use it and in how it performs on our text document. You can see the [full usage instructions](https://spacy.io/usage) at the SpaCy site; there's also some starter code below.

Once you've processed the document, your task is to write code that reads the processed data and builds a list of unique entities in the output, each entity's type, and a count of how many times each entity occurs.

Your program should print a summary of the entity data. Your output (with made-up data) should look roughly like this:

```
Entity		Type		Count
------		----		-----
Boston		Location	  2
John Smith	Person		1
```

## Alternative processing
If you're up for a modest challenge, you might try using NLTK's (slower, lower-performing) named entity chunker. But that's strictly optional.

## Submit

Submit your code and output as a single Jupyter notebook via Sakai.

## Consider

A few things to think about before class:

* How well or poorly do the entities extracted from the text square with your sense of what the text is about, whom it involves, and where it occurs (or with what areas it’s concerned)?
* How accurate is the NER process?
* How might you try to improve NER accuracy?

## Starter code

Here's a bit of code to get you going. Note that most of these examples involve placeholder filenames and will require changes to run on real data.

In [12]:
# Read the content of a text file from disk
with open('growler.txt', 'r') as f:
    txt = f.read()

In [13]:
# NLP with SpaCy
import spacy
import en_core_web_sm # Remember to download this model before importing!

txt = """George Washington was the first president of the United States. He was born in Virginia in 1732 and died in December of 1799."""

nlp = en_core_web_sm.load()
doc = nlp(txt)
print([(ent.text, ent.label_) for ent in doc.ents])

[('George Washington', 'PERSON'), ('first', 'ORDINAL'), ('the United States', 'GPE'), ('Virginia', 'GPE'), ('1732', 'DATE'), ('December of 1799', 'DATE')]


In [2]:
# Count things
from collections import defaultdict

data = [
    ('thing 1', 'a'), 
    ('thing 2', 'b'), 
    ('thing 3', 'b'), 
    ('thing 1', 'a'), 
    ('thing 3', 'b')
]

counts = defaultdict(dict)

for item in data:
    try:
        counts[item[1]][item[0]] += 1
    except:
        counts[item[1]][item[0]] = 1

# Print sorted counts
print('tag\tentity\tcount')
for tag in counts:
    for entity in sorted(counts[tag], key=counts[tag].get, reverse=True):
        print(f'{tag}\t{entity}\t{counts[tag][entity]}')

tag	entity	count
a	thing 1	2
b	thing 3	2
b	thing 2	1


## Your code here

Use the examples above to read the document from disk, process it with SpaCy, and produce a list of counted entities by type.

In [5]:
with open('growler.txt', 'r') as f:
    txt = f.read()
    doc = nlp(txt)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    counts = defaultdict(dict)
    for entity, tag in entities:
        entity = entity.replace('\n', '')
        if len(entity) > 0:
            try:
                counts[tag][entity] += 1
            except:
                counts[tag][entity] = 1
    print('tag\t\tentity\t\tcount')
    print('---\t\t------\t\t-----')
    for tag in counts:
        for entity in sorted(counts[tag], key=counts[tag].get, reverse=True):
            print(f'{tag}\t\t{entity}\t\t{counts[tag][entity]}')
        print()

tag		entity		count
---		------		-----
ORG		GROWLER		1
ORG		ESQ		1
ORG		District of Pennsylvania.		1
ORG		Rec'd		1
ORG		Government		1
ORG		State		1
ORG		Confederation		1
ORG		the Union		1

GPE		T.S.		1
GPE		Philadelphia		1
GPE		Pennsylvania		1
GPE		Alabama		1
GPE		Florida		1
GPE		Delaware		1
GPE		England		1
GPE		France		1
GPE		Carlisle		1
GPE		Gettysburg		1

PERSON		Growler		13
PERSON		Collector		2
PERSON		Lee		2
PERSON		Stand		1
PERSON		Pistols		1
PERSON		RICHARD GROWLER		1
PERSON		JOHN M. RILEY		1
PERSON		I.		1
PERSON		G.		1
PERSON		Collector Riley's		1

WORK_OF_ART		War Tax		1

DATE		Sept.,		1
DATE		1863		1
DATE		the year 1862		1
DATE		many years		1
DATE		annual		1
DATE		Yesterday		1
DATE		next year		1

ORDINAL		4th		1

MONEY		forty-three dollars		4
MONEY		twenty-one cents		4
MONEY		fifty-eight dollars		3
MONEY		Fifty-eight dollars		2
MONEY		43,21		1
MONEY		Just forty-three dollars		1
MONEY		two hundred dollars		1
MONEY		nearly two thousand dollars		1
MONEY		A million dollars		1
MONE

In [14]:
# Pandas alternative
import pandas as pd
import string
import numpy as np

punct = frozenset(list(string.punctuation + '\n'))

def strip_punct(phrase):
    new = ''
    for char in phrase:
        if char not in punct:
            new += char
        else:
            new += ' '
    new = new.strip()
    if len(new) > 0:
        return new.strip()
    else:
        return np.nan
    
df = pd.DataFrame.from_records(
    entities,
    columns=['entity', 'tag']
)

df['entity'] = df['entity'].apply(strip_punct)

g = df.groupby(['tag', 'entity']).size()

print("Most frequently-occurring entities\n")
with pd.option_context('display.max_rows', 999):
    display(g.sort_values(ascending=False).head(10))

Most frequently-occurring entities



tag       entity             
PERSON    Growler                13
MONEY     twenty one cents        4
          forty three dollars     4
          fifty eight dollars     3
CARDINAL  more than half          2
PERSON    Collector               2
CARDINAL  six                     2
MONEY     Fifty eight dollars     2
PERSON    Lee                     2
CARDINAL  half                    2
dtype: int64

In [10]:
print("Top entities by type\n")
for tag, group in df.groupby(['tag']):
    print(tag)
    print(group.groupby(['entity']).size().sort_values(ascending=False).head(3))
    print()

Top entities by type

CARDINAL
entity
six               2
more than half    2
half              2
dtype: int64

DATE
entity
the year 1862    1
next year        1
many years       1
dtype: int64

EVENT
entity
Nearer ten millions of dollars    1
dtype: int64

GPE
entity
T S             1
Philadelphia    1
Pennsylvania    1
dtype: int64

LOC
entity
South           1
Rappahannock    1
dtype: int64

MONEY
entity
twenty one cents       4
forty three dollars    4
fifty eight dollars    3
dtype: int64

NORP
entity
Southern    1
American    1
dtype: int64

ORDINAL
entity
4th    1
dtype: int64

ORG
entity
the Union    1
State        1
Rec d        1
dtype: int64

PERSON
entity
Growler      13
Lee           2
Collector     2
dtype: int64

QUANTITY
entity
over two hundred bushels    1
one half or two thirds      1
dtype: int64

WORK_OF_ART
entity
War Tax    1
dtype: int64



In [63]:
%%time
# The American corpus
import glob
import numpy as np
import os
import string

punct = frozenset(list(string.punctuation + '\n' + '“'))

def strip_punct(phrase):
    new = ''
    for char in phrase:
        if char not in punct:
            new += char
        else:
            new += ' '
    new = new.strip()
    if len(new) > 0:
        return new
    else:
        return np.nan
    
nlp = en_core_web_sm.load()
nlp.max_length = 2000000

dfs = []

for file in glob.glob(os.path.join('..', '..', 'data', 'texts')+'/A-*.txt'):
    name = os.path.basename(file).strip('.txt')
    print(os.path.basename(file))
    with open(file, 'r') as f:
        doc = nlp(f.read())
        entities = [(ent.text, ent.label_, name) for ent in doc.ents]
        df = pd.DataFrame.from_records(
            entities,
            columns=['entity', 'tag', 'doc']
        )
        df['entity'] = df['entity'].apply(strip_punct)
        dfs.append(df)

results = pd.concat(dfs)

A-Alcott-Little_Women-1868-F.txt
A-Cather-Antonia-1918-F.txt
A-Chesnutt-Marrow-1901-M.txt
A-Crane-Maggie-1893-M.txt
A-Dreiser-Sister_Carrie-1900-M.txt
A-Hawthorne-Scarlet_Letter-1850-M.txt
A-Howells-Silas_Lapham-1885-M.txt
A-James-Golden_Bowl-1904-M.txt
A-London-Call_Wild-1903-M.txt
A-Jewett-Pointed_Firs-1896-F.txt
A-Melville-Moby_Dick-1851-M.txt
A-Norris-Pit-1903-M.txt
A-Twain-Huck_Finn-1885-M.txt
A-Wharton-Age_Innocence-1920-F.txt
A-Chopin-Awakening-1899-F.txt
A-Gilman-Herland-1915-F.txt
A-Harper-Iola_Leroy-1892-F.txt
A-Stowe-Uncle_Tom-1852-F.txt
A-Freeman-Pembroke-1894-F.txt
A-Davis-Life_Iron_mills-1861-F.txt
CPU times: user 11min 26s, sys: 1min 46s, total: 13min 12s
Wall time: 5min 43s


In [64]:
results.head()

Unnamed: 0,entity,tag,doc
0,Christmas,DATE,A-Alcott-Little_Women-1868-F
1,Christmas,DATE,A-Alcott-Little_Women-1868-F
2,Jo,PERSON,A-Alcott-Little_Women-1868-F
3,Meg,ORG,A-Alcott-Little_Women-1868-F
4,Amy,PERSON,A-Alcott-Little_Women-1868-F


In [65]:
for tag, group in results.groupby(['tag']):
    print(tag)
    print(group.groupby(['entity']).size().sort_values(ascending=False).head(3))
    print()

CARDINAL
entity
one     2963
two     1523
half     628
dtype: int64

DATE
entity
May        286
the day    195
March      166
dtype: int64

EVENT
entity
Hugo                        10
the Battle of the Street     5
New Year s                   4
dtype: int64

FAC
entity
Eaton Square    46
Broadway        42
Fifth Avenue    39
dtype: int64

GPE
entity
Charlotte    1088
Drouet        384
New York      273
dtype: int64

LANGUAGE
entity
English      149
Hurstwood     42
French        19
dtype: int64

LAW
entity
the Countess Olenska    15
Miranda                  4
CHAPTER 1                2
dtype: int64

LOC
entity
South          98
New England    70
Hurstwood      70
dtype: int64

MONEY
entity
a cent        24
ten cents     23
every cent    13
dtype: int64

NORP
entity
Hurstwood    221
American     156
Christian    132
dtype: int64

ORDINAL
entity
first     1635
second     273
third      111
dtype: int64

ORG
entity
Meg     480
Ahab    399
Eva     197
dtype: int64

PERCENT
entity
one    2

In [66]:
for tag, group in results.loc[results.tag=='GPE'].groupby(['tag']):
    print(group.groupby(['entity']).size().sort_values(ascending=False).head(20))

entity
Charlotte    1088
Drouet        384
New York      273
Antonia       266
Mas’r         222
Rebecca       199
Chicago       159
Iola          133
Sylvia        124
St  Clare     114
Starbuck      110
Pit           104
London        101
Hurstwood      92
Paris          91
Jadwin         89
Jimmie         88
Boston         82
Thou           79
America        78
dtype: int64
