# NLP and entity extraction

## Summary

Use the SpaCy NLP library to identify named entities in a short text, then write code to parse the  output.

## Details

The text in question is one we've already seen (in the sutructured data exercise). To review: it's a brief pamphlet, *Growler's Income Tax* (1864), by the prolific mid-nineteenth-century writer T.S. Arthur. It's a defense of the then-new income tax, instituted in 1861 to fund the Union's war effort. As you'll see, the text is pretty straightforward, but it's kind of nifty (or infuriating, I guess) how similar are the arguments it presents concerning taxation to those you might hear today. Go ahead, download the [plain-text copy](https://github.com/wilkens-teaching/dhgrad2019/blob/master/exercises/ps4/growler.txt) and read it now, if you haven't already. It's short. (Note that you could also work with the [XML version](https://github.com/wilkens-teaching/dhgrad2019/blob/master/exercises/ps3/growler.xml) from the previous exercise and convert that to plain text as shown in the [solution to that exercise](https://github.com/wilkens-teaching/dhgrad2019/blob/master/exercises/ps3/Parsing%20structured%20data%20solution.ipynb).)

Anyway, tax policy isn't really the point. Your task is to identify algorithmically the named entities in the text and to extract them for further processing. To do this, you'll use the SpaCy NLP library. SpaCy isn't included with the default Anaconda distribution, but you can install it via the Anaconda Navigator GUI or from the command line (or from a console within Jupyter lab) by typing:

```
conda install spacy
```

and then installing at least the basic trained model:

```
python -m spacy download en_core_web_sm
```

SpaCy is a deep-learning-based NLP package. But the underlying details aren't super important for our purposes; we're interested in how how to use it and in how it performs on our text document. You can see the [full usage instructions](https://spacy.io/usage) at the SpaCy site; there's also some starter code below.

Once you've processed the document, your task is to write code that reads the processed data and builds a list of unique entities in the output, each entity's type, and a count of how many times each entity occurs.

Your program should print a summary of the entity data. Your output (with made-up data) should look roughly like this:

```
Entity		Type		Count
------		----		-----
Boston		Location	  2
John Smith	Person		1
```

## Alternative processing

If you're up for a modest challenge, you might try using NLTK's (slower, lower-performing) named entity chunker. But that's strictly optional. You might also try running the SpaCy pipeline over the full set of 40 texts in the class corpus and seeing what you get.

## Submit

Submit your code and output as a single Jupyter notebook via Sakai.

## Consider

A few things to think about before class:

* How well or poorly do the entities extracted from the text square with your sense of what the text is about, whom it involves, and where it occurs (or with what areas it’s concerned)?
* How accurate is the NER process?
* How might you try to improve NER accuracy?

## Starter code

Here's a bit of code to get you going. Note that most of these examples involve placeholder filenames and will require changes to run on real data.

In [None]:
# Read the content of a text file from disk
with open('filename.txt', 'r') as f:
    txt = f.load()

In [None]:
# NLP with SpaCy
import spacy
import en_core_web_sm # Remember to download this model before importing!

txt = """George Washington was the first president of the United States. He was born in Virginia in 1732 and died in December of 1799."""

nlp = en_core_web_sm.load()
doc = nlp(txt)
print([(ent.text, ent.label_) for ent in doc.ents])

In [None]:
# Count things
from collections import defaultdict

data = [
    ('thing 1', 'a'), 
    ('thing 2', 'b'), 
    ('thing 3', 'b'), 
    ('thing 1', 'a'), 
    ('thing 3', 'b')
]

counts = defaultdict(dict)

for item in data:
    try:
        counts[item[1]][item[0]] += 1
    except:
        counts[item[1]][item[0]] = 1

# Print sorted counts
print('tag\tentity\tcount')
for tag in counts:
    for entity in sorted(counts[tag], key=counts[tag].get, reverse=True):
        print(f'{tag}\t{entity}\t{counts[tag][entity]}')

## Your code here

Use the examples above to read the document from disk, process it with SpaCy, and produce a list of counted entities by type.

## Discussion

A few words on the quality of the output and what might help to improve it.