# People & Places
## Named Entity Recognition

In this lesson, we will learn how to extract named entities (names of people, places, groups, institutions, etc.) from text files and then analyze the results.

For example, one of our goals is to create a map of all countries mentioned in the State of the Union corpus. 

<div class="alert alert-success" role="alert"><p style="color:green">What steps do you anticipate we will have to do in order to successfully accomplish this project? List the steps in the markdown below:</p></div>

## I. Setup

1. Install **spaCy**

From command line / terminal - see https://spacy.io/usage

2. In Python, import the necessary packages for today's lesson:

In [1]:
import spacy
import collections
import pandas as pd
from spacy.lang.en.examples import sentences
from spacy import displacy   #for visualizing word types and relationships

3. Import the necessary language models. For documentation on **spaCy**'s four available language models, see: https://spacy.io/models/en. Note, the accuracy scores for the small, medium and large models are all roughly similar, thus it makes sense to use the small model. The TRF model does score a little better, but given its size, we will stick with the small model.

For more on spaCy's other models see: https://spacy.io/models.Note: many of these models are trained on 21st-century online new media and websites like Wikipedia. Thus, the accuracy of its NLP methods will decline as the texts on which you apply them differ from the texts the model was trained on.

In [2]:
nlp = spacy.load("en_core_web_sm")

## II. Apply NER to sample sentences

4. Let's start by experimenting with some preloaded sample sentences:

In [3]:
print(sentences)

['Apple is looking at buying U.K. startup for $1 billion', 'Autonomous cars shift insurance liability toward manufacturers', 'San Francisco considers banning sidewalk delivery robots', 'London is a big city in the United Kingdom.', 'Where are you?', 'Who is the president of France?', 'What is the capital of the United States?', 'When was Barack Obama born?']


<div class="alert alert-success" role="alert"><p style="color:green">5. Before we use spaCy's Named Entity Recognizer, can you identify all examples of the three principle types of named entities (person names, place names, and group / organization names) found in the above sentences? Type your answer below.</p></div>

6. Before we get to NER, let's identify [**parts of speech**](https://spacy.io/usage/linguistic-features#pos-tagging) (often abbreviated as "POS") and **dependencies** (i.e. who/what is doing what action to whom (or what thing)):

In [4]:
#tokenizes the text & then classifies each token in a variety of categories (pos, ner, dependencies, etc.)
doc = nlp(sentences[0]) 
print(doc.text)  #prints the whole sentence
for token in doc:
    #for each token, prints the token, its POS tag, and dependency tag
    print(token.text, token.pos_, token.dep_)

Apple is looking at buying U.K. startup for $1 billion
Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN dobj
startup NOUN dep
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj


7. We can also identify:
+ stop words (words that don't reveal much about content)
+ the lemmatized version of words
+ type of token (*alpha*: is it all letters, or does it include punctuation, numbers, and special symbols)

In [5]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Apple Apple PROPN NNP nsubj Xxxxx True False
is be AUX VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. U.K. PROPN NNP dobj X.X. False False
startup startup NOUN NN dep xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


8. Below, we extract named entities from this same sentence. [Click here for more on spaCy's NER functions](https://spacy.io/usage/linguistic-features#named-entities).

In [6]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


9. As you can see, spaCy's NER identifies more than the basic people, places and organizations. To identify what these labels mean, you can:
+ Look up the "Label Scheme" documentation on the [model page](https://spacy.io/models/en).
+ view the [full glossary](https://github.com/explosion/spaCy/blob/master/spacy/glossary.py) of terms
+ get info for one term using spacy.explain(), *see below*:

*I also provide additional information in the next markdown cell below.*

In [7]:
spacy.explain("GPE")

'Countries, cities, states'

Basic named entity recognizers commonly identify the following types of entities:

```
place names
person names
group names
miscellaneous / other entities
```

**spaCy**'s NER identifies a wider-range of entities.

Examine the list of entity types identified by spaCy below (from the [spaCY glossary](https://github.com/explosion/spaCy/blob/master/spacy/glossary.py)):

```
PERSON:      People, including fictional.
NORP:        Nationalities or religious or political groups.
FAC:         Buildings, airports, highways, bridges, etc.
ORG:         Companies, agencies, institutions, etc.
GPE:         Countries, cities, states.
LOC:         Non-GPE locations, mountain ranges, bodies of water.
PRODUCT:     Objects, vehicles, foods, etc. (Not services.)
EVENT:       Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART: Titles of books, songs, etc.
LAW:         Named documents made into laws.
LANGUAGE:    Any named language.
DATE:        Absolute or relative dates or periods.
TIME:        Times smaller than a day.
PERCENT:     Percentage, including ”%“.
MONEY:       Monetary values, including unit.
QUANTITY:    Measurements, as of weight or distance.
ORDINAL:     “first”, “second”, etc.
CARDINAL:    Numerals that do not fall under another type.
```




10. Now, let's apply spaCy's NER to the full list of sample sentences. Do you see any errors?

In [8]:
doc = nlp('. '.join(sentences)) 
print(doc.text)  
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple is looking at buying U.K. startup for $1 billion. Autonomous cars shift insurance liability toward manufacturers. San Francisco considers banning sidewalk delivery robots. London is a big city in the United Kingdom.. Where are you?. Who is the president of France?. What is the capital of the United States?. When was Barack Obama born?
Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY
San Francisco 120 133 GPE
London 178 184 GPE
the United Kingdom 202 220 GPE
France 263 269 GPE
the United States 295 312 GPE
Barack Obama 324 336 PERSON


## III. Visualize spaCy tagging using displacy

11. We can use spaCy's **displacy** module to visualize some of these linguistic labels. For more, see: https://spacy.io/usage/visualizers.

In [9]:
displacy.render(doc, style = "ent")

In [10]:
displacy.render(doc, style = "dep")

<div class="alert alert-success" role="alert"><p style="color:green">12. In the cells below, write your own series of sentences (maybe 4-6), read them into spacy (using the **nlp** function), extract named entities from these sentences, and then display the results using **displacy**. (If time allows, you can also examine POS and other tags as well.) Try to make it challenging for spaCy. What does it do well? Where does it fail?</p></div>

## Training your own NER model

As with all examples of machine learning, these models work better when trained on the same type of data they will be tested on. So, if you are analyzing medical records, legal trial transcripts, or historical documents, your model will not do well if it was trained on Wikipedia or news articles.

**spaCy** allows you to train your own model or modify existing models. This is beyond the scope of this lesson, but to learn more see their [Training Pipelines & Models](https://spacy.io/usage/training) documentation.

## IV. Apply NER to a full text

### Extracting entities from Biden's 2023 State of the Union address

<div class="alert alert-success" role="alert"><p style="color:green">13. Given what we've learned so far about NER, in the markdown cell below, brainstorm some different research questions you could use spaCy's NER to help you answer (given the list of entities it could help you analyze) about the State of the Union addresses that we have been studying.</p></div>



<div class="alert alert-success" role="alert"><p style="color:green">13b. Next, brainstorm and discuss the following:</p>
<ul>
    <li style="color:green">What texts or types of texts are you interested in analyzing?</li>
    <li style="color:green">How could you apply NER to these texts? What questions could you answer?</li>
</ul>
</div>

### IVa. Reading in SOTU addresses as a dataframe

14. In the last lesson, we learned about some pre-processing tasks that are essential to many types of more advanced NLP methods and to answer many common text analysis questions. These tasks include:
+ counting words in each text
+ tokenizing texts
+ lower-casing tokens
+ removing stopwords 

In this case, we have a dataset of State of the Union (SOTU) addresses already pre-processed for us. In this case, we saved the dataset as a ".tsv" to indicate it is a tab-separated-values file rather than a "csv" or comma-separated-values file. Thus, on import, we need to indicate the separator (aka. "delimiter") is a tab ("\t"):

In [11]:
#read in the .tsv as a dataframe
sotudf = pd.read_csv("sotudf.tsv", encoding="utf-8", sep="\t", index_col=0)
sotudf = sotudf.sort_values(by = ['year']) #you can probably guess what this does
sotudf.tail()  #outputs last 5 rows in dataset

Unnamed: 0,year,pres,numtoks,tokens,fulltext,ltoks,ltoks_ns
212,2019,Trump,5774,"['Madam', 'Speaker', 'Mr', 'Vice', 'President'...","Madam Speaker, Mr. Vice President, Members of ...","['madam', 'speaker', 'mr', 'vice', 'president'...","['madam', 'speaker', 'mr', 'vice', 'president'..."
213,2020,Trump,6472,"['Thank', 'you', 'very', 'much', 'Thank', 'you...",Thank you very much. Thank you. Thank you very...,"['thank', 'you', 'very', 'much', 'thank', 'you...","['thank', 'much', 'thank', 'thank', 'much', 'm..."
12,2021,Biden,8346,"['Thank', 'you', 'Thank', 'you', 'Thank', 'you...",Thank you. Thank you. Thank you. Good to be ba...,"['thank', 'you', 'thank', 'you', 'thank', 'you...","['thank', 'thank', 'thank', 'good', 'back', 'm..."
13,2022,Biden,8122,"['Thank', 'you', 'all', 'very', 'very', 'much'...","Thank you all very, very much. Thank you, plea...","['thank', 'you', 'all', 'very', 'very', 'much'...","['thank', 'much', 'thank', 'please', 'thank', ..."
14,2023,Biden,9534,"['Mr', 'Speaker', 'Thank', 'you', 'You', 'can'...","Mr. Speaker. Thank you. You can smile, it's OK...","['mr', 'speaker', 'thank', 'you', 'you', 'can'...","['mr', 'speaker', 'thank', 'smile', 'ok', 'tha..."


<p style="color:green">15. Can you identify what is contained in each column or field in this dataset?</p>

16. Next, we want to look at the full text of the most recent SOTU address. We can do that by calling the row which contains the address (Biden 2023) and the "fulltext" column.

In [12]:
#biden23text = sotudf.iloc[-1, 4]
biden23text = sotudf[sotudf['year'] == 2023]['fulltext'].item()
biden23text[:400]

"Mr. Speaker. Thank you. You can smile, it's OK. Thank you, thank you, thank you. Thank you. Please.\n\nMr. Speaker. Madam Vice President. Our first lady and second gentleman. Good to see you guys up there. Members of Congress.\n\nAnd by the way, Chief Justice, I may need a court order. She gets to go to the game tomorrow, next week; I have to stay home. We got to work something out here.\n\nMembers of t"

In [13]:
doc = nlp(biden23text)

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Speaker 4 11 PERSON
Speaker 105 112 PERSON
first 140 145 ORDINAL
second 155 161 ORDINAL
Congress 215 223 ORG
tomorrow 310 318 DATE
next week 320 329 DATE
the Supreme Court 495 512 ORG
Americans 536 545 NORP
tonight 566 573 TIME
the 118th Congress 592 610 ORG
House 638 643 ORG
Kevin McCarthy 645 659 PERSON
House 799 804 ORG
Democrats 805 814 NORP
first 820 825 ORDINAL
African American 826 842 NORP
Hakeem Jeffries 871 886 ORG
the United States 1003 1020 GPE
Senate 1021 1027 ORG
Mitch McConnell 1029 1044 PERSON
Mitch 1061 1066 PERSON
Chuck Schumer 1092 1105 PERSON
Senate 1143 1149 ORG
Leader 1247 1253 PERSON
the House of Representatives 1454 1482 ORG
Nancy Pelosi 1484 1496 PERSON
America 1519 1526 GPE
Two years ago 1814 1827 DATE
tonight 1868 1875 TIME
12 million 1943 1953 CARDINAL
two years 1986 1995 DATE
four years 2030 2040 DATE
American 2077 2085 NORP
Two years ago 2095 2108 DATE
Covid 2110 2115 PERSON
today 2201 2206 DATE
Covid 2208 2213 PERSON
two years ago 2249 2262 DATE
the Civil 

17. We can store entity information in a list (denoted by "[]") of tuples (denoted by "()").

In [14]:
ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents]
print(ents[:10])

[('Speaker', 'PERSON', ''), ('Speaker', 'PERSON', ''), ('first', 'ORDINAL', ''), ('second', 'ORDINAL', ''), ('Congress', 'ORG', ''), ('tomorrow', 'DATE', ''), ('next week', 'DATE', ''), ('the Supreme Court', 'ORG', ''), ('Americans', 'NORP', ''), ('tonight', 'TIME', '')]


18. We can iterate through this list of entities from our SOTU address and then save select types of named entities. 

In [15]:
person_names = []
for ent in ents:
    if ent[1] == "PERSON":
        person_names.append(ent[0])
person_names[:10]

#list comprehension to produce the same results in one line of code:
#person_names = [ent[0] for ent in ents if ent[1] == "PERSON"]


['Speaker',
 'Speaker',
 'Kevin McCarthy',
 'Mitch McConnell',
 'Mitch',
 'Chuck Schumer',
 'Leader',
 'Nancy Pelosi',
 'Covid',
 'Covid']

<div class="alert alert-success" role="alert"><p style="color:green">19. Re-run the code above, but this time choosing another entity type to extract.</p></div>

20. Note, to get place names, you need to request more than one entity type. Try running the code below:

In [16]:
place_names = [(ent[0], ent[1]) for ent in ents if ent[1] in ['GPE', 'LOC']]
place_names

[('the United States', 'GPE'),
 ('America', 'GPE'),
 ('America', 'GPE'),
 ('Europe', 'LOC'),
 ('America', 'GPE'),
 ('America', 'GPE'),
 ('America', 'GPE'),
 ('Ukraine', 'GPE'),
 ('Ukraine', 'GPE'),
 ('America', 'GPE'),
 ('America', 'GPE'),
 ('America', 'GPE'),
 ('America', 'GPE'),
 ('America', 'GPE'),
 ('Columbus', 'GPE'),
 ('Ohio', 'GPE'),
 ('The United States of America', 'GPE'),
 ('Boston', 'GPE'),
 ('Atlanta', 'GPE'),
 ('Portland', 'GPE'),
 ('America', 'GPE'),
 ('America', 'GPE'),
 ('Kentucky', 'GPE'),
 ('the Ohio River', 'LOC'),
 ('the Ohio River', 'LOC'),
 ('Cincinnati', 'GPE'),
 ('Ohio River', 'LOC'),
 ('America', 'GPE'),
 ('America', 'GPE'),
 ('America', 'GPE'),
 ('America', 'GPE'),
 ('America', 'GPE'),
 ('America', 'GPE'),
 ('America', 'GPE'),
 ('earth', 'LOC'),
 ('Arizona', 'GPE'),
 ('New Mexico', 'GPE'),
 ('Missouri', 'GPE'),
 ('Puerto Rico', 'GPE'),
 ('Florida', 'GPE'),
 ('Idaho', 'GPE'),
 ('America', 'GPE'),
 ('America', 'GPE'),
 ('America', 'GPE'),
 ('the United States of

Do you notice any mislabeled place names above? If so, they may be **false positives** (not named entities at all) or misidentified NEs (i.e. labeling a place name as a person). If you want to assess the accuracy of NERs, however, you should also examine the number of **false negatives** (named entities not recognized by the model).

21. To analyze the results, it is often helpful to create a frequency list. Python's **collections** library makes that fairly easy.

In [17]:
pnfreqs = collections.Counter(place_names)
pnfreqs

Counter({('the United States', 'GPE'): 2,
         ('America', 'GPE'): 35,
         ('Europe', 'LOC'): 3,
         ('Ukraine', 'GPE'): 4,
         ('Columbus', 'GPE'): 1,
         ('Ohio', 'GPE'): 1,
         ('The United States of America', 'GPE'): 1,
         ('Boston', 'GPE'): 1,
         ('Atlanta', 'GPE'): 1,
         ('Portland', 'GPE'): 1,
         ('Kentucky', 'GPE'): 1,
         ('the Ohio River', 'LOC'): 2,
         ('Cincinnati', 'GPE'): 1,
         ('Ohio River', 'LOC'): 1,
         ('earth', 'LOC'): 1,
         ('Arizona', 'GPE'): 1,
         ('New Mexico', 'GPE'): 1,
         ('Missouri', 'GPE'): 1,
         ('Puerto Rico', 'GPE'): 1,
         ('Florida', 'GPE'): 1,
         ('Idaho', 'GPE'): 1,
         ('the United States of America', 'GPE'): 2,
         ('Memphis', 'GPE'): 1,
         ('Cuba', 'GPE'): 1,
         ('Haiti', 'GPE'): 1,
         ('Nicaragua', 'GPE'): 1,
         ('Venezuela', 'GPE'): 1,
         ("the People's Republic of China", 'GPE'): 1,
         ('Chi

In [18]:
pnfreqs.most_common(5)

[(('America', 'GPE'), 35),
 (('Ukraine', 'GPE'), 4),
 (('China', 'GPE'), 4),
 (('Europe', 'LOC'), 3),
 (('the United States', 'GPE'), 2)]

22. A common issue with NER is that many named entities are referred to by multiple names ("America", "The United States", "The United States of America", and "USA"/"U.S.A." or "Samuel L. Jackson", "Samuel Jackson", "Sam L. Jackson", "Sam Jackson", etc.), have different names in different languages ("London"/ "Londres, "Roma" / "Rome"), or can sometimes be referred to by their title instead of their name ("The President", "The Captain"). 

There is no easy solution to this. Usually the preferred way to deal with multiple aliases is to create a separate dictionary with a unique identifier (i.e. "S66M23001A"), the standardized form of the name, and a list of potential aliases. Fortunately, there are some existing Python packages that help with well-known entities like countries. 

In the following notebook "NLP2_NER2_MappingCountries.ipynb" we will get some practice doing just that.