# Week 9: Named Entity Recognition

This week we are going to looking at named entity recognition in the fiction genre. In doing so we will introduce the spaCy library (https://spacy.io/) which provides a number of very fast, state-of-the-art accuracy tools for carrying out NLP tasks including part-of-speech tagging, dependency parsing and named entity recognition.



In [1]:
#preliminary imports

from google.colab import drive
#mount google drive
drive.mount('/content/drive/')
import sys
sys.path.append('/content/drive/My Drive/NLE Notebooks/resources/')

import pandas as pd
import operator

Mounted at /content/drive/


## Project Gutenberg

[Project Gutenberg electronic text archive](http://www.gutenberg.org/) contains around 25,000 free electronic books.

A small selection is made available through the NLTK. For the full list, run the following cell.

In [3]:
import nltk
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

In [4]:
from nltk.corpus import gutenberg
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

We can get the raw text of any of the novels using the `gutenberg.raw(fileid)` method.  This returns a String.

In [5]:
emma=gutenberg.raw('austen-emma.txt')
emma[:500]

"[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse, handsome, clever, and rich, with a comfortable home\nand happy disposition, seemed to unite some of the best blessings\nof existence; and had lived nearly twenty-one years in the world\nwith very little to distress or vex her.\n\nShe was the youngest of the two daughters of a most affectionate,\nindulgent father; and had, in consequence of her sister's marriage,\nbeen mistress of his house from a very early period.  Her mother\nhad died t"

Now, we carry out a little bit of cleaning of the text.  Check you understand what each line in the `clean_text()` function does.

In [6]:
import re
def clean_text(astring):
    #replace newlines with space
    newstring=re.sub("\n"," ",astring)
    #remove title and chapter headings
    newstring=re.sub("\[[^\]]*\]"," ",newstring)
    newstring=re.sub("VOLUME \S+"," ",newstring)
    newstring=re.sub("CHAPTER \S+"," ",newstring)
    newstring=re.sub("\s\s+"," ",newstring)
    #return re.sub("([^\.|^ ])  +",r"\1 .  ",newstring).lstrip().rstrip()
    return newstring.lstrip().rstrip()

clean_emma=clean_text(emma)
print(len(emma))
print(len(clean_emma))
clean_emma[:500]

887071
880067


"Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence; and had lived nearly twenty-one years in the world with very little to distress or vex her. She was the youngest of the two daughters of a most affectionate, indulgent father; and had, in consequence of her sister's marriage, been mistress of his house from a very early period. Her mother had died too long ago for her to have more than an indistinct "

## SpaCy

If working at home, you may need to install spaCy and download a set of English models.  at the command line:

```
pip install spacy
python -m spacy download en_core_web_sm
```

In the lab, or once you have done this at home, you should then be able to set up a spaCy processing pipeline as follows. If working on colab than this should work automatically.

In [8]:
import spacy
#nlp=spacy.load('en')
nlp=spacy.load('en_core_web_sm')
type(nlp)

spacy.lang.en.English

Now we can run any text string through the language processing pipeline stored in `nlp`
This next cell might take a few minutes to run since it carries out all of the SpaCy NLP functionality on the input text.  It will return a SpaCy `Doc` object which contains the text plus various annotations.  See the SpaCy documentation https://spacy.io/api/doc

In [9]:
nlp_emma=nlp(clean_emma)

In [10]:
type(nlp_emma)

spacy.tokens.doc.Doc

For example, we can now iterate over sentences in the document.

In [11]:
for s in nlp_emma.sents:
    print(s)
    break

Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence; and had lived nearly twenty-one years in the world with very little to distress or vex her.


We can iterate over tokens in sentences and find out the labels added by SpaCy to each token.

In [12]:
emma_sents=list(nlp_emma.sents)

In [13]:
print(emma_sents[0])

Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence; and had lived nearly twenty-one years in the world with very little to distress or vex her.


In [14]:
def display_sent(asent):
    headings=["token","lower","lemma","pos","NER"]
    info=[]
    for t in asent:
        info.append([t.text,t.lower_,t.lemma_,t.pos_,t.ent_type_])
    return(pd.DataFrame(info,columns=headings))
        
display_sent(emma_sents[3])

Unnamed: 0,token,lower,lemma,pos,NER
0,Sixteen,sixteen,sixteen,NUM,DATE
1,years,years,year,NOUN,DATE
2,had,had,have,AUX,
3,Miss,miss,Miss,PROPN,
4,Taylor,taylor,Taylor,PROPN,PERSON
5,been,been,be,AUX,
6,in,in,in,ADP,
7,Mr.,mr.,Mr.,PROPN,
8,Woodhouse,woodhouse,Woodhouse,PROPN,PERSON
9,'s,'s,'s,PART,


### Exercise 1.1
Run the `display_sent()` function on each of the first ten sentences of Emma (as stored in `emma_sents`).
* What errors do you see in the named entity recognition?
* Can you see any patterns in the words, lemmas or part-of-speech tags which might be used to improve the named entity recognition on these sentences?


### Exercise 1.2
Write a function 'make_tag_lists()' which takes a list of sentences as input and which returns 3 lists:
1. tokens
2. POS tags
3. Named Entity tags

These lists should be the same length (189191, if applied to the all of the sentences in `nlp_emma`) and maintain the order of the text, i.e., position i in each list should refer to the same token.

In [15]:
def make_tag_lists(sents):
    tokens=[]
    pos_tags=[]
    ner_tags=[]
    for sent in sents:
        for t in sent:
            tokens.append(t.text)
            pos_tags.append(t.pos_)
            ner_tags.append(t.ent_type_)
    return tokens,pos_tags,ner_tags
toks,pos,ner=make_tag_lists(emma_sents)

In [16]:
print(len(toks),len(pos),len(ner))

189191 189191 189191


### Exercise 1.3
Write a function `extract_entities` which takes a list of tokens, a list of tags and a tag-type and returns a dictionary of all of the **chunks** which have the given tag-type; together with their frequency in the text.

You can assume that two consecutive tokens with the same tag are part of the same chunk.

Test your code and you should get the following output (for the given input):

<img src=output-13.png>

This tells us that "Anne Cox" is tagged twice as a named entity of type "PERSON" in the text.  How many occurrences of "Miss Woodhouse" tagged as a "PERSON" are there?

In [17]:
def extract_entities(tokenlist,taglist,tagtype):
    
    entities={}
    inentity=False
    for i,(token,tag) in enumerate(zip(tokenlist,taglist)):
        if tag==tagtype:
            if inentity:
                entity+=" "+token
            else:
                entity=token
                inentity=True
        elif inentity:
            entities[entity]=entities.get(entity,0)+1
            inentity=False
    return entities           
            
        
    
    

In [18]:
extract_entities(toks,ner,"PERSON")

{'Abominable scoundrel!"--': 1,
 'Aladdin': 1,
 'Anna Weston': 1,
 'Anne Cox': 2,
 'Arthur!--How': 1,
 'Astley': 4,
 'Augusta': 2,
 'Augusta Hawkins': 1,
 'Aye': 12,
 'Baly - craig': 1,
 'Bates': 142,
 'Bates.--He': 1,
 'Bella': 3,
 'Bickerton': 2,
 'Bird': 1,
 'Bought': 1,
 'Box Hill': 13,
 'Box Hill--': 1,
 'Bragge': 7,
 'Bristol': 2,
 'Broadwood': 2,
 "Brown's": 1,
 'Campbell': 51,
 'Captain Weston': 1,
 'Catherine': 1,
 'Churchill': 69,
 'Churchill--': 1,
 'Churchills': 1,
 'Clara Partridge': 1,
 'Cole': 51,
 'Cole--': 1,
 'Coles': 8,
 'Cox': 4,
 'Coxe': 1,
 'Coxes': 3,
 'Cromer': 2,
 'Dear Harriet': 1,
 'Dear Jane': 1,
 'Dearer': 1,
 'Dixon': 39,
 'Donwell': 6,
 'Donwell Abbey': 1,
 'Donwell Lane': 1,
 'E.': 11,
 'Easter': 1,
 'Elizabeth': 5,
 'Elizabeth Martin': 1,
 "Elizabeth Martin 's": 2,
 'Elton': 377,
 'Elton _': 1,
 'Elton!--': 1,
 'Elton!--`Jane Fairfax': 1,
 'Elton!--no': 1,
 "Elton's": 2,
 'Elton--': 1,
 'Elton.--': 1,
 'Elton?--': 1,
 'Eltons': 2,
 'Emma': 168,
 'Emma W

### Exercise 1.4
Use your code to find 
* the 20 most commonly referred to people in Emma
* the 20 most commonly referred to places in Emma

In [19]:
people=extract_entities(toks,ner,"PERSON")
top_people=sorted(people.items(),key=operator.itemgetter(1),reverse=True)[:100]
print(top_people)

[('Harriet', 425), ('Weston', 412), ('Elton', 377), ('Knightley', 283), ('Jane', 182), ('Emma', 168), ('Woodhouse', 145), ('Bates', 142), ('Frank Churchill', 129), ('Fairfax', 112), ('Miss Woodhouse', 96), ('Jane Fairfax', 85), ('Perry', 75), ('Churchill', 69), ('Isabella', 65), ('Goddard', 58), ('Frank', 56), ('John Knightley', 54), ('Cole', 51), ('Campbell', 51), ('Martin', 49), ('Taylor', 45), ('Smith', 42), ('Dixon', 39), ('John', 26), ('Robert Martin', 24), ('Harriet Smith', 22), ('Henry', 20), ('James', 18), ('Hartfield', 17), ("Frank Churchill 's", 14), ("Jane Fairfax 's", 14), ('Suckling', 14), ('Box Hill', 13), ('Nash', 12), ('Aye', 12), ('Patty', 12), ('William Larkins', 12), ('E.', 11), ("Miss Woodhouse 's", 10), ('Wingfield', 9), ('Enscombe', 8), ('Coles', 8), ('Smallridge', 8), ('papa', 7), ('Bragge', 7), ('Yorkshire', 6), ('Miss Smith', 6), ('Knightleys', 6), ('Miss Hawkins', 6), ('Donwell', 6), ('Selina', 6), ('Elizabeth', 5), ('George', 5), ('Surry', 5), ('Weymouth', 5)

In [20]:
places=extract_entities(toks,ner,"GPE")
top_places=sorted(places.items(),key=operator.itemgetter(1),reverse=True)[:20]
print(top_places)

[('Hartfield', 131), ('London', 44), ('Enscombe', 15), ('Ireland', 14), ('Richmond', 13), ('Highbury', 8), ('Kingston', 8), ('England', 8), ('Fairfax', 7), ('Harriet', 5), ('Maple Grove', 5), ('Surry', 4), ('Woodhouses', 3), ('Emma Woodhouse', 2), ('thing?--why', 2), ('Weymouth', 2), ('Serle', 2), ('Selina', 2), ('us', 2), ('Birmingham', 2)]


In [21]:
places=extract_entities(toks,ner,"LOC")
top_places=sorted(places.items(),key=operator.itemgetter(1),reverse=True)[:20]
print(top_places)

[('South End', 5), ('earth', 4), ('Eltons', 4), ('Bateses', 2), ('Maple Grove', 2), ('Windsor', 2), ('Harriet', 1), ('Miss', 1), ('can;--', 1), ('preparation;--', 1), ('Aunt Emma', 1), ('it!--It', 1), ('Emma', 1), ('stopped.--The', 1), ('extent.-- Harriet', 1), ('Box Hill', 1), ('the least.--Poor', 1)]


In [22]:
set(ner)

{'',
 'CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART'}

### Exercise 1.5
Look at the lists of people and places generated.  Assuming no knowledge of the characters and plot of Emma, what errors can you see?

## Extensions

Code one or more of the following extensions.  In all cases, compare the lists of most frequently occurring named entities generated with the original ones.

### Expanding NER Chunks
* if the word immediately before or after a named entity chunk is POS-tagged as a PROPN, assume that this word is also part of the named entity chunk

For example, where the token "Miss" has pos-tag "PROPN" and is immediately followed by a token labelled with "PERSON", then it should also be labelled with "PERSON". 

### Relabelling NER Chunks
* if a named entity occurs more frequently elsewhere in the text as a different type, assume that it has been mis-labelled here

For example, all 9 occurrences of "Jane Fairfax" labelled as "GPE" could be relabelled as "PERSON".

### Linking NEs
* find candidates for named entity linking.  

For example, "Churchill" and "Frank Churchill" and "Frank" might all refer to the same person.
However, you should proceed with care.  Anyone who knows the story well would tell you that "Knightley" and "John Knightley" do not refer to the same character (they are brothers).  As a further extension, give your linking functionality access to a list of known characters e.g., from https://www.janeausten.org/emma/cast-of-characters.asp

### Co-occurring NEs
* find NEs that tend to co-occur together.

Can you find pairs of named entities which often occur together (or even better, occur more often together than one would expect if named entities occur independently)?  You could consider pairs of people or alternatively co-occurrences of people and places.

### NEs over Time
* record the position in the text of each named entity occurrence
* make a plot showing how the amount of occurrences of a given named entity varies with position in the text

If you store each text position in `list_of_indices`, you could use:
`pd.Series(np.histogram(list_of_indices, bins=num_bins)` to help you with this
