Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt), based on [Named Entity Recognition with NLTK and SpaCy](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da) by Susan Li.

# SEQUENCE LABELING

## Part Of Speech Tagging and Named Entity Recognition using NLTK or spaCy

Part Of Speech (POS) tagging and Named Entity Recognition (NER) are the two most well known examples of sequence labeling tasks.

### POS Tagging

POS tagging consists of assigning to each word its morpho-syntactic category.

NLTK includes a [POS tagger](https://www.nltk.org/api/nltk.tag.html) that we can use. We can check the tagset used by the tagger as follows:

In [1]:
import nltk

nltk.help.upenn_tagset()

LookupError: 
**********************************************************************
  Resource [93mtagsets[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('tagsets')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mhelp/tagsets/upenn_tagset.pickle[0m

  Searched in:
    - 'C:\\Users\\ineso/nltk_data'
    - 'C:\\Users\\ineso\\anaconda3\\nltk_data'
    - 'C:\\Users\\ineso\\anaconda3\\share\\nltk_data'
    - 'C:\\Users\\ineso\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\ineso\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************


To use the POS tagger, we first need to tokenize the text. Try out NLTK's [*pos_tag*](https://www.nltk.org/api/nltk.tag.html) with the following text, and analyse the POS tags you get:

In [2]:
from nltk import word_tokenize
from nltk import pos_tag

text = """European authorities fined Google a record $5.1 billion on Wednesday 
for abusing its power in the mobile phone market and 
ordered the company to alter its practices."""

# your code here
pos_tag(word_tokenize(text)) 

[('European', 'JJ'),
 ('authorities', 'NNS'),
 ('fined', 'VBD'),
 ('Google', 'NNP'),
 ('a', 'DT'),
 ('record', 'NN'),
 ('$', '$'),
 ('5.1', 'CD'),
 ('billion', 'CD'),
 ('on', 'IN'),
 ('Wednesday', 'NNP'),
 ('for', 'IN'),
 ('abusing', 'VBG'),
 ('its', 'PRP$'),
 ('power', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('mobile', 'JJ'),
 ('phone', 'NN'),
 ('market', 'NN'),
 ('and', 'CC'),
 ('ordered', 'VBD'),
 ('the', 'DT'),
 ('company', 'NN'),
 ('to', 'TO'),
 ('alter', 'VB'),
 ('its', 'PRP$'),
 ('practices', 'NNS'),
 ('.', '.')]

#### Training a POS tagger in NLTK

NLTK also allows you to train simple tagging models based on n-grams, where the model takes into account the tags assigned to the *n-1* words preceding the target word.

Let's try that using [Floresta Sintá(c)tica](https://www.linguateca.pt/Floresta/), a Portuguese corpus annotated with POS tags (we will follow [this tuturial](https://www.nltk.org/howto/portuguese_en.html)):

In [4]:
nltk.download('floresta')

[nltk_data] Downloading package floresta to
[nltk_data]     C:\Users\ineso\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\floresta.zip.


True

In [5]:
from nltk.corpus import floresta

print(len(floresta.sents()))
print(floresta.sents())
print(floresta.tagged_sents())

9266
[['Um', 'revivalismo', 'refrescante'], ['O', '7_e_Meio', 'é', 'um', 'ex-libris', 'de', 'a', 'noite', 'algarvia', '.'], ...]
[[('Um', '>N+art'), ('revivalismo', 'H+n'), ('refrescante', 'N<+adj')], [('O', '>N+art'), ('7_e_Meio', 'H+prop'), ('é', 'P+v-fin'), ('um', '>N+art'), ('ex-libris', 'H+n'), ('de', 'H+prp'), ('a', '>N+art'), ('noite', 'H+n'), ('algarvia', 'N<+adj'), ('.', '.')], ...]


The tags consist of some syntactic information, followed by a plus sign, followed by a conventional part-of-speech tag. We need to strip off the material before the plus sign:

In [6]:
def simplify_tag(t):
    if "+" in t:
        return t[t.index("+")+1:]
    else:
        return t

tsents = [[(w.lower(),simplify_tag(t)) for (w,t) in sent] for sent in floresta.tagged_sents()]
tsents

[[('um', 'art'), ('revivalismo', 'n'), ('refrescante', 'adj')],
 [('o', 'art'),
  ('7_e_meio', 'prop'),
  ('é', 'v-fin'),
  ('um', 'art'),
  ('ex-libris', 'n'),
  ('de', 'prp'),
  ('a', 'art'),
  ('noite', 'n'),
  ('algarvia', 'adj'),
  ('.', '.')],
 [('é', 'v-fin'),
  ('uma', 'num'),
  ('de', 'prp'),
  ('as', 'art'),
  ('mais', 'adv'),
  ('antigas', 'adj'),
  ('discotecas', 'n'),
  ('de', 'prp'),
  ('o', 'art'),
  ('algarve', 'prop'),
  (',', ','),
  ('situada', 'v-pcp'),
  ('em', 'prp'),
  ('albufeira', 'prop'),
  (',', ','),
  ('que', 'pron-indp'),
  ('continua', 'v-fin'),
  ('a', 'prp'),
  ('manter', 'v-inf'),
  ('os', 'art'),
  ('traços', 'n'),
  ('decorativos', 'adj'),
  ('e', 'conj-c'),
  ('as', 'art'),
  ('clientelas', 'n'),
  ('de', 'prp'),
  ('sempre', 'adv'),
  ('.', '.')],
 [('é', 'v-fin'),
  ('um_pouco', 'adv'),
  ('a', 'art'),
  ('versão', 'n'),
  ('de', 'prp'),
  ('uma', 'art'),
  ('espécie', 'n'),
  ('de', 'prp'),
  ('«', '«'),
  ('outro', 'pron-det'),
  ('lado', 'n'),


Now we can split our data into a train and a test set. Let's keep 100 sentences in the test set.

In [7]:
train = tsents[100:]
test = tsents[:100]

Let's see how we do with an unigram tagger, which simply assigns the most likely tag for any given token.

In [8]:
tagger1 = nltk.UnigramTagger(train)

We can check how the tagger performs on the test set by using the *evaluate* method, which gives us the model's accuracy.

In [9]:
tagger1.evaluate(test)

0.8511016346837242

Try tagging a user-generated sentence. Don't forget to tokenize it and lower-case the obtained tokens, following what we have done with the corpus above. To tag a list of tokens, you can invoke the *tag* method on the tagger.

In [13]:
# your code here
import os

text = input("Enter review: ")
tagger1.tag(word_tokenize(text)) 

[('o', 'art'),
 ('meu', 'pron-det'),
 ('nome', 'n'),
 ('é', 'v-fin'),
 ('ines', None)]

We can now try out a bigram model for POS tagging, which will take into account the tag assigned to the previous word. If that previous word hasn't been seen in the training set, however, the model will fail to tag the target word, even if it did appear in the training set. For that reason, it is convenient to backoff to the previous unigram tagger -- if we know nothing about the tag of the previous word, we can still use the most likely tag for the target word.

In [14]:
tagger2 = nltk.BigramTagger(train, backoff=tagger1, verbose=True)

[Trained Unigram tagger: size=2003, backoff=73.09%, pruning=95.91%]


The *verbose* flag outputs some information, namely the amount of backoff used.

Check the performance of this tagger, and compare it with the performance of a bigram tagger with no backoff strategy.

In [15]:
# your code here
tagger2.evaluate(test)

0.8731343283582089

Build a trigram tagger with backoff to the bigram tagger and check its performance.

In [16]:
# your code here
tagger3 = nltk.TrigramTagger(train, backoff=tagger2, verbose=True)
tagger3.evaluate(test)

[Trained Unigram tagger: size=1638, backoff=80.09%, pruning=97.80%]


0.8749111584932481

### Named Entity Recognition

NER consists of detecting named entities in the text, which can correspond to several different categories, such as person names, organizations, dates, and so on.

#### Chunking

Our first attempt to detect names in English may consist of chunking certain parts of the text that correspond to a pattern of POS tags. For that, we define a pattern consisting of (i) an optional *determinant*, optionally followed by (ii) *adjectives*, followed by (iii) a *noun*.

We can use NLTK's *RegexpParser* and supply it with an appropriate regular expression.

In [17]:
# creating a chunk parser
pattern = 'NP: {<DT>?<JJ>*<NN>}'
cp = nltk.RegexpParser(pattern)

With our chunk parser, we can parse our sentence's POS-tagged list of tokens.

In [18]:
from nltk import word_tokenize
from nltk import pos_tag

text = """European authorities fined Google a record $5.1 billion on Wednesday 
for abusing its power in the mobile phone market and 
ordered the company to alter its practices."""
pos_tokens = pos_tag(word_tokenize(text))

# generating a parse tree
cs = cp.parse(pos_tokens)
print(cs)

(S
  European/JJ
  authorities/NNS
  fined/VBD
  Google/NNP
  (NP a/DT record/NN)
  $/$
  5.1/CD
  billion/CD
  on/IN
  Wednesday/NNP
  for/IN
  abusing/VBG
  its/PRP$
  (NP power/NN)
  in/IN
  (NP the/DT mobile/JJ phone/NN)
  (NP market/NN)
  and/CC
  ordered/VBD
  (NP the/DT company/NN)
  to/TO
  alter/VB
  its/PRP$
  practices/NNS
  ./.)


A more appealing way of visualizing the result is to simply show the obtained parse tree, with *S* (for sentence) at the first level:

In [21]:
cs

The Ghostscript executable isn't found.
See http://web.mit.edu/ghostscript/www/Install.htm
If you're using a Mac, you can try installing
https://docs.brew.sh/Installation then `brew install ghostscript`


LookupError: 

Tree('S', [('European', 'JJ'), ('authorities', 'NNS'), ('fined', 'VBD'), ('Google', 'NNP'), Tree('NP', [('a', 'DT'), ('record', 'NN')]), ('$', '$'), ('5.1', 'CD'), ('billion', 'CD'), ('on', 'IN'), ('Wednesday', 'NNP'), ('for', 'IN'), ('abusing', 'VBG'), ('its', 'PRP$'), Tree('NP', [('power', 'NN')]), ('in', 'IN'), Tree('NP', [('the', 'DT'), ('mobile', 'JJ'), ('phone', 'NN')]), Tree('NP', [('market', 'NN')]), ('and', 'CC'), ('ordered', 'VBD'), Tree('NP', [('the', 'DT'), ('company', 'NN')]), ('to', 'TO'), ('alter', 'VB'), ('its', 'PRP$'), ('practices', 'NNS'), ('.', '.')])

Based on the obtained chunks, we can generate IOB tags for each of the elements in the sentence. For each chunk, we will get a **B**egin tag for its first token, optionally followed by **I**nside tags for subsequent tokens in the chunk.

In [22]:
# generating IOB tags for the tree: one token per line, each with its POS tag and its named entity tag
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint

iob_tagged = tree2conlltags(cs)
pprint(iob_tagged)

[('European', 'JJ', 'O'),
 ('authorities', 'NNS', 'O'),
 ('fined', 'VBD', 'O'),
 ('Google', 'NNP', 'O'),
 ('a', 'DT', 'B-NP'),
 ('record', 'NN', 'I-NP'),
 ('$', '$', 'O'),
 ('5.1', 'CD', 'O'),
 ('billion', 'CD', 'O'),
 ('on', 'IN', 'O'),
 ('Wednesday', 'NNP', 'O'),
 ('for', 'IN', 'O'),
 ('abusing', 'VBG', 'O'),
 ('its', 'PRP$', 'O'),
 ('power', 'NN', 'B-NP'),
 ('in', 'IN', 'O'),
 ('the', 'DT', 'B-NP'),
 ('mobile', 'JJ', 'I-NP'),
 ('phone', 'NN', 'I-NP'),
 ('market', 'NN', 'B-NP'),
 ('and', 'CC', 'O'),
 ('ordered', 'VBD', 'O'),
 ('the', 'DT', 'B-NP'),
 ('company', 'NN', 'I-NP'),
 ('to', 'TO', 'O'),
 ('alter', 'VB', 'O'),
 ('its', 'PRP$', 'O'),
 ('practices', 'NNS', 'O'),
 ('.', '.', 'O')]


NLTK provides a classifier that has already been trained to recognize named entities: [*ne_chunk*](https://www.nltk.org/book/ch07.html#duck_typing_index_term).

In [27]:
nltk.download('words')

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\ineso\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\words.zip.


True

In [29]:
from nltk.chunk import ne_chunk

ne_tree = ne_chunk(pos_tokens)
print(ne_tree)
#ne_tree

(S
  (GPE European/JJ)
  authorities/NNS
  fined/VBD
  (PERSON Google/NNP)
  a/DT
  record/NN
  $/$
  5.1/CD
  billion/CD
  on/IN
  Wednesday/NNP
  for/IN
  abusing/VBG
  its/PRP$
  power/NN
  in/IN
  the/DT
  mobile/JJ
  phone/NN
  market/NN
  and/CC
  ordered/VBD
  the/DT
  company/NN
  to/TO
  alter/VB
  its/PRP$
  practices/NNS
  ./.)


It's not the most perfect thing, is it?

### spaCy

SpaCy includes several [language processing pipelines](https://spacy.io/usage/processing-pipelines) that streamline several NLP tasks at once. We can use one of the available [trained pipelines](https://spacy.io/models).

In [30]:
import spacy

nlp = spacy.load("en_core_web_sm")

#### Entity level

SpaCy’s named entity recognition has been trained on the [OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) corpus.
We can directly obtain the entities identified by spaCy:

In [31]:
from pprint import pprint

doc = nlp("""European authorities fined Google a record $5.1 billion on Wednesday 
for abusing its power in the mobile phone market and 
ordered the company to alter its practices.""")

pprint([(X.text, X.label_) for X in doc.ents])

[('European', 'NORP'),
 ('Google', 'ORG'),
 ('$5.1 billion', 'MONEY'),
 ('Wednesday', 'DATE')]


#### Token level
We can also get the BIO encoding for the identified entities:

In [32]:
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])

[(European, 'B', 'NORP'),
 (authorities, 'O', ''),
 (fined, 'O', ''),
 (Google, 'B', 'ORG'),
 (a, 'O', ''),
 (record, 'O', ''),
 ($, 'B', 'MONEY'),
 (5.1, 'I', 'MONEY'),
 (billion, 'I', 'MONEY'),
 (on, 'O', ''),
 (Wednesday, 'B', 'DATE'),
 (
, 'O', ''),
 (for, 'O', ''),
 (abusing, 'O', ''),
 (its, 'O', ''),
 (power, 'O', ''),
 (in, 'O', ''),
 (the, 'O', ''),
 (mobile, 'O', ''),
 (phone, 'O', ''),
 (market, 'O', ''),
 (and, 'O', ''),
 (
, 'O', ''),
 (ordered, 'O', ''),
 (the, 'O', ''),
 (company, 'O', ''),
 (to, 'O', ''),
 (alter, 'O', ''),
 (its, 'O', ''),
 (practices, 'O', ''),
 (., 'O', '')]


#### NER from a document
Let's use spaCy to do NER on an actual web document:

In [33]:
from bs4 import BeautifulSoup
import requests
import re

def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html5lib')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))

url = 'https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-fired-fbi.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news'
clean_text = url_to_string(url)
article = nlp(clean_text)

How many entities were extracted from the document?

In [36]:
# your code here
len(article.ents)

164

How many instances are there for each entity type?

In [41]:
# your code here
instances = {}
for ent in article.ents:
    if ent.text not in instances:
        instances[ent.text] = 0
    instances[ent.text] += 1
instances

{'F.B.I.': 19,
 'Peter Strzok': 4,
 'Texts': 3,
 'SectionsSEARCHSkip': 1,
 'inToday': 1,
 'PaperPolitics|F.B.I.': 1,
 'byContinue': 1,
 'storyF.B.I.': 1,
 'storyAs': 1,
 '10': 1,
 'each month': 1,
 'appPeter Strzok': 1,
 'Trump': 11,
 'T.J. Kirkpatrick': 1,
 'The New York': 1,
 'Adam Goldman': 1,
 'Michael S. SchmidtAug': 1,
 '13': 1,
 '2018WASHINGTON': 1,
 'Hillary Clinton': 2,
 'Russia': 6,
 'Strzok': 28,
 'Monday': 2,
 '2016': 3,
 'Lisa Page': 1,
 '20 years': 1,
 'the early months': 1,
 'last summer': 1,
 'Robert S. Mueller III': 1,
 'Twitter': 2,
 'June': 1,
 'the bureau’s Office of Professional Responsibility': 1,
 '60 days': 1,
 'House': 1,
 'July': 1,
 'David Bowdich': 1,
 'the Office of Professional Responsibility': 1,
 'Bowdich': 1,
 'Christopher A. Wray': 1,
 'Aitan Goelman': 1,
 'Special Agent Strzok': 1,
 'Wray': 1,
 'Congress': 2,
 'Goelman': 2,
 'Americans': 1,
 'Page': 3,
 'one': 2,
 'Michael E. Horowitz': 1,
 'Clinton': 5,
 'just weeks': 1,
 'Horowitz': 3,
 'Hundreds': 

Which are the most mentioned entities?

In [45]:
# your code here
entity = max(instances, key=instances.get)
print(entity, instances[entity])

Strzok 28


Checking out a specific sentence:

In [46]:
sentences = [x for x in article.sents]
a_sentence = sentences[20]
a_sentence

victory traces back to June, when Mr. Strzok’s conduct was laid out in a wide-ranging inspector general

Getting the BIO encoding for the sentence:

In [47]:
[(x, x.pos_, x.ent_iob_, x.ent_type_) for x in a_sentence]

[(victory, 'NOUN', 'O', ''),
 (traces, 'VERB', 'O', ''),
 (back, 'ADV', 'O', ''),
 (to, 'ADP', 'O', ''),
 (June, 'PROPN', 'B', 'DATE'),
 (,, 'PUNCT', 'O', ''),
 (when, 'ADV', 'O', ''),
 (Mr., 'PROPN', 'O', ''),
 (Strzok, 'PROPN', 'B', 'PERSON'),
 (’s, 'PART', 'O', ''),
 (conduct, 'NOUN', 'O', ''),
 (was, 'AUX', 'O', ''),
 (laid, 'VERB', 'O', ''),
 (out, 'ADP', 'O', ''),
 (in, 'ADP', 'O', ''),
 (a, 'DET', 'O', ''),
 (wide, 'ADV', 'O', ''),
 (-, 'PUNCT', 'O', ''),
 (ranging, 'VERB', 'O', ''),
 (inspector, 'NOUN', 'O', ''),
 (general, 'NOUN', 'O', '')]

We can simply output the mentioned entities and their categories:

In [48]:
dict([(str(x), x.label_) for x in a_sentence.ents])

{'June': 'DATE', 'Strzok': 'PERSON'}

We can also use spaCy's [visualizers](https://spacy.io/usage/visualizers) to better show the output of the NER model:

In [49]:
from spacy import displacy

displacy.render(a_sentence, jupyter=True, style='ent')

 The displaCy visualizer also gets us POS information and dependency parsing:

In [50]:
displacy.render(a_sentence, style='dep', jupyter = True, options = {'distance': 120})

Extracting entities for the full document:

In [51]:
for sent in sentences:
    displacy.render(sent, jupyter=True, style='ent')



#### NER for other languages

Try out other spaCy [pipelines](https://spacy.io/models) for other languages!

In [None]:
# your code here
