Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt), based on [Named Entity Recognition with NLTK and SpaCy](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da) by Susan Li.

# SEQUENCE LABELING

## Part Of Speech Tagging and Named Entity Recognition using NLTK or spaCy

Part Of Speech (POS) tagging and Named Entity Recognition (NER) are the two most well known examples of sequence labeling tasks.

### POS Tagging

POS tagging consists of assigning to each word its morpho-syntactic category.

NLTK includes a [POS tagger](https://www.nltk.org/api/nltk.tag.html) that we can use. We can check the tagset used by the tagger as follows:

In [2]:
import nltk

nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

To use the POS tagger, we first need to tokenize the text. Try out NLTK's [*pos_tag*](https://www.nltk.org/api/nltk.tag.html) with the following text, and analyse the POS tags you get:

In [1]:
from nltk import word_tokenize
from nltk import pos_tag

text = """European authorities fined Google a record $5.1 billion on Wednesday 
for abusing its power in the mobile phone market and 
ordered the company to alter its practices."""

tokens = word_tokenize(text)
pos_tag(tokens)

LookupError: 
**********************************************************************
  Resource [93maveraged_perceptron_tagger[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('averaged_perceptron_tagger')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtaggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle[0m

  Searched in:
    - '/home/xico2001pt/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


#### Training a POS tagger in NLTK

NLTK also allows you to train simple tagging models based on n-grams, where the model takes into account the tags assigned to the *n-1* words preceding the target word.

Let's try that using [Floresta Sintá(c)tica](https://www.linguateca.pt/Floresta/), a Portuguese corpus annotated with POS tags (we will follow [this tuturial](https://www.nltk.org/howto/portuguese_en.html)):

In [None]:
from nltk.corpus import floresta

print(len(floresta.sents()))
print(floresta.sents())
print(floresta.tagged_sents())

The tags consist of some syntactic information, followed by a plus sign, followed by a conventional part-of-speech tag. We need to strip off the material before the plus sign:

In [None]:
def simplify_tag(t):
    if "+" in t:
        return t[t.index("+")+1:]
    else:
        return t

tsents = [[(w.lower(),simplify_tag(t)) for (w,t) in sent] for sent in floresta.tagged_sents()]
tsents

Now we can split our data into a train and a test set. Let's keep 100 sentences in the test set.

In [None]:
train = tsents[100:]
test = tsents[:100]

Let's see how we do with an unigram tagger, which simply assigns the most likely tag for any given token.

In [None]:
tagger1 = nltk.UnigramTagger(train)

We can check how the tagger performs on the test set by using the *evaluate* method, which gives us the model's accuracy.

In [None]:
tagger1.evaluate(test)

Try tagging a user-generated sentence. Don't forget to tokenize it and lower-case the obtained tokens, following what we have done with the corpus above. To tag a list of tokens, you can invoke the *tag* method on the tagger.

In [None]:
# your code here


We can now try out a bigram model for POS tagging, which will take into account the tag assigned to the previous word. If that previous word hasn't been seen in the training set, however, the model will fail to tag the target word, even if it did appear in the training set. For that reason, it is convenient to backoff to the previous unigram tagger -- if we know nothing about the tag of the previous word, we can still use the most likely tag for the target word.

In [None]:
tagger2 = nltk.BigramTagger(train, backoff=tagger1, verbose=True)

The *verbose* flag outputs some information, namely the amount of backoff used.

Check the performance of this tagger, and compare it with the performance of a bigram tagger with no backoff strategy.

In [None]:
# your code here


Build a trigram tagger with backoff to the bigram tagger and check its performance.

In [None]:
# your code here


### Named Entity Recognition

NER consists of detecting named entities in the text, which can correspond to several different categories, such as person names, organizations, dates, and so on.

#### Chunking

Our first attempt to detect names in English may consist of chunking certain parts of the text that correspond to a pattern of POS tags. For that, we define a pattern consisting of (i) an optional *determinant*, optionally followed by (ii) *adjectives*, followed by (iii) a *noun*.

We can use NLTK's *RegexpParser* and supply it with an appropriate regular expression.

In [None]:
# creating a chunk parser
pattern = 'NP: {<DT>?<JJ>*<NN>}'
cp = nltk.RegexpParser(pattern)

With our chunk parser, we can parse our sentence's POS-tagged list of tokens.

In [None]:
from nltk import word_tokenize
from nltk import pos_tag

text = """European authorities fined Google a record $5.1 billion on Wednesday 
for abusing its power in the mobile phone market and 
ordered the company to alter its practices."""
pos_tokens = pos_tag(word_tokenize(text))

# generating a parse tree
cs = cp.parse(pos_tokens)
print(cs)

A more appealing way of visualizing the result is to simply show the obtained parse tree, with *S* (for sentence) at the first level:

In [None]:
cs

Based on the obtained chunks, we can generate IOB tags for each of the elements in the sentence. For each chunk, we will get a **B**egin tag for its first token, optionally followed by **I**nside tags for subsequent tokens in the chunk.

In [None]:
# generating IOB tags for the tree: one token per line, each with its POS tag and its named entity tag
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint

iob_tagged = tree2conlltags(cs)
pprint(iob_tagged)

NLTK provides a classifier that has already been trained to recognize named entities: [*ne_chunk*](https://www.nltk.org/book/ch07.html#duck_typing_index_term).

In [None]:
from nltk.chunk import ne_chunk

ne_tree = ne_chunk(pos_tokens)
print(ne_tree)
ne_tree

It's not the most perfect thing, is it?

### spaCy

SpaCy includes several [language processing pipelines](https://spacy.io/usage/processing-pipelines) that streamline several NLP tasks at once. We can use one of the available [trained pipelines](https://spacy.io/models).

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

#### Entity level

SpaCy’s named entity recognition has been trained on the [OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) corpus.
We can directly obtain the entities identified by spaCy:

In [None]:
from pprint import pprint

doc = nlp("""European authorities fined Google a record $5.1 billion on Wednesday 
for abusing its power in the mobile phone market and 
ordered the company to alter its practices.""")

pprint([(X.text, X.label_) for X in doc.ents])

#### Token level
We can also get the BIO encoding for the identified entities:

In [None]:
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])

#### NER from a document
Let's use spaCy to do NER on an actual web document:

In [None]:
from bs4 import BeautifulSoup
import requests
import re

def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html5lib')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))

url = 'https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-fired-fbi.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news'
clean_text = url_to_string(url)
article = nlp(clean_text)

How many entities were extracted from the document?

In [None]:
# your code here


How many instances are there for each entity type?

In [None]:
# your code here


Which are the most mentioned entities?

In [None]:
# your code here


Checking out a specific sentence:

In [None]:
sentences = [x for x in article.sents]
a_sentence = sentences[20]
a_sentence

Getting the BIO encoding for the sentence:

In [None]:
[(x, x.pos_, x.ent_iob_, x.ent_type_) for x in a_sentence]

We can simply output the mentioned entities and their categories:

In [None]:
dict([(str(x), x.label_) for x in a_sentence.ents])

We can also use spaCy's [visualizers](https://spacy.io/usage/visualizers) to better show the output of the NER model:

In [None]:
from spacy import displacy

displacy.render(a_sentence, jupyter=True, style='ent')

 The displaCy visualizer also gets us POS information and dependency parsing:

In [None]:
displacy.render(a_sentence, style='dep', jupyter = True, options = {'distance': 120})

Extracting entities for the full document:

In [None]:
for sent in sentences:
    displacy.render(sent, jupyter=True, style='ent')

#### NER for other languages

Try out other spaCy [pipelines](https://spacy.io/models) for other languages!

In [None]:
# your code here
