In [4]:
import nltk
from collections import Counter

PoS_dict = {"the":"determiner", "can": "modal", "fly":"verb"}

tag_set = ['.', 'NUM', 'VERB', 'DET', 'ADV', 'CONJ', 'PRT', 'PRON', 'ADP', 'X', 'NOUN', 'ADJ']
corpus = nltk.corpus.brown.tagged_sents(tagset='brown')

unigram_tag_counts = Counter()
bigram_tag_counts = Counter()
word_tag_counts = Counter()
word_tag_dict = {}
for sent in corpus:
    tokens, tags = zip(*sent)
    padded_tags = [None]+list(tags)+[None]
    for index,tag in enumerate(padded_tags):
        unigram_tag_counts[tag]+=1
        if index > 0:
            bigram_tag_counts[(padded_tags[index-1],tag)] +=1
            if index< len(padded_tags)-1:
                word_tag_counts[(tag,tokens[index-1])] += 1
                if tokens[index-1] not in word_tag_dict:
                    word_tag_dict[tokens[index-1]] = [tag]
                else:
                    if tag not in word_tag_dict[tokens[index-1]]:
                        word_tag_dict[tokens[index-1]].append(tag)

In [5]:
def bigram_LM(tags_y):
    padded_tags = [None]+list(tags_y)+[None]
    tag_bigrams = []
    for index in range(len(padded_tags)-1):
        tag_bigrams.append((padded_tags[index],padded_tags[index+1]))
    print(tag_bigrams)
    prob_x = 1.0
    for bg in tag_bigrams:
        if bg[0] == None:
            prob_bg = (bigram_tag_counts[bg])/(len(corpus))
        else:
            prob_bg = (bigram_tag_counts[bg])/(unigram_tag_counts[bg[0]])
        prob_x = prob_x *prob_bg
        print(str(bg)+":"+str(prob_bg))
    return prob_x

<center>
<h2>Part of speech tagging<br> with hidden Markov models</h2>
<p style="text-align:center">
Natural Language Processing<br>
(COM4513/6513)<br>
<br>
<a href="http://andreasvlachos.github.io">Andreas Vlachos</a><br>
a.vlachos@sheffield.ac.uk<br>
<small>Department of Computer Science<br>
University of Sheffield
</small>
</p>
</center>

### Parts of speech (PoS)

Word labels according to their syntactic function in a sentence:

| The       | results  | appear | in          | today| 's         |news |
| --------- |----------| -------| ------------|----- |------------|-----|
| determiner       | noun     | verb   | preposition | noun | possesssive| noun|

What could they be useful for?

- language modelling
- syntactic parsing
- named entity recognition
- question answering

### Kinds of PoS tags

Open class:
- nouns
    - proper nouns
    - common nouns
- verbs
- adjectives
- adverbs

Closed class: determiners, prepositions, conjunctions, etc

### PoS definitions

Most research uses the [Penn Treebank PoS tag set](https://www.clips.uantwerpen.be/pages/mbsp-tags).

Includes 45 tags making distinctions between
  - verbs in active vs past tense
  - nouns in singular vs plural number
  - etc.

Most distinction are inspired by English. Recent work has focused on the [Universal PoS tag set](http://universaldependencies.org/u/pos/):
- 17 coarse tags: one noun, one verb, etc.
- developed considering 22 languages

### Problem setup

Training data is word sequences with label sequences:

\begin{align}
D_{train} & = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\} \\
\mathbf{x}^m & = [x_1,... x_N]\\
\mathbf{y}^m & = [y_1,... y_N]
\end{align}

for example:
\begin{align}
(\mathbf{x},\mathbf{y})=(&[I,studied,in,Sheffield],\\
&[Pronoun,Verb,Preposition,ProperNoun])
\end{align}

Learn a model that predicts the best label sequence:
\begin{align}
\mathbf{\hat y} = \mathop{\arg \max}\limits_{\mathbf{y} \in \cal Y^N} score(\mathbf{x},\mathbf{y})
\end{align}

### Could we use a dictionary?

In [14]:
print(PoS_dict)

{'the': 'determiner', 'can': 'modal', 'fly': 'verb'}


Yes, but the same word can have different tags in different contexts.

I | can | fly
---|----|----
pronoun | modal | verb

vs:


I | can | fly
---|----|----
pronoun | verb | noun

*can* and 11.5% of the words in the Brown corpus have more than one tag

### Can we use a Markov model?

Our training data should be able to tell us that tagging

I | can | fly
---|----|----
pronoun | modal | noun

is unlikely. Why?

Replace words $\mathbf{x}$ with tags $\mathbf{y}$:

$$P(\mathbf{y}) = \prod_{n=1}^N P(y_n| y_{n-1})$$

### Can we use a Markov model?

In [7]:
print(corpus[0])

[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')]


In [8]:
bigram_LM(['PPSS','MD','NN']) # PPSS is the personal pronoun

[(None, 'PPSS'), ('PPSS', 'MD'), ('MD', 'NN'), ('NN', None)]
(None, 'PPSS'):0.05486571328915242
('PPSS', 'MD'):0.15229676858426314
('MD', 'NN'):0.0017697691255731639
('NN', None):0.00044598937495900833


6.595274031841351e-09

In [9]:
bigram_LM(['PPSS','VB','NN'])

[(None, 'PPSS'), ('PPSS', 'VB'), ('VB', 'NN'), ('NN', None)]
(None, 'PPSS'):0.05486571328915242
('PPSS', 'VB'):0.23061875090566586
('VB', 'NN'):0.04342148220698661
('NN', None):0.00044598937495900833


2.450331267007358e-07

### What about the words?

What we have is:

$$P(\mathbf{y}) = \prod_{n=1}^N P(y_n| y_{n-1})$$

to do:

$$\mathbf{\hat y} = \mathop{\arg \max}\limits_{\mathbf{y} \in \cal Y^N} P(\mathbf{y})$$

What will we get?

The same $N$-tag long sequence for any sentence!

<h3>Hidden Markov model</h3>

<a href="http://www.slideshare.net/priberam/introducing-priberam-labs-machine-learning-and-natural-language-processing"><img width="500" src="images/hmm_pos_crop.png"></a>

PoS tags are hidden states emitting words. Assumptions:
- 1st order Markov among the PoS tags
- Each word only depends on its PoS tag

### Derivation

$\mathbf{\hat y} = \mathop{\arg \max}\limits_{\mathbf{y} \in \cal Y^N} P(\mathbf{y}|\mathbf{x})\quad$ (Bayes rule)

$\require{cancel} \mathbf{\hat y} = \mathop{\arg \max}\limits_{\mathbf{y} \in \cal Y^N} \frac{P(\mathbf{x}|\mathbf{y})P(\mathbf{y})}{{\cancel{P(\mathbf{x}}})} \quad$ (words are constant)

$\mathbf{\hat y} = \mathop{\arg \max}\limits_{\mathbf{y} \in \cal Y^N} P(\mathbf{x}|\mathbf{y})P(\mathbf{y}) \quad$ (1st order Markov)

$\mathbf{\hat y} \approx \mathop{\arg \max}\limits_{\mathbf{y} \in \cal Y^N} \prod_{n=1}^NP(x_n|y_n)P(y_n|y_{n-1}) $

### HMM training
				
Maximum likelihood estimation (i.e. counts!):
				
$ P(y_n|y_{n-1}) = \frac{counts(y_n,y_{n-1})}{counts(y_{n-1})} \quad$  (transition probabilities)
				
$ P(x_n|y_n) = \frac{counts(x_n,y_n)}{counts(y_n)} \quad$  (emission probabilities)
				
We can easily read them off a labeled corpus.

### In action

In [10]:
x = ['I','can','fly']
y = ['PPSS','MD','NN']

In [11]:
emission_product = 1.0
for i in range(len(x)):
    prob = word_tag_counts[(y[i],x[i])]/unigram_tag_counts[y[i]] 
    emission_product *= prob
    print(prob)   

0.37161280973771915
0.13876598825516853
9.182134190332525e-05


In [12]:
bigram_LM(y)*emission_product

[(None, 'PPSS'), ('PPSS', 'MD'), ('MD', 'NN'), ('NN', None)]
(None, 'PPSS'):0.05486571328915242
('PPSS', 'MD'):0.15229676858426314
('MD', 'NN'):0.0017697691255731639
('NN', None):0.00044598937495900833


3.122843277930901e-14

### Decoding/Inference

So we have everything we need to decode/infer the most likely tag sequence for a sentence:</p>

$\mathbf{\hat y} = \mathop{\arg \max}\limits_{\mathbf{y} \in \cal Y^N} \prod_{n=1}^NP(x_n|y_n)P(y_n|y_{n-1}) \quad$

Just enumerate all possible tag sequences?

No! We would need to evaluate $|\cal Y|^{N}$ sequences!

### Viterbi 
				
<a href="http://www.cse.unsw.edu.au/~billw/cs9414/notes/nlp/ambiguity/ambiguity-2012.html"><img height="400" src="images/viterbi.gif"></a>

- dynamic programming: store and re-use calculations
- possible due to independence assumptions
- keep track of the highest probability to reach each PoS tag for each word and how we got there

### Viterbi data structures

tag set $\cal Y$, sentence $\mathbf{x}=[x_1,... x_N]$

Viterbi score matrix $V^{|{\cal Y}|\times N}$
- each cell contains the highest prob. for word $n$ with tag $y$
- 1st order Markov: only depends on the  previous tag $y^{\prime}$
- i.e. $V[y,n] = \max_{y^{\prime}\in \cal Y} V[y^{\prime}, n-1] \times P(y|y^{\prime}) \times P(x_n|y)$

Backpointer matrix $backptr^{|{\cal Y}|\times N}$:
- instead of the max score, keep the previous tag that got it
- $argmax$ instead of $max$
- i.e.: $backptr[y,n] = \mathop{\arg\max}_{y^{\prime}\in \cal Y} V[y^{\prime}, n-1] \times P(y|y^{\prime}) \times P(x_n|y)$

### Viterbi diagram

![](images/jurafsky_5_18_viterbi.jpg)

<h3>Viterbi algorithm</h3>
<p style="border:3px; width: 1100px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em;">
\begin{align}
& \textbf{Input:} \; word\; sequence\; \mathbf{x}=[x_1,...,x_N],\\
& emission \; probs \; P(x|y), transition \; probs \; P(y_n|y_{n-1})\\
& set\; matrix \; V^{|{\cal Y}|\times N} = 1\\
& \mathbf{for} \; n = 1 \; \mathbf{to} \; N \; \mathbf{do}\\
& \quad \mathbf{for} \; y \in \cal Y \; \mathbf{do}\\
& \quad \quad V[y,n] = \max_{y^{\prime}\in \cal Y} V[y^{\prime}, n-1] \times P(y|y^{\prime}) \times P(x_n|y)\\
& \quad \quad backptr[y,n] = \mathop{\arg \max}\limits_{y^{\prime}\in \cal Y} V[y^{\prime}, n-1] \times P(y|y^{\prime})\times P(x_n|y)\\
& backptr[None,N+1] = \mathop{\arg \max}\limits_{y^{\prime}\in \cal Y} V[y^{\prime}, N] \times P(None|y^{\prime})\\
\end{align}
</p>

Break the large $\arg\max$ into smaller ones, left-to-right (**dynamic programming**)

### Some more points

Higher order HMMs:
- longer contexts, more expensive inference
- benefits are usually small

Smoothing:
- what happens when we have unseen word/tags or tag-tag combinations?
- everything we learned in the language modeling lecture!

### Bibliography

- Michael Collins's [notes](http://www.cs.columbia.edu/~mcollins/hmms-spring2013.pdf)
- J&M [chapter 9](https://web.stanford.edu/~jurafsky/slp3/9.pdf) from the new edition
- Graham Neubig's [slides](http://www.phontron.com/slides/nlp-programming-en-04-hmm.pdf)

### Coming up next

The perceptron and text classification!