In [44]:
from collections import Counter

def word_tag_features(sentence):
    word_tags = Counter()
    for word_tag in sentence:
        word_tags[word_tag[0]+"_"+word_tag[1]] += 1
    return word_tags

def tag_tag_features(sentence):
    tag_tags = Counter()
    for i,word_tag in enumerate(sentence):
        if i>0:
            tag_tags[sentence[i-1][1] + "_" + sentence[i][1]] += 1
        else:
            tag_tags[str(None)+ "_" + sentence[i][1]] += 1
    return tag_tags

<center>
<h2>Part of speech tagging<br> with the structured perceptron</h2>
<p style="text-align:center">
Natural Language Processing<br>
(COM4513/6513)<br>
<br>
<a href="http://andreasvlachos.github.io">Andreas Vlachos</a><br>
a.vlachos@sheffield.ac.uk<br>
<small>Department of Computer Science<br>
University of Sheffield
</small>
</p>
</center>

### Parts of speech (PoS)

Word labels according to their syntactic function in a sentence:

| The       | results  | appear | in          | today| 's         |news |
| --------- |----------| -------| ------------|----- |------------|-----|
| determiner       | noun     | verb   | preposition | noun | possesssive| noun|

And how to use an HMM for this:

<a href="http://www.slideshare.net/priberam/introducing-priberam-labs-machine-learning-and-natural-language-processing"><img width="500" src="images/hmm_pos_crop.png"></a>

Problems with the HMM?

### In this lecture

Hidden Markov models are very useful, but:
- they generate words and labels, we just want labels
- no overlapping features (e.g. word bigrams)
- no subword features (e.g. suffixes)

**Structured perceptron**
- extend the binary perceptron to label sequences
- inexact inference with beam search

### Problem setup

Training data is word sequences with label sequences:

\begin{align}
D_{train} & = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\} \\
\mathbf{x}^m & = [x_1,... x_N]\\
\mathbf{y}^m & = [y_1,... y_N]
\end{align}

for example:
\begin{align}
(\mathbf{x},\mathbf{y})=(&[I,studied,in,Sheffield],\\
&[Pronoun,Verb,Preposition,ProperNoun])
\end{align}

Learn a model that predicts the best label sequence:
\begin{align}
\mathbf{\hat y} = \mathop{\arg \max}\limits_{\mathbf{y} \in \cal Y^N} score(\mathbf{x},\mathbf{y})
\end{align}

### Could we use the perceptron?

**Yes!**

Decompose the per sentence $\mathbf{x}= [x_1,... x_N]$ prediction:
$$
\mathbf{\hat y} = \mathop{\arg \max}\limits_{\mathbf{y} \in \cal Y^N} score(\mathbf{x},\mathbf{y})
$$

into each word $x_n$:
$$
\hat y_n = \mathop{\arg \max}\limits_{y \in \cal Y} score(x_n) =  \mathop{\arg \max}\limits_{y \in \cal Y} \mathbf{w}^y\phi(x_n)
$$

- break sentences into words each labeled by its PoS
- each word becomes a (tiny) bag-of-words/features
- learn one set of weights per tag $\mathbf{w}^y$

Anything missing?

### How to add context

Instead of:
$\hat y_n = \mathop{\arg \max}\limits_{y \in \cal Y} \mathbf{w}^y\phi(x_n)$

Try this:
$\hat y = \mathop{\arg \max}\limits_{y \in \cal Y} \mathbf{w}^y\phi(\mathbf{x},n)$

$\phi(\mathbf{x},n)$: extract features about word $n$ in sentence $\mathbf{x}$. Ideas?

### Example

In [1]:
sentence = [("I", "PRP"), ("studied", "VBD"), ("in", "IN"), ("Sheffield", "NNP"), ("in", "IN"), ("the", "DT"), ("Diamond","NNP")]
n = 1

In [6]:
# current word
sentence[n][0]

'studied'

In [9]:
# previous+current word
sentence[n-1][0]+"_"+sentence[n][0]

'I_studied'

In [11]:
# 3-letter suffix
sentence[n][0][len(sentence[n][0])-3:]

'ied'

These features should be useful in predicting "VBD"

### HMM vs perceptron

both perceptron and HMMs can:
- handle word-tag interactions (emission probabilities)

perceptron can but HMM cannot handle:
- sub-word features
- overlapping features (word n-grams)

HMM can but perceptron cannot:
- handle tag interactions (transition probabilities)

### How to add structure

Remember the alternative multiclass formulation:

$$\hat y = \mathop{\arg \max}\limits_{y \in \cal Y} (\mathbf{w} \cdot \phi(x,y))$$

Expand it to a giant linear classifier over $\mathbf{y} \in \cal Y^N$:

$$\mathbf{\hat y} = \mathop{\arg \max}\limits_{\mathbf{y} \in \cal Y^N} \mathbf{w} \cdot \Phi(\mathbf{x},\mathbf{y})$$

$\Phi$ generates features capturing the compatibility between sentence $\mathbf{x}$ and tag sequence $\mathbf{y}$

Binary classification task: are $(\mathbf{x},\mathbf{y})$ compatible or not. What features?

### Example

In [2]:
sentence = [("I", "PRP"), ("studied", "VBD"), ("in", "IN"), ("Sheffield", "NNP"), ("in", "IN"), ("the", "DT"), ("Diamond","NNP")]

In [39]:
print(word_tag_features(sentence))

Counter({'in_IN': 2, 'Diamond_NNP': 1, 'the_DT': 1, 'studied_VBD': 1, 'I_PRP': 1, 'Sheffield_NNP': 1})


In [45]:
print(tag_tag_features(sentence))

Counter({'VBD_IN': 1, 'DT_NNP': 1, 'NNP_IN': 1, 'IN_NNP': 1, 'IN_DT': 1, 'PRP_VBD': 1, 'None_PRP': 1})


In [33]:
has_verb = None
tokens, tags = zip(*sentence)
tags = set(tags)
has_verb=str('VBD' in tags)
print(has_verb)

True


### Binary perceptron training

<p style="font-size: 100%; border:3px; width: 90%; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em;">
\begin{align}
& \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,y^1)...(\mathbf{x}^M,y^M)\}\\
& set\; \mathbf{w} = \mathbf{0} \\
& \mathbf{for} \; (\mathbf{x},y) \in D_{train} \; \mathbf{do}\\
& \quad predict  \; \hat y = sign(\mathbf{w}\cdot \phi(\mathbf{x}))\\
& \quad \mathbf{if} \; \hat y \neq y \; \mathbf{then}\\
& \quad \quad update \; \mathbf{w} = \mathbf{w} + y\phi(\mathbf{x})\\
& \mathbf{return} \; \mathbf{w}
\end{align}
</p>

Learns the compatibility of $x$ with the positive class

<h3>Structured perceptron training</h3>
<p style="font-size:100%; border:3px; width: 900px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em;">
\begin{align}
& \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}\\
& set\; \mathbf{w} = 0 \\
& \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\
& \quad predict  \; \mathbf{\hat y} = \color{red}{\mathop{\arg \max}\limits_{\mathbf{y} \in \cal Y^N} \mathbf{w} \cdot \Phi(\mathbf{x},\mathbf{y})}\\
& \quad \mathbf{if} \; \mathbf{\hat y} \neq \mathbf{y} \; \mathbf{then}\\
& \quad \quad update \; \mathbf{w} = \mathbf{w} + \color{red}{\Phi(\mathbf{x},\mathbf{y}) - \Phi(\mathbf{x},\mathbf{\hat y})}\\
& \mathbf{return} \; \mathbf{w}
\end{align}
</p>

Learns the compatibility of $x$ with the correct labeling

### Example

In [67]:
sentence_labeling1 = [("I", "PRP"), ("studied", "VBD"), ("in", "IN"), ("Sheffield", "NNP"), ("in", "IN"), ("the", "DT"), ("Diamond","NNP")]
feat_correct = tag_tag_features(sentence_labeling1)
print(feat_correct)

Counter({'VBD_IN': 1, 'DT_NNP': 1, 'NNP_IN': 1, 'IN_NNP': 1, 'IN_DT': 1, 'PRP_VBD': 1, 'None_PRP': 1})


In [68]:
sentence_labeling2 = [("I", "PRP"), ("studied", "NN"), ("in", "IN"), ("Sheffield", "NNP"), ("in", "IN"), ("the", "DT"), ("Diamond","NNP")]
feat_predicted = tag_tag_features(sentence_labeling2)
print(feat_predicted)

Counter({'DT_NNP': 1, 'PRP_NN': 1, 'NNP_IN': 1, 'IN_NNP': 1, 'IN_DT': 1, 'None_PRP': 1, 'NN_IN': 1})


In [69]:
feat_diff = Counter(feat_correct)
feat_diff.subtract(feat_predicted)
print(feat_diff)

Counter({'VBD_IN': 1, 'PRP_VBD': 1, 'DT_NNP': 0, 'IN_DT': 0, 'IN_NNP': 0, 'None_PRP': 0, 'NNP_IN': 0, 'PRP_NN': -1, 'NN_IN': -1})


### Decoding

The main difficulty in training is predicting (decoding):

$$ \mathbf{\hat y} = \mathop{\arg \max}\limits_{\mathbf{y} \in \cal Y^N} \mathbf{w} \cdot \Phi(\mathbf{x},\mathbf{y})$$

Need to enumerate and score all $\mathbf{y} \in \cal Y^N$

If 1st order Markov assumption:
$$\Phi(\mathbf{x},\mathbf{y}) = \sum_{n=1}^N \phi(y_n, y_{n-1}, \mathbf{x}, n) $$
then Viterbi (but sum $\mathbf{w}\phi$s instead of multiply probabilities).

Which of the features we proposed earlier can't be used with 1st order Markov?

### Inexact decoding

Viterbi performs exact search (under assumptions):

$$ \mathbf{\hat y} = \mathop{\arg \max}\limits_{\mathbf{y} \in \cal Y^N} \mathbf{w} \cdot \Phi(\mathbf{x},\mathbf{y})$$
by evaluating all options.

Let's get faster by being inexact:

$$ \mathbf{\hat y} \approx \mathop{\arg \max}\limits_{\mathbf{y} \in \cal Y^N} \mathbf{w} \cdot \Phi(\mathbf{x},\mathbf{y})$$
by avoiding to score some label sequences.

### Beam search intuition

<a href="http://slideplayer.com/slide/8593664/"><img width="80%" src="images/beam_pos.jpg"></a>

### Beam search algorithm

<p style="font-size: 100%; border:3px; width: 1000px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em;">
\begin{align}
& \textbf{Input:} \; word\; sequence\; \mathbf{x}=[x_1,...,x_N], weights \; \mathbf{w}\\
& set\; beam \; B = \{(\mathbf{y_{temp}}=[START], score=0)\}, size \; k\\
& \mathbf{for} \; n = 1 \; \mathbf{to} \; N \; \mathbf{do}\\
& \quad B' = \{\}\\
& \quad \mathbf{for} \; b \in B \; \mathbf{do}\\
& \quad \quad \mathbf{for} \; y \in \cal Y \; \mathbf{do}\\
& \quad \quad \quad score = \mathbf{w}\cdot \Phi(\mathbf{x}, [b.\mathbf{y_{temp}}; y]) \\
& \quad \quad \quad B' = B' \cup ([b.\mathbf{y_{temp}}; y], score) \\
& \quad B = \text{TOP-k}(B')\\
& \mathbf{return} \; \text{TOP-1}(B)
\end{align}
</p>

### Beam Search

- If beam size is 1, then we have greedy search
- Often beams less than 10 get close to exact search, but much faster
- Beams must be of the same length to be comparable
- Beam search is attractive when we need complex feature functions i.e. avoid Markov assumptions)

### Bibliography
- Kai Zhao's [survey](https://www.gc.cuny.edu/CUNY_GC/media/Computer-Science/Student%20Presentations/Kai%20Zhao/Second_Exam_Survey_Kai_Zhao_12_11_2014.pdf)
- Kai Zhao's and Liang Huang's [tutorial slides](http://kaizhao.me/files/perc-tutorial-masc.pdf)
- Graham Neubig's [slides on beam search and Viterbi](http://www.phontron.com/slides/nlp-programming-en-13-search.pdf)

### Coming up next
Structured perceptron is great, but where are the probabilities gone?

Logistic regression, Conditional Random Fields and gradient descent next!