### Hidden Markov Models (HMM)

Hidden Markov Models can help us solve the problem of sequence labelling, e.g. given a senquence of words, finding a corresponding sequence of parts-of-speech(POS) tags. HMMs are an extension of markov chains. The state (random) variables $T=\{t_1, t_2,...,t_N\}$ (e.g. $N$ different POS tags) are considered `hidden`. Given a sequence of these hidden states, e.g. $t_1,...,t_{i-1}$, we invoke the `markov assumption` (a.k.a. same as bigram LM), i.e. the next state in the sequence depends only on the previous state, so $P(t_i|t_1,..,t_{i-1}) = P(t_i|t_{i-1})$, these are also called `transition probabilities`. Since the state variables are drawn from an identical categorial distribution (because we have a finite number of possible values for the state), we can use `maximum likelihood estimation` to estimate the transition probabilities from a training corpus:

$P(t_i|t_{i-1}) = \frac{count(t_{i-1},t_i)}{count(t_{i-1})}$

Instead of observing a sequence of states directly, we observe a sequence of `observations` $O=\{w_1, .., w_T\}$, which are a different set of random variables. In the case of POS tagging, these are a sequence of $T$ words and each $w_i$ is drawn from the same vocabulary $V$. Here we invoke the `output independence assumption` according to which the $ith$ observation in the sequence only depends on the corresponding $ith$ hidden state (i.e. the $ith$ state generates the $ith$ observation) which means that $P(w_i|t_1, ..t_T, w_1, ..,w_T) = P(w_i|t_i)$, these are called `observation likelihoods`/`emission probabilites` and can also be estimated using MLE:

$P(w_i|t_i) = \frac{count(w_i,t_i)}{count(t_i)}$



Our main goal is to `infer`/`decode` the most likely sequence of hidden states that could have generated the observed sequence of words:   

$\hat{t}_{1:T} = \text{argmax}_{t_{1:T}} \text{ } P(t_1,..,t_T|w_1,..,w_T) =  \text{argmax}_{t_{1:T}} \frac{P(w_1,..,w_T|t_1,..,t_T) P(t_1,..,t_T)}{P(w_1,..,w_T)} = \text{argmax}_{t_{1:T}} P(w_1,..,w_T|t_1,..,t_T) P(t_1,..,t_T)$

where we used Baye's rule. Invoking the output independence assumptions for the observations, we can write: $P(w_1,..,w_T|t_1,..,t_T) = \prod_{i=1}^T P(w_i|t_i)$ and invoking the markov assumption for the states, we can write: $P(t_1,..,t_T) = \prod_{i=1}^T P(t_i|t_{i-1})$ which leads to the following:

$\hat{t}_{1:T} = \text{argmax}_{t_{1:T}} \text{ } \prod_{i=1}^T P(t_i|t_{i-1})  P(w_i|t_i)$

Instead of computing this product exhaustively for all possible sequence and then finding the maximum, we can note that there is an optimal substructure, i.e. the subsequence $\hat{t}_{1:i}$ is an optimal solution for the observed subsequence up to the $ith$ word, therefore we can use `dynamic programming` to obtain the solution more efficiently, aka the `Viterbi algorithm`.

In the Viterbi algorithm, we can set up a matrix whose columns represent the observations at each step and rows represent each possible hidden state. Then defining $v_t(j)$ as the cell in column $t$ and row $j$ which represents the probability of the HMM being in the state $j$ after seeing the first $t$ observations and passing through the most probable state subsequence $\{t_1,..t_{i-1}\}$, i.e. the most probable path to reach that cell. We can compute the value at each cell in column $t$, given that we've already computed the vaules in the preceding column $t-1$, using the following recurrence relation:

$v_t(j) = max_{i=1}^N \text{ } v_{t-1}(i) P(t_j|t_i) P(w_t|t_j)$

Note that we choose the tag that gives us the most probable extension of the path up to $t_i$ in the previous column. In addition, inside each cell, we also store a `backpointer` to that $t_i$.

(this is very similar to the Dijkstra shortest path algorithm!)

To run this algorithm, we need to initialize all the cells in the first column which are the probabilities for each possible hidden state given the first observation. We can compute these using the distribution over initial hidden states:

$v_1(j) = P(t_j|<s>) P(w_1|t_j) = \pi_j P(w_1|t_j)$, where $<s>$ denotes a special start of sequence hidden state (like a start of sentence token) and $\pi_j = P(t_j|<s>)$ denotes the probability distribution over all possible starting states. Then using the recurrence relation, we can fill out the remaining columns one by one. Finally once we've computed the column $v_T(j)$, we can pick the cell with the largest probability which is the final state along the optimal path, $\hat{t}_N = \text{argmax}_j v_T(j)$ and trace backward along the optimal path using the backpointers.  


In [60]:
""" 
    A simple example from the Jurafsky textbook (section 8.4.6). We assume that the transition and emission probabilities have already been obtained from pretraining.
"""

import numpy as np 


# seven different POS tags
tags = ["NNP", "MD", "VB", "JJ", "NN", "RB", "DT"]
tag_dict = {0: 'NNP', 1: 'MD', 2: 'VB', 3: 'JJ', 4: 'NN', 5: 'RB', 6: 'DT'}

# observation sequence
words = ["Janet", "will", "back", "the", "bill"]
word2idx = {"Janet": 0, "will": 1, "back": 2, "the": 3, "bill": 4}

# transition probabilities: A_ij = P(t_j|t_i)
A = np.array([
    [0.3777, 0.0110, 0.0009, 0.0084, 0.0584, 0.0090, 0.0025],
    [0.0008, 0.0002, 0.7968, 0.0005, 0.0008, 0.1698, 0.0041],
    [0.0322, 0.0005, 0.0050, 0.0837, 0.0615, 0.0514, 0.2231],
    [0.0366, 0.0004, 0.0001, 0.0733, 0.4509, 0.0036, 0.0036],
    [0.0096, 0.0176, 0.0014, 0.0086, 0.1216, 0.0177, 0.0068],
    [0.0068, 0.0102, 0.1011, 0.1012, 0.0120, 0.0728, 0.0479],
    [0.1147, 0.0021, 0.0002, 0.2157, 0.4744, 0.0102, 0.0017]
    ])

# initial state probabilities: Pi_j =  P(t_j|<s>)
pi = np.array([0.2767, 0.0006, 0.0031, 0.0453, 0.0449, 0.0510, 0.2026])

# emission probabilities B_jt = P(w_t|t_j)
B = np.array([
    [0.000032, 0, 0, 0.000048, 0],
    [0, 0.308431, 0, 0, 0],
    [0, 0.000028, 0.000672, 0, 0.000028],
    [0, 0, 0.000340, 0.000097, 0],
    [0, 0.000200, 0.000223, 0.000006, 0.002337],
    [0, 0, 0.010446, 0, 0],
    [0, 0, 0, 0.506099, 0]
    ])

#### Implementation of Viterbi Decoding 

In [72]:
def viterbi_tagger(A, B, pi, words, word2idx, tags):
    
    # initialize the viterbi matrix
    T = len(words)
    N = len(tags)
    V = np.zeros(shape=(N, T))
    V[:,0] = B[:,word2idx[words[0]]] * pi
    backptr = np.zeros_like(V)

    #np.set_printoptions(precision=2)
    # apply recurrence relation
    for t in range(1,T):
        for j in range(N):
            P_wt_tj = B[j,word2idx[words[t]]]
            max_i = 0
            max_v = 0
            for i in range(N):
                P_tj_ti = A[i,j]
                v_t_j = V[i,t-1] * P_tj_ti * P_wt_tj
                if v_t_j > max_v:
                    max_v = v_t_j
                    max_i = i
            V[j,t] = max_v
            backptr[j,t] = max_i        


    # get the final state in potimal path
    t_hat = []
    t_hat_T = np.argmax(V[:,-1])
    t_hat.append(t_hat_T)
    # back trace to get rest of the optimal path
    t_hat_prev = t_hat_T
    for t in range(T-1, 0, -1):
        t_hat_prev = backptr[int(t_hat_prev), t]
        t_hat.append(int(t_hat_prev))
    t_hat = reversed(t_hat)
    t_hat = [tags[t] for t in t_hat]
    #print(f"Word sequence; {words}")
    #print(f"Tag sequence: {t_hat}")
    #print(f"Max probability: {V[t_hat_T,-1]}")
    return t_hat    

In [62]:
viterbi_tagger(A,B,pi,words,word2idx,tags)

Word sequence; ['Janet', 'will', 'back', 'the', 'bill']
Tag sequence: ['NNP', 'MD', 'VB', 'DT', 'NN']
Max probability: 2.013570710221386e-15


['NNP', 'MD', 'VB', 'DT', 'NN']

#### Supervised training of Viterbi POS tagger on Stanford treebank dataset

In [43]:
import nltk
from nltk.corpus import treebank
nltk.download('treebank')

[nltk_data] Downloading package treebank to /home/tanzid/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


True

In [65]:
# POS tagged corpus, 3914 tagged sentences
corpus = treebank.tagged_sents()
len(corpus)

3914

In [66]:
corpus[0]

[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT'),
 ('board', 'NN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('nonexecutive', 'JJ'),
 ('director', 'NN'),
 ('Nov.', 'NNP'),
 ('29', 'CD'),
 ('.', '.')]

In [67]:
# lets get the vocabulary and tag set
vocab = sorted(list(set([elem[0] for s in corpus for elem in s])))
start_tag = "<s>"
tags = [start_tag] + sorted(list(set([elem[1] for s in corpus for elem in s])))

word2idx = {w:i for i,w in enumerate(vocab)}
tag2idx = {t:i for i,t in enumerate(tags)}

print(vocab)
print(tags)

['<s>', '#', '$', "''", ',', '-LRB-', '-NONE-', '-RRB-', '.', ':', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB', '``']


In [68]:
# create train test splits (keep only last ten sentences for test set)
corpus_train = corpus[:-10]
corpus_test = corpus[-10:]

#### Training involves MLE estimation of transition and emission probabilities

In [86]:
def viterbi_train(corpus_train, vocab, word2idx, tags, tag2idx):
    # estimate the transition and emission probabilities from counts
    N = len(tags)
    V = len(vocab)

    # apply add-k smoothing to the counts (to avoid zero probabilities)
    k = 0.1

    A = np.full(shape=(N,N), fill_value=k)
    B = np.full(shape=(N,V), fill_value=k)
    pi = np.full(shape=(N), fill_value=k)

    # store counts from training corpus
    for sentence in corpus_train:
        last_tag = start_tag
        for word,tag in sentence:
            A[tag2idx[last_tag], tag2idx[tag]] += 1
            B[tag2idx[tag], word2idx[word]] += 1
            last_tag = tag

    word_tag_counts = np.array(B)    

    # normalize the counts to get probabilities
    A = A / A.sum(axis=1, keepdims=True)        
    B = B / B.sum(axis=1, keepdims=True)   
    pi = A[tag2idx[start_tag]] 
    
    return pi, A, B, word_tag_counts     

In [87]:
pi, A, B, word_tag_counts = viterbi_train(corpus_train, vocab, word2idx, tags, tag2idx)

In [85]:
# now lets compute the predicted tags for the test sentences
num_correct = 0
num_total = 0
for s in corpus_test:
    words = [elem[0] for elem in s]
    gold_tags = [elem[1] for elem in s]
    pred_tags = viterbi_tagger(A, B, pi, words, word2idx, tags)
    num_correct += sum([gold_tags[i]==pred_tags[i] for i in range(len(words))])
    num_total += len(gold_tags)
    
    print(f"\nTest sentence: {words}")
    print(f"Actual POS tags   : {gold_tags}")
    print(f"Predicted POS tags: {pred_tags}")
    
print(f"\nAccuracy: {num_correct/num_total}")    


Test sentence: ['A', 'White', 'House', 'spokesman', 'said', 'last', 'week', 'that', 'the', 'president', 'is', 'considering', '*-1', 'declaring', 'that', 'the', 'Constitution', 'implicitly', 'gives', 'him', 'the', 'authority', 'for', 'a', 'line-item', 'veto', '*-2', 'to', 'provoke', 'a', 'test', 'case', '.']
Actual POS tags   : ['DT', 'NNP', 'NNP', 'NN', 'VBD', 'JJ', 'NN', 'IN', 'DT', 'NN', 'VBZ', 'VBG', '-NONE-', 'VBG', 'IN', 'DT', 'NNP', 'RB', 'VBZ', 'PRP', 'DT', 'NN', 'IN', 'DT', 'JJ', 'NN', '-NONE-', 'TO', 'VB', 'DT', 'NN', 'NN', '.']
Predicted POS tags: ['DT', 'NNP', 'NNP', 'NN', 'VBD', 'JJ', 'NN', 'IN', 'DT', 'NN', 'VBZ', 'VBG', '-NONE-', 'VBG', 'IN', 'DT', 'NNP', 'NNP', 'VBZ', 'PRP', 'DT', 'NN', 'IN', 'DT', 'JJ', 'NN', '-NONE-', 'TO', 'VB', 'DT', 'NN', 'NN', '.']

Test sentence: ['But', 'the', 'two', 'legal', 'experts', ',', 'responding', 'to', 'an', 'inquiry', 'by', 'Sen.', 'Edward', 'Kennedy', '-LRB-', 'D.', ',', 'Mass.', '-RRB-', ',', 'wrote', 'in', 'a', 'joint', 'letter', 't

#### Not bad! The Viterbi POS tagger acheives 92% accuracy on this dataset.

#### Now we can comprae the Viterbi results to a simpler baseline tagger that always assigns the most frequently observed tag for a given word, i.e. it picks the most common tag for a given word. This is called the `unigram tagger` because we're approximating $P(t_1,..,t_T) = \prod_{i=1}^T P(t_i)$

So instead of $\hat{t}_{1:T} = \text{argmax}_{t_{1:T}} \text{ } \prod_{i=1}^T P(t_i|t_{i-1})  P(w_i|t_i)$, we have the following

$\hat{t}_{1:T} = \text{argmax}_{t_{1:T}} \text{ } \prod_{i=1}^T P(t_i)  P(w_i|t_i) = \text{argmax}_{t_{1:T}} \prod_{i=1}^T  P(w_i, t_i) $

$\implies  \hat{t}_i  =  \text{argmax}_{t_i} P(w_i, t_i)$

In [96]:
def unigram_tagger(word_tag_counts, words, word2idx, tags):
    
    # get the final state in potimal path
    pred_tags = []
    for w in words:
        t_hat = tags[np.argmax(word_tag_counts[:,word2idx[w]])]
        pred_tags.append(t_hat)
    return pred_tags    

In [97]:
# evaluating the unigram 
num_correct = 0
num_total = 0
for s in corpus_test:
    words = [elem[0] for elem in s]
    gold_tags = [elem[1] for elem in s]
    pred_tags = unigram_tagger(word_tag_counts, words, word2idx, tags)
    num_correct += sum([gold_tags[i]==pred_tags[i] for i in range(len(words))])
    num_total += len(gold_tags)
    
    print(f"\nTest sentence: {words}")
    print(f"Actual POS tags   : {gold_tags}")
    print(f"Predicted POS tags: {pred_tags}")

    
print(f"\nAccuracy: {num_correct/num_total}")    


Test sentence: ['A', 'White', 'House', 'spokesman', 'said', 'last', 'week', 'that', 'the', 'president', 'is', 'considering', '*-1', 'declaring', 'that', 'the', 'Constitution', 'implicitly', 'gives', 'him', 'the', 'authority', 'for', 'a', 'line-item', 'veto', '*-2', 'to', 'provoke', 'a', 'test', 'case', '.']
Actual POS tags   : ['DT', 'NNP', 'NNP', 'NN', 'VBD', 'JJ', 'NN', 'IN', 'DT', 'NN', 'VBZ', 'VBG', '-NONE-', 'VBG', 'IN', 'DT', 'NNP', 'RB', 'VBZ', 'PRP', 'DT', 'NN', 'IN', 'DT', 'JJ', 'NN', '-NONE-', 'TO', 'VB', 'DT', 'NN', 'NN', '.']
Predicted POS tags: ['DT', 'NNP', 'NNP', 'NN', 'VBD', 'JJ', 'NN', 'IN', 'DT', 'NN', 'VBZ', 'VBG', '-NONE-', 'VBG', 'IN', 'DT', 'NNP', '<s>', 'VBZ', 'PRP', 'DT', 'NN', 'IN', 'DT', 'JJ', 'NN', '-NONE-', 'TO', 'VB', 'DT', 'NN', 'NN', '.']

Test sentence: ['But', 'the', 'two', 'legal', 'experts', ',', 'responding', 'to', 'an', 'inquiry', 'by', 'Sen.', 'Edward', 'Kennedy', '-LRB-', 'D.', ',', 'Mass.', '-RRB-', ',', 'wrote', 'in', 'a', 'joint', 'letter', 't

#### The unigram tagger performs almost as good as Viterbi and acheives 90% accuracy making it a very strong baseline. This is due to the fact that most word types occur most frequently with a single dominant associated POS tag, i.e. low ambiguity, making the probability of the most dominant tag very high and very low for the other tags. Simply predicting the most dominant tag for a word is usually good enough for disambiguating the word. 