In this part, we wrote several HMM algorithms for NLP.

## Words and POS Tags

We used the Brown Corpus downloaded by NLTK. The corpus has provided POS tags for the words and punctuation, so that we could use them to train our HMM model.

In [1]:
from nltk.corpus import brown
fileids = brown.fileids()
tagged_words = brown.tagged_words()
print 'Number of files:', len(fileids)
print 'Number of words:', len(tagged_words)

Number of files: 500
Number of words: 1161192


There are 500 files in the corpus, in order to validate our model, we need to split the data into training and testing parts. We randomly selected 400 articles as training part and the rest as testing part.

In [5]:
import random
random.seed(100)
fileids_testing = random.sample(fileids, 80)
fileids_training = fileids
for item in fileids_testing:
    fileids_training.remove(item) 

In [6]:
tagged_words_training = brown.tagged_words(fileids_training)
tagged_words_testing = brown.tagged_words(fileids_testing)
print 'Length of training words:', len(tagged_words_training)
print 'Length of testing words:', len(tagged_words_testing)

Length of training words: 787989
Length of testing words: 186742


In [19]:
from collections import Counter
words_training = [item[0] for item in tagged_words_training]
tags_training = [item[1] for item in tagged_words_training]
words_testing = [item[0] for item in tagged_words_testing]
tags_testing = [item[1] for item in tagged_words_testing]

In [16]:
words_training_count = Counter(words_training)
print 'Number of different words', len(words_training_count.keys())
words_training_count.most_common(20)

 Number of different words 45060


[(u'the', 42810),
 (u',', 39463),
 (u'.', 33499),
 (u'of', 24710),
 (u'and', 18907),
 (u'to', 17613),
 (u'a', 14796),
 (u'in', 13307),
 (u'that', 7109),
 (u'is', 7022),
 (u'was', 6564),
 (u'for', 5990),
 (u'``', 5751),
 (u"''", 5743),
 (u'The', 4961),
 (u'with', 4741),
 (u'it', 4595),
 (u'as', 4520),
 (u'be', 4477),
 (u'on', 4288)]

Undoubtedly, most frequent words are stop words and punctuation.

In [17]:
tags_training_count = Counter(tags_training)
print 'Number of different tags', len(tags_training_count.keys())
tags_training_count.most_common(20)

Number of different tags 421


[(u'NN', 104440),
 (u'IN', 82070),
 (u'AT', 66785),
 (u'JJ', 44002),
 (u'.', 40966),
 (u',', 39323),
 (u'NNS', 37804),
 (u'CC', 25524),
 (u'RB', 24779),
 (u'NP', 23128),
 (u'VB', 23024),
 (u'VBN', 19809),
 (u'VBD', 17203),
 (u'CS', 15073),
 (u'VBG', 12086),
 (u'PPS', 11966),
 (u'PP$', 11081),
 (u'TO', 10301),
 (u'PPSS', 9351),
 (u'CD', 9197)]

There are 421 different tags. We need to check whether all the tags of testing data belong to those of training data.

In [25]:
tags_testing_count = Counter(tags_testing)
print 'Number of different tags', len(tags_testing_count.keys())
new_tags = []
for key in tags_testing_count.keys():
    if key not in tags_training_count.keys():
        new_tags.append(key)
        print 'key', key, 'is not in training tags.'

Number of different tags 291
key PPSS+BER-TL is not in training tags.
key EX-HL is not in training tags.
key CC-TL-HL is not in training tags.
key HV-HL is not in training tags.
key JJ$-TL is not in training tags.
key FW-*-TL is not in training tags.
key ,-TL is not in training tags.
key NR-TL-HL is not in training tags.
key *-TL is not in training tags.
key FW-PP$ is not in training tags.
key FW-BE is not in training tags.
key FW-CD-TL is not in training tags.
key WRB+IN is not in training tags.
key WRB+DOZ is not in training tags.
key HV-TL is not in training tags.
key FW-RB-TL is not in training tags.
key WRB+DO is not in training tags.
key HV+TO is not in training tags.
key FW-VBD is not in training tags.
key FW-JJT is not in training tags.
key FW-QL is not in training tags.
key JJ-TL-NC is not in training tags.
key FW-DT+BEZ is not in training tags.
key FW-AT-HL is not in training tags.
key PN+HVD is not in training tags.
key BEN-TL is not in training tags.


There are many new tags in the testing data. We could remove them so that all the tags of testing data would be within those of training data.

In [26]:
for word, tag in tagged_words_testing:
    if tag in new_tags:
        words_testing.remove(word)
        tags_testing.remove(tag)

print 'Length of Testing words:', len(words_testing)

Length of Testing words: 186705


It seems only 37 words and tags have been removed.

In [27]:
words_testing_count = Counter(words_testing)
tags_testing_count = Counter(tags_testing)

## Parameter Estimation

We considered **bigram model** here. We define c(u, s) to be the number of times the sequence of two states (u, s) is seen in training data.
Define c(s) to be the number of times that the state s is seen in the corpus. Finally, define c(s - x) to be the number of times state s is seen paired sith observation x in the corpus: for example, c(N - dog) would be the number of times the word dog is seen paired
with the tag N.

*q(s|u) = c(u, s)/c(u)*

*e(x|s) = c(s - x)/c(s)*

We need calculate the transition matrix *q* and emission matrix *e* based on the corpus mentioned above.

In [96]:
observations = words_training_count.keys()#Observations are words
states = tags_training_count.keys()#states are the tags
states_len = len(states)
obs_len = len(observations)

First, we calculated the frequencies of each tag and bigram tags in the training data.

In [97]:
unigram_freq = {}
for st in states:
    unigram_freq[st] = tags_training_count[st]

In [98]:
bigram_freq = dict()
for i in range(len(tags_training) - 1):
    u = tags_training[i]
    s = tags_training[i+1]
    bigram = (u, s)
    bigram_freq[bigram] = bigram_freq.get(bigram, 0) + 1

Then, we could calculate the probabilities of each tag.

In [99]:
unigram_p = {}
for st in states:
    unigram_p[st] = float(tags_training_count[st])/states_len

Next, calculate the probabilities of bigram tags.

In [100]:
bigram_p = {}
for k, v in bigram_freq.items():
    #print k, v
    if unigram_freq[k[0]] != 0:
        bigram_p[k] = float(v)/unigram_freq[k[0]]#p(yi|yi-1)
    else:
        bigram_p[k] = 0

In [72]:
sum([ v == 0 for v in bigram_p.values()])

0

Obviously, there are no zero bigrams. Next, we calculated the condition probabilities between the tags and the words.

In [101]:
tag_word_freq = Counter(tagged_words_training)
len(tag_word_freq)

53233

In [102]:
tag_word_p = {}
for k, v in tag_word_freq.items():
    if unigram_freq[k[1]] != 0:
        tag_word_p[k] = float(v)/unigram_freq[k[1]]#p(xi|yi)
    else:
        tag_word_p[k] = 0

We need to transform the dictionaries into matrix for latter algorithms.

In [103]:
import numpy as np
trans_p = np.zeros([states_len, states_len])#there are 421 unique tags
emit_p = np.zeros([obs_len, states_len])#there are 45060 unique words and punctuation
start_p = np.zeros(states_len)#probabilities of each tag

First, we could get the probabilities of each tag, namely the start probability.

In [104]:
i = 0
for st in states:
    start_p[i] = unigram_p[st]
    i = i + 1

Next, calculated the transformation matrix.

In [114]:
i = 0
for st in states:
    j = 0
    for st2 in states:
        bigram = (st, st2)
        if bigram in bigram_p.keys():
            trans_p[i, j] = bigram_p[bigram]        
        j = j + 1
    i = i + 1

Calculated the emission matrix.

In [117]:
tag_words_list = tag_word_p.keys()
for i in range(obs_len):
    for j in range(states_len):
        tag_word = (observations[i], states[j])
        if tag_word in tag_words_list:
            emit_p[i, j] = tag_word_p[tag_word]
            tag_words_list.remove(tag_word)

KeyboardInterrupt: 

In [115]:
len(tag_word_p.keys())

53233

## Viterbi Algorithm

In [30]:
import traceback

states = ('Healthy', 'Fever')
observations = ('normal', 'cold', 'dizzy')
start_p = np.array([0.6, 0.4])
stop_p = np.array([1, 1])
trans_p = np.array([[0.7, 0.4], [0.3, 0.6]])
emit_p = np.array([[0.5, 0.1], [0.4, 0.3], [0.1, 0.6]])

In [27]:
#Viteri Algorithm for Bigram HMM Model
#start_p: probability of start state
#trans_p: transition probabilities, matrix
#emit_p: emission probabilities, matrix
def BigramViterbi(start_p, trans_p, emit_p):
    try:
        y_len = trans_p.shape[0]
        x_len = emit_p.shape[0]
        pi = np.zeros([x_len+1, y_len+1])#maximum probability matrix for any sequence of length
        bp = np.zeros([x_len+1, y_len+1])#backpoint matrix
        #Base case
        pi[0, 0] = 1
        for i in range(1, y_len+1):
            pi[0, i] = 0
        #Find max probability
        for i in range(1, x_len+1):
            for j in range(1, y_len+1):
                if i == 1:#Find the max porbability for pi(1, j)
                    pi[i, j] = start_p[j-1] * emit_p[i-1, j-1]
                else:
                    for l in range(y_len):
                        pi_new = pi[i-1, l+1] * trans_p[j-1, l] * emit_p[i-1, j-1]
                        if pi[i, j] < pi_new:
                            pi[i, j] = pi_new
                            bp[i, j] = l + 1
        return pi, bp
    except:
        print traceback.print_exc()  
        return None

def FindBestState(stop_p, pi, bp):
    try:
        x_len = pi.shape[0]
        y_len = pi.shape[1]
        max_prob = 0#max probability
        max_st = 0#the index of state which has the max probability 
        state_list = []
        #Find the max probability
        for j in range(1, y_len):
            prob = pi[x_len-1, j] * stop_p[j-1]
            if prob > max_prob:
                max_prob = prob
                max_st = j 
        state_list.append(max_st)
        #Find the max output states chain
        pre_st = max_st
        for i in range(x_len-1, 0, -1):
            pre_st = bp[i, pre_st]
            if i > 1: state_list.append(pre_st)
        return max_prob, state_list
    except:
        print traceback.print_exc()  
        return None
        

In [29]:
BigramViterbi(start_p, trans_p, emit_p)

(array([[ 1.     ,  0.     ,  0.     ],
        [ 0.     ,  0.3    ,  0.04   ],
        [ 0.     ,  0.084  ,  0.027  ],
        [ 0.     ,  0.00588,  0.01512]]), array([[ 0.,  0.,  0.],
        [ 0.,  0.,  0.],
        [ 0.,  1.,  1.],
        [ 0.,  1.,  1.]]))

In [21]:
y_len = pi.shape[1]
print y_len
FindBestState(stop_p, pi, bp)

3


[2, 1.0, 1.0]

In [22]:
range(1,1)

[]

In [37]:
#Viterbi Algorithm for Trigram HMM Model
#Input x_len: the length of the words.
#Input y_len: the length of tags
#Input q: the probability of seeing the tag s immediately after the bigram of tags (u, v)
#Input e: the probability of seeing observation x paired with state s
def TrigramViterbi(x_len, y_len, q, e):
    #Check the input
    if x_len == None or y_len == None or q == None or e == None:
        return None
    #Initialize the pi matrix
    pi = np.zeros([x_len+1, y_len+1, y_len+1])
    #Find the max probability for each ending tag(j, l) given k
    for i in range(x_len+1):
        for j in range(y_len+1):
            for l in range(y_len+1):
                #pi(0, *, *)=1
                if i == 0 and j == 0 and l == 0:
                    pi[i, j, l] = 1
                if i == 1 and j == 0:
                    pi[i, j, l] = q[l, 0, 0] * e[i, l]
                if i == 2:
                    pi[i, j, l] = pi[i-1, 0, j] * q[l, 0, j] * e[i, l]
                #Find the max probability for each pi(i, j ,l)
                if i > 2 and j > 0 and l > 0:
                    for m in range(y_len):
                        if pi[i, j, l]> pi[i-1, m+1, j] * q[l, m+1, j] *e[i, l]:
                            pi[i, j, l] = pi[i-1, m+1, j] * q[l, m+1, j] *e[i, l]
                        
    return pi                 


In [38]:
pi = TrigramViterbi(x_len, y_len, q, e)



In [91]:
#Viteri Algorithm for Bigram HMM Model
#start_p: probability of start state
#trans_p: transition probabilities, matrix
#emit_p: emission probabilities, matrix
def BigramViterbi(trans_p, emit_p):
    y_len = trans_p.shape[0]
    x_len = emit_p.shape[0]
    pi = np.zeros([x_len+1, y_len+1])#maximum probability matrix for any sequence of length
    bp = np.zeros([x_len+1, y_len+1])#backpoint matrix
    #Base case
    pi[0, 0] = 1
    for i in range(1, y_len+1):
        pi[0, i] = 0
    #Find max probability
    for i in range(1, x_len+1):
        for j in range(1, y_len+1):
            for l in range(y_len+1):
                pi_new = pi[i-1, l] * trans_p[j-1, l] * emit_p[i-1, j-1]
                if pi[i, j] < pi_new:
                    pi[i, j] = pi_new
                    bp[i, j] = l
            print pi[i, j]
    return pi, bp

In [None]:
states = ('Healthy', 'Fever')
observations = ('normal', 'cold', 'dizzy')
start_probability = np.array([0.6, 0.4)
trans_p = np.array([[0.6, 0.7, 0.4], [0.4, 0.3, 0.6]])
emit_p = np.array([[0.5, 0.1], [0.4, 0.3], [0.1, 0.6]])
emit_p