## Introduction
This tutorial will introduce you to Hidden Markov Models and specifically the Viterbi Algorithm in the context of Part-of-Speech (POS) tagging for English sentences. I believe POS tagging fits in the data pre-processing part of the data science pipeline for text-based applications. After tokenization, stop-word removal and lemmatization, POS tagging is used to assign tags to the tokens. Building on this, chunking and building parse trees may be carried out for further feature generation. They may be included in the analysis/machine learning aspect of the pipeline to develop question answering systems, for sentiment analysis etc. 

The following is a flow chart of the intermediate stages in Sentiment Analysis:
<img src="http://fotiad.is/assets/images/nlp/sentiwordnet-flowchart.png">
There are various approaches to Part-of-Speech tagging like Hidden Markov Models(HMM), Dynamic Programming, Unsupervised Learning. HMMs are a probabilistic approach to Part-of-Speech tagging.

## Contents

This tutorial will provide a walkthrough of what Hidden Markov Models are and how they can be applied to POS tagging. It will guide you through the Viterbi algorithm and its manual as well as library-based implementation. Finally, it ends with some concluding results and thoughts on the topic.

- [Data](#Data)
- [Hidden Markov Models](#Hidden-Markov-Models)
- [POS tagging and its relation to Hidden Markov Models](#POS-tagging-and-its-relation-to-Hidden-Markov-Models)
- [Viterbi Algorithm](#Viterbi-Algorithm)
- [Using the in-built hmm module in nltk](#Using-the-in-built-hmm-module-in-nltk)
- [Shortcomings of HMM of nltk with no paramter specifications](#Shortcomings-of-above-HMM-of-nltk)
- [Concluding notes](#Concluding-notes)

## Loading Required Libraries

In [3]:
import nltk
from nltk import word_tokenize
from nltk.tag import hmm
import numpy as np

## Data

For the purpose of this tutorial I am using the Brown Corpus available at:

http://www.sls.hawaii.edu/bley-vroman/brown_corpus.html

It has about a million words from 15 different categories of text samples. This text will be used in parameter estimation of the model.

Natural Language Processing involves tagsets which define a mapping from different parts of speech to distinct symbols. The Penn Treebank tagset is the most widely used tagset.

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

Thus to create training data, I tagged the sentences using the Penn tagset. The Brown Corpus has a tagged version too, with its own 87 tags based tagset. However, since the Penn tagset has now become a standard, I used the Stanford POS Tagger for tagging the file: brown_nolines.txt from the website.
The tutorial, I followed:

http://new.galalaly.me/index.php/2011/05/tagging-text-with-stanford-pos-tagger-in-java-applications/

The tagged corpus looks like:

<img src="https://raw.githubusercontent.com/apoorva-nitsure/PDS_Tutorial_Data/master/text_tagged.png">

All data files can be found at: https://github.com/apoorva-nitsure/PDS_Tutorial_Data

## Hidden Markov Models 

Hidden Markov Models(HMMs) are a variant of Markov Models which are used to model sequential or temporal data. In probability theory, a Markov model is a stochastic model used to model randomly changing systems where future states depend only on the current state not on the events that occurred before it (Markov property). 
A HMM is a Markov chain for which the state is only partially observable. Observations are related to the state of the system, but are typically insufficient to precisely determine the state.

Source: https://en.wikipedia.org/wiki/Markov_model

Basically a HMM, is a weighted finite automation with states, transitions between states and weighted paths, where each state can output observations belonging to a domain. 
It relies on two assumptions:

- Naive Bayes Assumption

Output of a particular state is dependent only on that state

- Markov Assumption

The next state depends only on the current state

A typical definition of a HMM would include the following:

1. Set of states 
: (The states in the model)

2. Output Alphabet
: (The allowed set of observables)

3. Transition Probabilities
: (The probability of moving from one state to another)

4. Emission Probabilities
:(The probabilitiy of getting an observable from a state)

5. Initial State Probabilities
:(The probability that the sequence will begin in a particular state)


## POS tagging and its relation to Hidden Markov Models 

Text is like time series or sequential data. Hence can be modelled using HMMs. In the real world we observe just a chunk of sentences. The words in the sentences can be mapped to observables and the Part of Speech as the hidden states. The algorithm to find the best state sequence is called the **Viterbi Algorithm**.

 
Now we must have a HMM, with all its parameters estimated. It can be achieved using traning data i.e. the Brown Corpus.


**Step 1 **: *Preprocessing data* 

This step involves getting all the words in the tagged file in the format of a tuple where each tuple has the word and it's tag.
Here the words and tags are converted to lowercase to ensure that the same word/tag is not counted differently due to a difference in case. The Brown Corpus has all words separated by a space and is cleaned data, hence a split on space helps tokenize the text.




In [4]:
with open('brown_tagged.txt') as ip_file:
    for sent in ip_file:
        sent = sent.strip('\n')
        sent = sent.lower()
        # Forming a list of (word,tag) of all the sentences
        temp = [nltk.tag.str2tuple(t) for t in sent.split()]

**Step 2**: *Getting the set of states,the output alphabet,transition probabilities,emission probabilities*

Now as we know that the states are tags and the observations are words, here we build dictionaries where the key represents a particular word/tag and its value is the frequency of occurence of that word/tag.

Transition probabilities are probabilities of moving from one state to another which in terms of text can be modelled as the probabilitiy of the next part of speech occuring given a particular part of speech. For eg a noun is most likely to follow an adjective and hence will have the highest transition probability. Here the transistions dictionary stores as it's keys the next_tag|current_tag from the available data and frequency of a particular tag following another tag as its value.
As the first tag is not going to have a previous tag we model that as 'INIT', a default initial state.

The emission probabilities are the probabilities of getting particular observable values from different states. Now in case of POS estimation they would correspond to the probability of getting a particular word given a particular Part-of-Speech. For eg the P(amazing|adjective) would be higher than P(amazing|verb). Thus the word_plus_tag dictionary has as keys word|tag and values as the frequency of a word being tagged a particular Part-of-Speech. 

P(next_tag|current_tag) = Frequency(next_tag follows current_tag) / Frequency(current_tag)


P(word|tag) = Frequency(word associated with that tag) / Frequency(tag)

Tag list is a list of all tags.

In [5]:
word_frequency = {}
tag_frequency = {}
previous = 'INIT'
transitions = {}
word_plus_tag = {}
word_tag_list = []

for (word,tag) in temp:
    if word is not None and tag is not None:
        word_tag_list.append((word,tag))
        if word not in word_frequency.keys():
            word_frequency[word] = 0
        if tag not in tag_frequency.keys():
            tag_frequency[tag] = 0
        word_given_tag = word + '#|#' + tag
        if word_given_tag not in word_plus_tag.keys():
            word_plus_tag[word_given_tag] = 0
        t = tag + '#|#' + previous
        if t not in transitions.keys():
            transitions[t] = 0
        tag_frequency[tag] = tag_frequency[tag] + 1
        word_frequency[word] = word_frequency[word] + 1
        word_plus_tag[word_given_tag] = word_plus_tag[word_given_tag] + 1
        transitions[t] = transitions[t] + 1
        previous = tag

In [6]:
#To find transition and emission probabilities
emission_probabilities = {}
transition_probabilities = {}
for key,value in transitions.items():
    individual = key.split('#|#')
    if individual[1] != 'INIT':
        transition_probabilities[key] = transitions[key]/tag_frequency[individual[1]] 
for key,value in word_plus_tag.items():
    individual = key.split('#|#')
    emission_probabilities[key] = word_plus_tag[key]/tag_frequency[individual[1]] 
    
tag_list = list(tag_frequency.keys())

**Step 3**: *Getting the initial probabilities*

The initial probabilities of starting in a particular state are indicative of the sentence starting with a particular part of speech. Most commonly they are are articles or nouns.

Thus the procedure I have followed here is to again create a new file from the tagged corpus with each sentence on a new line and store the tags of the first word of every sentence along with the number of times that tag occurs as the first word among all the sentences. 

P(tag|'INIT') = Frequency(Tag being the first tag in a sentence) / Total number of sentences

For tags that don't occur as the first tag in a sentence are assigned a default probability.

In [7]:
default_prob = 0.0001
initial = {}
sent_count = 0
c = 0
with open('brown_tagged_new.txt') as ip_file:
    for sent in ip_file:
        sent = sent.strip('\n')
        tokens = sent.split()
        if len(tokens) >= 1:
            sent_count = sent_count + 1
            tokens[0] = tokens[0].lower()
            if '/' in tokens[0]:
                tt = tokens[0].split('/')
                if tt[1] != '':
                    if tt[1] not in initial.keys():
                        initial[tt[1]] = 0
                    initial[tt[1]] = initial[tt[1]]  + 1
ip_file.close()


for key,val in initial.items():
    initial[key] = initial[key] / sent_count
for tag in tag_frequency.keys():
    if tag not in initial.keys():
        initial[tag] = default_prob

## Viterbi Algorithm

The Viterbi Algorithm is a dynamic programming based algorithm to find the most likely hidden state sequence having seen the observable data.

This algorithm generates a path ${\displaystyle X=(x_{1},x_{2},\ldots ,x_{T})}$, which is a sequence of states ${\displaystyle x_{n}\in S=\{s_{1},s_{2},\dots ,s_{K}\}}$ that most likely generates the observations 
${\displaystyle Y=(y_{1},y_{2},\ldots ,y_{T})\in \{1,2,\dots ,N\}^{T}}$ 

Source: https://en.wikipedia.org/wiki/Viterbi_algorithm

The algorithm style that I have referred to is: https://www.cse.iitb.ac.in/~cs626-460-2012/, https://github.com/phirwe/CSCI5832-Natural-Language-Processing/blob/master/PA2/hirve-poorwa-assgn2.py

Here T = length of the observation sequence and N = the number of possible states.

Data Structures used:

Sequence_score array: To maintain the probability score of all the possible sequences. Dimensions: (N * T)
Sequence_score array (i,t) represents the product of the probability of t'th word to be emitted from the state i and the maximum  probabilistic path for the t-1 words where the next tag will be i.

Back_pointer array: To recover the path with highest probability. Dimensions: (N * T)

The Viterbi Algorithm is comprised of 3 steps:

***Initialization***

This step involves determining the starting state of the sequence. The best tag for the first word is found out by finding the tag which gives the maximum value for P(word|tag) * P(tag|'INIT') 

***Iteration***

In this step for each word and tag combination we get the maximum sequence score possible. That is for each word and tag we run a loop through all tags and find the tag tprev which gives the maximum product of P(tag|tagprev) X sequencescore[tagprev,word-1]. This tagprev is stored in the backpointer array. The final value for sequence score[tag,word] is given by the product of P(word|tag) and the maximum product from the previous P(tag|tagprev) X sequencescore[tagprev,word-1]. 

This step basically finds out the the best sequence at every word and every possibility of tag. Thus for every word and tag we find the best possible sequence leading upto that tag and store that previous-tag as a reference in the backpointer array.

***Sequence Identification***

Now after all the sequence scores are estimated, we must get the path that gives the maximum score. So for the last word, the sequence scores for each tag and word combination represent the paths ending with a particular tag. Thus for the last word, we choose the tag which has the maximum sequence score. Now the backpointer array for each word and tag represents the previous tag which gives the best previous tag leading to a high score. Therefore when we check the backpointer of the previous tag, it will point to the index of its previous tag which gave the maximum sequence score. In that way we can reach to the best tag for the first word. 

Now we are ready with our tag sequence! But wait, its in the reverse order, to get the final answer reverse it!

***Smoothing***

Not all words/transistions in the test data may be present in the training data. Thus to handle unseen words/transitions a default probability can be assigned. Here it is a parameter to my function.

In [8]:
def viterbi(sent,smoothing_parameter):

    sent = sent.strip('\n').lower()
    words = word_tokenize(sent)
    T = len(words)
    N = len(tag_frequency)
    sequence_score = np.zeros((N, T))
    back_pointer = np.zeros((N, T)) - np.ones((N, T))
    previous = 'INIT'
    #initialization step
    first = words[0]
    for i in range(N):
        tag = tag_list[i]
        word_cond = first + "#|#" + tag
        transition = tag + "#|#" + previous
        word_given_tag = smoothing_parameter
        tag_state = smoothing_parameter
        if word_cond in emission_probabilities.keys():
            word_given_tag = emission_probabilities[word_cond]
        if transition in initial.keys():
            tag_state = initial[transistion]
        sequence_score[i, 0] = word_given_tag * tag_state
    best_tag = np.nanargmax(sequence_score[:, 0])
    back_pointer[best_tag,0] = -2
    #iteration step
    for t in range(1, T):
        for i in range(0, N):
            word = words[t]
            tag = tag_list[i]
            word_given_tag = word + "#|#" + tag
            if word_given_tag not in emission_probabilities:
                conditional_prob = smoothing_parameter
            else:
                conditional_prob = emission_probabilities[word_given_tag]
            max_score = -np.inf
            max_index = -np.inf
            for j in range(0, N):
                transition = tag_list[i] + "#|#" + tag_list[j]
                if transition not in transition_probabilities:
                    value = smoothing_parameter
                else:
                    value = transition_probabilities[transition]

                if max_score <= sequence_score[j, t - 1] * value:
                    max_score = sequence_score[j, t - 1] * value
                    max_index = j
            sequence_score[i, t] = max_score * conditional_prob
            back_pointer[i,t] = max_index
    #sequence identification
    all_tags = []
    last_tag_index = np.nanargmax(sequence_score[:, T-1])
    all_tags.append(last_tag_index)
    for i in range(T-2,0,-1):
        temp = all_tags[-1]
        k = i + 1
        all_tags.append(back_pointer[int(temp),k])
    all_tags.append(best_tag)
    final_answer = []
    k = 0
    for i in range(len(all_tags), 0 , -1):
        final_answer.append((words[k],tag_list[int(all_tags[i-1])]))
        k = k + 1
    return final_answer

## Testing the algorithm with sentences

Let's test the algorithm with a particular sentence which involves most of the words from the training data, i.e. the Brown Corpus.

In [10]:
ans = viterbi("The Fulton County Grand Jury said an investigation of Atlanta 's recent primary election produced `` no evidence '' of any irregularity .",0.0001)
print(ans)


[('the', 'DT'), ('fulton', 'NNP'), ('county', 'NNP'), ('grand', 'NNP'), ('jury', 'NNP'), ('said', 'VBD'), ('an', 'DT'), ('investigation', 'NN'), ('of', 'IN'), ('atlanta', 'NNP'), ("'s", 'POS'), ('recent', 'JJ'), ('primary', 'JJ'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'DT'), ('evidence', 'NN'), ("''", "''"), ('of', 'IN'), ('any', 'DT'), ('irregularity', 'NN'), ('.', '.')]


## Using the in-built hmm module in nltk

The hmm model in nltk can be run in both supervised and unsupervised fashion.
The supervised version assumes you have training data to train the model.
In this case we do and this is fed to the trainer.

The train data is in the format of (word,tag) for each sentence which is a list of tuples. Thus train_data is a list of lists.

Here I am training the HiddenMarkovModelTrainer with default parameters.

In [11]:
#Training data
train = temp
train = [train]
train_data = train
#Training to get parameter estimates
trainer = hmm.HiddenMarkovModelTrainer()
tagger = trainer.train_supervised(train_data)
#HMM Properties
print(tagger)

<HiddenMarkovModelTagger 93 states and 54814 output symbols>


In [13]:
text = "The Fulton County Grand Jury said an investigation of Atlanta 's recent primary election produced `` no evidence '' of any irregularity ."
text = text.lower()
print(tagger.tag(text.split()))


[('the', 'DT'), ('fulton', 'NNP'), ('county', 'NNP'), ('grand', 'NNP'), ('jury', 'NNP'), ('said', 'VBD'), ('an', 'DT'), ('investigation', 'NN'), ('of', 'IN'), ('atlanta', 'NNP'), ("'s", 'POS'), ('recent', 'JJ'), ('primary', 'JJ'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'DT'), ('evidence', 'NN'), ("''", "''"), ('of', 'IN'), ('any', 'DT'), ('irregularity', 'NN'), ('.', '.')]


### Result Evaluation

Comparing the outcomes from both the hmm module and manual implementation of the Viterbi algorithm we can see that both are same. So while testing on data similar to training data both perform equally well.

## Shortcomings of above HMM of nltk

Now consider the following sentence.

"Hello World"

This sentence has words which occured in the training data only a few times (hello ocurred 10 times out of the million words). 

### Using the trained tagger from nltk

In [14]:
t = "Hello World"
t = t.lower()

print(tagger.tag(t.split()))

[('hello', 'DT'), ('world', 'DT')]


### Using the self-implemented Viterbi algorithm 

In [15]:
print(viterbi("Hello World",0.0001))

[('hello', 'UH'), ('world', 'NN')]


### Result Evaluation

We see that hello is assigned the tag 'DT' by nltk's module and my implementation assigns it the tag 'UH'.
DT means determiner which is generally used for articles whereas UH stands for interjection.
Hello is an interjection which is correctly classified by the manual implementation.

Also the module tags world as 'DT'. However the manual implementation with smoothing produces NN as the tag for it, which stands for singular/mass nouns.

Here we can see that nltk's hmm tagger performs poorly because the sentence consists of unseen transitions and rare words.
Since no smoothing is applied in the source code as seen from: https://www.nltk.org/_modules/nltk/tag/hmm.html

it performs poorly. We can also see that it's case-sensitive. 

However the manual implementation has smoothing to take care of unseen data giving a better result.


## Concluding notes

This is an introductory tutorial to HMM for POS tagging. It will help you get started in understanding the basic concepts and the easiest method for it's implementation.

The above implementation of Viterbi algorithm is crud. Specifically tokenization is carried out by splitting on spaces, which is a very simplistic methodology. Also log-transforms of the probabilities can be made. There are many other ways to implement this algorithm. Instead of giving a numerical parameter for smoothing, assigning a default tag such as NNP is a common way to handle unseen words or unseen Part-of-Speech transitions.

For more information you could visit:

- http://www.nltk.org/book/ch05.html
- https://pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/
- https://web.stanford.edu/~jurafsky/slp3/10.pdf
- https://cs.nyu.edu/courses/spring12/CSCI-GA.2590-001/lecture4.pdf