# Part-of-speech Tagging with HMM
This is a tutorial on part-of-speech tagging task in natural language processing. POS tagging is an important task for language processing since they can give us a lot of distributional information that are useful to further infer essential things like syntax of a sentence. 

**This tutorial focuses mainly on training a English POS tagger from a tagged data set using Hidden Markov Models and solve using the viterbi algorithm.**

## Index
- [What is a POS tag?](#What-is-a-POS-tag?)
- [Why are POS tags important?](#Why-are-POS-tags-important?)
- [Installing Libraries and Loading Data](#Installing-Libraries-and-Loading-Data)
- [How do we tag the parts-of-speech?](#How-do-we-tag-the-parts-of-speech?)
- [How well do modern POS taggers perform?](#How-well-do-modern-POS-taggers-perform?)
- [Other tasks using HMMs](#Other-tasks-using-HMMs)


### What is a POS tag?
One may think there is a fixed set of classes of POS. However, in reality there are actually multiple standards of part-of-speech classes. Below is a list of POS tags from  [the Penn Treebank](https://web.archive.org/web/19970614160127/http://www.cis.upenn.edu:80/~treebank/).

| | **Alphabetical list of part-of-speech tags used in the Penn Treebank Project:**             | 
|----------|-------------|
| CC      | Coordinating conjunction|
| CD | Cardinal number |
| DT | Determiner |
| EX | Existential there |
| FW | Foreign word |
| IN | Preposition or subordinating conjunction | 
| JJ | Adjective | 
| JJR | Adjective, comparative | 
| JJS | Adjective, superlative |
| LS | List item marker |
| MD | Modal |
| NN | 	Noun, singular or mass |
| NNP |Proper noun, singular |
| NNPS |Proper noun, plural |
| PDT | Predeterminer |
| POS	|Possessive ending|
| PRP |Personal pronoun|
| PRP\$ | Possessive pronoun|
| RB	|Adverb|
| RBR |Adverb, comparative|
| RBS |Adverb, superlative|
| RP |Particle|
| SYM| Symbol|
| TO |to|
| UH |Interjection|
| VB |Verb, base form|
| VBD |Verb, past tense|
| VBG |Verb, gerund or present participle|
| VBN |Verb, past participle|
| VBP |Verb, non-3rd person singular present|
| VBZ |Verb, 3rd person singular present|
| WDT |Wh-determiner|
| WP |Wh-pronoun|
| WP\$ |Possessive wh-pronoun|
| WRB |Wh-adverb|

As we can see, the classes used in PTB are not the same as the eight categories we usually suppose. For example, for the Noun class we all know, the PTB has it separated into a lot of different classes including NN, NNP, NNPS... When doing computations, Part of speech classes are usually more find-grained in order to perform more tasks accurately. For further information on POS, one can checkout [this](https://web.stanford.edu/~jurafsky/slp3/10.pdf) pdf file.

### Why are POS tags important?
Because in NLP, if one wants to parse a sentence and make further analysis or generate a sentence, one needs to extract the grammatical syntax of that sentence. To do that, we will need part of speeches. 
Take sentence-parsing as a example. Many parsing algorithms (for example, the Earley's algorithm), use (probabilistic) [context-free grammar](https://en.wikipedia.org/wiki/Context-free_grammar) to do sentence parsing. Since the grammar rules are largely based on parts-of-speech. Indeed, POS tagging is a very important underlying task.<br>

### Installing Libraries and Loading Data
Libraries:
This tutorial will only use built-in libraries like os, collections. One can also install nltk (natural language tool-kit) and its dependencies following the [instructions](https://www.nltk.org/install.html). Some of steps in this tutorial can be done using functions provided by nltk, and this tutorial provides instructions on calling them.

Data:
In order to perform POS tagging, we need training data (not surprising). This means that we will need human-annotated corpus. We will use the a small sample of Penn Treebank's data, downloaded from [Kaggle](https://www.kaggle.com/nltkdata/penn-tree-bank/version/5). The tagged source is in the folder "tagged", the first thing we need to do is to concat them together into a collection of tagged text. The following two blocks are instructions on cleaning and formatting the raw tagged text.

Side note: "Treebanks" usually serve as a good source for annotated data. Treebanks are mostly corpus annotated by people. They are rich in information since they contains reliable syntactic information about texts. However, treebanks are not golden standards. People make mistakes too. Also since the making a treebank is costly, a lot of treebanks are actually annotated long times ago. However, they are relative reliable.


In [1]:
import os, nltk
taggedRaw = ""
for i in range(199):
    fileName =  "wsj_"+ str(i+1).zfill(4)+".pos"
    with open (os.path.join("tagged", fileName), "r") as f:
        taggedRaw = taggedRaw + f.read()
print(taggedRaw.strip()[1:100])

 Pierre/NNP Vinken/NNP ]
,/, 
[ 61/CD years/NNS ]
old/JJ ,/, will/MD join/VB 
[ the/DT board/NN ]
a


Now we have successfully loaded the file. As we can see, the tags and the raw text words are separated by a "/". Let's parse that to make a list of (word, tag) tuples.

In [2]:
# replace the "[]" s
taggedRaw = taggedRaw.replace("[", " ")
taggedRaw = taggedRaw.replace("]", " ")
splitted = taggedRaw.split(" ")
# from observation we can see there are "====" separators in some files
splitted = list(filter(lambda x: ((not x.isspace()) and (not x=="") and ("=" not in x)), splitted))
print(splitted[0:5])
# alternatively, this can be done by nltk using nltk.tag.str2tuple function
splitted = list(map((lambda x: tuple(x.strip().split("/"))), splitted)) 
print("Processed: \n")
print(splitted[0:5])

['Pierre/NNP', 'Vinken/NNP', '\n,/,', '61/CD', 'years/NNS']
Processed: 

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS')]


Actually, nltk has built-in tagged words one can use. Below is some nltk built-in word tags from the Brown corpus. It also has a built-in string to tuple function.

In [3]:
print(nltk.corpus.brown.tagged_words())
print(nltk.tag.str2tuple('Pierre/NNP'))

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]
('Pierre', 'NNP')


### How do we tag the parts-of-speech?
There are a lot of POS tagging techniques. We can use Hidden Markov Models to complete this task, or Maximum Entropy Markov Models, and these days there are even part-of-speech taggers using neural networks. In this tutorial, we will focus on bigram HMM, a simple but powerful model.


[Hidden Markov Models](https://en.wikipedia.org/wiki/Hidden_Markov_model) is a is a statistical Markov model that lets us predict the hidden states given an observation. It is a model where the system being modeled is assumed to be a Markov process with unobserved (i.e. hidden) states. <BR> Here, we can use HMM because we can see the word classes as the hidden states, and the sentence as the observation.<BR>
Click [here](https://web.stanford.edu/~jurafsky/slp3/9.pdf) to red more to understand HMMs.

#### Train a Hidden Markov Model
A HMM must have these components:
![alt text](https://i.imgur.com/OSsES16.png "States info from the textbook")
(this image is from Speech and Language Processing. Daniel Jurafsky & James H. Martin)


The observation is just the texts. What we need to compute is a transition probability matrix (A) and an emission probability sequence (B). 

This tutorial will break the computation into four steps. 
- record transition observation count
- normalize the transition matrix
- record the emission observation
- normalize the emission sequence


First, we need to compute the transition probability matrix. 
To do that we should first record the observations. We can represent the count of observation by making a matrix `transitionMatrix` such that `transitionMatrix[i][j]` represent the number of occurrence of tag i followed by j.

In [4]:
from collections import defaultdict
transitionMatrix = defaultdict(dict)
# keep track of the vocabulary, easier for smoothing
seen = set()
# mark the special symbols
UNKNOWN_SYMBOL = "UNK"
INIT_SYMBOL = "INIT"
FINAL_SYMBOL = "FINAL"

# initialize
prevTag = INIT_SYMBOL


for pair in splitted:
    try:
        (token, tag) = pair
        # This smoothing trick is taken from David Bamman.
        if token not in seen:
            seen.add(token)
            token = UNKNOWN_SYMBOL
            
        # initialize the inner layer of that tag
        if prevTag not in transitionMatrix:
            transitionMatrix[prevTag] = defaultdict(int)
        # increment the transition observation
        transitionMatrix[prevTag][tag] += 1
        
        # update the source tag
        prevTag = tag
        
    except:
        continue
        
# deal with the stop
if prevTag not in transitionMatrix:
    transitionMatrix[prevTag] = defaultdict(int)
transitionMatrix[prevTag][FINAL_SYMBOL] += 1

Now we get a 2d dictionary where `transitionMatrix[i][j]` represent the number of occurrence of tag i followed by tag j. The next thing we need to do is normalize it to get a probMatrix such that $\forall i, \sum _{j} \text{ProbMatrix[i][j]} = 1$.

In [5]:
import copy
transitionProbabilities = copy.deepcopy(transitionMatrix)
for (prevTag) in transitionMatrix:
    totalCount = 0
    for (tag) in transitionMatrix[prevTag]:
        totalCount += transitionMatrix[prevTag][tag]
    for (tag) in transitionMatrix[prevTag]:
        transitionProbabilities[prevTag][tag] = transitionMatrix[prevTag][tag]/totalCount

Similarly, we need to compute the emission probabilities. We will store it also in a 2d layered dictionary such that `emission[i][j]` represents the probability that the token j is emitted given the tag i.

First we will also count the observations. The structure of the code is very similar to the transition observation count code.

In [6]:
emissionMatrix = defaultdict(dict)
seen = set()
prevTag = INIT_SYMBOL
for pair in splitted:
    try:
        (token, tag) = pair
        if token not in seen:
            seen.add(token)
            token = UNKNOWN_SYMBOL
        if tag not in emissionMatrix:
            emissionMatrix[tag] = defaultdict(int)
        emissionMatrix[tag][token] += 1
        prevTag = tag
    except:
        continue

Then we also need to normalize this such that the emission probabilities are actual probabilities instead of observation count.

In [7]:
emissionProbabilities = copy.deepcopy(emissionMatrix)
for tag in emissionMatrix:
    totalCount = 0
    for token in emissionMatrix[tag]:
        totalCount += emissionMatrix[tag][token]
    for token in emissionMatrix[tag]:
        emissionProbabilities[tag][token] = emissionMatrix[tag][token]/totalCount

Now we have already computed the information needed for a Hidden Markov model.
Instead of storing the model in the global environment, one can also save it as a output file and read from there in the future.
Below is a quick check so that it's clearer what the matrices look like. They are in the form ( tag_i, tag_j, transition probability tag_i-> tag_j) and ( tag_i, token_j, emission probability tag_i-> word_j)


In [8]:
trans = []

for prevtag in transitionProbabilities:
    for tag in transitionProbabilities[prevtag]:
        trans.append((prevtag, tag, transitionProbabilities[prevtag][tag]))
emi = []
for tag in emissionProbabilities:
    for token in emissionProbabilities[tag]:
        emi.append((tag, token, emissionProbabilities[tag][token]))

print("Transition: i, j, p i->j\n" + str(trans[0:5]))
print("\n")
print("Emission: i, j, p i->j\n" + str(emi[0:5]))

Transition: i, j, p i->j
[('INIT', 'NNP', 1.0), ('NNP', 'NNP', 0.38257173219978746), ('NNP', ',', 0.1532412327311371), ('NNP', 'CD', 0.020403825717321997), ('NNP', 'VBZ', 0.03645058448459086)]


Emission: i, j, p i->j
[('NNP', 'UNK', 0.25823591923485656), ('NNP', 'Vinken', 0.00010626992561105207), ('NNP', 'Kent', 0.0007438894792773645), ('NNP', 'New', 0.01689691817215728), ('NNP', 'Lorillard', 0.0003188097768331562)]


#### the Viterbi Algorithm

Now we already have a HMM. We will need to solve for the best hidden states sequence given new observations. Here, new observations are new raw texts we need to tag. Luckily we do not need to invent an algorithm on-the-fly. We can use [the viterbi algorithm](https://en.wikipedia.org/wiki/Viterbi_algorithm). The core idea of the viterbi algorithm is that when there are two paths that both reach a state, we go with the path with higher probability. <br>This animation is an illustration of this process.
We first has a starting state, and (Hs and Fs) are the hidden states we want to get given the annotated observations. The probability on every arrow is the product of three things: 
1. the probability of the previous states sequence 
2. the transition probability from the state on the arrow's left to the state on the right (i.e. the probability we move to the target state from the previous state)
3. the emission probability (i.e. the probability that we get this observation given the state)<br>
as we traverse the observations, we can record probabilities effectively. Also, if we keep pointers from states to states, we will be able to retrieve the predicted states from the end to the beginning.
![alt text](https://upload.wikimedia.org/wikipedia/commons/7/73/Viterbi_animated_demo.gif "States info from the textbook")
(this animation if from the wikipedia page of the viterbi algorithm)

To implement the algorithm, let's first transform the probabilities into the log  space.

In [9]:
import math
# Read in the HMM and store probs as log probs
logTransition =defaultdict(dict)
logEmission = defaultdict(dict)
for prevtag in transitionProbabilities:
    for tag in transitionProbabilities[prevtag]:
        logTransition[prevtag][tag] = math.log(transitionProbabilities[prevtag][tag])
for tag in emissionProbabilities:
    for token in emissionProbabilities[tag]:
        logEmission[tag][token] =math.log(emissionProbabilities[tag][token])


We also need to collect a set of states, and a set of vocabulary in order to process unseen words.

In [10]:
states = set()
vocabulary = set()
for prevTag in logTransition:
    states.add(prevTag)
states.add(FINAL_SYMBOL)
for tag in emissionProbabilities:
    vocabulary = vocabulary.union(emissionProbabilities[tag].keys())

To implement the algorithm, we will need a trellis of "width the same as the len(observation) and height the same as the number of states". Also we need backpointers that allows us to trace back the path.

In [11]:
# This implementation is adapted and modified from an implementation of this algorithm by Noah Smith.
def viterbi(sentence):
    # get the collection of tags (states in the HMM context)

    # instead of using split(), one can also use nltk.word_tokenize() if all dependencies are installed.
    observations = [" "] + sentence.split() + [FINAL_SYMBOL]
    trellis = defaultdict(dict)
    backpointers = defaultdict(dict)
    for state in transitionProbabilities[INIT_SYMBOL].keys():
        trellis[0][state] = 0
    for i in range(1, len(observations)):
        if observations[i] not in vocabulary:
            observations[i] = UNKNOWN_SYMBOL

    # traverse to fill the trellis
    for i in range(1, len(observations) + 1):
        # iterate through possible current states
        for current in states:
            # iterate through possible prev states
            for previous in states:
                try:
                    if (i == len(observations)):   # special treatment for the final state
                        p = trellis[i-1][previous] + logTransition[previous][current] 
                    else:
                        # the product of the three probabilities, in log space
                        p = trellis[i-1][previous] + logTransition[previous][current] + logEmission[current][observations[i]]
                    # mark the better previous state
                    if (i in trellis and current in trellis[i]):
                        if (p > trellis[i][current]):
                            trellis[i][current] = p # Viterbi probability
                            backpointers[i][current] = previous # link the states
                    else:
                        trellis[i][current] = p
                        backpointers[i][current] = previous
                except:
                    continue
    # find the best of the last
    lastTag = None
    currentBest = 0
    for possibleTag in trellis[len(observations)]:
        p = trellis[len(observations)][possibleTag]
        if not lastTag or  p > currentBest:
            currentBest = p
            lastTag = possibleTag

    if lastTag:
        res = []
        for i in range(1, len(observations)):
            res = [lastTag] + res
            lastTag = backpointers[len(observations) - i ][lastTag]
        res.pop()
        return (" ".join(res))

Let's try tag three sentences.

In [12]:
print(viterbi("I am a good person ."))
print(viterbi("look at her !"))
print(viterbi("do you like it ?"))

PRP VBD DT JJ NN .
NN IN PRP .
VBP PRP VBP PRP .


The tags makes a lot of sense, right? Although there are some mistakes like it marked the look as a noun, most verbs are indeed marked as VB-, nouns are NN-, a is Determiner and so on (One can refer back to the list of classes for checking if they are not familiar). Given that we trained the model on such a small data set. It's pretty good.<br>
Side note: nltk also supports a [HMM tagger](https://www.nltk.org/_modules/nltk/tag/hmm.html), you can explore it if you want to.

### How well do modern POS taggers perform?
As you can see, our tagger trained from such a small training dataset using bigrams performs not bad. In fact, modern POS taggers can perform very well. The accuracy can exceed 95% so POS tagginig is generally perceived as a solved task. However, correctly determine the parts-of-speech is only a start. There is a long way to go.

### Other tasks using HMMs

Hidden Markov Models are powerful and have a lot of different applications since we have a lot of tasks involving inferring states. For example, it can be powerful in decryption, gene prediction, time series predictions and so on. This is a valuable and general model.