## Introduction
As suggested by its name, the Hidden Markov Model assumes a system is a Markov chain with "hidden" states. A system with underlying states transition from one to another with fixed probabilities (conditional probability depends on zero to many previous states) emits observables with fixed probabilities (conditional probability depends on the current hidden state) at each state. While the states are often unobservable, the Hidden Markov Model provides data scientists with a way to decode the underlying states based on the observables emitted by these states.

The Hidden Markov Model was first introduced by Rusian L. Stratonovich in 1960. And since then, the method has been widely applied to many data science related fields such as computation finance, bioinformatics, time series analysis as well as more specific machine learning problems including speech recognition, handwriting recognition and machine translation.

This tutorial will focus on the application of the model on more general data science problems.


## Tutorial content
In this tutoiral, we will present the theory behind the Hidden Markov Model, guide you through coding the algorithm from scratch and introduce [hmmlearn](http://hmmlearn.readthedocs.io/en/latest/tutorial.html), that was once part of the famous scikit-learn library but has been separated into an individual library release on Github.

To facilitate understanding, a small dataset will be used to illustrate the idea in this tutorial. But do keep in mind that the actual application of Hidden Markov Model does not limit to a small dataset.

The following are the section titles of the topic we are going to cover:
* Markov Chain
* From Markov Chain to Hidden Markov Model
* The Evaluation Problem, and the Forward Algorithm
* The Decoding Problem and the Viterbi Algorithm
* The Learning Problem, the Baum-Welch Algorithm and the hmmlearn library

## Markov Chain and First Order Markov Assumption
Markov Chain is a sequence of events whose states only depends on the state of the previous event of this sequence. When we model a sequence of events to be a Markov Chain, we assume that the events obey the first order markov assumption. Probabilitically, this assumption can be written as P(e<sub>n</sub>|e<sub>n-1</sub>, e<sub>n-2</sub>, ..., e<sub>1</sub>) = P(e<sub>n</sub>|e<sub>n-1</sub>).

For instance, let's define the event e<sub>n</sub> to be the act that Alvin has lunch at Underground on day n. The event can have multiple different states based on the kind of food Alvin orders on that day. By naming the state using the food ordered, we have the state of e<sub>n</sub> in an element in the set {Burger, Salad, None}. By ordering all events by time, we have a sequence of N events e<sub>1</sub>, e<sub>2</sub>, ... ,e<sub>n</sub>, ... ,e<sub>N</sub> on a time series. This alone does not make the series a Markov Chain. However, if we reasonably believe that Alvin's decision to order food only depends on his decision yesterday, in order words, P(e<sub>n</sub>|e<sub>n-1</sub>, e<sub>n-2</sub>, ..., e<sub>1</sub>) = P(e<sub>n</sub>|e<sub>n-1</sub>), then we have a Markov Chain here.

The following chart shows the conditional probability from one state to another.

<img src="files/chart1.tiff">

## From Markov Chain to Hidden Markov Model
Continuing our discussion about Alvin, Alvin's suitemate Roy loves to stalk others. He would like to know what Alvin has at Underground every day but it will be too obvious to follow him to Underground and observe. As a result, Roy decides to keep track of the time Alvin spent at Underground by computing the time interval between the time Alvin leaves the suite for Underground and the time he comes back.

Roy categories his data with different labels, namely {below 5min, 5 to 20min, above 20min}. By gathering enough observables, he wishes to decode the observable sequence into the actual state sequence and understand Alvin's diet.

In this case, the state of the events is unobservable for or "hidden" from Roy. But Roy still believes that the sequence is intrinsically a Markov Chain. Moreover, Roy believes that the value of the observable on day n, o<sub>n</sub> depends only one e<sub>n</sub>.

With these conditions, we say that Roy models the system as a Hidden Markov Model.

The following is the emission probability p(o<sub>n</sub>|e<sub>n</sub>) we use for this tutorial.

|| below 5 min | 5 to 20 min | above 20 min |
|---|---|---|---|
|Burger|0.05|0.40|0.55|
|Salad|0.08|0.66|0.26|
|None|0.80|0.15|0.05|

With the Hidden Markov Model, there are three important problems, namely, evaluation, decoding and learning. The definition of each problem is:
* **The Evaluation Problem**: Given a HMM model, its parameters (transition probabilities, emission probabilities) and a sequence of observations, evaluate the probability of observations
* **The Decoding Problem**: Given a HMM model, its parameters (transition probabilities, emission probabilities) and a sequence of observations, give the most likely state sequence
* **The Learning Problem**: Given a HMM model, a sequence of observations and a sequence of states, see how we should adjust the model parameters in order to maximize the probability of observations given model

Before we start to write some codes to address the three problems mentioned above, let's set up our dataset first.

First is the set of all possible states and observations:

In [None]:
states = ['Burger', 'Salad', 'None']
observations = {'below 5 min', '5 to 20 min', 'above 20 min'}

Then we define two helper functions which returns the transition probability and the emission probability:

In [None]:
import numpy
transition_matrix = numpy.matrix([[0.1, 0.65, 0.25],
                                  [0.55, 0.35, 0.1],
                                  [0.45, 0.50, 0.05]])

emission_matrix = numpy.matrix([[0.05, 0.40, 0.55],
                                [0.08, 0.66, 0.26],
                                [0.80, 0.15, 0.05]])

def state_num_index(state):
    return states.index(state)

def index_to_state(i):
    return states[i]

def observation_num_index(o):
    if o == 'below 5 min': return 0
    elif o == '5 to 20 min': return 1
    else:
        assert(o == 'above 20 min')
        return 2

def transition_prob(current_state, to_state, transition_matrix):
    return transition_matrix[state_num_index(current_state), state_num_index(to_state)]

def emission_prob(current_state, observation, emission_matrix):
    return emission_matrix[state_num_index(current_state), observation_num_index(observation)]

## The Evaluation Problem, the Forward and Backward Algorithm
With all the data ready, let's move on to the evaluation problem. Before explaining the concepts, we need to define some variables.
* **Observation Sequence**: From now onwards, we use O to represent the entire observation sequence. O = o<sub>1</sub> o<sub>2</sub> ... o<sub>T</sub>. o<sub>n</sub> refers to the observation corresponding to the nth state in the sequence.
* **State Sequence**: We use Q to represent the entire state sequence. Q = q<sub>1</sub> q<sub>2</sub> ... q<sub>T</sub>. q<sub>n</sub> refers to the nth state in the sequence.
* **HMM model**: We use &#955; to represent the HMM model. &#955; is defined by a collection of transition probabilities and emission probabilities. 
* **Transition Probability**: We use A for the transition probability matrix, where a<sub>qiqj</sub> refers to the transition probability from the state i to state j.
* **Emission Probability**: We use B for the collection of emission probabilities. b<sub>q</sub> refers to the emission probability for state q and b<sub>q</sub>(o) refers to the probability of having observation o at state q.
* **Number of possible states**: N
* **Length of the sequence**: T

The evaluation problem states that given a HMM model, evaluate the probability of observations. In order words, we want to calcuate the value of P(O|&#955;).

By applying the basic rules of probability, we know that P(O|&#955;) = &#8721;<sub>&#8704;Q</sub>P(O, Q|&#955;)=&#8721;<sub>&#8704;Q</sub>a<sub>q0q1</sub>b<sub>q1</sub>(o<sub>1</sub>) * a<sub>q1q2</sub>b<sub>q2</sub>(o<sub>2</sub>) * ... * a<sub>qT-1qT</sub>b<sub>qT</sub>(o<sub>T</sub>).

While the math is sound, the complexity to compute that summation as it is O(N<sup>T</sup>) with N be the number of possible states and T be the length of the sequence/number of observations.

To make the computation faster, we introduce the **forward algorithm**, which is essentially a dynamic programming approach to solve the evaluation problem.

All dynamic programming problem defines a subproblem or a subtask whose result will be shared by multiple paths. In our case, the subproblem is to compute the value &#945;<sub>t</sub>(j) which equals to P(o<sub>1</sub> o<sub>2</sub> ... o<sub>t</sub>, q<sub>t</sub> = S<sub>j</sub> | &#955;). In plain English, it is the probability of the observation sequence to be o<sub>1</sub> o<sub>2</sub> ... o<sub>t</sub> and the tth state to be S<sub>j</sub> given the HMM model.

The special value &#945;<sub>0</sub>(j) is set to be 1 if S<sub>j</sub> is the start state and 0 otherwise.
Other values can be computed recursively by &#945;<sub>t</sub>(j) = [&#8721;<sup>N</sup><sub>j = 0</sub>&#945;<sub>t-1</sub>(i)a<sub>ij</sub>]b<sub>j</sub>(o<sub>t</sub>)

Finally, as it naturally emerges from the definition of &#945;, P(O|&#955;) = &#8721;<sup>N</sup><sub>j = 0</sub>&#945;<sub>T</sub>(j)

And this is the entire forward algorithm. The complexity of this algorithm is O(NNT)

Let's use this algorithm to evaluate the probability of the observation sequence defined below given the Alvin Roy Underground HMM.

In [None]:
O = ['below 5 min', 'below 5 min', 'above 20 min', 'below 5 min', \
     '5 to 20 min', 'above 20 min', 'below 5 min', '5 to 20 min', \
     'above 20 min', 'above 20 min', 'below 5 min', '5 to 20 min', \
     'above 20 min', 'above 20 min']
T = len(O)
N = len(states)

It is **important** to realize that due to the hard limitation of the float number Python supports, the forward algorithm introduced above will almost always give 0 as Python cannot support a number too small, we want to convert all numbers into their log scale and compute the result in a logarithmic fashion.

This means that multiplication becomes addition. And addition becomes logarithmic addition with the helper function provided below.

In [None]:
import math
def log_add(left, right):
    if right < left: return left + math.log1p(math.exp(right - left))
    elif left < right: return right + math.log1p(math.exp(left - right))
    else: return left + math.log1p(1)

It is also useful to define the following wrapper for our helper function:

In [None]:
log_transition_matrix = numpy.log(transition_matrix)
log_emission_matrix = numpy.log(emission_matrix)

def log_transition_prob(current_state, to_state):
    return transition_prob(current_state, to_state, log_transition_matrix) 

def log_emission_prob(current_state, observation):
    return emission_prob(current_state, observation, log_emission_matrix)

Finally, we can implement the evaluate function which evaluate the input observation sequence using the helper function defined above:

In [None]:
def evaluate(O):
    # Initialize the alpha matrix
    width = T + 1
    height = N
    alphas = [[0 for i in range(width)] for j in range(height)]
    
    # Initialize the prior. We assume the probability distribution of the initial state to be uniform
    # for more information about picking the initial state, read online :D
    for state_index, state in enumerate(states):
        alphas[1][state_index] = math.log(1.0/N) + log_emission_prob(state, O[0])
    
    # Dynamically compute all the alpha values in the matrix alphas
    for t in range(2, T + 1):
        for current_state_index, current_state in enumerate(states):
            total = 0
            for prev_state_index, prev_state in enumerate(states):
                if prev_state_index == 0: 
                    total = log_transition_prob(prev_state, current_state) + alphas[prev_state_index][t - 1]
                else:
                    total = log_add(total, 
                                    log_transition_prob(prev_state, current_state) + alphas[prev_state_index][t - 1])
            alphas[current_state_index][t] = total + log_emission_prob(current_state, O[t - 1]) # the t - 1 was due to 0 indexing
            
    #Sum the final column up to get the answer
    rtn = 0
    for i in range(0, N):
        if i == 0:
            rtn = alphas[i][T]
        else:
            rtn = log_add(rtn, alphas[i][T])
    return rtn

In [None]:
evaluate(O)

In [None]:
math.exp(evaluate(O))

The function should return -14.4314... in log space or 5.401..e-07

Beside the forward algorithm, there is a backward algorithm which is able to arrive at the some results. The algorithm is an important building block for the solution of the learning problem. See the [following link](http://pages.cs.wisc.edu/~matthewb/pages/notes/pdf/hmms/BackwardAlgorithm.pdf) to learn more.

## The Decoding Problem and the Viterbi Algorithm
Moving on, given a HMM model &#955;, we want to find the most likely underlying state sequence Q based on the observation sequence O. The Evaluation Problem has already provided us with a way to compute the probability of all possible state sequence. The decoding problem wants to find the best/most probable one out of them.

We can have a similar table as the forward algorithm. The only difference is, instead of computing the sum of each value in the previous column times transimission probability times emission probability, we compute the max. Intuitively, this refers to the probability of the most probable state sequence up to the current time t with the last state to be state the cell corresponding to.

This is in fact **the Viterbi Algorithm**.

To define it formally, we define the subproblem VP<sub>t</sub>(i) = MAX<sub>q0,..qt-1</sub>P(o<sub>1</sub> o<sub>2</sub> ... o<sub>t</sub>, q<sub>t</sub> = S<sub>j</sub> | &#955;). where VP<sub>t</sub>(j)=MAX<sub>i=0,...,N</sub>VP<sub>t-1</sub>(i)a<sub>ij</sub>b<sub>j</sub>(o<sub>t</sub>)

Moreover, we internally keep track of the past state in order to return the most probably state sequence in the end.

The following is the code for encoding. Read it with the evaluate function to help you understand the similarities and differences:

In [None]:
def decode(O):
    # Initialize the viterbi matrix
    width = T + 1
    height = N
    viterbis = [[0 for i in range(width)] for j in range(height)]
    prev_states = [[None for i in range(width)] for j in range(height)]
    
    # Initialize the prior. We assume the probability distribution of the initial state to be uniform
    # for more information about picking the initial state, read online
    for state_index, state in enumerate(states):
        viterbis[1][state_index] = math.log(1.0/N) + log_emission_prob(state, O[0])
        
    # Dynamically fill in the viterbi matrix
    for t in range(2, T + 1):
        for current_state_index, current_state in enumerate(states):
            maxV = None
            prevHopState = None
            for prev_state_index, prev_state in enumerate(states):
                v = log_transition_prob(prev_state, current_state) + viterbis[prev_state_index][t - 1]
                if (maxV == None) or (maxV < v):
                    maxV = v
                    prevHopState = prev_state
            viterbis[current_state_index][t] = maxV + log_emission_prob(current_state, O[t - 1]) # the t - 1 was due to 0 indexing
            prev_states[current_state_index][t] = prevHopState
    
    # Final the most probable route
    best_p = None
    best_final_state_index = -1;
    for i in range(0, N):
        p = viterbis[i][T]
        if (best_p == None) or (best_p < p):
            best_p = p
            best_final_state_index = i
    
    # Backtrace the entire path using the prev_states matrix
    path = [index_to_state(best_final_state_index)]
    for t in range(T, 1, -1):
        best_final_state = prev_states[best_final_state_index][T]
        path = [best_final_state] + path
        best_final_state_index = state_num_index(best_final_state)
    
    return (best_p, path)

In [None]:
(p, path) = decode(O)

In [None]:
print(path)

In [None]:
print(p)

The result is an alternating sequence of burger and salad with p to be -18.90895... in the log space.

## The Learning Problem, the Baum-Welch Algorithm and the hmmlearn library
Finally, it is time to solve the learning problem. The Baum-Welch Algorithm is used in learning the probabilities based on the input observation. The Algorithm uses Forward and Backward Algorithm as building blocks. To learn more about the algorithm itself, [check here](https://people.cs.umass.edu/~mccallum/courses/inlp2004a/lect10-hmm2.pdf).

Instead of showing how to write the algorithm from scratch, we want to show you how to use it using the hmmlearn library to make your life easier.

First of all, install the library by using:

In [None]:
from hmmlearn import hmm

In [None]:
O_train = ['below 5 min', 'below 5 min', '5 to 20 min', 'below 5 min', \
     '5 to 20 min', 'below 5 min', 'below 5 min', '5 to 20 min', \
     'above 20 min', 'above 20 min', '5 to 20 min', '5 to 20 min', \
     'below 5 min', 'below 5 min', 'above 20 min', 'below 5 min', \
     '5 to 20 min', 'above 20 min', 'above 20 min', '5 to 20 min', \
     'above 20 min', 'above 20 min', 'below 5 min', '5 to 20 min']

# Convert the Observation into an index array
O_train_index = []
for o in O_train:
    O_train_index.append(observation_num_index(o))

# Initialize the model
model = hmm.MultinomialHMM(n_components = N, n_iter = 100)

# Train the model using the observations
model.fit([O_train_index])

Now we can use the model trained to decode new observations using:

In [None]:
# Convert the test observation to a list of indexes
O_index = []
for o in O:
    O_index.append(observation_num_index(o))
    
# Let's decode!
(logprob, state_sequence) = model.decode(O_index)
print(logprob)
print(state_sequence)

Of course, the library allows you to initialize transition probabilities, emission probabilities, prior probabilities, adjust the number of iterations and convergence threshold. Check [this link](http://hmmlearn.readthedocs.io/en/0.2.0/index.html) to learn more about the library.

## Further Reading
To learn more about HMM, check these resources
* Chapter 17 [Machine Learning: A Probabilistic Perspective](https://www.amazon.com/Machine-Learning-Probabilistic-Perspective-Computation/dp/0262018020/ref=sr_1_1?ie=UTF8&qid=1522271059&sr=8-1&keywords=Machine+Learning%3A+A+Probabilistic+Perspective%2C+by+Kevin+P.+Murphy) by Kelvin P. Murphy
* CMU 10601 [Course Website](http://www.cs.cmu.edu/~mgormley/courses/10601-s18/schedule.html)