### Hidden Markov Models (HMM)

Hidden Markov Models can help us solve the problem of sequence labelling, e.g. given a senquence of words, finding a corresponding sequence of parts-of-speech(POS) tags. HMMs are an extension of markov chains. The state (random) variables $T=\{t_1, t_2,...,t_N\}$ (e.g. $N$ different POS tags) are considered `hidden`. Given a sequence of these hidden states, e.g. $t_1,...,t_{i-1}$, we invoke the `markov assumption` (a.k.a. same as bigram LM), i.e. the next state in the sequence depends only on the previous state, so $P(t_i|t_1,..,t_{i-1}) = P(t_i|t_{i-1})$, these are also called `transition probabilities`. Since the state variables are drawn from an identical categorial distribution (because we have a finite number of possible values for the state), we can use `maximum likelihood estimation` to estimate the transition probabilities from a training corpus:

$P(t_i|t_{i-1}) = \frac{count(t_{i-1},t_i)}{count(t_{i-1})}$

Instead of observing a sequence of states directly, we observe a sequence of `observations` $O=\{w_1, .., w_T\}$, which are a different set of random variables. In the case of POS tagging, these are a sequence of $T$ words and each $w_i$ is drawn from the same vocabulary $V$. Here we invoke the `output independence assumption` according to which the $ith$ observation in the sequence only depends on the corresponding $ith$ hidden state (i.e. the $ith$ state generates the $ith$ observation) which means that $P(w_i|t_1, ..t_T, w_1, ..,w_T) = P(w_i|t_i)$, these are called `observation likelihoods`/`emission probabilites` and can also be estimated using MLE:

$P(w_i|t_i) = \frac{count(w_i,t_i)}{count(t_i)}$



Our main goal is to `infer`/`decode` the most likely sequence of hidden states that could have generated the observed sequence of words:   

$\hat{t}_{1:T} = \text{argmax}_{t_{1:T}} \text{ } P(t_1,..,t_T|w_1,..,w_T) =  \text{argmax}_{t_{1:T}} \frac{P(w_1,..,w_T|t_1,..,t_T) P(t_1,..,t_T)}{P(w_1,..,w_T)} = \text{argmax}_{t_{1:T}} P(w_1,..,w_T|t_1,..,t_T) P(t_1,..,t_T)$

where we used Baye's rule. Invoking the output independence assumptions for the observations, we can write: $P(w_1,..,w_T|t_1,..,t_T) = \prod_{i=1}^T P(w_i|t_i)$ and invoking the markov assumption for the states, we can write: $P(t_1,..,t_T) = \prod_{i=1}^T P(t_i|t_{i-1})$ which leads to the following:

$\hat{t}_{1:T} = \text{argmax}_{t_{1:T}} \text{ } \prod_{i=1}^T P(t_i|t_{i-1})  P(w_i|t_i)$

Instead of computing this product for all possible sequence and then finding the maximum, we can note that there is an optimal substructure, i.e. the subsequence $\hat{t}_{1:i}$ is an optimal solution for the observed subsequence up to the $ith$ word, therefore we can use `dynamic programming` to obtain the solution more efficiently, aka the `Viterbi algorithm`.

In the Viterbi algorithm, we can set up a matrix whose columns represent the observations at each step and rows represent each possible hidden state. Then defining $v_t(j)$ as the cell in column $t$ and row $j$ which represents the probability of the HMM being in the state $j$ after seeing the first $t$ observations and passing through the most probable state subsequence $\{t_1,..t_{i-1}\}$, i.e. the most probable path to reach that cell. We can compute the value at each cell in column $t$, given that we've already computed the vaules in the preceding column $t-1$, using the following recurrence relation:

$v_t(j) = max_{i=1}^N \text{ } v_{t-1}(i) P(t_j|t_i) P(w_t|t_j)$

Note that we choose the tag that gives us the most probable extension of the path up to $t_i$ in the previous column. In addition, inside each cell, we also store a `backpointer` to that $t_i$.

(Note that this is very similar to the Dijkstra shortest path algorithm)

To run this algorithm, we need to initialize all the cells in the first column which are the probabilities for each possible hidden state given the first observation. We can compute these using the distribution over initial hidden states:

$v_1(j) = P(t_j|<s>) P(w_1|t_j) = \pi_j P(w_1|t_j)$, where $<s>$ denotes a special start of sequence hidden state (like a start of sentence token) and $\pi_j = P(t_j|<s>)$ denotes the probability distribution over all possible starting states. Then using the recurrence relation, we can fill out the remaining columns one by one. Finally once we've computed the column $v_T(j)$, we can pick the cell with the largest probability which is the final state along the optimal path, $\hat{t}_N = \text{argmax}_j v_T(j)$ and trace backward along the optimal path using the backpointers.  
