# Models with latent variables
![img](https://upload.wikimedia.org/wikipedia/commons/thumb/4/43/HMMGraph.svg/2000px-HMMGraph.svg.png)

# Hidden Markov Models

![img](https://stathwang.github.io/images/hmm.png)

## Markov model
Given set of states $S = {s_1, ..., s_m}$ we observe series over time $z \in S^T$

Assumptions about markov model:
* Limited horizon  
$P(z_t | z_{t-1}, z_{t-2}, ..., z_{t-n}) = P(z_{t} | z_{t-1})$

* Stationary process  
$P(z_{t} | z_{t-1}) = P(z_2 | z_1)$ for $t \in 2..T$

State transition matrix $A \in R^{(|S|+1) x (|S|+1)}$, where  
$A_{ij}$ is probabilty of transition from state i to state j

We compute probability of the particular sequence z by chain rule:  
$P(z_t, z_{t-1}, z_{t-2}, ..., z_1; A) = P(z_t | z_{t-1}, z_{t-2}, ..., z_0; A) =  
= P(z_{t} | z_{t-1}; A) P(z_{t-1} | z_{t-2}; A) P(z_2 | z_1; A) P(z_1 | z_0; A) = \\ = \prod_{t=1}^T P(z_{t} | z_{t-1}; A) = \prod_{t=1}^T A_{z_{t-1}, z_t}$

Maximum likelihood parameter assignment

$l(A) = log P(z; A) = log \prod_{t=1}^T A_{z_{t-1}, z_t} = \sum_{t=1}^T log A_{z_{t-1}, z_t} = \sum_{t=1}^T \sum_{i=1}^{|S|} \sum_{j=1}^{|S|} [z_{t-1} = s_i \wedge z_t = s_j]log A_{ij}$

We want:  
$l(A) -> \underset {A} {\max}$  
$\sum_{j=1}^{|S|} A_{ij} = 1 $  
$A_{ij} \geq 0$  

With lagrange multipliers we'll get following estimate:  
$\hat A_{ij} = \frac {\sum_{t=1}^T  [z_{t-1} = s_i \wedge z_t = s_j]} {\sum_{t=1}^T  [z_{t-1} = s_i}]$

## Hidden Markov Model

$P(x; A, B) = \sum_z P(x, z; A, B) = \sum_z P(x|z; A, B) P(z; A, B) = \sum_z ( \prod_{t=1}^T P(x_t|z_t; B) ) ( \prod_{t=1}^T P(z_t|z_{t-1}; A) ) = \sum_z ( \prod_{t=1}^T B_{z_t, x_t} ) ( \prod_{t=1}^T A_{z_{t-1}, z_t} )$  
Evaluating the prob directly costs $O(|S|^T)$

Fundamental questions for HMM:
* What is the probability of the observed sequence?  
Given HMM (A, B) and observations x, caclulate probability that HMM generated x.
* What is the most likely series of states to generate the observations?  
Given HMM (A, B) and observations x, caclulate the most likely sequence of hidden states, that produced observations x.  
* How can we learn A and B?  
Given some training observations x and general structure of HMM (number of hidden and visible states), determine (A, B) that best fit the data.

### Forward algorithm 
for computing for computing probability of observed sequence:

Define $\alpha_i(t) = P(x_1, x_2, ..., x_t, z_t=s_i; A, B)$ - total probability of all observations up through time t, and being at state s_i at time t.
Then,  
$P(x; A, B) = P(x_1, ..., x_T; A, B) = \sum_{i=1}^{|S|} P(x_1, ..., x_T, z_T = s_i; A, B)) = \sum_{i=1}^{|S|} \alpha_i(T)$

We can compute with for $O(|S| T)$ by dynamic programming:  
$\alpha_i (0) = A_{0, i} $  
$\alpha_j (t) = \sum_{i=1}^{|S|} \alpha_i (t-1) A_{ij} B_{j, x_t} $


### Backward algorithm

Define $\beta_i(t) = P(x_{t+1}, ..., x_T, z_t=s_i; A, B)$ - probability of observing the rest of the sequence after time step t being at state s_i.  
$\beta_i(T) = A_{i, 0}$   
$\beta_i(t) = \sum_j A_{ij} B_{j, x_{t+1}} \beta_j(t+1) $  

### Viterbi algorithm  
for maximum likelihood state assignment. 

Given series of outputs $x \in V^T$:  

$arg max_z P(z|x; A, B) = arg max_z \frac {P(x, z; A, B)} {\sum_z P(x, z; A, B)} = arg max_z P(x, z; A, B)$

Naive approach in $O(|S|^T)$.  

Let $\pi[j, s]$ - max probability for any state sequence ending in state s at position j.

$\pi[1, s] = A_{0, s} B_{s, x_1}$

$\pi[j, s] = max_{i \in {1 .. k}} \pi[j-1, i] A_{i, s} B_{s, x_j}$
$bp[j, s] = arg max_{i \in {1 .. k}} \pi[j-1, i] A_{i, s} B_{s, x_j}$

Recover all sequence:  
$s_T = argmax_s \pi[T, s]$  
$s_{j-1} = bp[j, s_j]$

Complexity $O(T |S|^2)$

### EM for HMM

Repeat until convergence:

E-step:  

$Q(z) = p(z|x; A, B)$


M-step:

$A, B = arg max_{A, B} \sum_z Q(z) log \frac {P(x, z; A, B)} {Q(z)}$  
s.t. $\sum_{j=1}^{|S|} A_{ij} = 1$, $A_ij \geq 0$  
$\sum_{k=1}^{|V|} B_{ik} = 1$, $B_ik \geq 0$


### Forward-backward algorithm (Baum Welch)

Init A, B as random probability matrices  
$A_{i0} = 0$, $B_{0k} = 0$

Repeat until convergence:  
    
E-step: 
    
compute $\alpha_i$ and $\beta_i$ for i =1..S

Set $\gamma_t(i,j) = \alpha_i(t) A_{ij} B_{j, x_t} \beta_j(t+1)$ - proportional to the probability of transitioning between i and j states at time t.  

M-step: 

Reestimate:

$A_{ij} = \frac { \sum_{t=1}^T \gamma_t(i,j)  } { \sum_{j=1}^{|S|} \sum_{t=1}^T \gamma_t(i,j) } $   
= {expected number of transitions from $s_i$ to $s_j$ } / {expected number of transitions out of $s_i$ }  
$B_{ij} = \frac { \sum_{i=1}^{|S|} \sum_{t=1}^T [x_y=v_k] \gamma_t(i,j)  } { \sum_{i=1}^{|S|} \sum_{t=1}^T \gamma_t(i,j) }$   
= { expected number of observations $v_k$ occurs in state $s_i$ } / { expected number of times in state $s_i$ }
