# Belief propagation for HMM

<img src = '../illustrations/hidden-markov-model.png' width=100%>


There are generic methods for belief propagation in any tree. However, as the Hidden Markov Model has a very specific structure, we can simplify the derivation of belief propagation rules.



## I. Evidence and local likelihood

<img src = '../illustrations/belief-propagation-in-hmm-i.png' width=100%>

Note that the evidence in HMM can be attached only to observation nodes $Y_i$. 
Moreover, the eveidence is usually direct evidence, e.g. $Y_i=y_i$ or is missing if we failed to record $Y_i$.    

Thus we can define local likelihood vectors for observed variables $Y_i$: 

\begin{align*}
\lambda_i^*(x_i)=
\Pr[Y_i=y_i|X_i=x_i]=\delta[x_i, y_i] 
\end{align*}

and constant for unobserved variables $Y_i$:

\begin{align*}
\lambda_i^*(x_i)=
\Pr[Y_i\in\mathrm{supp}(Y_i)|X_i=x_i]=1\enspace.
\end{align*}


## II. Prior propagation rule 

<img src = '../illustrations/belief-propagation-in-hmm-ii.png' width=100%>

### Previous observation $y_{i-1}$ is known 

As the upstream evidence for $X_i$ consists of observations of $y_1, \ldots, y_{i-1}$, we can express

\begin{align*}
\pi_{X_i}(x_i)&=\Pr[X_i=x_i|y_1,\ldots, y_{i-1}]\\
&=\sum_{x_{i-1}}\Pr[X_i=x_i, X_{i-1}=x_{i-1}|y_1,\ldots, y_{i-1}] \\
&=\sum_{x_{i-1}}\Pr[X_i=x_i| X_{i-1}=x_{i-1}, y_1,\ldots, y_{i-1}]\cdot \Pr[X_{i-1}=x_{i-1}| y_1,\ldots, y_{i-1}]\\
&=\sum_{x_{i-1}}\Pr[X_i=x_i| X_{i-1}=x_{i-1}]\cdot \Pr[X_{i-1}=x_{i-1}| y_1,\ldots, y_{i-1}]\\
\end{align*}

For the second term we can apply the Bayes rule:

\begin{align*}
\Pr[X_{i-1}=x_{i-1}| y_1,\ldots, y_{i-1}] &= \frac{\Pr[y_{i-1}|X_{i-1}=x_{i-1}, y_1,\ldots,y_{i-2}]\cdot \Pr[X_{i-1}=x_{i-1}| y_1,\ldots,y_{i-2}]}{\Pr[y_{i-1}| y_1,\ldots, y_{i-2}]}\\
&= \frac{\Pr[y_{i-1}|X_{i-1}=x_{i-1}]\cdot \pi_{X_{i-1}}(x_{i-1})}{\Pr[y_{i-1}| y_1,\ldots, y_{i-2}]}\\
&=\frac{\lambda^*_{i-1}(x_{i-1}) \cdot \pi_{X_{i-1}}(x_{i-1})}{\Pr[y_{i-1}| y_1,\ldots, y_{i-2}]}\enspace.
\end{align*}

As observations $y_1, \ldots, y_{i-1}$ are fixed and $x_{i-1}$ varies in the summation we get

\begin{align*}
\pi_{X_i}(x_i)\propto
\sum_{x_{i-1}}\alpha[x_{i-1}, x_i]\cdot \lambda^*_{i-1}(x_{i-1}) \cdot \pi_{X_{i-1}}(x_{i-1})\enspace.
\end{align*}



### Previous observation $y_{i-1}$ is missing

As the observation $y_{i-1}$ is missing, the upstream evidence consists of $y_{1}, \ldots, y_{i-2}$ and thus

\begin{align*}
\pi_{X_i}(x_i)&=\Pr[X_i=x_i|y_1,\ldots, y_{i-2}]\\
&=\sum_{x_{i-1}}\Pr[X_i=x_i, X_{i-1}=x_{i-1}|y_1,\ldots, y_{i-2}] \\
&=\sum_{x_{i-1}}\Pr[X_i=x_i| X_{i-1}=x_{i-1}, y_1,\ldots, y_{i-2}]\cdot \Pr[X_{i-1}=x_{i-1}| y_1,\ldots, y_{i-2}]\\
&=\sum_{x_{i-1}}\Pr[X_i=x_i| X_{i-1}=x_{i-1}]\cdot \Pr[X_{i-1}=x_{i-1}| y_1,\ldots, y_{i-2}]\\
&\propto \sum_{x_{i-1}}\alpha[x_{i-1},x_i]\cdot \pi_{X_{i-1}}(x_{i-1})\enspace. 
\end{align*}



## III. Likelihood propagation rule

<img src = '../illustrations/belief-propagation-in-hmm-iii.png' width=100%>


As the downstream evidence for $X_i$ is $y_i, \ldots, y_n$ we get

\begin{align*}
\lambda_{X_i}(x)&=\Pr[y_{i},\ldots, y_n|X_i=x_i]\\
&=\Pr[y_{i}|X_i=x_i]\cdot\Pr[y_{i+1},\ldots, y_n|X_i=x_i]\\
&=\lambda_{i}^*(x_i)\cdot\Pr[y_{i+1},\ldots, y_n|X_i=x_i]\enspace.
\end{align*}

Now note that

\begin{align*}
\Pr[y_{i+1},\ldots, y_n|X_i=x_i]&=\sum_{x_{i+1}}\Pr[y_{i+1},\ldots, y_n,X_{i+1}=x_{i+1}|X_i=x_i]\\
&=\sum_{x_{i+1}}\Pr[y_{i+1},\ldots, y_n|X_{i+1}=x_{i+1},X_i=x_i]\cdot \Pr[X_{i+1}=x_{i+1}|X_i=x_i]\\
&=\sum_{x_{i+1}}\Pr[y_{i+1},\ldots, y_n|X_{i+1}=x_{i+1}]\cdot \Pr[X_{i+1}=x_{i+1}|X_i=x_i]\\
&=\sum_{x_{i+1}}\lambda_{X_{i+1}}(x_{i+1})\cdot \alpha[x_i,x_{i+1}]\\
\end{align*}

and consequently

\begin{align*}
\lambda_{X_i}(x)=
\lambda_{i}^*(x_i)\cdot\sum_{x_{i+1}}\alpha[x_i,x_{i+1}]\cdot\lambda_{X_{i+1}}(x_{i+1})\enspace.
\end{align*}




## IV. Filtering and smoothing

Recall that filtering is a prediction of $X_i$ given information available at the $i$-th timestep:

\begin{align*}
\Pr[X_i=x_i|y_1,\ldots, y_i]
&=\frac{\Pr[y_i|y_1,\ldots, y_{i-1}, X_i=x_i]\cdot \Pr[X_i=x_i|y_1,\ldots, y_{i-1}]}{\Pr[y_i|y_1,\ldots, y_{i-1}]}\\
&\propto \Pr[y_i|X_i=x_i]\cdot \Pr[X_i=x_i|y_1,\ldots, y_{i-1}]\\
&\propto \lambda_i^*(x_i)\cdot \pi_{X_{i}}(x_i)\enspace.
\end{align*}

In other words, we combine prior and local likelihood to get posterior. 


Recall that smooting is a a prediction of $X_i$ given information available after all observations $y_1,\ldots, y_n$ are available:

\begin{align*}
\Pr[X_i=x_i|y_1,\ldots, y_i]
&=\frac{\Pr[y_i,\ldots, y_n|y_1,\ldots, y_{i-1}, X_i=x_i]\cdot \Pr[X_i=x_i|y_1,\ldots, y_{i-1}]}{\Pr[y_i,\ldots, y_n|y_1,\ldots, y_{i-1}]}\\
&\propto \Pr[y_i,\ldots, y_{n}|X_i=x_i]\cdot \Pr[X_i=x_i|y_1,\ldots, y_{i-1}]\\
&\propto \Pr[y_i|X_i=x_i]\cdot\Pr[y_{i+1},\ldots, y_{n}|X_i=x_i]\cdot \Pr[X_i=x_i|y_1,\ldots, y_{i-1}]\\
&\propto \lambda_i^*(x_i)\cdot\lambda_{X_i}(x_i)\cdot \pi_{X_{i}}(x_i)\enspace.
\end{align*}

In other words, we combine prior, local likelihood and likelihood of remaining observations. 




