# Hidden Markov Models with continous state space

* Many physical system can be modelled with [Markov chains with continous state space](05_markov_chains_with_continous_state_space.ipynb).
* Unfortunately, the indernal state of the system is not directly observable.
* By adding a model for the measurement procedure we get a Hidden Markov model.


In [1]:
%config IPCompleter.greedy=True

In [2]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import sklearn
import string

from pandas import Series
from pandas import DataFrame
from typing import List,Tuple

from pandas import Categorical
from pandas.api.types import CategoricalDtype

from tqdm import tnrange#, tqdm_notebook
from sklearn.linear_model import LogisticRegression
from plotnine import *

# Local imports
from common import *
from convenience import *

from scipy.stats import norm
from scipy.stats import binom
from scipy.stats import multivariate_normal


## I. Hidden Markov models with a real-valued state

Let us consider a simple Hidden Markov model

\begin{align*}
x_{i+1}&=ax_i+ w_i\\
y_i&=cx_i+v_i 
\end{align*}

where $y_i$ is the observable quantity and $x_i$ is the hidden system state.
* Nonlinearity $w_i$ is modelled by a normal distribution $\mathcal{N}(0, \rho_i)$. 
* Measurement noise $v_i$ is modelled by a normal distribution $\mathcal{N}(0, \tau_i)$.
* Quantities $x_0, w_1, \ldots, w_n, v_1, \ldots, v_n$ are assumed to be independent.
* The initial state $x_0$ is fixed or distributed according to the normal distribution $\mathcal{N}(\mu_0, \sigma_0)$. 

### Standard questions 


Similarly to Hidden Markov models with discrete statespace we can as following questions:

* What is the distribution of $x_i$ given observations $y_1,\ldots, y_{i-1}$?

* What is the distribution of $x_i$ given observations $y_{i+1},\ldots, y_{n}$?

* What is the distribution of $x_i$ given observations $y_{1},\ldots, y_{n}$?

To answer these questions we need to reason about prior, filtering probability, likelihood and marginal posterior

\begin{align*}
\pi[x_i]&=p[x_i|y_1\ldots, y_{i-1}]\enspace\\
f[x_i]&=p[x_i|y_1\ldots, y_{i-1}, y_i]\\
\lambda[x_i]&=p[y_{i+1},\ldots, y_n|x_i]\enspace\\
p[x_i]&=p[x_i|y_1,\ldots, y_n]\propto \pi[x_i]\cdot p[y_i|x_i]\cdot \lambda[x_i] \enspace.
\end{align*}

and corresponding belief propagation formulae. Similarly to Markov chains, these formulae can quickly spiral out of the control for arbitrary error distribution. Fortunately, one can show that if all errors $w_i$ and $v_i$ come from a normal distribution then priors, likelihoods and posteriors are also  normal distributions and thus we can find the corresponding parameters by using moment matching and other analytical martching techniques, see the tutorial [normal_distributions_and_belief_propagation.ipynb](./tutorials/normal_distributions_and_belief_propagation.ipynb).
The analogous result holds also for multivariate states $\boldsymbol{x}_i$ and observations $\boldsymbol{y}_i$ as long as the update rules are linear and error terms $\boldsymbol{w}_i$ and $\boldsymbol{v}_i$ are multivariate normal distribution. This closeness forms the technical backbone of Kalman filters. 

## II. Conditional normal distribution as information fusion gate

Belief propagation in [Markov chains with normal distribution](05_markov_chains_with_continous_state_space.ipynb) was straightforward until we had to multiply two normal distributions to obtain marginal probability. 
For Hidden Markov Models, the same situation occurs in each belief propagation step.
Hence, we consider this step separately 


**Theorem.** Let $\xi_1\sim\mathcal{N}(\mu_1,\sigma_1)$ and $\xi_2\sim\mathcal{N}(\mu_2, \sigma_2)$ be independent random variables. Then the conditional distribution $p[\xi_1|\xi_1=\xi_2]$ is also a normal distribution $\mathcal{N}(\mu,\sigma)$ where

\begin{align*}
\mu&=\frac{\sigma_2^2\mu_1+\sigma_1^2\mu_2}{\sigma_2^2+\sigma_1^2}\\
\sigma&=\frac{\sigma_2^2\sigma_1^2}{\sigma_2^2+\sigma_1^2}\enspace.\\
\end{align*}

**Proof.** Bayes formula allows us to express

\begin{align*}
p[\xi_1=x|\xi_1=\xi_2]&=\frac{p[\xi_1=\xi_2|\xi_1=x]\cdot p[\xi_1=x]}{p[\xi_1=\xi_2]}\\
\end{align*}

the denomiantor is a complex integral that does not depend on $x$ and the first term in the numerator simplifies as $\xi_1$ and $\xi_2$ are independent

\begin{align*}
p[\xi_1=x|\xi_1=\xi_2]&\propto p[\xi_2=x]\cdot p[\xi_1=x]\enspace.
\end{align*}

By substituting the actual formulae of density functions we get

\begin{align*}
p[\xi_1=x|\xi_1=\xi_2]&\propto\exp\biggl(-\frac{(x-\mu_1)^2}{2\sigma_2}\biggr)\cdot\exp\biggl(-\frac{(x-\mu_1)^2}{2\sigma_2^2}\biggr)\\
&\propto\exp\biggl(-\frac{\sigma_2^2(x-\mu_1)^2+\sigma_2^2(x-\mu_1)^2}{2\sigma_2^2\sigma_1^2}\biggr)\\
&\propto\exp\biggl(-\frac{(\sigma_2^2+\sigma_1^2)x^2-2(\sigma_2^2\mu_1+\sigma_1^2\mu_2)x}{2\sigma_2^2\sigma_1^2}\biggr)\\
&\propto\exp\Biggl(-\frac{\sigma_2^2+\sigma_1^2}{2\sigma_2^2\sigma_1^2}\cdot\Bigl(x^2-2\cdot\frac{\sigma_2^2\mu_1+\sigma_1^2\mu_2}{\sigma_2^2+\sigma_1^2}\cdot x\Bigr)\Biggr)\\
\end{align*}

which is a density function of a normal distribution with parameters specified in the theorem.





**Theorem.** Let $\boldsymbol{\xi}_1\sim\mathcal{N}(\boldsymbol{\mu}_1,\Sigma_1)$ and $\boldsymbol{\xi}_2\sim\mathcal{N}(\boldsymbol{\mu}_2, \Sigma_2)$ be independent random variables with the same number of componets. Then the conditional distribution $p[\boldsymbol{\xi}_1|\boldsymbol{\xi}_1=\boldsymbol{\xi}_2]$ is also a multivariate normal distribution $\mathcal{N}(\boldsymbol{\mu},\Sigma)$ where

\begin{align*}
\boldsymbol{\mu}&=\bigl(\Sigma_1^{-1}\boldsymbol{\mu}_1+\Sigma_2^{-1}\boldsymbol{\mu}_2\bigr)\cdot\bigl(\Sigma_1^{-1}+\Sigma_2^{-1}\bigr)^{-1}\\
\Sigma&=\bigl(\Sigma_1^{-1}+\Sigma_2^{-1}\bigr)^{-1}.\\
\end{align*}

**Proof.** Analogously to the univariate case we can express

\begin{align*}
p[\boldsymbol{\xi_1}=\boldsymbol{x}|\boldsymbol{\xi}_1=\boldsymbol{\xi}_2]&\propto p[\boldsymbol{\xi}_2=\boldsymbol{x}]\cdot p[\boldsymbol{\xi}_1=\boldsymbol{x}]\enspace.
\end{align*}

By substituting the actual formulae of density functions we get

\begin{align*}
p[\boldsymbol{\xi}_1=\boldsymbol{x}|\boldsymbol{\xi}_1=\boldsymbol{\xi}_2]
&\propto\exp\biggl(-\frac{1}{2}\cdot(\boldsymbol{x}-\boldsymbol{\mu}_1)^T\Sigma_1^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_1)\biggr)\cdot
\propto\exp\biggl(-\frac{1}{2}\cdot(\boldsymbol{x}-\boldsymbol{\mu}_2)^T\Sigma_2^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_2)\biggr)\\
& \propto\exp\biggl(-\frac{1}{2}\cdot\boldsymbol{x}^T\bigl(\Sigma_1^{-1}+\Sigma_2^{-1}\bigr)\boldsymbol{x}
+\boldsymbol{x}^T\bigl(\Sigma_1^{-1}\boldsymbol{\mu}_1+\Sigma_2^{-1}\boldsymbol{\mu}_2\bigr)\biggr)\enspace.\\
\end{align*}

This is again proportional to a quadratic form and simple pattern matching provides
the formulae stated in the theorem. Note that the formula is equivalent to the formula derived for the univariate case. 




## III. Filtering aka Kalman filter

<img src = 'illustrations/kalman-filter.png' width=100%>


Recall that our goal here is to compute $f[x_i]=p[x_i|y_1,\ldots, y_{i}]$ where we have two equations to satisfy

\begin{align*}
x_{i}&=ax_{i-1}+w_{i-1} \tag{E1}\\
y_{i}&=cx_{i}+v_{i}\tag{E2}
\end{align*}

Assume that the conditional distribution of $x_{i-1}$ given $y_{1},\ldots, y_{i-1}$ is a normal distribution $\mathcal{N}(\mu_*,\sigma_*)$.
Then the prior $\pi[x_i]=p[x_i|y_1,\ldots, y_{i-1}]$ is also a normal distribution $\mathcal{N}(\mu_1,\sigma_1)$, since

\begin{align*}
x_{i}&=ax_{i-1}+w_{i-1}
\end{align*}

with parameters

\begin{align*}
\mu_1&=a\mu_*\\
\sigma_1^2&=a^2\sigma_*^2+\rho_{i-1}^2\enspace.
\end{align*}

The second equation (E2) allows us conclude that the likelihood $p[y_i|x_i]$ as a function of $x_i$ is a normal distribution by using analytical manipulation which is equivalent to reversing the causal direction of (E2):

\begin{align*}
x_i=\frac{y_i-v_i}{c}=\frac{y_i}{c}-\frac{v_i}{c}\enspace.
\end{align*}

From this expression we can read out that the likelihood $p[y_i|x_i]$ as a function of $x_i$ is a normal distribution $\mathcal{N}(\mu_2,\sigma_2)$ with parameters

\begin{align*}
\mu_2&=\frac{y}{c}\\
\sigma_2^2&=\frac{\tau_i^2}{c^2}\enspace.
\end{align*}


Thus we have obtained two independent characterisations of $x_i$ which must give the same answer, i.e. we have the situation $\xi_1\sim\mathcal{N}(\mu_1,\sigma_1^2)$ and $\xi_2\sim\mathcal{N}(\mu_2,\sigma_2)$ with condition $\xi_1=x_i=\xi_2$.
According the information fusion theorem we get that $x_i\sim\mathcal{N}(\mu,\sigma)$ with parameters

\begin{align*}
\mu&=\frac{\sigma_2^2\mu_1+\sigma_1^2\mu_2}{\sigma_2^2+\sigma_1^2}\\
\sigma^2&=\frac{\sigma_2^2\sigma_1^2}{\sigma_2^2+\sigma_1^2}\enspace.\\
\end{align*}

## IV. Likelihood propagation and smoothing

<img src = 'illustrations/reverse-hidden-markov-model.png' width=100%>


Analogously to Markov chains we can show that the likelihood $\lambda[x_{i}]=p[y_{i+1},\ldots,y_n|x_{i+1}]$ is proportional to the prior of reverse chain $\pi^*[x_{i}]=p[x_{i}|y_{i+1},\ldots,y_n]$. By reversing the direction

\begin{align*}
x_{i}&=\frac{x_{i+1}}{a}-\frac{w_i}{a}
\end{align*}

we can use the equations derived in the filtering phase to estabish that $\pi^*[x_{i}]=p[x_{i}|y_{i+1},\ldots,y_n]$ is a normal distribution with parameters $\mathcal{N}(\mu_*, \sigma_*)$. 
The only difference is in the boundary condition. For ordinary Hidden Markov Model we know the initial state $x_0$. For the reversed Hidden Markov Model, the final observation $y_n$ is in the role of the initial state and thus the equation for $f^*[x_n]$ are different from other equations.

The final marginal posterior can be computed using the fusion gate twice to manipulate the product of three densities of different normal distributions.  

# Homework

## 6.1 Implementation of Kalman filter (<font color="red">3p</font>)

Generate 100-200 data points according to the model

\begin{align*}
x_{i+1}&=0.9x_i+w_{i},& w_i&\sim\mathcal{N}(0,1.0)\\
y_i&=x_i+v_i,& v_i&\sim\mathcal{N}(0,0.1)\enspace
\end{align*}

Implement Kalman filter for the model and visualise the maximum aposteriori estimate and symmetric 90% credibility intervals for the model (<font color="red">1p</font>). Implement reverse Kalman filter and visualise the maximum aposteriori estimate and symmetric 90% credibility intervals for the model (<font color="red">1p</font>). Implement Kalman smoother and visualise the maximum aposteriori estimate and symmetric 90% credibility intervals for the model (<font color="red">1p</font>). 

## 6.2 Sensor fusion with Kalman filter* (<font color="red">5p</font>)

[The Zurich Urban Micro Aerial Vehicle Dataset](http://rpg.ifi.uzh.ch/zurichmavdataset.html) consists of time synchronized aerial high-resolution images, GPS and IMU sensor data, ground-level street view images, and ground truth data. Based on this data you can evaluate the measurement errors for the GPS data.
As the GPS data contains velocity and accelometer data constains acceleration data you can build a high-dimensional linear model for the state of the drone. After that you can apply Kalman filter and smoother to reconstruct the route.

You can use the package [filterpy](https://anaconda.org/conda-forge/filterpy) and its object [KalmanFilter](https://filterpy.readthedocs.io/en/latest/kalman/KalmanFilter.html) to do the heavy lifting. In particular note that this implementation allows you to consider accelaration and speed variables as external control signals. Thus you do not have to lern how to predict them but can just use them as extra inputs.
