# Markov chains with continous state space

* Dynamics of physical systems is particularly important in robotics.
* Dynamics of most physical systems are determined by differential equations.
* Differential equations are used to describe other dynamically evolving systems.  
* These equations are usually solved with numerical methods that approximate differentials.
* Usage of these approximations leads to linear update rules for the next state 

  \begin{align*}
  \boldsymbol{x}_{i+1}=A\boldsymbol{x}_i+\boldsymbol{w}_i 
  \end{align*}

  where $\boldsymbol{x}_{i+1}$ is the actual outcome, $A\boldsymbol{x}_i$ is an approximation and  $\boldsymbol{w}_i$ is unknown error term.
* A numerical approximation method is good if the average error $\mathbf{E}(\boldsymbol{w}_i)$ is near zero.
* Otherwise the numerical approximation method has a systematic bias that should be removed.
* Such a system can viewed as Markov chain if we additionally assume that errors $\boldsymbol{w}_i$  are independent. 
* Note that the state space for $\boldsymbol{x}_i$ is a continous vector space $\mathbb{R}^\ell$.


In [2]:
%config IPCompleter.greedy=True

In [4]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import sklearn
import string

from pandas import Series
from pandas import DataFrame
from typing import List,Tuple

from pandas import Categorical
from pandas.api.types import CategoricalDtype

from tqdm import tnrange#, tqdm_notebook
from sklearn.linear_model import LogisticRegression
from plotnine import *

# Local imports
from common import *
from convenience import *

from scipy.stats import norm
from scipy.stats import binom
from scipy.stats import multivariate_normal

## I. Markov chains with a real-valued state


Let us consider a swinging pendulum and let $x_i$ be its height at the $i$-th iteration:
* We start by realeasing the pendulum at the height $x_0$.
* Then it drops and raises again $x_1$ is its highest position after which its starts to drop again.

### System dynamics

Due to the friction the pendulum looses energy. We model it through the following equation 

\begin{align*}
x_{i+1}=a x_{i} + w_i 
\end{align*}

where 
* the coeffient $a$ detrmines how fast the system looses energy 
* the error term $w_i$ captures the impact of other forces like wind. 

We model the effect other forces as follows:
* We assume that all error terms $w_i$ are independent.   
* We assume the each error term $w_i$ is distributed according $\mathcal{N}(0,\sigma_i)$.
* We assume that the initial position $x_0$ is fixed.

Then description of the system is complete. It is Markov chain.

### Standard questions

Similarly to Markov chains with discrete statespace we can as following questions:
* What is the position of the pendulum at the $i$-th iteration if we know only the initial state $x_0$?

* What is the position of the pendulum at the $i$-th iteration if we know only the final state $x_n$?

* What is the position of the pendulum at the $i$-th iteration if we know  both states $x_0$ and $x_n$?


To answer the first question we need to compute prior 

\begin{align*}
\pi_{X_i}(x_i)=p[x_i|x_0]\enspace.
\end{align*}

To answer the second question we need to compute likelihood

\begin{align*}
\lambda_{X_i}(x_i)=p[x_n|x_i]\enspace.
\end{align*}

To answer the third question we need to compute the marginal posterior

\begin{align*}
p_{X_i}(x_i)=p[x_i|x_0,x_n]\propto \pi_{X_i}(x_i)\cdot \lambda_{X_i}(x_i) \enspace.
\end{align*}

## III. Prior propagation

There are two ways to derive expressions for the prior function:

* direct application of [prior propagation rules](../03/belief_propagation_in_a_tree.ipynb) 
* use of linear expressions that separate the effect of state from random error components.  

Both appraches lead to the same end result, but the first apprach is very technical and leads to many integrals.
Therefore, we pursue the second approach and iteratively express

\begin{align*}
x_{i}=\mu_{i}+\varepsilon_{i},\qquad \varepsilon_i\sim\mathcal{N}(0,\rho_{i})\enspace.
\end{align*}

The latter allows us to escape techincalities and use only the fact that linear combination of independent normal distributions is also  normal distribution.

### Closeness under linear combinations

Linear combination $v=\alpha_1 u_1+\alpha_2 u_2+\cdots+\alpha_n u_n$ of independent univariate normal distributions

\begin{align*}
u_1&\sim\mathcal{N}(\mu_1,\sigma_1)\\
u_2&\sim\mathcal{N}(\mu_2,\sigma_2)\\
\cdots&\sim\cdots\\
u_n&\sim\mathcal{N}(\mu_n,\sigma_n)
\end{align*}

is also a normal distribution $\mathcal{N}(\mu, \sigma)$ with parameters

\begin{align*}
\mu&=\alpha_1\mu_1+\alpha_2\mu_2+\cdots+\alpha_n\mu_n\\
\sigma^2&=\alpha_1^2\sigma_1^2+\alpha_2^2\sigma_2^2+\cdots+\alpha_n^2\sigma_n^2\enspace.
\end{align*}

**Justification for paramater expressions.**
If $v$ has normal distribution we can find its parameters through moment matching. 
Linearity of mathematical expectaion gives

\begin{align*}
\mu
&=\mathbf{E}(\alpha_1 u_1+\alpha_2 u_2+\cdots+\alpha_n u_n)\\
&=\mathbf{E}(\alpha_1 u_1)+ \mathbf{E}(\alpha_2 u_2)+\cdots+\mathbf{E}(\alpha_n u_n)\\
&=\alpha_1\mu_1+\alpha_2\mu_2+\cdots+\alpha_n\mu_n\enspace.
\end{align*}

Independence assumption together with the properties of variance gives

\begin{align*}
\sigma^2
&=\mathbf{D}(\alpha_1 u_1+\alpha_2 u_2+\cdots+\alpha_n u_n)\\
&=\mathbf{D}(\alpha_1 u_1)+ \mathbf{D}(\alpha_2 u_2)+\cdots+\mathbf{D}(\alpha_n u_n)\\
&=\alpha_1^2\sigma_2+\alpha_2^2\sigma_2^2+\cdots+\alpha_n^2\sigma_n^2\enspace.
\end{align*}

### Base

By definition of the normal distribution we can write $x_1=a x_0 + w_0$ for $w_0\sim\mathcal{N}(0, \sigma_0)$ and thus $\rho_1=\sigma_0$.

### General induction step

Assume that $p[x_{i-1}|x_{0}]$ is a normal distribution $\mathcal{N}(\mu_{i-1},\rho_{i-1})$ and we can express 

\begin{align*}
x_{i-1}=\mu_{i-1}+\varepsilon_{i-1}, \qquad \varepsilon_{i-1}\sim\mathcal{N}(0,\rho_{i-1})\enspace.
\end{align*}

As $x_{i+1}=ax_i+w_{i}$ we get

\begin{align*}
x_{i}=a(\mu_{i-1}+\varepsilon_{i-1})+w_{i}=a\mu_{i-1}+a\varepsilon_{i-1}+w_i
\end{align*}

and thus we can define

\begin{align*}
\mu_{i}&=a \mu_{i-1}\\
\varepsilon_{i} &= a\varepsilon_{i-1} +\varepsilon_{i}\enspace
\end{align*}

to preserve the induction hypotesis. Again, $\varepsilon_i$ is linear compination of independent normal distributions and thus must have a normal distribution $\mathcal{N}(0, \rho_i)$. 
Moment matching yields 

\begin{align*}
\rho_i^2=a^2\rho_{i-1}+\sigma_i^2\enspace.
\end{align*}

### Practical example 

Let us consider equation $x_{i+1}=0.9\cdot x_{i}+w_i$ for $w_i\sim\mathcal{N}(0, 1)$ and $x_0\sim\mathcal{N}(10, 1)$. Let us sample 1000 parallel runs and match theoretical prior distributions with simulated prior distributions.

## II. Likelihood propagation

There are two ways to derive expressions for the likelihood function:

* direct application of likelihood propagation rules 
* use of linear expression that separates the effect of state from random error components.  

Both appraches lead to the same end result, but the first approach is very technical and leads to many integrals.
Therefore, we pursue the second approach and iteratively express

\begin{align*}
x_{n}=\alpha_{i}x_{i}+\varepsilon_{i},\qquad \varepsilon_i\sim\mathcal{N}(0,\delta_{i})\enspace.
\end{align*}

Note that the mean value is expressed in terms of $x_i$ to make the induction step tractable. 

### Base

By definition $x_{n}=ax_{n-1}+ w_{n-1}$ for $w_{n-1}\sim\mathcal{N}(0,\sigma_{n-1})$ and thus trivially

\begin{align*}
x_{n}=\alpha_{n-1}x_{n-1}+\varepsilon_{n-1},\qquad \varepsilon_{n-1}\sim\mathcal{N}(0, \delta_{n-1})
\end{align*}

for $\alpha_{n-1}=a$ and $\delta_{n-1}=1$.

### General induction step

Assume that $p[x_n|x_{i+1}]$ is a normal distribution $\mathcal{N}(\alpha_{i+1}x_{i+1},\delta_{i+1})$ and we can express 

\begin{align*}
x_n= \alpha_{i+1}x_{i+1}+\varepsilon_{i+1},\qquad \varepsilon_{i+1}\sim\mathcal{N}(0, \delta_{i+1})
\end{align*}

As $x_{i+1}=ax_i+w_i$ we can express

\begin{align*}
x_n 
= \alpha_{i+1}(ax_i+w_i)+\varepsilon_{i+1}
= \alpha_{i+1} ax_i + \alpha_{i+1}w_i +\varepsilon_{i+1}
\end{align*}

and thus we can define

\begin{align*}
\alpha_{i}&=\alpha_{i+1}a\\
\varepsilon_{i} &= \alpha_{i+1}w_i +\varepsilon_{i+1}\enspace
\end{align*}

to preserve the induction hypotesis. As $\varepsilon_i$ is linear combination of independent normal distributions, it must have a normal distribution $\mathcal{N}(0, \delta_i)$. 
Moment matching yields 

\begin{align*}
\delta_i^2=\alpha_{i+1}^2\sigma_i^2+ \delta_{i+1}^2\enspace.
\end{align*}

Therefore we get

\begin{align*}
p[x_n|x_{n-1}]
\propto\exp\biggl(-\frac{(x_n-\alpha_{i}x_{i})^2}{2\delta_{i}^2}\biggr)\enspace.
\end{align*}


### Practical example  continued

Note that we can sample conditionditional distribution by reversing the chain. In general chain reversal is not always intractable but here the additive error model makes it straightforward

\begin{align*}
x_i=\frac{x_{i+1}-w_i}{a},\qquad w_{i}\sim\mathcal{N}(0,1)
\end{align*}

Thus we can indeen simulate what is the likelihood when $x_n$ is fixed. 


## IV. Marginal posterior

Note that we can indeed use Bayes formula to express

\begin{align*}
p[x_i| x_0, x_n]=\frac{p[x_n|x_i, x_0]\cdot p[x_i|x_0]}{p[x_n|x_0]}
\propto p[x_i|x_0]\cdot p[x_n|x_i]\enspace.
\end{align*}

As the prior and likelihood are normal ditributions, we get

\begin{align*}
p[x_i| x_0, x_n]
&\propto\exp\Biggl(-\frac{(x_i-\mu_i)^2}{2\rho_i^2}\Biggr)\cdot\exp\Biggl(-\frac{(x_n-\alpha_ix_i)^2}{2\delta_i^2}\Biggr)\\
&\propto\exp\Biggl(-\frac{\delta_i^2(x_i-\mu_i)^2+ \rho_i^2(x_n-\alpha_ix_i)^2}{2\rho_i^2\delta_i^2}\Biggr)\\
&\propto\exp\Biggl(-\frac{\delta_i^2x_i^2-2\delta_i^2\mu_ix_i+ \rho_i^2\alpha_i^2x_i^2-2\rho_i^2x_n\alpha_ix_i}{2\rho_i^2\delta_i^2}\Biggr)\\
&\propto\exp\Biggl(-\frac{(\delta_i^2+\rho_i^2\alpha_i^2)x_i^2
-2(\delta_i^2\mu_i+\rho_i^2x_n\alpha_i)x_i}{2\rho_i^2\delta_i^2}\Biggr)\\
\end{align*}

And thus the marginal distribution $x_i|x_0,x_n$ follows indeed a normal distribution $\mathcal{N}(\mu, \sigma)$ with parameters

\begin{align*}
\mu&=\frac{\delta_i^2\mu_i+\rho_i^2x_n\alpha_i}{\delta_i^2+\rho_i^2\alpha_i^2}\\
\sigma^2&= \frac{\rho_i^2\delta_i^2}{\delta_i^2+\rho_i^2\alpha_i^2}\enspace.
\end{align*}

Engineers are usually interested only on the maximal a posteriori estimate that is locaated in $\mu$ and thus they completely ignore the other equation.


### Practical example continued


## V.  Belief propagation formulae for higher dimensional state spaces

* We can use the same linear equations to derive belief propagation formulae.
* To make the notation consistent with standard treatments we use traditional notations for the update rule

\begin{align*}
\boldsymbol{x}_{i+1}= A\boldsymbol{x}_i+\boldsymbol{w}_i,\qquad \boldsymbol{w}_i\sim\mathcal{N}(\boldsymbol{0}, Q_i) \enspace.
\end{align*}

* <font color="red">To be completed</font>