In [1]:
import numpy as np
from matplotlib import pyplot as plt

# Dynamic Expectation Maximisation

## By Andr√© van Schaik

### __[International Centre for Neuromorphic Systems](https://westernsydney.edu.au/icns)__        10/01/2019

I made this notebook on Dynamic Expectation Maximisation (DEM) following "Hierarchical Models in the Brain", by Karl Friston, PLoS Computational Biology 4(11): e1000211. [doi:10.1371/journal.pcbi.1000211](https://doi.org/10.1371/journal.pcbi.1000211). <span style="color:red">This is very much a (slow) work in progress and as yet unfinished. It's on GitHub for my own convenience and not ready for public consumption. Read at your own peril!</span>

### Dynamic model ###

We can write a dynamic input-state-output model as:

\begin{align*}
\tilde y &= \tilde g + \tilde z \\
D \tilde x &= \tilde f + \tilde w \tag{1}
\end{align*}

where the $\tilde a$ notation indicates variables and functions in generalised coordinates of motion $\tilde a = [a, a', a'', a''', ...]^T$. Here, $a'$ is the *value* of the derivative of $a$ with respect to time; in other words, it is a dimensionless number, even though the time derivative obviously has a unit of $[s^{-1}]$. One way of interpreting this is that $a' = \tau\,da/dt$ where $\tau = dt$ and all time is measured in units of $dt$, and similarly for the higher orders of motion.

$D$ is a block-matrix derivative operator, whose first leading-diagonal contains identity matrices. This operator simply shifts the vectors of generalised motion so $a[i]$ that is replaced by $a[i+1]$.

The predicted sensor response $\tilde g$ and motion $\tilde f$ of the hidden states $\tilde x$ in absence of random fluctuations are:

\begin{align*}
\begin{split}
g &= g(x, v) \\
g' &= g_x x' + g_v v' \\
g'' &= g_x x'' + g_v v'' \\
&\phantom{g=\,} \vdots \\
\end{split}
\:\:\:
\begin{split}
f &= f(x, v) \\
f' &= f_x x' + f_v v' \\
f'' &= f_x x'' + f_v v'' \\
&\phantom{f=\,} \vdots \\
\end{split}
\end{align*}

Here, $f$ and $g$ are continuous nonlinear functions and $\tilde v$ are known causes or inputs, which can also result from actions by the agent. The notation $a_b$ is shorthand for $\partial{a}/\partial{b}$. We assume that the observation noise $\tilde z$ follows a zero-mean Gaussian distribution $p(\tilde z) = \mathcal{N}(0, \tilde \Sigma^z)$ and similarly for the state noise $p(\tilde w) = \mathcal{N}(0, \tilde \Sigma^w)$. The input drive is also Gaussian but with a mean that can be different from zero: $p(\tilde v) = N(\tilde \eta^v, \tilde C^v)$, where we use $\tilde C^v$ instead of $\tilde \Sigma^v$ to indicate this is a *prior* covariance, rather than a *conditional* covariance. We can then evaluate the joint density over observations $\tilde y$, hidden states $\tilde x$, and inputs $\tilde v$:

\begin{align*}
p(\tilde y, \tilde x, \tilde v \,|\, \theta, \lambda) &= p(\tilde y \,|\, \tilde x, \tilde v, \theta, \lambda) \; p(\tilde x \,|\, \tilde v, \theta, \lambda) \; p(\tilde v) \\
p(\tilde y \,|\, \tilde x, \tilde v, \theta, \lambda) &= \mathcal{N}(\tilde y : \tilde g, \tilde \Sigma(\lambda)^z) \\
p(\tilde x \,|\, \tilde v, \theta, \lambda) &= \mathcal{N}(D\tilde x : \tilde f, \tilde \Sigma(\lambda)^w) \\
p(\tilde v) &= \mathcal{N}(\tilde v : \eta^v, C^v)
\end{align*}

where $\theta$ contains the parameters describing $f$ and $g$, and $\lambda$ are hyperparameters which control the amplitude and smoothness of the random fluctuations. Here we have indicated explicitly which random variable is generated by each normal distribution. According to $(1)$, the random variable for state transition is $D\tilde x$, which therefore links different levels of motion.

Finally, we also assume Gaussian priors for the hyperparameters $\lambda$ and $\theta$:

\begin{align*}
p(\lambda) &= \mathcal{N}(\lambda : \eta^\lambda, C^\lambda)\\
p(\theta) &= \mathcal{N}(\theta : \eta^\theta, C^\theta)
\end{align*}

This allows us to write the directed Bayesian graph for the model:

<img src="DM.png" width="600">

### Hierarchical dynamic model ###

For a hierarchical dynamic model (HDM), we assume each higher level generates causes for the level below, so that the causes $v$ link levels, whereas hidden states $x$ link dynamics over time. Further it is assumed that the noise processes at each level $w^{(i)}$ and $z^{(i)}$ are conditionally independent. This leads to the following Bayesian directed graph:

<img src="HDM.png" width="600">

Here $\vartheta^{(i)} = [\theta^{(i)}, \lambda^{(i)}]$ and $u^{(i)} = [\tilde v^{(i)}, \tilde x^{(i)}]$.

### Model inversion ###

For model inversion, we are trying to estimate the parameters $\vartheta$ of a model given some observations $y$ and a model $m$ by maximising the conditional density $p(\vartheta \,|\, \tilde y, m)$. However, this density is in general not directly calculable as it involves normalising over all possible observations. *Variational Bayes* suggests a workaround by minimising the Kullback-Leibler divergence between what it believes the state of its environment is (encoded in a Recognition density $q(\theta)$) and the true Bayesian posterior.

\begin{align*}
D_{KL}(\: q(\vartheta) \; || \; p(\vartheta \,|\, \tilde y, m) \: ) = \int{q(\vartheta) \: \ln \frac{q(\vartheta)}{p(\vartheta\,|\, \tilde y, m)} \: d\vartheta}
\end{align*}

The KL divergence is a measure of the difference between two probability distributions, is always positve, and is 0 if and only if the two distributions are the same. Thus adapting $q(\vartheta)$ to minimise this KL divergence will result in $q(\vartheta)$ being a close approximation of $p(\vartheta\,|\, \tilde y, m)$. 

Obviously, to evaluate this KL divergence directly, we would still need to be able to calculate $p(\vartheta\,|\, \tilde y, m)$ and we seem to have made no progress. However, the FEP uses the fact that $p(\vartheta, \tilde y, m) = p(\vartheta\,|\, \tilde y, m) p(\tilde y, m)$, to write this as:

\begin{align*}
D_{KL}(\: q(\vartheta) \; || \; p(\vartheta\,|\, \tilde y, m) \: ) &= \int{q(\vartheta) \: \ln \frac{q(\vartheta)}{p(\vartheta, \tilde y, m)/p(\tilde y, m)} \: d\vartheta} \\
&= \int{q(\vartheta) \: \{ \ln\:q(\vartheta) - \ln\:p(\vartheta, \tilde y, m) + \ln\:p(\tilde y, m) \} \: d\vartheta} \\
&= \int{q(\vartheta) \: \ln \frac{q(\vartheta)}{p(\vartheta, \tilde y, m)} \: d\vartheta} + \int{q(\vartheta) \: \ln p(\tilde y, m) \: d\vartheta} \\
&= \int{q(\vartheta) \: \ln \frac{q(\vartheta)}{p(\vartheta, \tilde y, m)} \: d\vartheta} + \ln p(\tilde y, m) \int{q(\vartheta) \: d\vartheta} \\
&= \int{q(\vartheta) \: \ln \frac{q(\vartheta)}{p(\vartheta, \tilde y, m)} \: d\vartheta} + \ln p(\tilde y, m) \\
\end{align*}

since $\int{q(\vartheta) \: d\vartheta} = 1$ by definition of a probability density. We continue by writing:

\begin{align*}
D_{KL}(\: q(\vartheta) \; || \; p(\vartheta\,|\, \tilde y, m) \: ) &= \ln p(\tilde y, m) - F\\
F &= -\int{q(\vartheta) \: \ln \frac{q(\vartheta)}{p(\vartheta, \tilde y, m)} \: d\vartheta} \\
&= -D_{KL}(\: q(\vartheta) \; || \; p(\vartheta, \tilde y, m) \: )\\
\end{align*}

The joint density $p(\vartheta,\tilde y, m)$ is called the generative density, and represents the agent's belief in how the world works. It can be factorised into $p(\vartheta,\tilde y, m) = p(\tilde y, \vartheta, m) = p(\tilde y \,|\, \vartheta, m)\:p(\vartheta, m)$ where a prior $p(\vartheta, m)$ encodes the agent's beliefs for the world states prior to new sensory input, and a likelihood $p(\tilde y|\vartheta, m)$ encodes how the agent's sensory signals relate to the world states. Thus, if we have a model for how the world states generate sensory perception (or if we can learn one), we can calculate $F$, which is called the *Variational Free Energy*, and is the negative of the KL divergence between the Recognition density, $q(\vartheta)$, and the Generative density, $p(\vartheta, \tilde y, m)$. We probably don't know $p(\tilde y, m)$, but, since this doesn't depend on $\vartheta$, it plays no role in optimising $q(\vartheta)$. We can simply maximise $F$ to make $q(\vartheta)$ the best possible approximation of $p(\vartheta,\tilde y, m)$, and thereby maximise $p(\vartheta\,|\, \tilde y, m)$.

Given a model, we can determine the probability of the observations under the model by:

\begin{align*}
\ln p(\tilde y, m) &= \ln p(\tilde y \,|\, m) + \ln p(m) \\
\ln p(\tilde y \,|\, m) &= \ln p(\tilde y, m) - \ln p(m) \\
&= F + D_{KL}(\: q(\vartheta) \; || \; p(\vartheta\,|\, \tilde y, m) \: ) - \ln p(m)
\end{align*}

Thus for a given model ($p(m)=1$), we can write the log likelihood as:

\begin{align*}
\ln p(\tilde y \,|\, m) &= F + D_{KL}(\: q(\vartheta) \; || \; p(\vartheta\,|\, \tilde y, m) \: )
\end{align*}

This indicates that $F$ can be used as a lower-bound for the log-evidence, since the KL divergence term is always positive and is $0$ if and only if $q(\vartheta) =  p(\vartheta\,|\, \tilde y, m)$.

$F$ can also be expressed as:

\begin{align*}
F &= -\int{q(\vartheta) \: \ln \frac{q(\vartheta)}{p(\vartheta, \tilde y, m)} \: d\vartheta} \\
&= \left< \ln p(\vartheta, \tilde y, m) \right>_q - \left< \ln q(\vartheta) \right>_q
\end{align*}

which comprises the internal energy $U(\vartheta, \tilde y) = \ln p(\vartheta, \tilde y)$ of a given model $m$ expected under $q(\vartheta)$ and the entropy of $q(\vartheta)$, which is a measure of its uncertainty.



### Optimisation ###

The introduction of $q(\vartheta)$ converts the difficult integration problem inherent in Bayesian Inference into a much simpler optimisation problem of adapting $q(\vartheta)$ to maximise $F$. To further simplify calculation, we usually assume that the model parameters can be partitioned over the states $u = [\tilde v, \tilde x]^T$, the parameters $\theta$, and the hyperparameters $\lambda$, as:

\begin{align*}
q(\vartheta) &= q(u(t)) \, q(\theta) \, q(\lambda) \\
&= \prod_i q(\vartheta^i) \\
\vartheta^i &= \{u(t), \theta, \lambda\}
\end{align*}

This partition is called the *mean field* approximation in statistical physics. We further assume that over the timescale of inference, only the states $u$ change with time $t$, while the (hyper)parameters are assumed constant.

Under this partition, optimisation is still achieved by maximising the Free Energy, but we can now do this separately for each partition, by averaging over the other partitions. To show this, we define $F$ as an integral over the parameter partitions:

\begin{align*}
F &= \int f^i \, d\vartheta^i \\
&= \int{q(\vartheta) \: \ln p(\vartheta, \tilde y, m) \: d\vartheta} - \int{q(\vartheta) \: \ln q(\vartheta) \: d\vartheta} \\
&= \iint{q(\vartheta^i) \: q(\vartheta^{\backslash i}) \: \ln p(\vartheta, \tilde y, m) \: d\vartheta^{\backslash i} \: d\vartheta^i} - \iint{q(\vartheta^i) \: q(\vartheta^{\backslash i}) \: \ln q(\vartheta) \: d\vartheta^{\backslash i} \: d\vartheta^i} \\
f^i &= \int{q(\vartheta^i) \: q(\vartheta^{\backslash i}) \: \ln p(\vartheta, \tilde y, m) \: d\vartheta^{\backslash i} } - \int{q(\vartheta^i) \: q(\vartheta^{\backslash i}) \: \ln q(\vartheta) \: d\vartheta^{\backslash i}} \\
&= q(\vartheta^i) \: \int{ q(\vartheta^{\backslash i}) \: U(\vartheta, \tilde y) \: d\vartheta^{\backslash i} } - q(\vartheta^i) \: \int{q(\vartheta^{\backslash i}) \: (\ln q(\vartheta^i) + \ln q(\vartheta^{\backslash i})) \: d\vartheta^{\backslash i}} \\
&= q(\vartheta^i) \: V(\vartheta^i) - q(\vartheta^i) \: \ln q(\vartheta^i) - \int{q(\vartheta^{\backslash i}) \: \ln q(\vartheta^{\backslash i}) \: d\vartheta^{\backslash i}} \\
\partial_{q(\vartheta^i)} \: f^i &= V(\vartheta^i) - \ln q(\vartheta^i) - \ln Z^i \\
\end{align*}

Here, $\vartheta^{\backslash i}$ denotes all parameters not in set $i$, i.e., its Markov blanket,  $Z^i$ contains all the terms of $f^i$ that do not depend on $\vartheta^i$, and

\begin{align*}
V(\vartheta^i) &= \int{q(\vartheta^{\backslash i}) \: \ln p(\vartheta, \tilde y, m) \: d\vartheta^{\backslash i} } = \int{ q(\vartheta^{\backslash i}) \: U(\vartheta, \tilde y) \: d\vartheta^{\backslash i} } = \left< U(\vartheta) \right>_{q(\vartheta^{\backslash i})}\\
\end{align*}

The Fundamental Lemma of variational calculus states that the free energy is maximised when:

\begin{align*}
\delta_{q(\vartheta^i)} F &= 0 \Leftrightarrow \partial_{q(\vartheta^i)} \: f^i = 0 \\
\ln q(\vartheta^i) &= V(\vartheta^i) - \ln Z^i\\
q(\vartheta^i) &= \frac{1}{Z^i} \exp \left(V(\vartheta^i)\right) = \frac{1}{Z^i} \exp\left(\left< U(\vartheta) \right>_{q(\vartheta^{\backslash i})}\right) 
\end{align*}

Thus, $Z^i$ is a normalisation constant, and is also called a partition function in physics. The final equation indicates that the variational density over one parameter set is an exponential function of the internal energy averaged over all other parameters.

Given our partitions above, we can then write:

\begin{align*}
q(u(t)) &\propto \exp \left(V(t)\right) \\
V(t) &= \left< U(t) \right>_{q(\theta)q(\lambda)} \\
q(\theta) &\propto \exp \left(\bar{V}^\theta \right) \\
\bar{V}^\theta &= \int \left< U(t) \right>_{q(u)q(\lambda)} dt + U^\theta \\
q(\lambda) &\propto \exp \left(\bar{V}^\lambda \right) \\
\bar{V}^\lambda &= \int \left< U(t) \right>_{q(u)q(\theta)} dt + U^\lambda
\end{align*}

In a dynamical system, the instantaneous internal energy $U(t)$ is a function of time. Because the parameters and hyperparameters are considered constant over a period of observation, their variational densities are functions of the path integal of this internal energy. $U^\theta = \ln p(\theta)$ and $U^\lambda = \ln p(\lambda)$ are the prior energies of the parameters and hyperparameters, respectively. 

From these equations we see that the variational density over states can be determined from the instantaneous internal energy averaged over parameters and hyperparameters, whereas the density over parameters and hyperparameters can only be determined when data has been observed over a certain amount of time. In the absence of data, the integrals will be zero, and the conditional density simply reduces to the prior density.

*Variational Bayes* assumes the above equations are analytically tractable, which needs needs the choice of appropriate (conjugate) priors. The conditional distributions $q(\vartheta^i)$ above can then be updated through iteration as new data becomes available:

\begin{align*}
\ln q(u(t)) &\propto \left< U(t) \right>_{q(\theta)q(\lambda)} \\
\ln q(\theta) &\propto \int \left< U(t) \right>_{q(u)q(\lambda)} dt + \ln p(\theta) \\
\ln q(\lambda) &\propto \int \left< U(t) \right>_{q(u)q(\theta)} dt + \ln p(\lambda)
\end{align*}



### The Laplace approximation ###

The Laplace approximation assumes that the marginals of the conditional density assume a Gaussian form, i.e., $q(\vartheta^i) = \mathcal{N}(\vartheta^i : \mu^i, C^i)$, where $\mu^i$ and $C^i$ are the sufficient statistics. For notational clarity, we will use $\mu^i$,  $C^i$, and $P^i$ for the conditional mean, covariance, and precision of the $i^\text{th}$ marginal, respectively, and $\eta^i$,  $\Sigma^i$, and $\Pi^i$ for their priors. This approximation simplifies the updates to the marginals of the conditional densities.

For each partition $\vartheta^i$, we can then write:

\begin{align*}
q(\vartheta^i) &= \frac{1}{\sqrt{2\pi C^i}} \exp \left( \frac{-(\vartheta^i - \mu^i)^2}{2C^i} \right) \\
&= \frac{1}{Z^i} \exp \left( -\varepsilon(\vartheta^i) \right) \\ 
Z^i &= \sqrt{2\pi C^i} \\
\varepsilon(\vartheta^i) &= \frac{(\vartheta^i - \mu^i)^2}{2C^i}
\end{align*}

Recall that the Free Energy was defined as:

\begin{align*}
F &= -\int{q(\vartheta) \: \ln \frac{q(\vartheta)}{p(\vartheta, \tilde y, m)} \: d\vartheta} \\
&= - \int{q(\vartheta) \: \ln q(\vartheta) \: d\vartheta} + \int{q(\vartheta) \: \ln p(\vartheta, \tilde y, m) \: d\vartheta} \\
&= - \int{q(\vartheta) \: \ln \prod_i q(\vartheta^i) \: d\vartheta} + \int{q(\vartheta) \: \ln p(\vartheta, \tilde y, m) \: d\vartheta} \\
&= - \int{q(\vartheta) \: \sum_i(\ln Z^i + \varepsilon(\vartheta^i)) \: d\vartheta} + \int{q(\vartheta) \: \ln p(\vartheta, \tilde y, m) \: d\vartheta} \\
&= - \sum_i(\ln Z^i)\int{q(\vartheta) \: d\vartheta} - \int{q(\vartheta) \: \sum_i(\varepsilon(\vartheta^i)) \: d\vartheta} + \int{q(\vartheta) \: \ln p(\vartheta, \tilde y, m) \: d\vartheta} \\
&= - \sum_i(\ln Z^i) - \sum_i \frac{1}{2C^i}\int{q(\vartheta) \: (\vartheta^i - \mu^i)^2 \: d\vartheta} + \int{q(\vartheta) \: \ln p(\vartheta, \tilde y, m) \: d\vartheta} \\
&= - \sum_i(\ln Z^i) - \frac{1}{2} + \left< U \right>_q \\
\end{align*}

Now we still need to find an expression we can calculate for $\left< U \right>_q$. To do this, a further approximation assumes that $q$ is sharply peaked at its mean value $\mu$, so that the integration is only non-zero close to $\vartheta = \mu$. We can then use a Taylor expansion around the mean to obtain: 

\begin{align*}
\left< U \right>_q &= \int{q(\vartheta) \: U(\vartheta, \tilde y) \: d\vartheta} \\
&= \int{q(\vartheta) \: \left\{ U(\mu, \tilde y) + \left[ \frac{dU}{d\vartheta} \right]_\mu \delta\vartheta + \frac{1}{2} \left[ \frac{d^2U}{d\vartheta^2} \right]_\mu \delta\vartheta^2 \right\} \: d\vartheta} \\
&= U(\mu, \tilde y) + \left[ \frac{dU}{d\vartheta} \right]_\mu \int{q(\vartheta) \: (\vartheta - \mu) \: d\vartheta} + \frac{1}{2} \left[ \frac{d^2U}{d\vartheta^2} \right]_\mu \int{ q(\vartheta) \: (\vartheta - \mu)^2 \: d\vartheta} \\
&= U(\mu, \tilde y) + \left[ \frac{dU}{d\vartheta} \right]_\mu \left\{ \int{\vartheta q(\vartheta) \: d\vartheta} - \mu \right\} + \frac{1}{2} \left[ \frac{d^2U}{d\vartheta^2} \right]_\mu \int{ q(\vartheta) \: (\vartheta - \mu)^2 \: d\vartheta} \\
&= U(\mu, \tilde y) + \left[ \frac{dU}{d\vartheta} \right]_\mu \left\{ \mu - \mu \right\} + \frac{1}{2} \left[ \frac{d^2U}{d\vartheta^2} \right]_\mu \int{ q(\vartheta) \: (\vartheta - \mu)^2 \: d\vartheta} \\
&= U(\mu, \tilde y) + \frac{1}{2} \left[ \frac{d^2U}{d\vartheta^2} \right]_\mu C \\
\end{align*}

This now allows us to write for the free energy:

\begin{align*}
F &= U(\mu, \tilde y) - \frac{1}{2}  + \frac{1}{2} \sum_i( \left[ \frac{d^2U}{d{\vartheta^i}^2} \right]_{\mu^i} C^i - \ln 2\pi C^i) \\
\end{align*}

To find the optimal variances, we maximise the free energy with respect to the variances, so that the partial derivatives are zero:

\begin{align*}
\frac{dF}{d\vartheta^i} &= \frac{1}{2} \left\{ \left[ \frac{d^2U}{d{\vartheta^i}^2} \right]_{\mu^i} - \frac{1}{C^i} \right\} = 0 \\
C^{i*} &=  \left[ \frac{d^2U}{d{\vartheta^i}^2} \right]_{\mu^i}^{-1} \\
F &= U(\mu, \tilde y) - \frac{1}{2} \ln 2\pi C^*
\end{align*}

where we use the notation $C^{i*}$ to indicate this is the optimal variance which maximises the free energy.


<span style="color:red">Now, I don't quite know how to get here, but Friston just states that the updates under the Laplace approximation become:</span>

\begin{align*}
\bar{U} &= \int U(t)dt + U^\theta + U^\lambda \\
\bar{V}^u &= \int U(u, t|\mu^\theta, \mu^\lambda) + W(t)^\theta + W(t)^\lambda dt \\
\bar{V}^\theta &= \int U(\mu^u, t|\theta, \mu^\lambda) + W(t)^u + W(t)^\lambda dt + U^\theta \\
\bar{V}^\lambda &= \int U(\mu^u, t|\mu^\theta, \lambda) + W(t)^u + W(t)^\theta dt + U^\lambda \\
W(t)^u &= \frac{1}{2} \text{tr}(C^u U(t)_{uu}) \\
W(t)^\theta &= \frac{1}{2} \text{tr}(C^\theta U(t)_{\theta\theta}) \\
W(t)^\lambda &= \frac{1}{2} \text{tr}(C^\lambda U(t)_{\lambda\lambda}) \\
\end{align*}

where $U_{xx} = d^2U/dx^2$.

Also, the conditional precisions are equal to the negative curvatures of the internal action:

\begin{align*}
P^u &= -\bar{U}_{uu} = -U(t)_{uu} \\
P^\theta &= -\bar{U}_{\theta\theta} = - \int U(t)_{\theta\theta} \: dt - U^\theta_{\theta\theta} \\
P^\lambda &= -\bar{U}_{\lambda\lambda} = - \int U(t)_{\lambda\lambda} \: dt- U^\lambda_{\lambda\lambda} \\
\end{align*}

For our HDM the gradients and curvature of the internal energy are:

\begin{align*}
U(t)_u &= -\tilde \varepsilon_u^T \tilde \Pi \tilde \varepsilon \\
U(t)_\theta &= -\tilde \varepsilon_\theta^T \tilde \Pi \tilde \varepsilon \\
U(t)_{\lambda i} &= - \frac{1}{2} \text{tr} (Q_i(\tilde \varepsilon \tilde \varepsilon^T - \tilde \Sigma)) \\
U(t)_{uu} &= -\tilde \varepsilon_u^T \tilde \Pi \tilde \varepsilon_u \\
U(t)_{\theta\theta} &= -\tilde \varepsilon_\theta^T \tilde \Pi \tilde \varepsilon_\theta \\
U(t)_{\lambda\lambda ij} &= - \frac{1}{2} \text{tr} (Q_i \tilde \Sigma Q_j \tilde \Sigma)) \\
\end{align*}

with:

\begin{align*}
U(t)_{\lambda i} &= \frac{dU(t)}{d\lambda_i} \\
Q_{i} &= \frac{d\tilde\Pi}{d\lambda_i} \\
\end{align*}

and where we assume that

\begin{align*}
\frac{d^2\tilde\Pi}{d\lambda_i^2} &= 0\\
\end{align*}

The derivatives with respect to each parameter $\tilde \varepsilon_{\theta}(t) = \tilde \varepsilon_{u\theta} \mu^u(t)$ rest on the second derivative of the models functions that mediate interactions between each parameter and the states:

\begin{align*}
\tilde \varepsilon_{\theta u}^T &= \tilde \varepsilon_{u\theta} = -
\begin{bmatrix}
I \otimes g_{v\theta} & I \otimes g_{x\theta} \\
I \otimes f_{v\theta} & I \otimes f_{x\theta}
\end{bmatrix}
\end{align*}


### Temporal smoothness ###

Since the different levels of motion are linked, the covariance matrix will have off-diagonal elements with non-zero values. The covariance is given by the Kronecker product $\tilde \Sigma(\lambda)^z = S(\gamma)^{-1} \otimes \Sigma(\lambda)^z$, where $\Sigma(\lambda)^z$ is a diagonal matrix specifying the variance of the (often assumed independent) Gaussian noise at each level, and $S(\gamma)$ is the temporal precision matrix, which encodes the temporal dependencies between levels, which is a function of their autocorrelations:

\begin{align*}
S^{-1} =
    \begin{bmatrix}
    1 & 0 & \ddot{\rho}(0) & 0 & \ddot{\ddot{\rho}}(0) & 0 \\
    0 & -\ddot{\rho}(0) & 0 & -\ddot{\ddot{\rho}}(0) & 0 & -\dddot{\dddot{\rho}}(0)\\
    \ddot{\rho}(0) & 0 & \ddot{\ddot{\rho}}(0) & 0 & \dddot{\dddot{\rho}}(0) & 0 \\
    0 & -\ddot{\ddot{\rho}}(0) & 0 & -\dddot{\dddot{\rho}}(0) & 0 & -\ddot{\dddot{\dddot{\rho}}}(0) \\
    \ddot{\ddot{\rho}}(0) & 0 & \dddot{\dddot{\rho}}(0) & 0 & \ddot{\dddot{\dddot{\rho}}}(0) & 0 \\
    0 & -\dddot{\dddot{\rho}}(0) & 0 & -\ddot{\dddot{\dddot{\rho}}}(0) & 0 & -\ddot{\ddot{\dddot{\dddot{\rho}}}}(0)\\
    \end{bmatrix}
\end{align*}

(see [here](Generalised%20precision%20matrix.ipynb) for a derivation.) Here $\ddot{\rho}(0)$ is the second derivative of the autocorrelation function evaluated at zero. Note, that because the autocorrelation function is even (symmetrical for positive and negative delays), the odd derivatives of the autocorrelation function are all odd functions, and thus are zero when evaluated at zero.

While $\Sigma(\lambda)^z$ can be evaluated for any analytical autocorrelation function, we assume here that the temporal correlations all have the same Gaussian form, which gives:

\begin{align*}
S^{-1} &=
    \begin{bmatrix}
    1 & 0 & -\gamma & 0 & 3 \gamma^2 & 0 \\
    0 & \gamma & 0 & -3 \gamma^2 & 0 & 15 \gamma^3 \\
    -\gamma & 0 & 3 \gamma^2 & 0 & -15 \gamma^3 & 0 \\
    0 & -3 \gamma^2 & 0 & 15 \gamma^3 & 0 & -105 \gamma^4 \\
    3 \gamma^2 & 0 & -15 \gamma^3 & 0 & 105 \gamma^4 & 0 \\
    0 & 15 \gamma^3 & 0 & -105 \gamma^4 & 0 & 945 \gamma^5 \\
    \end{bmatrix}
\end{align*}

Here, $\gamma$ is the precision parameter of a Gaussian autocorrelation function. Typically, $\gamma > 1$, which ensures the precisions of high-order motion converge quickly to zero. This is important because it enables us to truncate the representation of an infinite number of generalised coordinates to a relatively small number, since high-order prediction errors have a vanishingly small precision. Friston states that an order of n=6 is sufficient in most cases.

Instead of using the covariance matrix, we can use its inverse, the precision matrix, which is defined by $\tilde \Pi(\lambda)^z = S(\gamma) \otimes \Pi(\lambda)^z$, where $\Pi(\lambda)^z$ is a diagonal matrix of the precisions of the Gaussian noise at each level of the generalised coordinates.

\begin{align*}
p(\tilde y, \tilde x, \tilde v \,|\, \theta, \lambda) &= p(\tilde y \,|\, \tilde x, \tilde v, \theta, \lambda) \; p(\tilde x \,|\, \tilde v, \theta, \lambda) \; p(\tilde v) \\
 &= (2\pi)^{-N_y/2} {|\tilde\Pi^z|}^{1/2} e^{-\frac{1}{2}{\tilde\varepsilon^v}^T \tilde\Pi^z \tilde\varepsilon^v} (2\pi)^{-N_x/2} {|\tilde\Pi^w|}^{1/2} e^{-\frac{1}{2}{\tilde\varepsilon^x}^T \tilde\Pi^w \tilde\varepsilon^x} \; p(\tilde v) \\
 &= (2\pi)^{-(N_y+N_x)/2} (|\tilde\Pi^z| + |\tilde\Pi^w|)^{1/2} e^{-\frac{1}{2}{\tilde\varepsilon^v}^T \tilde\Pi^z \tilde\varepsilon^v}  e^{-\frac{1}{2}{\tilde\varepsilon^x}^T \tilde\Pi^w \tilde\varepsilon^x} \; p(\tilde v)\\
 &= (2\pi)^{-N/2} |\tilde\Pi|^{1/2} e^{-\frac{1}{2}{\tilde\varepsilon}^T \tilde\Pi \tilde\varepsilon} \; p(\tilde v)\\ 
\tilde\Pi &= 
    \begin{bmatrix}
    \tilde\Pi^z & \\
    & \tilde\Pi^w
    \end{bmatrix}\\
\tilde\varepsilon &= 
    \begin{bmatrix}
    \tilde\varepsilon^v = \ \ \ \tilde y - \tilde g   \\
    \tilde\varepsilon^x = D\tilde x - \tilde f
    \end{bmatrix}\\
N &= \text{Rank}(\tilde\Pi)
\end{align*} 

Here we introduce auxilary variables $\tilde\varepsilon(t)$, which are the prediction errors for the generalised responses and motion of the hidden states, with respective predictions $\tilde g(t)$ and $\tilde f(t)$ and their precisions encoded by $\tilde\Pi$.

The log probability can thus be written:
\begin{align*}
\ln p(\tilde y, \tilde x, \tilde v \,|\, \theta, \lambda) &=  \frac{1}{2} \ln |\tilde\Pi| -  \frac{1}{2}{{\tilde\varepsilon}^T \tilde\Pi \tilde\varepsilon} - \frac{N}{2} \ln 2\pi + \ln p(\tilde v)
\end{align*}

where the third term is constant, and the fourth term is defined by the input causes and considered known.