Journal of Mathematical Psychology 76 (2017) 198–211. http://dx.doi.org/10.1016/j.jmp.2015.11.003 
 

# A tutorial on the free-energy framework for modelling perception and learning #

Rafal Bogacz

MRC Unit for Brain Network Dynamics, University of Oxford, Mansfield Road, Oxford, OX1 3TH, UK Nuffield Department of Clinical Neurosciences, University of Oxford, John Radcliffe Hospital, Oxford, OX3 9DU, UK 

<html>
<hr style="height:2px;border:none;color:#228;background-color:#228;" />
<span style="color:#228">
<p><i>Converted to Jupyter notebook by André van Schaik</i>
</span>
<hr style="height:2px;border:none;color:#228;background-color:#228;" />
</html>

## Abstract ##

This paper provides an easy to follow tutorial on the free-energy framework for modelling perception developed by Friston, which extends the predictive coding model of Rao and Ballard. These models assume that the sensory cortex infers the most likely values of attributes or features of sensory stimuli from the noisy inputs encoding the stimuli. Remarkably, these models describe how this inference could be implemented in a network of very simple computational elements, suggesting that this inference could be performed by biological networks of neurons. Furthermore, learning about the parameters describing the features and their uncertainty is implemented in these models by simple rules of synaptic plasticity based on Hebbian learning. This tutorial introduces the free-energy framework using very simple examples, and provides step-by-step derivations of the model. It also discusses in more detail how the model could be implemented in biological neural circuits. In particular, it presents an extended version of the model in which the neurons only sum their inputs, and synaptic plasticity only depends on activity of pre-synaptic and post-synaptic neurons. 

## 1. Introduction ##

The model of Friston (2005) and the predictive coding model of Rao and Ballard (1999) provide a powerful mathematical framework to describe how the sensory cortex extracts information from noisy stimuli. The predictive coding model (Rao & Ballard, 1999) suggests that visual cortex infers the most likely properties of stimuli from noisy sensory input. The inference in this model is implemented by a surprisingly simple network of neuron-like nodes. The model is called "predictive coding", because some of the nodes in the network encode the differences between inputs and predictions of the network. Remarkably, learning about features present in sensory stimuli is implemented by simple Hebbian synaptic plasticity, and Rao and Ballard (1999) demonstrated that the model presented with natural images learns features resembling recep.tive fields of neurons in the primary visual cortex. 

Friston (2005) has extended the model to also represent uncertainty associated with different features. He showed that learn.ing about the variance and co-variance of features can also be implemented by simple synaptic plasticity rules based on Hebbian learning. As the extended model (Friston, 2005) learns the variance and co-variance of features, it offers several new insights. First, it describes how the perceptual systems may differentially weight sources of sensory information depending on their level of noise. Second, it shows how the sensory networks can learn to recognize features that are encoded in the patterns of covariance between inputs, such as textures. Third, it provides a natural way to implement attentional modulation as the reduction in variance of the attended features (we come back to these insights in Discussion). Furthermore, Friston (2005) pointed out that this model can be viewed as an approximate Bayesian inference based on minimization of a function referred to in statistics as free-energy. The free-energy framework (Friston, 2003, 2005) has been recently extended by Karl Friston and his colleagues to describe how the brain performs  different cognitive functions including action selection (FitzGerald, Schwartenbeck, Moutoussis, Dolan, & Friston, 2015; Friston et al., 2013). Furthermore, Friston (2010) proposed that the free-energy theory unifies several theories of perception and action which are closely related to the free-energy framework. 

There are many articles which provide an intuition for the free-energy framework and discuss how it relates with other theories and experimental data (Friston, 2003, 2005, 2010; Friston et al., 2013). However, the description of mathematical details of the theory in these papers requires a very deep mathematical background. The main goal of this paper is to provide an easy to follow tutorial on the free-energy framework. To make the tutorial accessible to a wide audience, it only assumes basic knowledge of probability theory, calculus and linear algebra. This tutorial is planned to be complementary to existing literature so it does not focus on the relationship to other theories and experimental data, and on applications to more complex tasks which are described elsewhere (Friston, 2010; Friston et al., 2013). 

In this tutorial we also consider in more detail the neural implementation of the free-energy framework. Any computational model would need to satisfy the following constraints to be considered biologically plausible: 

1. Local computation: A neuron performs computations only on the basis of the activity of its input neurons and synaptic weights associated with these inputs (rather than information encoded in other parts of the circuit). 

2. Local plasticity: Synaptic plasticity is only based on the activity of pre-synaptic and post-synaptic neurons. 

The model of Rao and Ballard (1999) fully satisfied these constraints. The model of Friston (2005) did not satisfy them fully, but we show that after small modifications and extensions it can satisfy them. So the descriptions of the model in this tutorial slightly differ in a few places or extend the original model to better explain how the proposed computation could be implemented in the neural circuits. All such differences or extensions are indicated by footnotes or in text, and the original model is presented in Appendix A. 

It is commonly assumed in theoretical neuroscience, (O’Reilly & Munakata, 2000) that the basic computations a neuron performs are the summation of its input weighted by the strengths of synaptic connections, and the transformation of this sum through a (monotonic) function describing the relationship between neurons’ total input and output (also termed firing-Input or f-I curve). Whenever possible, we will assume that the computation of the neurons in the described model is limited to these computations (or even just to linear summation of inputs). 

We feel that the neural implementation of the model is worth considering, because if the free-energy principle indeed describes the computations in the brain, it can provide an explanation for why the cortex is organized in a particular way. However to gain such insight it is necessary to start comparing the neural networks implementing the model with those in the real brain. Consequently, we consider in this paper possible neural circuits that could perform the computations required by the theory. Although the neural implementations proposed here are not the only possible ones, it is worth considering them as a starting point for comparison of the model with details of neural architectures in the brain. We hope that such comparison could iteratively lead to refined neural implementations that are more and more similar to real neural circuits. 

To make this tutorial as easy to follow as possible we introduce the free-energy framework using a simple example, and then illustrate how the model can scale up to more complex neural architectures. The tutorial provides step-by-step derivation of the model. Some of these derivations are straightforward, and we feel that it would be helpful for the reader to do them on their own to gain a better understanding of the model and to ‘‘keep in mind’’ the notation used in the paper. Such straightforward derivations are indicated by ‘‘(TRY IT YOURSELF)’’, so after encountering such label we recommend trying to do the calculation described in the sentence with this label and then compare the obtained results with those in the paper. To illustrate the model we include simple simulations, but again we feel it would be helpful for a reader to perform them on their own, to get an intuition for the model. Therefore we describe these simulations as exercises. 

The paper is organized as follows. Section 2 introduces the model using a very simple example using as basic mathematical concepts as possible, so it is accessible to a particularly wide audience. Section 3 provides mathematical foundations for the model, and shows how the inference in the model is related to minimization of free-energy. Section 4 then shows how the model scales up to describe the neural circuits in sensory cortex. In these three sections we use notation similar to that used by Friston (2005). Section 5 describes an extended version of the model which satisfies the constraint of local plasticity described above. Finally, Section 6 discusses insights provided by the model. 

## 2. Simplest example of perception ##

We start by considering in this section a simple perceptual problem in which a value of a single variable has to be inferred from a single observation. To make it more concrete, **consider a simple organism that tries to infer the size or diameter of a food item, which we denote by $v$, on the basis of light intensity it observes. Let us assume that our simple animal has only one light sensitive receptor which provides it with a noisy estimate of light intensity, which we denote by $u$. Let $g$ denote a non-linear function relating the average light intensity with the size. Since the amount of light reflected is related to the area of an object, in this example we will consider a simple function of $g(v) = v^2$. Let us further assume that the sensory input is noisy — in particular, when the size of food item is $v$, the perceived light intensity is normally distributed with mean $g(v)$, and variance $\varSigma_u$** (although a normal distribution is not the best choice for a distribution of light intensity, as it includes negative numbers, we will still use it for a simplicity): 

\begin{equation*}
p(u|v) = f (u; g(v), \varSigma_u) \tag{1}
\end{equation*}

In Eq. (1) $f(x; \mu, \varSigma)$ denotes the density of a normal distribution with mean $\mu$  and variance $\varSigma$: 

\begin{equation*} 
f (x; \mu, \varSigma) = \frac{1}{\sqrt{2 \pi \varSigma}} \exp \left( -\frac{(x - \mu)^2}{2\varSigma}\right) \tag{2}
\end{equation*}

Due to the noise present in the observed light intensity, the animal can refine its guess for the size $v$ by combining the sensory stimulus with the prior knowledge on how large the food items usually are, that it had learnt from experience. For simplicity, **let us assume that our animal expects this size to be normally distributed with mean $v_p$ and variance $\varSigma_p$** (subscript $p$ stands for "prior"), which we can write as: 

\begin{equation*}
p(v) = f (v; v_p, \varSigma_p). \tag{3}
\end{equation*}

Let us now assume that our animal observed a particular value of light intensity, and attempts to estimate the size of the food item on the basis of this observation. We will first consider an exact solution to this problem, and illustrate why it would be difficult to compute it in a simple neural circuit. Then we will present an approximate solution that can be easily implemented in a simple network of neurons. 

<html>
<hr style="height:2px;border:none;color:#228;background-color:#228;" />
<span style="color:#228">

*I (AvS) occasionally take the liberty to add my own comments or simulations to the document. These will be indicated by blue text and two horizontal lines, like this section.*

Note, in the set-up above, variables can have measurement units. If $x$ is measured in [m], then $f$ will have units [1/m]. However, measurements about the external world are represented by sensory neurons internally, and their values are measured in membrane voltage, or spike rate. Typically, in these types of Bayesian Inference approaches, all observations and internal variables are considered dimensionless. This means that there is an implicit conversion of the measurement to a dimensionless variable by division with the measurement unit, which determines the scale on which values are represented. This will become important later. 

<hr style="height:2px;border:none;color:#228;background-color:#228;" />
</html>

### 2.1. Exact solution ###

To compute how likely different sizes $v$ are given the observed sensory input $u$, we could use Bayes’ theorem: 

\begin{equation*}
p(v|u) = \frac{p(v)p(u|v)}{p(u)} \tag{4}
\end{equation*}

Term $p(u)$ in the denominator of Eq. (4)  is a normalization term, which ensures that the posterior probabilities of all sizes $p(v|u$) integrate to 1: 

\begin{equation*}
p(u) = \int p(v)p(u|v)dv \tag{5}
\end{equation*}

The integral in the above equation sums over the whole range of possible values of $v$, so it is a definite integral, but for brevity of notation we do not state the limits of integration in this and all other integrals in the paper. 

Now combining Eqs. (1)–(5) we can compute numerically how likely different sizes are given the sensory observation. For readers who are not familiar with such Bayesian inference we recommend doing the following exercise now. 

**Exercise 1.** *Assume that our animal observed the light intensity $u = 2$, the level of noise in its receptor is $\varSigma_u = 1$, and the mean and variance of its prior expectation of size are $v_p = 3$ and $\varSigma_p = 1$. Write a computer program that computes the posterior probabilities of sizes from 0.01 to 5, and plots them.*

<html>
<hr style="height:2px;border:none;color:#228;background-color:#228;" />
<span style="color:#228">

The Python code below solves this exercise, and also provides sliders that allow you to change the parameters. You will need to download the notebook and run it locally to be able to use these sliders.

</span>
<hr style="height:2px;border:none;color:#228;background-color:#228;" />
</html>

In [1]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as ipw
import matplotlib

def Normal(x, mu, sigma):
    # Normal distribution
    return 1 / (2*np.pi*sigma)**(1/2) * np.exp(- (x - mu)**2 / (2 * sigma)) 

def g(v):
    return v**2

In [2]:
def fig_1(CU):

    def draw_fig1(u=2.0, sigma_u=1.0, v_p=3.0, sigma_p=1.0):
        [h.remove() for h in ax.get_children() if isinstance(h, matplotlib.lines.Line2D)]
        v = np.arange(0.01, 5, dv)
        prior = Normal(v, v_p, sigma_p)
        posterior  = Normal(u, g(v), sigma_u) * prior
        posterior /= np.sum(posterior * dv)
        ax.plot(v, posterior, 'C0', label='posterior')
        ax.plot(v, prior, 'C1', label='prior')

    dv      = 0.01
    u       = ipw.FloatSlider(value=2.0, min=0.1, max=20, step=0.1, continuous_update=CU) 
    sigma_u = ipw.FloatSlider(value=1.0, min=0.1, max=10, step=0.1, continuous_update=CU) 
    v_p     = ipw.FloatSlider(value=3.0, min=0.1, max=5,  step=0.1, continuous_update=CU) 
    sigma_p = ipw.FloatSlider(value=1.0, min=0.1, max=10, step=0.1, continuous_update=CU)

    fig = plt.figure(figsize=(6,6), num='Fig 1')
    ax  = fig.add_subplot(1, 1, 1)
    ipw.interact(draw_fig1, u=u, sigma_u=sigma_u, v_p=v_p, sigma_p=sigma_p)
    plt.title('Exercise 1')
    plt.axis([0, 5, 0, 3])
    plt.xlabel('v')
    plt.ylabel('p(v) or p(v|u)');
    plt.legend()


fig_1(False) # For some unknown reason continuous updating of this widget doesn't work very well

<IPython.core.display.Javascript object>

aW50ZXJhY3RpdmUoY2hpbGRyZW49KEZsb2F0U2xpZGVyKHZhbHVlPTIuMCwgY29udGludW91c191cGRhdGU9RmFsc2UsIGRlc2NyaXB0aW9uPXUndScsIG1heD0yMC4wLCBtaW49MC4xKSwgRmzigKY=


The the resulting plot is shown in Fig. 1. It is worth observing that such Bayesian approach integrates the information brought by the stimulus with prior knowledge: please note that the most likely value of $v$ lies between that suggested by the stimulus (i.e. $\sqrt{2}$) and the most likely value based on prior knowledge (i.e. $3$). It may seem surprising why the posterior probability is so low for $v = 3$, i.e. the mean prior expectation. It comes from the fact that $g(3) = 9$, which is really far from observed value $u = 2$, so $p(u = 2|v = 3)$ is very close to zero. This illustrates how non-intuitive Bayesian inference can be once the relationship between variables is non-linear. 

Let us now discuss why performing such exact calculation is challenging for a simple biological system. First, as soon as function $g$ relating the variable we wish to infer with observations is non-linear, the posterior distribution $p(v|u)$ may not take a standard shape—for example the distribution in Fig. 1 is not normal. Thus representing the distribution $p(v|u)$ requires representing infinitely many values $p(v|u)$ for different possible $u$ rather than a few summary statistics like mean and variance. Second, the computation of the posterior distribution involves computation of the normalization term. Although it has been proposed that circuits within the basal ganglia can compute the normalization term in case of the discrete probability distributions (Bogacz & Gurney, 2007), computation of the normalization for continuous distributions involves evaluating the integral of Eq. (5). 
Calculating such integral would be challenging for a simple biological system. This is especially true when the dimensionality of the integrals (i.e., the number of unknown variables) increases beyond a trivial number. Even mathematicians resort to (computationally very expensive) numerical or sampling techniques in this case. 

We will now present an approximate solution to the above inference problem, that could be easily implemented in a simple biological system. 

### 2.2. Finding the most likely feature value ###

**Instead of finding the whole posterior distribution $p(v|u)$, let us try to find the most likely size of the food item $v$ which maximizes $p(v|u)$.** We will denote this most likely size by $\phi$, and its posterior probability density by $p(\phi|u)$. It is reasonable to assume that in many cases the brain represents at a given moment of time only most likely values of features. For example in case of binocular rivalry, only one of the two possible interpretations of sensory inputs is represented. 

<html>
<hr style="height:2px;border:none;color:#228;background-color:#228;" />
<span style="color:#228">

I'm repeating the equation here with the estimate $\phi$ replacing the full range $v$ for ease of reference.

\begin{align*}
p(\phi|u) &= \frac{p(\phi)p(u|\phi)}{p(u)} \tag{4} \\
\end{align*}

</span>
<hr style="height:2px;border:none;color:#228;background-color:#228;" />
</html>

**We will look for the value $\phi$ which maximizes $p(\phi|u)$.** According to Eq. (4), the posterior probability $p(\phi|u)$ depends on a ratio of two quantities, but the denominator $p(u)$ does not depend on $\phi$. Thus the value of $\phi$ which maximizes $p(\phi|u)$ is the same one which maximizes the numerator of Eq. (4). We will denote the logarithm of the numerator by $F$, as it is related to the negative of free energy (as we will describe in Section 3): 

\begin{equation*}
F = \ln p(\phi) + \ln p(u|\phi). \tag{6}
\end{equation*}

In the above equation we used the property of logarithm $\ln(ab) = \ln a + \ln b$. **We will maximize the logarithm of the numerator of Eq. (4), because it has the same maximum as the numerator itself as $\ln$ is a monotonic function, and is easier to compute as the expressions for $p(u|\phi)$ and $p(\phi)$ involve exponentiation.** 

To find the parameter $\phi$ that describes the most likely size of the food item, we will use a simple gradient ascent: i.e. **we will modify $\phi$ proportionally to the gradient of $F$**, which will turn out to be a very simple operation. It is relatively straightforward to compute $F$ by substituting Eqs. (1)–(3) into Eq. (6) and then to compute the derivative of F (TRY IT YOURSELF). 

\begin{align*}
F &= \ln f(\phi; v_p,\varSigma_p) + \ln f(u; g(\phi), \varSigma_u) \\
&= \ln \left[ \frac{1}{\sqrt{2 \pi \varSigma_p}} \exp \left(-\frac{(\phi - v_p)^2}{2 \varSigma_p}\right)\right] 
+ \ln \left[ \frac{1}{\sqrt{2 \pi \varSigma_u}} \exp \left(-\frac{(u - g(\phi))^2}{2 \varSigma_u}\right)\right] \\
&= \ln \frac{1}{\sqrt{2 \pi}} - \frac{1}{2} \ln \varSigma_p - \frac{(\phi - v_p)^2}{2 \varSigma_p} 
+ \ln \frac{1}{\sqrt{2 \pi}} - \frac{1}{2} \ln \varSigma_u - \frac{(u - g(\phi))^2}{2 \varSigma_u} \\
&= \frac{1}{2} \left( -\ln \varSigma_p - \frac{(\phi - v_p)^2}{2 \varSigma_p} - \ln \varSigma_u - \frac{(u - g(\phi))^2}{2 \varSigma_u} \right) + C. \tag{7}
\end{align*}

We incorporated the constant terms in the 2nd line above into a constant $C$. Now we can compute the derivative of $F$ over $\phi$: 

\begin{equation*}
\frac{\partial{F}}{\partial{\phi}} = \frac{\phi - v_p}{\varSigma_p} + \frac{u - g(\phi)}{\varSigma_u} g'(\phi).
\tag{8}
\end{equation*}

In the above equation we used the chain rule to compute the second term, and $g'(\phi)$ is a derivative of function $g$ evaluated at $\phi$, so in our example $g'(\phi) = 2\phi$. We can find our best guess $\phi$ for $v$ simply by changing $\phi$ in proportion to the gradient: 

\begin{equation*}
\dot{\phi} = \frac{\partial{F}}{\partial{\phi}}
\tag{9}
\end{equation*}

In the above equation $\dot{\phi}$ is the rate of change of $\phi$ with time. **Let us note that the update of $\phi$ is very intuitive. It is driven by two terms in Eq. (8): the first moves it towards the mean of the prior, the second moves it according to the sensory stimulus, and both terms are weighted by the reliabilities of prior and sensory input respectively.** 

Now please note that the above procedure for finding the approximate distribution of distance to food item is computationally much simpler than the exact method presented at the start of the paper. To gain more appreciation for the simplicity of this computation we recommend doing the following exercise. 

**Exercise 2.** *Write a computer program finding the most likely size of the food item $\phi$ for the situation described in Exercise 1. Initialize $\phi = v_p$, and then find its values in the next 5 time units (you can use Euler’s method, i.e. update $\phi(t + \Delta t) = \phi(t) + \Delta t \partial{F}/\partial{\phi}$ with $\Delta t = 0.01$).*

In [3]:
def dg_dphi(v):
    return 2*v

def dF_dphi(u, sigma_u, v_p, sigma_p, phi):
    grad = (v_p - phi) / sigma_p + (u - g(phi)) / sigma_u * dg_dphi(phi)
    return grad

def fig_2a():
    u       = 2
    sigma_u = 1
    v_p     = 3
    sigma_p = 1
    phi     = v_p
    dt      = 0.01
    dur     = 5
    steps   = int(dur/dt)

    trace    = np.zeros(steps)
    trace[0] = phi

    for t in range(steps-1):
        phi += dt * dF_dphi(u, sigma_u, v_p, sigma_p, phi)
        trace[t+1] = phi

    fig = plt.figure(figsize=(4,4), num='Fig 2a')
    ax  = fig.add_subplot(1, 1, 1)

    ax.plot(np.arange(steps) * dt, trace)
    plt.xlabel('time')
    plt.ylabel(r'$\phi$')
    plt.axis([0, 5, -2, 3])

fig_2a()

<IPython.core.display.Javascript object>

Fig. 2(a) shows a solution to Exercise 2. Please notice that it rapidly converges to the value of $\phi \approx 1.6$, which is also the value that maximizes the exact posterior probability $p(v|u)$ shown in Fig. 1. 

### 2.3. A possible neural implementation ###

One can envisage many possible ways in which the computation described in previous subsection could be implemented in neural circuits. In this paper we will present a possible implementation which satisfies the constraints of local computation and plasticity described in the Introduction. It slightly differs from the original implementation which is contained in Appendix A. 

While thinking about the neural implementation of the above computation, it is helpful to note that there are two similar terms in Eq. (8), so let us denote them by new variables. 

\begin{align*}
\varepsilon_p &= \frac{\phi - v_p}{\varSigma_p} \tag{10} \\
\varepsilon_u &= \frac{u - g(\phi)}{\varSigma_u}. \tag{11} \\
\end{align*}

**The above terms are the prediction errors$^1$: $\varepsilon_u$ expresses how much the light intensity differs from that expected if the size of the food item was $\phi$, while $\varepsilon_p$ denotes how the inferred size differs from prior expectations. With these new variables the equation for updating $\phi$ simplifies to:**

\begin{equation*}
\dot{\phi} = \varepsilon_u g'(\phi) - \varepsilon_p.
\tag{12}
\end{equation*}

The neural implementation of the model assumes that the model parameters $v_p$, $\varSigma_p$, and $\varSigma_u$ are encoded in the strengths of synaptic connections (as they need to be maintained over the animal’s lifetime), while variables $\phi$, $\varepsilon_u$, and $\varepsilon_p$ and the sensory input $u$ are maintained in the activity of neurons or neuronal populations (as they change rapidly when the sensory input is modified). In particular, we will consider very simple neural "nodes" which simply change their activity proportionally to the input they receive, so for example, Eq. (12) is implemented in the model by a node receiving input equal to the right hand side of this equation. **The prediction errors could be computed by the nodes with the following dynamics**$^2$: 

\begin{align*}
\dot{\varepsilon_p} &= \phi - v_p - \varSigma_p \varepsilon_p \tag{13} \\
\dot{\varepsilon_u} &= u - g(\phi) - \varSigma_u \varepsilon_u \tag{14} \\
\end{align*}

It is easy to show that the nodes with dynamics described by Eqs. (13)–(14) converge to the values defined in Eqs. (10)–(11). Once Eqs. (13)–(14) converge, then $\dot{\varepsilon} = 0$, so setting $\dot{\varepsilon} = 0$ and solving Eqs. (13)–(14) for $\varepsilon$, one obtains Eqs. (10)–(11). 


<html>
    <div class="image">
        <img src="Bogacz_fig3.png", width=400>
        <div style="margin: 20px 100px 20px 100px">
            Fig. 3. The architecture of the model performing simple perceptual inference. Circles denote neural "nodes", arrows denote excitatory connections, while lines ended with circles denote inhibitory connections. Labels above the connections encode their strength, and lack of label indicates the strength of 1. Rectangles indicate the values that need to be transmitted via the connections they label. 
        </div>
    </div>
</html>

The architecture of the network described by Eqs. (12)–(14) is shown in Fig. 3. Let us consider the computations in its nodes. The node $\varepsilon_p$ receives excitatory input from node $\phi$, inhibitory input from a tonically active neuron via a connection with strength $v_p$, and inhibitory input from itself via a connection with strength $\varSigma_p$, so it implements Eq. (13). The nodes $\phi$ and $\varepsilon_u$ analogously implement Eqs. (12) 
and (14), but here the information exchange between them is additionally affected by function $g$, and we will discuss this issue in more detail in Section 2.5. We have now described all the details necessary to simulate the model. 

<html>
<hr style="height:2px;border:none;color:#228;background-color:#228;" />
<span style="color:#228">

Note that several of the neural "nodes" above ($\varepsilon_u$ and $\phi$) send both excitatory and inhibitory connections, while in biology, a neuron is either excitatory or inhibitory. Assuming all nodes, except the "$1$" node are excitatory, an inhibitory projection would come from an addtional inhibitory neuron that has the same activity as its corresponding excitatory neuron. This can be achieved by giving the inhibitory neuron the exact same input as the excitatory neuron. Doing this has no impact on how the network behaves, and so is left out here for simplicity.

While on biological realism, neurons generally do not send connections to themselves, but they do send them to other nearby neurons. Thus it is better to think of the neural nodes above as representing groups of neurons, rather than single neurons. That also helps with the fact that neurons communicate with action potentials (spikes), and not with continuous valued signals. We can then think of the average number of output spikes as representing the continuous valued signals. Finally, all connections are assumed to be instantaenous, i.e., have zero delay. We know this is not true in biology, with connection delays ranging over two orders of magnitude in cortex: from $0.1$[ms] to more than $10$[ms].

</span>
<hr style="height:2px;border:none;color:#228;background-color:#228;" />
</html>

**Exercise 3.** *Simulate the model from Fig. 3 for the problem from Exercise 1. In particular, initialize $\phi = v_p, \varepsilon_p = \varepsilon_u = 0$, and find their values for the next 5 units of time.* 

In [4]:
def fig_2b(CU):
    
    def dphi_dt(phi, e_p, e_u):
        return e_u * dg_dphi(phi) - e_p

    def dep_dt(phi, e_p, e_u):
        return phi - v_p - sigma_p * e_p

    def deu_dt(phi, e_p, e_u):
        return u - g(phi) - sigma_u * e_u

    def draw_fig2b(tau=1, w_phi=1, w_ep=1, w_eu=1):
        [h.remove() for h in ax.get_children() if isinstance(h, matplotlib.lines.Line2D)]
        trace    = np.zeros((steps, 3))
        state    = (phi, e_p, e_u)
        trace[0] = np.asarray(state)
        for t in range(steps-1):
            state += dt / tau * np.array([w_phi*dphi_dt(*state), w_ep*dep_dt(*state), w_eu*deu_dt(*state)])
            trace[t+1] = np.asarray(state)
        ax.plot(np.arange(steps) * dt, trace[:,0], color='C0')
        ax.plot(np.arange(steps) * dt, trace[:,1], color='C1')
        ax.plot(np.arange(steps) * dt, trace[:,2], color='C2')


    u       = 2
    sigma_u = 1
    v_p     = 3
    sigma_p = 1
    phi     = v_p
    e_p     = 0
    e_u     = 0
    dt      = 0.001
    dur     = 5
    steps   = int(dur/dt)

    tau     = ipw.FloatSlider(value=1.0, min=0.1, max=2, step=0.1, continuous_update=CU) 
    w_phi   = ipw.FloatSlider(value=1.0, min=0.1, max=2, step=0.1, continuous_update=CU) 
    w_ep    = ipw.FloatSlider(value=1.0, min=0.1, max=2, step=0.1, continuous_update=CU) 
    w_eu    = ipw.FloatSlider(value=1.0, min=0.1, max=2, step=0.1, continuous_update=CU) 

    fig = plt.figure(figsize=(4,4), num='Fig 2b')
    ax  = fig.add_subplot(1, 1, 1)
    ipw.interact(draw_fig2b, tau=tau, w_phi=w_phi, w_ep=w_ep, w_eu=w_eu)    
    plt.xlabel('time')
    plt.ylabel('Activity')
    plt.legend([r'$\phi$', r'$\varepsilon_p$', r'$\varepsilon_u$'])
    plt.axis([0, dur, -2, 3])

    
fig_2b(True)

<IPython.core.display.Javascript object>

aW50ZXJhY3RpdmUoY2hpbGRyZW49KEZsb2F0U2xpZGVyKHZhbHVlPTEuMCwgZGVzY3JpcHRpb249dSd0YXUnLCBtYXg9Mi4wLCBtaW49MC4xKSwgRmxvYXRTbGlkZXIodmFsdWU9MS4wLCBkZXPigKY=


Solution to Exercise 3 is shown in Fig. 2(b). The model converges to the same value as in Fig. 2(a), but the convergence is just slower, as the model now includes multiple nodes connected by excitatory and inhibitory connections and such networks have oscillatory tendencies, so these oscillations need to settle for the network to converge. 

<html>
<hr style="height:2px;border:none;color:#228;background-color:#228;" />
<span style="color:#228">

In my first comment, I pointed out that all measurements and internal values are assumed dimensionless. I struggle with treating time as dimensionless though, and it confuses me in many of Free Energy related papers. If we look at the equations above:

\begin{align*}
\dot{\phi} &= \varepsilon_u g'(\phi) - \varepsilon_p \tag{12} \\
\dot{\varepsilon_p} &= \phi - v_p - \varSigma_p \varepsilon_p \tag{13} \\
\dot{\varepsilon_u} &= u - g(\phi) - \varSigma_u \varepsilon_u \tag{14} \\
\end{align*}

clearly, the left hand sides, which are derivatives with respect to time have a unit of [1/s], but the right hand sides are dimensionless. To correct for this, we need to explicitly give the neurons time constants and change these equations to:

\begin{align*}
\tau_{\phi} \dot{\phi} &= \varepsilon_u g'(\phi) - \varepsilon_p \tag{12a} \\
\tau_p \dot{\varepsilon_p} &= \phi - v_p - \varSigma_p \varepsilon_p \tag{13a} \\
\tau_u \dot{\varepsilon_u} &= u - g(\phi) - \varSigma_u \varepsilon_u \tag{14a} \\
\end{align*}

This makes it clear that the original equations assume a time constant of $1$ for each of the neurons, but we still haven't made explicit what the time scale is that time is measured on. If we assume the simulation above is $5 [s]$ long, then that means that the neural time constants are $1 [s]$ each, which is very large, but if the simulation above only represents $50 [ms]$ then the time constants are $10 [ms]$ which is more realistic. Above, I provide a slider to explore the effect of changing $\tau$. Since changing $\tau$ is equivalent to changing the simulation time scale; all it does is stretch or compress the curves in time.

We should also point out that there is nothing in the equations above that forces us to update each variable with the same gain. We could choose, for instance, to update $\tau_u$ much more slowly than the other variables, by multiplying the right hand side of (14a) with a gain less than $1$. Changing these gains do not change the equilibrium state, which happens when the temporal derivatives are all $0$, but it does change how you get there. The above sliders allow you to explore these gains, and I have implemented the following equations, which assume the neural nodes all have identical time constants:

\begin{align*}
\tau \dot{\phi} &= w_{\phi} \left( \varepsilon_u g'(\phi) - \varepsilon_p \right) \tag{12b} \\
\tau \dot{\varepsilon_p} &= w_p \left( \phi - v_p - \varSigma_p \varepsilon_p \right) \tag{13b} \\
\tau \dot{\varepsilon_u} &= w_u \left( u - g(\phi) - \varSigma_u \varepsilon_u \right) \tag{14b} \\
\end{align*}

</span>
<hr style="height:2px;border:none;color:#228;background-color:#228;" />
</html>

### 2.4. Learning model parameters ###

As our imaginary animal perceives food items through its lifetime, it may wish to refine its expectation about typical sizes of food items described by parameters $v_p$ and $\varSigma_p$, and about the amount of error it makes observing light intensity, described by parameter $\varSigma_u$. Thus it may wish to update the parameters $v_p$, $\varSigma_p$, and $\varSigma_u$ after each stimulus to gradually refine them. 

We wish to choose the model parameters for which the perceived light intensities $u$ are least surprising, or in other words most expected. Thus we wish to choose parameters that maximize $p(u)$. However, please recall that $p(u)$ is described by a complicated integral of Eq. (5), so it would be difficult to maximize $p(u)$ directly. Nevertheless, it is simple to maximize a related quantity $p(u, \phi)$, which is the joint probability of sensory input $u$ and our inferred food size $\phi$. Note that $p(u, \phi) = p(\phi)p(u|\phi)$, so $F = \ln p(u, \phi)$, thus maximization of $p(u, \phi)$ can be achieved by maximizing $F$. A more formal explanation for why the parameters can be optimized by maximizing $F$ will be provided in Section 3. 

The model parameters can be hence optimized by modifying them proportionally to the gradient of $F$. Starting with the expression in Eq. (7) it is straightforward to find the derivatives of $F$ over $v_p$, $\varSigma_p$, and $\varSigma_u$ (TRY IT YOURSELF): 

\begin{align*}
\frac{\partial{F}}{\partial{v_p}} &= \frac{\phi - v_p}{\varSigma_p} \tag{15} \\
\frac{\partial{F}}{\partial{\varSigma_p}} &= \frac{1}{2} \left( \frac{(\phi - v_p)^2}{\varSigma_p^2} - \frac{1}{\varSigma_p} \right) \tag{16} \\
\frac{\partial{F}}{\partial{\varSigma_u}} &= \frac{1}{2} \left( \frac{(u - g(\phi))^2}{\varSigma_u^2} - \frac{1}{\varSigma_u} \right) \tag{17} \\ 
\end{align*}

Let us now provide an intuition for why the parameter update rules have their particular form. We note that since parameters are updated after observing each food item, and different food items observed during animal’s life time have different sizes, the parameters never converge. Nevertheless it is useful to consider the values of parameters for which the expected value of change is 0, as these are the values in vicinity of which the parameters are likely to be. For example, according to Eq. (15), the expected value of change in $v_p$ is $0$ when $\left<(\phi - v_p) / \varSigma_p\right> = 0$, where $\left<\right>$ denotes the expected value over trials. This will happen if $v_p = \left<\phi\right>$, i.e. when $v_p$ is indeed equal to the expected value of $\phi$. Analogously, the expected value of change in $\varSigma_p$ is $0$ when:

\begin{equation*}
\left< \frac{(\phi - v_p)^2}{\varSigma_p^2} - \frac{1}{\varSigma_p} \right> = 0.
\tag{18}
\end{equation*}

Rearranging the above condition one obtains $\varSigma_p = \left<(\phi - v_p)^2\right>$, thus the expected value of change in $\varSigma_p$ is $0$, when $\varSigma_p$ is equal to the variance of $\phi$. An analogous analysis can be made for $\varSigma_u$
. 
Eqs. (15)–(17) for update of model parameters simplify significantly when they are written in terms of prediction errors (TRY IT YOURSELF): 

\begin{align*}
\frac{\partial{F}}{\partial{v_p}} &= \varepsilon_p \tag{19} \\
\frac{\partial{F}}{\partial{\varSigma_p}} &= \frac{1}{2} \left( \varepsilon_p^2 - \varSigma_p^{-1} \right) \tag{20} \\
\frac{\partial{F}}{\partial{\varSigma_u}} &= \frac{1}{2} \left( \varepsilon_u^2 - \varSigma_u^{-1} \right). \tag{21} \end{align*}

The above rules for update of parameters correspond to very simple synaptic plasticity mechanisms. All rules include only values that can be "known" by the synapse, i.e. the activities of pre-synaptic and post-synaptic neurons, and the strengths of the synapse itself. Furthermore, the rules are Hebbian, in the sense that they depend on the products of activity of pre-synaptic and post-synaptic neurons. For example, the change in $v_p$ in Eq. (19) 
is equal to the product of pre-synaptic activity (i.e. 1) and the post-synaptic activity $\varepsilon_p$. Similarly, the changes in $\varSigma$ in Eqs. (20)–(21) depend on the products of pre-synaptic and post-synaptic activities, both equal to $\varepsilon$.

The plasticity rules of Eqs. (20)–(21) also depend on the value of synaptic weights themselves, as they include terms $\varSigma^{-1}$. For the simple case considered in this section, the synapse "has access" to the information on its weight. Moreover, the dependence of synaptic plasticity on initial weights has been seen experimentally (Chen et al., 2013), so we feel it is plausible for the dependence predicted by the model to be present in real synapses. However, when the model is scaled up to include multiple features and sensory inputs in Section 4.1, terms $\varSigma^{-1}$ will turn into a matrix inverse (in Eqs. (48)–(49)), so the required changes in each weight will depend on the weights of other synapses in the network. Nevertheless, we will show in Section 5 how this problem can be overcome. 

<html>
<hr style="height:2px;border:none;color:#228;background-color:#228;" />
<span style="color:#228">

We can extend the simulation from Exercise 3 to implement these updates too. However, **unlike the updates in (12)-(14), which happen continuously, the updates to the model parameters should really only be done once per food item and average over many items, so the updates should be small.** Since the parameters are represented by synaptic weights in Fig. 3, there now needs to be a control signal that indicates the state estimate has finished for a new food item, and the parameters can be updated. The paper doesn't discuss how this control signal is generated. In the simulation below, I simply force the update to happen at $t = 4.0$. 

I've reduced $\tau$ to $0.5$ to reach convergence of the state more quickly, made both variances different from $1$ and exaggerated the synaptic weight updates to show them more clearly. Note that the synaptic weight update at $t = 4.0$ also causes the state variables to change again. In particular, the synaptic weight updates aim to reduce the absolute value of the errors $\varepsilon$.

Below I also provide a slider to explore different input values (observed light) $u$. Note that when $u = 9$, $\phi = v_p = 3$, so that none of the state variables change, and only the variance is reduced after observing this food item.

</span>
<hr style="height:2px;border:none;color:#228;background-color:#228;" />
</html>

In [5]:
def fig_2c(CU):
    
    def draw_fig2c(u, tau, w_phi, w_ep, w_eu, w_params):

        def dphi_dt(phi, e_p, e_u, v_p, sigma_p, sigma_u):
            return e_u * dg_dphi(phi) - e_p

        def dep_dt(phi, e_p, e_u, v_p, sigma_p, sigma_u):
            return phi - v_p - sigma_p * e_p

        def deu_dt(phi, e_p, e_u, v_p, sigma_p, sigma_u):
            return u - g(phi) - sigma_u * e_u

        def dvp(phi, e_p, e_u, v_p, sigma_p, sigma_u):
            return e_p

        def dsigmap(phi, e_p, e_u, v_p, sigma_p, sigma_u):
            return e_p**2 - 1/sigma_p

        def dsigmau(phi, e_p, e_u, v_p, sigma_p, sigma_u):
            return e_u**2 - 1/sigma_u

        [h.remove() for h in ax.get_children() if isinstance(h, matplotlib.lines.Line2D)]
        trace  = np.zeros((steps, 6))
        state  = (phi, e_p, e_u)
        params = (v_p, sigma_p, sigma_u)
        trace[0, :3] = np.asarray(state)
        trace[0, 3:] = np.asarray(params)
        for t in range(steps-1):
            state  += dt / tau * np.array([w_phi*dphi_dt(*trace[t]), w_ep*dep_dt(*trace[t]), w_eu*deu_dt(*trace[t])])
            trace[t+1, :3] = np.asarray(state)
            trace[t+1, 3:] = np.asarray(params)
            if t*dt==4.0:
                params += w_params * np.array([dvp(*trace[t]), dsigmap(*trace[t]), dsigmau(*trace[t])])
                if params[1]<1: params[1] = 1
                if params[2]<1: params[2] = 1
                trace[t+1, 3:] = np.asarray(params)
        ax.plot(np.arange(steps) * dt, trace[:,0], color='C0')
        ax.plot(np.arange(steps) * dt, trace[:,1], color='C1')
        ax.plot(np.arange(steps) * dt, trace[:,2], color='C2')
        ax.plot(np.arange(steps) * dt, trace[:,3], color='C3')
        ax.plot(np.arange(steps) * dt, trace[:,4], color='C4')
        ax.plot(np.arange(steps) * dt, trace[:,5], color='C5')

    sigma_u = 1.5
    v_p     = 3
    sigma_p = 2
    phi     = v_p
    e_p     = 0
    e_u     = 0
    dt      = 0.001
    dur     = 5
    steps   = int(dur/dt)

    u         = ipw.FloatSlider(value=2,   min=1,  max=20, step=1,   continuous_update=CU)
    tau       = ipw.FloatSlider(value=0.5, min=0.1, max=2, step=0.1, continuous_update=CU) 
    w_phi     = ipw.FloatSlider(value=1.0, min=0.1, max=2, step=0.1, continuous_update=CU) 
    w_ep      = ipw.FloatSlider(value=1.0, min=0.1, max=2, step=0.1, continuous_update=CU) 
    w_eu      = ipw.FloatSlider(value=1.0, min=0.1, max=2, step=0.1, continuous_update=CU) 
    w_params  = ipw.FloatSlider(value=1.0, min=0.1, max=2, step=0.1, continuous_update=CU) 

    fig = plt.figure(figsize=(4,4), num='Fig 2c')
    ax  = fig.add_subplot(1, 1, 1)
    ipw.interact(draw_fig2c, u=u, tau=tau, w_phi=w_phi, w_ep=w_ep, w_eu=w_eu, w_params=w_params)    
    plt.xlabel('time')
    plt.ylabel('Activity')
    plt.legend([r'$\phi$', r'$\varepsilon_p$', r'$\varepsilon_u$', r'$v_p$', r'$\Sigma_p$', r'$\Sigma_u$'], ncol=2)
    plt.axis([0, dur, -2, 5])

    
fig_2c(True)

<IPython.core.display.Javascript object>

aW50ZXJhY3RpdmUoY2hpbGRyZW49KEZsb2F0U2xpZGVyKHZhbHVlPTIuMCwgZGVzY3JpcHRpb249dSd1JywgbWF4PTIwLjAsIG1pbj0xLjAsIHN0ZXA9MS4wKSwgRmxvYXRTbGlkZXIodmFsdWXigKY=


Finally, we would like to discuss the limits on parameters $\varSigma$. Although in principle the variance of a random variable can be equal to $0$, if $\varSigma_p = 0$ or $\varSigma_u = 0$, then Eq. (13) or (14) would not converge but instead $\varepsilon_p$ or $\varepsilon_u$ would diverge to positive or negative infinity. Similarly, if $\varSigma$ were close to $0$, the convergence would be very slow. To prevent this from happening, the minimum value of $1$ is imposed by Friston (2005) on the estimated variance.$^3$ 

<html>
<hr style="height:2px;border:none;color:#228;background-color:#228;" />
<span style="color:#228">

So here, the implicit conversion of a measurement to a dimensionless variable by division with the measurement unit matters. If we are using [m], or [mm] as the measurement unit, then a minimum variance of $1$ means something different. In this paper, it is not discussed how this minimum value is implemented. In the simulation above, I simply explicitly clip $\varSigma$ at minimum value of $1$.

</span>
    <hr style="height:2px;border:none;color:#228;background-color:#228;" />
</html>


### 2.5. Learning the relationship between variables ###

So far we have assumed for simplicity that the relationship g between the variable being inferred and the stimulus is known. However, in general it may not be known, or may need to be tuned. So we will now consider function $g(v, \theta)$ that also depends on a parameter which we denote by $\theta$. 

We will consider two special cases of function $g(v, \theta)$, where the parameter $\theta$ has a clear biological interpretation. First, let us consider a simple case of a linear function: $g(v, \theta) = \theta v$, as then the model has a straightforward neural implementation. In this case, Eqs. (12)–(14) 
describing the model simplify to: 

\begin{align*}
\dot{\phi} &= \theta \varepsilon_u - \varepsilon_p \tag{22}\\
\dot{\varepsilon_p} &= \phi - v_p - \varSigma_p \varepsilon_p \tag{23} \\
\dot{\varepsilon_u} &= u - \theta \phi - \varSigma_u \varepsilon_u \tag{24} \\
\end{align*}

In this model, nodes $\phi$ and $\varepsilon$ simply communicate through connections with weight $\theta$ as shown in Fig. 4(a). Furthermore, we can also derive the rule for updating the parameter $\theta$ by finding the gradient of $F$ over $\theta$, as now function $g$ in Eq. (7) depends on $\theta$ (TRY IT YOURSELF): 

\begin{equation*}
\frac{\partial{F}}{\partial{\theta}} = \varepsilon_u \phi.
\tag{25}
\end{equation*}

Please note that this rule is again Hebbian, as the synaptic weights encoding $\theta$ are modified proportionally to the activities of pre-synaptic and post-synaptic neurons (see Fig. 4(a)). 

<html>
    <div class="image">
        <img src="Bogacz_fig4.png", width=1000>
        <div style="margin: 20px 100px 20px 100px">
            Fig. 4. Architectures of models with linear and nonlinear function $g$. Circles and hexagons denote linear and nonlinear nodes respectively. Filled arrows and lines ended with circles denote excitatory and inhibitory connections respectively, and an open arrow denotes a modulatory influence. 
        </div>
    </div>
</html>

Second, let us consider a case of a nonlinear function$^4$ $g(v, \theta) = \theta h(v)$, where $h(v)$ is a nonlinear function that just depends on $v$, as it results in only slightly more complex neural implementation. Furthermore, this situation is relevant to the example of the simple animal considered at the start of this section, as the light is proportional to the area, but the proportionality constant may not be known (this case is also relevant to the network that we will discuss in Section 4.1). In this case, Eqs. (12)–(14) describing the model become: 

\begin{align*}
\dot{\phi} &= \theta \varepsilon_u h'(\phi) - \varepsilon_p \tag{26}\\
\dot{\varepsilon_p} &= \phi - v_p - \varSigma_p \varepsilon_p \tag{27} \\
\dot{\varepsilon_u} &= u - \theta h(\phi) - \varSigma_u \varepsilon_u \tag{28} \\
\end{align*}

A possible network implementing this model is illustrated in Fig. 4(b), which now includes non-linear elements. In particular, the node $\phi$ sends to node $\varepsilon_u$ its activity transformed by a non-linear function, i.e. $\theta h(\phi)$. One could imagine that this could be implemented by an additional node receiving input from node $\phi$, transforming it via a non-linear transformation $h$ and sending its output to node $\varepsilon_u$ via a connection with the weight $\theta$. Analogously, the input from node $\varepsilon_u$ to node $\phi$ needs to be scaled by $\theta h'(\phi)$. Again one could imagine that this could be implemented by an additional node receiving input from node $\phi$, transforming it via a non-linear transformation $h'$ and modulating input received from node $\epsilon_u$ via a connection with weight $\theta$ (alternatively, this could be implemented within the node $\phi$ by making it react to its input differentially depending on its level of activity). The details of the neural implementation of these non-linear transformations depend on the form of function $h$, and would be an interesting direction of the future work. 

We also note that the update of the parameter $\theta$, i.e. gradient of $F$ over $\theta$ becomes: 

\begin{equation*}
\frac{\partial{F}}{\partial{\theta}} = \varepsilon_u h(\phi).
\tag{29}
\end{equation*}

This rule is Hebbian for the top connection labelled by $\theta$ in Fig. 4(b), as it is a product of activity of the pre-synaptic and post-synaptic nodes. It would be interesting to investigate how such a plasticity rule could be realized for the other connection with the weight of $\theta$ (from node $\varepsilon_u$ to $\phi$). We just note that for this connection the rule also satisfies the constraint of local plasticity (stated in the Introduction), as $\phi$ fully determines $h(\phi$), so the change in weight is fully determined by the activity of pre-synaptic and post-synaptic neurons. 

<html>
<hr style="height:2px;border:none;color:#228;background-color:#228;" />
<span style="color:#228">

We can again extend the simulation above to update $\theta$ too. This is also an update to a synaptic weight, and only once we have a good estimate of $/phi$ for the current food item, should we update $\theta$ according to (29). 

</span>
<hr style="height:2px;border:none;color:#228;background-color:#228;" />
</html>

In [6]:
def fig_2d(CU):
    
    def draw_fig2d(u, tau, w_phi, w_ep, w_eu, w_params):

        def dphi_dt(phi, e_p, e_u, v_p, theta, sigma_p, sigma_u):
            return theta * e_u * dg_dphi(phi) - e_p

        def dep_dt(phi, e_p, e_u, v_p, theta, sigma_p, sigma_u):
            return phi - v_p - sigma_p * e_p

        def deu_dt(phi, e_p, e_u, v_p, theta, sigma_p, sigma_u):
            return u - theta * g(phi) - sigma_u * e_u

        def dvp(phi, e_p, e_u, v_p, theta, sigma_p, sigma_u):
            return e_p

        def dsigmap(phi, e_p, e_u, v_p, theta, sigma_p, sigma_u):
            return (e_p**2 - 1/sigma_p)/2.0

        def dsigmau(phi, e_p, e_u, v_p, theta, sigma_p, sigma_u):
            return (e_u**2 - 1/sigma_u)/2.0
        
        def dtheta(phi, e_p, e_u, v_p, theta, sigma_p, sigma_u):
            return e_u * g(phi)

        [h.remove() for h in ax.get_children() if isinstance(h, matplotlib.lines.Line2D)]
        trace  = np.zeros((steps, 7))
        state  = (phi, e_p, e_u)
        params = (v_p, theta, sigma_p, sigma_u)
        trace[0, :3] = np.asarray(state)
        trace[0, 3:] = np.asarray(params)
        for t in range(steps-1):
            state  += dt / tau * np.array([w_phi*dphi_dt(*trace[t]), w_ep*dep_dt(*trace[t]), w_eu*deu_dt(*trace[t])])
            trace[t+1, :3] = np.asarray(state)
            trace[t+1, 3:] = np.asarray(params)
            if t*dt==4.0:
                params += w_params * np.array([dvp(*trace[t]), dtheta(*trace[t]), dsigmap(*trace[t]), dsigmau(*trace[t])])
                if params[2]<1: params[2] = 1
                if params[3]<1: params[3] = 1
                trace[t+1, 3:] = np.asarray(params)
        ax.plot(np.arange(steps) * dt, trace[:,0], color='C0')
        ax.plot(np.arange(steps) * dt, trace[:,1], color='C1')
        ax.plot(np.arange(steps) * dt, trace[:,2], color='C2')
        ax.plot(np.arange(steps) * dt, trace[:,3], color='C3')
        ax.plot(np.arange(steps) * dt, trace[:,4], color='C4')
        ax.plot(np.arange(steps) * dt, trace[:,5], color='C5')
        ax.plot(np.arange(steps) * dt, trace[:,6], color='C6')


    sigma_u = 1.5
    v_p     = 3
    sigma_p = 2
    phi     = v_p
    e_p     = 0
    e_u     = 0
    dt      = 0.001
    dur     = 5
    theta   = 2
    steps   = int(dur/dt)

    u         = ipw.FloatSlider(value=2,   min=1,  max=20, step=1,   continuous_update=CU)
    tau       = ipw.FloatSlider(value=0.5, min=0.1, max=2, step=0.1, continuous_update=CU) 
    w_phi     = ipw.FloatSlider(value=1.0, min=0.1, max=2, step=0.1, continuous_update=CU) 
    w_ep      = ipw.FloatSlider(value=1.0, min=0.1, max=2, step=0.1, continuous_update=CU) 
    w_eu      = ipw.FloatSlider(value=1.0, min=0.1, max=2, step=0.1, continuous_update=CU) 
    w_params  = ipw.FloatSlider(value=1.0, min=0.1, max=2, step=0.1, continuous_update=CU) 

    fig = plt.figure(figsize=(4,4), num='Fig 2d')
    ax  = fig.add_subplot(1, 1, 1)
    ipw.interact(draw_fig2d, u=u, tau=tau, w_phi=w_phi, w_ep=w_ep, w_eu=w_eu, w_params=w_params)    
    plt.xlabel('time')
    plt.ylabel('Activity')
    plt.legend([r'$\phi$', r'$\varepsilon_p$', r'$\varepsilon_u$', r'$v_p$', r'$\theta$', r'$\Sigma_p$', r'$\Sigma_u$'], ncol=3)
    plt.axis([0, dur, -2, 5])

    
fig_2d(True)

<IPython.core.display.Javascript object>

aW50ZXJhY3RpdmUoY2hpbGRyZW49KEZsb2F0U2xpZGVyKHZhbHVlPTIuMCwgZGVzY3JpcHRpb249dSd1JywgbWF4PTIwLjAsIG1pbj0xLjAsIHN0ZXA9MS4wKSwgRmxvYXRTbGlkZXIodmFsdWXigKY=


<html>
<hr style="height:2px;border:none;color:#228;background-color:#228;" />
<span style="color:#228">

As pointed out in the paper, the model parameters will never converge as the sizes of each food item will be different. To illustrate this, we need to observe many food items. Below you can run the simulation continuing on from the previous one. It is set up so that the average size of the food item ie $3$, $\theta = 1$, and the real variance of the food item size and their luminosity are small, much smaller than the minimum model parameter of $1$, so that the system can learn more quickly what the average value is. Therefore, $v_p$ should converge to $3$ and $\varSigma_p$ and $\varSigma_u$ should converge to their minimum value of 1.

Note how unstable this is, particularly for $\phi$ and $\varepsilon_u$ once $\varSigma$ converges to $1$. 

There is also an unintended consequence to setting the variance of $v$ and $u$ so small: the system does not explore the nonlinearity of $h$ very much. All sensor values of $u$ are close to $9$. This is modelled by $\theta h(\phi)$, but there isn't enough information about which part is linear, due to $\theta$, and which part is nonlinear, due to $h(\phi)$. This seemed to result in both $\phi$ and $v_p$ trending towards $0$, which makes the first term in equation (8) small, while $\theta$ gets large so that $\theta h(\phi)$ is still close to $u$. To avoid this, in the code below, I have made the update to $\theta$ $0$ so that it stays at the correct initial value of $1$. I tried fixing it by setting $\varSigma_p=3$, but I couldn't get that to remain stable.

You can also reset the initial internal values at the start of each run, by clicking the 'reset' button. Note, this is a ToggleButton, so that if you click it once, it will reset all internal variables everytime you click 'continue'. Only if you click 'reset' again will it stop resetting and continue on from the end of the previous run.

</span>
<hr style="height:2px;border:none;color:#228;background-color:#228;" />
</html>

In [7]:
from IPython.display import display
def fig_2e(CU):
    global state, params
    
    def reset_all():
        global state, params
        phi     = 2
        e_p     = 0
        e_u     = 0
        v_p     = 2
        theta   = 1
        sigma_p = 3
        sigma_u = 3
        state  = (phi, e_p, e_u)
        params = (v_p, theta, sigma_p, sigma_u)


    def draw_fig2e(tau, w_phi, w_ep, w_eu, w_params, reset, cont):
        global state, params

        def dphi_dt(phi, e_p, e_u, v_p, theta, sigma_p, sigma_u, u, v):
            return theta * e_u * dg_dphi(phi) - e_p

        def dep_dt(phi, e_p, e_u, v_p, theta, sigma_p, sigma_u, u, v):
            return phi - v_p - sigma_p * e_p

        def deu_dt(phi, e_p, e_u, v_p, theta, sigma_p, sigma_u, u, v):
            return u - theta * g(phi) - sigma_u * e_u

        def dvp(phi, e_p, e_u, v_p, theta, sigma_p, sigma_u, u, v):
            return e_p

        def dsigmap(phi, e_p, e_u, v_p, theta, sigma_p, sigma_u, u, v):
            return (e_p**2 - 1/sigma_p)/2.0

        def dsigmau(phi, e_p, e_u, v_p, theta, sigma_p, sigma_u, u, v):
            return (e_u**2 - 1/sigma_u)/2.0
        
        def dtheta(phi, e_p, e_u, v_p, theta, sigma_p, sigma_u, u, v):
            return e_u * g(phi)

        if reset == True:
            reset_all()
        (phi, e_p, e_u) = state
        (v_p, theta, sigma_p, sigma_u) = params

        [h.remove() for h in ax.get_children() if isinstance(h, matplotlib.lines.Line2D)]
        # draw a random food item from this distribution
        v      = max(np.sqrt(Sigma_p) * np.random.randn() + mu_p, 0)
        u      = max(np.sqrt(Sigma_u) * np.random.randn() + Theta*g(v), 0)
        trace  = np.zeros((steps, 9))
        trace[0, 0:3] = np.asarray(state)
        trace[0, 3:7] = np.asarray(params)
        trace[0, 7]   = u
        trace[0, 8]   = v
        
        for t in range(steps-1):
            state += dt / tau * np.array([w_phi*dphi_dt(*trace[t]), w_ep*dep_dt(*trace[t]), w_eu*deu_dt(*trace[t])])
            trace[t+1, 0:3] = np.asarray(state)
            trace[t+1, 3:7] = np.asarray(params)
            trace[t+1, 7] = u
            trace[t+1, 8] = v

            if (t*dt)%5 == 0:
                params += w_params * np.array([dvp(*trace[t]), 0*dtheta(*trace[t]), dsigmap(*trace[t]), dsigmau(*trace[t])])
                if params[2]<1: params[2] = 1
                if params[3]<1: params[3] = 1
                trace[t+1, 3:7] = np.asarray(params)
                v = max(np.sqrt(Sigma_p) * np.random.randn() + mu_p, 0.1)
                u = max(np.sqrt(Sigma_u) * np.random.randn() + Theta*g(v), 0.01)
                trace[t+1, 7] = u
                trace[t+1, 8] = v

        ax.plot(np.arange(steps) * dt, trace[:,0], color='C0')
        ax.plot(np.arange(steps) * dt, np.sqrt(trace[:,7]), color='C7')
        ax.plot(np.arange(steps) * dt, trace[:,8], color='C8')
        ax.plot(np.arange(steps) * dt, trace[:,1], color='C1')
        ax.plot(np.arange(steps) * dt, trace[:,2], color='C2')
        ax.plot(np.arange(steps) * dt, trace[:,3], color='C3')
        ax.plot(np.arange(steps) * dt, trace[:,4], color='C4')
        ax.plot(np.arange(steps) * dt, trace[:,5], color='C5')
        ax.plot(np.arange(steps) * dt, trace[:,6], color='C6')
        plt.legend([r'$\phi$', r'$v$', r'$\sqrt{u}$', 
                    r'$\varepsilon_p$', r'$\varepsilon_u$', r'$v_p$',   
                    r'$\theta$', r'$\Sigma_p$', r'$\Sigma_u$'], ncol=3, loc=1)



    # set real world mean and variance of food size and light intensity
    mu_p    = 3      
    Sigma_p = 2.5
    Sigma_u = 0.1
    Theta   = 1.0

    # simulation parameters
    dt      = 0.01
    dur     = 100
    steps   = int(dur/dt)

    # initialise state and parameter estimates
    reset_all()
    
    tau       = ipw.FloatSlider(value=1, min=0.1, max=20, step=0.1, continuous_update=CU) 
    w_phi     = ipw.FloatSlider(value=1, min=0.1, max=20, step=0.1, continuous_update=CU) 
    w_ep      = ipw.FloatSlider(value=1, min=0.1, max=20, step=0.1, continuous_update=CU) 
    w_eu      = ipw.FloatSlider(value=1, min=0.1, max=20, step=0.1, continuous_update=CU) 
    w_params  = ipw.FloatSlider(value=0.1, min=0.1, max=2, step=0.1, continuous_update=CU) 
    sliders   = ipw.VBox([tau, w_phi, w_ep, w_eu, w_params]) 
    reset     = ipw.ToggleButton(value=False, description='Reset', button_style='info')
    cont      = ipw.ToggleButton(value=False, description='Continue', button_style='info')
    buttons   = ipw.VBox([reset, cont])
    controls  = ipw.HBox([sliders, buttons])

    fig = plt.figure(figsize=(8,4), num='Fig 2e')
    ax  = fig.add_subplot(1, 1, 1)
    myplot = ipw.interactive(draw_fig2e, tau=tau, w_phi=w_phi, w_ep=w_ep, w_eu=w_eu, w_params=w_params, reset=reset, cont=cont)
    cont.value = True
    display(controls)
    plt.xlabel('time')
    plt.ylabel('Activity')
    plt.axis([0, dur, -2, 5])

    
fig_2e(False)

<IPython.core.display.Javascript object>

SEJveChjaGlsZHJlbj0oVkJveChjaGlsZHJlbj0oRmxvYXRTbGlkZXIodmFsdWU9MS4wLCBjb250aW51b3VzX3VwZGF0ZT1GYWxzZSwgZGVzY3JpcHRpb249dSd0YXUnLCBtYXg9MjAuMCwgbWnigKY=


## 3. Free-energy ##

In this section we discuss how the computations in the model relate to a technique of statistical inference involving minimization of free-energy. There are three reasons for describing this relationship. First, it will provide more insight for why the parameters can be optimized by maximization of $F$. Second, the concept of free-energy is critical for understanding of more complex models (Friston et al., 2013), which not only estimate the most likely values of variables, but their distribution. Third, the free-energy is a very interesting concept on its own, and has applications in mathematical psychology (Ostwald, Kirilina, Starke, & Blankenburg, 2014). 

We now come back to the example of an inference by a simple organism, and discuss how the exact inference described in Section 2.1 can be approximated. As we noted in Section 2.1, the posterior distribution $p(v|u)$ may have a complicated shape, so we will approximate it with another distribution, which we denote $q(v)$. Importantly, we will assume that $q(v)$ has a standard shape, so we will be able to characterize it by parameters of this typical distribution. For example, if we assume that $q(v)$ is normal, then to fully describe it, we can infer just two numbers: its mean and variance, instead of infinitely many numbers potentially required to characterize a distribution of an arbitrary shape. 

<html>
<hr style="height:2px;border:none;color:#228;background-color:#228;" />
<span style="color:#228">

I'm repeating the equations from Section 2.1 here so you don't have to scroll back.

\begin{align*}
p(v|u) &= \frac{p(v)p(u|v)}{p(u)} \tag{4} \\
p(u) &= \int p(v)p(u|v)dv \tag{5}
\end{align*}

</span>
<hr style="height:2px;border:none;color:#228;background-color:#228;" />
</html>

For simplicity, here we will use an even simpler shape of the approximate distribution, namely the delta distribution, which has all its mass cumulated in one point which we denote by $\phi$ (i.e. the delta distribution is equal to $0$ for all values different from $\phi$, but its integral is equal to $1$). Thus we will try to infer from observation just one parameter $\phi$ which will characterize the most likely value of $v$. 

We now describe what criterion we wish our approximate distribution to satisfy. We will seek the approximate distribution $q(v)$ which is as close as possible to the actual posterior distribution $p(v|u)$. Mathematically, the dissimilarity between two distributions in measured by the Kullback–Leibler divergence defined as: 

\begin{equation*}
KL(q(v), p(v|u)) = \int{q(v) \ln \frac{q(v)}{p(v|u)} dv}.
\tag{30}
\end{equation*}

For readers not familiar with Kullback–Leibler divergence we would like clarify why it is a measure of dissimilarity between the distributions. Please note that if the two distributions $q(v)$ and $p(v|u)$ were identical, the ratio $q(v)/p(v|u)$ would be equal to $1$, so its logarithm would be equal to $0$, and so the whole expression in Eq. (30) 
would be $0$. The Kullback–Leibler divergence also has a property that the more different the two distributions are, the higher its value is (see Ostwald et al. (2014) for more details). 

Since we assumed above that our simplified distribution is a delta function, we will simply seek the value of its centre parameter $\phi$ which minimizes the Kullback–Leibler divergence defined in Eq. (30). 

It may seem that the minimization of Eq. (30) is still difficult, because to compute term $p(v|u)$ present in Eq. (30) from Bayes’ theorem (Eq. (4)) one needs to compute the difficult normalization integral (Eq. (5)). However, we will now show that there exists another way of finding the approximate distribution $q(v)$ that does not involve the complicated computation of the normalization integral. 

Substituting the definition of conditional probability $p(v|u) = p(u, v)/p(u)$ into Eq. (30) we obtain: 

\begin{align*}
KL(q(v), p(v|u)) &= \int{q(v) \ln \frac{q(v)p(u)}{p(u,v)} dv} \\
&= \int{q(v) \ln \frac{q(v)}{p(u,v)} dv} + \int{q(v) \ln p(u) dv} \\
&= \int{q(v) \ln \frac{q(v)}{p(u,v)} dv} + \ln p(u) \\
\tag{31}
\end{align*}

In the transition from the second to the third line we used the fact that $q(v)$ is a probability distribution so its integral is $1$. The integral in the last line of the above equation is called free-energy, and we will denote its negative by $F$, because we will show below, that for certain assumptions the negative free-energy is equal (modulo a constant) to the function $F$ we defined and used in the previous section: 

\begin{equation*}
F = \int{q(v) \ln \frac{p(u,v)}{q(v)} dv}
\tag{32}
\end{equation*}

In the above equation we used the property of logarithms that $- \ln a/b = \ln b/a$. So, the negative free-energy is related to the Kullback–Leibler divergence in the following way: 

\begin{equation*}
KL (q(v), p(v|u)) = -F + \ln p(u). 
\tag{33}
\end{equation*}

Now please note that $\ln p(u)$ does not depend on $\phi$ (which is a parameter describing $q(v)$), so the value of $\phi$ that minimizes the distance between $q(v$) and $p(v|u)$ is the same value as that which maximizes $F$. Therefore instead of minimizing the Kullback–Leibler divergence we can maximize $F$, and this will have two benefits: first, as we already mentioned above, $F$ is easier to compute as it does not involve the complicated computation of the normalization term. Second, as we will see later, it will allow us to naturally introduce learning about the parameters of the model. 

Let us first note that by assuming that $q(v)$ is a delta distribution, the negative free energy simplifies to:

\begin{align*}
F &= \int{q(v) \ln \frac{p(u,v)}{q(v)} dv} \\
&= \int{q(v) \ln p(u,v) dv} - \int{q(v) \ln q(v) dv} \\
&= \ln p(u, \phi) + C_1.
\tag{34}
\end{align*}

In the transition from the first to the second line above we used the property of logarithms $\ln(a/b) = \ln a - \ln b$. In the transition from the second line to the third line we used the property of a delta function $\delta(x)$ with centre $\phi$ that for any function $h(x)$, the integral of $\delta(x)h(x)$ is equal to $h(\phi)$. Furthermore, since the value of the second integral in the second line of the above equation does not depend on $\phi$ (so it will cancel when we compute the derivative over $\phi$) we denote it by a constant $C_1$.

Now using $p(u, \phi) = p(\phi)p(u|\phi)$, and ignoring constant $C_1$, we obtain the expression for $F$ we introduced previously in Eq. (6). Thus finding approximate delta distribution $q(v)$ through minimization of free-energy is equivalent to the inference of features in the model described in the previous section. It is worth noting that Eq. (34) states that the best centre for our approximate distribution (i.e. our best guess for the size of the food item) is the value $v = \phi$ which maximizes the joint probability $p(u, \phi)$. 

We now discuss how the concept of free-energy will help us to understand why the parameters of the model can be learnt by maximization of $F$. Recall from Section 2.4 that we wish to find parameters for which the sensory observations are least surprising, i.e. those which maximize $p(u)$. To see the relationship between maximizing $p(u)$ and maximizing $F$, we note that according to Eq. (33), $p(u)$ is related to the negative free-energy in the following way: 

\begin{equation*}
\ln p(u) = F + KL (q(v), p(v|u)). 
\tag{35}
\end{equation*}

Since Kullback–Leibler divergence is non-negative, $F$ is a lower bound on $\ln p(u)$, thus by maximizing $F$ we maximize the lower bound on $\ln p(u)$. So in summary, by maximizing $F$ we can both find an approximate distribution $q(v)$ (as discussed earlier), and optimize model parameters. However, there is a twist here: we wish to maximize the average of $p(u)$ across trials (or here observations of different food items). Thus on each trial we need to modify the model parameters just a little bit (rather than until minimum of free energy is reached as was the case for $\phi$). 

## 4. Scaling up the model of perception ##
In this section we will show how the model scales up to the networks inferring multiple features and involving hierarchy. 

<html>
    <div class="image">
        <img src="Bogacz_tab1.png", width=800>
    </div>
</html>

### 4.1. Increasing the dimension of sensory input ###

The model naturally scales up to the case of multiple sensory inputs from which we estimate multiple variables. Such scaled model could be used to describe information processing within a cortical area (e.g. primary visual cortex) which infers multiple features (e.g. edges at different position and orientation) on the basis of multiple inputs (e.g. information from multiple retinal receptors preprocessed by the thalamus). This section shows that when the dimensionality of inputs and features is increased, the dynamics of nodes in the networks and synaptic plasticity are described by the same rules as in Section 2, just generalized to multiple dimensions. 

The only complication in explaining this case lies in the necessity to use matrix notation, so let us make this notation very explicit: we will denote single numbers in italic (e.g. $x$), column vectors by bar (e.g. $\bar{x})$, and matrices in bold (e.g. $\mathbf{x}$). So we assume the animal has observed sensory input $\bar{u}$ and estimates the most likely values $\bar{\phi}$ of variables $\bar{v}$. We further assume that the animal has prior expectation that the variables $\bar{v}$ come from multivariate normal distribution with mean $\bar{v_p}$ and covariance matrix $\mathbf{\Sigma_p}$, i.e. $p(\bar{v}) = f (\bar{v}; \bar{v}_p, \mathbf{\Sigma_p})$ where: 

\begin{equation*}
f (\bar{x}; \bar{µ}, \mathbf{\Sigma}) = \frac{1}{\sqrt{(2\pi)^N |\mathbf{\Sigma}|}} \exp \left( -\frac{1}{2}(\bar{x}-\bar{µ})^T \mathbf{\Sigma}^{-1} (\bar{x}-\bar{µ}) \right).
\tag{36}
\end{equation*}

In the above equation $N$ denotes the length of vector $\bar{x}$, and $|\mathbf{\Sigma}|$ denotes the determinant of matrix $\mathbf{\Sigma}$. Analogously, the probability of observing sensory input given the values of variables is given by $p(\bar{u}|\bar{v}) = f (\bar{u}; g(\bar{v}, \mathbf{\Theta}), \mathbf{\Sigma_u})$, where $\mathbf{\Theta}$ are parameters of function $g$. We denote these parameters by a matrix $\mathbf{\Theta}$, as we will consider a generalization of the function $g$ discussed in Section 2.5, i.e. $g(\bar{v}, \mathbf{\Theta}) = \mathbf{\Theta}h(\bar{v})$, where each element $i$ of vector $h(\bar{v})$ depends only on $v_i$. This function corresponds to an assumption often made by models of feature extraction (Bell & Sejnowski, 1995; Olshausen & Field, 1995), that stimuli are formed by a linear combination of features.$^5$ Moreover, such a function $g$ can be easily computed as it is equal to an input to a layer of neurons from another layer with activity $h(\bar{v})$ via connections with strength $\mathbf{\Theta}$. 

We can state the negative free energy, analogously as for the simple model considered in Eq. (7) (TRY IT YOURSELF): 

\begin{align*}
F &= \ln p(\bar{\phi}) + \ln p(\bar{u}|\bar{\phi}) \\
&= \frac{1}{2} \left( - \ln |\mathbf{\Sigma_p}| - (\bar{\phi} - \bar{v}_p)^T \mathbf{\Sigma_p}^{-1} (\bar{\phi} - \bar{v}_p) - \ln |\mathbf{\Sigma_u}| - (\bar{u} - g(\bar{\phi}, \mathbf{\Theta}))^T \mathbf{\Sigma_u}^{-1} (\bar{u} - g(\bar{\phi}, \mathbf{\Theta})) \right) + C.
\tag{37} 
\end{align*}

Analogously as before, to find the vector of most likely values of features $\bar{\phi}$, we will calculate the gradient (vector of derivatives $\partial{F}/\partial{\phi_i}$) which we will denote by $\partial{F}/\partial{\bar{\phi}}$. We will use the elegant property that rules for computation of derivatives generalize to vectors and matrices. To get an intuition for these rules we recommend the following exercise that shows how the rule $\partial{x^2}/\partial{x} = 2x$ generalizes to vectors.

**Exercise 4.** *Show that for any vector $\bar{x}$ the gradient of function $y = \bar{x}^T \bar{x}$ is equal to: $\partial{y}/\partial{\bar{x}} = 2\bar{x}$.* 

*It is easiest to consider a vector of two numbers (analogous can be shown for longer vectors):*

\begin{equation*}
x =
\begin{bmatrix}
x1 \\
x2
\end{bmatrix}.
\end{equation*}

*Then $y = x^T x = x_1^2 + x_2^2$, so the gradient is equal to:*

\begin{equation*}
\frac{y}{\bar{x}} =
\begin{bmatrix}
\frac{\partial{y}}{\partial{x_1}} \\
\frac{\partial{y}}{\partial{x_2}} 
\end{bmatrix} =
\begin{bmatrix}
2 x_1 \\
2 x_2
\end{bmatrix} =
2 \bar{x}.
\end{equation*}

Using an analogous method as that in the solution to Exercise 4 (at the end of the paper) one can see that several other rules generalize as summarized in Table 1. These rules can be applied for symmetric matrices, but since $\mathbf{\Sigma}$ are covariance matrices, they are symmetric, so we can use the top two rules in Table 1 to compute the gradient of the negative free energy (TRY IT YOURSELF): 

\begin{equation*}
\frac{\partial{F}}{\partial{\bar{\phi}}} = \mathbf{\Sigma_p}^{-1} (\bar{\phi} - \bar{v}_p) + \frac{\partial{g(\bar{\phi}, \mathbf{\Theta}})^T} {\partial{\bar{\phi}}} \mathbf{\Sigma_u}^{-1} (\bar{u} - g(\bar{\phi}, \mathbf{\Theta}))
\tag{38} 
\end{equation*} 

In the above equation, terms appear which are generalizations of the prediction errors we defined for the simple models: 

\begin{align*}
\bar{\varepsilon}_p &= \mathbf{\Sigma_p}^{-1} (\bar{\phi} - \bar{v}_p) \tag{39} \\ 
\bar{\varepsilon}_u &= \mathbf{\Sigma_u}^{-1} (\bar{u} - g(\bar{\phi}, \mathbf{\Theta})) \tag{40} 
\end{align*}

With the error terms defined, the equation describing the update of $\bar{\phi}$. becomes: 

\begin{equation*}
\dot{\bar{\phi}} = -\bar{\varepsilon}_p - \frac{\partial{g(\bar{\phi}, \mathbf{\Theta}})^T} {\partial{\bar{\phi}}} \bar{\varepsilon}_u.
\tag{41}
\end{equation*} 

The partial derivative term in the above equation is a matrix that contains in each entry with co-ordinates $(i, j)$ the derivative of element $i$ of vector $g(\bar{\phi}, \mathbf{\Theta})$ over $\phi_j$. To see how the above equation simplifies for our choice of function $g$, it is helpful without loss of generality to consider a case of $2$ features being estimated from $2$ stimuli. Then: 

\begin{equation*}
g(\bar{\phi}, \mathbf{\Theta}) = \mathbf{\Theta} h(\bar{\phi}) = 
\begin{bmatrix}
\theta_{1,1} h(\phi_1) + \theta_{1,2} h(\phi_2) \\
\theta_{2,1} h(\phi_1) + \theta_{2,2} h(\phi_2) \\
\end{bmatrix}
\tag{42}
\end{equation*}

Hence we can find the derivatives of elements of the above vector over the elements of vector .: 

\begin{equation*}
\frac{\partial{g(\bar{\phi}, \mathbf{\Theta})}}{\partial{\bar{\phi}}} = 
\begin{bmatrix}
\theta_{1,1} h'(\phi_1) + \theta_{1,2} h'(\phi_2) \\
\theta_{2,1} h'(\phi_1) + \theta_{2,2} h'(\phi_2) \\
\end{bmatrix}
\tag{43}
\end{equation*}

Now we can see that Eq. (41) can be written as: 

\begin{equation*}
\dot{\bar{\phi}} = -\bar{\varepsilon}_p - h'(\bar{\phi}) \times \mathbf{\Theta}^T \bar{\varepsilon}_u.
\tag{44}
\end{equation*} 

In the above equation $\times$ denotes element by element multiplication, so term $h'(\bar{\phi}) \times \mathbf{\Theta}^T \bar{\epsilon}_u$ is a vector where its element $i$ is equal to a product of $h'(\bar{\phi})$ and element $i$ of vector $\mathbf{\Theta}^T \bar{\epsilon}_u$. Analogously, as for the simple model, prediction errors could be computed by nodes with the following dynamics: 

\begin{align*}
\dot{\bar{\varepsilon}}_p &= \bar{\phi} - \bar{v}_p - \mathbf{\Sigma_p}\bar{\varepsilon}_p \tag{45} \\
\dot{\bar{\varepsilon}}_u &= \bar{u} - \mathbf{\Theta}h(\bar{\phi}) - \mathbf{\Sigma_u}\bar{\varepsilon}_u. \tag{46}
\end{align*}

It is easy to see that Eqs. (45)–(46) have fixed points at values given by Eqs. (39)–(40) by setting the left hand sides of Eqs. (45)–(46) to 0. The architecture of the network with the dynamics described by Eqs. (44)–(46) is shown in Fig. 5, and it is analogous to that in Fig. 4(b). 

<html>
    <div class="image">
        <img src="Bogacz_fig5.png", width=800>
        <div style="margin: 20px 100px 20px 100px">
            Fig. 5. The architecture of the model inferring 2 features from 2 sensory stimuli. Notation as in Fig. 
4(b). To help identify which connections are intrinsic and extrinsic to each level of hierarchy, the nodes and their projections in each level of hierarchy are shown in green, blue and purple respectively. 
        </div>
    </div>
</html>

Analogously as for the simple model, one can also find the rules for updating parameters encoded in synaptic connections, which generalize the rules presented previously. In particular, using the top formula in Table 1 it is easy to see that: 

\begin{equation*}
\frac{\partial{F}}{\partial{\bar{v}}_p} = \bar{\varepsilon}_p
\tag{47}
\end{equation*}

Using the two bottom formulas in Table 1 one can find the rules for update of covariance matrices (TRY IT YOURSELF): 

\begin{align*}
\frac{\partial{F}}{\partial{\mathbf{\Sigma_p}}} &= \frac{1}{2} \left( \bar{\varepsilon}_p \bar{\varepsilon}_p^T - \mathbf{\Sigma_p}^{-1} \right) \tag{48} \\
\frac{\partial{F}}{\partial{\mathbf{\Sigma_u}}} &= \frac{1}{2} \left( \bar{\varepsilon}_u \bar{\varepsilon}_u^T - \mathbf{\Sigma_u}^{-1} \right) \tag{49}
\end{align*}

The derivation of update of parameters $\mathbf{\Theta}$ is a bit more tedious, but we show in Appendix B that: 

\begin{equation*}
\frac{\partial{F}}{\partial{\mathbf{\Theta}}} = \bar{\varepsilon}_u h(\bar{\phi})^T.
\tag{50}
\end{equation*}

The above plasticity rules of Eqs. (47)–(50) are Hebbian in the same sense they were for the simple model—for example Eq. (48) implies that $\varSigma_{p,i,j}$ should be updated proportionally to $\varepsilon_{p,i} \varepsilon_{p,j}$, i.e. to the product of activity of pre-synaptic and post-synaptic neurons. However, the rules of update of covariance matrices of Eqs. (48)–(49) contain matrix inverses $\mathbf{\Sigma}^{-1}$. The value of each entry in matrix inverse depends on all matrix elements, so it is difficult how it can be "known" by a synapse that encodes just a single element. Nevertheless, we will show in Section 5 how the model can be extended to satisfy the constraint of local plasticity. 

<html>
    <div class="image">
        <img src="Bogacz_fig6.png", width=1000>
        <div style="margin: 20px 100px 20px 100px">
            Fig. 6. (a) The architecture of the model including multiple layers. For simplicity only the first two layers are shown. Notation as in Fig. 5. (b) Extrinsic connectivity of cortical layers. 
        </div>
    </div>
</html>

### 4.2. Introducing hierarchy ###
Sensory cortical areas are organized hierarchically, such that areas in lower levels of hierarchy (e.g. primary visual cortex) infer presence of simple features of stimuli (e.g. edges), on the basis of which the sensory areas in higher levels of hierarchy infer presence of more and more complex features. It is straightforward to generalize the model from 2 layers to multiple layers. In such generalized model the rules describing dynamics of neurons and plasticity of synapses remain exactly the same, and only notation has to be modified to describe presence of multiple layers of hierarchy. 

We assume that the expected value of activity in one layer $v_i$ depends on the activity in the next layer $v_{i+1}$:

\begin{align*}
E(\bar{u}) &= g_1(\bar{v}_2, \mathbf{\Theta_1}) \tag{51} \\ 
E(\bar{v}_2) &= g_2(\bar{v}_3, \mathbf{\Theta_2}) \\
E(\bar{v}_3) &= \;.... 
\end{align*}

To simplify the notation we could denote $u$ by $v_1$, and then the likelihood of activity in layer $i$ becomes: 

\begin{equation*}
p(\bar{v}_i|\bar{v}_{i+1}) = f (\bar{v}_i; g_i(\bar{v}_{i+1}, \mathbf{\Theta_i}), \mathbf{\Sigma_i}). 
\tag{52}
\end{equation*}

In this model, $\mathbf{\Sigma_i}$ parametrize the covariance between features in each level, and $\mathbf{\Theta_i}$ parametrize how the mean value of features in one level depends on the next. Let us assume the same form of function $g$ as before, i.e. $g_i(\bar{v}_{i+1}, \mathbf{\Theta_i}) = \mathbf{\Theta_i} h(\bar{v}_{i+1})$. By analogy to the model described in the previous subsection, one can see that inference of the features in all layers on the basis of sensory input can be achieved in the network shown in Fig. 6(a). In this network the dynamics of the nodes are described by: 

\begin{align*}
\dot{\bar{\phi}}_i &= -\bar{\varepsilon}_i + h'(\bar{\phi}_i) \times \mathbf{\Theta}_{i-1}^T \bar{\varepsilon}_{i-1}
\tag{53} \\
\dot{\bar{\varepsilon}}_i &= \bar{\phi}_i - \mathbf{\Theta}_i h(\bar{\phi}_{i+1}) - \mathbf{\Sigma_i} \bar{\varepsilon}_i. \tag{54}
\end{align*}

Furthermore, by analogy to the previous section, the rules for modifying synaptic connections in the model become: 

\begin{align*}
\frac{\partial{F}}{\partial{\mathbf{\Sigma_i}}} &= \frac{1}{2} \left( \bar{\varepsilon}_i \bar{\varepsilon}_i^T - \mathbf{\Sigma_i}^{-1} \right) \tag{55} \\
\frac{\partial{F}}{\partial{\mathbf{\Theta_i}}} &= \bar{\varepsilon}_i h(\bar{\phi}_{i+1})^T. \tag{56}
\end{align*}

The hierarchical structure of the model in Fig. 6(a) parallels the hierarchical structure of the cortex. Furthermore, it is worth noting that different layers within the cortex communicate with higher and lower sensory areas (as illustrated schematically in Fig. 6(b)), which parallel the fact that different nodes in the model communicate with other levels of hierarchy (Fig. 6(a)). 

<html>
    <div class="image">
        <img src="Bogacz_fig7.png", width=600>
        <div style="margin: 20px 100px 20px 100px">
            Fig. 7. Prediction error networks that can learn the uncertainty parameter with local plasticity. Notation as in Fig. 4(b). (a) Single node. (b) Multiple nodes for multidimensional features.
        </div>
    </div>
</html>

## 5. Local plasticity ##

The plasticity rules for synapses encoding matrix $\mathbf{\Sigma}$ (describing the variance and co-variance of features or sensory inputs) introduced in the previous section (Eqs. (48), (49) and (55)) include terms equal to the matrix inverse $\mathbf{\Sigma}^{-1}$ Computing each element of the inverse $\mathbf{\Sigma}^{-1}$ requires not only the knowledge of the corresponding element of $\mathbf{\Sigma}$, but also of other elements. For example, in a case of 2-dimensional vector $\bar{u}$, the update rule for the synaptic connection encoding $\varSigma_{u,1,1}$ (Eq. (49)) requires the computation of $\varSigma_{u,1,1}^{-1} = \varSigma_{u,2,2}/|\mathbf{\Sigma_u}|$. Hence the change of synaptic weight $\varSigma_{u,1,1}$ depends on the value of the weight $\varSigma_{u,2,2}$, but these are the weights of connections between different neurons (see Fig. 5), thus the update rule violates the principle of the local plasticity stated in the Introduction. Nevertheless, in this section we show that by slightly modifying the architecture of the network computing prediction errors, the need for computing matrix inverses in the plasticity rules disappears. In other words, we present an extension of the model from the previous section in which learning the values of parameters $\varSigma$ satisfies the constraint of local plasticity. To make the description as easy to follow as possible, we start with considering the case of single sensory input and single feature on each level, and then generalize it to increased dimension of inputs and features. 

### 5.1. Learning variance of a single prediction error node ###

Instead of considering the whole model we now focus on computations in a single node computing prediction error. In the model we wish the prediction error on each level to converge to: 

\begin{equation*}
\varepsilon_i = \frac{\phi_i - g_i(\phi_{i+1})}{\varSigma_i} 
\tag{57}
\end{equation*}

In the above equation $\varSigma_i$ is the variance of feature $\phi_i$ (around the mean predicted by the level above): 

\begin{equation*}
\varSigma_i = \left<(\phi_i - g_i(\phi_{i+1})^2)\right>
\tag{58}
\end{equation*}

A sample architecture of the model that can achieve this computation with local plasticity is shown in Fig. 7(a). It includes an additional inhibitory inter-neuron $e_i$ which is connected to the prediction error node, and receives input from it via the connection with weight encoding $\varSigma_i$. The dynamics of this model is described by the following set of equations: 

\begin{align*}
\dot{\varepsilon}_i &= \phi_i - g_i(\phi_{i+1}) - e_i \tag{59} \\
\dot{e}_i &= \varSigma_i \varepsilon_i - e_i \tag{60}
\end{align*}

The levels of activity at the fixed point can be found by setting the left hand sides of Eqs. (59)–(60) 
to $0$ and solving the resulting set of simultaneous equations (TRY IT YOURSELF): 

\begin{align*}
\varepsilon_i &= \frac{\phi_i - g_i(\phi_{i+1})}{\varSigma_i} \tag{61} \\
e_i &= \phi_i - g_i(\phi_{i+1}) \tag{62}
\end{align*}

Thus we see that the prediction error node has a fixed point at the desired value (cf. Eq. (57)). Let us now consider the following rule for plasticity of the connection encoding $\varSigma_i$: 

\begin{equation*}
\Delta \varSigma_i = \alpha(\varepsilon_i e_i - 1). 
\tag{63}
\end{equation*}

According to this rule the weight is modified proportionally to the product of activities of pre-synaptic and post-synaptic neurons decreased by a constant, with a learning rate $\alpha$. To analyse to what values this rule converges, we note that the expected change is equal to $0$ when: 

\begin{equation*}
\left<\varepsilon_i e_i - 1 \right> = 0. 
\tag{64} 
\end{equation*}

Substituting Eqs. (61)–(62) into the above equation and rearranging terms we obtain: 

\begin{equation*}
\frac{\left< \phi_i - g_i(\phi_{i+1})^2 \right>}{\varSigma_i} = 1.
\tag{65}
\end{equation*}

Solving the above equation for $\varSigma_i$ we obtain Eq. (58). Thus in summary the network in Fig. 7(a) computes the prediction  error and learns the variance of the corresponding feature with a local Hebbian plasticity rule. To gain more intuition for how this model works we suggest the following exercise. 

**Exercise 5.** *Simulate learning of variance $\varSigma_i$ over trials. For simplicity, only simulate the network described by Eqs. (59)–(60), and assume that variables $\phi$ are constant. On each trial generate input $\phi_i$ from a normal distribution with mean $5$ and variance $2$, and set $g_i(\phi_{i+1}) = 5$ (so that the upper level correctly predicts the mean of $\phi_i$). Simulate the network for $20$ time units, and then update weight $\varSigma_i$ with learning rate $\alpha = 0.01$. Simulate $1000$ trials and plot how $\varSigma_i$ changes across trials.*

<html>
<hr style="height:2px;border:none;color:#228;background-color:#228;" />
<span style="color:#228">
<p><i>I've made the simulation 3000 trials long to show the longer term stability of the weight.</i>
</span>
<hr style="height:2px;border:none;color:#228;background-color:#228;" />
</html>

In [8]:
def fig_8():
    
    def deps_dt(phi_i, eps_i, e_i, Sigma_i, phi_i_plus_1):
        return phi_i - g(phi_i_plus_1) - e_i

    def de_dt(phi_i, eps_i, e_i, Sigma_i):
        return Sigma_i * eps_i - e_i

    def Delta_Sigma(phi_i, eps_i, e_i, Sigma_i):
        return alpha * (eps_i * e_i - 1)

    mu_phi       = 5
    Sigma_phi    = 2
    phi_i_plus_1 = np.sqrt(mu_phi)
    Sigma_i      = 1
    dt           = 0.01
    dur          = 20
    steps        = int(dur/dt)
    trials       = 3000
    alpha        = 0.01
    trace        = np.zeros((trials, 4))
    
    for trial in range(trials):
        phi_i    = np.sqrt(Sigma_phi) * np.random.randn() + mu_phi
        eps_i    = 0
        e_i      = 0
        state_i  = (phi_i, eps_i, e_i, Sigma_i)
        
        for t in range(steps):
            eps_i  += dt * deps_dt(*state_i, phi_i_plus_1 = phi_i_plus_1)
            e_i    += dt * de_dt(*state_i)
            state_i = (phi_i, eps_i, e_i, Sigma_i)

        Sigma_i += Delta_Sigma(*state_i)
        state_i  = (phi_i, eps_i, e_i, Sigma_i)
        trace[trial] = state_i
        
    fig = plt.figure(figsize=(10, 5), num='Fig 8')
    ax  = fig.add_subplot(1, 1, 1)

    ax.plot(np.arange(trials), trace[:, 3])
    plt.xlabel('Trial')
    plt.ylabel(r'$\Sigma$')
    plt.axis([0, 3000, 0.5, 2.5])

fig_8()

<IPython.core.display.Javascript object>

The results of simulations are shown in Fig. 8, and they illustrate that the synaptic weight $\varSigma_i$ approaches the vicinity of the variance of $\phi_i$. 

It is also worth adding that $\varepsilon_i$ in the model described by Eqs. (59)–(60) converges to the prediction error (Eq. (61)),  when one assumes that $\phi$ are constant or change on much slower time-scale than $\varepsilon_i$ and $e_i$. This convergence takes place because the fixed point of the model is stable, which can be shown using the standard dynamical systems theory (Strogatz, 1994). In particular, since Eqs. (59)–(60) only contain linear functions of variables $\varepsilon_i$ and $e_i$, their solution has a form of exponential functions of time $t$, e.g. $\varepsilon_i(t) = c \exp(\lambda t) + \varepsilon_i^*$, where $c$ and $\lambda$ are constants, and $\varepsilon_i^*$ is the value at the fixed point. The sign of $\lambda$ determines the stability of the fixed point: when $\lambda < 0$, the exponential term decreases with time, and $\varepsilon_i$ converges to the fixed point, while if $\lambda > 0$, the fixed point is unstable. 

The values of $\lambda$ are equal to the eigenvalues of the matrix in the equation below (Strogatz, 1994), which rewrites Eqs. (59)–(60) in a vector form: 

\begin{equation*}
\begin{bmatrix}
\dot{\varepsilon}_i \\ \dot{e}_i
\end{bmatrix} =
\begin{bmatrix}
0 & -1 \\
\varSigma_i & -1
\end{bmatrix}
\begin{bmatrix}
\varepsilon_i \\ e_i
\end{bmatrix}
+
\begin{bmatrix}
\phi_i - g_i(\phi_{i+1}) \\ 0
\end{bmatrix}.
\tag{66}
\end{equation*}

To show that the eigenvalues of the matrix in the above equation are negative we use the property that sum of eigenvalues is equal to the trace and the product to the determinant. The trace and determinant of this matrix are $-1$ and $\varSigma_i$, respectively. Since the sum of eigenvalues is negative and their product positive, both eigenvalues are negative, so the system is stable. 

### 5.2. Learning the covariance matrix ###

The model described in the previous subsection scales up to larger dimension of features and sensory inputs. The architecture of the scaled up network is shown in Fig. 7(b), and its dynamics is described by the following equations:

\begin{align*}
\dot{\bar{\varepsilon}}_i &= \bar{\phi}_i - g_i(\bar{\phi}_{i+1}) - \bar{e}_i \tag{67} \\
\dot{\bar{e}}_i &= \mathbf{\Sigma_i} \bar{\varepsilon}_i - \bar{e}_i \tag{68}
\end{align*}

Analogously as before, we can find the fixed point by setting the left hand side of the equation to 0: 

\begin{align*}
\bar{\varepsilon}_i &= \mathbf{\Sigma_i}^{-1} \bar{\phi}_i - g_i(\bar{\phi}_{i+1}) \tag{69} \\
\bar{e}_i &= \bar{\phi}_i - g_i(\bar{\phi}_{i+1}) \tag{70}
\end{align*}

Thus we can see that nodes $\varepsilon$ have fixed points at the values equal to the prediction errors. We can now consider a learning rule analogous to that in the previous subsection: 

\begin{equation*}
\Delta \mathbf{\Sigma}_i = \alpha(\bar{\varepsilon}_i \bar{e}_i^T - 1). 
\tag{71}
\end{equation*}

To find the values to vicinity of which the above rule may converge, we can find the value of $\mathbf{\Sigma}_i$ for which the expected value of the right hand side of the above equation is equal to $0$: 

\begin{equation*}
\left<\bar{\varepsilon}_i \bar{e}_i^T - 1 \right> = 0.
\tag{72}
\end{equation*}

Substituting Eqs. (69)–(70) into the above equation, and solving for i we obtain (TRY IT YOURSELF): 

\begin{equation*}
\mathbf{\Sigma_i} = \left< (\bar{\phi}_i - g_i(\bar{\phi}_{i+1})) (\bar{\phi}_i - g_i(\bar{\phi}_{i+1}))^T \right>.
\tag{73}
\end{equation*}

We can see that the learning rule has a stochastic fixed point at the values corresponding to the covariance matrix. In summary, the nodes in network described in this section have fixed points at prediction errors and can learn the covariance of the corresponding features, thus the proposed network may substitute the prediction error nodes in the model shown in Fig. 6, and the computation will remain the same. But importantly in the proposed network the covariance is learnt with local plasticity involving simple Hebbian learning. 

<html>
    <div class="image">
        <img src="Bogacz_fig9.png", width=400>
        <div style="margin: 20px 100px 20px 100px">
            Fig. 9. An example of a texture. 
        </div>
    </div>
</html>

## 6. Discussion ##

In this paper we presented the model of perception and learning in neural circuits based on the free-energy framework. This model extends the predictive coding model (Rao & Ballard, 1999) in that it represents and learns not only mean values of stimuli or features, but also their variances, which gives the model several new computational capabilities, as we now discuss. 

First, the model can weight incoming sensory information by their reliability. This property arises in the model, because the prediction errors are normalized by dividing them by the variance of noise. Thus the more noisy is a particular dimension of the stimulus, the smaller the corresponding prediction error, and thus lower its influence on activity on other neurons in the network. 

Second, the model can learn properties of features encoded in covariance of sensory input. An example of such feature is texture, which can be efficiently recognized on the basis of covariance, irrespectively of translation (Harwood, Ojala, Pietikäinen, Kelman, & Davis, 1995). To get an intuition for this property, let us consider an example of checker-board texture (Fig. 9). Please note that adjacent nodes have always opposite colour – corresponding to negative covariance, while the diagonal nodes have the same colour – corresponding to positive covariance. 

Third, the attentional modulation can be easily implemented in the model by changing the variance associated with the attended features (Feldman & Friston, 2010). Thus for example, attending to feature $i$ at level $j$ of the hierarchy can be implemented by decreasing synaptic weight $\varSigma_{j,i,i}$, or inhibiting node $e_{j,i}$ in case of the model described in Section 5, which will result in a larger effect of the node encoding this feature on the activity in the rest of the network. 

In this paper we included description of the modified or extended version of the model with local computation and plasticity to better illustrate how computation proposed by the free-energy framework can be implemented in neural circuits. However, it will be necessary in the future to numerically evaluate the efficiency of learning in the proposed model and the free-energy framework in general. Existing models of feature extraction (Bell & Sejnowski, 1997; Bogacz, Brown, & Giraud-Carrier, 2001; Olshausen & Field, 1995) and predictive coding (Rao & Ballard, 1999) have been shown to be able to find features efficiently and reproduce the receptive fields of neurons in the primary visual cortex when trained with natural images. It would be interesting to explicitly test in simulations if the model based on the free-energy framework can equally efficiently extract features from natural stimuli and additionally learn the variance and covariance of features. 

We have also demonstrated that if the dynamics within the nodes computing prediction errors takes place on a time-scale much faster than in the whole network, these nodes converge to stable fixed points. It is also worth noting that under the assumption of separation of time scales, the nodes computing $\phi$ also converge to a stable fixed point, because variables $\phi$ converge to the values that maximize function $F$. It would be interesting to investigate how to ensure that the model converges to desired values (rather than engaging into oscillatory behaviour) also when one considers a more realistic case of time-scales not being fully separated. 

In summary, in this paper we presented the free-energy theory, which offers a powerful framework for describing computations performed by the brain during perception and learning. The appeal of the similarity in the organization of networks suggested by this theory and observed in the brain invites attempts to map the currently relatively abstract models on details of cortical micro-circuitry, i.e. to map different elements of the model on different neural populations within the cortex. For example, Bastos et al. (2012) compared a more recent version of the model (Friston, 2008) with the details of the cortical organization. Such comparisons of the models with biological circuits are likely to lead to iterative refinement of the models. 

Even if the free-energy framework does describe cortical computation, the mapping between the variables in the model and the elements of neural circuit may not be "clean" but rather "messy" i.e. each model variable or parameter may be represented by multiple neurons or synapses. The particular implementation of the framework in the cortical circuit may be influenced by other constraints the evolutionary pressure optimizes such as robustness to damage, energy efficiency, speed of processing, etc. In any case, the comparison of predictions of theoretical framework like the free-energy with experimental data offers hope for understanding the cortical micro-circuits. 

## Acknowledgments ##
This work was supported by Medical Research Council grant MC UU 12024/5. The author thanks Karl Frison, John-Stuart Brittain, Daniela Massiceti, Linus Schumacher and Rui Costa for reading the previous version of the manuscript and very useful suggestions, and Chris Mathys, Peter Dayan and Diego Vidaurre for discussion. 

## Footnotes ##

$^1$ In the original model (Friston, 2005) the prediction errors were normalized slightly differently as explained in Appendix A. 

$^2$ The original model does not provide details on the dynamics of the nodes computing prediction error, but we consider sample description of their dynamics to illustrate how these nodes can perform their computation. 

$^3$ In the original model, variables $\lambda = \sqrt{\varSigma} - 1$ were defined, and these variables were encoded in the synaptic connections. Formally, this constraint is known as a hyperprior. This is because the variance or precision parameters are often referred to mathematically as hyperparameters. Whenever we place constraints on hyperparameters we necessarily invoke hyperpriors. 

$^4$ Although this case has not been discussed by Friston (2005), it was discussed by Rao and Ballard (1999). 

$^5$ In the model of Rao and Ballard (1999) the sparse coding was achieved through introduction of additional prior expectation that most $\phi_i$ are close to $0$, but the sparse coding can also be achieved by choosing a shape of function $h$ such that $h(v_i)$ are mostly close to $0$, but only occasionally significantly different from zero (Friston, 2008). 

## Appendix A. The original neural implementation ##

In the original model (Friston, 2005), the prediction errors were defined in a slightly different way: 

\begin{align*}
\xi_p &= \frac{\phi - v_p}{\sigma_p} \tag{74} \\
\xi_u &= \frac{u - g(\phi)}{\sigma_u}. \tag{75} \\
\end{align*}

In the above equations, $\sigma_p = \sqrt{\varSigma_p}$, and $\sigma_u = \sqrt{\varSigma_u}$, i.e. $\sigma_p$ and $\sigma_u$ denote the standard deviations of distributions $p(v)$ and $p(u|v$) respectively. With the prediction error terms defined in this way, the negative free energy computed in Eq. (7) can be written as: 

\begin{equation*}
F =  \left( -\ln \sigma_p - \frac{1}{2} \xi_p^2 - \ln \sigma_u - \frac{1}{2} \xi_u^2 \right) + C_2.
\tag{76}
\end{equation*}

The dynamics of variable $\phi$ is proportional to the derivative of the above equation over $\phi$: 

\begin{equation*}
\dot{\phi} = \frac{\xi_u g'(\phi)}{\sigma_u} - \frac{\xi_p}{\sigma_p}.
\tag{77}
\end{equation*}

Analogously as in Section 2.3, the prediction errors defined in Eqs. (74)–(75) could be computed in the nodes with the following dynamics: 

\begin{align*}
\dot{\xi_p} &= \phi - v_p - \sigma_p \xi_p \tag{78} \\
\dot{\xi_u} &= u - g(\phi) - \sigma_u \xi_u \tag{79} \\
\end{align*}

The architecture of the model described by Eqs. (77)–(79) is shown in Fig. 10. It is similar to that in Fig. 
3, but differs in the information received by node $\phi$ (we will discuss this difference in more detail at the end of this Appendix). 

Analogously as before, we can find the rules describing synaptic plasticity in the model, by calculating the derivatives of $F$ (Eq. (76)) over $v_p$, $\sigma_p$ and $\sigma_u$ (TRY IT YOURSELF): 

\begin{align*}
\frac{\partial{F}}{\partial{v_p}} &= \xi_p \sigma_p^{-1} \tag{80} \\
\frac{\partial{F}}{\partial{\sigma_p}} &= \left( \xi_p^2 - 1 \right) \sigma_p^{-1} \tag{81} \\
\frac{\partial{F}}{\partial{\sigma_u}} &= \left( \xi_u^2 - 1 \right) \sigma_u^{-1}. \tag{82} 
\end{align*}

The original model does not satisfy the constraint of local computation stated in the Introduction, because the node computing $\phi$ receives the input from prediction error nodes scaled by parameters $\sigma$ (see Fig. 10), but the parameters $\sigma$ are not encoded in the connections between node $\phi$ and the prediction error nodes, but instead in the connections among the prediction error neurons. Nevertheless, we have shown in Section 2.3 that by just changing the way in which prediction errors are normalized the computation in the model becomes local. 

## Appendix B. Derivation of plasticity rule for connections between layers ##

This Appendix derives the rule for update of $\mathbf{\Theta}$ given in Eq. (50). In order to use the two top formulas in Table 1 we have to reshape the matrix $\mathbf{\Theta}$ into a vector. To avoid death by notation, without loss of generality, let us consider the case of 2 dimensional stimuli and features. So let us define the vector of parameters: $\bar{\theta} = \left[ \theta_{1,1}, \theta_{1,2}, \theta_{2,1}, \theta_{2,2} \right]$. Now, using two top formulas in Table 1 one can find that: 

\begin{equation*}
\frac{\partial{F}}{\partial{\bar{\theta}}} = \frac{\partial{g(\bar{\phi}, \mathbf{\Theta})^T}}{\partial{\bar{\theta}}} \bar{\epsilon}_u.
\tag{83}
\end{equation*}

From Eq. (42) we find: 

\begin{equation*}
\frac{\partial{g(\bar{\phi}, \mathbf{\Theta})}}{\partial{\bar{\theta}}} = 
\begin{bmatrix}
h(\phi_1) & h(\phi_2) & 0 & 0 \\
0 & 0 & h(\phi_1) & h(\phi_2)
\end{bmatrix}.
\tag{84}
\end{equation*}

We can now evaluate the right hand side of Eq. (83): 
\begin{equation*}
\frac{\partial{g(\bar{\phi}, \mathbf{\Theta})^T}}{\partial{\bar{\theta}}}\bar{\epsilon}_u = 
\begin{bmatrix}
h(\phi_1) & 0 \\
h(\phi_2) & 0 \\
0 & h(\phi_1) \\
0 & h(\phi_2)
\end{bmatrix}
\begin{bmatrix}
\varepsilon_{u,1}\\
\varepsilon_{u,2}
\end{bmatrix} =
\begin{bmatrix}
h(\phi_1) \varepsilon_{u,1} \\
h(\phi_2) \varepsilon_{u,1} \\
h(\phi_1) \varepsilon_{u,2} \\
h(\phi_2) \varepsilon_{u,2} 
\end{bmatrix}.
\tag{85}
\end{equation*}

Reshaping the right hand side of the above equation into a matrix, we can see how it can be decomposed into the product of vectors in Eq. (50): 

\begin{equation*}
\begin{bmatrix}
h(\phi_1) \varepsilon_{u,1} & h(\phi_2) \varepsilon_{u,1} \\
h(\phi_1) \varepsilon_{u,2} & h(\phi_2) \varepsilon_{u,2} 
\end{bmatrix} =
\begin{bmatrix}
\varepsilon_{u,1} \\
\varepsilon_{u,2} 
\end{bmatrix}
\begin{bmatrix}
h(\phi_1) & h(\phi_2) \\
\end{bmatrix}.
\tag{86}
\end{equation*}

## References ##

Bastos, Andre M., Usrey, W. Martin, Adams, Rick A., Mangun, George R., Fries, Pascal, & Friston, Karl J. (2012). Canonical microcircuits for predictive coding. Neuron, 4, e1000211. 76, 695–711.

Bell, Anthony J., & Sejnowski, Terrence J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. 

Bell, Anthony J., & Sejnowski, Terrence J. (1997). The independent components of natural scenes are edge filters. Vision Research, 37, 3327–3338. 

Bogacz, Rafal, Brown, Malcolm W., & Giraud-Carrier, Christophe (2001). Emergence of movement sensitive neurons’ properties by learning a sparse code for natural moving images. Advances in Neural Information Processing Systems, 13, 838–844. 

Bogacz, Rafal, & Gurney, Kevin (2007). The basal ganglia and cortex implement optimal decision making between alternative actions. Neural Computation, 19, 442–477. 

Chen, J.-Y., Lonjers, P., Lee, C., Chistiakova, M., Volgushev, M., & Bazhenov, M. (2013). Heterosynaptic plasticity prevents runaway synaptic dynamics. Journal of Neuroscience, 33, 15915–15929. 

Feldman, Harriet, & Friston, Karl (2010). Attention, uncertainty, and free-energy. Frontiers in Human Neuroscience, 4, 215. 

FitzGerald, Thomas H. B., Schwartenbeck, Philipp, Moutoussis, Michael, Dolan, Raymond J., & Friston, Karl (2015). Active inference, evidence accumulation and the urn task. Neural Computation, 27, 306–328. 

Friston, Karl (2003). Learning and inference in the brain. Neural Networks, 16, 1325–1352.

Friston, Karl (2005). A theory of cortical responses. Philosophical Transactions of the Royal Society B, 360, 815836.

Friston, Karl (2008). Hierarchical models in the brain. PLoS Computational Biology, 4, e1000211.

Friston, Karl (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11, 127–138.

Friston, Karl, Schwartenbeck, Philipp, FitzGerald, Thomas, Moutoussis, Michael, Behrens, Timothy, & Dolan, Raymond J. (2013). The anatomy of choice: active inference and agency. Frontiers in Human Neuroscience, 7, 598.

Harwood, David, Ojala, Timo, Pietikäinen, Matti, Kelman, Shalom, & Davis, Larry (1995). Texture classification by center-symmetric auto-correlation, using Kullback discrimination of distributions. Pattern Recognition Letters, 16, 1–10. 

Olshausen, Bruno A., & Field, David J. (1995). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. 

O’Reilly, Randall C., & Munakata, Yuko (2000). 
Computational explorations in cognitive neuroscience. MIT Press.

Ostwald, Dirk, Kirilina, Evgeniya, Starke, Ludger, & Blankenburg, Felix (2014). A tutorial on variational Bayes for latent linear stochastic time-series models. Journal of Mathematical Psychology, 60, 1–19. 

Rao, Rajesh P. N., & Ballard, Dana H. (1999). Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2, 79–87. 

Strogatz, Steven (1994). Nonlinear dynamics and chaos. Westview Press. 

 