# Optimal Bayesian Experimental Design

R. D. McMichael 
rmcmichael@nist.gov  
National Institute of Standards and Technology  
Gaithersburg, MD  USA
March 29, 2019

## Introduction

This manual describes an implementation of optimal Bayesian experimental design methods.  These methods address routine measurements where data are fit to experimenal models in order to obtain model parameters.  The twin benefits of these methods  are reduced uncertainty with fewer required measurements.  These methods are therefore most beneficial in measurements where measurments are expensive in terms of money, time, risk, labor and/or discomfort.  The price for these benefits lies in the complexity of automating such measuremnts and in the computational load required.  It is the goal of this package to assist potential users in overcoming at least the programming hurdles.

Optimal Bayesian experimental design is not new, at least not in the statistics community.  A review paper from 1995 by [Kathryn Chaloner and Isabella Verinelli](https://projecteuclid.org/euclid.ss/1177009939) reveals that the basic methods had been worked out in preceding decades.  The methods implemented here closely follow [Xun Huan and Youssef M. Marzouk](http://dx.doi.org/10.1016/j.jcp.2012.08.013) which emphasizes simulation-based experimental design.  Optimal Bayesian experimental design is also an active area of research

There are at least three important factors that encourage application of these methods today.  First, the availability of flexible, modular (package friendly?) computer languages such as Python.  Second, availability of cheap computational power.   Most of all though, an increased awareness of the benefits of code sharing and reuse is growing in scientific communities, and the sharing is facilitated by websites such as sourceforge, bitbucket and github.


## A sampling

### Locating a Lorentzian peak

![Comparison of measure-then-fit and optimal bayesian methods](./demoLorentzfig1.png "lorentz demo")

Figure 1:  A comparison of measure-then-fit (left) and optimal Bayesian experimental design (right).  Both methods measure the same "true" peak with Gaussian noise added ($\sigma = 1$) independently.  The peak parameters are selected randomly: center between 2 and 4, height between 2 and 5, background between -1 and 1 and peak width between 0.1 and 0.3.  On the left, 30 evenly-spaced "measurements" are made and fit using `scipy.curve_fit()`.  A curve usingthe best-fit parameters is plotted for comparison with the true curve.  The diagonal element of the covariance matrix is taken as the square of the uncertainty.   On the right, the optimal Bayes experimental design method is used sequentially.  Iterations stop when the standard deviation of the `x0` peak center distribution is less than the uncertaintly of the fit on the left.  Green curves correspond to random draws from the parameter distribution at the stopping point. Typical runs of this example require something like 1/4 to 1/2 as many measurements

### Tuning a $\pi$ pulse
![2D settings space](./pipulsefig2.png "pi pulse demo")

Figure 2: A $\pi$ pulse is a method of inverting spins that is frequenly used in nuclear magnetic resonance (NMR and MRI) and pulsed electron paramagnetic resonance (EPR).  In order to be accurate, the duration and frequency of the pulse must be tuned.  On the left, the background image displays the model photon counts for optically detected spin manipulation for different detunings and pulse lengths.  White indicates the expected result for spin up and black, spin down.  Points indicate simulated measurement settings, with sequence in order from white to dark red.  Simulated measurements have 1$\sigma$ uncertainties of 100.  The right panel shows the probability distribution function after 50 measurements, with the red dot at the simulated "true" value.  The grey line shows the path of the probability maximum.

## Philosophy and attitude

> If it sounds good, it is good
>> Duke Ellington

The goals of this package are quite modest: To adapt some of the developments in optimal Bayeseian experimental design research for practical use in laboratory settings.

- If its a struggle to use, it can't run good.
- If its to full of technical jargon to understand, it can't run good
- If the user finds it useful, it runs good.
- If it runs good, it is good.

## Requirements for users

It takes a little effort to get this software up and running.  Here's what a user will need to supply to get started.

1. An experiment that yields measurement results with uncertainty estimates. 
2. A model for the experiment - typically a function with parameters to be determined. 
3. A working knowledge of Python programming - enough to follow examples and program your own model.



## Theory of operation

The optimal Bayes experimental design method incorporates two main jobs, which we can describe as "learning fast" and "making good decisions"

### Learning fast

The learning process is a straightforward application of the well-known Bayesian inference method.  If that last sentence made perfect sense to you, feel free to skip ahead.  For the rest of us we'll start in the logical place, ~~at the beginning~~ in the middle.

We do measurements in order to learn things.  At the very beginning, before we start measuring, we may have some knowledge or experience, but we expect to have better information after we measure.  There are phenomena like drift that can spoil things, but in general, we expect to learn something from each measurement.  Next, we're going to look at how the knowledge gets better as a new measurment result is digested.  For that we need to use some technical language.

We'll express our knowledge of the set of model parameters $\theta$ as a probability distribution function $p(\theta)$.  If $p(\theta)$ is a broad distribution, then we really don't know the values very well, and if $p(\theta)$ is narrow, the uncertaintly is small.  When we get a measurement result $m$ at settings $x$, the results should have some influence on $p(\theta)$.  

So what's $m$, exactly? It's a package that includes experimental settings and measured values including uncertainty estimates.  This software assumes that your experiment yields mean values and standard deviations, which is a shorthand way of saying that the noise in your experiment follows a Gaussian distribution.

>If your measurements don't have uncertainty, you might be a redneck.  

When we make a new measurement $m$ we want to know the new probability distribution $p(\theta|m)$ after we have taken $m$ into account.  The vertical bar in the notation $p(\theta|m)$ indicates a conditional probablility, the distribution of $\theta$ values given $m$. Bayes theorem gives us
    $$ p(\theta|m) = \frac{p(m|\theta) p(\theta)}{p(m)}. $$
   
It's easy to write down Bayes theorem.  At least for me, the challenge is understanding what the symbols mean.  All of the terms here have special names.  The left side is the _posterior_ distribution, i.e. the distribution after we include $m$. Distribution $p(\theta)$ is the _prior_, representing what we knew about the parameters $\theta$ before the measurement. In the denominator, $p(m)$ is the _evidence_, but $m$ is a permanent record of numbers (for now), and its probability is a constant.  The term that we need to focus on, $p(m|\theta)$, is called the _likelihood_.  It's the probability of getting the particular combination of settings and results $m$ given different parameter values.  

It's worthwhile to take a little time to get aquainted with the _likelihood_, so Let's look at the likelihood's arguments first.  Note that the measurment $m$ are numbers that came from the instruments, so $m$ is fixed.  But the parameters $\theta$ are variable, so $P(m|\theta)$ behaves like a function of $\theta$.  A particular measuremnt result $m$ is more likely for some parameters than for others. So the likelihood answers this question: for different parameters $\theta$, what's the probability that our measurment will yield the value $m$?     

In constructing $P(m|\theta)$, we're answering a question about how we expect our system to behave, expressed as a probability.  We take the case where our system has an explicit model
    $$ P(y) = f(x, \theta) $$,
which says that given experimental settings $x$ and sample parameters $\theta$, we know our system well enough to predict the distribution of measurement results $y$ as $P(y)$.  If our experiment demonstrates Gaussian noise, we can write
    $$ P(y) \propto \exp[-(y-\bar{y}(x, \theta))^2/2\sigma_y^2 ]$$
    
Now we know how to update our "knowledge" of parameters $\theta$ expressed as a probability distribution $P(\theta)$.
1. Collect measurement data including settings, $x$, measurement values $y$ and measurement uncertainties $\sigma_y$.
2. For all values of $\theta$ calculate the model's prediction of the mean measurment result, $\bar{y}(x, \theta)$
3. For all values of $\theta$ either
   - multiply $P(\theta)$ by the likelihood $\exp[-(y-\bar{y}(x, \theta))^2/2\sigma^2 ]$ or
   - add $\ln P(\theta)$ to  $-(y-\bar{y}(x, \theta))^2/2\sigma^2$
4. Normalize

We just made several important assumptions:
 - That a known function of settings $x$ and unknown parameters $\theta$ describes our experimenal results
 - that the noise in our measurement is Gaussian with standard deviation $\sigma_y$.  
On one hand we have to admit that these assumptions don't allow us to address all important cases.  On the other hand, these are the same assumptions we make in doing least-squares curve fitting.


### Making good decisions

The next important job in the process is figuring out the settings to use for the next measurement.  It helps to define  what we're trying to accomplish in mathematical terms.   At least part of our goal is to make the parameter probability distribution $p(\theta)$ while minimizing cost or time spent.  We might have more than one purpose for the measurements, for example to fit a model, and also to demonstrate the fidelity of the model.

The challenge, then is to develop a _utility function_ $U(x)$ that helps us to predict and compare the relative benefit/cost ratio of different possible experimental settings $x$.

First, an appeal to intuition.  Our system model describes a connection between parameter values $a$ and measurement results $y$.  Different measurements, e.g. from different samples, will yield different parameter values.  Similarly, different parameter values predict different measurements.  So, intuitively, if we want to constrain the parameter values, it would do the most good to "pin down" the measurement at the settings $x$ where the predicted variations in $y$ are the largest.  Measure where the the predicted results are most strongly coupled to the parameters $a$.  Our approach to making good decisions about measurement parameters goes like this:
1. For random draws $\theta_i$ of parameters from the distribution $p(\theta)$
   Use the model to predict $y_i(x)$ for every possible setting $x$.
2. Calculate a measure of the spread in $y_i$ values
3. Pick a measurement setting with a large spread.

To translate such a qualitative argument into code, a good place to start is to clarify what we mean by "doing the most good" in refining our parameter distribution $p(\theta)$.  When we determine model parameters, usually the goal is to get results with small uncertainty.  But here we're thinking in terms of a distribution $p(\theta)$.  Information theory gives us information entropy as a way to quantify the sharpness of a probability distribution.  The information entropy of a probability distribution $p(a)$ is defined as  
 $$ E = -\int da\; p(a)\; \ln[p(a)] $$  
Note that the integrand is zero for both $p(a) = 1$ and $p(a)=0$.  It's the intermediate values encountered in a spread-out distribution where the information entropy accumulates.  For common distributions, like rectangular or Gaussian, that have characteristic widths $w$ the entropy goes like $\ln(w) + C$.

We adopt the information entropy as our measure of $p(\theta)$ sharpness, and that makes it possible to estimate how much $E$(posterior) - $E$(prior) we might get for predicted measurement values $y$ at different settings $x$.  Actually, the statisticians use something slightly different called the Kulback-Liebler divergence. 
$$ D^{KL} = \int\int d\theta dy\; p(\theta |y,x)\ln \left[ \frac{p(\theta | y,x)}{p(\theta)}\right] $$  
Here, we're using $y,x$ to denote predicted measurements at potential settings $x$ instead of the $m$ we used for a completed measurement in the learning section above.

The result for each potential setting $x$ is the difference between two information entropy values:
1. The entropy of the $y$ distribution due to both measurement uncertainty and uncertainty in $\theta$.
  $$ \int dy\; p(y|x) \ln[p(y|x)] $$
  with
  $$ p(y|x) = \int d\theta'\; p(\theta') p(y|\theta',x) $$
2. The entropy of the $y$ distribution due to measurement uncertainty alone, averaged over $\theta$ values. 
  $$ \int d\theta\; p(\theta) \int dy\; p(y|\theta,x) \ln [ p(y|\theta, x) ] $$

This information entropy difference is the estimated improvement in the $\theta$ distribution for different setting values, and the best setting choice for the next measurement is the one that maximizes the information change.

If we were to implement this result directly, the next step would be to estimate the entropy integrals using random draws from $p(\theta)$. To do this, we would need to make model measurement results for enough samples from $p(\theta)$ to make a good estimate of $p(y|\theta,x)$, and that task is computationally intensive.   In keeping with our "runs good" philosophy, let's examine the risks in taking a mathematical short cut:

| Risk | Consequence |
| :---- | :----------- |
| Bad estimate of $\Delta E$ | May affect stopping criterion |
| Frequent good, but not optimum choice of $x$  | reduced performance |
| Occacional stupid choice of $x$       |  reduced perfomance |
| Slow calculations to choose $x$       | doesn't run good  |

We don't need precise values, we just need to compare the utility of different settings.  Even if we don't choose the absolute best setting, a "pretty good" choice will do more good than an uninformed choice.  The only really bad possibility is the risk that the software will run too slowly to be useful.   

Having given ourselves permission to guess, let's start guessing.  Information entropy of simple distributions (maybe not bimodal etc.) goes like $\ln$(width), and at least in normal distributions, the widths add as squares.   
$$U(x) \approx \ln(\sigma_\theta^2 + \sigma_y^2) - \ln(\sigma_y^2) = 
\ln\left[\frac{\sigma_\theta(x)^2}{\sigma_y(x)^2}+1\right]$$
In addition to making unproven leaps, this expression also assumes that it's possible to think of model output distributions in terms of widths due to parameter variations, $\sigma_\theta$ and widths due to measurement uncertainty $\sigma_y$.  The dependence on $\sigma_\theta$ matches our initial intuitive argument.  The measurement noise $\sigma_y$ appears via the information entropy derivation, but it also has an intuitive interpretation, that it's less useful to make measurements at settings $x$ where the instrumental noise is larger.





## Missing pieces

A. In some situations, the model predicts a distribution of measured values, separate from the measurement uncertainty.  Quantum mechanics for example. How do we handle those?

B. So far, measurement noise enters with measurement data, consistent with the notion that all measurments should be provided with uncertainty values.  

C. How do we handle situations with multiple measurements at once, like voltage and current, each with its own uncertainty?
