# Interactive notebook for using `helmpy` for parameter inference from data

The mean field or 'deterministic' models which can be run with the `helmpy.run_meanfield` method can be used to provide an efficient way to perform approximate posterior parameter inference with respect to some dataset(s) using the `helmpy.fit_data` method, but before we illustrate how to do this, it will be informative to describe the theoretical background necessary to perform this inference. Note that this notebook will assume prior knowledge with the interface of `helmpy`, so to make sure that this is familiar it is suggested that one reads through helmpy_examples.ipynb before continuing here.

## 1. Theoretical background

Here we outline the basic theoretical background for performing Bayesian parameter inference using mean field helminth models under the assumption that the system close to a state of endemic equilibrium. Bear in mind that inference with the mean field model will typically underestimate the variance of the posterior over parameters in comparison to the fully individual-based stochastic model and so it should be used when population sizes are large so that this additional variance is minimised.

The formalism we will outline here assumes that the diagnostic data are either Kato-Katz counts (for _Ascaris lumbricoides, Trichuris trichiura, Schistosoma mansoni_ and hookworm diagnostic testing) or urine filtration counts (for _Schistosoma haematobium_ diagnostic testing). As ever in any canonical Bayesian problem, specification of the likelihood function ${\cal L}$ is not the end of the story. To infer a full joint posterior distribution ${\cal P}$ given a dataset ${\cal D}$ over the collection of transmission and diagnostic parameters $\{ R_0(a), M (a,t_0),k,z,\lambda_{\rm d},k_{\rm d} \}$, Bayes' rule here reads 

$${\cal P}[ R_0(a), M (a,t_0),k,z,\lambda_{\rm d},k_{\rm d}  \vert {\cal D}] = \frac{1}{{\cal E}} \pi [R_0(a)]\, \pi [M (a,t_0)] \, \pi (k) \, \pi (z) \, \pi (\lambda_{\rm d})\, \pi (k_{\rm d}) \,{\cal L}[{\cal D}\vert R_0(a), M (a,t_0),k,z,\lambda_{\rm d},k_{\rm d} ] \,,$$

where ${\cal E}$ is a normalisation constant and $\pi [R_0(a)]$, $\pi [M (a,t_0)]$, $\pi (k)$, $\pi (z)$, $\pi (\lambda_{\rm d})$ and $\pi (k_{\rm d})$ are the prior distributions over $R_0(a)$, $M (a,t_0)$ $k$, $z$, $\lambda_{\rm d}$ and $k_{\rm d}$ which are the age-dependent contributions to the basic reproduction number, the age-dependent initial condition to the mean total worm burden per individual (which, when combined with $R_0(a)$ and the other parameters, specify the dynamics $M(a,t)$), worm aggregation, density dependent fecundity factor, the number of diagnostically-detected eggs per female worm and measured diagnostic aggregation parameter, respectively (all assumed to be independent of each other _a priori_). Kato-katz and urine filtration counts typically follow a distribution which appears to be negative binomial in shape. By computing the mean diagnostically-detected egg count $\hat{{\sf e}}_{\rm d}=\lambda_{\rm d}\hat{{\sf e}}$ from the transmission parameters $\{ R_0(a), M (a,t_0),k,z,\lambda_{\rm d},k_{\rm d}\}$, the likelihood distribution which will be used for the inference of these parameters is therefore likely to be well-approximated by

$${\cal L}[{\cal D}\vert  R_0(a), M (a,t_0),k,z,\lambda_{\rm d},k_{\rm d} ] = \prod_{\forall {\sf e}_i\in {\cal D}}{\rm NB}\bigg\{ {\sf e}_i; \frac{\lambda_{\rm d}}{2}\hat{{\sf e}}[M(a,t),k,z],k_{\rm d}\bigg\} \,,$$

where $M(a,t)$ is the age and time-dependent total mean worm burden which can be fully specified from $R_0(a)$, the other parameters, and the initial conditions $M (a,t_0)$. 

Note that one should refer to, e.g., Anderson & May, 1991 or [https://www.sciencedirect.com/science/article/pii/S002251931930445X ] for the motivations behind the calculation of the proportionality factor for the first moment of the egg count distribution $\hat{{\sf e}}$, which, for example, in the case of STH (fully polygamous male worms) is analytic (it is not in the case of monogamous schistosomes)

$$\hat{{\sf e}}_{\rm STH}[M(a,t),k,t] = \phi [M(a,t);k,z]\, f [M(a,t);k,z] M(a,t)$$

$$f[M(a,t);k,z] \equiv \left[ 1+(1-z)\frac{M(a,t)}{k}\right]^{-(k+1)} $$

$$\phi [M(a,t);k,z] \equiv 1-\left[ \frac{1+(1-z)M(a,t)/k}{1+(2-z)M(a,t)/(2k)}\right]^{k+1} \,.$$

From Anderson & May, 1991, the mean field (or 'deterministic') transmission dynamics of the helminth infections considered here (STH and schistosomes) with age structure can be described by the following system 

$$\frac{\partial M}{\partial t} + \frac{\partial M}{\partial a} = \Lambda (a,t) - \mu_1 M(a,t) \,.$$

This equation may be converted to a differential equation with respect to time only, while discretising the mean worm burden into age bins $\{ a_i\}$, by integrating over $a$ using a survival rate kernel $S(a)$ like so

$$M_i(t) \equiv M(a_i,t) = \frac{\int^{a_{i+1/2}}_{a_{i-1/2}}{\rm d}a \, M(a,t) S(a)}{\int^{\infty}_{0}{\rm d}a S(a)} \,.$$

Choosing the $S(a) = e^{-\mu a}$ (where $\mu$ is the human death rate) and assuming an age-constant $\Lambda (a_i,t)$ within the bin (as we have assumed before in the fitting procedure), one may obtain the following first-order differential equation corresponding to the dynamics in the $i$-th age bin

$$\frac{{\rm d} M_i}{{\rm d} t} = \Lambda (a_i,t) - (\mu + \mu_1)M_i(t) \,. \qquad \qquad (1)$$

In order to obtain this equation above, we have assumed that the boundary flux between age bins must vanish

$$\left.\frac{\partial M_i}{\partial a} \right\vert_{a_{i+1/2}}=0 \,,$$

due to an approximated instananeous change in the worm burden for the individual (as a consequence of individuals changing force of infection $\Lambda (a_i,t) \rightarrow \Lambda (a_{i+1},t)$) - note that this also sets the other boundary flux

$$\left.\frac{\partial M_i}{\partial a} \right\vert_{a_{i-1/2}}=\left.\frac{\partial M_{i-1}}{\partial a} \right\vert_{a_{i-1/2}}=\left.\frac{\partial M_{(i-1)}}{\partial a} \right\vert_{a_{(i-1)+1/2}}=0 \,.$$

As a side note: in a fully stochastic individual-based model, the change in the expected worm burden will occur over a timescale of $1/\mu_1$, so ensuring that the age bin widths are wider than this timescale is a necessity for this approximation remain accurate. Note also that the birth rate into the first age bin should be set to $\mu$ to match the simulation.

Similarly, one may obtain an equation for the age-binned dynamics (see [https://www.medrxiv.org/content/10.1101/2019.12.17.19013490v1 ]) of the force of infection $\Lambda_i \equiv \Lambda (a_i,t)$ 

$$\frac{{\rm d}\Lambda_i}{{\rm d}t} = \mu_2(\mu + \mu_1)R_{0,i}\bigg\{ \sum_{j=1}^{N_a}\frac{N_j}{N_{\rm tot}}\hat{{\sf e}}[M_j(t),k,z] \bigg\} - \mu_2 \Lambda_i\,. \qquad \qquad (2)$$

The solution to the system $(1)$ and $(2)$ can hence be inserted into the negative binomial likelihood ${\rm NB}\{ {\sf e}_i; \lambda_{\rm d}\hat{{\sf e}}[M(a,t),k,z],k_{\rm d}\}$ to perform the inference. Note also that given the rapid equilibration of the infectious reservoir ${\rm d}\Lambda_i /{\rm d}t\rightarrow 0$, we need not specify $\Lambda_i(t_0)$ independently in the inference, but instead may identify 

$$\Lambda_i(t) = (\mu + \mu_1)R_{0,i}\sum_{j=1}^{N_a}\frac{N_j}{N_{\rm tot}} \, \hat{{\sf e}}[M_j(t),k,z] \,,$$

where $N_i$ is the number of people within an age group (and $N_{\rm tot}$ in total) and $R_{0,i}=R_0(a_i)$ is an age-dependent coefficient which contributes to the basic reproduction number in this bin. By inserting $\Lambda (a_i,t)$ into the equation for ${\rm d} M_i/{\rm d} t$ above, the nonlinear dynamical system of equations that this generates is 

$$\frac{{\rm d} M_{i}}{{\rm d} t} = (\mu +\mu_1) R_{0,i}\sum_{j=1}^{N_a}\left\{ \frac{N_j}{N_{\rm tot}}\, \hat{{\sf e}}[M_j(t),k,z] \right\} - (\mu +\mu_1)M_i(t) \,,$$

where value of the overall $R_0$ may be obtained through the relation

$$R_{0}=\frac{1}{N_{\rm tot}}\sum^{N_a}_{i=1} N_iR_{0,i} \,.$$

Note also that, at equilibrium $M(a_i,t)\rightarrow M(a_i)\,\, \forall i$, the value of $R_0$ is constrained to

$$R_{0} = \frac{\sum_{i=1}^{N_a} N_iM_i}{\sum_{j=1}^{N_a} N_j \,\hat{{\sf e}}(M_j,k,z)} \,.$$

To include migration between clusters in the inference, the ${\rm d}\Lambda_i/{\rm d}t$ equation above would be modified by the expectations of compound Poisson processes modelling the net ingoing and outgoing eggs/larvae (see: [https://www.sciencedirect.com/science/article/pii/S002251931930445X]). We will not handle this case here though.

## 2. Setup with mock data

Before illustrating how to use `helmpy` for inference, first we must import it... 

In [None]:
import sys
path_to_helmpy = '/Users/Rob/work/helmpy' # Give your path to helmpy here
sys.path.append(path_to_helmpy + '/source/') 
from helmpy import helmpy
import time

# These modules are not necessary to run helmpy alone but will be useful for our demonstrations

# LEAVE THESE IMPORTS COMMENTED AS THEY ARE FOR PRODUCING LaTeX-STYLE FIGURES ONLY
#import matplotlib as mpl
#mpl.use('Agg')
#mpl.rc('font',family='CMU Serif')
#mpl.rcParams['xtick.labelsize'] = 15
#mpl.rcParams['ytick.labelsize'] = 15
#mpl.rcParams['axes.labelsize'] = 20
#from matplotlib import rc
#rc('text',usetex=True)
#rc('text.latex',preamble=r'\usepackage{mathrsfs}')
#rc('text.latex',preamble=r'\usepackage{sansmath}')
# LEAVE THESE IMPORTS COMMENTED AS THEY ARE FOR PRODUCING LaTeX-STYLE FIGURES ONLY

import numpy as np
import matplotlib.pyplot as plt

...and make up some mock (here we shall assume full Kato-Katz intensity counts) data in 2 age categories to use...

In [None]:
# Mean egg counts from data
meanegg_age1 = 10.0
meanegg_age2 = 20.0
# Variance of egg count data
varegg = 3000.0
# Kato-Katz samples drawn for each age group
kksamps_age1 = np.random.negative_binomial(meanegg_age1**2.0/np.abs(varegg-meanegg_age1),meanegg_age1/varegg,size=150)
kksamps_age2 = np.random.negative_binomial(meanegg_age2**2.0/np.abs(varegg-meanegg_age2),meanegg_age2/varegg,size=150)
# Combine Kato-Katz samples into list
kksamps = [kksamps_age1,kksamps_age2]

## 3. Parameter inference and visualisation

We will now illustrate how to use `helmpy.fit_data` with an STH instance. This will generate the parameter samples from the posterior distribution (see ${\cal P}[ R_{0,i}, M_i (t_0),k,z,\lambda_{\rm d},k_{\rm d}  \vert {\cal D}]$ above) using the dataset we have made up above. Of course, it will not be possible to infer all of these parameters from the Kato-Katz data alone, so we shall make some prior assumptions (i.e., Dirac delta priors) for their values in some cases. In particular (see [https://parasitesandvectors.biomedcentral.com/articles/10.1186/s13071-019-3686-2 ]) here for STH we will assume $\lambda_{\rm d} = 3.05$ and $z=e^{-\gamma}=e^{-0.005}\simeq 0.995$. We will also keep things simple running the simulation for 10 years so that it has long enough to get close to the dynamical endemic equilibrium.

Assuming that one has already used the `helmpy` interface, the following code should make sense now...

In [None]:
hp = helmpy('STH',path_to_helmpy,suppress_terminal_output=True)     # New helmpy instance
hp.parameter_dictionary['mu'] = [0.014,0.014]                       # Human death rate (per year)
hp.parameter_dictionary['mu1'] = [0.5,0.5]                          # Adult worm death rate (per year)
hp.parameter_dictionary['mu2'] = [26.0,26.0]                        # Reservoir (eggs and larvae) death rate 
hp.parameter_dictionary['gam'] = [0.005,0.005]                      # Density dependent fecundity: z = exp(-gam)
hp.parameter_dictionary['Np'] = [250,250]                           # Number of people within grouping   
hp.parameter_dictionary['spi'] = [1,1]                              # Spatial index number of grouping

# Setting the same arbitrary initial conditions for the force of infection since in the cases of either STH
# or SCH, the equilibration of the force of infection is very rapid anyway
hp.initial_conditions['FOI'] = [5.0,5.0]                            

# Set the parameter lambda_d
hp.data_specific_parameters['KatoKatz'] = [3.05]

# Initialise number of Monte Carlo walkers and iterations - these are less than you need but it 
# is quicker to see how the code works for just this example
nwalkers = 20
niterations = 100

# Set the deterministic model runtime and timestep in years
runtime = 10.0
timestep = 0.02

# Initialise ensemble of walkers and their initial conditions
walker_initconds = []

# Add lnM0s to ensemble (normal distribution initial ensemble with [mean,std])
walker_initconds.append([0.5,0.5]) 
walker_initconds.append([0.5,0.5]) 
# Add lnR0s to ensemble (normal distribution initial ensemble with [mean,std])
walker_initconds.append([0.5,0.5]) 
walker_initconds.append([0.5,0.5]) 
# Add lnk to ensemble (normal distribution initial ensemble with [mean,std])
walker_initconds.append([-2.5,0.5])
# Add lnkd to ensemble (normal distribution initial ensemble with [mean,std])
walker_initconds.append([-2.0,0.5])

# Set filename
output_filename = 'default_example'

# Parameter labels for the corner plot
names = ['lnM01','lnM02','lnR01','lnR02','lnk','lnkd']

# Run with mock data
hp.fit_data(kksamps,
            walker_initconds,
            'fit_' + output_filename,
            num_walkers=nwalkers,
            num_iterations=niterations,
            runtime=runtime,
            timestep=timestep,
            plot_labels=names)

Note that this procedure may need to be performed many times and the chain diagnostics should be examined to make sure the sampling is converging! We can see the posterior samples stored in the `helmpy.data_samples` variable...

In [None]:
hp.data_samples

## 4. Posterior predictive forecasting with the samples

Once the samples have been obtained, forecasting with a full stochastic individual-based simulation can be achieved by using the `helmpy.run_full_stoch` method immediately. This will use a uniform random subset of the posterior samples and run a fully individual-based stochastic simulation with them initialised with a realisation for each sample. By multiple reruns of this method, we can therefore generate good forecasting statistics for whatever purpose. We do this like so...

In [None]:
# Set the number of realisations
realisations = 100                            

# Run the stochastic simulation for the results
runtime = 20.0 
do_nothing_timescale = 0.01 
hp.run_full_stoch(runtime,realisations,do_nothing_timescale,'fit_sim_' + output_filename)

The output can also be found in the usual place...

In [None]:
# Load plot data
forecast_runs = np.loadtxt(path_to_helmpy + '/data/fit_sim_' + output_filename + '.txt')

# Generate plot for mean and 68% credible intervals
plt.plot(forecast_runs.T[0],forecast_runs.T[1],color='r')
plt.plot(forecast_runs.T[0],forecast_runs.T[3],color='r')
plt.plot(forecast_runs.T[0],forecast_runs.T[4],color='r')
plt.fill_between(forecast_runs.T[0],forecast_runs.T[4],forecast_runs.T[3],color='r',alpha=0.2)
plt.show()

Alternatively to running with the same `helmpy` instance, we can simply set the `helmpy.data_samples` variable of a new instance with the above obtained samples (as long as its parameter configurations are the same) to generate the forecasts with this new instance. This is useful when running on HPC clusters since the data analysis can be performed seperately to the forecasting runs.

Using the deterministic fitting code in conjunction with stochastic simulations is a convenient way to generate predictions for a whole range of scenarios, especially when used in conjunction with the other features within `helmpy`. Note, however, there are important limitations to this method that were discussed at the beginning of this notebook.