## Fitting the simulation to data for parameter inference and forecasting 

Using `helmpy`, a brute force method of fitting to data with simulations can be considered. If multiple candidate realisations of the simulation are acquired and then either accepted or rejected as posterior samples over initial conditions and transmission parameters then these can either be used for parameter inference or to run in a subsequent forecasting simulation according to their likelihood.

For STH or schistosomiasis the Kato-Katz (_Schistosoma mansoni_) or urine filtration (_Schistosoma haematobium_) egg count data have a mean ${\rm E}({\rm egg})$ and variance ${\rm Var}({\rm egg})$, as independent summary statistics, respectively. An inference of the posterior distribution $p({\bf s}\vert {\cal D})$ over the summary statistics in each of the $i$ age/gender/other bins ${\bf s} =[{\rm E}({\rm egg})_i,{\rm Var}({\rm egg})]$ given the data ${\cal D}$ can be used to obtain the full posterior $p({\bf x}\vert {\cal D})$ over the transmission parameters and initial conditions ${\bf x}$ given the data ${\cal D}$ in the following way

$p({\bf x}\vert {\cal D})=\int p({\bf x},{\bf s} \vert {\cal D}) \, {\rm d}{\bf s}$

$\qquad \quad =\int p({\bf x} \vert {\bf s}) p({\bf s} \vert {\cal D}) \, {\rm d}{\bf s}$

$\qquad \quad =\int p({\bf x} \vert {\bf s}) \frac{p({\cal D}\vert {\bf s})p({\bf s})}{p({\cal D})} {\rm d}{\bf s}\,.$

Having marginalised over the inferred variance ${\rm Var}({\rm egg})$, which is assumed to be due to Kato-Katz diagnostic error, the subset of egg count means $\tilde{{\bf s}} =[ {\rm E}({\rm egg})_i]$ can simulated as a function of the ${\bf x}$, which we indicate by ${\bf f}({\bf x})$. For modelling reasons, the functional form of $p({\bf x}\vert \tilde{{\bf s}})$ is assumed to be Gaussian

$$p({\bf x}\vert \tilde{{\bf s}}) = \frac{1}{\sqrt{\prod_i2\pi \epsilon^2}}\exp \left\{ -\sum_i\frac{[{\rm f}_i(x_i)-\tilde{s}_i]^2}{2\epsilon^2}\right\}\,,$$

with prior $p({\bf x})$ and for a flat prior $p(\tilde{{\bf s}})\propto 1$, the value of $\frac{p({\cal D}\vert \tilde{{\bf s}})p(\tilde{{\bf s}})}{p({\cal D})}$ is proportional to the likelihood of the data given the summary parameters. Furthermore, the simulator likelihood has a tolerance scale parameter $\epsilon$ which is unknown, therefore we need to input a range of its possible values (which we should do below) and select the one for which there is the greatest evidence or simply marginalise over it.

The likelihood is automatically by `helmpy` once the summary statistics have been inferred and a subsequent simulation with parameter/initial condition samples has been run. To begin with our example, then we must first create a batch of mock Kato-Katz egg counts...

In [None]:
meanegg = 30.0
varegg = 8000.0
kksamps = np.random.negative_binomial(meanegg**2.0/np.abs(varegg-meanegg),meanegg/varegg,size=1000)

Having done this, `helmpy` must then be initialised and given the necessary information for the fit, including the Kato-Katz $\lambda_{\rm epg}/24$ parameter and a list of tolerance-to-fitting parameters $\epsilon$ (discussed above)...

In [None]:
hpfit = helmpy('STH',path_to_helmpy,suppress_terminal_output=False)  

hpfit.parameter_dictionary['Np'] = [1000]             # Number of people within grouping   
hpfit.parameter_dictionary['spi'] = [1]               # Spatial index number of grouping

hpfit.data_specific_parameters['KatoKatz'] = [3.0]                             # Kato-Katz egg count lambda_epg/24
#hpfit.data_specific_parameters['UrineFil'] = [0.5]                         # Urine filtered egg count lambda_epml/24 
hpfit.data_specific_parameters['tolerances'] = [10.0**(-10.0+(float(i)*0.5)) \ # Range of possible tolerances input
                                                for i in range(0,40)]   

Having specified to this `helmpy` instance that it is reading Kato-Katz data through setting a value for $\lambda_{\rm epg}/24$ (or urine filtration data through setting a value for $\lambda_{\rm epml}/24$), all we now need to to is run...

In [None]:
data_from_file = [kksamps]                 # Input the data in a list structure equivalent to the input parameters

output_filename = 'default_example'        # Set a filename for the data to be output

walker_initconds = [[30.0,1.0],[9.0,0.1]]  # Parameter initial conditions [centre,width] for the ensemble MC walkers

plot_labels = ['Egg Mean','ln-Variance']   # Option to list a set of variable names (strings) in the same order 

hpfit.fit_data(data_from_file,           
               walker_initconds,         
               output_filename,          
               output_corner_plot=True,  
               plot_labels=plot_labels,  
               num_walkers=100,          
               num_iterations=500)       

The output for these inferred summary parameters has been stored in a text file but the current `helmpy` instance has also automatically stored these samples for comparison to simulation runs using the marginalised likelihood method described at the start of this subsection. To generate samples of the parameters/initial conditions and their corresponding likelihoods with respect to the data, and hence obtain posterior samples, all we need to is run this same instance of `helmpy` (using the parameter samples feature) and an output with likelihoods for each realisation will be generated. First we set the other parameters and include ageing...

In [None]:
hpfit.parameter_dictionary['mu'] = [0.014]         # Human death rate (per year)
hpfit.parameter_dictionary['mu1'] = [0.5]          # Adult worm death rate (per year)
hpfit.parameter_dictionary['mu2'] = [26.0]         # Reservoir (eggs and larvae) death rate
hpfit.parameter_dictionary['ari'] = [0]            # Group age-ordering index - 0,1,2,3,... 
hpfit.parameter_dictionary['brat'] = [8.4]         # Birth rate per year into grouping 0
hpfit.parameter_dictionary['Na'] = [10]            # Number of people ageing per event

We can now also make use of the posterior samples feature which allows for a set of samples of parameters and initial conditions to be input...

In [None]:
realisations = 250                            

gams = np.random.uniform(0.001,0.01,size=realisations)
ks = 10.0**(np.random.uniform(-2.5,0.0,size=realisations))

hpfit.posterior_samples['ksamps'] = [ks]                                             # Initialisation with k samples
hpfit.posterior_samples['R0samps'] = [np.random.uniform(1.0,5.0,size=realisations)]  # Initialisation with R0 samples
hpfit.posterior_samples['gamsamps'] = [gams]                                         # Initialisation with gam samples
hpfit.posterior_samples['Msamps'] = [np.random.uniform(0.0,100.0,size=realisations)] # Initialisation with M samples
hpfit.posterior_samples['FOIsamps'] = [hpfit.posterior_samples['Msamps'][0]*\        # Initialisation with FOI samples
                                       hpfit.parameter_dictionary['mu1'][0]] 

Lastly, these samples may be run and their likelihoods with respect to the data will be calculated and output...

In [None]:
runtime = 20.0 
do_nothing_timescale = 0.01 

hpfit.run_full_stoch(runtime,realisations,do_nothing_timescale,'fit_' + output_filename)

Let us now read in the output and visualise the parameter samples and their likelihoods with respect to the data in some scatter plots...

In [None]:
fits = np.loadtxt(path_to_helmpy + '/data/' + 'fit_' + output_filename + '_likelihood_cluster_1.txt')

evidences = spec.logsumexp(fits,axis=0)
best_fits = spec.logsumexp(fits,axis=1)

plt.plot(np.log10(hpfit.data_specific_parameters['tolerances']),np.exp(evidences-np.max(evidences)))
plt.show()

In [None]:
for j in range(0,realisations):
    plt.scatter(hpfit.posterior_samples['Msamps'][0][j],\
                np.log10(hpfit.posterior_samples['ksamps'][0][j]),\
                color='Red',alpha=np.exp(best_fits[j]-np.max(best_fits)))
plt.show()

In [None]:
for j in range(0,realisations):
    plt.scatter(hpfit.posterior_samples['R0samps'][0][j],\
                np.log10(hpfit.posterior_samples['ksamps'][0][j]),\
                color='Red',alpha=np.exp(best_fits[j]-np.max(best_fits)))
plt.show()

In [None]:
for j in range(0,realisations):
    plt.scatter(hpfit.posterior_samples['gamsamps'][0][j],\
                np.log10(hpfit.posterior_samples['ksamps'][0][j]),\
                color='Red',alpha=np.exp(best_fits[j]-np.max(best_fits)))
plt.show()

Note that many more samples will be needed to estimate the parameters properly. 