netabc is a computer program for fitting contact network models to transmission trees by kernel approximate Bayesian computation.
You need the following shared libraries.
And recent versions the following programs.
On Ubuntu, these are all available in the repos.
sudo apt-get install libgsl0ldbl libgsl0-dev
sudo apt-get install libjudydebian1 libjudy-dev
sudo apt-get install libyaml-0-2 libyaml-dev
sudo apt-get install check
sudo apt-get install flex bison
Optionally, to build the documentation, you need Doxygen and a LaTeX distribution like TeX Live.
If you just want to use the program, download one of the releases. Check the INSTALL file inside the release for instructions.
If you want to build straight from the git repo, for example if you are
interested in adding new network models, first clone the repository using
git clone --recursive
. You must use --recursive
because a forked version of
the igraph library has been packaged as a submodule. Then run ./autogen.sh
to
create the configure script, and build with the --enable-tls
option (this is
necessary for igraph).
./configure --enable-tls
make
make install
This requires recent versions of the autotools.
Netabc requires two inputs: an estimated transmission tree, in Newick format, and a YAML file specifying your priors on the network parameters. The typical usage of netabc is as follows.
netabc -d [output file] [other options] [tree] [yaml_file]
You can type netabc -h
to get a summary of all the command line options. They
are described in detail in the next section. The structure of the YAML file
with priors is described in the following section.
Three additional binaries are included with netabc
. The first is
treekernel
, which computes the phylogenetic kernel of a pair of trees. The
second is nettree
, which simulates a phylogeny over a transmission tree in
GML format. The third
is treestat
, which computes several summary statistics on trees. Each of
these has a --help
option which displays their usage.
-t, --num-threads
Number of threads to use. Ideally you would set this to the number of cores you have available. The default is one thread, but we strongly encourage you to use multiple threads if possible.
-s, --seed
An integer to use to set the state of the random number generator. Repeated runs with the same tree and seed should produce the same results. The default is to use the current time.
-l, --decay-factor
The decay factor parameter to the tree kernel (see [@poon2013mapping]). The default is 0.3, and typical values are between 0.2 and 0.5. We encourage you to perform a preliminary analysis to find optimal kernel meta-parameters for your particular model, such as the cross-validation analysis performed in [@poon2015phylodynamic].
-g, --rbf-variance
The variance for the radial basis function used in the tree kernel (see [@poon2013mapping]). The default is 4, and typical values are between 0.5 and 8. As with the decay factor parameter, we encourage you to perform a separate analysis to find an optimal value for your problem.
-c, --nltt
If this option is passed, multiply the tree kernel by the normalized lineages-through-time (nLTT) statistic [@janzen2015approximate]. This is not done by default.
-n, --num-particles
The number of particles to use for SMC (see [@del2012adaptive]). More particles lead to a better SMC approximation, but at the expense of a linear increase in computational cost. The default is 1000, which was used by [@del2012adaptive] but leads to a fairly course approximation.
-p, --num-samples
The number of simulated trees to keep track of per particle. Again, more is better in terms of accuracy, but the complexity is linearly related. The default is 5; typical values are between 1 and 100.
-q, --quality
The coefficient of the equation which is solved to determine the next tolerance value ε. It is called α in [@del2012adaptive]. Essentially it represents a tradeoff between speed and accuracy - lower values will finish with fewer iterations but produce worse approximations. Typical values are between 0.9 and 0.99. The default is 0.95.
-d, --trace
Dump the values of each particle, their weights, and the distances of their associated simulated trees to this file after each iteration. You must specify this parameter if you want to record the estimated parameter values.
-m, --net-type
The name of the network model you want to fit. Currently available values are listed in the next section. The default is "pa", which indicates Barabasi-Albert preferential attachment.
-e, --final-epsilon
The first of two possible stopping conditions for SMC. The algorithm will be
stopped when the current distance tolerance falls to less than this value. The
default is 0.01. If you do not want to use this stopping criterion, pass -e 0
and specify a value for -a
.
-a, --final-accept-rate
The second of two possible stopping conditions for SMC. The algorithm will be
stopped when the acceptance proportion of MCMC moves of the particles falls below
this value. The default is 0.015, or 1.5%. If you do not wish to use this
stopping criterion, pass -a 0
and specify a value for -e
.
Currently, three models are supported.
model | description |
---|---|
pa | Barabasi-Albert preferential attachment (uses igraph_barabasi_game ) |
gnp | Erdos-Renyi random network (uses igraph_erdos_renyi_game ) |
smallworld | Watts-Strogatz small world network (uses igraph_watts_strogatz_game ) |
Priors are specified in a YAML file. It's also possible to specify exact values on the parameters. Consider the following example YAML file.
N: ["uniform", 287, 10000]
I: ["uniform", 287, 10000]
time: 0
transmit_rate: 1
remove_rate: 0
m: ["uniform", 2, 6]
alpha: ["uniform", 0, 2]
Here, we are specifying uniform priors with various endpoints for the parameters "N", "I", "m", and "alpha". The parameters "time", "transmit_rate", and "remove_rate" have been specified exactly at the values 0, 1, and 0, respectively.
You must specify either priors for all parameters of the model you are fitting.
model | parameter | meaning |
---|---|---|
all | N | number of nodes in the network |
all | I | number of infected nodes at time of sampling |
all | time | amount of simulated time passed at time of sampling |
all | transmit_rate | transmission rate per discordant edge |
all | remove_rate | removal rate per infected node |
pa | m | number of edges added per vertex |
pa | alpha | preferential attachment power |
gnp | p | probability per edge |
smallworld | nei | number of neighbours for each vertex |
smallworld | p | rewiring probability |
The required parameters for all models have some dependency on each other. In the case of N and I (the total number of nodes, and the number of infected nodes), the priors you specify will be truncated to the region I ≤ N by rejection sampling. If there is not enough prior density in the region I ≤ N, the initialization of the particles will fail.
The time parameter indicates the simulation time at which sampling occurs. If set to 0, it is ignored, and the simulation proceeds as long as necessary I nodes to become infected. Likewise, if I is set to 0, the simulation proceeds until the specified time has been reached regardless of how large the epidemic gets. If I and time are both 0, then the simulation will always continue until there are no more discordant edges in the network. You cannot specify exact values for time or I unless the other is zero.
These are the distributions currently supported for specifying priors. Currently, joint priors are not supported, so you must list priors for each parameter independently. The "delta" distribution is the Dirac delta, equivalent to specifying an exact value. All distributions are continuous except "poisson" and "discrete_uniform", which are for discrete-valued parameters.
Some distributions, such as the exponential distribution, are lower-bounded by
zero. The parameter called "shift" allows you to specify a different lower
bound, shifting the entire distribution. In other words, if the pdf of a
distribution is given by f(x), the shift parameter will modify the pdf to be
f(x) + shift. Similarly, the beta distribution is defined on the region [0, 1].
We introduce a parameter "scale" which scales the whole distribution, so that
it is defined on
Some distributions can be parameterized multiple ways. I tried to keep the below table consistent Wikipedia.
distribution | parameters |
---|---|
uniform | a (lower limit), b (upper limit) |
gaussian | μ (mean), σ (variance) |
delta | shift |
exponential | shift, λ (rate) |
laplace | μ (location), b (scale) |
exponential_power | μ (location), α (scale), β (shape) |
cauchy | x0 (location), γ (scale) |
rayleigh | shift, σ (scale) |
gamma | k (shape), θ (scale) |
lognormal | μ (location), σ (scale) |
chi_squared | shift, k (degrees of freedom) |
f | shift, d1 (first degrees of freedom), d2 (second degrees of freedom) |
student_t | shift, ν (degrees of freedom) |
beta | shift, α (first shape), β (second shape), scale |
logistic | μ (location), s (scale) |
pareto | xm (scale), α (shape) |
weibull | shift, λ (scale), k (shape) |
poisson | shift, λ (variance) |
discrete_uniform | a (lower limit, inclusive), b (upper limit, inclusive) |
To maintain the speed of the program, it is necessary to directly modify the C
code if you want to add a new network model. All the model-specific code is in
src/netabc.c
. At this stage, the procedure to add a new model is somewhat
involved.
- Add a new element to the
net_type
enum for your new model. - Append the number of model parameters it has to the global
NUM_PARAMS
array. - Add elements to the
net_parameter
enum for all the model's parameters. - Add the parameters' names to the
PARAM_NAMES
array. - Indicate whether the parameters are continuous or discrete in the
PARAM_DISCRETE
array. - Modify the
get_options()
function to accept an abbreviation for your network type on the command line. Look for the lineopts.net = NET_TYPE_PA
and add something similar. Add the abbreviation to theusage()
function. - Add code to simulate a network under your model in the
sample_dataset()
function. Look for the linecase NET_TYPE_PA:
and add something similar.