-
Notifications
You must be signed in to change notification settings - Fork 4
/
coala-intro.Rmd
257 lines (194 loc) · 7.74 KB
/
coala-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
---
title: "Introduction to Coala"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Introduction to Coala}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
Coalescent simulation refers to the idea of simulating the evolution of
biological sequences like DNA by tracing their ancestry back in time. The
`coala` package is an interface for calling a number of commonly used
coalescent simulators from R. It should be able to use the simulator
`scrm` out of the box. Other simulators need to be installed separately
and must be activated explicitly. This is described in the `installation`
vignette:
```{r installation, eval=FALSE}
vignette("coala-install", package = "coala")
```
In this introduction we will stick to using `scrm`.
# Creating a model
In order to conduct simulations, we first need to specify the components of
the simulation model. The function `coal_model` creates a basic coalescent
model:
```{r create_model}
library(coala)
model <- coal_model(sample_size = 3, loci_number = 1)
```
This creates a basic model with one population of constant size.
One genetic locus for three haplotypes is sampled from
the population.
Printing the model gives a short summary of this content:
```{r print_model}
model
```
Models consist of
* _features_, which represent evolutionary forces and events present in the
model,
* _parameters_, which are the model parameters,
* _loci_, which describe the genetic regions that are simulated and
* _summary statistics_, which describe the format of the simulated data.
In order to simulate the model we need to add a few more features and at least
one summary statistic.
# Adding Features
For simulating sequences, we need to add mutations to the model. To do so,
we create a corresponding feature using `feat_mutation` and add it to the
existing model using the plus operator:
```{r}
model <- model + feat_mutation(rate = 1, model = "IFS")
model
```
Now, mutations occur with rate 1 and according to an infinite-sites mutation
model (IFS). Details of rate and mutation model are given in the help
page of the feature (`?feat_mutation`).
Coala currently support the features
```{r echo=FALSE}
funcs <- ls("package:coala")
funcs[grep("^feat_", funcs)]
```
However, not all combination of all features might be possible. Please refer
to the features help pages for detailed information.
## Multiple Populations
If we build a model consisting of multiple populations, we need to state the sample
sizes as a vector of sample sizes for the different populations. The lines,
```{r}
model <- coal_model(sample_size = c(5, 2, 0), loci_number = 1) +
feat_migration(rate = 0.5, symmetric = TRUE) +
feat_pop_merge(0.5, 3, 2) +
feat_pop_merge(0.8, 2, 1)
```
create a model of three populations, with a symmetric migration rate of `0.5`
between them. When viewed backwards in time, population 3 merges into population 2
`0.5` coalescent time units in the past and population 2 into population 1
`0.3` time units further into the past. Looking forwards in time, this
represents two speciation events with migration going on afterwards. At time `0`
five haploids are sampled from population 1 and two from population 2. Please
note that sample sizes for all populations must be given, even if no haploid
is sampled from a population, as it is the case for population 3 here.
# Adding Summary Statistics
Adding summary statistics works in a similar fashion as adding features:
```{r}
model <- coal_model(3, 1) +
feat_mutation(rate = 1) +
sumstat_seg_sites()
model
```
This adds the _segregating sites_ summary statistic to the model, which is a
basic summary statistic in population genetics. Again, refer to
`?sumstat_seg_sites` for details.
Available summary statistics are:
```{r echo=FALSE}
funcs[grep("^sumstat_", funcs)]
```
# Simulating the model
Now we can simulate the model. The printed output of a model contains information
which program will be used for the simulation and which arguments will be used.
As coala is in an early stage, please make sure to always check both.
The
function `simulate` will call the program with the printed options, parse its
output and calculated the added summary statistics:
```{r}
sumstats <- simulate(model, seed = 123)
```
The returned object `sumstats` is a list, in which each entry corresponds to one
summary statistic. As there is only one summary statistic in our model,
the list has only one entry:
```{r}
names(sumstats)
```
The structure in `sumstats$seg_sites` is given by the segregating sites
statistic. It is again a list, where each entry represents one locus. For each
locus, it contains a matrix as specified in `?sumstat_seg_sites`:
```{r}
sumstats$seg_sites[[1]]
```
# Adding Loci
If we want to have more loci in a model, we can add them using the `locus_`
functions. The most basic option is to add an additional locus with a different
length:
```{r}
model <- model + locus_single(500)
model
```
Now the model consists of two loci, the first with length 1000, the second with
500. Simulation now produces a segregating sites list with two entries corresponding
to the loci:
```{r}
sumstats <- simulate(model)
sumstats$seg_sites[[1]]
sumstats$seg_sites[[2]]
```
Another possibility is to add multiple loci with the same length using
`locus_averaged`, which gives better performance than adding the loci
one by one. For example
```{r}
model <- model + locus_averaged(2, 750)
sumstats <- simulate(model)
length(sumstats$seg_sites)
```
adds two more loci with length of 750bp to the model.
# Adding Parameters
So far, we have used a model without parameters that can vary between
simulations. In particular for fitting a model to data via ABC or Jaatha, it is
useful to add parameters to a previous model instead of creating a new model for
each simulation.
### Named Parameters
Named parameters values can be specified in the simulation command. If we want, for example,
to launch simulations for a model with different values of the mutation rate, we can use
a named parameter:
```{r}
model <- coal_model(5, 1) +
feat_mutation(rate = par_named("theta")) +
sumstat_seg_sites()
sumstats1 <- simulate(model, pars = c(theta = 2.5))
sumstats2 <- simulate(model, pars = c(theta = 4.3))
```
### Parameters with Priors
A parameter distributed according to a prior can be specified using the
`par_prior` function. The function's first argument is a name for the parameter,
the second an expression that, when evaluated, produces a sample from the
prior distribution.
So if we want the mutation to follow a uniform distribution between
`0` and `10`, we can use:
```{r priors}
model <- coal_model(5, 1) +
feat_mutation(rate = par_prior("theta", runif(1, 0, 10))) +
sumstat_seg_sites()
sumstats <- simulate(model)
sumstats$pars
sumstats2 <- simulate(model)
sumstats2$pars
```
### Parameter Ranges
For simulations that will we used for parameter inference with the R
package `jaatha`, you need to give a range of possible values for each
parameter. This is done using `par_range`. For instance
```{r}
model <- coal_model(5, 1) +
feat_mutation(rate = par_range("theta", 0.1, 5)) +
sumstat_seg_sites()
```
sets a possible range from _0.1_ to _5_ for the mutation rate. The actual
rate is given in the `simulate` function, just as with named parameters.
### Expressions
Finally, there is a very powerful type of parameters generated with `par_expr`.
Similar to parameters with priors, the value of the parameter is given as an
R expression, which is evaluated before simulation. Unlinke `par_prior`,
this expression can contain other named parameters. For example
```{r}
model <- coal_model(4, 2) +
feat_mutation(rate = par_named("theta")) +
feat_recombination(rate = par_expr(theta * 2))
```
creates a model with a a recombination rate that always is twice as high as the
mutation rate.