/
getting-started.Rmd
166 lines (119 loc) · 6.26 KB
/
getting-started.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
title: "Getting started with `InfectionTrees`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{getting-started}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "",
message = FALSE,
warning = FALSE
)
```
`InfectionTrees` is a software package used to sample and analyze branching processes that model the spread of infectious diseases.
## Downloading the package
Currently, `InfectionTrees` is available at [https://github.com/skgallagher/InfectionTrees](https://github.com/skgallagher/InfectionTrees), and can be installed in `R` via the following command,
```{r eval = FALSE}
devtools::install_github("skgallagher/InfectionTrees")
```
```{r echo = FALSE, warning = FALSE, message = FALSE}
devtools::load_all()
library(knitr)
library(kableExtra)
library(tidyr)
library(dplyr)
library(data.table)
```
```{r setup, eval = FALSE}
library(InfectionTrees)
library(knitr)
library(kableExtra)
library(tidyr)
library(dplyr)
library(data.table)
```
## About
Below is a true quick start to simulate data, sample trees based on that data, and finally compute the log likelihood.
The other vignettes cover:
**The Model**
1. [General overview](not-built-vignettes/model-overview.html)
2. [Accounting for under-reporting (Multiple outside transmissions model)](not-built-vignettes/multiple-outside-transmissions-model.html)
3. [Sampling from $\mathcal{T}_C$](not-built-vignettes/sampling-mc-trees.html)
3. [Sampling enough MC trees](not-built-vignettes/sampling-enough-mc-trees.html)
**Simulations**
1. [Simulations for the base model: binary covariates](not-built-vignettes/base-binary-simulations.html)
2. [Simulations for the base model: general covariates](not-built-vignettes/base-general-simulations.html)
2. [Simulations for the multiple outside transmission model](not-built-vignettes/multiple-outside-binary-simulations.html)
3. [Simulations to assess whether we sampled enough MC trees](not-built-vignettes/simulations-enough-mc-trees.html)
**Tuberculosis in Maryland**
1. [The data](not-built-vignettes/tb-eda.html)
2. [Running our models on the data](not-built-vignettes/tb-data-base.html)
3. [Visualizing the results](not-built-vignettes/tb-data-vis.html)
## Simulating branching processes
We can simulate from a branching process to form a data set of transmission trees or clusters (if we remove infector information from the transmission trees). We simulate $K$ branching processes where the primary infector in each tree is the root of each transmission tree and the probability of infection of another individual is where $\bf{X}$ is a covariate matrix and $\beta$ is a vector of coefficients,
$$
p_i = logit^{-1} (\bf{X}_i\bf{\beta}).
$$
Then the number of *new* infectious individuals infected by individual $i$ is
$$
N_i \sim geometric(p_i)
$$
It is possible to produce transmission trees that do not naturally terminate, so a maximal number of infections is used as an argument. The output is a data frame with $M$ transmission trees which contains the cluster ID, the person ID within a cluster, the generation of the person (i.e. the length of the path from the primary infector to the person + 1), the infector ID (i.e. who infected this person), the number of infections caused directly by this person if any, the total cluster size, the covariates, and an indicator telling us whether the cluster was artificially terminated or not (i.e. `censored = TRUE`).
```{r}
set.seed(2020)
## Generate possible covariates
sample_covariates_df <- data.frame(x = c(1,0))
beta0 <- -2
beta1 <- 1.5
M <- 100
data <- simulate_bp(K = M,
inf_params = c(beta0, beta1),
sample_covariates_df = sample_covariates_df,
covariate_names = "x",
max_size = 50)
data %>% head(., 10) %>%
kable(type = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```
## Estimate the maximum likelihood
Although our data above are true transmission trees (i.e. we can trace a direct path of infections), generally the infector ID is unknown and hence the data is a group of individuals where the transmission tree is not known.
We estimate the parameters $\beta_0$ and $\beta_1$ by summing over all possible transmission trees for each cluster, which includes the number of individuals in each cluster and their covariates.
The likelihood for a given transmission tree $T$ with $n$ individuals is
$$
L(\beta; T) = \prod_{i=1}^n (1-p_i)p_i^{N_i}
$$
and the total likelihood of a single cluster $C$ over all trees of size $|C|=n$ is
$$
\mathcal{L}( \beta; C) = \sum_{T \in \mathcal{T}_C} L(\beta;T).
$$
We instead maximize the **approximate** likelihood, which is obtained from sampling transmission trees from $\mathcal{T}_n$. That is, we approximate $\frac{\mathcal{L}(C, \beta)}{|\mathcal{T}_C|}$ averaging the likelihood from Monte Carlo sampled trees $T_1, \dots, T_K$,
$$
\hat{\mathcal{L}}_K(\beta; C) = \frac{1}{K} \sum_{i=1}^K L(\beta, T_i).
$$
The estimates for the parameters are then found by maximizing the likelihood over the approximate likelihood over all clusters $C_1, \dots, C_M$
$$
\left (\hat{\beta} \right)^{(MC)} = \arg \max_{\beta} \prod_{m=1}^M \hat{\mathcal{L}}_K(\beta; C_m).
$$
### How to get sampled transmission trees
If we want to sample $B$ transmission trees for each cluster in our data, we use the function `sample_mc_trees`. Here, `B` is the number of MC samples. Typically this number is $\ge5000$, but we use 10 below for demonstrative purposes.
```{r}
mc_trees <- sample_mc_trees(observed_data = data, B = 10,
covariate_names = "x")
mc_trees %>% head(., 4) %>%
kable(type = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```
The samples are linked to the original cluster by the variable `orig_id`.
## Compute the log likelihood
Once the MC trees are sampled, it is possible to compute the log likelihood using the function `general_loglike()`
```{r}
loglike <- general_loglike(inf_params = c(-2, 1),
mc_trees = mc_trees,
return_neg = FALSE,
cov_names = "x")
print(loglike)
```