-
-
Notifications
You must be signed in to change notification settings - Fork 478
/
boarding_school_case_study.Rmd
1188 lines (982 loc) · 59.7 KB
/
boarding_school_case_study.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Bayesian workflow for disease transmission modeling in Stan"
author: "Léo Grinsztajn^[École polytechnique, Palaiseau, France, leo.grinsztajn@polytechnique.edu], Elizaveta Semenova^[Data Sciences and Quantitative Biology, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK], Charles C. Margossian^[Department of Statistics, Columbia University, New York, NY, USA], Julien Riou^[Institute of Social and Preventive Medicine, University of Bern, Bern, Switzerland]"
link-citations: true
output:
html_document:
toc: true
toc_depth: 5
toc_float: true
bibliography: biblio.bib
biblio-style: imsmart-nameyear
abstract: This tutorial shows how to build, fit, and criticize disease transmission models in Stan, and should be useful to researchers interested in modeling the COVID-19 outbreak and doing Bayesian inference. Bayesian modeling provides a principled way to quantify uncertainty and incorporate prior knowledge into the model. What is more, Stan's main inference engine, Hamiltonian Monte Carlo sampling, is amiable to diagnostics, which means we can verify whether our inference is reliable. Stan is an expressive probabilistic programing language that abstracts the inference and allows users to focus on the modeling. The resulting code is readable and easily extensible, which makes the modeler's work more transparent and flexible. In this tutorial, we demonstrate with a simple Susceptible-Infected-Recovered (SIR) model how to formulate, fit, and diagnose a compartmental model in Stan. We also introduce more advanced topics which can help practitioners fit sophisticated models; notably, how to use simulations to probe our model and our priors, and computational techniques to scale ODE-based models.
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
set.seed(1954)
```
# Introduction
The ongoing pandemic of the Coronavirus COVID-19 has led to an increased interest
in statistical disease modeling
and, amongst other approaches, Bayesian modeling
(e.g @Flaxman_covid_2020; @Riou_covid_2020;
see this [post on the Stan forum](https://discourse.mc-stan.org/t/stan-being-used-to-study-and-fight-coronavirus/14296) for an updated list of examples).
Transmission models of infectious diseases can help answer questions about past and future transmission,
including the effects of interventions.
<!-- Such models can be expressed in several ways. -->
<!-- The two broad categories are population-based and agent-based models. -->
_Population-based models_ subdivide the total population into homogeneous groups,
called _compartments_.
The flows between compartments can be described by a system of differential equations.
Individuals within a compartment are considered to be in the same state
(e.g. susceptible, infected, etc.).
Moreover, the time-dependent volume of each compartment
solves a system of ordinary differential equations (ODEs).
Compartment models are relatively easy to formulate
and computationally tractable.
<!-- Agent-based models simulate inidivuals (called _agents_) -->
<!-- and their interaction with an environment over time. -->
<!-- This approach explicitly accounts for heterogeneity amongst individuals -->
<!-- but yields models which are difficult to handle, -->
<!-- both mathematically and computationally. -->
<!-- By contrast, compartment models are relatively easy to formulate -->
<!-- and computationally tractable. -->
<!-- and model the size of each compartment via a set of either continuous or discrete-time differential equations. -->
<!-- Individuals within each compartment are considered to have the same state with respect to the disease in question. -->
<!-- Time-dependent number of individuals in each group is a solution of an ordinary differential equations system relating all the quantities. Stochastic processes, such as Markov chains, are also used to describe transitions between compartments. Agent-based models are used to simulate individuals (agents) and their interactions with an environment over time. Such models allow to account for heterogeneity in an explicit way but are very computationally demanding, especially when the objective is statistical inference.]. -->
<!-- Compartmental models described by ordinary differential equations (ODEs) have gained popularity due to several factors: -->
<!-- they are relatively easy to formulate, are computationally more tractable than agent-based models, -->
<!-- and in the simplest cases can be solved and/or analysed analytically. -->
<!-- In practice, the volumes of some compartments and most parameters driving the dynamic are latent, -->
<!-- and thus need to be inferred. -->
<!-- Statistical inference for ODEs is a hard task and should be treated separately from the model formulation. -->
<!-- The ongoing pandemic of the Coronavirus COVID-19 has led to an increased interest in statistical disease modeling. Indeed, transmission models of infectious diseases can help answer questions about forecasting and effects of interventions. Such models can be expressed in several ways^[The three broad categories are population-based, agent-based or stochastic models. Population-based models subdivide the total population into homogeneous groups (compartments) and model the volume of each group via a set of continuous-time differential equations or discrete-time difference equations. Individuals within each compartment are considered to have the same state with respect to the disease in question. Time-dependent number of individuals in each group is a solution of an ordinary differential equations (ODEs) system relating all the quantities. Agent-based models are used to simulate individuals (agents) and their interactions with an environment over time. Such models allow to account for heterogeneity in an explicit way but are very computationally demanding. Stochastic processes, such as Markov chains, are also used to describe transitions between compartments.]. Compartmental models described by ODEs have gained popularity due to several factors: they are relatively easy to formulate, require less computational time than agent-based models, and in the simplest case can be solved and/or analysed analytically. However in practice, the volumes of some compartments and most parameters driving the dynamic are latent, and thus need to be inferred. Statistical inference for ODEs is a hard task and should be treated separately from the model formulation. -->
In this tutorial, we demonstrate with a simple example how
to formulate and fit a typical compartmental model in Stan.
Stan is a probabilistic programming framework designed to let the user focus on modeling, while inference happens under the hood [@Carpenter_stan_2017]. This allows for faster implementation and extension of epidemiological models through a principled modeling workflow, and is thus particularly useful when handling rapidly evolving data and knowledge.
Stan is an expressive language that supports many probability densities, matrix operations, and numerical ODE solvers.
We can combine all these elements to specify a data generating process.
We can also compute useful epidemiological parameters,
such as the _basic reproduction number_, $R_0$, and make predictions.
<!-- $R_0$ is defined as the expected number of secondary infections produced from one infected individual in a fully susceptible population through the entire duration of the infectious period. -->
<!-- $R_0 > 1$ indicates a sustainable infection, -->
<!-- which can lead to a major outbreak, -->
<!-- while $R_0 < 1$ suggests, under stable conditions, that the infection will die out. -->
<!-- ^[Two crucial numbers quantifying a disease are the *basic reproduction number* $R_0$ and the *effective reproduction number* $R_t.$ The basic reproduction number, $R_0$, is the expected number of secondary infections produced from one infected individual in a fully susceptible population through the entire duration of the infectious period. Values of $R_0$ greater than one indicate a sustainable infection, when a major outbreak to be foreseen, while values below one indicate, under stable conditions, a potentially small outbreak and that the infection will die out.]. -->
Generative models formulated in Stan can be used both for simulation and inference.
Stan bolsters several inference methods: full Bayesian inference using Markov chains Monte Carlo (MCMC), approximate Bayesian inference with variational inference, and penalized maximum likelihood estimation. We focus on Bayesian inference with MCMC.
Bayesian inference gives us a principled quantification of uncertainty and the ability to incorporate domain knowledge in the form of priors, while MCMC is a reliable and flexible algorithm.
In addition, Stan provides diagnostic tools to evaluate both the inference (e.g. accuracy of the MCMC, convergence of chains) and the model (e.g. posterior predictive checks).
Other tutorials on the subject include the work by @Chatzilena_tutorial_2019
and @Mihaljevic_tutorial_2016 on transmission models,
and the case studies by @Carpenter_predatory-prey_2018, @Weber_ode_2018,
and @margossian_differential_2017 on ODE-based models,
all of which can serve as complementary reading.
## Outline
Through the lens of the Susceptible-Infected-Recovered (SIR) model, we show how to put together a principled Bayesian workflow in Stan, allowing a faster development of reliable epidemiological models.
In Section 1, we introduce how to build, fit, and diagnose compartment models in Stan.
The next sections discuss topics that can help practitioners fit more complex models.
Section 2 reviews how we can use simulated data to examine our model and our priors,
and provides an introduction to inference calibration.
Section 3 offers a pragmatic discussion on how to efficiently implement
and scale up ODEs in Stan.
Throughout the tutorial, we use R as a scripting language^[Note that Stan can also be used with other langages such as Python or Julia, see [here](https://mc-stan.org/users/interfaces/) for the list of Stan interfaces],
and, while we review some elementary concepts,
assume the reader has basic familiarity with Bayesian inference
and Stan. The source code of this case study can be found on Github [here](https://github.com/stan-dev/example-models/tree/master/knitr/disease_transmission).
# 1 Simple SIR
## Data
In this example, we examine an outbreak of influenza A (H1N1) in 1978 at a British boarding school.
The data consists of the daily number of students in bed, spanning over a time interval of 14 days.
There were 763 male students who were mostly full boarders and 512 of them became ill.
The outbreak lasted from the 22nd of January to the 4th of February.
It is reported that one infected boy started the epidemic, which spread rapidly in the relatively closed community of the boarding school.
<!-- The original data were available only in a figure with some additional data in the [main text](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1603269/pdf/brmedj00115-0064.pdf). -->
The data are freely available in the R package outbreaks, maintained as part of the [R Epidemics Consortium](http://www.repidemicsconsortium.org).
```{r, message=FALSE}
library(outbreaks)
library(tidyverse)
```
```{r}
head(influenza_england_1978_school)
```
```{r}
ggplot(data = influenza_england_1978_school) +
geom_point(mapping = aes(x = date, y = in_bed)) +
labs(y = "Number of students in bed")
```
## Mathematical transmission model
The Susceptible-Infected-Recovered (SIR) model splits the population in three time-dependent compartments: the susceptible, the infected (and infectious), and the recovered (and not infectious) compartments. When a susceptible individual comes into contact with an infectious individual, the former can become infected for some time, and then recover and become immune.
The dynamics can be summarized graphically:
<img src="pictures/SIR_tikz.png" alt="SIR graph" width="500" align="center"/>
The temporal dynamics of the sizes of each of the compartments are governed by the following system of ODEs:
$$
\begin{aligned}
\frac{dS}{dt} &= -\beta S \frac{I}{N}\\
\frac{dI}{dt} &= \beta S \frac{I}{N} - \gamma I \\
\frac{dR}{dt} &= \gamma I
\end{aligned}
$$
where
* $S(t)$ is the number of people susceptible to becoming infected (no immunity),
* $I(t)$ is the number of people currently infected (and infectious),
* $R(t)$ is the number of recovered people (we assume they remain immune indefinitely),
* $\beta$ is the constant rate of infectious contact between people,
* $\gamma$ the constant recovery rate of infected individuals.
Let's give some intuition behind these ODEs. The proportion of infected people among the population is $\frac{I}{N}$. At each time step, given uniform contacts, the probability for a susceptible person to become infected is thus $\beta\frac{I}{N}$, with $\beta$ the average number of contacts per person per time, multiplied by the probability of disease transmission when a susceptible and an infected subject come in contact. Hence, at each time step, $\beta S \frac{I}{N}$ susceptible individuals become infected, meaning $\beta S \frac{I}{N}$ people leave the $S$ compartment and $\beta S \frac{I}{N}$ people enter the $I$ compartment. Similarly, the recovery of an infected individual is taking place at rate $\gamma$, and thus the number of infected individuals decreases with speed $\gamma I$ while the number of recovered grows at the same speed.
The above model holds under several assumptions:
* births and deaths are not contributing to the dynamics and the total population $N=S+I+R$ remains constant,
* recovered individuals do not become susceptible again over time,
* the infection rate $\beta$ and recovery rate $\gamma$ are constant,
* the population is homogeneous,
* individuals meet any other individual uniformly at random (homogeneous mixing)
and recovery time follows an exponential distribution with mean $\frac{1}{\gamma}$.
* replacing the integer number of people in each compartement by a continuous approximation is legitimate (the population is big enough)
In case of boarding school data, the spread of the disease has started with one infected individual which leads to the initial conditions $I(0) = 1, S(0) = N-1, R(0) = 0.$
## Statistical model
We now introduce a _sampling distribution_ (also termed _likelihood_)
$$
p(\mathcal Y \mid \theta)
$$
which tells us, given model parameters $\theta$, how to generate data $\mathcal Y$.
Inference reverse-engineers the data generating process
and asks: "given a model and observations, $\mathcal Y$,
what are plausible parameter values?"
In a Bayesian framework, the set of plausible values is characterized by
the _posterior distribution_,
$$
p(\theta \mid \mathcal Y).
$$
Bayes' rule teaches us that
$$
p(\theta \mid \mathcal Y) \propto p(\mathcal Y \mid \theta) p(\theta)
$$
where $p(\mathcal Y \mid \theta)$ is the sampling distribution,
$p(\theta)$ the _prior_ distribution,
and $\propto$ stands for "proportional to".
The prior encodes information about the parameters we have
before observing the data.
To summarize, a Bayesian model couples a mathematical model of what we know about the parameters in the form of a prior and a sampling distribution, i.e. a model of the data generating process.
<!-- ^[Given the a priori probability of the parameters $p(\theta)$ and the probability of the data given the parameters $p(y|\theta)$, Bayes rules tells us that the posterior distribution of the parameters $\theta$ is $p(\theta|y) \propto p(\theta)p(y|\theta)$]. Specifically, a Bayesian model requires a mathematical model of what we know about the parameters (i.e., a prior) and a model of what we know about the data generating process given the parameters (i.e., a sampling distribution). -->
### Sampling distribution
<!-- Here we build a compatemental model. -->
Given transmission parameters and initial conditions,
a compartmental model defines a unique solution for each of the compartments,
including the number of infected students, $I_\text{ODE}(t)$.
We want to link this solution to the observed data, i.e the number of students in bed, $I_\text{obs}(t)$, at each time point.
<!-- (other states and parameters are latent). -->
The latter is a noisy estimate of the true number of infected students.
<!-- Likelihood allows us to match observed and predicted values to update the latent parameters. -->
We choose to model the number of students in bed with a count distribution --
the Negative Binomial.
This distribution allows us to use $I_\text{ODE}(t)$ as the expected value
and account for over-dispersion, through the parameter $\phi$:
$$
I_\text{obs}(t) \sim \text{NegBin}(I_\text{ODE}(t), \phi)
$$
This gives us $p(\mathcal Y \mid \theta)$, with $\theta = (\beta, \gamma, \phi)$, the parameters of the model.
### Prior distribution
We still need $p(\theta)$. With $\theta = (\beta, \gamma, \phi)$, we specify a prior distribution for each of the three parameters. One advantage of the Bayesian approach is that we can formally
incorporate prior knowledge about the parameters into the model.
<!-- prior knowledge about the parameters can be formally incoperated into the model. -->
<!-- Natural constraints on both $\beta$ and $\gamma$ are that they are positive. -->
<!-- Knowledge about the contact rate, infectiousness and recovery rate can be expressed in the form of priors. -->
For instance, we specify $\gamma \sim N(0.4, 0.5)$ (truncated at 0), which expresses our belief that $\gamma$ has to be positive, and that $P(\gamma \leq 1) = 0.9$
(i.e the probability that the recovery time is > 1 day is 0.9 *a priori*).
We can change this prior if more information becomes available, constraining our parameter estimation more tightly or, on the contrary,
increasing its variance.
Section 2 discusses how to check if our priors are consistent with available domain knowledge using _prior predictive checks_.
### Predictions and derived quantities
Once we fit the model and obtain a posterior distribution for $\theta$,
we can derive additional quantities of interests.
We can simulate predictions, $\mathcal Y_\mathrm{pred}$,
and work out the plausible range of new observations.
Because these predictions are based on the posterior, $p(\theta \mid \mathcal Y)$,
our simulations account for uncertainty both in the data generating process
and in our estimates of $\theta$.
Moreover, we compute a posterior distribution of predictions
$$
p(\mathcal Y_\mathrm{pred} \mid \mathcal Y) = \int p(\mathcal Y_{pred} | \theta) p(\theta | \mathcal Y) d\theta
$$
Similarly we can compute quantities that depend on the model parameters.
An important example is the basic reproduction number, $R_0$.
$R_0$ is defined as the expected number of secondary infections produced from one infected individual in a fully susceptible population through the entire duration of the infectious period.
$R_0 > 1$ indicates a sustainable infection,
which can lead to a major outbreak, while $R_0 < 1$ suggests, under stable conditions, that the infection will die out.
Bayesian inference allows us to construct a posterior distribution for this quantity too
$$
p(R_0 \mid \mathcal Y).
$$
## Coding the Model: Stan Program
We will need the following libraries
```{r, message=FALSE}
library(rstan)
library(gridExtra)
rstan_options (auto_write = TRUE)
options (mc.cores = parallel::detectCores ())
set.seed(3) # for reproductibility
```
### Coding the ODE in Stan
An ODE takes the form
$$
\frac{\mathrm d y}{\mathrm d t} = f(t, y)
$$
where $y$ are the _states_, in our example $y = (S, I, R)$,
and $t$ is time.
We also need an initial condition $y_0$ at $t_0$
and the times, $\tau$, at which we evaluate the solution.
To specify an ODE in Stan, we first code $f$ in the `functions` block.
This function must observe a strict signature:
```{stan, eval=F, output.var="md"}
real[] f(real time, real[] state, real[] theta,
real[] x_r, int[] x_i)
```
with
* `time`, $t$;
* `state`, the volumes in each compartment, $y$;
* `theta`, variables used to compute $f$, which depend on the model parameters;
* `x_r`, real variables used to evaluate $f$, which only depend on fixed data;
* `x_i`, integer values used to evaluate $f$, which only depend on fixed data.
We motivate this signature in our discussion on scaling ODEs (section 3).
In our example, the ODEs for the SIR model is defined as follows:
```{stan, eval=F, output.var="md"}
functions {
real[] sir(real t, real[] y, real[] theta,
real[] x_r, int[] x_i) {
real S = y[1];
real I = y[2];
real R = y[3];
real N = x_i[1];
real beta = theta[1];
real gamma = theta[2];
real dS_dt = -beta * I * S / N;
real dI_dt = beta * I * S / N - gamma * I;
real dR_dt = gamma * I;
return {dS_dt, dI_dt, dR_dt};
}
}
```
We evaluate the solution numerically by using one of Stan's numerical integrators.
We opt for the Runge-Kutta 4$^\mathrm{th}$ / 5$^\mathrm{th}$ order
but, as we discuss in section 3, the user can consider other options.
A call to the integrator looks as follows
```{stan, eval=F, output.var="md"}
y = integrate_ode_rk45(sir, y0, t0, ts, theta, x_r, x_i);
```
with
* `sir`, the name of the function that returns the derivatives, $f$;
* `y0`, the initial condition;
* `t0`, the time of the initial condition;
* `ts`, the times at which we require the solution to be evaluated;
* `theta`, `x_r`, `x_i`, arguments to be passed to $f$.
We now have all the ingredients to solve our ODE.
Note that in the given example, when we assume that the total population remains constant, the three derivatives $\frac{dS}{dt}$, $\frac{dI}{dt}$, $\frac{dR}{dt}$ sum up to $0$: We can use this fact to improve computational efficiency of the `sir` function by deriving the value of $\frac{dI}{dt}$ from $\frac{dS}{dt}$ and $\frac{dR}{dt}$:
```{stan, eval=F, output.var="md"}
real dS_dt = -beta * I * S / N;
real dR_dt = gamma * I;
real dI_dt = -(dS_dt + dR_dt);
```
### Building the model in Stan
We next code the model in Stan, working through the various coding blocks.
The `functions` block specifies `sir`
```{stan, eval=F, output.var="md"}
functions {
real[] sir (...) {...} //copy code from above
}
```
Fixed data is declared in the `data` block:
```{stan, eval=F, output.var="md"}
data {
int<lower=1> n_days;
real y0[3];
real t0;
real ts[n_days];
int N;
int cases[n_days];
}
```
We code transforms of the data in the `transformed data` block. In this example, we transform our data to match the signature of `sir` (with `x_r` being of length 0 because we have nothing to put in it).
```{stan, eval=F, output.var="md"}
transformed data {
real x_r[0];
int x_i[1] = { N };
}
```
We next declare the model parameters. If you want some parameter to be bounded, and it is not already guaranteed by his prior, you need to specify `<lower=a, upper=b>` when declaring this parameter. Note that this is how you put a truncated prior distribution on a parameter.
```{stan, eval=F, output.var="md"}
parameters {
real<lower=0> gamma;
real<lower=0> beta;
real<lower=0> phi_inv;
}
```
And then transforms of the parameters
```{stan, eval=F, output.var="md"}
transformed parameters{
real y[n_days, 3];
real phi = 1. / phi_inv;
{
real theta[2];
theta[1] = beta;
theta[2] = gamma;
y = integrate_ode_rk45(sir, y0, t0, ts, theta, x_r, x_i);
}
}
```
With the ODE solution, $y$, in hand, the only thing left to do is to code the prior and sampling distribution.
<!-- As with the other parts of the model, these directly follow the notation in the mathematical specification. -->
```{stan, eval=F, output.var="md"}
model {
//priors
beta ~ normal(2, 1); //truncated at 0
gamma ~ normal(0.4, 0.5); //truncated at 0
phi_inv ~ exponential(5);
//sampling distribution
//col(matrix x, int n) - The n-th column of matrix x. Here the number of infected people
cases ~ neg_binomial_2(col(to_matrix(y), 2), phi);
}
```
Untangled from the inference, we can calculate the basic reproduction number, $R_0$, and predictions for the number of cases in the `generated quantities` block:
```{stan, eval=F, output.var="md"}
generated quantities {
real R0 = beta / gamma;
real recovery_time = 1 / gamma;
real pred_cases[n_days];
pred_cases = neg_binomial_2_rng(col(to_matrix(y), 2) + 1e-5, phi);
}
```
<!-- ### Pressing the inference button -->
### Fitting the model in R
<!-- To provide the data to Stan for inference -->
<!-- ^[What do we mean by inference here? We want to infer the value of the parameters $\theta$ given the data $y$ . We have already specified the priors $p(\theta)$ and the likelihood $p(y|\theta)$. Bayes rules tells us that the posterior distribution of the parameters $\theta$ is $p(\theta|y) \propto p(\theta)p(y|\theta)$. We thus know this distribution $p(\theta|y)$ up to a multiplicative constant. Stan provides a version of Markov Chain Monte Carlo (MCMC), which provides samples which approximativelly follow the $p(\theta|y)$ distribution. By approximating the true posterior distribution by the empirical distribution on these samples, we can compute the expectations of functions of the paramaters, or e.g the probability that the number of infected student is more than 10 on the third day.], -->
<!-- we collect it into a list -->
We now go to R and collect the data into a list.
```{r}
# time series of cases
cases <- influenza_england_1978_school$in_bed # Number of students in bed
# total count
N <- 763;
# times
n_days <- length(cases)
t <- seq(0, n_days, by = 1)
t0 = 0
t <- t[-1]
#initial conditions
i0 <- 1
s0 <- N - i0
r0 <- 0
y0 = c(S = s0, I = i0, R = r0)
# data for Stan
data_sir <- list(n_days = n_days, y0 = y0, t0 = t0, ts = t, N = N, cases = cases)
# number of MCMC steps
niter <- 2000
```
Next we compile the model, saved in the file `sir_negbin.stan`,
```{r}
model <- stan_model("sir_negbin.stan")
```
and run MCMC.
For this problem, it suffices to use Stan's defaults.
Note that, as is standard practice, we run 4 Markov chains.
```{r}
fit_sir_negbin <- sampling(model,
data = data_sir,
iter = niter,
chains = 4)
```
### Checking the inference
Let's specify the parameters of interest.
```{r}
pars=c('beta', 'gamma', "R0", "recovery_time")
```
<!-- We here display the median (0.5 quantile) and 90% interval (0.05 and 0.95 quantiles), restricted to the parameters of interest. -->
We start with a summary table of the results,
which displays the posterior mean, standard error, quantiles,
and some useful diagnostics.
```{r}
print(fit_sir_negbin, pars = pars)
```
Stan gives us a host of information to evaluate whether the inference is reliable. During sampling, warnings can tell us if something is wrong (here we have no warnings^[Note that getting a few warnings is not always concerning, especially if they only occur during
the warmup phase. See [here](https://mc-stan.org/misc/warnings.html) for an overview of Stan's warnings]). In the summary table, several quantities are available to check inference. Here we note that $\widehat R$ is close to 1 (< 1.01),
indicating the 4 Markov chains are in close agreement with one another.
Furthermore the effective samples size, $n_\mathrm{eff}$, is large (> 100^[$n_\mathrm{eff} \geq 100$ is considered sufficient to estimate the posterior mean correctly. Note that we may need a bigger $n_\mathrm{eff}$ to estimate extreme quantiles.]),
which means the Markov chains were able to cohesively explore the parameter space.
Conversely, large $\widehat R$ and low $n_\mathrm{eff}$ would indicate that the Markov chains
are poorly mixing. Apart from fixing coding errors, improving the mixing of the Markov chains almost always requires tweaking the model specification, for example with a [reparameterization](https://mc-stan.org/docs/2_23/stan-users-guide/reparameterization-section.html) or stronger priors.
<!-- There were no divergent transitions reported^[Divergences occur when Stan’s Hamiltonian solver diverges from the true Hamiltonian, which must be conserved, because of numerical problems in the stepwise gradient-based approximation of the curvature of the log density. This indicates that the Markov chains did not completely explore the posterior and that our Markov chain Monte Carlo estimators will be biased. This is often the sign that some region of the posterior is hard to explore and that the step-size strongly depends on the location. Divergences, however, can sometimes be false positives. To verify that we have real fitting issues we can rerun with a larger target acceptance probability, adapt_delta, which will force more accurate simulations of Hamiltonian trajectories and reduce the false positives.] and more generally no warning messages. -->
<!-- $\hat{R}$ values^[$\hat{R}$ indicates how well different chains of the MCMC have mixed : if $\hat{R}$ is close to 1, it means that all the chains are exploring the same region, and thus that no chain should be stuck in one subregion. In practice, $\hat{R} \leq 1.01$ is satisfactory.] are close to 1 (<1.01), and the effective sample size^[The effective sample size corresponds to the number of independant posterior draws which would yield the same standard error of estimation for the **mean** of the parameters that the draws from Stan's MCMC. Note that $n_{eff}$ can be much lower for quantiles. In practice, $n_{eff} \geq 100$ for estimating the parameter mean is considered sufficient. With effective sample sizes of roughly one hundred, standard errors are roughly one tenth the size of posterior standard deviations, being in an inverse square root relation.] estimates for each parameter are sufficient (>100) for inference^[Both large $\hat{R}$ and low effective samples per transition would be consequences of poorly mixing Markov chains. Improving the mixing of the Markov chains almost always requires tweaking the model specification, for example with a reparameterization or stronger priors.]. -->
We can furthermore plot the marginal posterior densities
and confirm the Markov chains are in agreement with one another.
```{r}
stan_dens(fit_sir_negbin, pars = pars, separate_chains = TRUE)
```
### Checking the model
Now that we trust our inference, let us check the utility of our model.
Utility is problem specific and can include the
precise estimation of a quantity or predicting future behaviors.
In general, it is good to check if our model, once fitted, produces simulations
that are consistent with the observed data.
This is the idea behind _posterior predictive checks_.
We sample predictions, $\mathcal Y_\mathrm{pred}$,
from $p(\mathcal Y_\mathrm{pred} \mid \mathcal Y)$
and use these samples to construct a fitted curve for students in bed,
together with the uncertainty (90% interval,
meaning observed data is expected to fall outside of this interval one in ten times).
This _posterior predictive check_ allows us to verify if the model captures the structure of the data.
Here we see that the model gives a satisfying fit to the data,
and that the model uncertainty is able to capture the variation of the data.
```{r}
smr_pred <- cbind(as.data.frame(summary(
fit_sir_negbin, pars = "pred_cases", probs = c(0.05, 0.5, 0.95))$summary), t, cases)
colnames(smr_pred) <- make.names(colnames(smr_pred)) # to remove % in the col names
ggplot(smr_pred, mapping = aes(x = t)) +
geom_ribbon(aes(ymin = X5., ymax = X95.), fill = "orange", alpha = 0.6) +
geom_line(mapping = aes(x = t, y = X50.)) +
geom_point(mapping = aes(y = cases)) +
labs(x = "Day", y = "Number of students in bed")
```
Maybe we also want to access the true number of infected people at each time, and not just the number of students in bed. This is a latent variable for which we have an estimation.
```{r}
params <- lapply(t, function(i){sprintf("y[%s,2]", i)}) #number of infected for each day
smr_y <- as.data.frame(summary(fit_sir_negbin,
pars = params, probs = c(0.05, 0.5, 0.95))$summary)
colnames(smr_y) <- make.names(colnames(smr_y)) # to remove % in the col names
ggplot(smr_y, mapping = aes(x = t)) +
geom_ribbon(aes(ymin = X5., ymax = X95.), fill = "orange", alpha = 0.6) +
geom_line(mapping = aes(x = t, y = X50.)) +
labs(x = "Day", y = "Number of infected students")
```
## Complete Stan program
```{stan, eval=F, output.var="md"}
functions {
real[] sir(real t, real[] y, real[] theta,
real[] x_r, int[] x_i) {
real S = y[1];
real I = y[2];
real R = y[3];
real N = x_i[1];
real beta = theta[1];
real gamma = theta[2];
real dS_dt = -beta * I * S / N;
real dI_dt = beta * I * S / N - gamma * I;
real dR_dt = gamma * I;
return {dS_dt, dI_dt, dR_dt};
}
}
data {
int<lower=1> n_days;
real y0[3];
real t0;
real ts[n_days];
int N;
int cases[n_days];
}
transformed data {
real x_r[0];
int x_i[1] = { N };
}
parameters {
real<lower=0> gamma;
real<lower=0> beta;
real<lower=0> phi_inv;
}
transformed parameters{
real y[n_days, 3];
real phi = 1. / phi_inv;
{
real theta[2];
theta[1] = beta;
theta[2] = gamma;
y = integrate_ode_rk45(sir, y0, t0, ts, theta, x_r, x_i);
}
}
model {
//priors
beta ~ normal(2, 1);
gamma ~ normal(0.4, 0.5);
phi_inv ~ exponential(5);
//sampling distribution
//col(matrix x, int n) - The n-th column of matrix x. Here the number of infected people
cases ~ neg_binomial_2(col(to_matrix(y), 2), phi);
}
generated quantities {
real R0 = beta / gamma;
real recovery_time = 1 / gamma;
real pred_cases[n_days];
pred_cases = neg_binomial_2_rng(col(to_matrix(y), 2), phi);
}
```
# 2 Using simulated data to understand our model
We fitted a simple model to a well-understood data set.
In practice, we must proceed more cautiously
and probe the behavior of our model and our inference algorithm.
To that end, working with fake data can be a very productive endeavour.
## Checking our priors
We can check if our priors are sound
by computing the *a priori* probability of various epidemiological parameters of interest.
For instance for influenza, we know from domain knowledge that $R_0$ is typically between 1 and 2,
and that the recovery time is approximately 1 week.
We want priors that allow for every reasonable configurations of the data
but exclude pattently absurd scenarios, per our domain expertise.
To check if our priors fulfill this role, we can do a *prior predictive check*.
<!-- i.e sample parameters of interest from the prior distribution. -->
To conduct a prior predictive check, we take the same model as before,
put the parameters of interest in the `generated_quantities` code block,
and remove the sampling distribution term from the model.
Without the sampling distribution, the parameters are not fitted to the data
and are thus sampled from their prior distribution.
The Stan code is thus the same as the final Stan code, without the `cases ~ neg_binomial_2(col(to_matrix(y), 2), phi);` line. A useful trick to make prior predictive check easy is to add a switch `compute_likelihood` to the data. Then in the `model` code block :
```{stan, eval=F, output.var="md"}
if (compute_likelihood == 1)
cases ~ neg_binomial_2(col(to_matrix(y), 2), phi);
```
This allows to do prior predictive check and inference with the same Stan file.
We compile the model without the likelihood term
```{r}
model <- stan_model("sir_prior.stan")
```
and sample from it.
```{r}
fit_sir_prior <- sampling(model,
data = data_sir, seed = 0, chains = 4)
```
This gives us samples from the *a priori* distribution of parameters, which we can visualize.
Here we show the distribution of the log of the recovery time,
with the red bars showing loose bounds on the recovery time (1/2 day and 30 days).
We observe that most of the probality mass is between the red bars
but we still allow more extreme values,
meaning our posterior can concentrate outside the bars, if the data warrants it.
```{r}
s_prior <- rstan::extract(fit_sir_prior)
df_test <- tibble(r = s_prior$recovery_time)
ggplot(data = df_test) +
geom_histogram(mapping = aes(x = r), bins = 30) +
labs(x = "log(recovery time)") +
geom_vline(xintercept = 0.5, color = "red") +
geom_vline(xintercept = 30, color = "red") +
scale_x_log10()
```
We can do the same thing for $R_0$ (again, on the log-scale), the loose bounds being 0.3 and 30.
```{r}
df_test <- tibble(r = s_prior$R0)
ggplot(data = df_test) +
geom_histogram(mapping = aes(x = r), bins = 30) +
labs(x = "log(R0)") +
geom_vline(xintercept = 0.3, color = "red") +
geom_vline(xintercept = 30, color = "red") +
scale_x_log10()
```
We thus see that these distributions are coherent with domain knowledge. See [here](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations) for more recommendations on prior choice.
^[Previoulsy, we fitted the data with these priors and found a posteriori $R_0$ ~ 3 and a recovery time of ~ 2 days. This is quite unexpected from our basic domain knowledge, but can probably be explained ($R_0$ bigger among students? isolated students counts for recovered in the model? etc). This shows that the prior should not be too constraining in order to incorporate both prior knowledge and unexpected knowledge from the data.]
We can also plot trajectories of infection according to the prior,
that is the number of infected people at each time accoring to prior distributions of parameters.
```{r, warning=FALSE}
n_draws <- 1000
draws <- as_tibble(t(s_prior$y[,,2][1:n_draws,])) %>% add_column(t=t)
draws <- pivot_longer(draws, c(1:1000) , names_to = "draw")
draws %>%
ggplot() +
geom_line(mapping = aes(x = t, y=value, group = draw), alpha = 0.6, size=0.1) +
geom_hline(yintercept=763, color="red") +
geom_text(x=1.8, y=747, label="Population size", color="red") +
labs(x = "Day", y="Number of infected students")
```
And the median (black line) and 90% interval of the *a priori* number of student in bed (i.e the observed number of infected students).
```{r}
smr_pred <- cbind(as.data.frame(summary(fit_sir_prior, pars="pred_cases",
probs=c(0.05, 0.5, 0.95))$summary), t)
colnames(smr_pred) <- make.names(colnames(smr_pred)) # to remove % in the col names
ggplot(smr_pred, mapping=aes(x=t)) +
geom_ribbon(aes(ymin = X5., ymax = X95.), fill = "orange", alpha = 0.6) +
geom_line(mapping=aes(x=t, y=X50.)) +
geom_hline(yintercept=763, color="red" ) +
geom_text(x=1.8, y=747, label="Population size", color="red") +
labs(x = "Day", y="Number of students in bed")
```
It seems that most trajectories are reasonable and quite diverse. Still, some of the curves look a little bit funky and suggest we could refine our priors and make them more informative, although it may not be needed here.
Typically, we can get away with priors that do not capture all our *a priori* knowledge,
provided the data is informative enough.
However when dealing with complicated models and relatively sparse data,
we usually need well constructed priors to regularize our estimates and avoid non-identifiability.
## Can our inference algorithm recover the right parameters?
While there exist many theoretical guarantees for MCMC algorithms,
modelers should realize that these rely on a set of assumptions
which are not always easy to verify
and that many of these guarantees are _asymptotic_.
This means they are proven in the limit where we have an infinite number of samples from the posterior distribution.
A very nice, if advanced, review on the subject can be found in @Roberts_mcmc_2004.
As practitioners, we must contend with finite computational resources
and assumptions which may or may not hold.
The diagnostics we reviewed earlier, e.g. $\widehat R$, effective sample sizes,
provide _necessary_ conditions for
the MCMC sampler to work but not _sufficient_ ones.
Nevertheless they are potent tools for diagnosing shortcomings in our inference.
This section provides further such tools,
from both a rigorous and a pragmatic perspective.
Fitting the model to simulated data is, if done properly, an effective way
to test whether our inference algorithm is reliable.
If we cannot successfully fit the model in a controlled setting,
it is unlikely we can do so with real data.
This of course raises the question of what is meant by "successfully fitting"
the model.
In a Bayesian setting, this means our inference procedure recovers
the correct posterior distribution.
Unfortunately, even in a controlled setting,
the posterior distribution is, but in the simplest cases, not tractable.
A powerful method to check the accuracy of our Bayesian inference
is _simulation-based calibration_ (SBC) [@Talts_sbc_2018].
SBC exploits a very nice consistency result.
The intuition is the following:
if we draw our parameters from our prior distribution
$$
\theta_1, ..., \theta_2 \sim p(\theta)
$$
and for each $\theta_i$ simulate a data set $\mathcal Y^i$,
we can by fitting the model multiple times recover the prior distribution
from the estimated posteriors.
This technique is a bit beyond the scope of this tutorial,
though we vividly encourage the reader to consult the original paper, or to see [here](https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html) how this method fits in a principled Bayesian workflow.
For the time being, we focus on a simpler heuristic:
fit the model to one simulated data set and
check if we recover the correct parameter value.
There are serious limitations with this approach:
when do we consider that the estimated posterior distribution covers
the correct value? How do we know if the variance of the posterior
is properly estimated? etc.
But the test is useful: in this controlled setting,
do the chains converge?
Is the computation of the log density numerically stable
(e.g. are we using the right ODE integrator)?
Do my priors prevent the chains from wondering into absurd regions
of the parameter space?
These are all questions this simple test can help us tackle.
<!-- A quick check is to take one trajectory sampled from the prior, and check if we can retrieve the true parameters with our model. Still, altough this test can be useful to shed light on some issues with the model, it is far from enough to falsify or validate a model. See [@talts2018validating] for a discussion of this method and a correct way to check calibration by fitting draws from the prior distribution. -->
We take one arbitrary draw from the prior distribution
```{r}
# one arbitrary draw from the prior distribution
draw <- 12
# the number of predicted cases sampled from the prior distribution, which we will use as data
cases_simu <- s_prior$pred_cases[draw,]
```
And use it as data which we fit to our model.
```{r}
data_simu <- list (n_days = n_days, y0 = y0, t0 = t0, ts = t, N=N, cases=cases_simu)
fit_simu <- sampling(model, data=data_simu, chains=4)
```
We can then examine the estimated posterior distribution.
<!-- Here we see it is the case. -->
```{r}
params = c("beta", "gamma", "phi")
paste("true beta :", toString(s_prior$beta[draw]),
", true gamma :", toString(s_prior$gamma[draw]), ", true phi :", toString(s_prior$phi[draw]))
print(fit_simu, pars = params)
```
We plot the posterior density (in red) to check if it matches the true value of the parameter (black line).
The density is compatible with the true parameters, although not always centered on it.
The latter is not alarming, especially if the model parameter $\theta$ we sampled,
lies on the tail of the prior distribution. We could repeat this process a few times to get a better sense of the performance of our inference algorithm.
```{r}
plot_beta <- stan_dens(fit_simu, pars="beta") + geom_vline(xintercept =s_prior$beta[draw])
plot_gamma <- stan_dens(fit_simu, pars="gamma") + geom_vline(xintercept = s_prior$gamma[draw])
plot_phi <- stan_dens(fit_simu, pars="phi_inv") + geom_vline(xintercept = s_prior$phi_inv[draw])
grid.arrange(plot_beta, plot_gamma, plot_phi, nrow=1)
```
# 3 Scaling up ODE-based models
Doing MCMC on ODE-based models can be computationally intensive,
especially as we scale up the number of observations, parameters,
and start using more sophisticated ODEs.
If we want to ripe the benefits of a full Bayesian inference,
we have to pay the computational cost.
But while we cannot get away with a free lunch,
we can avoid an overpriced one.
Stan is a flexible language, which means there are worse and better ways of coding things.
This section develops a few principles to make ODE models in Stan
more scalable,
drawing on our experience with an SEIR model of Covid-19 [@Riou_covid_2020].
## The Computational cost of Stan
Stan abstracts the inference away from the modeling but it's worth
taking a peek inside the black box.
Markov chains are constructed using a dynamic Hamiltonian Monte Carlo (HMC)
sampler [e.g @Betancourt_hmc_2018; @Hoffman_nuts_2014].
To run the sampler, we need to specify the log joint density over
the observations, $\mathcal Y$, and the latent variables, $\theta$,
$$
\log p(\mathcal Y, \theta) = \log p(\mathcal Y \mid \theta) + \log p(\theta)
$$
which conveniently splits between a log likelihood and a log prior.
This is ultimately what we specify in the `model` block.
In addition, Hamiltonian Monte Carlo uses the gradient
with respect to $\theta$,
$$
\nabla_\theta \log p(\mathcal Y, \theta) = \nabla_\theta \log p(\theta \mid \mathcal Y).
$$
The gradient contains information about the posterior distribution which,
when properly harnessed, allows us to efficiently sample from the posterior.
Fortunately, the user does not need to specify the gradient.
Instead, _automatic differentiation_ generates the requisite derivatives
under the hood, based on computer code to evaluate the log joint density^[
Though heavily used in computational statistics and machine learning,
automatic differentiation remains an arcane subject for many practitioners;
for a review on the subject, we recommend [@Margossian_ad_2019].
A more comprehensive treatment can be found
in [@Griewank_ad_2008].
].
But, just like evaluating $\log p(\mathcal Y, \theta)$,
differentiation has a computational cost.
HMC simulates trajectories across the parameter space
by numerically solving Hamilton's equations of motion.
We solve these equations using a numerical leapfrog integrator.
At each step the integrator takes,
we need to _evaluate and differentiate_ $\log p(\mathcal Y, \theta)$.
## Partitioning the code into blocks
This perspective informs how each model block scales.
The `data` and parameters blocks are used to declare variables.
The `transformed data` block is evaluated once.
The `transformed parameters` and `model` blocks
are evaluated and differentiated at each leapfrog step,
which is multiple times per iteration.
The `generated quantities` block is evaluated once per iteration.
Hence operations in the `parameters` and `transformed parameters`
blocks dominate the computational cost and should only entail
operations that depend on $\theta$
and are required to compute $\log p(\mathcal Y, \theta)$.
In our boarding school model, the computation of $R_0$, for example,
is relegated to `generated quantities`.
Even though we want samples from $p(R_0 \mid \mathcal Y)$
and $R_0$ depends on the model parameters,
$R_0$ does not contribute to $\log p(\mathcal Y, \theta)$.
In the model by @Riou_covid_2020, we integrate an ODE --
denoting the solution $y$ -- in
`transformed parameters` from $t_0$ to $\tau$,
the time of the last observation
$$
\int_{t_0}^\tau f(y, t) \mathrm dt
$$
where
$$
\frac{\mathrm d y}{\mathrm d t} = f(y, t).
$$
We then make additional predictions, from time $\tau$ to $\tau'$,
$$
\int_\tau^{\tau'} f(y, t) \mathrm{d}t.
$$
Because these predicted solutions do not impact our likelihood,
we do these calculations in `generated quantities`.
Compared to the first integral, this second integration is computed
a small number of times
and it is not differentiated, making its computational cost marginal.
## Propagating derivatives through ODEs
Our ODE is defined by
$$
\frac{\mathrm d y}{\mathrm d t} = f(y, t, \vartheta, x).
$$
Here, $\vartheta$ contains inputs to $f$ that depend on the model parameters, $\theta$,
while $x$ contains inputs which do _not_ depend on $\theta$
and therefore remain fixed as the Markov chain moves through the parameter space.
Note that in general, $\vartheta \neq \theta$.
In the SIR model for example, $\vartheta$ is defined as
```
theta[1] = beta;
theta[2] = gamma;
```
and contains two model parameters.
Specifically $\vartheta = (\beta, \gamma)$, while $\theta = (\beta, \gamma, \phi^{-1})$.
To define the integral, we additionally specify an initial time, $t_0$,
times of integration, $\tau$, and an initial condition, $y_0$,
all of which can vary with $\theta$.
Hence when propagating derivatives to compute the gradient of the log joint density,
we need to worry about how the solution varies
with respect to $\vartheta$ and potentially $y_0$^[
While $t_0$ and $\tau$ may depend on model parameters,
we here assume they do not to avoid some minute technicalities
and simplify our discussion.].
We say that these elements are _varying parameters_
and denote $K$ the number of such elements.
Furthermore, let $N$ be the number of _states_, that is the length of $y$
or in a SIR model, the number of compartments.
In Stan, we propagate derivatives by solving a coupled system of ODEs.
The intuition is the following.
Suppose we want to compute
$$
\frac{\mathrm d y}{\mathrm d \vartheta}.
$$
We do not have an analytical expression for $y$, so a direct
application of automatic differentiation is not feasible.
But we can, assuming the requisite derivatives exist, compute
$$
\frac{\mathrm d f}{\mathrm d \vartheta} = \frac{\mathrm d}{\mathrm dt}
\frac{\mathrm dy}{\mathrm d \vartheta}
$$
and then integrate this quantity numerically.
The end result is that, instead of only solving for the $N$ original states,
we solve an $N + NK$ system to both evaluate and differentiate $y$.
Strictly speaking, we do not need to explicitly compute
$\mathrm d y / \mathrm d \vartheta$ to propagate derivatives;
this is an important, if somewhat counter-intuitive, result of automatic differentiation,
and motivates a so-called _adjoint method_^[For a discussion on adjoint methods, we
recommend @Hindmarsh_ode_2020.],
which only requires solving $2N + K$ ODE states.
While this method is very much on our todo list,
it is not yet implemented in Stan.
But the takeaway is the same for both methods: we should minimize $K$ as much as possible.
### Splitting fixed and varying parameters
This sheds light on the function signature of Stan's numerical integrator.
For every element in `theta`, we add an additional $N$ states to solve for.
Hence, components which do not depend on $\theta$ should be relegated
to `x_r` and `x_i`.
### Fixing the initial conditions (where we can)
Suppose our initial condition, $y_0$, are varying parameters,
i.e. depend on the model parameters.
It is not uncommon for some of the elements in $y_0$ to not depend on $\theta$.
For example, in a compartment model,
the initial condition for the first compartment may depend on model parameters,