Semiparametric Bayesian Inference for the Transmission Dynamics of COVID-19 with a State-Space Model
-
The manuscript has been published in Contemporary Clinical Trials (Special Issue on COVID-19)
-
Link to the manuscript: https://doi.org/10.1016/j.cct.2020.106146
-
arXiv version: http://arxiv.org/abs/2006.05581
-
For analysis results of U.S. data, see this webpage: http://covid19.laiyaconsulting.com/baysir
-
I am actively updating the package. Please visit again for the latest version!
[6/29] The default prior mean for the infectious period is changed to 9.3 days (based on He et al., Nature Medicine and this webpage). Previously it was 7 days. Also, the user is now allowed to specify the link function for the diagnosis rate (logit, probit, cloglog, where logit is the default). Previously cloglog was used as default. See the updated manuscript for details. Please install again for the updated version.
[6/23] There has been a mistake so that the earlier version of this package does not implement the parallel tempering procedure (now fixed of course). Please install again for the updated version.
First of all, you need R. For Windows users, Rtools is needed. For Mac users, Xcode may be needed.
The BaySIR
package has two dependencies: Rcpp
and RcppArmadillo
.
BLAS/LAPACK libraries are needed by RcppArmadillo
. Windows 7 users may have difficulties installing these.
The package was tested on MacOS (High Sierra 10.13.6) and Windows 10.
The BaySIR
package can be easily installed with the devtools
package in R. After devtools
has been installed, run the following commands in R to install the BaySIR
package.
library(devtools)
install_github("tianjianzhou/BaySIR")
Please refer to Documentation for details about the BaySIR functions.
library(BaySIR)
# read data
data(data_sim_1)
B = data_sim_1$B
I_D_0 = data_sim_1$I_D[1]
N = data_sim_1$N
# run MCMC (may take a few minutes, depending on computer. ~ 2 mins on Macbook Pro)
result_list = BaySIR_MCMC(B = B, I_D_0 = I_D_0, N = N)
# for testing purpose, use smaller number of MCMC burn-in/iterations
# result_list = BaySIR_MCMC(B = B, I_D_0 = I_D_0, N = N, burnin = 1000, thin = 2, niter = 500)
# posterior summary for the effective reproduction number
result_list$MCMC_summary$R_eff
# sample from posterior predictive distribution (for future 30 days)
predict_list = BaySIR_predict(T_pred = 30, MCMC_spls = result_list$MCMC_spls, B = B, I_D_0 = I_D_0, N = N)
# posterior median of future B's
predict_list$pred_summary$B[ , 1]
# posterior summary for the future effective reproduction numbers
predict_list$pred_summary$R_eff
Data from JHU CSSE
library(BaySIR)
# read data
data = read.csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv", header = TRUE)
confirmed_cases_cum = colSums(as.matrix(data[data$Province_State == "Illinois", -(1:11)]))
confirmed_cases_cum = unname(confirmed_cases_cum[confirmed_cases_cum > 100])
# population in IL
N = 12671821
# run MCMC
result_list = BaySIR_MCMC(confirmed_cases_cum = confirmed_cases_cum, N = N)
# for testing purpose, use smaller number of MCMC burn-in/iterations
# result_list = BaySIR_MCMC(confirmed_cases_cum = confirmed_cases_cum, N = N, burnin = 1000, thin = 2, niter = 500)
# posterior summary for the effective reproduction number
result_list$MCMC_summary$R_eff
# sample from posterior predictive distribution (for future 30 days)
predict_list = BaySIR_predict(T_pred = 30, MCMC_spls = result_list$MCMC_spls, confirmed_cases_cum = confirmed_cases_cum, N = N)
# posterior median of future B's
predict_list$pred_summary$B[ , 1]
# posterior summary for the future effective reproduction numbers
predict_list$pred_summary$R_eff
We did not provide a plot function in the package. To plot the estimates, please use the R function plot()
or the ggplot2
package. An example is given below, plotting the estimated effective reproduction numbers.
library(ggplot2)
R_eff_median = result_list$MCMC_summary$R_eff[ , 1]
R_eff_lower = result_list$MCMC_summary$R_eff[ , 2]
R_eff_upper = result_list$MCMC_summary$R_eff[ , 3]
R_eff_df = data.frame(t = 0:(length(R_eff_median)-1),
m = R_eff_median,
l = R_eff_lower,
u = R_eff_upper)
plot_R_eff = ggplot() + geom_line(data = R_eff_df, aes(x = t, y = m), size = 0.6) +
geom_ribbon(data = R_eff_df, aes(x = t, ymin = l, ymax = u), alpha = 0.5, fill = "grey60") +
labs(x = "Day", y = "Eff. Reprod. No.")
plot_R_eff
-
Usage:
BaySIR_MCMC(B, I_D_0, N, ...)
-
Arguments
- Required
B
: a lengthT + 1
vector of daily new confirmed casesB[0], ..., B[T]
.I_D_0
: the total number of confirmed cases on day 0.N
: the population size.
- Optional
confirmed_cases_cum
: an optional lengthT + 2
vector of cumulative confirmed case counts. IfB
andI_D_0
have already been specified, thenconfirmed_cases_cum
will be ignored. If not bothB
andI_D_0
are specified andconfirmed_cases_cum
is supplied, thenconfirmed_cases_cum
will be used to calculateB
andI_D_0
.X
: a(T + 1) * Q
matrix, covariates related to the disease transmission rate. Default is an intercept term plus a time trend,X[t, ] = (1, t)
. It is possible to include other covariates. For example,X[t, ] = (1, t, 1(stay-at-home order on day t), t * 1(stay-at-home order on day t))
and thusQ = 4
.Y
: a(T + 1) * K
matrix, covariates related to the diagnosis rate. Default contains only an intercept term,Y[t, ] = 1
. It is possible to include other covariates. For example,Y[t, ] = (log number of test on day t)
orY[t, ] = (1, t)
.niter
: number of MCMC samples to return. Default is1000
. The total number of MCMC iterations to be run isniter * thin + burnin
.burnin
: number of MCMC iterations that will be discarded as initial burn-in. Default is10000
.thin
: meaning keep 1 draw everythin
MCMC iterations. Default is20
.link
: the link function to be used for the diagnosis rate (the rate is between 0 and 1, and the link function transforms it to the real line).1
,2
and3
represent the logit, probit and complementary log-log link, respectively. Default is1
.Delta
: a monotonically increasing vector (each element is larger than the previous) defining the temperatures of the parallel Markov chains (parallel tempering). The first element must be 1, corresponding to the original posterior. Default is1.5^(0:9)
.nu_alpha_1
andnu_alpha_2
: prior shape and rate parameters for the length of the infectious period.1 / alpha ~ Gamma(nu_alpha_1, nu_alpha_2)
. Default values are325.5
and35
, resulting in a prior mean of 9.3 (days) for1 / alpha
.nu_alpha_1
andnu_alpha_2
may be changed to reflect different modeling assumptions. For example, if one defines the infectious period as the time from infection to recovery/death, then a larger prior mean (~ 25 days) for1 / alpha
should be chosen (see, e.g., Verity et al., The Lancet Infectious Diseases).
- Required
-
Output
- A list of the following:
MCMC_spls
: a list of the MCMC samples for the parameters.MCMC_summary
: a list of the posterior summaries for the parameters. For each parameter, its posterior median, 2.5% quantile and 97.5% quantile are reported.
- A list of the following:
-
Usage:
BaySIR_predict(T_pred = 10, MCMC_spls, B, I_D_0, N, ...)
-
Arguments
- Required
MCMC_spls
: a list of the MCMC samples for the parameters obtained fromBaySIR_MCMC(...)
.B
: a lengthT + 1
vector of daily new confirmed casesB[0], ..., B[T]
.I_D_0
: the total number of confirmed cases on day 0.N
: the population size.
- Optional
T_pred
: the number of future days that you would like to predict. Will predictB[T + 1], ..., B[T + T_pred]
. Default is 10 days.confirmed_cases_cum
: an optional lengthT + 2
vector of cumulative confirmed case counts. IfB
andI_D_0
have already been specified, thenconfirmed_cases_cum
will be ignored. If not bothB
andI_D_0
are specified andconfirmed_cases_cum
is supplied, thenconfirmed_cases_cum
will be used to calculateB
andI_D_0
.X_pred
: aT_pred * Q
matrix, covariates related to the disease transmission rate for future daysT + 1, ..., T + T_pred
. Default is an intercept term plus a time trend,X_pred[t, ] = (1, T + t)
. Note that if policy indicator is used as a covariate, we may not know what the policy will look like in the future and have to impute its future value.Y_pred
: aT_pred * K
matrix, covariates related to the diagnosis rate for future daysT + 1, ..., T + T_pred
. Default contains only an intercept term,Y_pred[t, ] = 1
. Note that if number of tests is used as a covariate, we do not know what the number of tests will be in the future and have to impute its future value.X
: a(T + 1) * Q
matrix, covariates related to the disease transmission rate. Default is an intercept term plus a time trend,X[t, ] = (1, t)
.Y
: a(T + 1) * K
matrix, covariates related to the diagnosis rate. Default contains only an intercept term,Y[t, ] = 1
.link
: the link function to be used for the diagnosis rate (the rate is between 0 and 1, and the link function transforms it to the real line).1
,2
and3
represent the logit, probit and complementary log-log link, respectively. Default is1
.
- Required
-
Output
- A list of the following:
pred_spls
: a list of the MC samples for the parameters from their posterior predictive distributions.pred_summary
: a list of the summaries for the posterior predictive distributions of the parameters. For each parameter, its posterior (predictive) median, 2.5% quantile and 97.5% quantile are reported.
- A list of the following: