Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
161 lines (122 sloc) 5.34 KB
---
title: "Survival Analysis Example"
vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{Survival Analysis Example}
output:
knitr:::html_vignette:
toc: yes
---
```{r setup, include = FALSE}
options(digits = 3)
library(survival)
library(purrr)
library(rsample)
library(dplyr)
library(tidyposterior)
library(ggplot2)
library(tidyr)
```
In this article, a parametric analysis of censored data is conducted and `rsample` is used to measure the importance of predictors in the model. The data that will be used is the NCCTG lung cancer data contained in the `survival` package:
```{r lung}
library(survival)
str(lung)
```
`status` is an indicator for which patients are censored (`status = 1`) or an actual event (`status = 2`). The help file `?survreg` has the following model fit:
```{r example-model}
lung_mod <- survreg(Surv(time, status) ~ ph.ecog + age + strata(sex), data = lung)
summary(lung_mod)
```
Note that the stratification on gender only affects the scale parameter; the estimates above are from a log-linear model for the scale parameter even though they are listed with the regression variables for the other parameter. `coef` gives results that are more clear:
```{r coef}
coef(lung_mod)
```
To resample these data, it would be a good idea to try to maintain the same censoring rate across the splits. To do this, stratified resampling can be used where each analysis/assessment split is conducted within each value of the status indicator. To demonstrate, Monte Carlo resampling is used where 75% of the data are in the analysis set. A total of 100 splits are created.
```{r splits}
library(rsample)
set.seed(9666)
mc_samp <- mc_cv(lung, strata = "status", times = 100)
library(purrr)
cens_rate <- function(x) mean(analysis(x)$status == 1)
summary(map_dbl(mc_samp$splits, cens_rate))
```
To demonstrate the use of resampling with censored data, the parametric model shown above will be fit with different variable sets to characterize how important each predictor is to the outcome.
To do this, a set of formulas are created for the different variable sets:
```{r forms}
three_fact <- as.formula(Surv(time, status) ~ ph.ecog + age + strata(sex))
rm_ph.ecog <- as.formula(Surv(time, status) ~ age + strata(sex))
rm_age <- as.formula(Surv(time, status) ~ ph.ecog + strata(sex))
rm_sex <- as.formula(Surv(time, status) ~ ph.ecog + age )
```
The model fitting function will take the formula as an argument:
```{r fit-func}
mod_fit <- function(x, form, ...)
survreg(form, data = analysis(x), ...)
```
To calculate the efficacy of the model, the concordance statistic is used (see `?survConcordance`):
```{r concord}
get_concord <- function(split, mod, ...) {
pred_dat <- assessment(split)
pred_dat$pred <- predict(mod, newdata = pred_dat)
concordance(Surv(time, status) ~ pred, pred_dat, ...)$concordance
}
```
With these functions, a series of models are created for each variable set.
```{r models}
mc_samp$mod_full <- map(mc_samp$splits, mod_fit, form = three_fact)
mc_samp$mod_ph.ecog <- map(mc_samp$splits, mod_fit, form = rm_ph.ecog)
mc_samp$mod_age <- map(mc_samp$splits, mod_fit, form = rm_age)
mc_samp$mod_sex <- map(mc_samp$splits, mod_fit, form = rm_sex)
```
Similarly, the concordance values are computed for each model:
```{r concord-est}
mc_samp$full <- map2_dbl(mc_samp$splits, mc_samp$mod_full, get_concord)
mc_samp$ph.ecog <- map2_dbl(mc_samp$splits, mc_samp$mod_ph.ecog, get_concord)
mc_samp$age <- map2_dbl(mc_samp$splits, mc_samp$mod_age, get_concord)
mc_samp$sex <- map2_dbl(mc_samp$splits, mc_samp$mod_sex, get_concord)
```
The distributions of the resampling estimates
```{r concord-df}
library(dplyr)
concord_est <- mc_samp %>%
dplyr::select(-matches("^mod"))
library(tidyr)
library(ggplot2)
concord_est %>%
gather() %>%
ggplot(aes(x = statistic, col = model)) +
geom_line(stat = "density") +
theme_bw() +
theme(legend.position = "top")
```
It looks as though the model missing `ph.ecog` has larger concordance values than the other models. As one might expect, the full model and the model absent `sex` are very similar; the difference in these models should only be the scale parameters estimates.
To more formally test this, the `tidyposterior` package is used to create a Bayesian model for the concordance statistics.
```{r perf-mod, warning = FALSE, message = FALSE}
library(tidyposterior)
concord_est <- perf_mod(concord_est, seed = 6507, iter = 5000)
concord_est$stan
```
To summarize the posteriors for each model:
```{r post}
ggplot(tidy(concord_est)) +
theme_bw()
```
While this seems clear-cut, let's assume that a difference in the concordance statistic of 0.1 is a real effect. To compute the posteriors for the difference in models, the full model will be contrasted with the others:
```{r diffs}
comparisons <- contrast_models(
concord_est,
list_1 = rep("full", 3),
list_2 = c("ph.ecog", "age", "sex"),
seed = 4654
)
```
The posterior distributions show that, statistically, `ph.ecog` has real importance ot the model. However, since these distributions are mostly with +/- 0.05, they are unlikely to be real differences.
```{r diff-post}
ggplot(comparisons, size = 0.05) +
theme_bw()
```
The ROPE statistics quantify the practical effects:
```{r diff-sum}
summary(comparisons, size = 0.05) %>%
dplyr::select(contrast, starts_with("pract"))
```
You can’t perform that action at this time.