Skip to content

weichiyao/TimeVaryingData_LTRCforests

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

LTRCforests for Time-Varying Covariate Data

In the paper "Ensemble Methods for Survival Function Estimation with Time-Varying Covariates", we generalize the conditional inference and relative risk forests to allow time-varying covariates and propose two forest algorithms CIF-TV and RSF-TV. The proposed methods by design can handle survival data with all combinations of left-truncation and right-censoring in the survival outcome, and with both time-invariant and time-varying covariates. We name the methods LTRC CIF and LTRC RRF when referring to the conditional inference forest and relative risk forest, respectively.

The pkg folder contains the R package LTRCforests, available on CRAN.

We here provide analysis codes for the paper, as well as the analysis codes for time-invariant covariate data, in analysis folder. Supplemental Material for the paper can be found in the subfolder doc.

Analysis Codes

We provide analysis codes for "Ensemble Methods for Survival Data with Time-Varying Covariates", as well as the analysis codes for time-invariant covariate data.

The analysis folder provides analysis codes in the paper:

  • The subfolder data contains

    • functions to create simulated dataset with time-varying covariates
    • functions to create simulated LTRC dataset with time-invariant covariates
    • a function to obtain the real dataset.
  • The subfolder codes contains the functions to reproduce the analysis:

    • simulations_tvary.R -- codes to reproduce results for simulated datasets with time-varying covariates
    • simulations_tindepLTRC.R -- codes to reproduce results for simulated LTRC datasets with time-invariant covariates
    • plot_and_tables_tvary.R -- codes to reproduce plots and tables for simulated datasets with time-varying covariates
    • plot_and_tables_tindepLTRC.R -- codes to reproduce plots and tables for simulated LTRC datasets with time-invariant covariates
    • realsetPBC.R -- analysis of real dataset (including functions to reproduce plots)

    In particular, simulations_tvary.R and simulations_tindepLTRC.R provide results

    • to compare performance of LTRC forests with default parameter settings and proposed parameter settings.
    • to evaluate performance comparison for the four methods, the Cox model, CIF, RRF and TSF (all forests trained with proposed parameter settings)
    • to choose methods by using IBS-based 10-fold CV, and compare the results produced by the selection rule with the best method.
  • The subfolder utils contains the source functions used to perform the analysis in the folder codes, including the functions to compute the integrated L2 difference.

LTRCforests for Time-Varying Covariate Data

Main analysis for applying the methodology on time-varying covariate data have been provided in the paper "Ensemble Methods for Survival Data with Time-Varying Covariates". Here we provide some more detailed information as supplemental material.

"Out-of-bag" observation-based mtry tuning algorithm

The values of mtry can be fine-tuned using the "out-of-bag" observation. The simulation results have shown that it can greatly improve the forest performance over the default setting. See the following figure for the performance comparisons using CIF-TV for different values of mtry vs. the optimal one (Opt) vs. the one tuned by the tuning algorithm (Tuned).

See Figure for the similar results using RRF-TV and Figure for TSF-TV.

ntree in the ensembles

Throughout the experiments, we use ntree=100L for all forest ensembles. It has been recommended that a random forest should have a number of trees between 64 and 128 trees (see Lecture Notes in Computer Science). It is true that generally more trees will result in better accuracy. However, more trees also means higher computational cost, and after a certain number of trees, the improvement is negligible. See the following figure for performance comparisons for different numbers of trees built in the forest methods.

Bootstrap pseudo-subject observations vs. subject observations

In forest-like algorithms, bootstrapped samples are typically used to construct each individual tree to increase independence between these base learners. For time-varying covariate data, we have considered two different ways to bootstrap the observations:

  • Bootstrapping pseudo-subjects. Namely, it is to bootstrap "independent" observations as the first step of any forest algorithm; this is because all pseudo-subjects are treated as independent observations in the recursive partitioning process;
  • Bootstrapping subjects. It keeps all of the pseudo-subjects for each subject in the bootstrap sample.

Simulations have shown that the two different bootstrapping mechanisms do not result in fundamentally different levels of performance:

LTRCforests for Time-Invariant Covariate Data

"Ensemble Methods for Survival Data with Time-Varying Covariates" mainly focuses on the analysis of the methodology applied on time-varying covariate data. There are certainly many situations in which only time-invariant (baseline) covariate information is available, and understanding the properties of different methods in that situation is important. In fact, our developed methodology and algorithms allow for estimation using the proposed forests for (left-truncated) right-censored data with time-invariant covariates.

In fact, the same data-driven guidance for tuning the parameters or selecting a modeling method also applies to the time-invariant covariates case (for both left-truncated right-censored survival data and right-censored survival data).

How mtry affects the performance and how mtry-tuning algorithm performs

The following figures show how LTRC CIF performs with different values of mtry under the PH setting and non-PH setting, respectively. The datasets are generated with survival times following a Weibull-Increasing distribution, light (right-)censoring rate. This implies its broad effectiveness regardless of additional left-truncation and regardless of the presence of time-varying effects.

Figure 2.1. Integrated L2 difference of LTRC CIF with different mtry values distribution under the PH setting.

Figure 2.2. Integrated L2 difference of LTRC CIF with different mtry values distribution under the non-PH setting.

Using IBS-CV to choose among different methods

See below the boxplots of integrated L2 difference for performance comparison. Datasets are generated with time-invariant covariates, left-truncated right-censored survival times following a Weibull-Increasing distribution. The first row shows results for the number of subjects N=100, second row for N=300, third row for N=500, bottom row for N=1000; the first column shows results for linear survival relationship, second column for nonlinear, the third column for interaction. In each plot, LTRC CIF(P)--LTRC CIF with proposed parameter settings; LTRC RRF(P)--LTRC RRF with proposed parameter settings; LTRC TSF(P)--LTRC TSF with proposed parameter settings; Opt--Best method; IBSCV--Method chosen by IBS-based 10-fold CV.

Figure 3.3. Boxplots of integrated L2 difference for performance comparison on time-invariant covariate data.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages