<a href="https://colab.research.google.com/github/zia207/Survival_Analysis_R/blob/main/Colab_Notebook/02_07_07_07_survival_analysis_super_learner_r.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![All-test](http://drive.google.com/uc?export=view&id=1bLQ3nhDbZrCCqy_WCxxckOne2lgVvn3l)


# 2.7.7 Ensemble-based (Super Learner)  Survival Prediction

This notbook demonstrates **ensemble-based survival prediction** using the **`survivalSL`** R package, which implements the **Super Learner** framework for combining multiple survival models to optimize prognostic performance in the presence of right-censored time-to-event data.


## Overview

**Ensemble-based survival prediction** is a machine learning approach for modeling **time-to-event outcomes** (e.g., death, disease recurrence, machine failure) in the presence of **censoring**, by **combining multiple individual (“base”) survival models** into a single, more accurate and robust predictor.

Instead of relying on a single model (e.g., Cox regression or Random Survival Forest), ensemble-based prediction:
- Trains **many diverse models** (parametric, semi-parametric, non-parametric, regularized, etc.)
- **Combines their predictions** using data-driven weights that optimize a pre-specified performance metric (e.g., time-dependent AUC, Brier score)
- Produces a **final prediction** that often **outperforms any individual model**

This is formalized in the **Super Learner** framework—a theoretically grounded ensemble method that is **asymptotically optimal**: it performs at least as well as the best possible model in the library (given the data).


### Advantages

- **Adaptive**: Automatically selects the best-performing models (or combinations)
- **Robust**: Reduces reliance on correct model specification
- **Flexible**: Can include both traditional stats models and modern ML methods
- **Valid**: Uses cross-validation to ensure generalizability

## Setup R in Python Runtype - Install {rpy2}
{rpy2} is a Python package that provides an interface to the R programming language, allowing Python users to run R code, call R functions, and manipulate R objects directly from Python. It enables seamless integration between Python and R, leveraging R's statistical and graphical capabilities while using Python's flexibility. The package supports passing data between the two languages and is widely used for statistical analysis, data visualization, and machine learning tasks that benefit from R's specialized libraries.

In [None]:
!pip uninstall rpy2 -y
!pip install rpy2==3.5.1
%load_ext rpy2.ipython

Found existing installation: rpy2 3.5.17
Uninstalling rpy2-3.5.17:
  Successfully uninstalled rpy2-3.5.17
Collecting rpy2==3.5.1
  Downloading rpy2-3.5.1.tar.gz (201 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m201.7/201.7 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rpy2
  Building wheel for rpy2 (setup.py) ... [?25l[?25hdone
  Created wheel for rpy2: filename=rpy2-3.5.1-cp312-cp312-linux_x86_64.whl size=316569 sha256=89ae16fad1fffae78000c08d45854029920c240c8c9745051d438ffc816d52e2
  Stored in directory: /root/.cache/pip/wheels/00/26/d5/d5e8c0b039915e785be870270e4a9263e5058168a03513d8cc
Successfully built rpy2
Installing collected packages: rpy2
Successfully installed rpy2-3.5.1


## Mount Google Drive

In [None]:
## Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Ensemble-based survival prediction in R

The **`survivalSL`** package in R provides a comprehensive framework for **ensemble-based survival prediction** in the presence of **right-censored time-to-event data**. Built on the **Super Learner** principle—a data-adaptive ensemble method—it combines multiple diverse survival models (e.g., parametric, semi-parametric, and machine learning algorithms) into a single optimized predictor that maximizes prognostic performance.

The package includes a library of 14+ base learners—ranging from **Cox models** (penalized and unpenalized) and **parametric AFT/PH models** (Weibull, Gamma, Gompertz, etc.) to advanced methods like **Random Survival Forests**, **Royston–Parmar splines**, and **survival neural networks (PLANN)**. It supports **cross-validated tuning**, **performance evaluation** (via AUC, Brier score, concordance index, etc.), and **calibration assessment** through intuitive plots.

Designed for reproducibility and ease of use, `survivalSL` is ideal for clinical, epidemiological, and public health research where robust, high-performance survival prediction is essential.


### Accelerated Failure Time (AFT) Models

These assume that covariates act multiplicatively on **survival time** (i.e., they accelerate or decelerate the time to event).

1. AFT – Gamma Distribution

   Parametric AFT model assuming the log-survival time follows a **Gamma distribution**. Flexible for skewed survival data.

2. AFT – Generalized Gamma Distribution

   Highly flexible AFT model with **three parameters** (shape, scale, location). Includes Weibull, log-normal, and gamma as special cases.

3. AFT – Log-Logistic Distribution

   AFT model where survival times follow a **log-logistic distribution**. Allows non-monotonic hazard functions (e.g., hazard peaks then declines).

4. AFT – Weibull Distribution

   Most common AFT model. Assumes **Weibull-distributed survival times**. Encompasses exponential AFT as a special case and allows monotonic increasing/decreasing hazards.

### Cox Proportional Hazards (PH) Models

These model the **log-hazard** as a linear function of covariates (hazards are proportional over time).

5. Cox Model with Selected Covariates*

   Uses **stepwise AIC-based selection** to choose a subset of covariates from the full model.

6. Cox Regression (Full Model)

   Standard **unpenalized Cox model** using **all specified covariates**.

7. Elastic Net Cox Regression  

   **Penalized Cox model** combining **L1 (lasso)** and **L2 (ridge)** penalties. Useful for correlated or high-dimensional predictors.

8. Lasso Cox Regression*

   **L1-penalized Cox model** that performs **variable selection** by shrinking some coefficients to zero.

9. Ridge Cox Regression

 **L2-penalized Cox model** that **shrinks coefficients** but retains all variables—good for multicollinearity.



### **Parametric Proportional Hazards (PH) Models**

Fully parametric models assuming a specific baseline hazard shape.

10. PH – Exponential Distribution

   Assumes **constant hazard over time** (memoryless property). Simplest parametric PH model.

11. PH – Gompertz Distribution  

   Assumes **exponentially increasing (or decreasing) hazard** over time—common in aging or reliability studies.

12. Royston–Parmar Spline Model  

  Flexible **semi-parametric PH model** where the baseline hazard (or log cumulative hazard) is modeled using **natural cubic splines**. Adapts well to complex hazard shapes.


### Machine Learning–Based Survival Models

13. Survival Neural Network (PLANN)

 **Partial Logistic Artificial Neural Network**: discretizes time into intervals and uses logistic regression in a neural net architecture to model survival. Handles non-linear effects.

14. Random Survival Forest (RSF)

  **Non-parametric ensemble method** based on bootstrap aggregation of survival trees. Estimates survival via **ensemble cumulative hazard functions**. Robust to non-linearity and interactions.


These 14 learners provide a **diverse library**—spanning parametric, semi-parametric, regularized, and machine learning approaches—enabling the **Super Learner** to adaptively combine them for optimal predictive performance in censored time-to-event settings.


### Install Required R Packages


Following R packages are required to run this notebook. If any of these packages are not installed, you can install them using the code below:


In [None]:
%%R
packages <-c(
		 'tidyverse',
		 'tidyr',
		 'Hmisc',
	   'survival',
		 'survMisc',
		 'survminer',
		 'MASS',
		 'survivalSL')

### Install missing packages

In [None]:
%%R
# Install missing packages
new.packages <- packages[!(packages %in% installed.packages(lib='drive/My Drive/R/')[,"Package"])]
if(length(new.packages)) install.packages(new.packages, lib='drive/My Drive/R/')
devtools::install_github("ItziarI/WeDiBaDis", lib='drive/My Drive/R/')
#devtools::install_github("ItziarI/WeDiBaDis")
BiocManager::install("survcomp")


### Verify Installation

In [None]:
%%R
.libPaths('drive/My Drive/R')
# Verify installation
cat("Installed packages:\n")
print(sapply(packages, requireNamespace, quietly = TRUE))

Installed packages:
  tidyverse survivalsvm    survival    survcomp    survMisc   survminer 
       TRUE        TRUE        TRUE        TRUE        TRUE        TRUE 
       MASS 
       TRUE 


### Load Packages

In [None]:
%%R
.libPaths('drive/My Drive/R')
# Verify installation
cat("Installed packages:\n")
print(sapply(packages, requireNamespace, quietly = TRUE))

Installed packages:
  tidyverse survivalsvm    survival    survcomp    survMisc   survminer 
       TRUE        TRUE        TRUE        TRUE        TRUE        TRUE 
       MASS 
       TRUE 


In [None]:
%%R
# Check loaded packages
cat("Successfully loaded packages:\n")
print(search()[grepl("package:", search())])

### Data

The `dataDIVAT3` dataset is a sample from the **DIVAT (Données Informatisées et Validées en Transplantation)** cohort, containing **4,267 French kidney transplant recipients**. It is designed for time-to-event (survival) analysis in transplantation research.

### Key Variables:
- `ageR`: Recipient age (years)  
- `sexeR`: Recipient gender (1 = male, 0 = female)  
- `year.tx`: Year of transplantation  
- `ante.diab`: Pre-transplant diabetes (1 = yes, 0 = no)  
- `pra`: Panel reactive antibody status (1 = detectable, 0 = undetectable)  
- `ageD`: Donor age (years)  
- `death.time`: Follow-up time in **days** (until death or censoring)  
- `death`: Death indicator (1 = death, 0 = alive at censoring)

This dataset is commonly used to model **patient survival** after kidney transplantation and to evaluate the impact of clinical and demographic factors on long-term outcomes. Time is recorded in **days**, and the event of interest is **death**.



In [None]:
%%R
# Load the built-in data (already 0/1 event coded)
data(dataDIVAT2)

# Optional: reduce size for faster computation (e.g., first 500 rows)
dat <- dataDIVAT2[1:500, ]  # n=500 is reasonable for demo

# Inspect
head(dat)
# Variables: age, hla, retransplant, ecd, times, failures (0=censored, 1=event)


###  Define Survival Formula

In [None]:
%%R
form <- Surv(times, failures) ~ age + hla + retransplant + ecd

### Fit all 14 Models

First, we’ll safely fit each model. Some may fail due to convergence (especially parametric models on small samples), so we wrap in `try()`.


In [None]:
%%R
models <- list()

model_funs <- c(
  "LIB_AFTgamma",
  "LIB_AFTggamma",
  "LIB_AFTllogis",
  "LIB_AFTweibull",
  "LIB_COXaic",
  "LIB_COXall",
  "LIB_COXen",
  "LIB_COXlasso",
  "LIB_COXridge",
  "LIB_PHexponential",
  "LIB_PHgompertz",
  "LIB_PHspline",
  "LIB_PLANN",
  "LIB_RSF"
)

# Fit each with appropriate args
for (m in model_funs) {
  cat("Fitting", m, "...\n")
  res <- try({
    if (m == "LIB_COXen") {
      LIB_COXen(formula = form, data = dat, alpha = 0.5, lambda = 0.01)
    } else if (m == "LIB_COXlasso") {
      LIB_COXlasso(formula = form, data = dat, lambda = 0.01)
    } else if (m == "LIB_COXridge") {
      LIB_COXridge(formula = form, data = dat, lambda = 0.01)
    } else if (m == "LIB_PHspline") {
      LIB_PHspline(formula = form, data = dat, k = 2)
    } else if (m == "LIB_PLANN") {
      LIB_PLANN(formula = form, data = dat,
                inter = 1, size = 10, decay = 0.01,
                maxit = 100, MaxNWts = 1000)
    } else if (m == "LIB_RSF") {
      LIB_RSF(formula = form, data = dat,
              nodesize = 15, mtry = 2, ntree = 200, seed = 123)
    } else {
      do.call(m, list(formula = form, data = dat))
    }
  }, silent = TRUE)

  if (!inherits(res, "try-error")) {
    models[[m]] <- res
  } else {
    cat(" ❌", m, "failed\n")
  }
}

###  Evaluate Models Using Metrics

Choose a prognostic time (e.g., **5 years**):

In [None]:
%%R
pro_time <- 5  # years (since times are in years)

# Compute AUC and RIBS for each model
results <- tibble::tibble(
  model = names(models),
  auc   = sapply(models, function(x) tryCatch(metrics("auc", object = x, pro.time = pro_time), error = function(e) NA)),
  ribs  = sapply(models, function(x) tryCatch(metrics("ribs", object = x, pro.time = pro_time), error = function(e) NA))
) %>%
  filter(!is.na(auc)) %>%
  arrange(desc(auc), ribs)

print(results, digits = 4)

### Select Top 5 Models & Fit Super Learner

In [None]:
%%R
top5 <- head(results$model, 5)
cat("Top 5 models:\n")
print(top5)

# Fit Super Learner using 5-fold CV (data is modest)
set.seed(42)
sl_fit <- survivalSL(
  formula = form,
  data = dat,
  methods = top5,
  metric = "auc",
  pro.time = pro_time,
  cv = 5,
  show_progress = TRUE
)

# View weights
print(sl_fit, digits = 4)

### Calibration Plot for Super Learner

We’ll use **internal validation** via splitting:

In [None]:
%%R
# Split 70/30
set.seed(123)
train_idx <- sample(nrow(dat), size = 0.7 * nrow(dat))
train <- dat[train_idx, ]
test  <- dat[-train_idx, ]

# Refit Super Learner on training set
sl_train <- survivalSL(
  formula = form,
  data = train,
  methods = top5,
  metric = "auc",
  pro.time = pro_time,
  cv = 3,  # smaller CV due to n ≈ 350
  show_progress = FALSE
)

# Calibration plot at pro_time = 5 years
plot(sl_train,
     method = "sl",
     n.groups = 5,
     pro.time = pro_time,
     newdata = test,
     xlab = "Predicted 5-Year Survival",
     ylab = "Observed 5-Year Survival",
     main = "Calibration: Super Learner (dataDIVAT2)",
     col = "darkblue")

#### Kaplan–Meier Plot

In [None]:
%%R
# --- KM Plot ---
test_df_plot_fixed <- test_df
test_df_plot_fixed$risk <- test_risk_fixed
test_df_plot_fixed$risk_group <- ifelse(test_risk_fixed >= median(test_risk_fixed), "High risk", "Low risk")
fit_km_fixed <- survfit(Surv(time, event) ~ risk_group, data = test_df_plot_fixed)

p_km_fixed <- ggsurvplot(fit_km_fixed, data = test_df_plot_fixed,
                         risk.table = TRUE, pval = TRUE,
                         palette = c("#E41A1C", "#377EB8"),
                         legend.labs = c("High risk", "Low risk"),
                         title = "DeepSurv Risk Stratification (Fixed HP)")$plot

# Display plots
p_km_fixed

## Summary and Conclusion

The **`survivalSL`** R package offers a comprehensive and flexible framework for **ensemble-based survival prediction** in the presence of **right-censored time-to-event data**. Built on the **Super Learner** algorithm—a theoretically optimal ensemble method—it combines diverse base learners, including **parametric models** (e.g., AFT and PH models with Gamma, Weibull, Gompertz, or generalized Gamma distributions), **semi-parametric Cox models** (standard, stepwise-selected, and penalized variants such as Lasso, Ridge, and Elastic Net), **flexible spline-based PH models** (Royston–Parmar), and **machine learning approaches** (Random Survival Forests and survival neural networks via the PLANN method). The package supports model evaluation using a wide range of prognostic metrics—including time-dependent AUC, concordance index, Brier score, and integrated likelihood-based measures—and provides tools for **cross-validated tuning**, **prediction**, and **calibration assessment** via intuitive plots. Designed for clinical and epidemiological research, `survivalSL` enables robust, data-adaptive survival modeling while maintaining interpretability and reproducibility.


## Resources

### Official Package Resources

1. CRAN Page  
   - [survivalSL on CRAN](https://cran.r-project.org/web/packages/survivalSL/index.html)  
   - Includes full documentation, reference manual, and installation instructions.

2. Reference Manual (PDF)  
   - Direct link: [survivalSL Reference Manual](https://cran.r-project.org/web/packages/survivalSL/survivalSL.pdf)  
   - Comprehensive descriptions of all functions, arguments, and output structures.

3. Source Code & Issue Tracker
   - GitHub repository: [https://github.com/foucher-y/survivalSL](https://github.com/foucher-y/survivalSL)  
   - Report bugs, request features, or explore implementation details.

###  Key Academic References

1. Super Learner Theory
   - van der Laan, M. J., Polley, E. C., & Hubbard, A. E. (2007).  
     Super Learner.  
     `Statistical Applications in Genetics and Molecular Biology**, 6(1).  
     DOI: [10.2202/1544-6115.1309](https://doi.org/10.2202/1544-6115.1309)

2. Parametric & Flexible Survival Modeling  
   - Jackson, C. (2016).  
     `flexsurv: A Platform for Parametric Survival Modeling in R`  
     Journal of Statistical Software**, 70(8), 1–33.  
     DOI: [10.18637/jss.v070.i08](https://doi.org/10.18637/jss.v070.i08)

3. Royston–Parmar Spline Models  
   - Royston, P., & Parmar, M. K. B. (2002).  
     `Flexible parametric proportional-hazards and proportional-odds models`.  
     `Statistics in Medicine`, 21(15), 2175–2197.  
     DOI: [10.1002/sim.1203](https://doi.org/10.1002/sim.1203)

4. Penalized Cox Models (glmnet)  
   - Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2011).  
     `Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent`.  
     `Journal of Statistical Software`, 39(5), 1–13.  
     [https://www.jstatsoft.org/v39/i05/](https://www.jstatsoft.org/v39/i05/)

5. Random Survival Forests
   - Ishwaran, H., & Kogalur, U. B. (2007).  
    `Random Survival Forests for R`.  
     `R News`, 7(2), 25–31.  
     [https://cran.r-project.org/doc/Rnews/Rnews_2007-2.pdf](https://cran.r-project.org/doc/Rnews/Rnews_2007-2.pdf)

6. PLANN (Survival Neural Networks)
   - Biganzoli, E., et al. (1998).  
    `Feed forward neural networks for the analysis of censored survival data*.  
     `Statistics in Medicine`, 17(10), 1169–1186.  
     DOI: [10.1002/(SICI)1097-0258(19980530)17:10<1169::AID-SIM818>3.0.CO;2-V](https://doi.org/10.1002/(SICI)1097-0258(19980530)17:10<1169::AID-SIM818>3.0.CO;2-V)

###  Related R Packages Used by `survivalSL`
- [`survival`](https://cran.r-project.org/package=survival) – Core survival analysis
- [`flexsurv`](https://cran.r-project.org/package=flexsurv) – Parametric & spline models
- [`glmnet`](https://cran.r-project.org/package=glmnet) – Penalized Cox regression
- [`randomForestSRC`](https://cran.r-project.org/package=randomForestSRC) – Random Survival Forests
- [`survivalPLANN`](https://cran.r-project.org/package=survivalPLANN) – Survival neural networks

###  Tutorials & Examples`
- The examples in the `survivalSL` documentation (e.g., using `dataDIVAT2`) provide ready-to-run code for:
  - Fitting individual learners
  - Building a Super Learner
  - Evaluating performance (AUC, Brier score)
  - Calibration plotting
- See the "Examples section in each function’s help page (e.g., `?survivalSL`, `?LIB_RSF`).