<a href="https://colab.research.google.com/github/zia207/Survival_Analysis_R/blob/main/Colab_Notebook/02_07_04_03_survival_analysis_frailty_models_r.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![All-test](http://drive.google.com/uc?export=view&id=1bLQ3nhDbZrCCqy_WCxxckOne2lgVvn3l)

# 4.3 Frailty Models {.unnumbered}


Frailty models are extensions of standard survival analysis techniques, such as the Cox proportional hazards (PH) model, designed to handle unobserved heterogeneity or clustering in time-to-event data. In survival analysis, we often study the time until an event occurs (e.g., death, failure, or recurrence), but not all factors influencing the risk (hazard) may be observed or measurable. Frailty models introduce a random effect, called "frailty," to account for this unobserved variation.


### Key Concepts


- **Basic Idea**: In a standard Cox PH model, the hazard rate for an individual is $h(t | X) = h_0(t) \exp(\beta^T X)$, where $h_0(t)$ is the baseline hazard, $X$ are covariates, and $\beta$ are coefficients. Frailty models modify this to $h(t | X, u) = u \cdot h_0(t) \exp(\beta^T X)$, where $u$ is the frailty term—a non-negative random variable with mean 1 (for identifiability) and variance $\theta$ (which measures the degree of heterogeneity). Higher frailty ($u > 1$) means higher risk, and vice versa.

- **Unobserved Heterogeneity**: If important covariates are omitted, the population appears more homogeneous over time because "frailer" individuals experience the event earlier, leaving "robust" survivors. This can lead to biased estimates, attenuated hazard ratios, or apparent time-dependence in hazards.

- **Dependence and Clustering**: Frailty induces positive correlation between event times within clusters (e.g., families, hospitals) or for recurrent events in the same individual.

- **Distributions for Frailty**: Common choices include gamma (constant dependence), inverse Gaussian (intermediate dependence), positive stable (early dependence), log-normal, or compound Poisson (allows a non-susceptible subpopulation).

- **Effects**:

  - **Selection**: Over time, the average frailty among survivors decreases.
  
  - **Marginal vs. Conditional**: The population-averaged (marginal) hazard differs from the individual (conditional) hazard due to averaging over frailties.
  
  - **Cross-Ratio**: Measures how one event affects the hazard of another; constant for gamma frailty.


### Types of Frailty Models


- **Individual (Univariate) Frailty Models**: These apply to independent survival data where each individual has their own unique frailty to capture unobserved individual-specific effects. They explain deviations from proportional hazards due to omitted covariates but are challenging to identify without strong assumptions (e.g., no covariates mean the frailty distribution and baseline hazard are confounded). They're less common in practice for non-clustered data but useful for modeling heterogeneity in large populations.

- **Shared Frailty Models**: These are for clustered data (e.g., siblings, patients in the same center) or recurrent events (e.g., multiple infections in one patient). The frailty is shared within the cluster or across events for the same individual, inducing dependence. For recurrent events, the "cluster" is often the individual, so the frailty is shared across their multiple events but individual-specific relative to the population. This is the most common type for datasets like recurrent disease episodes.

Frailty models can be estimated semi-parametrically (e.g., non-parametric baseline hazard via EM algorithm or penalized likelihood) or parametrically (e.g., Weibull baseline). Testing for frailty (e.g.,  $\theta = 0$ uses a mixture chi-squared distribution. Packages in R like `survival`, `frailtypack`, `frailtyEM`, and `coxme` support fitting these models.



## Setup R in Python Runtype - Install {rpy2}
{rpy2} is a Python package that provides an interface to the R programming language, allowing Python users to run R code, call R functions, and manipulate R objects directly from Python. It enables seamless integration between Python and R, leveraging R's statistical and graphical capabilities while using Python's flexibility. The package supports passing data between the two languages and is widely used for statistical analysis, data visualization, and machine learning tasks that benefit from R's specialized libraries.

In [1]:
!pip uninstall rpy2 -y
!pip install rpy2==3.5.1
%load_ext rpy2.ipython

Found existing installation: rpy2 3.5.17
Uninstalling rpy2-3.5.17:
  Successfully uninstalled rpy2-3.5.17
Collecting rpy2==3.5.1
  Downloading rpy2-3.5.1.tar.gz (201 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m201.7/201.7 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rpy2
  Building wheel for rpy2 (setup.py) ... [?25l[?25hdone
  Created wheel for rpy2: filename=rpy2-3.5.1-cp312-cp312-linux_x86_64.whl size=316570 sha256=f01a91e43b8095800fd990ceaafb0bb2aa9cda70c87fd6ca632ec7e9e1ac86f0
  Stored in directory: /root/.cache/pip/wheels/00/26/d5/d5e8c0b039915e785be870270e4a9263e5058168a03513d8cc
Successfully built rpy2
Installing collected packages: rpy2
Successfully installed rpy2-3.5.1


## Mount Google Drive

In [2]:
## Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Implement Frailty Models in R


We'll fit:

- A standard Cox PH model (no frailty, assuming independence).
- An individual frailty model (univariate, but note: for recurrent data, this isn't standard; we'll simulate it by treating each event as independent with per-observation frailty, though identifiability is limited).
- A shared frailty model (standard for recurrent events, with frailty shared across events per patient).

For individual frailty in recurrent data, it's conceptually tricky since events are correlated; we'll approximate it using a Gaussian random effect on a per-event basis (unique ID per row), but this is more illustrative than practical. In practice, shared frailty is preferred for this dataset.


### Install Required R Packages


Following R packages are required to run this notebook. If any of these packages are not installed, you can install them using the code below:


In [3]:
%%R
packages <-c(
		 'tidyverse',
		 'survival',
		 'survminer',
		 'ggsurvfit',
		 'tidycmprsk',
		 'ggfortify',
		 'timereg',
		 'cmprsk',
		 'riskRegression',
		 'reda',
		 'frailtypack',
		 'coxme'
		 )


### Install missing packages

In [None]:
%%R
# Install missing packages
new.packages <- packages[!(packages %in% installed.packages(lib='drive/My Drive/R/')[,"Package"])]
if(length(new.packages)) install.packages(new.packages, lib='drive/My Drive/R/')
devtools::install_github("ItziarI/WeDiBaDis", lib='drive/My Drive/R/')


In [None]:
%%R
install.packages(c("frailtypack", "coxme"), lib='drive/My Drive/R/')

In [7]:
%%R
.libPaths('drive/My Drive/R')
# Verify installation
cat("Installed packages:\n")
print(sapply(packages, requireNamespace, quietly = TRUE))

Installed packages:
     tidyverse       survival      survminer      ggsurvfit     tidycmprsk 
          TRUE           TRUE           TRUE           TRUE           TRUE 
     ggfortify        timereg         cmprsk riskRegression           reda 
          TRUE           TRUE           TRUE           TRUE           TRUE 
   frailtypack          coxme 
          TRUE           TRUE 


### Load Packages

In [8]:
%%R
.libPaths('drive/My Drive/R')
# Load packages with suppressed messages
invisible(lapply(packages, function(pkg) {
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))

In [9]:
%%R
# Check loaded packages
cat("Successfully loaded packages:\n")
print(search()[grepl("package:", search())])

Successfully loaded packages:
 [1] "package:coxme"          "package:bdsmatrix"      "package:frailtypack"   
 [4] "package:survC1"         "package:MASS"           "package:doBy"          
 [7] "package:boot"           "package:reda"           "package:riskRegression"
[10] "package:cmprsk"         "package:timereg"        "package:ggfortify"     
[13] "package:tidycmprsk"     "package:ggsurvfit"      "package:survminer"     
[16] "package:ggpubr"         "package:survival"       "package:lubridate"     
[19] "package:forcats"        "package:stringr"        "package:dplyr"         
[22] "package:purrr"          "package:readr"          "package:tidyr"         
[25] "package:tibble"         "package:ggplot2"        "package:tidyverse"     
[28] "package:tools"          "package:stats"          "package:graphics"      
[31] "package:grDevices"      "package:utils"          "package:datasets"      
[34] "package:methods"        "package:base"          


### Data


This tutorial uses the `bladder1` dataset from the `survival` package in R, which contains data on recurrent bladder cancer tumors from 85 patients. It's in counting-process format for recurrent events (up to 4 recurrences per patient). Columns include:
- `id`: Patient ID (cluster for shared frailty).
- `rx`: Treatment (1 = placebo, 2 = thiotepa).
- `number`: Initial number of tumors.
- `size`: Initial tumor size (cm).
- `start`: Start time of interval.
- `stop`: End time of interval (event or censoring time).
- `event`: Indicator (1 = recurrence, 0 = censored).
- `enum`: Event number (1-4).



In [10]:
%%R
# Load bladder1 dataset
data(bladder1)

### Fit a Standard Cox PH Model (No Frailty)


This assumes independence across all observations (ignores clustering by patient).


In [11]:
%%R
# Fit model (using counting-process format for recurrent events)
cox_no_frailty <- coxph(Surv(start, stop, status) ~treatment + number + size,
                        data = bladder1)

# Summary
summary(cox_no_frailty)
# To account for clustering (robust SEs, but no frailty)
cox_robust <- coxph(Surv(start, stop, status) ~ treatment + number + size + cluster(id),
                    data = bladder1)
summary(cox_robust)  # Similar coefficients, but adjusted SEs

Call:
coxph(formula = Surv(start, stop, status) ~ treatment + number + 
    size, data = bladder1, cluster = id)

  n= 264, number of events= 189 
   (30 observations deleted due to missingness)

                        coef exp(coef) se(coef) robust se      z Pr(>|z|)   
treatmentpyridoxine  0.02070   1.02091  0.17101   0.32214  0.064  0.94877   
treatmentthiotepa   -0.44295   0.64214  0.18659   0.26926 -1.645  0.09995 . 
number               0.18081   1.19819  0.03523   0.06009  3.009  0.00262 **
size                -0.01689   0.98325  0.04342   0.06818 -0.248  0.80438   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

                    exp(coef) exp(-coef) lower .95 upper .95
treatmentpyridoxine    1.0209     0.9795    0.5430     1.920
treatmentthiotepa      0.6421     1.5573    0.3788     1.088
number                 1.1982     0.8346    1.0651     1.348
size                   0.9833     1.0170    0.8603     1.124

Concordance= 0.633  (se = 0.028 )
Likelihood 

### Fit an Individual Frailty Model


For illustration, we'll create a unique ID per observation (row) and fit a Gaussian frailty (normal random effects on log-scale). This treats each event as having its own independent frailty, ignoring clustering—useful for overdispersion but not ideal for recurrent data (may lead to convergence issues or poor identifiability).


In [12]:
%%R
# Create unique ID per event/observation
bladder1$unique_id <- 1:nrow(bladder1)
# Fit with Gaussian frailty (per-observation random effect)
indiv_frailty <- coxph(Surv(start, stop, status) ~ treatment+ number + size + frailty(unique_id, dist = "gauss"),
                       data = bladder1)
# Summary
summary(indiv_frailty)

Call:
coxph(formula = Surv(start, stop, status) ~ treatment + number + 
    size + frailty(unique_id, dist = "gauss"), data = bladder1)

  n= 264, number of events= 189 
   (30 observations deleted due to missingness)

                          coef     se(coef) se2     Chisq  DF    p      
treatmentpyridoxine        0.13029 0.23585  0.17336   0.31   1.0 5.8e-01
treatmentthiotepa         -0.45674 0.26240  0.19591   3.03   1.0 8.2e-02
number                     0.21536 0.05751  0.04113  14.02   1.0 1.8e-04
size                      -0.00463 0.06302  0.04620   0.01   1.0 9.4e-01
frailty(unique_id, dist =                           177.03 100.6 3.9e-06

                    exp(coef) exp(-coef) lower .95 upper .95
treatmentpyridoxine    1.1392     0.8778    0.7175     1.809
treatmentthiotepa      0.6333     1.5789    0.3787     1.059
number                 1.2403     0.8063    1.1081     1.388
size                   0.9954     1.0046    0.8797     1.126

Iterations: 5 outer, 41 Newton-Raphs

In [13]:
%%R
# Test for frailty significance (LRT vs. no-frailty model)
anova(cox_no_frailty, indiv_frailty)
# If p < 0.05, evidence of individual-level heterogeneity.

Analysis of Deviance Table
 Cox model: response is  Surv(start, stop, status)
 Model 1: ~ treatment + number + size
 Model 2: ~ treatment + number + size + frailty(unique_id, dist = "gauss")
   loglik  Chisq     Df Pr(>|Chi|)    
1 -782.91                             
2 -646.95 271.91 98.735  < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


### Fit a Shared Frailty Model


This is the appropriate model for recurrent events: frailty shared across events for each patient (cluster = id), using gamma distribution (default).


In [14]:
%%R
# Fit shared frailty (gamma distribution)
shared_frailty_gamma <- coxph(Surv(start, stop, status) ~ treatment + number + size + frailty(id),
                              data = bladder1)
# Summary
summary(shared_frailty_gamma)

Call:
coxph(formula = Surv(start, stop, status) ~ treatment + number + 
    size + frailty(id), data = bladder1)

  n= 264, number of events= 189 
   (30 observations deleted due to missingness)

                    coef     se(coef) se2     Chisq  DF    p      
treatmentpyridoxine  0.17381 0.35199  0.17163   0.24  1.00 6.2e-01
treatmentthiotepa   -0.43348 0.34980  0.21075   1.54  1.00 2.2e-01
number               0.26540 0.08817  0.04647   9.06  1.00 2.6e-03
size                 0.01773 0.09156  0.04946   0.04  1.00 8.5e-01
frailty(id)                                   129.66 60.93 7.1e-07

                    exp(coef) exp(-coef) lower .95 upper .95
treatmentpyridoxine    1.1898     0.8405    0.5969     2.372
treatmentthiotepa      0.6483     1.5426    0.3266     1.287
number                 1.3040     0.7669    1.0970     1.550
size                   1.0179     0.9824    0.8507     1.218

Iterations: 6 outer, 29 Newton-Raphson
     Variance of random effect= 1.166814   I-likelihood 


Alternative: Gaussian frailty (log-normal)::



In [None]:
%%R
# Alternative: Gaussian frailty (log-normal)
shared_frailty_gauss <- coxph(Surv(start, stop, status) ~ treatment + number + size + frailty(id, dist = "gauss"),
                              data = bladder1)
summary(shared_frailty_gauss)  # Similar, but variance on log-scale.

In [None]:
%%R
# Predict or plot (e.g., baseline hazard)
plot(survfit(shared_frailty_gamma), xlab = "Time", ylab = "Survival")

### Advanced Options with frailtypack


For more flexibility (e.g., parametric baseline, joint models), use `frailtypack`. Install: `install.packages("frailtypack")`.



In [15]:
%%R
# Identify problematic rows
invalid_intervals <- bladder1[bladder1$stop <= bladder1$start, ]
invalid_status <- bladder1[!bladder1$status %in% c(0, 1), ]

# Remove invalid rows
bladder1_clean <- bladder1 %>%
  filter(stop > start, status %in% c(0, 1))

# Create gaptime for PWP-GT
bladder1_clean <- bladder1_clean %>%
  group_by(id) %>%
  mutate(gaptime = stop - start) %>%
  ungroup()

# Truncate to first 4 events for PWP models
bladder_trunc <- bladder1_clean[bladder1_clean$enum <= 4, ]

# Verify
head(bladder_trunc)

# A tibble: 6 × 13
     id treatment number  size recur start  stop status rtumor rsize  enum
  <int> <fct>      <int> <int> <int> <int> <int>  <dbl> <chr>  <chr> <dbl>
1     3 placebo        2     1     0     0     4      0 .      .         1
2     4 placebo        1     1     0     0     7      0 .      .         1
3     6 placebo        4     1     1     0     6      1 1      1         1
4     7 placebo        1     1     0     0    14      0 .      .         1
5     8 placebo        1     1     0     0    18      0 .      .         1
6     9 placebo        1     3     1     0     5      1 2      4         1
# ℹ 2 more variables: unique_id <int>, gaptime <int>


In [16]:
%%R
library(frailtypack)

# Shared frailty with splines baseline
shared_pack <- frailtyPenal(
  Surv(start, stop, status) ~ cluster(id) + treatment + number + size,
  data = bladder1_clean,
  n.knots = 8,
  kappa = 1e5,
  hazard = "Splines"
)

# Summary
print(shared_pack)


Be patient. The program is computing ... 
The program took 0.09 seconds 
Call:
frailtyPenal(formula = Surv(start, stop, status) ~ cluster(id) + 
    treatment + number + size, data = bladder1_clean, n.knots = 8, 
    kappa = 1e+05, hazard = "Splines")

      left truncated structure used

  Shared Gamma Frailty model parameter estimates  
  using a Penalized Likelihood on the hazard function 

                          coef exp(coef) SE coef (H) SE coef (HIH)         z
treatmentpyridoxine  0.0229803  1.023246   0.1714367     0.1714367  0.134045
treatmentthiotepa   -0.4313980  0.649600   0.1875060     0.1875060 -2.300715
number               0.1840139  1.202033   0.0355844     0.0355844  5.171195
size                -0.0141934  0.985907   0.0433354     0.0433354 -0.327523
                             p
treatmentpyridoxine 8.9337e-01
treatmentthiotepa   2.1408e-02
number              2.3260e-07
size                7.4327e-01

            chisq df global p
treatment 6.42637  2   0.0402




- **No Frailty**: Assumes independence; may underestimate SEs.

- **Individual Frailty**: Captures per-event variation but ignores correlation; variance $\theta$ small if little heterogeneity.

- **Shared Frailty**: Accounts for patient-level correlation; $\theta > 0$ suggests unobserved patient factors affect recurrence risk. Treatment (rx) reduces hazard by ~40% (exp(-0.51) ≈ 0.6), initial tumors increase risk.


### Using `coxme` for Mixed-Effects Cox Models

In [17]:
%%R
library(coxme)
# Fit mixed-effects Cox with frailty (random intercept by id)
coxme_frailty <- coxme(Surv(start, stop, status) ~ treatment + number + size + (1 | id),
                       data = bladder1)
summary(coxme_frailty)

Mixed effects coxme model
 Formula: Surv(start, stop, status) ~ treatment + number + size + (1 |      id) 
    Data: bladder1 

  events, n = 189, 264 (30 observations deleted due to missingness)

Random effects:
  group  variable       sd variance
1    id Intercept 1.058302 1.120003
                   Chisq    df p    AIC    BIC
Integrated loglik  93.62  5.00 0  83.62  67.41
 Penalized loglik 238.64 57.24 0 124.17 -61.39

Fixed effects:
                        coef exp(coef) se(coef)     z       p
treatmentpyridoxine -0.04029   0.96051  0.34640 -0.12 0.90740
treatmentthiotepa   -0.44220   0.64262  0.34810 -1.27 0.20397
number               0.24832   1.28187  0.07923  3.13 0.00172
size                 0.01842   1.01860  0.08563  0.22 0.82963


`Interpretation`: Fixed effects similar to standard Cox, but SEs account for frailty. Random effect variance (e.g., 0.45) measures patient-level heterogeneity; higher variance means stronger clustering.

`Convergence`: coxme may take longer; use control = list(optimizer = "bobyqa") if issues arise.



#### Test for Frailty Significance

In [18]:
%%R
# Or manual:
loglik_null <- logLik(cox_no_frailty)[1]
loglik_frail <- logLik(coxme_frailty)[1]
lrt_stat <- 2 * (loglik_frail - loglik_null)
p_value <- pchisq(lrt_stat, df = 1, lower.tail = FALSE)  # Or use 0.5 df mixture for exact
p_value
# Example: If lrt_stat ~ 10, p < 0.001, significant frailty.

   Penalized 
5.510871e-48 


### Summary


Frailty models extend traditional survival analysis methods, such as the Cox proportional hazards model, by incorporating a random effect (frailty) to account for unobserved heterogeneity and clustering in time-to-event data. This heterogeneity arises from unmeasured factors that influence the hazard rate, leading to biased estimates if ignored. Key types include individual frailty models, which assign unique frailties to each subject or observation to capture personal-level variation (though less common and harder to identify in non-clustered data), and shared frailty models, which apply a common frailty within clusters (e.g., families) or across recurrent events for the same individual, inducing dependence and better handling correlated outcomes like repeated tumor recurrences.

In the R tutorial using the `bladder1` dataset from the `survival` package, we analyzed recurrent bladder cancer events in 85 patients. The dataset, in counting-process format, includes covariates like treatment (`treatment`), initial tumor number, and size. We fitted a standard Cox model (ignoring clustering), an individual frailty model (using Gaussian distribution per observation for illustrative overdispersion), and a shared frailty model (gamma or Gaussian, clustered by patient ID). Results showed significant frailty variance in the shared model, indicating unobserved patient-specific factors, with treatment reducing hazard and initial tumors increasing it. Advanced fitting was demonstrated with `frailtypack` for penalized likelihood and splines.


Frailty models are essential for robust survival analysis in the presence of unobserved heterogeneity or dependent events, preventing underestimation of variability and improving model fit, as seen in the bladder cancer example where shared frailty revealed clustering effects. They enable more accurate inference in fields like medicine, epidemiology, and reliability engineering, though assumptions about frailty distribution (e.g., gamma for constant dependence) must be carefully chosen and tested. In practice, shared frailty is particularly valuable for recurrent or clustered data, while individual frailty suits broader heterogeneity exploration. Overall, integrating frailty enhances interpretability, such as through selection effects where frailer individuals exit early, leaving robust survivors, and supports better decision-making in risk assessment.


## Resources


- **Book: Frailty Models in Survival Analysis** by Andreas Wienke (2010). This comprehensive text covers univariate and multivariate frailty models, with emphasis on real-data applications and statistical techniques.
  
- **Tutorial Paper: A Tutorial on Frailty Models** by Theodor A. Balan and Hein Putter (2020). An accessible guide illustrating frailty concepts, selection effects, and implementation for survival outcomes.

- **Book: Applied Survival Analysis Using R** by Dirk F. Moore (2016). Focuses on practical survival analysis in R, including frailty models, with code examples and integration of packages like `survival`.

- **R Package Documentation**:
  - `survival` package vignette on frailty models (available via `vignette("frailty", package="survival")` in R).
  - `frailtypack` package on CRAN: Provides advanced tools for frailty models, including penalized and joint models (https://cran.r-project.org/web/packages/frailtypack/index.html).
  - `coxme` package for mixed-effects Cox models with frailty (https://cran.r-project.org/web/packages/coxme/index.html).



