Idea: Attempting to Recreate a Dataset using only the Results of a Cox-PH Model #17

swaheera · 2023-02-16T04:12:38Z

Suppose I fit a Survival Cox-PH Regression Model in R and get the following results:

Call:
coxph(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lung)

             coef exp(coef)  se(coef)      z        p
age      0.011067  1.011128  0.009267  1.194 0.232416
sex     -0.552612  0.575445  0.167739 -3.294 0.000986
ph.ecog  0.463728  1.589991  0.113577  4.083 4.45e-05

Likelihood ratio test=30.5  on 3 df, p=1.083e-06
n= 227, number of events= 164 
   (1 observation deleted due to missingness)

Based on these results, I can infer information such as:

The number of observations
The number of events
The number of variables
The estimate for the effect of each variable

My Question: Given this information, is it possible to simulate the covariate and response information for n = 227 such observations - such that if a similar Cox-PH model was fit to these newly simulated 227 observations, the resulting regression coefficients would approximately be equal to the original regression coefficients? Can I try to "guess" (and recreate) a plausible set of observations might have been observed based on the regression model coefficients?

For example, I know that if I were to "fix" the covariate information for a group of n = 227 "arbitrary created" patients, I could then simulate their survival times (e.g. https://cran.r-project.org/web/packages/simsurv/index.html) - however, if I were to then fit a Cox-PH model to these observations, the model coefficients would not necessarily be close to the original model coefficients.

In general, is this possible to do? Only given the above model summary, could I try and somehow generate the original dataset that this model was trained on?

Thanks!

Note: I realize there are probably an infinite number of n = 227 samples that can be randomly simulated such that a Cox-PH Model produces the same regression coefficient estimates as above.

The text was updated successfully, but these errors were encountered:

swaheera · 2023-02-16T04:26:03Z

Possible Pseudocode:

Use an algorithm like the Genetic Algorithm (https://cran.r-project.org/web/packages/GA/vignettes/GA.html)
Step 0 : Set some plausible ranges for the covariates (e.g. age between 20 - 80, sex either male or female, ph.ecog between 0 and 5 https://ecog-acrin.org/resources/ecog-performance-status/) . Specify distributions for each of these variables (e.g. age ~ lognormal, sex ~ binom, ph.ecog ~ multinomial) and specify parameters for each of these distributions (e.g. lognormal(mu1, sigma1), binom(p1), etc)
Step 1: Set some plausible range and distribution for the Survival Times (e.g. Exponential Distribution with Lambda = 5)
Step 2: Simulate 227 random observations and randomly censor 164 of these observations
Step 3: Fit a Cox PH model to this simulated data and record performance measurements such as the coefficients, standard errors, p-values, likelihood ratio test, etc
Step 4: Calculate the squared difference between these performance measurements from the model with the simulated dataset vs the performance measurements from the model with the actual dataset - and then sum all these squared differences.
Step 5: Repeat Step 1 - Step 4 many times (e.g. 1000 times)
Step 6: Repeat Step 0 - 5 many times
Step 7: Use the Genetic Algorithm to find out which covariate ranges and covariate distributions bring you closest to the performance measurements of the original model. In other words, the "summed squared difference" is the Objective Function you will be attempting to optimize using the Genetic Algorithm.

I know this is a very abstract and roundabout way that might not have any mathematical validity - but I was curious to know if such an application might be logical?

Note: I realize that its entirely possible that I happen to simulate a dataset in which all the ages are concentrated around 20-25 years old when in fact the real age of the participants in this dataset were senior citizens - but based on the results of the other simulated variables, the resulting Cox-PH model produced using this simulated data might coincidentally have the same performance results as the original model - thus rendering this approach useless.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: Attempting to Recreate a Dataset using only the Results of a Cox-PH Model #17

Idea: Attempting to Recreate a Dataset using only the Results of a Cox-PH Model #17

swaheera commented Feb 16, 2023 •

edited

Loading

swaheera commented Feb 16, 2023 •

edited

Loading

Idea: Attempting to Recreate a Dataset using only the Results of a Cox-PH Model #17

Idea: Attempting to Recreate a Dataset using only the Results of a Cox-PH Model #17

Comments

swaheera commented Feb 16, 2023 • edited Loading

swaheera commented Feb 16, 2023 • edited Loading

swaheera commented Feb 16, 2023 •

edited

Loading

swaheera commented Feb 16, 2023 •

edited

Loading