Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: Attempting to Recreate a Dataset using only the Results of a Cox-PH Model #17

Open
swaheera opened this issue Feb 16, 2023 · 1 comment

Comments

@swaheera
Copy link

swaheera commented Feb 16, 2023

Suppose I fit a Survival Cox-PH Regression Model in R and get the following results:

Call:
coxph(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lung)

             coef exp(coef)  se(coef)      z        p
age      0.011067  1.011128  0.009267  1.194 0.232416
sex     -0.552612  0.575445  0.167739 -3.294 0.000986
ph.ecog  0.463728  1.589991  0.113577  4.083 4.45e-05

Likelihood ratio test=30.5  on 3 df, p=1.083e-06
n= 227, number of events= 164 
   (1 observation deleted due to missingness)

Based on these results, I can infer information such as:

  • The number of observations
  • The number of events
  • The number of variables
  • The estimate for the effect of each variable

My Question: Given this information, is it possible to simulate the covariate and response information for n = 227 such observations - such that if a similar Cox-PH model was fit to these newly simulated 227 observations, the resulting regression coefficients would approximately be equal to the original regression coefficients? Can I try to "guess" (and recreate) a plausible set of observations might have been observed based on the regression model coefficients?

For example, I know that if I were to "fix" the covariate information for a group of n = 227 "arbitrary created" patients, I could then simulate their survival times (e.g. https://cran.r-project.org/web/packages/simsurv/index.html) - however, if I were to then fit a Cox-PH model to these observations, the model coefficients would not necessarily be close to the original model coefficients.

In general, is this possible to do? Only given the above model summary, could I try and somehow generate the original dataset that this model was trained on?

Thanks!

Note: I realize there are probably an infinite number of n = 227 samples that can be randomly simulated such that a Cox-PH Model produces the same regression coefficient estimates as above.

@swaheera
Copy link
Author

swaheera commented Feb 16, 2023

Possible Pseudocode:

  • Use an algorithm like the Genetic Algorithm (https://cran.r-project.org/web/packages/GA/vignettes/GA.html)
  • Step 0 : Set some plausible ranges for the covariates (e.g. age between 20 - 80, sex either male or female, ph.ecog between 0 and 5 https://ecog-acrin.org/resources/ecog-performance-status/) . Specify distributions for each of these variables (e.g. age ~ lognormal, sex ~ binom, ph.ecog ~ multinomial) and specify parameters for each of these distributions (e.g. lognormal(mu1, sigma1), binom(p1), etc)
  • Step 1: Set some plausible range and distribution for the Survival Times (e.g. Exponential Distribution with Lambda = 5)
  • Step 2: Simulate 227 random observations and randomly censor 164 of these observations
  • Step 3: Fit a Cox PH model to this simulated data and record performance measurements such as the coefficients, standard errors, p-values, likelihood ratio test, etc
  • Step 4: Calculate the squared difference between these performance measurements from the model with the simulated dataset vs the performance measurements from the model with the actual dataset - and then sum all these squared differences.
  • Step 5: Repeat Step 1 - Step 4 many times (e.g. 1000 times)
  • Step 6: Repeat Step 0 - 5 many times
  • Step 7: Use the Genetic Algorithm to find out which covariate ranges and covariate distributions bring you closest to the performance measurements of the original model. In other words, the "summed squared difference" is the Objective Function you will be attempting to optimize using the Genetic Algorithm.

I know this is a very abstract and roundabout way that might not have any mathematical validity - but I was curious to know if such an application might be logical?

Note: I realize that its entirely possible that I happen to simulate a dataset in which all the ages are concentrated around 20-25 years old when in fact the real age of the participants in this dataset were senior citizens - but based on the results of the other simulated variables, the resulting Cox-PH model produced using this simulated data might coincidentally have the same performance results as the original model - thus rendering this approach useless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant