# Tutorial 3: Generating static survival-analysis data

__Survival analysis__ is a statistical method used to analyze data on the time it takes for an event of interest to occur. The event of interest is often called the "failure" or "censoring" event. Examples of failure events include death, disease onset, or equipment failure.

Static survival analysis refers to a specific type of survival analysis in which all subjects in the study are followed for the same length of time. In other words, the study is "static" because there is no ongoing enrollment of new subjects and no ongoing follow-up of existing subjects. This means that all subjects have the same "start" and "end" times for their follow-up.

In static survival analysis, the main goal is to estimate the probability of the failure event occurring at different time points.

`synthcity` includes models targeting specific tabular modalities. One of the specific modalities is for generating survival-analysis data. The general-purpose models can also be used for this task.

One significant difference compared to regular data is in the benchmarks. The downstream model performance is evaluated using the __C-Index__ and __Brier score__ metrics.

The main requirement of survival-analysis generators is to use a `SurvivalAnalysisDataLoader` dataloader.

In [None]:
# stdlib
import sys
import warnings

# synthcity absolute
import synthcity.logger as log
from synthcity.plugins import Plugins
from synthcity.plugins.core.dataloader import SurvivalAnalysisDataLoader

log.add(sink=sys.stderr, level="INFO")
warnings.filterwarnings("ignore")

In [None]:
# third party
from pycox import datasets

df = datasets.gbsg.read_df()
df = df[df["duration"] > 0]

df

In [None]:
loader = SurvivalAnalysisDataLoader(
    df,
    target_column="event",
    time_to_event_column="duration",
)

## List the available generative models

In [None]:
# synthcity absolute
from synthcity.plugins import Plugins

Plugins(categories=["generic", "survival_analysis"]).list()

## Load and train a generative model

In [None]:
# synthcity absolute
from synthcity.plugins import Plugins

syn_model = Plugins().get("survival_gan")

syn_model.fit(loader)

## Generate new data using the model

In [None]:
syn_model.generate(count=10).dataframe()

## Generate new data using a conditional

We will use the `event` outcome to condition the data generation.

In [None]:
# synthcity absolute
from synthcity.plugins import Plugins

syn_model = Plugins().get("survival_gan")

cond = df["event"]

syn_model.fit(loader, cond=cond)

In [None]:
import numpy as np

count = 10
syn_model.generate(count=count, cond=np.ones(count)).dataframe()

## Serialization

In [None]:
# synthcity absolute
from synthcity.utils.serialization import load, load_from_file, save, save_to_file

buff = save(syn_model)

type(buff)

In [None]:
reloaded = load(buff)

reloaded.name()

## Plot real-synthetic distributions

Compared to the general case, the survival-analysis data includes KM plots for real and synthetic data.

In [None]:
# third party
import matplotlib.pyplot as plt

syn_model.plot(plt, loader)

plt.show()

## Benchmarking metrics

| **Metric**                                         | **Description**                                                                                                            |
|----------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------|
| sanity.data\_mismatch.score                        | Data types mismatch between the real//synthetic features                                                                   |
| sanity.common\_rows\_proportion.score              | Real data copy-paste in the synthetic data                                                                                 |
| sanity.nearest\_syn\_neighbor\_distance.mean       | Computes the \textless{}reduction\textgreater{}(distance) from the real data to the closest neighbor in the synthetic data |
| sanity.close\_values\_probability.score            | the probability of close values between the real and synthetic data.                                                       |
| sanity.distant\_values\_probability.score          | the probability of distant values between the real and synthetic data.                                                     |
| stats.jensenshannon\_dist.marginal                 | the average Jensen-Shannon distance                                                                                        |
| stats.chi\_squared\_test.marginal                  | the one-way chi-square test.                                                                                               |
| stats.feature\_corr.joint                          | the correlation/strength-of-association of features in data-set with both categorical and continuous features              |
| stats.inv\_kl\_divergence.marginal                 | the average inverse of the Kullback–Leibler Divergence metric.                                                             |
| stats.ks\_test.marginal                            | the Kolmogorov-Smirnov test for goodness of fit.                                                                           |
| stats.max\_mean\_discrepancy.joint                 | Empirical maximum mean discrepancy. The lower the result the more evidence that distributions are the same.                |
| stats.prdc.precision                               | precision between the two manifolds                                                                                        |
| stats.prdc.recall                                  | recall between the two manifolds                                                                                           |
| stats.prdc.density                                 | density between the two manifolds                                                                                          |
| stats.prdc.coverage                                | coverage between the two manifolds                                                                                         |
| stats.alpha\_precision.delta\_precision\_alpha\_OC | Delta precision                                                                                                            |
| stats.alpha\_precision.delta\_coverage\_beta\_OC   | Delta coverage                                                                                                             |
| stats.alpha\_precision.authenticity\_OC            | Authetnticity                                                                                                              |
| stats.survival\_km\_distance.optimism              | Kaplan-Meier distance between real-synthetic data                                                                          |
| stats.survival\_km\_distance.abs\_optimism         | Kaplan-Meier metrics absolute distance between real-syn data                                                               |
| stats.survival\_km\_distance.sightedness           | Kaplan-Meier metrics distance on the temporal axis                                                                         |
| performance.linear\_model.gt.c\_index              | Train on real, test on the test real data using CoxPH: C-Index                                                             |
| performance.linear\_model.gt.brier\_score          | Train on real, test on the test real data using CoxPH: Brier score                                                         |
| performance.linear\_model.syn\_id.c\_index         | Train on synthetic, test on the train real data using CoxPH: C-Index                                                       |
| performance.linear\_model.syn\_id.brier\_score     | Train on synthetic, test on the train real data using CoxPH: Brier score                                                   |
| performance.linear\_model.syn\_ood.c\_index        | Train on synthetic, test on the test real data using CoxPH: C-Index                                                        |
| performance.linear\_model.syn\_ood.brier\_score    | Train on synthetic, test on the test real data using CoxPH: Brier score                                                    |
| performance.mlp.gt.c\_index                        | Train on real, test on the test real data using NN: C-Index                                                                |
| performance.mlp.gt.brier\_score                    | Train on real, test on the test real data using NN : Brier score                                                           |
| performance.mlp.syn\_id.c\_index                   | Train on synthetic, test on the train real data using NN: C-Index                                                          |
| performance.mlp.syn\_id.brier\_score               | Train on synthetic, test on the train real data using NN: Brier score                                                      |
| performance.mlp.syn\_ood.c\_index                  | Train on synthetic, test on the test real data using NN: C-Index                                                           |
| performance.mlp.syn\_ood.brier\_score              | Train on synthetic, test on the test real data using NN: Brier score                                                       |
| performance.xgb.gt.c\_index                        | Train on real, test on the test real data using XGB: C-Index                                                               |
| performance.xgb.gt.brier\_score                    | Train on real, test on the test real data using XGB : Brier score                                                          |
| performance.xgb.syn\_id.c\_index                   | Train on synthetic, test on the train real data using XGB: C-Index                                                         |
| performance.xgb.syn\_id.brier\_score               | Train on synthetic, test on the train real data using XGB: Brier score                                                     |
| performance.xgb.syn\_ood.c\_index                  | Train on synthetic, test on the test real data using XGB: C-Index                                                          |
| performance.xgb.syn\_ood.brier\_score              | Train on synthetic, test on the test real data using XGB: Brier score                                                      |
| performance.feat\_rank\_distance.corr              | Correlation for the rank distances between the feature importance on real and synthetic data                               |
| performance.feat\_rank\_distance.pvalue            | p-vale for the rank distances between the feature importance on real and synthetic data                                    |
| detection.detection\_xgb.mean                      | The average AUCROC score for detecting synthetic data using an XGBoost.                                                    |
| detection.detection\_mlp.mean                      | The average AUCROC score for detecting synthetic data using a NN.                                                          |
| detection.detection\_gmm.mean                      | The average AUCROC score for detecting synthetic data using a GMM.                                                         |
| privacy.delta-presence.score                       | the maximum re-identification probability on the real dataset from the synthetic dataset.                                  |
| privacy.k-anonymization.gt                         | the k-anon for the real data                                                                                               |
| privacy.k-anonymization.syn                        | the k-anon for the synthetic data                                                                                          |
| privacy.k-map.score                                | the minimum value k that satisfies the k-map rule.                                                                         |
| privacy.distinct l-diversity.gt                    | the l-diversity for the real data                                                                                          |
| privacy.distinct l-diversity.syn                   | the l-diversity for the synthetic data                                                                                     |
| privacy.identifiability\_score.score               | the re-identification score on the real dataset from the synthetic dataset.                                                |

## Benchmark the quality of plugins

For survival analysis, general purpose generators can be used as well.

In [None]:
# synthcity absolute
from synthcity.benchmark import Benchmarks

score = Benchmarks.evaluate(
    [(f"test_{model}", model, {}) for model in ["adsgan", "survival_gan", "survae"]],
    loader,
    synthetic_size=1000,
    repeats=2,
    task_type="survival_analysis",
)

In [None]:
Benchmarks.print(score)

## Congratulations!

Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement towards Machine learning and AI for medicine, you can do so in the following ways!

### Star [Synthcity](https://github.com/vanderschaarlab/synthcity) on GitHub

- The easiest way to help our community is just by starring the Repos! This helps raise awareness of the tools we're building.


### Checkout other projects from vanderschaarlab
- [HyperImpute](https://github.com/vanderschaarlab/hyperimpute)
- [AutoPrognosis](https://github.com/vanderschaarlab/autoprognosis)
