# Introduction to synthetic data workflow

![Standard workflow of generating and evaluating synthetic data with synthcity.](creatives/workflow.png)

The synthcity library captures the entire workflow of synthetic data generation and evaluation. The typical workflow contains the following steps, as illustrated above.

1. **Loading the dataset using a DataLoader**. The DataLoader class provides a consistent interface for loading and storing different types of input data (e.g. tabular, time series, and survival data). The user can also provide meta-data to inform downstream algorithms (e.g. specifying the sensitive columns for privacy-preserving algorithms).
2. **Training the generator using a Plugin**. In synthcity, the users instantiate, train, and apply different data generators via the Plugin class. Each Plugin represents a specific data generation algorithm. The generator can be trained using the fit() method of a Plugin.
3. **Generating synthetic data**. After the Plugin is trained, the user can use the generate() method to generate synthetic data. Some plugins also allow for conditional generation.
4. **Evaluating synthetic data**. Synthcity provides a large set of metrics for evaluating the fidelity, utility, and privacy of synthetic data. The Metrics class allows users to perform evaluation.

In addition, synthcity also has a Benchmark class that wraps around all the four steps, which is helpful for comparing and evaluating different generators.
After the synthetic data is evaluated, it can then be used in various downstream tasks.

# Case Study 1 - Data Modality
These notebooks are also available on Google Colab. This enables you to run the notebooks without having to set up an environment locally and gives you access to GPUs to run the notebooks on.

[![Run in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JOstMJmhI2wcufyBqZ1iV3YqOdThJ-_U?usp=sharing)

## 1. Introduction
![Catgorization of data modalities](creatives/data_modality.png)


\"Tabular data\" is a general category that encompasses many different data modalities. In this section, we introduce how to categorize these diverse modalities and how synthcity can be used to handle it.

### Single dataset

We start by introducing the most fundamental case where there is a single training dataset (e.g. a single DataFrame in Pandas). We characterize the data modalities by two axes: the observation pattern and the feature type.

The observation pattern describes whether and how the data are collected over time. There are three most prominent patterns, all supported by synthcity:

1. Static. All features are observed in a snapshot. There is no temporal ordering.
2. Regular time series.  Observations are made at regular intervals, t = 1, 2, 3... Of note, it is possible that different series may have different number of observations.
3. Irregular time series. Observations are made at irregular intervals, t = t1, t2, t3, ... Note that, for different series, the observation times may vary.

The feature type describes the domain of individual features. Synthcity supports the following three types. It also supports multivariate cases with a mixture of different feature types.

1. Continuous feature
2. Categorical feature
3. Integer feature
4. Censored feature: survival time and censoring indicator

The combination of observation patterns and feature types give rise to an array of data modalities. Synthcity supports all combinations.

### Composite dataset

A composite dataset involves multiple sub datesets. For instance, it may contain datesets collected from different sources or domains (e.g. from different countries). It may also contain both static and time series data. Such composite data are quite often seen in practice. For example, a patient's medical record may contain both static demographic information and longitudinal follow up data.

synthcity can handle the generation of different classes of composite datasets. Currently, it supports (1) multiple static datasets, (2) a static and a regular time series dataset, and (3) a static and a irregular time series dataset.

### Metadata

Very often we have access to metadata that describes the properties of the underlying data. Synthcity can make use of these information to guide the generation and evaluation process. It supports the following types of metadata:

1. sensitive features: indicator of sensitive features that should be protected for privacy.
2. outcome features: indicator of outcome feature that will be used as the  target in downstream prediction tasks.
3. domain: information about the data type and allowed value range.



### 1.1 The Task
In this first exercise, we will get used to loading datasets with the library and generating synthetic data from them, whatever the modality of the real data.

## 2. Imports
Lets get the imports out of the way. We import the required standard and 3rd party libraries and relevant Synthcity modules. We can also set the level of logging here, using Synthcity's bespoke logger. 

In [1]:
# Standard
import sys
import warnings
from pathlib import Path

# 3rd party
import numpy as np
import pandas as pd

# synthcity
import synthcity.logger as log
from synthcity.plugins import Plugins
from synthcity.plugins.core.dataloader import (GenericDataLoader, SurvivalAnalysisDataLoader, TimeSeriesDataLoader, TimeSeriesSurvivalDataLoader)

# Configure warnings and logging
warnings.filterwarnings("ignore")

# Set the level for the logging
# log.add(sink=sys.stderr, level="DEBUG")
log.remove()

  from .autonotebook import tqdm as notebook_tqdm


## 3. Loading data of different modalities

In this notebook we will load different datasets into synthcity and show that data of many different modalities can be used to generate synthetic data using this module.

### 3.1 Static Data
Now we will start with the simplest example, static tabular data. For this, we will use the diabetes dataset from sklearn. First, we need to load the dataset.

In [2]:
from sklearn.datasets import load_diabetes

X, y = load_diabetes(return_X_y=True, as_frame=True)
X["target"] = y
display(X)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.050680,0.044451,-0.005670,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.025930,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0
...,...,...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207,178.0
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485,104.0
439,0.041708,0.050680,-0.015906,0.017293,-0.037344,-0.013840,-0.024993,-0.011080,-0.046883,0.015491,132.0
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044529,-0.025930,220.0


Then we pass it to the `GenericDataLoader` object from `synthcity`.

In [3]:
loader = GenericDataLoader(
    X,
    target_column="target",
    sensitive_columns=["sex"],
)

We can print out different methods that are compatible with our data by calling `Plugins().list()` with a relevant list passed to the categories parameter.

In [4]:
print(Plugins(categories=["generic"]).list())

['bayesian_network', 'tvae', 'ctgan', 'nflow', 'rtvae']


No need to worry about the code in this next block here, we will go into lots of detail in how to generate synthetic data in the case studies to come. It is here purely to demonstrate that our dataset can be used to generate synthetic data using the synthcity module. We are using the method `marginal_distributions` to generate the synthetic data, which is one of the available debugging methods.

In [5]:
syn_model = Plugins().get("marginal_distributions")
syn_model.fit(loader)
syn_model.generate(count=10).dataframe()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.01239,-0.044642,0.052872,0.021754,0.027268,0.056934,0.053274,0.067191,0.016427,0.012267,201.169135
1,0.048652,0.05068,0.096268,0.062424,0.073969,0.109243,0.100439,0.11072,0.059634,0.057751,254.575787
2,0.024148,0.05068,0.066944,0.034942,0.042412,0.073896,0.068568,0.081306,0.030437,0.027016,218.487044
3,0.011533,-0.044642,0.051847,0.020794,0.026165,0.055699,0.05216,0.066163,0.015406,0.011192,199.907502
4,-0.014889,0.05068,0.020227,-0.00884,-0.007863,0.017584,0.017793,0.034446,-0.016076,-0.021949,160.993191
5,0.033548,0.05068,0.078194,0.045485,0.054518,0.087457,0.080795,0.09259,0.041638,0.038807,232.33201
6,-0.011852,0.05068,0.023861,-0.005434,-0.003952,0.021965,0.021743,0.038091,-0.012458,-0.01814,165.465495
7,0.087138,0.05068,0.142326,0.105588,0.123535,0.164761,0.150498,0.156919,0.105491,0.106025,311.259133
8,0.102807,0.05068,0.161077,0.123161,0.143714,0.187364,0.170878,0.175728,0.124161,0.125678,334.335746
9,-0.023654,0.05068,0.009738,-0.018669,-0.019151,0.004941,0.006393,0.023925,-0.02652,-0.032942,148.084728


### 3.2 Static survival
Next lets look at censored data. Censoring is a form of missing data problem in which time to event is not observed for reasons such as termination of study before all recruited subjects have shown the event of interest or the subject has left the study prior to experiencing an event. Censoring is common in survival analysis. For our next example we will load a static survival dataset. Our dataset this time is a veteran lung cancer dataset provided by scikit-survival. 

First, load the dataset.

In [6]:
from sksurv.datasets import load_veterans_lung_cancer

data_x, data_y = load_veterans_lung_cancer()
data_x["status"], data_x["survival_in_days"] = [record[0] for record in data_y], [record[1] for record in data_y]
display(data_x)

Unnamed: 0,Age_in_years,Celltype,Karnofsky_score,Months_from_Diagnosis,Prior_therapy,Treatment,status,survival_in_days
0,69.0,squamous,60.0,7.0,no,standard,True,72.0
1,64.0,squamous,70.0,5.0,yes,standard,True,411.0
2,38.0,squamous,60.0,3.0,no,standard,True,228.0
3,63.0,squamous,60.0,9.0,yes,standard,True,126.0
4,65.0,squamous,70.0,11.0,yes,standard,True,118.0
...,...,...,...,...,...,...,...,...
132,65.0,large,75.0,1.0,no,test,True,133.0
133,64.0,large,60.0,5.0,no,test,True,111.0
134,67.0,large,70.0,18.0,yes,test,True,231.0
135,65.0,large,80.0,4.0,no,test,True,378.0


Pass it to the DataLoader. This time we will use the `SurvivalAnalysisDataLoader`. We need to pass it the data, the name of the column that contains our labels or targets to `target_column` and the the name of the column  containing the time elapsed when the event occurred (the event defined by the target column) to `time_to_event_column`. Calling `info()` on the loader object allows us to see the information about the dataset we have just prepared.

In [7]:

loader = SurvivalAnalysisDataLoader(
    data_x,
    target_column="status",
    time_to_event_column="survival_in_days",
)
print(loader.info())


{'data_type': 'survival_analysis', 'len': 137, 'static_features': ['Age_in_years', 'Celltype', 'Karnofsky_score', 'Months_from_Diagnosis', 'Prior_therapy', 'Treatment', 'status', 'survival_in_days'], 'sensitive_features': [], 'important_features': [], 'outcome_features': ['status'], 'target_column': 'status', 'time_to_event_column': 'survival_in_days', 'time_horizons': [250.5, 500.0, 749.5], 'train_size': 0.8}


If we get the `marginal_distributions` plugin again and fit it to the `loader` object, we can then call `generate` to produce the synthetic data.

In [8]:
syn_model = Plugins().get("marginal_distributions")
syn_model.fit(loader)
syn_model.generate(count=10)

Unnamed: 0,Age_in_years,Celltype,Karnofsky_score,Months_from_Diagnosis,Prior_therapy,Treatment,status,survival_in_days
0,59.794235,smallcell,60.0,48.197961,no,standard,True,548.715877
1,67.6139,large,10.0,62.506286,yes,test,False,714.758988
2,62.329879,squamous,40.0,52.83765,yes,test,False,602.557849
3,59.60951,smallcell,99.0,47.859954,no,standard,True,544.793417
4,53.911776,large,40.0,37.434313,yes,test,False,423.80749
5,64.357023,large,75.0,56.546894,yes,test,False,645.602325
6,54.566599,large,85.0,38.6325,yes,test,False,437.712037
7,75.913331,large,40.0,77.692478,yes,test,False,890.989455
8,79.29215,squamous,60.0,83.874997,yes,test,False,962.735435
9,52.021751,large,30.0,33.975971,yes,test,False,383.674636


### 3.3 Regular Time Series

In this next example we will load up a simple regular time series dataset and show that it is compatible with Synthcity. The temporal data must be passed to the loader as a list of dataframes, where each dataframe in the list refers to a different record and contains all time points for the record. So, there is a small amount of pre-processing to get our data into the right shape. As it is a regular time series we can simply pass a sequential list for each record.

The dataset we will use here is the basic motions dataset provided by SKTime. So, we need to import the library.

In [9]:
from sktime.datasets import load_basic_motions

Load the data and re-format it into a list of dataframes, where each dataframe in the list refers to a different record and contains all time points for the record. We also need the outcomes as a dataframe and the observation times as a list of time steps for each record. As this is a regular time series our time steps can simply be a sequential list of integers. We will also print the some of the data when we have it in the correct shape.

In [21]:
X, y = load_basic_motions(
    split="TRAIN", return_X_y=True, return_type="pd-multiindex"
)
num_instances = len(set((x[0] for x in X.index)))
num_time_steps = len(set((x[1] for x in X.index)))

# Convert multi-index dataframe into list of dataframes
temporal_data = [X.loc[i] for i in range(num_instances)]
y = pd.DataFrame(y, columns=["label"])
observation_times = [list(range(num_time_steps)) for i in range(num_instances)]

print("The first 3 dataframes in the list, `temporal_data`. They refer to the first 3 instances in the dataset. Each instance contains all time steps for all features.")
for i in range(3):
    display(temporal_data[i])
print("The first 3 label values, `y`.")
display(y[0:3])

The first 3 dataframes in the list, `temporal_data`. They refer to the first 3 instances in the dataset. Each instance contains all time steps for all features.


Unnamed: 0_level_0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
timepoints,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.079106,0.394032,0.551444,0.351565,0.023970,0.633883
1,0.079106,0.394032,0.551444,0.351565,0.023970,0.633883
2,-0.903497,-3.666397,-0.282844,-0.095881,-0.319605,0.972131
3,1.116125,-0.656101,0.333118,1.624657,-0.569962,1.209171
4,1.638200,1.405135,0.393875,1.187864,-0.271664,1.739182
...,...,...,...,...,...,...
95,-0.167918,0.224085,0.039889,0.039951,-0.010653,-0.021307
96,-0.227670,0.118392,-0.088594,-0.029297,0.005327,-0.042614
97,-0.193271,0.055227,-0.041530,0.000000,-0.013317,-0.063921
98,-0.193271,0.055227,-0.041530,0.000000,-0.013317,-0.063921


Unnamed: 0_level_0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
timepoints,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.377751,-0.610850,-0.147376,-0.103872,-0.109198,-0.037287
1,0.377751,-0.610850,-0.147376,-0.103872,-0.109198,-0.037287
2,2.952965,0.970717,-5.962515,-7.593275,-0.697804,-2.865789
3,4.310925,-1.625661,-1.898794,-5.345389,0.402169,-4.176168
4,3.256906,-6.969257,-2.730436,-2.743274,0.615239,-3.417107
...,...,...,...,...,...,...
95,-0.287704,0.344518,0.055352,-0.015980,-0.007990,-0.039951
96,-0.323383,-0.098593,0.063051,-0.018644,-0.010653,-0.135832
97,-0.323383,-0.098593,0.063051,-0.018644,-0.010653,-0.135832
98,-0.333625,-0.215987,-0.000848,-0.007990,0.015980,0.034624


Unnamed: 0_level_0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
timepoints,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,-0.813905,0.825666,0.032712,0.021307,0.122515,0.775041
1,-0.813905,0.825666,0.032712,0.021307,0.122515,0.775041
2,-0.424628,-1.305033,0.826170,-0.372872,-0.045277,0.383526
3,0.316895,-0.507693,0.218569,0.023970,-0.130505,0.588605
4,0.228580,0.028821,0.586313,0.066584,-0.263674,0.817655
...,...,...,...,...,...,...
95,-0.255364,0.314641,-0.091294,-0.292971,0.077238,-0.077238
96,-0.066292,0.610305,0.286864,0.095881,-0.034624,-0.167792
97,-0.206440,0.209063,0.679102,0.279654,-0.031960,-0.372872
98,-0.544255,-0.558847,0.379032,0.159802,0.114525,-0.423476


The first 3 label values, `y`.


Unnamed: 0,label
0,standing
1,standing
2,standing


Pass the data we just prepared to the DataLoader. Here we will use the `TimeSeriesDataLoader`. Then we will print out the loader info to check everything looks correct.

In [11]:
loader = TimeSeriesDataLoader(
    temporal_data=temporal_data,
    observation_times=observation_times,
    outcome=y,
)
display(loader.dataframe())
print(loader.info())

Unnamed: 0,seq_id,seq_time_id,seq_temporal_dim_0,seq_temporal_dim_1,seq_temporal_dim_2,seq_temporal_dim_3,seq_temporal_dim_4,seq_temporal_dim_5,seq_out_label
0,0,0,0.079106,0.394032,0.551444,0.351565,0.023970,0.633883,standing
1,0,1,0.079106,0.394032,0.551444,0.351565,0.023970,0.633883,standing
2,0,2,-0.903497,-3.666397,-0.282844,-0.095881,-0.319605,0.972131,standing
3,0,3,1.116125,-0.656101,0.333118,1.624657,-0.569962,1.209171,standing
4,0,4,1.638200,1.405135,0.393875,1.187864,-0.271664,1.739182,standing
...,...,...,...,...,...,...,...,...,...
3995,39,95,1.239144,-6.142442,0.028264,-2.309144,1.472845,-0.998765,badminton
3996,39,96,0.261434,0.205915,-0.224944,-0.524684,0.769715,0.157139,badminton
3997,39,97,2.490353,-0.878765,-0.597296,0.111862,-0.117188,-0.050604,badminton
3998,39,98,4.122120,0.911620,-0.465409,0.535338,0.197090,0.442120,badminton


{'data_type': 'time_series', 'len': 4000, 'static_features': [], 'temporal_features': ['dim_0', 'dim_1', 'dim_2', 'dim_3', 'dim_4', 'dim_5'], 'outcome_features': ['label'], 'outcome_len': 1.0, 'window_len': 100, 'sensitive_features': [], 'important_features': [], 'random_state': 0, 'train_size': 0.8, 'fill': nan, 'seq_static_features': [], 'seq_temporal_features': ['seq_temporal_dim_0', 'seq_temporal_dim_1', 'seq_temporal_dim_2', 'seq_temporal_dim_3', 'seq_temporal_dim_4', 'seq_temporal_dim_5'], 'seq_outcome_features': ['seq_out_label'], 'seq_offset': 0, 'seq_id_feature': 'seq_id', 'seq_time_id_feature': 'seq_time_id', 'seq_features': ['seq_id', 'seq_time_id', 'seq_temporal_dim_0', 'seq_temporal_dim_1', 'seq_temporal_dim_2', 'seq_temporal_dim_3', 'seq_temporal_dim_4', 'seq_temporal_dim_5', 'seq_out_label']}


Now we are ready to produce the synthetic data. We will use the `timegan` plugin to handle the timeseries data. As we don't care about the quality of the dataset here, we just want to check that it is compatible and practice loading datasets, we can pass `n_iter=1` to limit the number of iterations in the generator.

In [None]:
syn_model = Plugins().get("timegan", n_iter=1)
syn_model.fit(loader)
syn_model.generate(count=10)

### 3.4 Irregular Time Series

Now lets load an irregular time series dataset and show that that is also compatible with Synthcity. The dataset we will use here is a google stocks dataset provided by the synthcity module itself.

In [None]:
import numpy as np
from synthcity.utils.datasets.time_series.google_stocks import GoogleStocksDataloader

static_data, temporal_data, observation_times, outcome = GoogleStocksDataloader().load()

As the dataset is wrapped by synthcity, it is already provided to us in the correct format, but the requirements are the same as before. The temporal data is a list of dataframes, where each dataframe in the list refers to a different record and contains all time points for the record. The outcomes are all in one dataframe and the observation times are a list of time steps for each record. The main difference here is that the observation times is a list of floats that represent the time between each data point.

In [None]:
loader = TimeSeriesDataLoader(
    temporal_data=temporal_data,
    observation_times=observation_times,
    static_data=static_data,
    outcome=outcome,
)
print(loader.info())
display(loader.dataframe())

Exactly as for the regular time series, we can now generate synthetic data, by selecting our time series compatible plugin, then calling `fit()` and `generate()`.

In [None]:
syn_model = Plugins().get("timegan", n_iter=1)
syn_model.fit(loader)
syn_model.generate(count=5)

### 3.5 Composite Irregular Time Series Survival Analysis

In this final example we will look at composite data while adding all the other more complex elements we have looked at so far. This next dataset is a composite irregular time series survival analysis dataset. 

Again this dataset is provided by synthcity, so there is little to do in terms of pre-processing as everything is in the right format to begin with.

In [42]:
from synthcity.utils.datasets.time_series.pbc import PBCDataloader
(
    static_surv,
    temporal_surv,
    temporal_surv_horizons,
    outcome_surv,
) = PBCDataloader().load()
T, E = outcome_surv

print("The static survival features (`static_surv`) for the first 3 instances:")
display(static_surv[0:3])
print("The temporal survival features (`temporal_surv`) for the first 3 instances:")
for i in range(3):
    display(temporal_surv[i])
print("The observation times (`temporal_surv_horizons`) for the first 3 instances:")
display(temporal_surv_horizons[0:3])
print("The first 3 time to event values, `T`.")
display(T[0:3])
print("The first 3 event values, `E`.")
display(E[0:3])

The static survival features (`static_surv`) for the first 3 instances:


Unnamed: 0,sex
0,1.0
1,0.0
2,1.0


The temporal survival features (`temporal_surv`) for the first 3 instances:


Unnamed: 0,drug,ascites,hepatomegaly,spiders,edema,histologic,serBilir,serChol,albumin,alkaline,SGOT,platelets,prothrombin,age
0.569489,0.0,2.0,2.0,2.0,2.0,1.0,2.015877,-0.469461,-1.570646,0.285613,0.195488,-0.456022,0.813132,0.248058
1.09517,0.0,0.0,1.0,1.0,0.0,2.0,3.28189,0.0,-0.894575,0.195532,-1.485263,-0.529101,0.136768,0.248058


Unnamed: 0,drug,ascites,hepatomegaly,spiders,edema,histologic,serBilir,serChol,albumin,alkaline,SGOT,platelets,prothrombin,age
5.31979,0.0,0.0,1.0,1.0,0.0,3.0,-0.478914,-0.145812,1.491559,5.110015,-0.116943,-0.132388,-0.26905,1.292856
6.261636,0.0,2.0,2.0,2.0,2.0,3.0,-0.534768,0.0,0.417798,0.616191,0.214616,-0.476902,0.001495,1.292856
7.266455,1.0,2.0,2.0,2.0,0.0,2.0,-0.497532,0.0,0.318376,0.279664,0.274552,-0.758777,0.407314,1.292856
8.26306,0.0,2.0,2.0,2.0,1.0,3.0,-0.329971,0.0,1.054101,-0.014372,0.274552,-1.165929,-0.26905,1.292856
9.251451,0.0,2.0,2.0,2.0,1.0,3.0,-0.199646,-0.714171,-0.138966,-0.231075,0.116423,-1.030212,0.204405,1.292856
12.049611,1.0,2.0,2.0,2.0,2.0,3.0,-0.013468,0.0,-0.934344,-0.327954,0.116423,-1.395605,0.339677,1.292856
13.152995,1.0,1.0,1.0,0.0,0.0,3.0,0.098239,0.0,-1.312149,-0.443529,0.29368,-1.364286,0.339677,1.292856
13.654036,0.0,2.0,2.0,2.0,0.0,3.0,-0.013468,-0.603657,-1.172958,-0.512364,-0.046806,-1.259888,0.339677,1.292856
14.152338,1.0,2.0,2.0,2.0,2.0,3.0,0.17271,-0.658914,-1.431455,-0.605844,-0.442126,-1.395605,0.339677,1.292856


Unnamed: 0,drug,ascites,hepatomegaly,spiders,edema,histologic,serBilir,serChol,albumin,alkaline,SGOT,platelets,prothrombin,age
0.736502,0.0,2.0,2.0,2.0,2.0,2.0,-0.423061,-1.14044,0.179185,-0.735865,-0.338833,-0.863175,0.677859,1.511299
1.774176,1.0,0.0,1.0,1.0,2.0,3.0,-0.478914,0.0,-0.19862,-0.874385,-0.674218,-0.769217,0.677859,1.511299
2.288906,1.0,0.0,0.0,0.0,0.0,3.0,-0.404443,-0.690489,0.358145,-0.98911,-0.832346,-1.322527,0.677859,1.511299
2.770781,0.0,2.0,2.0,2.0,0.0,3.0,-0.348589,-1.069395,-0.278157,-0.794503,-0.437026,-1.301647,1.557133,1.511299


The observation times (`temporal_surv_horizons`) for the first 3 instances:


[array([0.56948856, 1.0951703 ]),
 array([ 5.31978973,  6.26163618,  7.26645493,  8.26305991,  9.2514511 ,
        12.04961121, 13.15299529, 13.6540357 , 14.15233819]),
 array([0.73650203, 1.77417588, 2.28890592, 2.77078086])]

The first 3 time to event values, `T`.


0     0.569489
1    14.152338
2     0.736502
Name: time_to_event, dtype: float64

The first 3 event values, `E`.


0    1
1    0
2    1
Name: event, dtype: int64

Even complex datasets such as this are compatible with Synthcity. We can load this data using the `TimeSeriesSurvivalDataLoader`. Then by calling `loader.info()`, we can check the information about the dataset. It contains both one static feature ("sex") and 14 temporal features, making it a composite dataset. The `seq_time_id` field shows the irregular time sampling, which we create by passing the values to the `observation_times` parameter of the `TimeSeriesSurvivalDataLoader` object. And finally, we are formulating this data as a survival analysis problem, which is indicated by the presence of a `time_to_event` field.

In [38]:
loader = TimeSeriesSurvivalDataLoader(
    temporal_data=temporal_surv,
    observation_times=temporal_surv_horizons,
    static_data=static_surv,
    T=T,
    E=E,
)

print(loader.info())

{'data_type': 'time_series_survival', 'len': 1945, 'static_features': ['sex'], 'temporal_features': ['SGOT', 'age', 'albumin', 'alkaline', 'ascites', 'drug', 'edema', 'hepatomegaly', 'histologic', 'platelets', 'prothrombin', 'serBilir', 'serChol', 'spiders'], 'outcome_features': ['time_to_event', 'event'], 'outcome_len': 2.0, 'window_len': 16, 'sensitive_features': [], 'important_features': [], 'random_state': 0, 'train_size': 0.8, 'fill': nan, 'seq_static_features': ['seq_static_sex'], 'seq_temporal_features': ['seq_temporal_SGOT', 'seq_temporal_age', 'seq_temporal_albumin', 'seq_temporal_alkaline', 'seq_temporal_ascites', 'seq_temporal_drug', 'seq_temporal_edema', 'seq_temporal_hepatomegaly', 'seq_temporal_histologic', 'seq_temporal_platelets', 'seq_temporal_prothrombin', 'seq_temporal_serBilir', 'seq_temporal_serChol', 'seq_temporal_spiders'], 'seq_outcome_features': ['seq_out_time_to_event', 'seq_out_event'], 'seq_offset': 0, 'seq_id_feature': 'seq_id', 'seq_time_id_feature': 'seq_

We can now generate synthetic data, in the way we are now well familiar with. We select our time series compatible plugin, then call `fit()` and `generate()`.

In [39]:
syn_model = Plugins().get("timegan", n_iter=1)
syn_model.fit(loader)
syn_model.generate(count=5)

100%|██████████| 1/1 [00:00<00:00,  1.93it/s]


Unnamed: 0,seq_id,seq_time_id,seq_static_sex,seq_temporal_SGOT,seq_temporal_age,seq_temporal_albumin,seq_temporal_alkaline,seq_temporal_ascites,seq_temporal_drug,seq_temporal_edema,seq_temporal_hepatomegaly,seq_temporal_histologic,seq_temporal_platelets,seq_temporal_prothrombin,seq_temporal_serBilir,seq_temporal_serChol,seq_temporal_spiders,seq_out_time_to_event,seq_out_event
0,0,4.462156,0.0,-0.204392,-0.558758,-0.512971,2.684311,0.0,0.0,0.0,0.0,3.0,1.681971,-0.270595,1.53037,-0.128398,0.0,5.946796,0.0
1,1,3.00378,0.0,0.435623,-0.318297,-0.791745,0.360462,2.0,0.0,2.0,2.0,2.0,0.421941,0.002799,1.363272,-0.015049,1.0,5.92429,0.0


## 4. Extension

Use the code block below as a space to complete the extension exercises below.

### 4.1 Create synthetic datasets
 1) Above we have generated data with the debugging method `marginal_distributions` for tabular data and `timegan` for time series data. Now, using `Plugins().list()` or the [documentation](https://synthcity.readthedocs.io/en/latest/) find another method that is compatible with some of the datasets to see if you can generate your own synthetic data. What makes the method you have chosen better than the defaults we used before?
 
 2) Generate synthetic data for another dataset of your choice using the methods described above. You can use any of the other dataset from the sources we have used above: [SKLearn](https://scikit-learn.org/stable/datasets/toy_dataset.html), [SKTime](https://www.sktime.org/en/stable/api_reference/datasets.html), [SKSurv](https://scikit-survival.readthedocs.io/en/stable/api/datasets.html)  or [synthcity](https://github.com/vanderschaarlab/synthcity/tree/main/src/synthcity/utils/datasets) itself.