# Case Study 1 - Data Modality

## The Task
Get used to loading datasets into the library and generating synthetic data from them, whatever the modality of the real data.

### Imports
Lets get the imports out of the way. We import the required standard and 3rd party libraries and relevant Synthcity modules. We can also set the level of logging here, using Synthcity's bespoke logger. 

In [8]:
# Standard
import sys
import warnings
from pathlib import Path

# 3rd party
import numpy as np
import pandas as pd

# synthcity
import synthcity.logger as log
from synthcity.plugins import Plugins
from synthcity.plugins.core.dataloader import (GenericDataLoader, SurvivalAnalysisDataLoader)

# Configure warnings and logging
warnings.filterwarnings("ignore")

# Set the level for the logging
# log.add(sink=sys.stderr, level="DEBUG")
log.remove()

## Synthetic generators

We can list the available generic synthetic generators by calling list() on the Plugins object.

In [9]:
print(Plugins().list())

['survival_nflow', 'bayesian_network', 'timegan', 'rtvae', 'survival_ctgan', 'radialgan', 'timevae', 'adsgan', 'fflows', 'ctgan', 'nflow', 'dpgan', 'survae', 'decaf', 'privbayes', 'pategan', 'probabilistic_ar', 'survival_gan', 'tvae']


### Static
### Regular Time Series
### Irregular Time Series

### Load the data - SurvivalAnalysisDataLoader

Load the data from file into a SurvivalAnalysisDataLoader object. For this we need to pass the names of our `target_column` and our `time_to_event_column` to the data loader. Then we can see the data by calling loader.dataframe() and get the infomation about the data loader object with loader.info().

In [5]:
X = pd.read_csv(f"../data/Brazil_COVID/covid_normalised_numericalised.csv")
loader = SurvivalAnalysisDataLoader(
    X,
    target_column="is_dead",
    time_to_event_column="Days_hospital_to_outcome",
    sensitive_features=["Age", "Sex", "Ethnicity", "Region"],
    random_state=42,
)

display(loader.dataframe())
# display(loader.info())

Unnamed: 0,is_dead,Days_hospital_to_outcome,Age,Sex,Ethnicity,Region,Fever,Cough,Sore_throat,Shortness_of_breath,...,Vomitting,Cardiovascular,Asthma,Diabetis,Pulmonary,Immunosuppresion,Obesity,Liver,Neurologic,Renal
0,0,3,1,0,0,2,1,1,0,1,...,0,0,1,0,0,0,0,0,0,0
1,0,7,75,0,0,2,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,6,81,1,0,2,1,1,0,1,...,0,1,0,1,0,0,0,0,0,0
5,1,7,64,1,1,4,0,0,0,1,...,0,1,0,1,0,0,0,0,0,0
6,1,5,62,1,1,4,1,1,0,1,...,0,1,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6877,0,5,52,0,0,4,1,1,0,1,...,0,0,0,1,0,0,0,0,0,0
6878,0,2,34,1,1,4,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
6879,0,18,44,1,1,4,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
6880,0,17,23,1,1,4,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


 - Don't train just list
### Create synthetic datasets

From the above list we are going to select the synthetic generation models for privacy: "privbayes", "dpgan", "adsgan", and "pategan". Then we will create and fit the synthetic model before using it to generate a synthetic dataset.

In [7]:
outdir = Path("saved_models")
prefix = "data_modality"
n_iter = 10
model="adsgan"

syn_model = Plugins().get(model, n_iter=n_iter)
syn_model.fit(loader)
syn_data = syn_model.generate(count=10).dataframe()

display(syn_data)

100%|██████████| 10/10 [00:10<00:00,  1.03s/it]


Unnamed: 0,is_dead,Days_hospital_to_outcome,Age,Sex,Ethnicity,Region,Fever,Cough,Sore_throat,Shortness_of_breath,...,Vomitting,Cardiovascular,Asthma,Diabetis,Pulmonary,Immunosuppresion,Obesity,Liver,Neurologic,Renal
0,0,17,14,0,0,2,0,1,1,1,...,0,0,0,0,0,0,0,0,0,1
1,0,49,7,1,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,11,51,1,2,4,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,30,20,1,1,4,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,12,26,1,1,4,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
5,1,13,7,1,1,4,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
6,0,7,34,0,1,4,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
7,0,13,37,1,1,2,1,1,0,0,...,1,0,0,0,0,0,0,0,0,0
8,0,24,52,1,1,4,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,5,26,1,1,4,1,1,0,0,...,1,0,0,0,0,0,0,0,0,1
