# From Expectations to Synthetic Data generation

## 2. The synthetic data generation


In [1]:
from IPython.display import JSON, display_json
import json

import pandas as pd

In [2]:
dataset_name = "Cardiovascular"
data = pd.read_csv('cardio.csv')

f = open(f'.profile_{dataset_name}.json')
json_profile = json.load(f)
json_profile = json.loads(json_profile)

Let's leverage `pandas-profiling` automated detection of data types to separate the columns by dtype for the synthesis process.

In [3]:
num_cols = [col for col, val in json_profile['variables'].items() if val['type']=='Numeric' and col!='cardio']
cat_cols = [col for col, val in json_profile['variables'].items() if val['type']=='Categorical' and col!='cardio']    

print(f'Number of categorical: {len(num_cols)}, Number of numerical: {len(cat_cols)}')

Number of categorical: 5, Number of numerical: 6


### Prepare the data for synthesis

After checking the warnings generated by `pandas-profiling` we are able to understand that the cardio dataset is generally well behaved, meaning that we can leverage the standard data preparation performed by the synthesizer architectures: 
- **Numerical columns** - Standard Scaler, important to ensure a faster convergence of the models and ease of results reproducibility.
- **Categorical columns** - Label Encoder

Because we aim to generate synthetic data as close to the original one, it is recommend to not perform any outlier treatment of feature engineering. 

<div class="alert alert-block alert-warning">
<b>Note:</b> The selection and use of data transformations for synthesis will vary based on the dataset and synthetic data generation approach.
</div>

As we have selected a Conditional GAN architecture, this means we need to select a conditional column. Tipically and to optimize the utlity and fidelity of the generated data, it is recommend to select the *target* variable, in case the dataset has one. 

For the `Cardiovascular disease` dataset, we are going to consider the variable *Cardio* as our conditional column

<p style="text-align:center;"><img src="img/cgan.jpeg" alt = "test pic" width="500" height="200"></p>

[Image source](https://arxiv.org/abs/1411.1784)

In [4]:
#The cardio dataset is pretty balanced in what concerns the target variable.
#But not only - variables like gender are quite balanced even when compared under the context of the target
data['cardio'].value_counts()

0    34701
1    33970
Name: cardio, dtype: int64

### Training a synthesizer

##Add here more details on the parameters

In [5]:
from ydata_synthetic.synthesizers.regular import RegularSynthesizer
from ydata_synthetic.synthesizers import ModelParameters, TrainParameters

In [6]:
## Setting the architecture hyperparameters
noise_dim = 32
dim = 128
batch_size = 64

#Defined as per the literature on CWGAN
beta_1 = 0.5
beta_2 = 0.9

log_step = 100
epochs = 5 + 1
learning_rate = 0.0001
models_dir = '../cache'

model_parameters = ModelParameters(batch_size=batch_size,
                                   lr=learning_rate,
                                   betas=(beta_1, beta_2),
                                   noise_dim=noise_dim,
                                   layers_dim=dim)

train_args = TrainParameters(epochs=epochs,
                             cache_prefix='',
                             sample_interval=log_step,
                             label_dim=-1,
                             labels=(0,1))

In [7]:
#Init the synthesizer model
#n_critic sets the number of updates of the critic network per adversarial training
synth = RegularSynthesizer(modelname='cwgangp', model_parameters=model_parameters, n_critic=5)

#Model training
synth.fit(data=data, 
          label_cols=["cardio"], 
          train_arguments=train_args,
          num_cols=num_cols, cat_cols=cat_cols)

2022-08-17 08:54:12.781032: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-17 08:54:12.809180: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2022-08-17 08:54:12.809200: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2022-08-17 08:54:12.890704: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN

Number of iterations per epoch: 1073


 17%|███████▌                                     | 1/6 [00:21<01:45, 21.06s/it]

Epoch: 0 | critic_loss: -0.2805415987968445 | gen_loss: 0.20160722732543945


 33%|███████████████                              | 2/6 [00:41<01:23, 20.78s/it]

Epoch: 1 | critic_loss: -0.33020472526550293 | gen_loss: 0.31511062383651733


 50%|██████████████████████▌                      | 3/6 [01:00<01:00, 20.03s/it]

Epoch: 2 | critic_loss: -0.22822757065296173 | gen_loss: 0.22219200432300568


 67%|██████████████████████████████               | 4/6 [01:19<00:39, 19.56s/it]

Epoch: 3 | critic_loss: -0.2783055007457733 | gen_loss: 0.28723031282424927


 83%|█████████████████████████████████████▌       | 5/6 [01:38<00:19, 19.25s/it]

Epoch: 4 | critic_loss: -0.22748863697052002 | gen_loss: 0.14494484663009644


100%|█████████████████████████████████████████████| 6/6 [01:57<00:00, 19.54s/it]

Epoch: 5 | critic_loss: -0.3456514775753021 | gen_loss: 0.4503900706768036





In [8]:
#Saving the trained synthesizer
synth.save(f'{dataset_name}_synth.pkl')

In [9]:
cond_array = data[["cardio"]]

#Generating a sample with the same size and conditional configuration as the original dataset
synth_sample = synth.sample(cond_array)

#Saving the synthetic sample as CSV
synth_sample.to_csv(f'synth_{dataset_name}')

2022-08-17 08:56:10.344535: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 140638208 exceeds 10% of free system memory.
2022-08-17 08:56:10.404227: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 140638208 exceeds 10% of free system memory.
2022-08-17 08:56:10.425273: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 140638208 exceeds 10% of free system memory.


## Summary & Next Steps

#### Provide here more details

Now that we were able to generate succesfully our synthetic data sample we need to assess wether the output data of our synthesizer as enough quality. The quality of synthetic data can be translated into *Fidelity* and *Utility*. 

In the next notebook we will explore the *Fidelity* of our dataset through:
- Synthetic data profiling vs Real data profiling
- Running real data suit of expectations