# Great Expectations and YData Synthetic: An data expectation framework for a synthetic data engine

This notebook is a step-by-step tutorial on how you can inetgrate the great expectations framework while working on a ydata-synthetic project.
Specifically we will
1. Setup the project structure by initializing a Data Context
2. Download/Extract the real data set which we use to create synthetic data
3. Configure a Data Source to connect our data
4. Create an Expectation Suite using the built-in Great Expectations profiler
5. Transform the real data for modeling
6. Train the synethsizers and create the model file
7. Sample synthetic data from synthesizer
8. Inverse transform the data to obtain the original format
9. Create a new checkpoint to validate the synthetic data against the real data
10. Evaluate the synthetic data using Data Docs 

But before we get started, ensure that you have installed both ydata-synthetic and great_expectations after creating a virtual environment.

!pip intsall ydata-synthetic great_expetctaions

## Step 1: Setup the project structure through a Data Context

In Great Expectations, your Data Context manages the project configuration. There are multiple ways to create the Data Context, however the simplest one is by using the CLI that comes along when you install the great_expectations package.

Note: For this tutorial we'll use the newest V3 (Batch request) API so youd see the flag ```--v3-api``` in most of the commands.

Open your terminal and navigate to the project directory and type in the following:

```great_expectations --v3-api init```

Press enter to complete the creation of the Data Context and that's about it.

If you're curious about the modified project structure, here's an excerpt from the GE documentations:
- great_expectations.yml contains the main configuration of your deployment.
- The expectations/ directory stores all your Expectations as JSON files. If you want to store them somewhere else, you can change that later.
- The plugins/ directory holds code for any custom plugins you develop as part of your deployment.
- The uncommitted/ directory contains files that shouldn’t live in version control. It has a .gitignore configured to exclude all its contents from version control. The main contents of the directory are:
    - uncommitted/config_variables.yml, which holds sensitive information, such as database credentials and other secrets.
    - uncommitted/documentation, which contains Data Docs generated from Expectations, Validation Results, and other metadata.
    - uncommitted/validations, which holds Validation Results generated by Great Expectations.

## Step 2: Download/Extract the real data set which we use to create synthetic data

To move forward, we pick an use-case example of "The Credit Card Fraud Dataset - Synthesizing the Minority Class" where we aim to synthesize the minority class of the credit card fraud dataset that has a high imbalance.Further a practical exercise is presented to showcase the usage of the YData Synthetic library along with
GANs to synthesize tabular data. For the purpose of this exercise, dataset of credit card fraud from Kaggle is used, that can be found here:
https://www.kaggle.com/mlg-ulb/creditcardfraud

Since we're interested only on the fraud class data points of the dataset, let's filter them and write it to teh data directory.

In [1]:
# importing all required libraries

import os
import pandas as pd
import sklearn.cluster as cluster

from ydata_synthetic.synthesizers.regular import VanilllaGAN
from ydata_synthetic.synthesizers import ModelParameters, TrainParameters
from ydata_synthetic.preprocessing.regular.credit_fraud import *
from ydata_synthetic.postprocessing.regular.inverse_preprocesser import inverse_transform

model = VanilllaGAN

In [2]:
# Read the original data
data = pd.read_csv('./data/creditcard.csv')

#Filter the minority class
train_data = data.loc[ data['Class']==1 ].copy()

# Inspect the shape of the data
train_data.shape

(492, 31)

In [3]:
# Write to the data folder
train_data.to_csv('./data/creditcard_fraud.csv', index=False)

## Stpe 3: Configure a Data Source to connect our data

In Great Expectations, Datasources simplify connections, by managing configuration and providing a consistent, cross-platform API for referencing data.

Let’s configure our first Datasource: a connection to the data directory we’ve provided in the repo. Instead, this could even be a data base connection, and more.

```great_expectations --v3-api datasource new```

You would be presented with different options, select:
- 1. Files on a filesystem (for processing with Pandas or Spark) 

and for processing the data select


-  1. Pandas

and finally enter the directory as: data (where we have our real data)

Once you've entered the details an jupyter notebook will open up. This is just the way Great Expectations has given templated codes, which helps us create expectations with a few code changes.

Let's change the Datasource name to something more specific.

Edit the second code cell as follows:
```datasource_name = "data__dir"```

Then execute all cells in the notebook in order to save the new Datasource. If successful, the last cell will print a list of all Datasources, including the one you just created.

## Step 4: Create an Expectation Suite using the built-in Great Expectations profiler

An expectation is nothing but a falsifiable, verifiable statement about data. Expectations provide a language to talk about data characteristics and data quality - humans to humans, humans to machines and machines to machines.

The idea here is that we assume that the real data has the ideal quality of the data we want synthesized, so we use the real data to create a set of expectations which we can later use to evaluate our synthetic data on.

The CLI will help create our first Expectation Suite. Suites are simply collections of Expectations. We can use the built-in profiler to automatically create an Expectation Suite called `creditcard.quality`

Type the following into your terminal:
```great_expectations --v3-api suite new```

You would be presented with few options on the terminal. Choose 3 to create the Expectation Suite Automatically, using a profiler, and 1 to profile the  real fraud dataset creditcard_fraud.csv which we saved in step 2.

This step might be a bit confusing in the beginning, but stay with us. Again anotehr jupyter notebook would be opened with boilerplate code for creating new expectation suite. The code is pretty standard, however please not on the second cell, all columns are added to the list of ignored columns. In our example we want to validate every single column, hence remove (or comment) out the columns from the ignored_columns list.

Other than that, go ahead and execute the entire notebook. This will create an expectation suite against the real creditcard fraud dataset.

## Stpe 5: Transform the real data for modeling

Now that we have created the expectation suite, we shift our focus back to creating the synthetic data. We follow the standard process of transforming the data first.

In [5]:
# Extract list of columns
data_cols = list(data.columns[ data.columns != 'Class' ])

# Before training the GAN do not forget to apply the required data transformations
# To ease here we've applied a PowerTransformation - make data distribution more Gaussian-like.
_, data, preprocessor = transformations(data)

# For the purpose of this example we will only synthesize the minority class
# train_data contains 492 rows which had 'Class' value as 1 (which were very few)
train_data = data.loc[ data['Class']==1 ].copy()

print("Dataset info: Number of records - {} Number of variables - {}".format(train_data.shape[0], train_data.shape[1]))

# We define a K-means clustering method using sklearn, and declare that
# we want 2 clusters. We then apply this algorithm (fit_predict) to our train_data
# We essentially get an array of 492 rows ('labels') having values either 0 or 1 for the 2 clustered classes.
algorithm = cluster.KMeans
args, kwds = (), {'n_clusters':2, 'random_state':0}
labels = algorithm(*args, **kwds).fit_predict(train_data[ data_cols ])

# Get the count of both classes
print( pd.DataFrame( [ [np.sum(labels==i)] for i in np.unique(labels) ], columns=['count'], index=np.unique(labels) ) )

# Assign the k-means clustered classes' labels to the a seperate copy of train data 'fraud_w_classes'
fraud_w_classes = train_data.copy()
fraud_w_classes['Class'] = labels

Dataset info: Number of records - 492 Number of variables - 31
   count
0    375
1    117


In [6]:
train_data.shape

(492, 31)

# Step 6: Train the synethsizers and create the model

Below you can try to train your own generators using the available GANs architectures. You can train it either with labels (created using KMeans) or with no labels at all. 

Remember that for this exercise in particular we've decided to synthesize only the minority class from the Credit Fraud dataset.

In [7]:
# Define the GAN and training parameters
noise_dim = 32
dim = 128
batch_size = 128

log_step = 100
epochs = 200+1
learning_rate = 5e-4
beta_1 = 0.5
beta_2 = 0.9
models_dir = './cache'

train_sample = fraud_w_classes.copy().reset_index(drop=True)
print("train_sample.columns:")
print(train_sample.columns)

# There's only 1 class, so essentially rename the 'Class' to 'Class_1',
# which tells weather a sample data is of class 1 or not.
train_sample = pd.get_dummies(train_sample, columns=['Class'], prefix='Class', drop_first=True)

# 'Class_1' label
label_cols = [ i for i in train_sample.columns if 'Class' in i ]

# All columns except 'Class_1'
data_cols = [ i for i in train_sample.columns if i not in label_cols ]

# Scale down the data, and rename it to 'train_no_label'
train_sample[ data_cols ] = train_sample[ data_cols ] / 10 # scale to random noise size, one less thing to learn
train_no_label = train_sample[ data_cols ]

train_sample.columns:
Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')


In [8]:
#Setting the GAN model parameters and the training step parameters
gan_args = ModelParameters(batch_size=batch_size,
                           lr=learning_rate,
                           betas=(beta_1, beta_2),
                           noise_dim=noise_dim,
                           n_cols=train_sample.shape[1],
                           layers_dim=dim)

train_args = TrainParameters(epochs=epochs,
                             sample_interval=log_step)

In [9]:
# Training the GAN model chosen: Vanilla GAN, CGAN, DCGAN, etc.
synthesizer = model(gan_args)
synthesizer.train(train_sample, train_args)

  1%|▊                                                                                 | 2/201 [00:00<01:16,  2.59it/s]

0 [D loss: 0.636076, acc.: 50.39%] [G loss: 0.666174]
generated_data
1 [D loss: 0.740944, acc.: 49.22%] [G loss: 0.543567]


  2%|█▋                                                                                | 4/201 [00:01<00:42,  4.66it/s]

2 [D loss: 0.721365, acc.: 49.22%] [G loss: 0.680754]
3 [D loss: 0.632399, acc.: 80.47%] [G loss: 0.947135]


  3%|██▍                                                                               | 6/201 [00:01<00:33,  5.79it/s]

4 [D loss: 0.618473, acc.: 80.47%] [G loss: 0.952196]
5 [D loss: 0.571622, acc.: 89.06%] [G loss: 0.970230]


  4%|███▎                                                                              | 8/201 [00:01<00:29,  6.54it/s]

6 [D loss: 0.569562, acc.: 86.72%] [G loss: 0.880279]
7 [D loss: 0.613530, acc.: 51.17%] [G loss: 0.731060]


  5%|████                                                                             | 10/201 [00:01<00:27,  6.99it/s]

8 [D loss: 0.573943, acc.: 74.61%] [G loss: 0.806645]
9 [D loss: 0.676134, acc.: 62.11%] [G loss: 0.721362]


  6%|████▊                                                                            | 12/201 [00:02<00:26,  7.15it/s]

10 [D loss: 0.636380, acc.: 52.73%] [G loss: 0.966309]
11 [D loss: 0.395751, acc.: 85.55%] [G loss: 1.512076]


  7%|█████▋                                                                           | 14/201 [00:02<00:25,  7.30it/s]

12 [D loss: 0.330358, acc.: 88.67%] [G loss: 1.560551]
13 [D loss: 0.458118, acc.: 74.22%] [G loss: 1.024349]


  8%|██████▍                                                                          | 16/201 [00:02<00:25,  7.22it/s]

14 [D loss: 0.722985, acc.: 41.41%] [G loss: 0.628383]
15 [D loss: 0.760535, acc.: 35.55%] [G loss: 0.659330]


  9%|███████▎                                                                         | 18/201 [00:03<00:25,  7.10it/s]

16 [D loss: 0.818688, acc.: 25.78%] [G loss: 0.656721]
17 [D loss: 0.757173, acc.: 29.69%] [G loss: 0.808466]


 10%|████████                                                                         | 20/201 [00:03<00:25,  7.21it/s]

18 [D loss: 0.684740, acc.: 53.52%] [G loss: 0.972274]
19 [D loss: 0.642705, acc.: 67.19%] [G loss: 0.983413]


 11%|████████▊                                                                        | 22/201 [00:03<00:24,  7.40it/s]

20 [D loss: 0.584872, acc.: 77.34%] [G loss: 1.054361]
21 [D loss: 0.558246, acc.: 79.69%] [G loss: 1.005306]


 12%|█████████▋                                                                       | 24/201 [00:03<00:23,  7.58it/s]

22 [D loss: 0.587358, acc.: 68.75%] [G loss: 0.896084]
23 [D loss: 0.592873, acc.: 69.14%] [G loss: 0.882824]


 13%|██████████▍                                                                      | 26/201 [00:04<00:23,  7.54it/s]

24 [D loss: 0.551339, acc.: 75.39%] [G loss: 1.027531]
25 [D loss: 0.465764, acc.: 80.08%] [G loss: 1.203178]


 14%|███████████▎                                                                     | 28/201 [00:04<00:22,  7.59it/s]

26 [D loss: 0.579346, acc.: 65.62%] [G loss: 0.985317]
27 [D loss: 0.606635, acc.: 65.62%] [G loss: 0.977058]


 15%|████████████                                                                     | 30/201 [00:04<00:22,  7.68it/s]

28 [D loss: 0.664233, acc.: 54.30%] [G loss: 0.813210]
29 [D loss: 0.717089, acc.: 42.58%] [G loss: 0.855176]


 16%|████████████▉                                                                    | 32/201 [00:04<00:22,  7.46it/s]

30 [D loss: 0.602405, acc.: 73.44%] [G loss: 1.057513]
31 [D loss: 0.651200, acc.: 64.84%] [G loss: 0.937345]


 17%|█████████████▋                                                                   | 34/201 [00:05<00:23,  7.15it/s]

32 [D loss: 0.737108, acc.: 55.86%] [G loss: 0.862930]
33 [D loss: 0.648462, acc.: 64.06%] [G loss: 1.108585]


 18%|██████████████▌                                                                  | 36/201 [00:05<00:23,  6.89it/s]

34 [D loss: 0.678109, acc.: 56.25%] [G loss: 1.014116]
35 [D loss: 0.719165, acc.: 52.34%] [G loss: 0.933629]


 19%|███████████████▎                                                                 | 38/201 [00:05<00:23,  6.85it/s]

36 [D loss: 0.727754, acc.: 51.56%] [G loss: 0.913853]
37 [D loss: 0.744828, acc.: 47.27%] [G loss: 0.863923]


 20%|████████████████                                                                 | 40/201 [00:06<00:22,  7.15it/s]

38 [D loss: 0.686736, acc.: 60.16%] [G loss: 0.843922]
39 [D loss: 0.593250, acc.: 76.17%] [G loss: 0.963020]


 21%|████████████████▉                                                                | 42/201 [00:06<00:21,  7.42it/s]

40 [D loss: 0.578069, acc.: 79.30%] [G loss: 1.025159]
41 [D loss: 0.653506, acc.: 55.86%] [G loss: 0.899542]


 22%|█████████████████▋                                                               | 44/201 [00:06<00:21,  7.44it/s]

42 [D loss: 0.808087, acc.: 33.20%] [G loss: 0.800426]
43 [D loss: 0.787609, acc.: 38.67%] [G loss: 0.882919]


 23%|██████████████████▌                                                              | 46/201 [00:06<00:20,  7.54it/s]

44 [D loss: 0.650217, acc.: 62.89%] [G loss: 1.097704]
45 [D loss: 0.574599, acc.: 73.44%] [G loss: 1.063861]


 24%|███████████████████▎                                                             | 48/201 [00:07<00:21,  7.26it/s]

46 [D loss: 0.594158, acc.: 71.88%] [G loss: 1.011992]
47 [D loss: 0.617718, acc.: 71.48%] [G loss: 1.007216]


 25%|████████████████████▏                                                            | 50/201 [00:07<00:20,  7.37it/s]

48 [D loss: 0.668223, acc.: 55.47%] [G loss: 0.849152]
49 [D loss: 0.653734, acc.: 57.42%] [G loss: 0.816721]


 26%|████████████████████▉                                                            | 52/201 [00:07<00:20,  7.44it/s]

50 [D loss: 0.630357, acc.: 64.45%] [G loss: 0.837308]
51 [D loss: 0.599089, acc.: 68.36%] [G loss: 0.906127]


 27%|█████████████████████▊                                                           | 54/201 [00:07<00:19,  7.36it/s]

52 [D loss: 0.541291, acc.: 80.08%] [G loss: 1.025464]
53 [D loss: 0.685293, acc.: 49.61%] [G loss: 0.771237]


 28%|██████████████████████▌                                                          | 56/201 [00:08<00:20,  6.95it/s]

54 [D loss: 0.724142, acc.: 36.72%] [G loss: 0.796941]
55 [D loss: 0.646947, acc.: 58.98%] [G loss: 1.043048]


 29%|███████████████████████▎                                                         | 58/201 [00:08<00:20,  7.09it/s]

56 [D loss: 0.621978, acc.: 67.19%] [G loss: 1.024598]
57 [D loss: 0.598206, acc.: 70.31%] [G loss: 1.168633]


 30%|████████████████████████▏                                                        | 60/201 [00:08<00:19,  7.28it/s]

58 [D loss: 0.570087, acc.: 73.05%] [G loss: 1.007023]
59 [D loss: 0.574943, acc.: 73.44%] [G loss: 0.933127]


 31%|████████████████████████▉                                                        | 62/201 [00:09<00:19,  7.24it/s]

60 [D loss: 0.651551, acc.: 60.55%] [G loss: 0.857460]
61 [D loss: 0.675130, acc.: 53.12%] [G loss: 0.853580]


 32%|█████████████████████████▊                                                       | 64/201 [00:09<00:19,  7.06it/s]

62 [D loss: 0.620275, acc.: 69.92%] [G loss: 1.061840]
63 [D loss: 0.563554, acc.: 77.73%] [G loss: 1.173257]


 33%|██████████████████████████▌                                                      | 66/201 [00:09<00:18,  7.25it/s]

64 [D loss: 0.600068, acc.: 70.31%] [G loss: 1.084824]
65 [D loss: 0.639127, acc.: 65.62%] [G loss: 0.957905]


 34%|███████████████████████████▍                                                     | 68/201 [00:09<00:19,  6.92it/s]

66 [D loss: 0.714617, acc.: 49.22%] [G loss: 0.830868]
67 [D loss: 0.626827, acc.: 65.62%] [G loss: 1.014262]


 35%|████████████████████████████▏                                                    | 70/201 [00:10<00:18,  6.94it/s]

68 [D loss: 0.547576, acc.: 79.30%] [G loss: 1.196927]
69 [D loss: 0.583984, acc.: 71.09%] [G loss: 1.011835]


 36%|█████████████████████████████                                                    | 72/201 [00:10<00:18,  7.12it/s]

70 [D loss: 0.713640, acc.: 51.56%] [G loss: 0.812505]
71 [D loss: 0.685693, acc.: 50.39%] [G loss: 0.859980]


 37%|█████████████████████████████▊                                                   | 74/201 [00:10<00:17,  7.30it/s]

72 [D loss: 0.575943, acc.: 73.83%] [G loss: 1.010275]
73 [D loss: 0.564789, acc.: 70.31%] [G loss: 1.013620]


 38%|██████████████████████████████▋                                                  | 76/201 [00:11<00:16,  7.36it/s]

74 [D loss: 0.672123, acc.: 56.64%] [G loss: 0.907849]
75 [D loss: 0.695018, acc.: 52.73%] [G loss: 1.000841]


 39%|███████████████████████████████▍                                                 | 78/201 [00:11<00:17,  6.84it/s]

76 [D loss: 0.626263, acc.: 64.84%] [G loss: 1.165671]
77 [D loss: 0.636230, acc.: 61.72%] [G loss: 1.238726]


 40%|████████████████████████████████▏                                                | 80/201 [00:11<00:17,  6.94it/s]

78 [D loss: 0.580019, acc.: 70.70%] [G loss: 1.188218]
79 [D loss: 0.495347, acc.: 82.42%] [G loss: 1.279363]


 41%|█████████████████████████████████                                                | 82/201 [00:11<00:16,  7.16it/s]

80 [D loss: 0.487096, acc.: 82.42%] [G loss: 1.154888]
81 [D loss: 0.588233, acc.: 68.36%] [G loss: 0.993040]


 42%|█████████████████████████████████▊                                               | 84/201 [00:12<00:15,  7.34it/s]

82 [D loss: 0.625974, acc.: 64.45%] [G loss: 1.047866]
83 [D loss: 0.687841, acc.: 57.81%] [G loss: 1.131772]


 43%|██████████████████████████████████▋                                              | 86/201 [00:12<00:15,  7.40it/s]

84 [D loss: 0.613222, acc.: 63.67%] [G loss: 1.097470]
85 [D loss: 0.650196, acc.: 58.98%] [G loss: 1.133914]


 44%|███████████████████████████████████▍                                             | 88/201 [00:12<00:16,  6.98it/s]

86 [D loss: 0.688790, acc.: 60.94%] [G loss: 0.973795]
87 [D loss: 0.702085, acc.: 52.34%] [G loss: 0.966511]


 45%|████████████████████████████████████▎                                            | 90/201 [00:13<00:15,  6.96it/s]

88 [D loss: 0.690311, acc.: 55.47%] [G loss: 0.973021]
89 [D loss: 0.631358, acc.: 62.11%] [G loss: 1.061102]


 46%|█████████████████████████████████████                                            | 92/201 [00:13<00:15,  6.87it/s]

90 [D loss: 0.578601, acc.: 71.88%] [G loss: 1.200727]
91 [D loss: 0.604068, acc.: 68.75%] [G loss: 1.088345]


 47%|█████████████████████████████████████▉                                           | 94/201 [00:13<00:15,  6.93it/s]

92 [D loss: 0.630844, acc.: 66.80%] [G loss: 1.014538]
93 [D loss: 0.595021, acc.: 73.05%] [G loss: 1.086882]


 48%|██████████████████████████████████████▋                                          | 96/201 [00:13<00:14,  7.10it/s]

94 [D loss: 0.592853, acc.: 68.36%] [G loss: 1.020879]
95 [D loss: 0.608857, acc.: 66.80%] [G loss: 1.033632]


 49%|███████████████████████████████████████▍                                         | 98/201 [00:14<00:14,  7.01it/s]

96 [D loss: 0.648420, acc.: 63.67%] [G loss: 1.085856]
97 [D loss: 0.556466, acc.: 76.56%] [G loss: 1.128967]


 50%|███████████████████████████████████████▊                                        | 100/201 [00:14<00:13,  7.23it/s]

98 [D loss: 0.591422, acc.: 71.09%] [G loss: 1.140468]
99 [D loss: 0.681237, acc.: 55.08%] [G loss: 0.889258]


 51%|████████████████████████████████████████▌                                       | 102/201 [00:14<00:14,  6.99it/s]

100 [D loss: 0.661264, acc.: 58.98%] [G loss: 0.928145]
generated_data
101 [D loss: 0.660793, acc.: 55.86%] [G loss: 0.934039]


 52%|█████████████████████████████████████████▍                                      | 104/201 [00:15<00:14,  6.86it/s]

102 [D loss: 0.628608, acc.: 63.67%] [G loss: 1.118889]
103 [D loss: 0.606895, acc.: 66.80%] [G loss: 1.224909]


 53%|██████████████████████████████████████████▏                                     | 106/201 [00:15<00:14,  6.65it/s]

104 [D loss: 0.604994, acc.: 66.41%] [G loss: 1.128781]
105 [D loss: 0.641859, acc.: 65.23%] [G loss: 1.088796]


 54%|██████████████████████████████████████████▉                                     | 108/201 [00:15<00:13,  7.03it/s]

106 [D loss: 0.623813, acc.: 67.58%] [G loss: 1.074989]
107 [D loss: 0.629782, acc.: 67.58%] [G loss: 1.038528]


 55%|███████████████████████████████████████████▊                                    | 110/201 [00:15<00:13,  6.99it/s]

108 [D loss: 0.635973, acc.: 65.23%] [G loss: 1.032716]
109 [D loss: 0.655529, acc.: 62.89%] [G loss: 0.931407]


 56%|████████████████████████████████████████████▌                                   | 112/201 [00:16<00:12,  7.30it/s]

110 [D loss: 0.577298, acc.: 73.83%] [G loss: 1.137161]
111 [D loss: 0.590042, acc.: 68.75%] [G loss: 1.088559]


 57%|█████████████████████████████████████████████▎                                  | 114/201 [00:16<00:11,  7.50it/s]

112 [D loss: 0.706538, acc.: 52.73%] [G loss: 0.929068]
113 [D loss: 0.658264, acc.: 58.98%] [G loss: 1.020436]


 58%|██████████████████████████████████████████████▏                                 | 116/201 [00:16<00:11,  7.63it/s]

114 [D loss: 0.630122, acc.: 66.41%] [G loss: 1.119588]
115 [D loss: 0.624879, acc.: 67.19%] [G loss: 1.029573]


 59%|██████████████████████████████████████████████▉                                 | 118/201 [00:16<00:10,  7.70it/s]

116 [D loss: 0.655716, acc.: 62.89%] [G loss: 1.044257]
117 [D loss: 0.603690, acc.: 71.88%] [G loss: 0.997266]


 60%|███████████████████████████████████████████████▊                                | 120/201 [00:17<00:10,  7.67it/s]

118 [D loss: 0.584655, acc.: 70.31%] [G loss: 1.053270]
119 [D loss: 0.576636, acc.: 73.05%] [G loss: 0.993130]


 61%|████████████████████████████████████████████████▌                               | 122/201 [00:17<00:10,  7.70it/s]

120 [D loss: 0.619042, acc.: 64.06%] [G loss: 0.988995]
121 [D loss: 0.638178, acc.: 61.33%] [G loss: 0.984475]


 62%|█████████████████████████████████████████████████▎                              | 124/201 [00:17<00:10,  7.69it/s]

122 [D loss: 0.636288, acc.: 63.67%] [G loss: 1.099901]
123 [D loss: 0.575220, acc.: 69.92%] [G loss: 1.197113]


 63%|██████████████████████████████████████████████████▏                             | 126/201 [00:17<00:09,  7.76it/s]

124 [D loss: 0.554152, acc.: 73.44%] [G loss: 1.145599]
125 [D loss: 0.610211, acc.: 67.58%] [G loss: 1.085784]


 64%|██████████████████████████████████████████████████▉                             | 128/201 [00:18<00:09,  7.47it/s]

126 [D loss: 0.550637, acc.: 75.00%] [G loss: 1.245064]
127 [D loss: 0.577519, acc.: 72.27%] [G loss: 1.181745]


 65%|███████████████████████████████████████████████████▋                            | 130/201 [00:18<00:09,  7.37it/s]

128 [D loss: 0.609230, acc.: 67.58%] [G loss: 1.056980]
129 [D loss: 0.556674, acc.: 72.66%] [G loss: 1.181112]


 66%|████████████████████████████████████████████████████▌                           | 132/201 [00:18<00:09,  7.30it/s]

130 [D loss: 0.530695, acc.: 75.39%] [G loss: 1.234228]
131 [D loss: 0.554094, acc.: 76.17%] [G loss: 1.121365]


 67%|█████████████████████████████████████████████████████▎                          | 134/201 [00:19<00:09,  7.38it/s]

132 [D loss: 0.574552, acc.: 69.53%] [G loss: 1.084868]
133 [D loss: 0.559596, acc.: 72.66%] [G loss: 1.180134]


 68%|██████████████████████████████████████████████████████▏                         | 136/201 [00:19<00:08,  7.54it/s]

134 [D loss: 0.640179, acc.: 64.84%] [G loss: 1.082688]
135 [D loss: 0.603372, acc.: 73.05%] [G loss: 1.118760]


 69%|██████████████████████████████████████████████████████▉                         | 138/201 [00:19<00:08,  7.59it/s]

136 [D loss: 0.518803, acc.: 75.39%] [G loss: 1.316881]
137 [D loss: 0.535368, acc.: 75.78%] [G loss: 1.415515]


 70%|███████████████████████████████████████████████████████▋                        | 140/201 [00:19<00:08,  7.31it/s]

138 [D loss: 0.519845, acc.: 79.69%] [G loss: 1.325203]
139 [D loss: 0.512873, acc.: 77.34%] [G loss: 1.306228]


 71%|████████████████████████████████████████████████████████▌                       | 142/201 [00:20<00:08,  7.26it/s]

140 [D loss: 0.602440, acc.: 63.67%] [G loss: 1.061125]
141 [D loss: 0.571597, acc.: 74.61%] [G loss: 1.076455]


 72%|█████████████████████████████████████████████████████████▎                      | 144/201 [00:20<00:07,  7.37it/s]

142 [D loss: 0.505982, acc.: 76.17%] [G loss: 1.386549]
143 [D loss: 0.518357, acc.: 76.95%] [G loss: 1.203856]


 73%|██████████████████████████████████████████████████████████                      | 146/201 [00:20<00:07,  7.48it/s]

144 [D loss: 0.562238, acc.: 72.66%] [G loss: 1.184477]
145 [D loss: 0.558663, acc.: 73.05%] [G loss: 1.267951]


 74%|██████████████████████████████████████████████████████████▉                     | 148/201 [00:21<00:07,  7.07it/s]

146 [D loss: 0.538481, acc.: 73.05%] [G loss: 1.305417]
147 [D loss: 0.576785, acc.: 69.53%] [G loss: 1.222218]


 75%|███████████████████████████████████████████████████████████▋                    | 150/201 [00:21<00:07,  7.08it/s]

148 [D loss: 0.559799, acc.: 71.48%] [G loss: 1.212079]
149 [D loss: 0.546238, acc.: 70.70%] [G loss: 1.311506]


 76%|████████████████████████████████████████████████████████████▍                   | 152/201 [00:21<00:06,  7.23it/s]

150 [D loss: 0.524328, acc.: 73.05%] [G loss: 1.326872]
151 [D loss: 0.567025, acc.: 68.36%] [G loss: 1.229630]


 77%|█████████████████████████████████████████████████████████████▎                  | 154/201 [00:21<00:06,  6.86it/s]

152 [D loss: 0.540963, acc.: 73.44%] [G loss: 1.198694]
153 [D loss: 0.595928, acc.: 67.58%] [G loss: 1.201284]


 78%|██████████████████████████████████████████████████████████████                  | 156/201 [00:22<00:06,  7.06it/s]

154 [D loss: 0.561886, acc.: 72.66%] [G loss: 1.332510]
155 [D loss: 0.500545, acc.: 75.00%] [G loss: 1.446318]


 79%|██████████████████████████████████████████████████████████████▉                 | 158/201 [00:22<00:06,  7.07it/s]

156 [D loss: 0.516893, acc.: 77.73%] [G loss: 1.301750]
157 [D loss: 0.550242, acc.: 75.39%] [G loss: 1.176880]


 80%|███████████████████████████████████████████████████████████████▋                | 160/201 [00:22<00:05,  7.29it/s]

158 [D loss: 0.568529, acc.: 67.58%] [G loss: 1.216639]
159 [D loss: 0.541624, acc.: 71.88%] [G loss: 1.280236]


 81%|████████████████████████████████████████████████████████████████▍               | 162/201 [00:22<00:05,  7.00it/s]

160 [D loss: 0.517430, acc.: 74.61%] [G loss: 1.348273]
161 [D loss: 0.529361, acc.: 75.00%] [G loss: 1.313916]


 82%|█████████████████████████████████████████████████████████████████▎              | 164/201 [00:23<00:05,  6.50it/s]

162 [D loss: 0.555137, acc.: 73.05%] [G loss: 1.331454]
163 [D loss: 0.541644, acc.: 74.61%] [G loss: 1.288050]


 83%|██████████████████████████████████████████████████████████████████              | 166/201 [00:23<00:04,  7.06it/s]

164 [D loss: 0.556094, acc.: 73.05%] [G loss: 1.179869]
165 [D loss: 0.556641, acc.: 73.83%] [G loss: 1.265802]


 84%|██████████████████████████████████████████████████████████████████▊             | 168/201 [00:23<00:04,  7.32it/s]

166 [D loss: 0.548724, acc.: 75.39%] [G loss: 1.389965]
167 [D loss: 0.527993, acc.: 73.83%] [G loss: 1.265041]


 85%|███████████████████████████████████████████████████████████████████▋            | 170/201 [00:24<00:04,  7.43it/s]

168 [D loss: 0.577306, acc.: 74.61%] [G loss: 1.136818]
169 [D loss: 0.605786, acc.: 69.53%] [G loss: 1.083936]


 86%|████████████████████████████████████████████████████████████████████▍           | 172/201 [00:24<00:03,  7.28it/s]

170 [D loss: 0.525909, acc.: 73.83%] [G loss: 1.378801]
171 [D loss: 0.482942, acc.: 77.73%] [G loss: 1.402089]


 87%|█████████████████████████████████████████████████████████████████████▎          | 174/201 [00:24<00:03,  7.20it/s]

172 [D loss: 0.524541, acc.: 74.22%] [G loss: 1.371125]
173 [D loss: 0.589603, acc.: 66.80%] [G loss: 1.206686]


 88%|██████████████████████████████████████████████████████████████████████          | 176/201 [00:24<00:03,  7.30it/s]

174 [D loss: 0.586794, acc.: 69.14%] [G loss: 1.226057]
175 [D loss: 0.503211, acc.: 78.52%] [G loss: 1.298704]


 89%|██████████████████████████████████████████████████████████████████████▊         | 178/201 [00:25<00:03,  7.37it/s]

176 [D loss: 0.481943, acc.: 81.64%] [G loss: 1.418530]
177 [D loss: 0.492204, acc.: 79.69%] [G loss: 1.492362]


 90%|███████████████████████████████████████████████████████████████████████▋        | 180/201 [00:25<00:02,  7.50it/s]

178 [D loss: 0.527489, acc.: 76.17%] [G loss: 1.443672]
179 [D loss: 0.509164, acc.: 77.34%] [G loss: 1.518598]


 91%|████████████████████████████████████████████████████████████████████████▍       | 182/201 [00:25<00:02,  7.50it/s]

180 [D loss: 0.502608, acc.: 77.73%] [G loss: 1.337065]
181 [D loss: 0.526453, acc.: 73.05%] [G loss: 1.381773]


 92%|█████████████████████████████████████████████████████████████████████████▏      | 184/201 [00:26<00:02,  7.23it/s]

182 [D loss: 0.500595, acc.: 76.95%] [G loss: 1.340854]
183 [D loss: 0.461662, acc.: 82.42%] [G loss: 1.493798]


 93%|██████████████████████████████████████████████████████████████████████████      | 186/201 [00:26<00:02,  7.25it/s]

184 [D loss: 0.534295, acc.: 73.83%] [G loss: 1.289783]
185 [D loss: 0.574632, acc.: 72.27%] [G loss: 1.300255]


 94%|██████████████████████████████████████████████████████████████████████████▊     | 188/201 [00:26<00:01,  7.40it/s]

186 [D loss: 0.481188, acc.: 78.52%] [G loss: 1.465832]
187 [D loss: 0.471949, acc.: 78.52%] [G loss: 1.515759]


 95%|███████████████████████████████████████████████████████████████████████████▌    | 190/201 [00:26<00:01,  7.52it/s]

188 [D loss: 0.519134, acc.: 75.00%] [G loss: 1.515703]
189 [D loss: 0.549161, acc.: 71.09%] [G loss: 1.343797]


 96%|████████████████████████████████████████████████████████████████████████████▍   | 192/201 [00:27<00:01,  7.46it/s]

190 [D loss: 0.559190, acc.: 72.27%] [G loss: 1.250807]
191 [D loss: 0.530266, acc.: 77.34%] [G loss: 1.282847]


 97%|█████████████████████████████████████████████████████████████████████████████▏  | 194/201 [00:27<00:00,  7.06it/s]

192 [D loss: 0.462458, acc.: 78.91%] [G loss: 1.571291]
193 [D loss: 0.469717, acc.: 79.30%] [G loss: 1.663337]


 98%|██████████████████████████████████████████████████████████████████████████████  | 196/201 [00:27<00:00,  7.05it/s]

194 [D loss: 0.553423, acc.: 71.09%] [G loss: 1.307566]
195 [D loss: 0.519310, acc.: 75.39%] [G loss: 1.375313]


 99%|██████████████████████████████████████████████████████████████████████████████▊ | 198/201 [00:27<00:00,  7.31it/s]

196 [D loss: 0.528041, acc.: 75.00%] [G loss: 1.426011]
197 [D loss: 0.498662, acc.: 77.73%] [G loss: 1.432637]


100%|███████████████████████████████████████████████████████████████████████████████▌| 200/201 [00:28<00:00,  7.36it/s]

198 [D loss: 0.496180, acc.: 76.95%] [G loss: 1.309457]
199 [D loss: 0.548137, acc.: 74.61%] [G loss: 1.389564]


100%|████████████████████████████████████████████████████████████████████████████████| 201/201 [00:28<00:00,  7.09it/s]

200 [D loss: 0.498236, acc.: 77.34%] [G loss: 1.381582]
generated_data





In [10]:
# Generator description
synthesizer.generator.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(128, 32)]               0         
_________________________________________________________________
dense (Dense)                (128, 128)                4224      
_________________________________________________________________
dense_1 (Dense)              (128, 256)                33024     
_________________________________________________________________
dense_2 (Dense)              (128, 512)                131584    
_________________________________________________________________
dense_3 (Dense)              (128, 31)                 15903     
Total params: 184,735
Trainable params: 184,735
Non-trainable params: 0
_________________________________________________________________


In [11]:
# Discriminator description
synthesizer.discriminator.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(128, 31)]               0         
_________________________________________________________________
dense_4 (Dense)              (128, 512)                16384     
_________________________________________________________________
dropout (Dropout)            (128, 512)                0         
_________________________________________________________________
dense_5 (Dense)              (128, 256)                131328    
_________________________________________________________________
dropout_1 (Dropout)          (128, 256)                0         
_________________________________________________________________
dense_6 (Dense)              (128, 128)                32896     
_________________________________________________________________
dense_7 (Dense)              (128, 1)                  129 

In [12]:
# You can easily save the trained generator and loaded it afterwards
if not os.path.exists("./saved/gan"):
    os.makedirs("./saved/gan")
synthesizer.save(path="./saved/gan/generator_fraud.pkl")

In [13]:
models = {'GAN': ['GAN', False, synthesizer.generator]}

## Step 7: Sample synthetic data from the Synthesizer

In [14]:
# use the same shape as the real data

synthetic_fraud = synthesizer.sample(492)

Synthetic data generation: 100%|████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 190.49it/s]


In [15]:
synthetic_fraud.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,21,22,23,24,25,26,27,28,29,30
0,-0.028322,-0.056973,0.075122,-0.077015,0.176086,0.000401,-0.006622,-0.054043,0.023125,-0.138701,...,0.033368,-0.05168,0.035562,-0.060543,0.052678,0.00306,0.028167,0.003305,-0.097905,-0.024802
1,0.003791,-0.060411,0.11622,-0.123338,0.207262,-0.004066,-0.038257,-0.122285,0.059356,-0.15998,...,0.055218,-0.054451,0.027974,-0.054083,0.038001,0.035491,0.046303,0.00858,-0.114265,-0.016638
2,-0.084681,0.027821,-0.008397,-0.090557,0.092042,-0.02693,0.04258,0.026009,-0.0035,-0.001856,...,0.042665,-0.100598,0.006049,-0.049122,0.036208,-0.045092,0.004077,-0.017562,-0.063882,-0.04682
3,0.024048,-0.055298,-0.004408,-0.118931,0.139287,-0.070903,-0.009979,-0.057477,0.043256,-0.128425,...,0.045004,-0.013676,0.112154,-0.037354,0.024974,0.044594,0.063829,-0.083527,0.011412,-0.039752
4,0.090478,-0.014989,0.02756,-0.088822,0.147422,0.037545,0.017765,-0.013605,-0.002967,-0.097704,...,0.051536,-0.086814,0.045123,-0.031647,-0.003462,-0.010022,-0.021095,-0.008208,-0.047879,-0.011579


However we notice that the generated synthetic data is still on the transformed form and needs to be inverse-transformed to the original format.

## Step 8: Inverse transform the data to obtain the original format

In [16]:
synthetic_data = inverse_transform(synthetic_fraud,preprocessor )

  "X does not have valid feature names, but"


In [17]:
synthetic_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,0.038542,-0.032114,0.064474,-0.079785,0.179731,0.000191,0.007486,-0.054481,0.01159,-0.134371,...,-0.008922,0.035842,-0.052928,0.03323,-0.06104,0.049955,0.003665,0.027846,0.00815,-0.095918
1,0.070174,-0.035555,0.105647,-0.126073,0.210845,-0.004276,-0.024071,-0.122712,0.047731,-0.155687,...,-0.007819,0.057671,-0.055698,0.025641,-0.054581,0.035273,0.036095,0.045982,0.013416,-0.112287
2,-0.017405,0.052484,-0.019017,-0.093318,0.095807,-0.02714,0.056479,0.025569,-0.014931,0.002582,...,0.005489,0.045131,-0.101832,0.003719,-0.049621,0.033479,-0.044488,0.003756,-0.01269,-0.06188
3,0.090039,-0.030438,-0.015035,-0.12167,0.142993,-0.071112,0.004139,-0.057914,0.031664,-0.124079,...,-0.026704,0.047467,-0.014928,0.109859,-0.037854,0.022243,0.045197,0.063509,-0.078634,0.013425
4,0.154732,0.009834,0.016898,-0.091585,0.151116,0.037336,0.031782,-0.014045,-0.0144,-0.093318,...,-0.008395,0.053993,-0.088053,0.042792,-0.032147,-0.006194,-0.009417,-0.021416,-0.003347,-0.045873


In [18]:
#Add back the class column
synthetic_data['Class']= 1

#Write the data for profiling
synthetic_data.to_csv('./data/creditcard_fraud_synthetic.csv', index=False)

## Step 9: Create a new checkpoint to validate the synthetic data against the real data

For the normal usage of Great Expectations, the best way to validate data is with a Checkpoint. Checkpoints bundle Batches of data with corresponding Expectation Suites for validation.

From the terminal, run the following command
```great_expectations --v3-api checkpoint new my_new_checkpoint```

This will again open a Jupyter Notebook that will allow you to complete the configuration of our Checkpoint. Edit the data_asset_name to reference the data we want to validate as "creditcard_fraud_synthetic.csv" (the file we wrote in step 8). Ensure that the expectation_suite_name is identical to what we created in step 4.

Once done, go ahead and execute all the cells in the notebook.

## Step 10: Evaluate the synthetic data using Data Docs

If you've following alone, you would have created the new checkpoint to validate the synthetic data. The last final step is to uncomment the final cell of the checkpoint notebook and execute it.

This will open up a HTML page titled Data Docs. We can inspect the Data Docs, for the most recent check point and see that the expectation has failed. By clicking on the checkpoint run, we get a detailed report of which expectations failed from which columns.

Based on this input we can do either of these actions:

- Go back to our synthesizer and tweak the parameters, optimize it to get better synthetic data
- Go back to the expectation suite and edit few expectations that is not important (maybe for certain columns). Yes - the expectations are customizable and here's how you can do it https://docs.greatexpectations.io/docs/guides/expectations/creating_custom_expectations/how_to_create_custom_expectations

# Conclusion

In this tutorial we have successfully demonstrated the use of ydata-synthetic alongside great expectations. A 10 step guide was presented starting from configuring a data context to evaluating the synthesized data using Data Docs. We believe an integration of the these two libraries can help data sciencetist unlock the power of synthetic data with data quality.