# Using Gretel's Python SDK to generate Synthetic data for a sample dataset

This notebook will walk you through the process of creating your own synthetic data using Gretel's Python SDK from a CSV or a DataFrame of your choosing. 

As a first step, we need to create an account at the Gretel Console at https://console.gretel.cloud and generate an API key in order to use the gretel API for synthetic data generation.


The first step is to pip install gretel-client

In [1]:
%%capture
!pip install -U gretel-client

We will now enter the generated API key from the Gretel console.

In [2]:
# Specify your Gretel API key

from getpass import getpass
import pandas as pd
from gretel_client import configure_session, ClientConfig

pd.set_option('max_colwidth', None)

configure_session(ClientConfig(api_key=getpass(prompt="Enter Gretel API key"), 
                               endpoint="https://api.gretel.cloud"))

                            

Enter Gretel API key··········


We will create a project via the Gretel Client

In [3]:
# Create a project

from gretel_client import create_project

project = create_project(display_name="synthetic-data")

## Set up Configuration file
Gretel configuration templates are available at https://github.com/gretelai/gretel-blueprints/tree/main/config_templates/gretel/synthetics . For the scope of this tutorial in this notebook, we will load the default configuration template. This template will work well for most datasets.

In [4]:
import json
from smart_open import open
import yaml

with open("https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/config_templates/gretel/synthetics/default.yml", 'r') as stream:
    config = yaml.safe_load(stream)

# Set the model epochs to 50
config['models'][0]['synthetics']['params']['epochs'] = 50

print(json.dumps(config, indent=2))

{
  "schema_version": "1.0",
  "models": [
    {
      "synthetics": {
        "data_source": "__tmp__",
        "params": {
          "epochs": 50,
          "batch_size": 64,
          "vocab_size": 20000,
          "reset_states": false,
          "learning_rate": 0.01,
          "rnn_units": 256,
          "dropout_rate": 0.2,
          "overwrite": true,
          "early_stopping": true,
          "gen_temp": 1.0,
          "predict_batch_size": 64,
          "validation_split": false,
          "dp": false,
          "dp_noise_multiplier": 0.001,
          "dp_l2_norm_clip": 5.0,
          "dp_microbatches": 1,
          "data_upsample_limit": 10000
        },
        "validators": {
          "in_set_count": 10,
          "pattern_count": 10
        },
        "generate": {
          "num_records": 5000,
          "max_invalid": null
        },
        "privacy_filters": {
          "outliers": "medium",
          "similarity": "medium"
        }
      }
    }
  ]
}


## Load and preview the source dataset

Specify a data source to train the model on. This can be a local file, web location, or HDFS file. For this scope of this notebook, we will use one of the publicly available datasets on the Gretel website.


In [5]:
# Load and preview the DataFrame to train the synthetic model on.
import pandas as pd

dataset_path = 'https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/USAdultIncome5k.csv'
df = pd.read_csv(dataset_path)
df.to_csv('training_data.csv', index=False)
df

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,42,Private,255847,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,4386,0,48,United-States,>50K
1,34,Private,111567,HS-grad,9,Never-married,Transport-moving,Own-child,White,Male,0,0,40,United-States,<=50K
2,34,Private,263307,Bachelors,13,Never-married,Sales,Unmarried,Black,Male,0,0,45,?,<=50K
3,69,Private,174474,10th,6,Separated,Machine-op-inspct,Not-in-family,White,Female,0,0,28,Peru,<=50K
4,26,Private,260614,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,42,Self-emp-inc,287037,12th,8,Divorced,Craft-repair,Not-in-family,White,Male,0,0,10,United-States,<=50K
4996,48,Private,236858,11th,7,Divorced,Other-service,Not-in-family,White,Female,0,0,31,United-States,<=50K
4997,53,Private,317313,HS-grad,9,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,60,United-States,>50K
4998,23,Private,113601,Some-college,10,Never-married,Handlers-cleaners,Own-child,White,Male,0,0,30,United-States,<=50K


## Train the Gretel Synthetic model

In this step, we will task the worker running in the Gretel cloud, or locally, to train a synthetic model on the source dataset that we loaded in the previous step.

In [6]:
from gretel_client.helpers import poll

model = project.create_model_obj(model_config=config)
model.data_source = 'training_data.csv'
model.submit(upload_data_source=True)

poll(model)

[32mINFO: [0mStarting poller


{
    "uid": "62263e1a44cf11dc1f9e359e",
    "guid": "model_264CNwh8o038MKpqP7xpXcjGUdq",
    "model_name": "macho-delicate-kolean",
    "runner_mode": "cloud",
    "user_id": "62263d95bff6212fbe3b65ab",
    "user_guid": "user_264C7DAq76ExWdcu0FeuNPtAout",
    "billing_domain": null,
    "billing_domain_guid": null,
    "project_id": "62263ddc9b12a3db339793b8",
    "project_guid": "proj_264CG6rSqOGTyg2ADssYvQwued6",
    "status_history": {
        "created": "2022-03-07T17:17:14.184035Z"
    },
    "last_modified": "2022-03-07T17:17:14.385143Z",
    "status": "created",
    "last_active_hb": null,
    "duration_minutes": null,
    "error_msg": null,
    "error_id": null,
    "traceback": null,
    "container_image": "074762682575.dkr.ecr.us-west-2.amazonaws.com/gretelai/synthetics@sha256:717a68c0e4ef3000c8b650bbed308162ef10c1b2cb4bfc3026b773bc908ee577",
    "model_type": "synthetics",
    "config": {
        "schema_version": "1.0",
        "name": null,
        "models": [
           

[32mINFO: [0mStatus is created. Model creation has been queued.
[32mINFO: [0mStatus is pending. A Gretel Cloud worker is being allocated to begin model creation.
[32mINFO: [0mStatus is active. A worker has started creating your model!
2022-03-07T17:17:30.934359Z  Starting synthetic model training
2022-03-07T17:17:30.936418Z  Loading training data
2022-03-07T17:17:31.124617Z  Training data loaded, detected format: 'csv'
2022-03-07T17:17:31.128116Z  Training data loaded
{
    "record_count": 5000,
    "field_count": 15,
    "upsample_count": 5000
}
2022-03-07T17:17:34.518439Z  Creating semantic validators and preparing training data
2022-03-07T17:17:45.426371Z  Beginning ML model training
2022-03-07T17:17:57.712655Z  Training epoch completed
{
    "epoch": 0,
    "accuracy": 0.2661,
    "loss": 3.7419,
    "val_accuracy": 0,
    "val_loss": 0,
    "batch": 0
}
2022-03-07T17:18:02.281164Z  Training epoch completed
{
    "epoch": 1,
    "accuracy": 0.8304,
    "loss": 0.792,
    "val

# Visualize the generated Synthetic data

We will now visualize the generated synthetic data that was generated by the Gretel synthetic data generation API.

In [7]:
# View the synthetic data

synthetic_df = pd.read_csv(model.get_artifact_link("data_preview"), compression='gzip')

synthetic_df

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,25,Private,181828.0,9th,5,Never-married,Other-service,Own-child,White,Male,0,0,40,?,<=50K
1,33,Private,37402.0,12th,8,Married-civ-spouse,Sales,Husband,White,Male,0,0,60,United-States,<=50K
2,51,Private,307392.0,Masters,14,Married-civ-spouse,Sales,Husband,White,Male,4064,0,50,United-States,<=50K
3,30,?,362685.0,Bachelors,13,Divorced,?,Not-in-family,White,Male,0,0,40,El-Salvador,<=50K
4,21,Private,178309.0,12th,8,Never-married,Other-service,Other-relative,Black,Female,0,0,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,49,Local-gov,130554.0,HS-grad,9,Married-civ-spouse,Handlers-cleaners,Husband,White,Male,0,0,40,United-States,<=50K
4996,35,Private,54878.0,Some-college,10,Never-married,Transport-moving,Own-child,Black,Male,0,0,40,United-States,<=50K
4997,29,State-gov,200835.0,HS-grad,9,Never-married,Handlers-cleaners,Own-child,White,Male,0,0,15,United-States,<=50K
4998,36,Private,44797.0,HS-grad,9,Never-married,Craft-repair,Not-in-family,White,Male,13550,0,40,United-States,>50K


# View the synthetic data quality report

We can also use interactive plots from the IPython library to visualize the new synthetic data vs the original data in the form of a data quality report.

In [8]:
# Generate report that shows the statistical performance between the training and synthetic data

import IPython
from smart_open import open

IPython.display.HTML(data=open(model.get_artifact_link("report")).read())

0,1,2,3,4,5
Synthetic Data Use Cases,Excellent,Good,Moderate,Poor,Very Poor
Significant tuning required to improve model,,,,,
Improve your model using our tips and advice,,,,,
Demo environments or mock data,,,,,
Pre-production testing environments,,,,,
Balance or augment machine learning data sources,,,,,
Machine learning or statistical analysis,,,,,

0,1,2,3,4
Data Sharing Use Case,Excellent,Very Good,Good,Normal
"Internally, within the same team",,,,
"Internally, across different teams",,,,
"Externally, with trusted partners",,,,
"Externally, public availability",,,,

Unnamed: 0,Training Data,Synthetic Data
Row Count,5000,5000
Column Count,15,15
Training Lines Duplicated,--,0

Default Privacy Protections,Advanced Protections

Field,Unique,Missing,Ave. Length,Type,Distribution Stability
education,16,0,8.43,Categorical,Good
education_num,16,0,1.55,Categorical,Good
hours_per_week,82,0,1.98,Categorical,Excellent
age,70,0,2.0,Categorical,Excellent
occupation,15,0,12.18,Categorical,Excellent
fnlwgt,4557,0,5.83,Numeric,Excellent
capital_gain,79,0,1.28,Categorical,Excellent
capital_loss,53,0,1.14,Categorical,Excellent
native_country,40,0,12.3,Categorical,Excellent
marital_status,7,0,14.52,Categorical,Excellent


# Generate more synthetic data

We can now use the trained synthetic model to generate as much synthetic data as we'd like. We are attempting to do so in the following cells.

In [9]:
# Generate more records from the model

record_handler = model.create_record_handler_obj()

record_handler.submit(
    action="generate",
    params={"num_records": 100, "max_invalid": 500}
)

poll(record_handler)

[32mINFO: [0mStarting poller


{
    "uid": "62263f76de63b40cded80fa2",
    "guid": "model_run_264D5cICnG76YJ8YuBsqQ63A6Zn",
    "model_name": null,
    "runner_mode": "cloud",
    "user_id": "62263d95bff6212fbe3b65ab",
    "user_guid": "user_264C7DAq76ExWdcu0FeuNPtAout",
    "billing_domain": null,
    "billing_domain_guid": null,
    "project_id": "62263ddc9b12a3db339793b8",
    "project_guid": "proj_264CG6rSqOGTyg2ADssYvQwued6",
    "status_history": {
        "created": "2022-03-07T17:23:02.698000Z"
    },
    "last_modified": "2022-03-07T17:23:02.782000Z",
    "status": "created",
    "last_active_hb": null,
    "duration_minutes": null,
    "error_msg": null,
    "error_id": null,
    "traceback": null,
    "container_image": "074762682575.dkr.ecr.us-west-2.amazonaws.com/gretelai/synthetics@sha256:717a68c0e4ef3000c8b650bbed308162ef10c1b2cb4bfc3026b773bc908ee577",
    "model_id": "62263e1a44cf11dc1f9e359e",
    "model_guid": "model_264CNwh8o038MKpqP7xpXcjGUdq",
    "action": "generate",
    "config": {
        

[32mINFO: [0mStatus is created. A Record generation job has been queued.
[32mINFO: [0mStatus is pending. A Gretel Cloud worker is being allocated to begin generating synthetic records.
[32mINFO: [0mStatus is active. A worker has started!
2022-03-07T17:23:20.010136Z  Loading model to worker
2022-03-07T17:23:20.553465Z  Checking for synthetic smart seeds
2022-03-07T17:23:20.553844Z  No smart seeds provided, will attempt generation without them
2022-03-07T17:23:20.554699Z  Loading model
2022-03-07T17:23:22.916162Z  Generating records
{
    "num_records": 100
}
2022-03-07T17:23:27.923090Z  Generation in progress
{
    "current_valid_count": 0,
    "current_invalid_count": 0,
    "new_valid_count": 0,
    "new_invalid_count": 0,
    "completion_percent": 0.0
}
2022-03-07T17:23:32.929829Z  Generation in progress
{
    "current_valid_count": 0,
    "current_invalid_count": 0,
    "new_valid_count": 0,
    "new_invalid_count": 0,
    "completion_percent": 0.0
}
2022-03-07T17:23:34.932979

In [10]:
synthetic_df = pd.read_csv(record_handler.get_artifact_link("data"), compression='gzip')

synthetic_df

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,17,Private,237824,9th,5,Never-married,Other-service,Other-relative,Black,Male,0,0,40,Jamaica,<=50K
1,44,Private,182074,Some-college,10,Divorced,Sales,Unmarried,White,Female,0,0,40,United-States,<=50K
2,40,Private,265148,Assoc-acdm,12,Married-civ-spouse,Sales,Husband,Black,Male,0,0,60,Jamaica,<=50K
3,51,Private,249741,Bachelors,13,Married-civ-spouse,Sales,Husband,White,Male,4386,0,40,United-States,>50K
4,29,Private,162667,9th,5,Separated,Other-service,Unmarried,White,Female,0,0,35,Columbia,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,28,Private,376728,Masters,14,Married-civ-spouse,Exec-managerial,Wife,Black,Female,0,0,40,United-States,<=50K
96,49,Private,264244,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,70,United-States,<=50K
97,30,Private,97986,Assoc-voc,11,Married-civ-spouse,Transport-moving,Wife,White,Female,4064,0,40,United-States,<=50K
98,51,Private,165972,5th-6th,3,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
