<a href="https://colab.research.google.com/github/sandeepkesarkar/data-playground/blob/main/Sandeep_Gretel_101_Blueprint.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/sdk_blueprints/Gretel_101_Blueprint.ipynb)

<br>

<center><a href=https://gretel.ai/><img src="https://gretel-public-website.s3.us-west-2.amazonaws.com/assets/brand/gretel_brand_wordmark.svg" alt="Gretel" width="350"/></a></center>

<br>

## Welcome to the Gretel 101 Blueprint!

In this Blueprint, we will use Gretel to train a deep generative model and use it to generate high-quality synthetic (tabular) data. We will accomplish this by submitting training and generation jobs to the [Gretel Cloud](https://gretel.ai/faqs/gretel-cloud) via [Gretel's Python SDK](https://docs.gretel.ai/guides/environment-setup/cli-and-sdk).

Behind the scenes, Gretel will spin up workers with the necessary compute resources, set up the model with your desired configuration, and perform the submitted task.

## Create your Gretel account

To get started, you will need to [sign up for a free Gretel account](https://console.gretel.ai/).

<br>

#### Ready? Let's go 🚀

## 💾 Install `gretel-client` and its dependencies

In [None]:
%%capture
!pip install gretel-client

## 🛜 Configure your Gretel session

- The `Gretel` object provides a high-level interface for streamlining interactions with Gretel's APIs.

- Each `Gretel` instance is bound to a single [Gretel project](https://docs.gretel.ai/guides/gretel-fundamentals/projects).

- Running the cell below will prompt you for your Gretel API key, which you can retrieve [here](https://console.gretel.ai/users/me/key).

- With `validate=True`, your login credentials will be validated immediately at instantiation.

In [None]:
from gretel_client import Gretel

gretel = Gretel(api_key="prompt", validate=True)

Gretel Api Key··········
Using endpoint https://api.gretel.cloud
Logged in as sandeepk39@gmail.com ✅


In [None]:
# @title 🗂️ Pick a tabular dataset 👇 { display-mode: "form" }
dataset_path_dict = {
    "adult income in the USA (14000 records, 15 fields)": "https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/us-adult-income.csv",
    "hospital length of stay (9999 records, 18 fields)": "https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/sample-synthetic-healthcare.csv",
    "customer churn (7032 records, 21 fields)": "https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/monthly-customer-payments.csv"
}

dataset = "adult income in the USA (14000 records, 15 fields)" # @param ["adult income in the USA (14000 records, 15 fields)", "hospital length of stay (9999 records, 18 fields)", "customer churn (7032 records, 21 fields)"]
dataset = dataset_path_dict[dataset]


In [None]:
import pandas as pd

# explore the data using pandas
df = pd.read_csv(dataset)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,33,Private,229051,Some-college,10,Never-married,Prof-specialty,Not-in-family,White,Male,0,0,52,United-States,<=50K
1,38,Local-gov,91711,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,>50K
2,56,Private,282023,HS-grad,9,Married-civ-spouse,Adm-clerical,Husband,White,Male,0,0,40,United-States,<=50K
3,32,Private,209538,Masters,14,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,55,United-States,>50K
4,34,Self-emp-inc,215382,Masters,14,Separated,Prof-specialty,Not-in-family,White,Female,4787,0,40,United-States,>50K


## 🏋️‍♂️ Train a generative model

- The [tabular-actgan](https://github.com/gretelai/gretel-blueprints/blob/main/config_templates/gretel/synthetics/tabular-actgan.yml) base config tells Gretel which model to train and how to configure it.

- You can replace `tabular-actgan` with the path to a custom config file, or you can select any of the tabular configs [listed here](https://github.com/gretelai/gretel-blueprints/tree/main/config_templates/gretel/synthetics).

- The training data is passed in using the `data_source` argument. Its type can be a file path or `DataFrame`.

- **Tip:** Click the printed Console URL to monitor your job's progress in the Gretel Console.

In [None]:
trained = gretel.submit_train("tabular-actgan", data_source=dataset)

No project set -> creating a new one...
Project URL: https://console.gretel.ai/proj_2Z5sKiHp85OnSkdXLRu79xQ6B2V
Submitting ACTGAN training job...
Model Docs:https://docs.gretel.ai/reference/synthetics/models/gretel-actgan

Console URL: https://console.gretel.ai/proj_2Z5sKiHp85OnSkdXLRu79xQ6B2V/models/656e3c5682e12248be05d59c/activity
Analyzing input data and checking for auto-params... 
Found 3 auto-params that were set based on input data. epochs 600, batch_size 600, force_conditioning False
Starting ACTGAN model training... num_epochs 600
Training data loaded. record_count 14000, field_count 15, upsample_count 0
Training: [██████████████████████████████████████████████████] 600/600 epochs.
ACTGAN model training complete. 
Sampling records for data preview... num_records 5000
Preparing privacy filters 
Loaded 0 privacy filters 
Starting privacy filtering 
Privacy filtering complete. 
Sampled 5000 records. 
Creating synthetic quality report (SQS)... 
Finished creating SQS 
Uploading ar

## 🧐 Evaluate the synthetic data quality

- Gretel automatically creates a [synthetic data quality report](https://docs.gretel.ai/reference/evaluate/synthetic-data-quality-report) for each model you train.

- The training results object returned by `submit_train` has a `GretelReport` attribute for viewing the quality report.


In [None]:
# view the quality scores
print(trained.report)

GretelReport(
    synthetic_data_quality_score: 91
    field_correlation_stability: 89
    principal_component_stability: 100
    field_distribution_stability: 86
    privacy_protection_level: 0
)



In [None]:
# display the full report within this notebook
trained.report.display_in_notebook()

0,1,2,3,4,5
How to interpret your SQS,Excellent,Good,Moderate,Poor,Very Poor
Suitable for machine learning or statistical analysis,,,,,
Suitable for balancing or augmenting machine learning data sources,,,,,
Suitable for pre-production testing environments,,,,,
Suitable for demo environments or mock data,,,,,
Improve your model using our tips and advice,,,,,
Significant tuning required to improve model,,,,,

0,1,2,3,4,5
Data Sharing Use Case,Excellent,Very Good,Good,Normal,Poor
"Internally, within the same team",,,,,
"Internally, across different teams",,,,,
"Externally, with trusted partners",,,,,
"Externally, public availability",,,,,

Unnamed: 0,Training Data,Synthetic Data
Row Count,5000,5000
Column Count,15,15
Training Lines Duplicated,--,0

Default Privacy Protections,Advanced Protections

Field,Unique,Missing,Ave. Length,Type,Distribution Stability
hours_per_week,78,0,1.98,Numeric,Good
capital_loss,53,0,1.13,Numeric,Good
native_country,41,0,12.23,Categorical,Good
education_num,16,0,1.55,Numeric,Excellent
fnlwgt,4556,0,5.83,Numeric,Excellent
occupation,14,0,12.2,Categorical,Excellent
age,69,0,2.0,Numeric,Excellent
workclass,8,0,7.87,Categorical,Excellent
marital_status,7,0,14.42,Categorical,Excellent
relationship,6,0,9.12,Categorical,Excellent


In [None]:
# inspect the synthetic data used to create the report
df_synth_report = trained.fetch_report_synthetic_data()
df_synth_report.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,57,Local-gov,115041,11th,7,Never-married,Other-service,Unmarried,White,Female,0,1,8,United-States,<=50K
1,49,Private,189410,Some-college,10,Divorced,Other-service,Not-in-family,White,Female,17,0,40,United-States,<=50K
2,44,Private,203745,Some-college,10,Divorced,Exec-managerial,Not-in-family,White,Male,0,1,40,United-States,<=50K
3,27,State-gov,186637,Assoc-voc,11,Married-civ-spouse,Prof-specialty,Wife,White,Female,0,0,60,United-States,<=50K
4,61,Private,197541,7th-8th,4,Married-civ-spouse,Protective-serv,Husband,White,Male,7,1,40,United-States,>50K


## 🤖 Generate synthetic data

- The `model_id` argument can be the ID of any trained model within the current project.


In [None]:
generated = gretel.submit_generate(trained.model_id, num_records=1000)

Submitting ACTGAN generate job...
Model Docs:https://docs.gretel.ai/reference/synthetics/models/gretel-actgan
Console URL: https://console.gretel.ai/proj_2Z5sKiHp85OnSkdXLRu79xQ6B2V/models/656e3c5682e12248be05d59c/data
Loading model to worker 
Loading ACTGAN model... 
Sampling 1000 records... 
Preparing privacy filters 
Loaded 0 privacy filters 
Starting privacy filtering 
Privacy filtering complete. 
Uploading artifacts to Gretel Cloud... 
Upload to Gretel Cloud is completed. 


In [None]:
# inspect the generated synthetic data
generated.synthetic_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,21,Private,166900,HS-grad,9,Divorced,Exec-managerial,Not-in-family,White,Male,2751,0,40,United-States,<=50K
1,37,Private,186709,HS-grad,9,Divorced,Other-service,Unmarried,White,Female,9,0,40,United-States,<=50K
2,29,Private,40746,HS-grad,9,Never-married,Priv-house-serv,Unmarried,White,Female,0,0,25,United-States,<=50K
3,54,?,109637,HS-grad,9,Married-civ-spouse,?,Wife,White,Female,3110,1,60,Nicaragua,>50K
4,48,Private,113131,HS-grad,9,Married-civ-spouse,Sales,Husband,White,Female,0,0,32,United-States,<=50K
