# Gretel Synthetics Walkthrough

Welcome to the Gretel Synthetics walkthrough! In this tutorial we will take you through the steps of extracting data from Gretel, building a training dataset, creating synthetic data, and validating the new data!

This tutorial assumes you have already created and uploaded data to a [Gretel project](https://console.gretel.cloud).

Let's get started!

## Configuration

- If using Google Colab, we recommend you change to a GPU runtime. From the menu, choose "Runtime" and then choose "Change runtime type"

- Input your Gretel URI String. Just run the cell below (no need to change it's contents) and then enter your Gretel URI in the pop-up box when it appears. 

- Create your Gretel Synthetic Configuration Template
  - See [our documentation](https://gretel-synthetics.readthedocs.io/en/stable/api/config.html) for additional config options

In [None]:
from pathlib import Path
import getpass
import os

gretel_uri = os.getenv("GRETEL_URI") or getpass.getpass("Your Gretel URI")
checkpoint_dir = str(Path.cwd() / "checkpoints")

config_template = {
    "checkpoint_dir": checkpoint_dir,
    "dp": True, # enable differential privacy in training
    "epochs": 15,
    "gen_lines": 100,
    "max_lines": 0,
    "max_line_len": 2048,
    "overwrite": True,
    "save_all_checkpoints": False,
    "vocab_size": 20000
}


## Steps to create a synthetic dataset

In the code below, we will:
* Install Gretel packages and dependencies
* Connect to Gretel API and download source data the project stream
* Automatically build a record validator from the source data
* Train a synthetic model (neural network) on the source data
* Generate `gen_lines` synthetic data records that pass validation
* Create a synthetic data performance report to compare the source and synthetic datasets

In [None]:
%%capture

!pip install -U gretel-client

# NOTE: if you need synthetics, but already have TensorFlow installed (like in Colab) install below
!pip install gretel-synthetics

# NOTE: if you need synthetics AND TensorFlow, use the below
# !pip install gretel-synthetics[tf]

In [None]:
from gretel_client import project_from_uri

project = project_from_uri(gretel_uri)
project.client.install_packages()

## Select fields from source dataset

By default we suggest filtering fields based on percent unique and percent missing. We reccomend using fields that have no more than 80% uniqueness and are missing no more than 20% of the time. Feel free to adjust these parameters.

If you wish to use all fields, you can omit the returned ``include_fields`` list from the synthetic bundle creation below.


In [None]:
from gretel_helpers.synthetics import create_bundle_from_project, filter_fields

include_fields, drop_fields = filter_fields(project, max_unique_percent=80, max_missing_percent=20)

## Create a Gretel Synthetic Bundle

Next, we run our bundle automation process. This automates the following actions:

- Download records from your Gretel Project and convert them to a DataFrame
- Adjust the fields to be used for synthesis
- Automatically detect a field delimiter to be used for the Gretel Synthetics library
- Automatically detect correlations between columns and create batches of column headers for synthesis
- Build data validators that ensure generated records are within a range of boundaries learned from your training data
- Build neural network models
- Utilize AI models to create synthetic data

In [None]:
bundle = create_bundle_from_project(
    project=project,
    max_size=5000,
    include_fields=include_fields,  # NOTE: you may omit this param to utilize all fields from your training data
    synthetic_config=config_template
)

In [None]:
bundle.training_df.head()

In [None]:
bundle.build()

In [None]:
bundle.train()

In [None]:
bundle.generate()

In [None]:
bundle.get_synthetic_df()

## Performance Report

The Performance Report compares the training data to the newly created synthetic data and assesses their statistical similarity.   It shows you both quantitatively and graphically any differences between within field distributions as well as cross field correlations.

In [None]:
bundle.generate_report()