![Synthesized Banner](./synthesized_banner.png)

# Train, tune, and deploy a custom synthetic data generator using Synthesized's Tabular Data Synthesizer Algorithm from AWS Marketplace


[Synthesized's Tabular Data Synthesizer Algorithm](https://aws.amazon.com/marketplace/pp/prodview-da77no6ehd3pe) brings the generative AI capabilitiies of our [SDK](https://docs.synthesized.io/sdk/latest/) to AWS Sagemaker.

## Overview

Synthesized provides a comprehensive framework for generative modelling for structured data. The SDK helps you create compliant statistical-preserving data snapshots for BI/Analytics and ML/AI applications.

You can find more details in our [documentation](https://docs.synthesized.io/sdk/latest/).

This sample notebook shows you how to train a custom ML model using [Synthesized's Tabular Data Synthesizer Algorithm](https://aws.amazon.com/marketplace/pp/prodview-da77no6ehd3pe) from AWS Marketplace.

> **Note**: This is a reference notebook and it cannot run unless you make changes suggested in the notebook.

## Pre-requisites
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. Some hands-on experience using [Amazon SageMaker](https://aws.amazon.com/sagemaker/).
1. To use this algorithm successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to [Synthesized's Tabular Data Synthesizer Algorithm](https://aws.amazon.com/marketplace/pp/prodview-da77no6ehd3pe). 

## Contents
1. [Subscribe to the algorithm](#1.-Subscribe-to-the-algorithm)
1. [Prepare dataset](#2.-Prepare-dataset)
	1. [Dataset format expected by the algorithm](#A.-Dataset-format-expected-by-the-algorithm)
	1. [Configure and visualize train and test dataset](#B.-Configure-and-visualize-train-and-test-dataset)
	1. [Upload datasets to Amazon S3](#C.-Upload-datasets-to-Amazon-S3)
1. [Train a tabular synthesizer](#3:-Train-a-tabular-synthesizer)
	1. [Set up environment](#3.1-Set-up-environment)
	1. [Train a synthesizer](#3.2-Train-a-synthesizer)
1. [Deploy synthesizer and verify results](#4:-Deploy-synthesizer-and-verify-results)
    1. [Deploy trained synthesizer](#A.-Deploy-trained-synthesizer)
    1. [Create input payload](#B.-Create-input-payload)
    1. [Perform real-time inference](#C.-Perform-real-time-inference)
    1. [Visualize output](#D.-Visualize-output)
    1. [Calculate relevant metrics](#E.-Calculate-relevant-metrics)
    1. [Delete the endpoint](#F.-Delete-the-endpoint)
1. [Tune your synthesizer! (optional)](#5:-Tune-your-synthesizer!-(optional))
	1. [Tuning Guidelines](#A.-Tuning-Guidelines)
	1. [Define Tuning configuration](#B.-Define-Tuning-configuration)
	1. [Run a model tuning job](#C.-Run-a-model-tuning-job)
1. [Clean-up](#6.-Clean-up)
	1. [Delete the model](#A.-Delete-the-model)
	1. [Unsubscribe to the listing (optional)](#B.-Unsubscribe-to-the-listing-(optional))


## Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

## 1. Subscribe to the algorithm

To subscribe to the algorithm:
1. Open the algorithm listing page [Synthesized's Tabular Data Synthesizer Algorithm](https://aws.amazon.com/marketplace/pp/prodview-da77no6ehd3pe).
1. On the AWS Marketplace listing,  click on **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you agree with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn**. This is the algorithm ARN that you need to specify while training a custom ML model. Copy the ARN corresponding to your region and specify the same in the following cell.

In [2]:
algo_arn = "<Customer to specify algorithm ARN corresponding to their AWS region>"

## 2. Prepare dataset

In [3]:
!pip install insight==0.9rc6 numpy pandas

In [4]:
import io
import os
import json
import uuid

from insight import plot, metrics
import pandas as pd
import sagemaker as sage

### A. Dataset format expected by the algorithm

The Tabular Synthesizer algorithm accepts data in CSV format. The algorithm accepts a comprehensive range of datatypes, including:
- String
- Integer
- Float
- Boolean
- Datetime
- Timedeltas

You can also find more information about dataset format in **Usage Information** section of [Synthesized's Tabular Data Synthesizer Algorithm](https://aws.amazon.com/marketplace/pp/prodview-da77no6ehd3pe).

### B. Configure and visualize train and test dataset

In [5]:
training_dataset = "data/algorithm/train/credit_orig.csv"
training_data_dir = os.path.dirname(training_dataset)

In [6]:
df = pd.read_csv(training_dataset)
plot.dataset([df], figure_cols=4, max_categories=20, figsize=(12,8))
df.info()

In [7]:
training_config = "data/algorithm/config/credit_config.json"
training_config_dir = os.path.dirname(training_config)

In [8]:
with open(training_config, 'rb') as file:
    config = json.load(file)
print(json.dumps(config, indent=2))

### C. Upload datasets to Amazon S3

In [9]:
sagemaker_session = sage.Session()
bucket = sagemaker_session.default_bucket()
bucket

In [10]:
prefix = f"synth-{uuid.uuid4()}"

training_dataset_dir = sagemaker_session.upload_data(
    training_data_dir, bucket=bucket, key_prefix=f"{prefix}/train"
)
training_config_dir = sagemaker_session.upload_data(
    training_config_dir, bucket=bucket, key_prefix=f"{prefix}/config"
)

print(training_dataset_dir)
print(training_config_dir)

## 3: Train the Tabular Data Synthesizer Algorithm

Now that dataset is available in an accessible Amazon S3 bucket, we are ready to train a machine learning model. 

### 3.1 Set up environment

In [11]:
role = sage.get_execution_role()

In [12]:
output_location = f"s3://{bucket}/{prefix}/output"
print(output_location)

### 3.2 Train

You can also find more information about dataset format in **Hyperparameters** section of the [Tabular Data Synthesizer Algorithm](https://aws.amazon.com/marketplace/pp/prodview-da77no6ehd3pe). Note that the parameter `num_steps` has been commented out which ensures that the learning process is intelligently stopped as the algorithm converges.

In [13]:
hyperparameters = {
  "batch_size": "256",
  "latent_size": "32",
  # "num_steps": "500"
}

For information on creating an `AlgorithmEstimator` object, see [documentation](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html)

In [14]:
instance_type = 'ml.m5.2xlarge'

# Create an algorithm object for running a training job
algorithm = sage.algorithm.AlgorithmEstimator(
    algorithm_arn=algo_arn,
    base_job_name="tabular-synthesizer",
    role=role,
    instance_count=1,
    instance_type=instance_type,
    input_mode="File",
    output_path=output_location,
    sagemaker_session=sagemaker_session,
    hyperparameters=hyperparameters,

)
# Train the algorithm object
algorithm.fit({"train": training_dataset_dir, "config": training_config_dir})

See this [blog-post](https://aws.amazon.com/blogs/machine-learning/easily-monitor-and-visualize-metrics-while-training-models-on-amazon-sagemaker/) for more information how to visualize metrics during the process. You can also open the training job from [Amazon SageMaker console](https://console.aws.amazon.com/sagemaker/home?#/jobs/) and monitor the metrics/logs in **Monitor** section.

## 4: Deploy synthesizer and verify results

Now you can deploy a `synthesizer` for performing real-time data generation.

In [15]:
model_name = f"{prefix}-credit"
endpoint_name = f"{prefix}-credit-endpoint"
content_type = "application/json"

real_time_inference_instance_type = 'ml.m5.2xlarge'
batch_transform_inference_instance_type = 'ml.m5.2xlarge'

### A. Deploy trained synthesizer

We use the trained algorithm instance to create a `synthesizer` object for data generation

In [16]:
synthesizer = algorithm.deploy(
    initial_instance_count=1,
    instance_type=real_time_inference_instance_type,
    model_name=model_name,
    endpoint_name=endpoint_name,
)

Once an endpoint is created, you can perform real-time data generation.

### B. View input payload

In [17]:
file_name = "data/inference/input/real-time/synth.json"
output_file_name = "data/inference/output/synth-credit.csv"

In [18]:
input_payload = json.loads(open(file_name, 'rb').read())
input_payload

### C. Perform real-time data generation

Synthetic data can be generated using either the AWS CLI or the Python API. If you would like to save the synthetic DataFrame directly to a CSV file, rather than returning it to the notebook (e.g. in the case where you are generating a large synthetic DataFrame), you can use the AWS CLI as demonstrated below.

- CLI

In [19]:
!aws sagemaker-runtime invoke-endpoint \
    --endpoint-name $synthesizer.endpoint \
    --body fileb://$file_name \
    --content-type $content_type \
    --region $sagemaker_session.boto_region_name \
    $output_file_name

In [20]:
pd.read_csv(output_file_name)

- Python API

In [21]:
response_body = synthesizer.predict(json.dumps(input_payload))
df_synth = pd.read_csv(io.BytesIO(response_body))
df_synth

### D. Visualize output

As a quick sanity check, we can compare the distributions of columns in the original and synthetic DataFrames. In this example, we drop the column `MonthlyIncome` since we have masked the values in the synthetic data using the `RoundingMask`. See our [documentation](https://docs.synthesized.io/sdk/latest/features/compliance/privacy_masks) for more details on the privacy preserving transformations offered as part of the SDK.

In [22]:
plot.dataset([df.drop(columns="MonthlyIncome"), df_synth.drop(columns="MonthlyIncome")], figure_cols=4, max_categories=20);

### E. Calculate relevant metrics

The statistical similarity of the original and synthetic DataFrames can be quantitatively analysed using the `TwoColumnMap` class in the `metrics` module of the insight package. Univariate and Bivariate metrics can be calculated, as demonstrated below for the examples of the earth movers distance (EMD) and the Kendall Tau correlation. In the case where a metric cannot be calculated due to a column mismatch (e.g. the EMD can only be calculated between columns that are categorical in nature) a `NaN` value is returned. 

In [23]:
emd = metrics.TwoColumnMap(metrics.EarthMoversDistance())
emd(df, df_synth)

In [24]:
kt_corr = metrics.DiffCorrMatrix(metrics.KendallTauCorrelation())
df_corr = kt_corr(df.sample(100), df_synth.sample(100))

In [25]:
df_corr.style.set_caption("KT_corr")

If [Amazon SageMaker Model Monitor](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html) supports the type of problem you are trying to solve using this algorithm, use the following examples to add Model Monitor support to your product:
For sample code to enable and monitor the model, see following notebooks:
1. [Enable Amazon SageMaker Model Monitor](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker_model_monitor/enable_model_monitor/SageMaker-Enable-Model-Monitor.ipynb)
2. [Amazon SageMaker Model Monitor - visualizing monitoring results](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker_model_monitor/visualization/SageMaker-Model-Monitor-Visualize.ipynb)

### F. Delete the synthesizer endpoint

Now that you have successfully generated synthetic data, you do not need the endpoint any more. You can terminate the same to avoid being charged.

In [26]:
sagemaker_session.delete_endpoint(endpoint_name)

Since this is an experiment, you do not need to run a hyperparameter tuning job. However, if you would like to see how to tune a model trained using a third-party algorithm with Amazon SageMaker's hyperparameter tuning functionality, you can run the optional tuning step.

## 5: Tune your synthesizer! (optional)

While the deep generative models utilised by the SDK have been pre-tuned across a wide variety of datasets, users can run their own hyperparameter tuning jobs if the statistical quality of the synthetic DataFrame requires further tuning.

    
For information about Automatic model tuning, see [Perform Automatic Model Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)

### A. Tuning Guidelines

The following details the supported hyperparameters that can be tuned.

#### Supported objective metrics

- total_loss (type=Minimize)
- step_time (type=Minimize)

#### Supported hyperparameters

- batch_size: (min: 16, max: 2048, tunable: True, default: 1024)
- capacity: (min: 8, max: 256, tunable: True, default: 128)
- latent_size: (min: 8, max: 256, tunable: True~, default: 32)
- num_steps: (min: 1, max: 50000, tunable: False, default: None)

### B. Define Tuning configuration

In [27]:
# Uncomment desired variables to be tuned
hyperparameter_ranges = {
    "capacity": sage.parameter.IntegerParameter(32, 128),
    # "batch_size": sage.parameter.IntegerParameter(32, 1024),
    # "latent_size": sage.parameter.IntegerParameter(16, 256),
}

In [28]:
objective_metric_name = "total_loss"
tuning_direction = "Minimize"

### C. Run a synthesizer tuning job

In [29]:
tuner = sage.tuner.HyperparameterTuner(
    estimator=algorithm,
    base_tuning_job_name=f"tune-{model_name}",
    objective_metric_name=objective_metric_name,
    objective_type=tuning_direction,
    hyperparameter_ranges=hyperparameter_ranges,
    max_jobs=5,
    max_parallel_jobs=5,
)

In [30]:
# Uncomment following lines to run Hyperparameter optimization job.
# tuner.fit({"train": training_dataset_dir, "config": training_config_dir})
# tuner.wait()
# best_estimator = tuner.best_estimator()
# best_estimator.hyperparameters()

Once you have completed a tuning job, (or even while the job is still running) you can [clone and use this notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/hyperparameter_tuning/analyze_results/HPO_Analyze_TuningJob_Results.ipynb) to analyze the results to understand how each hyperparameter effects the quality of the model.

## 6. Clean-up

### A. Delete the synthesizer model and endpoint config

In [31]:
sagemaker_session.delete_model(model_name)
sagemaker_session.delete_endpoint_config(endpoint_name)

### B. Unsubscribe to the listing (optional)

If you would like to unsubscribe to the algorithm, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.

