<img src="https://uploads-ssl.webflow.com/5f2d65b321549c3a6228ce06/60892a20edbd1da3fd641167_Synthesized%20logo.png" width="350" alt="Synthesized" align="left">

# Introduction

Welcome to the demo notebook, where we showcase some of the core features of the Synthesized SDK.

You can apply the Synthesized SDK to automatically create a **general-purpose generative model** for any dataset, enabling easy solutions to a wide range of classic data problems.

This notebook looks at 4 different examples:
1. Bootstrap data where the density of data is low
2. Automatically reshape data as you like
3. Anonymise data for repurposing
4. Identify biases in structured data **(Latest release!)**
 
**Note:**

If you want to save your progress and come back to your work in a new session you must copy this notebook to your Google Drive.
 
If you wish to use the SDK outside Colab, in a production environment, on-premise/private cloud, connect to databases, integrate into ETL, work with Spark and big data sources natively, or just move beyond a single dataframe in memory,  get in touch with us on letschat@synthesized.io.

**Useful links:**

[Synthesized Docs](https://docs.synthesized.io/latest/)

[Sdk for data manipulation](https://www.synthesized.io/sdk-for-data-manipulation) 

[Contact us](letschat@synthesized.io)



# Synthesized License Key

In [None]:
#@title ### Request licence key
#@markdown Please enter your details to receive a licence key. You will need to enter the licence key in order to run the notebook cells below.

first_name = "" #@param {type:"string"}
last_name = "" #@param {type:"string"}
email = "" #@param {type:"string"}

#@markdown Submit the form by running the cell (⌘/ctrl+Enter).
import requests

if email is None or len(email.split("@")) < 2:
  print("please enter a valid email")
else:
  print(f"An email has been sent to {email}")
  url = f'https://us-central1-synthesized-cloud-275014.cloudfunctions.net/process-licence-request?firstname={first_name}&lastname={last_name}&email={email}'
  # payload = f'{{firstname: "{first_name}", lastname: "{last_name}", email: "{email}" }}'
  r = requests.get(url)

In [None]:
#@title ### Set the licence key
#@markdown Please check you email for the licence key which can be pasted below:

licence_key = "" #@param {type:"string"}

import os
os.environ["SYNTHESIZED_KEY"] = licence_key
print(f"Set Synthesized licence key to {licence_key}.")

#@markdown The Synthesized SDK will be installed once you have entered the key and run this cell (⌘/ctrl+Enter).
!pip install -q imgaug==0.2.5
!pip install -q --pre "synthesized[colab]>=1.5rc" --extra-index https://colab:AP3DrAqXTX3dSMVAW1SwowpKgsh@synthesizedio.jfrog.io/artifactory/api/pypi/synthesized-colab/simple

import synthesized



# Example 1 - Bootstrapping Data

This workflow is one of the simplest and **it takes up to 4 minutes.**

To create a generative model with the Synthesized SDK,  we will use the `HighDimSynthesizer` object from the library. But firstly, we need to extract all meta-information from the data frame, by calling `MetaExtractor.extract`, which will create a df_meta: `DataFrameMeta` object.

Next we use df_meta to construct the `HighDimSynthesizer`, and when we call `synthesizer.learn()`, the `HighDimSynthesizer` learns patterns in the data it can later use for generation. 


In [None]:
import pandas as pd
from synthesized import HighDimSynthesizer, MetaExtractor

In [None]:
df1 = pd.read_csv('https://raw.githubusercontent.com/synthesized-io/synthesized-notebooks/master/data/claim_prediction.csv'); df1

In [None]:
# Extract the meta information from the dataset
df1_meta = MetaExtractor.extract(df=df1)

# Construct and train the generative model
synth1 = HighDimSynthesizer(df1_meta)
synth1.learn(df_train=df1)

In [None]:
# Let's now create an additional 1000 rows
df1_synth = synth1.synthesize(1000); df1_synth

We can use the `Assessor` object to do a quick visual comparison of the newly generated data with the original.

In [None]:
from synthesized.testing import Assessor

In [None]:
Assessor(df1_meta).show_distributions(df1, df1_synth)

# Example 2 - Reshaping Data

When creating a predictive model for imbalanced classification, one may encounter a number of pitfalls: some models are unsuitable, model explainability may suffer and unwanted biases may be propagated.

To solve these problems, the Synthesized SDK enables fast and accurate rebalancing of datasets through conditional sampling of the generative model. With just two lines of extra code we can create a balanced dataset for model training!

Here, the dataset used is a [public credit scoring dataset from Kaggle](https://www.kaggle.com/c/GiveMeSomeCredit/data).

**Read more:**

- Our [blog post](https://www.synthesized.io/post/solving-data-imbalance-with-synthetic-data) with a more in-depth analysis of a balanced dataset.
- The [SDK documentation](https://docs.synthesized.io/latest/user_guide/conditions.html) for more ways to enhance and reshape your data. 


In [None]:
import pandas as pd
from synthesized import ConditionalSampler, HighDimSynthesizer, MetaExtractor
from synthesized.insight.metrics import modelling_metrics as metrics

In [None]:
df2 = pd.read_csv('https://raw.githubusercontent.com/synthesized-io/synthesized-notebooks/master/data/credit.csv'); df2

In [None]:
pms = metrics.PredictiveModellingScore('Linear', y_label='SeriousDlqin2yrs')
print('Predictive Modelling ROC AUC', pms(df2))

We've trained a model on the dataset to predict `'SeriousDlqin2yrs'` and evaluated its performance. Now lets use the generative model to improve that result.

In [None]:
# Extract the meta information from the dataset
df2_meta = MetaExtractor.extract(df2)

# Construct and train the generative model
synth2 = HighDimSynthesizer(df2_meta)
synth2.learn(df2)

We train the generative model in the same manner as before. 

Once learned, we can then wrap it with a `ConditionalSampler` that can be queried to produce a new dataset with a balanced distribution of 'SeriousDlqin2yrs'. Our desired distribution is specified using the `explicit_marginals` parameter. We can then compare a classifier trained on the balanced data to the original classifer and also visualize the effect of reshaping the data using the `Assessor`.

In [None]:
from synthesized import ConditionalSampler
from synthesized.testing import Assessor

In [None]:
sampler = ConditionalSampler(synth2)
df2_balanced = sampler.synthesize(num_rows=len(df2), explicit_marginals={'SeriousDlqin2yrs': [(0, 0.5), (1, 0.5)]})

In [None]:
pmc = metrics.PredictiveModellingComparison('Linear', y_label='SeriousDlqin2yrs')

# Greater than 1 -> the new datset produced a better result than the original dataset 
# when evaluated on some held out data.
print('Ratio of ROC AUC using df2_balanced / df2', pmc(df2, df2_balanced))

In [None]:
Assessor(df2_meta).show_distributions(df2, df2_balanced)

# Example 3 - Data Anonymization

The privacy needs for each user and application are different, so we wanted to give you  flexibility to increase the amount of information that can be extracted from the original dataset by adding a Differential Privacy training option to the model.

A privacy evaluation module is also provided as part of the SDK, to ensure the privacy needs of each user are achieved. This colab version of the SDK contains a small subset of the available metrics and evaluations to conduct some preliminary privacy assessments.

Here, we compare how robust is generative modelling  with Differential Privacy against an attribute inference attack. The dataset is a [German Credit Dataset from Kaggle](https://www.kaggle.com/uciml/german-credit).


**Read more:** 
- Docs about [differential privacy](https://docs.synthesized.io/latest/user_guide/privacy/differential_privacy.html) and [privacy assessment](https://docs.synthesized.io/latest/user_guide/evaluation/privacy.html).


In [None]:
import pandas as pd
from synthesized import MetaExtractor, HighDimSynthesizer
from synthesized.config import HighDimConfig

In [None]:
df3 = pd.read_csv("https://raw.githubusercontent.com/synthesized-io/synthesized-notebooks/staging/data/german_credit_data.csv"); df3

Below, we train two synthesizers, one with default configuration and the second one with Differential Privacy enabled, and we sample datasets for both.

In [None]:
df3_meta = MetaExtractor.extract(df3)

# Learn and synthesize the dataset with default configuration
synth3 = HighDimSynthesizer(df3_meta)
synth3.learn(df3)
df3_synth = synth3.synthesize(len(df3))

In [None]:
# Learn and synthesize the dataset with Differential Privacy
synth3_dp = HighDimSynthesizer(df3_meta, config=HighDimConfig(differential_privacy=True))
synth3_dp.learn(df3)
df3_synth_dp = synth3_dp.synthesize(len(df3)); df3_synth_dp

In [None]:
from synthesized.insight.metrics import privacy

In [None]:
metric = privacy.AttributeInferenceAttackML(
    model='GradientBoosting', 
    sensitive_col='Credit amount',
    predictors=['Age', 'Sex', 'Housing']
)

In [None]:
print(metric(df3, df3_synth))
print(metric(df3, df3_synth_dp))

# Example 4 - Data bias (New!)

With the recent release of [Fairlens](https://github.com/synthesized-io/fairlens) we can now make some measurements of some biases within datasets.

We can use the SDK to upsample rare groups the data in order to check for other biases that may be hidden.

The full version of the SDK offers the ability to mitigate the biases that are detected whilst preserving the other properties of the dataset.

For this example we use [the COMPAS dataset](https://github.com/propublica/compas-analysis/).

In [None]:
# install the fairlens library
! pip install -q fairlens

In [None]:
import fairlens as fl
import pandas as pd
from synthesized import ConditionalSampler, HighDimSynthesizer, MetaExtractor

In [None]:
df4 = pd.read_csv("https://raw.githubusercontent.com/synthesized-io/fairlens/main/datasets/compas.csv"); df4

In [None]:
fs = fl.FairnessScorer(df4, 'RawScore')
fs.demographic_report()
fs.plot_distributions()

In [None]:
df4_meta = MetaExtractor.extract(df4)
synth4 = HighDimSynthesizer(df4_meta)

In [None]:
synth4.learn(df4)

In [None]:
sampler = ConditionalSampler(synth4)
df4_balanced = sampler.synthesize(num_rows=len(df4), explicit_marginals={'Sex': [('Male', 0.5), ('Female', 0.5)]})

In [None]:
fs_balanced = fl.FairnessScorer(df4_balanced, 'RawScore')
fs_balanced.demographic_report()
fs_balanced.plot_distributions()

# Conclusions

While this notebook is focused on just some of the many benefits of generative models, it gives you a glimpse into how you can quickly start leveraging the SDK in development and testing of machine learning models and beyond.

You can learn about other features of the Synthesized SDK [in the Docs](https://docs.synthesized.io/latest/). 

### Licence Agreement

Please note that your use of this colab environment is subject to the following terms and policies:
* https://www.synthesized.io/privacy-policy
* https://www.synthesized.io/data-processing-addendum
* https://www.synthesized.io/terms-of-service
* https://support.google.com/drive/answer/2450387?hl=en