<img src="https://uploads-ssl.webflow.com/5f2d65b321549c3a6228ce06/60892a20edbd1da3fd641167_Synthesized%20logo.png" width="350" alt="Synthesized" align="left">

# Introduction

Welcome to the demo notebook, where we showcase some of the core features of the Synthesized SDK.

Apply the Synthesized SDK to automatically create a **general-purpose generative model** for any datasets to 

* Bootstrap data where the density of data is low
* Automatically reshape data as you like 
* Anonymise data for repurposing

and more to ultimately improve the performance of your own models. You can learn about other features of the Synthesized SDK [in the Docs](https://docs.synthesized.io/v1.4/). 

**Note: This is a Google Colab version of the SDK.** In order to use the SDK outside Colab in a production environment, on-premise/private cloud, connect to databases, intergrate into ETL, being able to work with Spark and big data sources natively, or just moving beyond a single dataframe in memory and more reach out to letschat@synthesized.io for a commercial version of the SDK. You can read more about it here http://synthesized.io/sdk-for-data-manipulation.

**Note: If you want to you save your progress and come back to your work in a new session you must copy this notebook to your Google Drive.**


# Synthesized License Key

In [None]:
#@title ### Install Synthesized
#@markdown This cell will grab you a licence key and install the Synthesized python SDK.

#@markdown Install with `⌘/ctrl+Enter`
import os
import requests

# url = f'https://us-central1-synthesized-cloud-275014.cloudfunctions.net/free-licence-request'
# licence_key = requests.get(url)
licence_key = "SmRvOG1TbGF6MnRRNU5KMkZneFZQQ09wTWZpNGhQTjlCaUpRSnV5SWU3K011dlJmZ3QrRFZ2eWhYbGszQmM3MWFTeUtVVjJKd3ptRSszcTJaL2x4amc9PXsiZXhwaXJ5IjogIjIwMjMtMDctMjgiLCAiZmVhdHVyZV9pZHMiOiBbIioiXSwgImNvbGFiIjogdHJ1ZX0="

os.environ["SYNTHESIZED_KEY"] = licence_key
print(f"Set Synthesized licence key to {licence_key}.")

!pip install -q imgaug==0.2.5
!pip install -q --pre synthesized==1.4.* --extra-index https://colab:AP3DrAqXTX3dSMVAW1SwowpKgsh@synthesizedio.jfrog.io/artifactory/api/pypi/synthesized-colab/simple

import synthesized



# Use Case 1 - Generating Data

In all of these workflows we use the HighDimSynthesizer object from the library. This is a key object for creating a generative model with the Synthesized SDK. Before using it, we need to extract all meta-information from the dataframe. This is done by calling `MetaExtractor.extract`, which will create a `df_meta: DataFrameMeta` object.

Next we use `df_meta` to create a `synthesizer: HighDimSynthesizer`, and then when we call `synthesizer.learn()` . The HighDimSynthesizer learns patterns in the data it can later use for generation.
**This  step takes approximately 4 minutes.**

In [None]:
import pandas as pd

In [None]:
df1 = pd.read_csv('https://raw.githubusercontent.com/synthesized-io/synthesized-notebooks/master/data/claim_prediction.csv'); df1

In [None]:
from synthesized import HighDimSynthesizer, MetaExtractor

In [None]:
df1_meta = MetaExtractor.extract(df=df1)
synth = HighDimSynthesizer(df1_meta)
synth.learn(df_train=df1)

In [None]:
df1_synth = synth.synthesize(1000); df1_synth

In [None]:
from synthesized.insight import metrics
from synthesized.testing import Assessor

In [None]:
asr = Assessor(df1_meta)
asr.df_model.fit(df1).plot(); df1

# Use Case 2 - Data Rebalancing

Most real-world datasets are highly skewed and show bias towards a particular outcome, category or segment - especially those related to the detection of rare events. Additionally, it is well known that most machine learning models have limited sensitivity to low density regions of a dataset and minority classes. As a result, when creating a predictive model for imbalanced classification, one may encounter a number of pitfalls: some models are unsuitable, model explainability may suffer and unwanted biases may be propagated.

To solve this problem, the Synthesized SDK enables fast and accurate rebalancing of datasets through conditional sampling of the generative model. With just two lines of extra code we can create a balanced dataset for model training!

See our [blog post](https://www.synthesized.io/post/solving-data-imbalance-with-synthetic-data) for a more in-depth analysis of a balanced dataset, and check out the [SDK documentation](https://docs.synthesized.io/v1.4/user_guide/augmentation/index.html) to see more ways to enhance and reshape your data. 

We start with a [public credit scoring dataset from Kaggle](https://www.kaggle.com/c/GiveMeSomeCredit/data), with the aim to predict whether customers will experience financial distress in the next two years using the 'SeriousDlqin2yrs' indicator. As most customers already have good credit scores, this feature is highly imbalanced; only 7% of the dataset contains examples where a customer experienced financial distress. As a result, a model trained on this data will be biased to predict that all customers will be credit worthy. Clearly that's not the desired outcome!



In [None]:
import pandas as pd
from synthesized import ConditionalSampler, HighDimSynthesizer, MetaExtractor
from synthesized.testing.plotting.distributions import categorical_distribution_plot

In [None]:
df2 = pd.read_csv('https://raw.githubusercontent.com/synthesized-io/synthesized-notebooks/master/data/credit.csv'); df2

We train the generative model in the same manner as before. Once learned, we can then wrap it with a `ConditionalSampler` that can be queried to produce a new dataset with a balanced distribution of 'SeriousDlqin2yrs'. Our desired distribution is specified using the `explicit_marginals` parameter. This sets the Synthesized dataset to have uniform distribution for 'SeriousDlqin2yrs' 

In [None]:
df_meta = MetaExtractor.extract(df2)

synthesizer = HighDimSynthesizer(df2_meta)
synthesizer.learn(df2)

sampler = ConditionalSampler(synthesizer)
df2_balanced = sampler.synthesize(num_rows=len(df2), explicit_marginals={'SeriousDlqin2yrs': {'0': 0.5, '1': 0.5}})

categorical_distribution_plot(df2['SeriousDlqin2yrs'], df2_balanced['SeriousDlqin2yrs'])

# Use Case 3 - Privacy

At Synthesized we understand that the privacy needs for each user and application are different, so we give the user flexibility to increase the amount of information that can be extracted from the original dataset by adding a Differential Privacy training option to the model. A privacy evaluation module is also provided as part of the SDK, to ensure the privacy needs of each user are achieved.

Read more about [differential privacy](https://docs.synthesized.io/v1.4/user_guide/augmentation/differential_privacy.html) and [privacy assesment](https://docs.synthesized.io/v1.4/user_guide/evaluation/privacy.html) in our documentation.

Here, we compare how robust are Synthesized and Synthesized with Differential Privacy against a Linkage attack, using as a baseline splitting the dataset into two. The dataset is a [German Credit Dataset from Kaggle](https://www.kaggle.com/uciml/german-credit).

In [None]:
import pandas as pd

In [None]:
df3 = pd.read_csv("https://raw.githubusercontent.com/synthesized-io/synthesized-notebooks/staging/data/german_credit_data.csv"); df3

Below, we train two synthesizers, one with default configuration and the second one with Differential Privacy enabled, and we sample datasets for both.

In [None]:
from synthesized import MetaExtractor, HighDimSynthesizer
from synthesized.config import HighDimConfig

In [None]:
df3_meta = MetaExtractor.extract(df3)

# Learn and synthesize the dataset with default configuration
synthesizer = HighDimSynthesizer(df3_meta)
synthesizer.learn(df3)
df_synth = synthesizer.synthesize(len(df3))

# Learn and synthesize the dataset with Differential Privacy
synthesizer_dp = HighDimSynthesizer(df3_meta, config=HighDimConfig(differential_privacy=True))
synthesizer_dp.learn(df3)

In [None]:
df3_synth_dp = synthesizer_dp.synthesize(len(df3)); df3_synth_dp

Now, let's perform the Linkage Attack. Similarly to Inference Attack, these family of attacks refer to the situation where an attacker adversary might deduce, with significant probability, the value of a hidden sensitive attribute from the values of other attributes.

In this example, the user has access to three columns from the original dataset, 'Age', 'Sex' and 'Housing', and will try to disclose information about 'Credit amount'.

In [None]:
known_columns = ['Age', 'Sex', 'Housing']
attacked_column = 'Credit amount'

In [None]:
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from synthesized.privacy import LinkageAttack
from synthesized.testing.plotting import set_plotting_style, plot_linkage_attack

In [None]:
df3_train, df3_test = train_test_split(df3, test_size=0.5, random_state=42)

la = LinkageAttack(
    df_orig=df3_train,
    t_closeness=0.0, 
    k_distance=1.0, 
    max_n_vulnerable=25,
    key_columns=known_columns,
    sensitive_columns=[attacked_column],
)

set_plotting_style()
fig, axs = plt.subplots(1, 2, figsize=(10, 5))

for i, (df3_i, name) in enumerate(zip([df3_test, df3_synth_dp], ['Original','Synthesized w/ Diff. Privacy'])):
    df3_la = la.get_attacks(df3_i, n_bins=20)
    plot_linkage_attack(df3_la[attacked_column], ax=axs[i])

    axs[i].set_title(name)
    axs[i].add_patch(plt.Rectangle((0, 0.3), 0.2, 1, fc='#312874', alpha=0.3))
    axs[i].set_xlim([0, 1])
    axs[i].set_ylim([0, 0.6])

The plots above show, for each cluster of the data, how significant is the information that the attacker is able to extract (t-closeness) in the y-axis, and how different that information is compared to the rest of the dataset (k-distance) in the x-axis. Therefore, the only information that will be valuable to the attacker is that information in the upper left corner of the graph, were the information gain is high and the information extracted is different.

As observed above, default synthesized is already secure against linkage attacks, but adding DP would decrease the information gain of any attack.

# Conclusions

While this notebook is focused on just some of the many benefits of generative models we've designed, we hope it showcases how you can quickly start levaraging the SDK in development and testing of machine learning models and beyond. 

You can learn about other features of the Synthesized SDK [in the Docs](https://docs.synthesized.io/v1.4/). 

**Note: This is a Google Colab version of the SDK.** In order to use the SDK outside Colab in a production environment, on-premise/private cloud, connect to databases, intergrate into ETL, being able to work with Spark and big data sources natively, or just moving beyond a single dataframe in memory and more reach out to letschat@synthesized.io for a commercial version of the SDK. You can read more about it here http://synthesized.io/sdk-for-data-manipulation. 

# How can synthetic data help you?
Have a try below and let us know! Our documentation is available at https://docs.synthesized.io/v1.4/

### Licence Agreement

Please note that your use of this colab environment is subject to the following terms and policies:
* https://www.synthesized.io/privacy-policy
* https://www.synthesized.io/data-processing-addendum
* https://www.synthesized.io/terms-of-service
* https://support.google.com/drive/answer/2450387?hl=en