# From Expectations to Synthetic Data generation
**A practical flow to generate synthetic data**

<p style="text-align:center;"><img src="img/logos.png" alt = "test pic" width="700" height="300"></p>

## 1. Expectations & profiling

Combining `pandas-profiling`, `ydata-synthetic` and `great-expectations`.

Synthetic data is artificially generated data that is not collected from real-world events. It replicates the statistical components of actual data without containing any identifiable information, ensuring individuals’ privacy. 

But how can one be sure that the synthetic data generated follows the quality standards of the original data? And what means synthetic data quality? 
The truth is that, real world data quality dependes on the downstream applications, the same applied to generated data. 

Nevertheless, one thing we are sure that we want to be able to keep regardless: fidelity, meaning the statistical properties and business rules observed within the data of origin. 

**And that's what this notebook's series is about**: how to build a flow, while leveraging relevant open-source tools, to profile and validate rules and expectations the generated synthetic data.


In [3]:
### Installing required packages
#%%capture
#!pip install great-expectations==0.13.4
#!pip install pandas-profiling==2.1.0
#!pip install jinja2==3.0.0
#!pip install matplolib==3.4.0

<div class="alert alert-block alert-success">
<b>Up to you:</b> 
    <p>To run this notebook we recommend the use of a virtual environment - pip or conda - and the creation of a jupyter kernel. If leveraging your machine <strong>GPU</strong> to train and generate synthetic data please check our <a href="https://github.com/ydataai/ydata/blob/dev/setup-utils/conda_tensorflowGPU.sh">environment setup file for Tensorflow.</a></p>
</div>

### The dataset

In [4]:
#Add here the dataset choice to be explored to deliver this use case
import pandas as pd
import json

dataset_name = "Cardiovascular"
data = pd.read_csv('cardio.csv')

## Generate the data profile
**Exploratory Data Analysis (EDA)**

EDA, or data profiling, is one of the most important and cores steps for any data science pipeline. This steps helps Data Science teams to understand the data they are working as well as discover importanct patterns, anomalies and business assumptions validations. But not only. 

Although the process of EDA seems to straighfoward, the truth is that it involves a lot of repetetive processes for all the variables involved. This process only gets more complicated and time-consuming as the datasets grow in dimensionality. 

`pandas-profiling`is the open-source solution that enables the standartization and automation of this initial profile. The package can be described as the one-line code EDA.

`pandas-pofiling` enables the generation of a profile report from a pandas `DataFrame`, that can be easily exported to JSON, HTML and ipywidgets.

More details can be found at [pandas-profiling github](https://github.com/ydataai/pandas-profiling/) or at [pandas-profiling docs](https://pandas-profiling.ydata.ai/docs/master/index.html).


In [5]:
from pandas_profiling import ProfileReport

#Generating the standard profiling report
title = f"EDA: {dataset_name}"
profile = ProfileReport(data, title=title)

In [7]:
#Exploring as widget
#profile.to_widgets()

#Exploring as HTML inside Jupyter Notebook
profile.to_notebook_iframe()

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

In [8]:
#Exporting the report as HTML
profile.to_file(f"eda_{dataset_name}.html")

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

<div class="alert alert-block alert-info">
<b>Tip:</b> Depending on dataset size or downstream applications, pandas-profiling enables the you to generate a ligher version of your report (expensive computations such as correlations and interactions are not included in the final generated report).
</div>

In [9]:
explr_profile = ProfileReport(data, title=title, explorative=True)

In [10]:
json_profile = explr_profile.to_json()

Summarize dataset:   0%|          | 0/25 [00:00<?, ?it/s]

Render JSON:   0%|          | 0/1 [00:00<?, ?it/s]

In [11]:
with open(f'.profile_{dataset_name}.json', 'w') as f:
    json.dump(json_profile, f)

## Generating and saving the expectations suite

`great-expectations` it is an open-source package and widely used tool to validate and documend your data, in order to maintain its quality along the flow and improve the communication between the teams involved in a data project. **Great expectations** it is all about setting unit tests to your data, but with *Expectations*! 

In [20]:
import great_expectations as ge

data_context = ge.data_context.DataContext(context_root_dir="great_expectations")

suite = explr_profile.to_expectation_suite(suite_name=f"{dataset_name}_expectations",
                                           data_context=data_context, 
                                           save_suite = False,
                                           run_validation=False,
                                           build_data_docs=False)

In [22]:
data_context.save_expectation_suite(suite)

In [23]:
batch = ge.dataset.PandasDataset(data, expectation_suite=suite)

In [24]:
results = data_context.run_validation_operator(
    "action_list_operator", assets_to_validate=[batch]
)

## Summary & Next steps

Althought it seems like the process of data profiling is commodotize due to tools like `numpy`, `scikit-learn` or even `pandas`, the use of our beloved `describe` is not enough to explore the data from a univariate and multivariate perspective. This process can be even more time consuming if we consider larger volumes of data. The proces of data synthesis it is not agnostic to the initial exploration of data, after all the biggest impact on the quality of our synthetic data comes from choices such as the data preparation, as we are going to cover next.

The next Notebooks include:
- Training and generation of conditional synthetic data
- Synthetic data profiling & Expecations validation