# From Expectations to Synthetic Data generation

## 3. Synthetic data & expectations

After the generation of the synthetic data, we need to assess the quality of the data. For the purpose of this flow we are only going to focus on the data Fidelity assesment both with `pandas-profiling` and `great-expectations`


### The dataset - Real and Synthetic data

In [6]:
import json
import pandas as pd

dataset_name = "BankChurn"
real = pd.read_csv('BankChurners.csv')
synth = pd.read_csv(f'synth_{dataset_name}', index_col=0)

#Read the json_profiling from the real data
f = open(f'.profile_{dataset_name}.json')
json_profile = json.load(f)
json_profile = json.loads(json_profile)

In [7]:
real.head()

Unnamed: 0,CLIENTNUM,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,...,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1,Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2
0,768805383,Existing Customer,45,M,3,High School,Married,$60K - $80K,Blue,39,...,12691.0,777,11914.0,1.335,1144,42,1.625,0.061,9.3e-05,0.99991
1,818770008,Existing Customer,49,F,5,Graduate,Single,Less than $40K,Blue,44,...,8256.0,864,7392.0,1.541,1291,33,3.714,0.105,5.7e-05,0.99994
2,713982108,Existing Customer,51,M,3,Graduate,Married,$80K - $120K,Blue,36,...,3418.0,0,3418.0,2.594,1887,20,2.333,0.0,2.1e-05,0.99998
3,769911858,Existing Customer,40,F,4,High School,Unknown,Less than $40K,Blue,34,...,3313.0,2517,796.0,1.405,1171,20,2.333,0.76,0.000134,0.99987
4,709106358,Existing Customer,40,M,3,Uneducated,Married,$60K - $80K,Blue,21,...,4716.0,0,4716.0,2.175,816,28,2.5,0.0,2.2e-05,0.99998


In [8]:
synth.head()

Unnamed: 0,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,...,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1,Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2
0,1,18,F,0,Unknown,Single,Less than $40K,Blue,0,3,...,-19194.160156,399,-13395.407227,-0.07356,4348,44,-0.15284,-0.025812,-0.09561,0.167355
1,1,7,F,-4,Unknown,Single,Less than $40K,Gold,-22,4,...,-31210.564453,526,-6185.100586,-0.890604,8321,47,-2.072164,0.624603,-0.374166,-0.027725
2,1,14,F,0,Unknown,Single,Less than $40K,Blue,-3,3,...,-17750.167969,-271,-10119.250977,-0.778567,2598,33,-0.475148,-0.171177,0.035958,-0.112372
3,1,0,F,-3,Unknown,Single,Less than $40K,Gold,-34,3,...,-49034.042969,-2627,-26602.591797,-0.920144,4179,38,-2.743991,-0.244227,-0.335357,-0.75354
4,1,15,F,-2,Unknown,Single,Less than $40K,Blue,-16,5,...,-17883.761719,107,-10018.253906,-0.185765,5535,38,-0.727267,0.367464,-0.063643,-0.026639


#### Profiling the synthetic data

In [9]:
from pandas_profiling import ProfileReport

In [10]:
title = f"Synth: {dataset_name}"
synth_profile = ProfileReport(synth, title=title)

In [11]:
synth_profile.to_file('synth_profile.html')

Summarize dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

#### Running the expectations

In [7]:
import great_expectations as ge

data_context = ge.data_context.DataContext(context_root_dir="great_expectations")

#Loading the previously build suite
suite = data_context.get_expectation_suite(f"{dataset_name}_expectations")

In [8]:
batch = ge.dataset.PandasDataset(synth, expectation_suite=suite)

In [9]:
results = data_context.run_validation_operator(
    "action_list_operator", assets_to_validate=[batch]
)
validation_result_identifier = results.list_validation_result_identifiers()[0]

In [10]:
#Building & openning the Data Docs
data_context.build_data_docs()
data_context.open_data_docs(validation_result_identifier)