Evaluating Synthetic Data
=========================

A very common question when someone starts using **SDV** to generate
synthetic data is: *\"How good is the data that I just generated?\"*

In order to answer this question, **SDV** has a collection of metrics
and tools that allow you to compare the *real* that you provided and the
*synthetic* data that you generated using **SDV** or any other tool.

In this guide we will show you how to perform this evaluation and how to
explore the different metrics that exist.

Using the SDV Evaluation Framework
----------------------------------

To evaluate the quality of synthetic data we essentially need two
things: *real* data and *synthetic* data that pretends to resemble it.

Let us start by loading a demo table and generate a synthetic replica of
it using the `GaussianCopula` model.

In [1]:
from sdv.demo import load_tabular_demo
from sdv.tabular import GaussianCopula

real_data = load_tabular_demo('student_placements')

model = GaussianCopula()
model.fit(real_data)
synthetic_data = model.sample()

After the previous steps we will have two tables:

-   `real_data`: A table containing data about student placements

In [2]:
real_data.head()

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,17264,M,67.0,91.0,Commerce,58.0,Sci&Tech,False,0,55.0,Mkt&HR,58.8,27000.0,True,2020-07-23,2020-10-12,3.0
1,17265,M,79.33,78.33,Science,77.48,Sci&Tech,True,1,86.5,Mkt&Fin,66.28,20000.0,True,2020-01-11,2020-04-09,3.0
2,17266,M,65.0,68.0,Arts,64.0,Comm&Mgmt,False,0,75.0,Mkt&Fin,57.8,25000.0,True,2020-01-26,2020-07-13,6.0
3,17267,M,56.0,52.0,Science,52.0,Sci&Tech,False,0,66.0,Mkt&HR,59.43,,False,NaT,NaT,
4,17268,M,85.8,73.6,Commerce,73.3,Comm&Mgmt,False,0,96.8,Mkt&Fin,55.5,42500.0,True,2020-07-04,2020-09-27,3.0


-   `synthetic_data`: A synthetically generated table that contains data
    in the same format and with similar statistical properties as the
    `real_data`.

In [3]:
synthetic_data.head()

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,17318,M,66.725348,76.408487,Commerce,83.328878,Sci&Tech,False,0,52.696034,Mkt&Fin,60.457128,22813.6924,True,2020-01-23,2020-08-07,3.0
1,17372,M,76.784298,87.042473,Science,76.970188,Sci&Tech,False,0,88.01507,Mkt&HR,65.517693,31675.910373,True,2020-04-18,2020-06-08,3.0
2,17295,M,59.6179,70.617668,Science,59.230593,Others,False,1,73.284359,Mkt&HR,59.098228,27119.86589,True,2020-01-19,2020-06-11,3.0
3,17385,M,60.103062,57.080626,Science,53.581154,Sci&Tech,False,1,93.783054,Mkt&HR,51.683411,22638.23197,True,2020-02-22,2020-07-27,3.0
4,17352,F,75.981981,70.013769,Commerce,75.51525,Comm&Mgmt,False,0,85.463569,Mkt&Fin,62.936807,28316.094709,True,2020-01-18,2020-07-30,3.0


<div class="alert alert-info">

**Note**

For more details about this process, please visit the
[gaussian_copula](gaussian_copula.ipynb) guide.

</div>

### Computing an overall score

The simplest way to see how similar the two tables are is to import the
`sdv.evaluation.evaluate` function and run it passing both the
`synthetic_data` and the `real_data` tables.

In [4]:
from sdv.evaluation import evaluate

evaluate(synthetic_data, real_data)

0.49143637293841225

The output of this function call will be a number between 0 and 1 that
will indicate how similar the two tables are, being 0 the worst and 1
the best possible score.

### How was the obtained score computed?

The `evaluate` function applies a collection of pre-configured metric
functions and returns the average of the scores that the data obtained
on each one of them, after normalizing them to the 0-1 range.
In most scenarios this can be enough to get an idea
about the similarity of the two tables, but you might want to explore
the metrics in more detail.

In order to see the different metrics that were applied you can pass and
additional argument `aggregate=False`, which will make the `evaluate`
function return a dictionary with the scores that each one of the
metrics functions returned:

In [5]:
evaluate(synthetic_data, real_data, aggregate=False)

Unnamed: 0,metric,name,score,min_value,max_value,goal
1,LogisticDetection,LogisticRegression Detection,0.395674,0.0,1.0,MAXIMIZE
2,SVCDetection,SVC Detection,0.351723,0.0,1.0,MAXIMIZE
11,GMLogLikelihood,GaussianMixture Log Likelihood,-82.6573,-inf,inf,MAXIMIZE
12,CSTest,Chi-Squared,0.909491,0.0,1.0,MAXIMIZE
13,KSTest,Inverted Kolmogorov-Smirnov D statistic,0.918605,0.0,1.0,MAXIMIZE
14,KSTestExtended,Inverted Kolmogorov-Smirnov D statistic,0.920698,0.0,1.0,MAXIMIZE
15,ContinuousKLDivergence,Continuous Kullback–Leibler Divergence,0.5449,0.0,1.0,MAXIMIZE
16,DiscreteKLDivergence,Discrete Kullback–Leibler Divergence,0.832816,0.0,1.0,MAXIMIZE


### Can I control which metrics are applied?

By default, the `evaluate` function will apply all the metrics that are
included within the SDV Evaluation framework. However, the list of
metrics that are applied can be controlled by passing a list with the
names of the metrics that you want to apply.

For example, if you were interested on obtaining only the `CSTest` and
`KSTest` metrics you can call the `evaluate` function as follows:

In [6]:
evaluate(synthetic_data, real_data, metrics=['CSTest', 'KSTest'])

0.9140476536783428

Or, if we want to see the scores separately:

In [7]:
evaluate(synthetic_data, real_data, metrics=['CSTest', 'KSTest'], aggregate=False)

Unnamed: 0,metric,name,score,min_value,max_value,goal
0,CSTest,Chi-Squared,0.909491,0.0,1.0,MAXIMIZE
1,KSTest,Inverted Kolmogorov-Smirnov D statistic,0.918605,0.0,1.0,MAXIMIZE


The complete list of possible metrics is:

-   `cstest`: This metric compares the distributions of all the
    categorical columns of the table by using a Chi-squared test and
    returns the average of the `p-values` obtained across all the
    columns. If the tables that you are evaluating do not contain any
    categorical columns the result will be `nan`.
-   `kstest`: This metric compares the distributions of all the
    numerical columns of the table with a two-sample Kolmogorov-Smirnov
    test using the empirical CDF and returns the average of the
    KS statistic values obtained across all the columns. If the tables
    that you are evaluating do not contain any numerical columns the result
    will be `nan`.
-   `logistic_detection`: This metric tries to use a Logistic Regression
    classifier to detect whether each row is real or synthetic and then
    evaluates its performance using an Area under the ROC curve metric.
    The returned score is 1 minus the ROC AUC score obtained by the
    classifier.
-   `svc_detection`: This metric tries to use an Support Vector
    Classifier to detect whether each row is real or synthetic and then
    evaluates its performance using an Area under the ROC curve metric.
    The returned score is 1 minus the ROC AUC score obtained by the
    classifier.