# Demo: Table Comparison Engine

This notebook demonstrates how to use the table comparison engine from the core library of Artifact-ML to evaluate a synthetic tabular dataset.

The engine provides tools enabling the quantitative evaluation of distributional similarity between the synthetic and real datasets.

We'll walk through:

1. Loading real and synthetic datasets
2. Setting up the validation engine
3. Computing various comparison metrics
4. Generating visualizations to assess data similarity

## Setup

First, we'll set up our environment and import the necessary libraries.

In [None]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [None]:
from pathlib import Path

import pandas as pd
from artifact_core.table_comparison import (
    TableComparisonEngine,
    TableComparisonPlotType,
    TableComparisonScoreCollectionType,
    TableComparisonScoreType,
    TabularDataSpec,
)

## Loading the Data

We'll load both real and synthetic datasets from CSV files. These datasets appear to contain health-related information that we'll analyze and compare.

In [None]:
artifact_core_root = Path().absolute().parent

df_real = pd.read_csv(artifact_core_root / "assets/real.csv")
df_synthetic = pd.read_csv(artifact_core_root / "assets/synthetic.csv")

Let's examine the real dataset to understand its structure and content:

In [None]:
df_real

## Resource Specification Setup

Before we can compare the datasets, we need to specify which features are continuous and which are categorical.

This information helps the dataset comparison engine apply appropriate comparison metrics for each feature type.

In [None]:
ls_cts_features = ["Age", "RestingBP", "Cholesterol", "MaxHR", "Oldpeak"]

resource_spec = TabularDataSpec.from_df(
    df=df_real,
    ls_cts_features=ls_cts_features,
    ls_cat_features=[feature for feature in df_real.columns if feature not in ls_cts_features],
)

## Initializing the Validation Engine

Now we'll initialize the TableComparisonEngine with our resource specification. This engine will handle all comparison operations between our real and synthetic datasets.

In [None]:
engine = TableComparisonEngine(resource_spec=resource_spec)

## Computing Statistical Distance Metrics

### Jensen-Shannon Distance

The Jensen-Shannon (JS) distance is a measure of similarity between probability distributions. It's based on the Kullback-Leibler divergence but is symmetric and always has a finite value.

Values closer to 0 indicate more similar distributions, while values closer to 1 indicate more dissimilar distributions.

In [None]:
engine.produce_dataset_comparison_score_collection(
    score_collection_type=TableComparisonScoreCollectionType.JS_DISTANCE,
    dataset_real=df_real,
    dataset_synthetic=df_synthetic,
)

### Mean Jensen-Shannon Distance

This computes the mean JS distance across all features, providing a single summary metric for how well the synthetic data matches the real data overall.

In [None]:
engine.produce_dataset_comparison_score(
    score_type=TableComparisonScoreType.MEAN_JS_DISTANCE,
    dataset_real=df_real,
    dataset_synthetic=df_synthetic,
)

## Visualizing Dataset Comparisons

### Descriptive Statistics Comparison

This plot compares basic descriptive statistics (mean, median, standard deviation, etc.) between the real and synthetic datasets for each feature. It helps identify if the synthetic data captures the central tendencies and variability of the real data.

In [None]:
engine.produce_dataset_comparison_plot(
    plot_type=TableComparisonPlotType.DESCRIPTIVE_STATS_ALIGNMENT_PLOT,
    dataset_real=df_real,
    dataset_synthetic=df_synthetic,
)

### Probability Density Function (PDF) Comparison

The PDF plots show the distribution shapes for each feature in both datasets. This helps visualize how well the synthetic data captures the distribution characteristics of the real data, including skewness, modality, and outliers.

In [None]:
engine.produce_dataset_comparison_plot(
    plot_type=TableComparisonPlotType.PDF_PLOT, dataset_real=df_real, dataset_synthetic=df_synthetic
)

### Cumulative Distribution Function (CDF) Comparison

CDF plots show the cumulative probability distributions for each feature. These are particularly useful for identifying differences in percentiles and the overall range of values between the real and synthetic datasets.

In [None]:
engine.produce_dataset_comparison_plot(
    plot_type=TableComparisonPlotType.CDF_PLOT, dataset_real=df_real, dataset_synthetic=df_synthetic
)

### Principal Component Analysis (PCA) Projection

PCA reduces the dimensionality of the data while preserving as much variance as possible. This plot projects both datasets into a lower-dimensional space (typically 2D), allowing us to visualize how well the synthetic data captures the overall structure and relationships in the real data.

If the synthetic data points overlap significantly with the real data points in this projection, it suggests that the synthetic data is capturing the joint distribution of features well.

In [None]:
engine.produce_dataset_comparison_plot(
    plot_type=TableComparisonPlotType.PCA_PROJECTION_PLOT,
    dataset_real=df_real,
    dataset_synthetic=df_synthetic,
)

In [None]:
engine.produce_dataset_comparison_score_collection(
    score_collection_type="BANI",
    dataset_real=df_real,
    dataset_synthetic=df_synthetic,
)