-
Notifications
You must be signed in to change notification settings - Fork 49
Description
Problem Description
The ContingencySimilarity metric computes an entire contingency table for both the real and synthetic data in order to compare the discrete, 2D distributions. Our experiments have shown that for large datasets with high cardinality (i.e. large # of category values), this metric is not very performant.
Our experiments have also shown that a simple approach of subsampling the data (both the real dataset and the synthetic dataset) can yield much faster performance without affecting the final score by too much (within 5%). Based on this, we should add a parameter to this metric to allow for subsampling.
Expected behavior
Add an optional parameter to ContingencySimilarity
called num_rows_subsample
- (default)
None
: Do not subsample the rows <integer>
: Randomly subsample the provided number of rows for both the real and the synthetic datasets before computing the metric
from sdmetrics.column_pairs import ContingencySimilarity
ContingencySimilarity.compute(
real_data=real_table[['column_1', 'column_2']],
synthetic_data=synthetic_table[['column_1', 'column_2']],
num_rows_subsample=1000
)
Additional context
Our experiments have shown that multiple iterations are not needed when doing such a subsample, as the overall score is not affected by much. So we are not adding any parameter for iterations.