Skip to content

Allow subsampling when computing the ContingencySimilarity metric #716

@npatki

Description

@npatki

Problem Description

The ContingencySimilarity metric computes an entire contingency table for both the real and synthetic data in order to compare the discrete, 2D distributions. Our experiments have shown that for large datasets with high cardinality (i.e. large # of category values), this metric is not very performant.

Our experiments have also shown that a simple approach of subsampling the data (both the real dataset and the synthetic dataset) can yield much faster performance without affecting the final score by too much (within 5%). Based on this, we should add a parameter to this metric to allow for subsampling.

Expected behavior

Add an optional parameter to ContingencySimilarity called num_rows_subsample

  • (default) None: Do not subsample the rows
  • <integer>: Randomly subsample the provided number of rows for both the real and the synthetic datasets before computing the metric
from sdmetrics.column_pairs import ContingencySimilarity

ContingencySimilarity.compute(
    real_data=real_table[['column_1', 'column_2']],
    synthetic_data=synthetic_table[['column_1', 'column_2']],
    num_rows_subsample=1000
)

Additional context

Our experiments have shown that multiple iterations are not needed when doing such a subsample, as the overall score is not affected by much. So we are not adding any parameter for iterations.

Metadata

Metadata

Assignees

Labels

feature requestRequest for a new featurefeature:metricsRelated to any of the individual metrics

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions