Skip to content

The DCRBaselineProtection metric is not creating the correct size of random data #743

@npatki

Description

@npatki

Environment Details

  • SDMetrics version: 0.19.1 (DCR Branch)
  • Python version: Python 3.11
  • Operating System: Linux Colab

Error Description

The new DCRBaselineProtection metric is a measure of privacy of the synthetic data. It asks the Q: If I were to use random data instead of synthetic data, how much more private would it be?

For an accurate comparison point, the size of the random data should be the same as the size of the synthetic data. Otherwise, the distance to closest record calculation may give an unfair advantage to either the synthetic or random dataset.

By default, the size of the random dataset is correct. However, if I use the num_rows_subsample option, it is not correct.

For eg, let's say my synthetic data has 50K rows but when calling the metric, I ask for a subsample of 1000 rows only. In this case, the metric should create only 1000 random data rows (to match the synthetic data subsample). Instead, it is currently creating the full 50K rows.

>>> DCRBaselineProtection.compute(
      real_data=real_df,
      synthetic_data=synthetic_df,
      metadata=my_metadata,
      num_rows_subsample=1000,
      num_iterations=3)
0.58808

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions