-
Notifications
You must be signed in to change notification settings - Fork 49
Description
Environment Details
- SDMetrics version: 0.19.1 (DCR Branch)
- Python version: Python 3.11
- Operating System: Linux Colab
Error Description
The new DCRBaselineProtection
metric is a measure of privacy of the synthetic data. It asks the Q: If I were to use random data instead of synthetic data, how much more private would it be?
For an accurate comparison point, the size of the random data should be the same as the size of the synthetic data. Otherwise, the distance to closest record calculation may give an unfair advantage to either the synthetic or random dataset.
By default, the size of the random dataset is correct. However, if I use the num_rows_subsample
option, it is not correct.
For eg, let's say my synthetic data has 50K rows but when calling the metric, I ask for a subsample of 1000 rows only. In this case, the metric should create only 1000 random data rows (to match the synthetic data subsample). Instead, it is currently creating the full 50K rows.
>>> DCRBaselineProtection.compute(
real_data=real_df,
synthetic_data=synthetic_df,
metadata=my_metadata,
num_rows_subsample=1000,
num_iterations=3)
0.58808