-
Notifications
You must be signed in to change notification settings - Fork 49
Description
Environment Details
- SDMetrics version: 0.19.1 (DCR Branch)
- Python version: Python 3.11
- Operating System: Linux Colab
Error Description
The new DCRBaselineProtection
metric is a measure of privacy of the synthetic data. It asks the Q: If I were to use random data instead of synthetic data, how much more private would it be?
The metric is based on the distance to closest record. It measures:
random_data_median
: The typical distance between random data and realsynthetic_data_median
: The typical distance between synthetic data and real
The final score is: synthetic_data_median / random_data_median
However in some cases, random_data_median=0
. This happens when you have a dataset that is capable of very little diversity. For eg, the dataset has only 2 columns, which each can only contain 2 possible discrete values (=4 possibilities).
is_active | response |
---|---|
True | "YES" |
True | "NO" |
False | YES |
False | "NO" |
If random_data_median=0
, this metric currently crashes with a ZeroDivisionError
.
Expected Behavior
Rather than crashing, the final metric score should be NaN
, indicating that it was is not recommended to be computing privacy on such a dataset anyways.
The compute_breakdown
should still return the individual median scores so the user can understand more about what's happening.
>>> DCRBaselineProtection.compute_breakdown(
real_data=real_df,
synthetic_data=synthetic_df,
metadata=my_metadata)
{
'score': NaN,
'median_DCR_to_real_data': {
'synthetic_data': 0.25
'random_data_baseline': 0.0
}
}
`` `