Skip to content

The DCRBaselineProtection metric crashes when the distance between random data and real data is 0 #738

@npatki

Description

@npatki

Environment Details

  • SDMetrics version: 0.19.1 (DCR Branch)
  • Python version: Python 3.11
  • Operating System: Linux Colab

Error Description

The new DCRBaselineProtection metric is a measure of privacy of the synthetic data. It asks the Q: If I were to use random data instead of synthetic data, how much more private would it be?

The metric is based on the distance to closest record. It measures:

  • random_data_median: The typical distance between random data and real
  • synthetic_data_median: The typical distance between synthetic data and real

The final score is: synthetic_data_median / random_data_median

However in some cases, random_data_median=0. This happens when you have a dataset that is capable of very little diversity. For eg, the dataset has only 2 columns, which each can only contain 2 possible discrete values (=4 possibilities).

is_active response
True "YES"
True "NO"
False YES
False "NO"

If random_data_median=0, this metric currently crashes with a ZeroDivisionError.

Expected Behavior

Rather than crashing, the final metric score should be NaN, indicating that it was is not recommended to be computing privacy on such a dataset anyways.

The compute_breakdown should still return the individual median scores so the user can understand more about what's happening.

>>> DCRBaselineProtection.compute_breakdown(
  real_data=real_df,
  synthetic_data=synthetic_df,
  metadata=my_metadata)
{
  'score': NaN,
  'median_DCR_to_real_data': {
    'synthetic_data': 0.25
    'random_data_baseline': 0.0
  }
}
`` `

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions