Skip to content

The DCRBaselineProtection metric is not producing the expected score #742

@npatki

Description

@npatki

Environment Details

  • SDMetrics version: 0.19.1 (DCR Branch)
  • Python version: Python 3.11
  • Operating System: Linux Colab

Error Description

The DCRBaselineProtection metric is not producing the correct score. There seems to be something wrong in the computation of the median DCR between a dataset and the real data.

We discussed there potentially being 2 different root causes to this issue:

  1. Making sure that the DCR computation done for: synthetic vs. real and random vs. real datasets. More particularly, we want to compare:
    • The median DCR for a synthetic data row and the real dataset and
    • The median DCR for a random data row and the real dataset
  2. For numerical data, the distance computation should be based on the range of the real data column (not the synthetic data or random data)

Steps to reproduce

In this case, I expect the median distance between synthetic and real to be 0.1

from sdmetrics.single_table.privacy import DCRBaselineProtection
import pandas as pd
import numpy as np

real_data = pd.DataFrame(data={
    'A': [0, 10, 3, 4, 1]}) # the range of this column is 10

synthetic_data = pd.DataFrame(data={
    'A': [5],}) # the DCR between this row and the real dataset is 1/10 = 0.1 

metadata = {
    'columns': {
        'A': { 'sdtype': 'numerical' },}}

# I expect that the median distance between real data and synthetic is 0.1
DCRBaselineProtection.compute_breakdown(
    real_data=real_data,
    synthetic_data=synthetic_data,
    metadata = metadata)

In this case, I expect the median distance between synthetic and real to be 0 because there is a null value in both the real and synthetic data.

from sdmetrics.single_table.privacy import DCRBaselineProtection
import pandas as pd
import numpy as np

real_data = pd.DataFrame(data={
    'A': [0, 10, 3, 4, 1, np.nan], })

synthetic_data = pd.DataFrame(data={
    'A': [np.nan], # the DCR between this an the real data is 0 (np.nan exactly matches)})

metadata = {
    'columns': {
        'A': { 'sdtype': 'numerical' }}}

# I expect that the median distance between real data and synthetic is 0
DCRBaselineProtection.compute_breakdown(
    real_data=real_data,
    synthetic_data=synthetic_data,
    metadata = metadata)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions