-
Notifications
You must be signed in to change notification settings - Fork 49
Closed
Description
Environment Details
- SDMetrics version: 0.19.1 (DCR Branch)
- Python version: Python 3.11
- Operating System: Linux Colab
Error Description
The DCRBaselineProtection
metric is not producing the correct score. There seems to be something wrong in the computation of the median DCR between a dataset and the real data.
We discussed there potentially being 2 different root causes to this issue:
- Making sure that the DCR computation done for:
synthetic vs. real
andrandom vs. real
datasets. More particularly, we want to compare:- The median DCR for a synthetic data row and the real dataset and
- The median DCR for a random data row and the real dataset
- For numerical data, the distance computation should be based on the range of the real data column (not the synthetic data or random data)
Steps to reproduce
In this case, I expect the median distance between synthetic and real to be 0.1
from sdmetrics.single_table.privacy import DCRBaselineProtection
import pandas as pd
import numpy as np
real_data = pd.DataFrame(data={
'A': [0, 10, 3, 4, 1]}) # the range of this column is 10
synthetic_data = pd.DataFrame(data={
'A': [5],}) # the DCR between this row and the real dataset is 1/10 = 0.1
metadata = {
'columns': {
'A': { 'sdtype': 'numerical' },}}
# I expect that the median distance between real data and synthetic is 0.1
DCRBaselineProtection.compute_breakdown(
real_data=real_data,
synthetic_data=synthetic_data,
metadata = metadata)
In this case, I expect the median distance between synthetic and real to be 0 because there is a null value in both the real and synthetic data.
from sdmetrics.single_table.privacy import DCRBaselineProtection
import pandas as pd
import numpy as np
real_data = pd.DataFrame(data={
'A': [0, 10, 3, 4, 1, np.nan], })
synthetic_data = pd.DataFrame(data={
'A': [np.nan], # the DCR between this an the real data is 0 (np.nan exactly matches)})
metadata = {
'columns': {
'A': { 'sdtype': 'numerical' }}}
# I expect that the median distance between real data and synthetic is 0
DCRBaselineProtection.compute_breakdown(
real_data=real_data,
synthetic_data=synthetic_data,
metadata = metadata)
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working