-
Notifications
You must be signed in to change notification settings - Fork 49
Closed
Closed
Copy link
Description
Environment Details
Please indicate the following details about the environment in which you found the bug:
- SDMetrics version: 0.19.1 (DCR Branch)
- Python version: Python 3.11
- Operating System: Jupyter, MacOS
Error Description
User Warning appears multiple times:
'The columns ('billing_address', 'credit_card_number', 'guest_email') are in the metadata but they are not present in the data.'
This is caused by the _process_data_with_metadata
-> _remove_missing_columns_metadata
multiple times when sanitize the input.
[Edit from Neha] In our meeting on Mar 13, 2025 we discussed the following as the root cause: Internally, the metrics will drop any columns that are not used for the computation (i.e. id and PII columns). However, these columns are not removed from the metadata, which is causing this consistency. The warnings do not have any impact on the final score.
Steps to reproduce
From BugHunt link:
from sdv.datasets.demo import download_demo
from sdv.single_table import GaussianCopulaSynthesizer
from sklearn.model_selection import train_test_split
real_data, metadata = download_demo('single_table', 'fake_hotel_guests')
# Use train_test_split to have a training data set and a holdout set
train_df, holdout_df = train_test_split(real_data, test_size=0.2)
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(train_df)
synthetic_data = synthesizer.sample(1000)
num_rows_subsample = 100
num_iterations = 1
# SDMetrics uses dictionary of sinlge table metadata and not the Metadata class
metadata_dict = metadata._convert_to_single_table().to_dict()
compute_breakdown_result = DCROverfittingProtection.compute_breakdown(
train_df, synthetic_data, holdout_df, metadata_dict, num_rows_subsample,num_iterations
)
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working