Skip to content

DCROverfitting and DCRBaseline metrics produce too many warnings about missing columns. #737

@lajohn4747

Description

@lajohn4747

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDMetrics version: 0.19.1 (DCR Branch)
  • Python version: Python 3.11
  • Operating System: Jupyter, MacOS

Error Description

User Warning appears multiple times:

'The columns ('billing_address', 'credit_card_number', 'guest_email') are in the metadata but they are not present in the data.' 

This is caused by the _process_data_with_metadata -> _remove_missing_columns_metadata multiple times when sanitize the input.

[Edit from Neha] In our meeting on Mar 13, 2025 we discussed the following as the root cause: Internally, the metrics will drop any columns that are not used for the computation (i.e. id and PII columns). However, these columns are not removed from the metadata, which is causing this consistency. The warnings do not have any impact on the final score.

Steps to reproduce

From BugHunt link:

from sdv.datasets.demo import download_demo
from sdv.single_table import GaussianCopulaSynthesizer
from sklearn.model_selection import train_test_split

real_data, metadata = download_demo('single_table', 'fake_hotel_guests')

# Use train_test_split to have a training data set and a holdout set
train_df, holdout_df = train_test_split(real_data, test_size=0.2)
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(train_df)
synthetic_data = synthesizer.sample(1000)

num_rows_subsample = 100
num_iterations = 1
# SDMetrics uses dictionary of sinlge table metadata and not the Metadata class
metadata_dict = metadata._convert_to_single_table().to_dict()
compute_breakdown_result = DCROverfittingProtection.compute_breakdown(
    train_df, synthetic_data, holdout_df, metadata_dict, num_rows_subsample,num_iterations
)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions