HMASynthesizer diagnostic score is not 1.0 when using `'truncnorm'` distribution #1831

npatki · 2024-03-04T21:15:29Z

Environment Details

SDV version: 1.10.0
Python version: (any)
Operating System: (any)

Error Description

If I update the default distribution to 'truncnorm', then the HMASynthesizer creates synthetic data that is not completely valid. When running the diagnostic report, the Data Validity score is not 100% -- because there are extra NaN/NaT values that appear in the synthetic data.

Steps to reproduce

Replicate this using the attached metadata and data.

from sdv.datasets.local import load_csvs
from sdv.metadata import MultiTableMetadata
from sdv.multi_table import HMASynthesizer

data = load_csvs(folder_name='test_data/')
metadata = MultiTableMetadata.load_from_json('test_metadata.json')

synthesizer = HMASynthesizer(metadata)
for table_name in data.keys():
  synthesizer.set_table_parameters(
    table_name=table_name,
    table_parameters={'default_distribution': 'truncnorm'})

synthesizer.fit(data)
synthetic_data = synthesizer.sample()

diagnostic_report = run_diagnostic(
  real_data=data, synthetic_data=synthetic_data, metadata=metadata)

test_data.zip
test_metadata.json

OUTPUT:
At first, you'll see many warnings originating by truncated gaussian during modeling:

site-packages/copulas/univariate/truncated_gaussian.py:45: RuntimeWarning: invalid value encountered in scalar divide
site-packages/copulas/univariate/truncated_gaussian.py:46: RuntimeWarning: divide by zero encountered in scalar divide

Then during sampling, there are more warnings that the transformed data (coming directly from ML models) contain null values and therefore overall synthetic data (after reverse sampling) will also have null values.

site-packages/rdt/transformers/utils.py:217: UserWarning: There are null values in the transformed data. The reversed transformed data will contain null values.

Finally, the diagnostic is not 100%:

Overall Score: 94.67%

Properties:
- Data Validity: 84.01%
- Data Structure: 100.0%
- Relationship Validity: 100.0%

Additional Context

This was first observed in #1755

The text was updated successfully, but these errors were encountered:

frances-h · 2024-03-05T22:18:51Z

@npatki FYI this appears to happen whenever the sampled a value is great than the sampled b value

npatki added bug Something isn't working data:multi-table Related to multi-table, relational datasets labels Mar 4, 2024

frances-h mentioned this issue Mar 25, 2024

HMASynthesizer diagnostic score is not 1.0 when using 'truncnorm' distribution #1867

Merged

frances-h closed this as completed in #1867 Apr 1, 2024

amontanez24 assigned frances-h Apr 11, 2024

amontanez24 added this to the 1.12.0 milestone Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HMASynthesizer diagnostic score is not 1.0 when using `'truncnorm'` distribution #1831

HMASynthesizer diagnostic score is not 1.0 when using `'truncnorm'` distribution #1831

npatki commented Mar 4, 2024 •

edited

frances-h commented Mar 5, 2024

HMASynthesizer diagnostic score is not 1.0 when using 'truncnorm' distribution #1831

HMASynthesizer diagnostic score is not 1.0 when using 'truncnorm' distribution #1831

Comments

npatki commented Mar 4, 2024 • edited

Environment Details

Error Description

Steps to reproduce

Additional Context

frances-h commented Mar 5, 2024

HMASynthesizer diagnostic score is not 1.0 when using `'truncnorm'` distribution #1831

HMASynthesizer diagnostic score is not 1.0 when using `'truncnorm'` distribution #1831

npatki commented Mar 4, 2024 •

edited