Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HMASynthesizer diagnostic score is not 1.0 when using 'truncnorm' distribution #1831

Closed
npatki opened this issue Mar 4, 2024 · 1 comment · Fixed by #1867
Closed

HMASynthesizer diagnostic score is not 1.0 when using 'truncnorm' distribution #1831

npatki opened this issue Mar 4, 2024 · 1 comment · Fixed by #1867
Assignees
Labels
bug Something isn't working data:multi-table Related to multi-table, relational datasets
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Mar 4, 2024

Environment Details

  • SDV version: 1.10.0
  • Python version: (any)
  • Operating System: (any)

Error Description

If I update the default distribution to 'truncnorm', then the HMASynthesizer creates synthetic data that is not completely valid. When running the diagnostic report, the Data Validity score is not 100% -- because there are extra NaN/NaT values that appear in the synthetic data.

Steps to reproduce

Replicate this using the attached metadata and data.

from sdv.datasets.local import load_csvs
from sdv.metadata import MultiTableMetadata
from sdv.multi_table import HMASynthesizer

data = load_csvs(folder_name='test_data/')
metadata = MultiTableMetadata.load_from_json('test_metadata.json')

synthesizer = HMASynthesizer(metadata)
for table_name in data.keys():
  synthesizer.set_table_parameters(
    table_name=table_name,
    table_parameters={'default_distribution': 'truncnorm'})

synthesizer.fit(data)
synthetic_data = synthesizer.sample()

diagnostic_report = run_diagnostic(
  real_data=data, synthetic_data=synthetic_data, metadata=metadata)

test_data.zip
test_metadata.json

OUTPUT:
At first, you'll see many warnings originating by truncated gaussian during modeling:

site-packages/copulas/univariate/truncated_gaussian.py:45: RuntimeWarning: invalid value encountered in scalar divide
site-packages/copulas/univariate/truncated_gaussian.py:46: RuntimeWarning: divide by zero encountered in scalar divide

Then during sampling, there are more warnings that the transformed data (coming directly from ML models) contain null values and therefore overall synthetic data (after reverse sampling) will also have null values.

site-packages/rdt/transformers/utils.py:217: UserWarning: There are null values in the transformed data. The reversed transformed data will contain null values.

Finally, the diagnostic is not 100%:

Overall Score: 94.67%

Properties:
- Data Validity: 84.01%
- Data Structure: 100.0%
- Relationship Validity: 100.0%

Additional Context

This was first observed in #1755

@npatki npatki added bug Something isn't working data:multi-table Related to multi-table, relational datasets labels Mar 4, 2024
@frances-h
Copy link
Contributor

@npatki FYI this appears to happen whenever the sampled a value is great than the sampled b value

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data:multi-table Related to multi-table, relational datasets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants