Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HMASynthesizer sometimes creates null values (out-of-bounds parameters synthesized) #1691

Closed
npatki opened this issue Nov 27, 2023 · 1 comment · Fixed by #1764
Closed

HMASynthesizer sometimes creates null values (out-of-bounds parameters synthesized) #1691

npatki opened this issue Nov 27, 2023 · 1 comment · Fixed by #1764
Assignees
Labels
bug Something isn't working data:multi-table Related to multi-table, relational datasets
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Nov 27, 2023

Environment Details

  • SDV version: Any

Error Description

The SDV is only supposed to synthesize null values if the real data also has null values. However, in some cases, the HMA Synthesizer creates erroneous null values (in columns that are not supposed to have these). These incorrect nulls only appear in child tables (i.e. tables with a parent). The root tables are unaffected/do not contain any of these nulls.

Root Cause

The HMA algorithm works by summarizing the distribution of children -- for eg, using a Beta distribution, it summarizes the childen using the parameters alpha, beta, loc and scale. It then models the parameters and creates new ones from scratch during sampling.

Unfortunately, the new parameters that are sampled are not guaranteed to be in-bounds. So there is a chance that the sampled alpha or beta parameter will be <0. This is invalid for a Beta distribution, which is only defined when alpha and beta are >0.

Fixes

HMA should apply a FloatFormatter to each of the extended columns (for marginal distributions as well as the covariance columns). The FloatFormatter should be set up to clip the synthesized min/max values.

FloatFormatter(enforce_min_max_values=True)

Note that these transformers should be accessible after fitting in an easy-to-understand way. For HSA, we are using the parameter extended_columns. We should do the same here.

>>> synthesizer.extended_columns['my_table_name']
{
   '<extended_column_name>': FloatFormatter(enforce_min_max_values=True),
   '<extended_column_name>: FloatFormatter(enforce_min_max_values=True),
  ...
}

A more robust option for HMA would be to apply some kind of transformer to the extended columns (for each parameter: alpha, beta, etc.). This transformer could be responsible for clipping the min/max values in case they are synthesized to be out-of-bounds.

There is an issue for this in Copulas (see issue 367). However, Copulas is not really expected to work with invalid parameter values.

@npatki npatki added the bug Something isn't working label Nov 27, 2023
@npatki npatki changed the title HMASynthesizer sometimes reports null values (out-of-bounds parameters synthesized) HMASynthesizer sometimes creates null values (out-of-bounds parameters synthesized) Nov 27, 2023
@npatki npatki added the data:multi-table Related to multi-table, relational datasets label Nov 27, 2023
@npatki
Copy link
Contributor Author

npatki commented Jan 24, 2024

Workarounds

Option 1: Users encountering this issue may have better luck with using the 'truncnorm' (or 'norm') distribution rather than the default 'beta' distribution. This is not a guaranteed fix, but it makes it much less likely for the synthesizer to run into this issue.

Use the code below to adjust the distribution.

from sdv.multi_table import HMASynthesizer

 # TODO replace with your table names
TABLE_NAMES = ['users', 'sessions', 'transactions', ...]

synthesizer = HMASynthesizer(metadata)

for table_name in TABLE_NAMES:
  synthesizer.set_table_parameters(
  table_name=table_name,
  table_parameters={
    'enforce_min_max_values': True,
    'default_distribution': 'truncnorm'})

Option 2: Use the HSASynthesizer, as this uses a different algorithm so does not have the same bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data:multi-table Related to multi-table, relational datasets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants