You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The SDV is only supposed to synthesize null values if the real data also has null values. However, in some cases, the HMA Synthesizer creates erroneous null values (in columns that are not supposed to have these). These incorrect nulls only appear in child tables (i.e. tables with a parent). The root tables are unaffected/do not contain any of these nulls.
Root Cause
The HMA algorithm works by summarizing the distribution of children -- for eg, using a Beta distribution, it summarizes the childen using the parameters alpha, beta, loc and scale. It then models the parameters and creates new ones from scratch during sampling.
Unfortunately, the new parameters that are sampled are not guaranteed to be in-bounds. So there is a chance that the sampled alpha or beta parameter will be <0. This is invalid for a Beta distribution, which is only defined when alpha and beta are >0.
Fixes
HMA should apply a FloatFormatter to each of the extended columns (for marginal distributions as well as the covariance columns). The FloatFormatter should be set up to clip the synthesized min/max values.
FloatFormatter(enforce_min_max_values=True)
Note that these transformers should be accessible after fitting in an easy-to-understand way. For HSA, we are using the parameter extended_columns. We should do the same here.
A more robust option for HMA would be to apply some kind of transformer to the extended columns (for each parameter: alpha, beta, etc.). This transformer could be responsible for clipping the min/max values in case they are synthesized to be out-of-bounds.
There is an issue for this in Copulas (see issue 367). However, Copulas is not really expected to work with invalid parameter values.
The text was updated successfully, but these errors were encountered:
npatki
changed the title
HMASynthesizer sometimes reports null values (out-of-bounds parameters synthesized)
HMASynthesizer sometimes creates null values (out-of-bounds parameters synthesized)
Nov 27, 2023
Option 1: Users encountering this issue may have better luck with using the 'truncnorm' (or 'norm') distribution rather than the default 'beta' distribution. This is not a guaranteed fix, but it makes it much less likely for the synthesizer to run into this issue.
Use the code below to adjust the distribution.
fromsdv.multi_tableimportHMASynthesizer# TODO replace with your table namesTABLE_NAMES= ['users', 'sessions', 'transactions', ...]
synthesizer=HMASynthesizer(metadata)
fortable_nameinTABLE_NAMES:
synthesizer.set_table_parameters(
table_name=table_name,
table_parameters={
'enforce_min_max_values': True,
'default_distribution': 'truncnorm'})
Option 2: Use the HSASynthesizer, as this uses a different algorithm so does not have the same bug.
Environment Details
Error Description
The SDV is only supposed to synthesize null values if the real data also has null values. However, in some cases, the HMA Synthesizer creates erroneous null values (in columns that are not supposed to have these). These incorrect nulls only appear in child tables (i.e. tables with a parent). The root tables are unaffected/do not contain any of these nulls.
Root Cause
The HMA algorithm works by summarizing the distribution of children -- for eg, using a Beta distribution, it summarizes the childen using the parameters
alpha
,beta
,loc
andscale
. It then models the parameters and creates new ones from scratch during sampling.Unfortunately, the new parameters that are sampled are not guaranteed to be in-bounds. So there is a chance that the sampled
alpha
orbeta
parameter will be<0
. This is invalid for a Beta distribution, which is only defined when alpha and beta are >0.Fixes
HMA should apply a FloatFormatter to each of the extended columns (for marginal distributions as well as the covariance columns). The FloatFormatter should be set up to clip the synthesized min/max values.
Note that these transformers should be accessible after fitting in an easy-to-understand way. For HSA, we are using the parameter
extended_columns
. We should do the same here.A more robust option for HMA would be to apply some kind of transformer to the extended columns (for each parameter: alpha, beta, etc.). This transformer could be responsible for clipping the min/max values in case they are synthesized to be out-of-bounds.
There is an issue for this in Copulas (see issue 367). However, Copulas is not really expected to work with invalid parameter values.
The text was updated successfully, but these errors were encountered: