Numerical Instability in Constrained GaussianCopula #806

tlranda · 2022-05-19T14:48:38Z

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDV version: 0.14.2.dev0
Python version: Python 3.7.13
Operating System: Linux 5.4.0-88-generic x86_64 GNU/Linux

Error Description

When using integer values in a constrained GaussianCopula, numerical instability can cause valid inputs to fail to produce any rows with the sample_conditions() call. This doesn't happen for all inputs, but the bug can be reliably produced for some value(s) in the range of valid inputs for virtually any Between constraint instance. I do not believe the problem is strictly isolated to Between, but I have not attempted to reproduce the error for other constraint classes.

What happens

Given a particular Between constraint bounded by low and high, conditional sampling of some values such that low < x < high fails to produce any values, but other valid inputs work as expected. The following exception traceback will be generated, erroneously claiming that the input value violated the constraint:

Traceback (most recent call last):
  File "SDV/sdv/tabular/copulas.py", line 285, in sample_conditions
    conditions, 100, batch_size, randomize_samples, output_file_path)
  File "SDV/sdv/tabular/base.py", line 715, in _sample_conditions
    handle_sampling_error(output_file_path == TMP_FILE_NAME, output_file_path, error)
  File "SDV/sdv/tabular/utils.py", line 175, in handle_sampling_error
    raise sampling_error
  File "SDV/sdv/tabular/base.py", line 711, in _sample_conditions
    batch_size_per_try,
  File "SDV/sdv/tabular/utils.py", line 224, in check_num_rows
    raise ValueError(user_msg)
ValueError: Unable to sample any rows for the given conditions. This may be because the provided values are out-of-bounds in the current model.
Please try again with a different set of values.

What should happen

The function call is expected to work without issue for all valid integer inputs when constraints are specified as integer types.

Steps to reproduce

import sdv 
from sdv.tabular import GaussianCopula
from sdv.constraints import Between
from sdv.sampling.tabular import Condition
import pandas as pd
import numpy as np

# The high and low values alter which exact integers are unstable
# Another script can be used to search for unstable conversions for arbitrary low/high values
constraint_input = Between(column='input', low=49, high=100)
model = GaussianCopula(
            field_names=['input', 'output'],
            field_transformers = {'input': 'integer', # Problematic conversions may occur
                                  'output': 'float',},
            constraints=[constraint_input],
            min_value = None,
            max_value = None)
# The particular data (and amount used) do not matter, but should be present for the model to have a sampling basis
i, j = 50, 80
arbitrary_data = pd.DataFrame(data={'input': [_ for _ in range(i,j)],
                                    'output': [np.random.rand() for _ in range(j-i)]})
model.fit(arbitrary_data)
# In this case of low=49, high=100; input=88 is the only unstable value
conditions = Condition({'input': 88}, num_rows=3)
output = model.sample_conditions([conditions])

Running the above as mwe.py:

$ python mwe.py 
Sampling conditions:   0%|         | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "mwe.py", line 25, in <module>
    output = model.sample_conditions([conditions])
  File "SDV/sdv/tabular/copulas.py", line 285, in sample_conditions
    conditions, 100, batch_size, randomize_samples, output_file_path)
  File "SDV/sdv/tabular/base.py", line 715, in _sample_conditions
    handle_sampling_error(output_file_path == TMP_FILE_NAME, output_file_path, error)
  File "SDV/sdv/tabular/utils.py", line 175, in handle_sampling_error
    raise sampling_error
  File "SDV/sdv/tabular/base.py", line 711, in _sample_conditions
    batch_size_per_try,
  File "SDV/sdv/tabular/utils.py", line 224, in check_num_rows
    raise ValueError(user_msg)
ValueError: Unable to sample any rows for the given conditions. This may be because the provided values are out-of-bounds in the current model. 
Please try again with a different set of values.

In the particular example above, debugging can show that during the reverse transformation, the interval floating point value is 87.9999999 rather than 88.0, so casting to an integer demotes it to 87 and makes all sampled rows fail to pass the constraint.

Known Workarounds

I believe the issue should be fixed for SDV's benefit, but in the meantime it is possible to get around the issue by avoiding SDV's integer conversions. However, it would be ideal if the numerical stability was reliable on integer values as expected.

Using the same values in floating point format for each of the constraint, sampling input, and training data avoids the integer conversion/numerical instability as it is smaller than rtol's default value.
Instead of using integers, using floating point values low=0, high=1 and doing the appropriate data conversions (ie: x = x*(high-low)+low and x = (x-low)/(high-low), where high and low are the desired integer constraint bounds) before modeling/requesting samples and after sampling can get around the issue.

The text was updated successfully, but these errors were encountered:

tlranda · 2022-05-31T21:07:25Z

I can also replicate this using floating point values when strict=False on the constraint if the lower bound is strictly equal to the constrained value. Adding np.finfo(float).eps will push a low value back within the constraint bounds for SDV's validation check--this may also be at play in the integer case above, but cannot be naively applied to all constrained inputs without adverse affects on upper bound values.

As far as I've been able to determine, SDV always accepts/rejects constrained input that is equal to the upper constraint value as expected according to the truthiness of strict.

jaehoonkoo · 2022-06-02T00:02:33Z

Is there any reason for the data = data * 0.95 + 0.025 and the logit data = np.log(data / (1.0 - data)) in transformation and reverse transform for Between constraints?
https://github.com/sdv-dev/SDV/blob/v0.14.1/sdv/constraints/tabular.py#L876-L877

amontanez24 · 2022-06-02T21:58:46Z

@tlranda It seems that during the reverse_transform for the Between constraint, the condition value (ie. 88) is being reconstructed and sometimes it is a little off. In this case we get 87.999999999 instead of 88. Then after that when the columns are converted back to their original dtypes and it is casted as an int it goes down to 87.

This is definitely a bug that we can fix for the next release. Another work around for now would be to use reject_sampling as the handling_strategy for the constraint.

jaehoonkoo · 2022-06-02T22:04:56Z

@amontanez24, I made a workaround in sdv/metadata/table.py applying round() before astype() for integer type.
https://github.com/jaehoonkoo/SDV/blob/v0.14.1-fix_btw/sdv/metadata/table.py#L716-L721

amontanez24 · 2022-06-02T22:06:22Z

@amontanez24, I made a workaround in sdv/metadata/table.py applying round() before astype() for integer type. https://github.com/jaehoonkoo/SDV/blob/v0.14.1-fix_btw/sdv/metadata/table.py#L716-L721

Yeah this is pretty much the solution I was considering.

tlranda · 2022-06-02T22:34:46Z

Another work around for now would be to use reject_sampling as the handling_strategy for the constraint.

From what I've seen in testing, reject_sampling will fail to generate any rows as they all suffer from the bug. I tried @jaehoonkoo's change and that does appear to work as a fix for SDV itself.

amontanez24 · 2022-06-02T22:41:02Z

I was able to generate rows using reject_sampling as the handling strategy. It shouldn't fail from that bug since the condition value never gets altered. It might fail on occasion with the same error message though, but that would be because it had a hard time generating the desired number of rows. I was able to do it with the following code snippet

constraint_input = Between(column='input', low=49, high=100, handling_strategy='reject_sampling')
model = GaussianCopula(
              field_names=['input', 'output'],
              field_transformers = {'input': 'integer', # Problematic conversions may occur
                                  'output': 'float',},
              constraints=[constraint_input],
              min_value = None,
              max_value = None)
  # The particular data (and amount used) do not matter, but should be present for the model to have a sampling basis
  i, j = 50, 80
  arbitrary_data = pd.DataFrame(data={'input': [_ for _ in range(i,j)],
                                      'output': [np.random.rand() for _ in range(j-i)]})
  model.fit(arbitrary_data)

  # In this case of low=49, high=100; input=88 is the only unstable value
  conditions = Condition({'input': 88}, num_rows=3)
  output = model.sample_conditions([conditions])

tlranda · 2023-07-07T16:55:47Z

This issue has been re-introduced with ScalarRanges, perhaps other constraint types as well.
I have replicated it in versions 1.1.0 and 1.2.0, but did not track the exact commit it was re-introduced in.

Simple MWE for the bug's return in SDV 1.2.0 commit 750332ca2dfc58517e88680d37d1e1cbb9e5b819

import sdv
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.sampling import Condition
from sdv.metadata import SingleTableMetadata
import pandas as pd, numpy as np

arbitrary_data = pd.DataFrame({'x': [1,3,6], 'y': [3.,6.,9.]})
meta = SingleTableMetadata()
meta.detect_from_dataframe(data)

lo = 0
hi = 100
my_constraint = {
    'constraint_class': 'ScalarRange',
    'constraint_parameters': {
       'column_name': 'x',
       'low_value': lo,
       'high_value': hi,
       'strict_boundaries': False
    }
}

model = GaussianCopulaSynthesizer(meta, enforce_min_max_values=False)
model.add_constraints(constraints=[my_constraint])
model.fit(arbitrary_data)

n_sampled = {}
good_keys = []
bad_keys = []
tests = np.arange(int(lo), int(hi+1))
# These values will be numerically unstable for the given arbitrary data/constraints -- changing the setup will change which indices are unstable
# tests = [5, 7, 9, 12, 15, 19, 21, 22, 29, 32, 36, 49, 57, 58, 65, 83, 89, 96, 100]
for _ in tests:
    condition = [Condition(num_rows=10, column_values={'x': _})]
    try:
        n_sampled[_] = len(model.sample_from_conditions(condition))
    except:
        n_sampled[_] = 0
    if n_sampled[_] == 0:
        bad_keys.append(_)
    else:
        good_keys.append(_)
if len(bad_keys) > len(good_keys):
    print("Bad outcome. Good keys?", good_keys)
else:
    print("Good outcome. Bad keys?", bad_keys)

The fix remains the same as before: integer datatypes MUST be rounded before casting to prevent numerical instability from truncating condition values, thus rejecting all conditionally sampled data.

Example fix:

# Line 1186
else:
+   if self._dtype in [int, np.int64, np.int32]:
+       data = data.round(0)
    table_data[self._column_name] = data.astype(self._dtype)

Permalink to portion of affected code for above:

SDV/sdv/constraints/tabular.py

Line 1186 in 750332c

else:

amontanez24 · 2023-07-10T21:54:19Z

@tlranda Thank you for bringing this up. I was able to replicate as well. Reopening for now

tlranda added bug Something isn't working pending review labels May 19, 2022

npatki added under discussion Issue is currently being discussed and removed pending review labels Jun 4, 2022

npatki removed the under discussion Issue is currently being discussed label Jun 14, 2022

npatki mentioned this issue Jun 14, 2022

HMA1 sampling running indefinitely #677

Closed

fealho mentioned this issue Jul 7, 2022

Numerical Instability in conditioned constraint #872

Merged

fealho closed this as completed in #872 Jul 12, 2022

pvk-developer added this to the 0.16.0 milestone Jul 18, 2022

amontanez24 reopened this Jul 10, 2023

tlranda mentioned this issue Jul 31, 2023

GaussianCopula synthesizer not performing well for Conditional Sampling #1435

Closed

npatki removed this from the 0.16.0 milestone Sep 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numerical Instability in Constrained GaussianCopula #806

Numerical Instability in Constrained GaussianCopula #806

tlranda commented May 19, 2022

tlranda commented May 31, 2022

jaehoonkoo commented Jun 2, 2022

amontanez24 commented Jun 2, 2022

jaehoonkoo commented Jun 2, 2022

amontanez24 commented Jun 2, 2022 •

edited

Loading

tlranda commented Jun 2, 2022

amontanez24 commented Jun 2, 2022

tlranda commented Jul 7, 2023

amontanez24 commented Jul 10, 2023

Numerical Instability in Constrained GaussianCopula #806

Numerical Instability in Constrained GaussianCopula #806

Comments

tlranda commented May 19, 2022

Environment Details

Error Description

What happens

What should happen

Steps to reproduce

Known Workarounds

tlranda commented May 31, 2022

jaehoonkoo commented Jun 2, 2022

amontanez24 commented Jun 2, 2022

jaehoonkoo commented Jun 2, 2022

amontanez24 commented Jun 2, 2022 • edited Loading

tlranda commented Jun 2, 2022

amontanez24 commented Jun 2, 2022

tlranda commented Jul 7, 2023

amontanez24 commented Jul 10, 2023

amontanez24 commented Jun 2, 2022 •

edited

Loading