-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numerical Instability in Constrained GaussianCopula #806
Comments
I can also replicate this using floating point values when As far as I've been able to determine, SDV always accepts/rejects constrained input that is equal to the upper constraint value as expected according to the truthiness of |
Is there any reason for the |
@tlranda It seems that during the This is definitely a bug that we can fix for the next release. Another work around for now would be to use |
@amontanez24, I made a workaround in sdv/metadata/table.py applying round() before astype() for integer type. |
Yeah this is pretty much the solution I was considering. |
From what I've seen in testing, |
I was able to generate rows using constraint_input = Between(column='input', low=49, high=100, handling_strategy='reject_sampling')
model = GaussianCopula(
field_names=['input', 'output'],
field_transformers = {'input': 'integer', # Problematic conversions may occur
'output': 'float',},
constraints=[constraint_input],
min_value = None,
max_value = None)
# The particular data (and amount used) do not matter, but should be present for the model to have a sampling basis
i, j = 50, 80
arbitrary_data = pd.DataFrame(data={'input': [_ for _ in range(i,j)],
'output': [np.random.rand() for _ in range(j-i)]})
model.fit(arbitrary_data)
# In this case of low=49, high=100; input=88 is the only unstable value
conditions = Condition({'input': 88}, num_rows=3)
output = model.sample_conditions([conditions]) |
This issue has been re-introduced with ScalarRanges, perhaps other constraint types as well. Simple MWE for the bug's return in SDV 1.2.0 commit import sdv
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.sampling import Condition
from sdv.metadata import SingleTableMetadata
import pandas as pd, numpy as np
arbitrary_data = pd.DataFrame({'x': [1,3,6], 'y': [3.,6.,9.]})
meta = SingleTableMetadata()
meta.detect_from_dataframe(data)
lo = 0
hi = 100
my_constraint = {
'constraint_class': 'ScalarRange',
'constraint_parameters': {
'column_name': 'x',
'low_value': lo,
'high_value': hi,
'strict_boundaries': False
}
}
model = GaussianCopulaSynthesizer(meta, enforce_min_max_values=False)
model.add_constraints(constraints=[my_constraint])
model.fit(arbitrary_data)
n_sampled = {}
good_keys = []
bad_keys = []
tests = np.arange(int(lo), int(hi+1))
# These values will be numerically unstable for the given arbitrary data/constraints -- changing the setup will change which indices are unstable
# tests = [5, 7, 9, 12, 15, 19, 21, 22, 29, 32, 36, 49, 57, 58, 65, 83, 89, 96, 100]
for _ in tests:
condition = [Condition(num_rows=10, column_values={'x': _})]
try:
n_sampled[_] = len(model.sample_from_conditions(condition))
except:
n_sampled[_] = 0
if n_sampled[_] == 0:
bad_keys.append(_)
else:
good_keys.append(_)
if len(bad_keys) > len(good_keys):
print("Bad outcome. Good keys?", good_keys)
else:
print("Good outcome. Bad keys?", bad_keys) The fix remains the same as before: integer datatypes MUST be rounded before casting to prevent numerical instability from truncating condition values, thus rejecting all conditionally sampled data. Example fix: # Line 1186
else:
+ if self._dtype in [int, np.int64, np.int32]:
+ data = data.round(0)
table_data[self._column_name] = data.astype(self._dtype) Permalink to portion of affected code for above: SDV/sdv/constraints/tabular.py Line 1186 in 750332c
|
@tlranda Thank you for bringing this up. I was able to replicate as well. Reopening for now |
Environment Details
Please indicate the following details about the environment in which you found the bug:
Error Description
When using integer values in a constrained GaussianCopula, numerical instability can cause valid inputs to fail to produce any rows with the
sample_conditions()
call. This doesn't happen for all inputs, but the bug can be reliably produced for some value(s) in the range of valid inputs for virtually anyBetween
constraint instance. I do not believe the problem is strictly isolated toBetween
, but I have not attempted to reproduce the error for other constraint classes.What happens
Given a particular
Between
constraint bounded bylow
andhigh
, conditional sampling of some values such thatlow < x < high
fails to produce any values, but other valid inputs work as expected. The following exception traceback will be generated, erroneously claiming that the input value violated the constraint:What should happen
The function call is expected to work without issue for all valid integer inputs when constraints are specified as integer types.
Steps to reproduce
Running the above as
mwe.py
:In the particular example above, debugging can show that during the reverse transformation, the interval floating point value is 87.9999999 rather than 88.0, so casting to an integer demotes it to 87 and makes all sampled rows fail to pass the constraint.
Known Workarounds
I believe the issue should be fixed for SDV's benefit, but in the meantime it is possible to get around the issue by avoiding SDV's integer conversions. However, it would be ideal if the numerical stability was reliable on integer values as expected.
Using the same values in floating point format for each of the constraint, sampling input, and training data avoids the integer conversion/numerical instability as it is smaller than
rtol
's default value.Instead of using integers, using floating point values
low=0
,high=1
and doing the appropriate data conversions (ie:x = x*(high-low)+low
andx = (x-low)/(high-low)
, wherehigh
andlow
are the desired integer constraint bounds) before modeling/requesting samples and after sampling can get around the issue.The text was updated successfully, but these errors were encountered: