Updates to `GaussianCopula` conditional sampling methods #729

npatki · 2022-03-07T22:33:27Z

Problem Description

When conditional sampling, the GaussianCopula model directly synthesizes the values without reject sampling. We should update the API so that it does not make reference to reject sampling for this model only. Note: All other tabular models use reject sampling, so will have a slightly different API.

Expected behavior

The following applies to both sample_conditions and sample_remaining_columns.
(See #691 and #692 )

Remove the batch_size_per_try and max_tries arguments
Add batch_size argument. This will work exactly like in sample
Update the error and warning messages

from sdv.tabular import GaussianCopula
model = GaussianCopula()
model.fit(data)

# see Issue #689 for Condition object details
from sdv.tabular.sampling import Conditions
female_users = Condition(column_values={'sex': 'F', 'active_user': True}, num_rows=50)
inactive_users = Condition(column_values={'active_user': False}, num_rows=100)
conditions = [female_users, inactive_users]

# pass in list of conditions
model.sample_conditions(conditions, batch_size=50, randomize_sample=False)

# or pass in a partial dataframe
model.sample_remaining_conditions(known_columns=partial_dataframe, batch_size=100)

Edge Cases

Scenario 1: Unable to create any rows: Either bc the input is out-of-bounds or due to mathematical reasons (eg. matrix inversion). Reject sampling is not a reason why this method can fail.

>>> synthetic_data = model.sample_remaining_columns(known_columns=difficult_dataframe)
Error: Unable to sample any rows. This may be because the provided values are out-of-bounds in the current model.
Please try again with a different set of values.

Scenario 2: Invalid input -- Same as #691 and #692

# Unexpected column name
>>> c = Condition(column_values={'New_Column': 42})
>>> synthetic_data = model.sample_conditions([c])
Error: Unexpected column name 'New_Column'. Use a column name that was present in the original data.

The text was updated successfully, but these errors were encountered:

npatki added feature request Request for a new feature data:single-table Related to tabular datasets labels Mar 7, 2022

katxiao mentioned this issue Mar 8, 2022

Update Gaussian copula conditional sampling API #731

Merged

katxiao self-assigned this Mar 8, 2022

katxiao closed this as completed in #731 Mar 9, 2022

katxiao added this to the 0.14.0 milestone Mar 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to `GaussianCopula` conditional sampling methods #729

Updates to `GaussianCopula` conditional sampling methods #729

npatki commented Mar 7, 2022 •

edited

Loading

Updates to GaussianCopula conditional sampling methods #729

Updates to GaussianCopula conditional sampling methods #729

Comments

npatki commented Mar 7, 2022 • edited Loading

Problem Description

Expected behavior

Edge Cases

Updates to `GaussianCopula` conditional sampling methods #729

Updates to `GaussianCopula` conditional sampling methods #729

npatki commented Mar 7, 2022 •

edited

Loading