Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to GaussianCopula conditional sampling methods #729

Closed
npatki opened this issue Mar 7, 2022 · 0 comments · Fixed by #731
Closed

Updates to GaussianCopula conditional sampling methods #729

npatki opened this issue Mar 7, 2022 · 0 comments · Fixed by #731
Assignees
Labels
data:single-table Related to tabular datasets feature request Request for a new feature
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Mar 7, 2022

Problem Description

When conditional sampling, the GaussianCopula model directly synthesizes the values without reject sampling. We should update the API so that it does not make reference to reject sampling for this model only. Note: All other tabular models use reject sampling, so will have a slightly different API.

Expected behavior

The following applies to both sample_conditions and sample_remaining_columns.
(See #691 and #692 )

  • Remove the batch_size_per_try and max_tries arguments
  • Add batch_size argument. This will work exactly like in sample
  • Update the error and warning messages
from sdv.tabular import GaussianCopula
model = GaussianCopula()
model.fit(data)

# see Issue #689 for Condition object details
from sdv.tabular.sampling import Conditions
female_users = Condition(column_values={'sex': 'F', 'active_user': True}, num_rows=50)
inactive_users = Condition(column_values={'active_user': False}, num_rows=100)
conditions = [female_users, inactive_users]

# pass in list of conditions
model.sample_conditions(conditions, batch_size=50, randomize_sample=False)

# or pass in a partial dataframe
model.sample_remaining_conditions(known_columns=partial_dataframe, batch_size=100)

Edge Cases

Scenario 1: Unable to create any rows: Either bc the input is out-of-bounds or due to mathematical reasons (eg. matrix inversion). Reject sampling is not a reason why this method can fail.

>>> synthetic_data = model.sample_remaining_columns(known_columns=difficult_dataframe)
Error: Unable to sample any rows. This may be because the provided values are out-of-bounds in the current model.
Please try again with a different set of values.

Scenario 2: Invalid input -- Same as #691 and #692

# Unexpected column name
>>> c = Condition(column_values={'New_Column': 42})
>>> synthetic_data = model.sample_conditions([c])
Error: Unexpected column name 'New_Column'. Use a column name that was present in the original data.
@npatki npatki added feature request Request for a new feature data:single-table Related to tabular datasets labels Mar 7, 2022
@katxiao katxiao self-assigned this Mar 8, 2022
@katxiao katxiao added this to the 0.14.0 milestone Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:single-table Related to tabular datasets feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants