Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conditional sampling using GaussianCopula inefficient when categories are noised #910

Closed
npatki opened this issue Jul 18, 2022 · 1 comment · Fixed by #912
Closed

Conditional sampling using GaussianCopula inefficient when categories are noised #910

npatki opened this issue Jul 18, 2022 · 1 comment · Fixed by #912
Labels
bug Something isn't working data:single-table Related to tabular datasets
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Jul 18, 2022

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: 0.16.0.dev2
  • Python version: 3.7
  • Operating System: Colab Notebook

Error Description

The GaussianCopula is listed as the most efficient way to perform conditional sampling. Yet, there are some configurations of this model that are very inefficient compared to others.

Inefficient configurations: categorical_transformer='categorical_fuzzy'
Efficient configurations (up to 100x faster): categorical_transformer='categorical' or 'label-encoding' or 'one_hot_encoding'

Steps to reproduce

The following is inefficient:

from sdv.demo import load_tabular_demo
from sdv.tabular import GaussianCopula
from sdv.sampling import Condition

data = load_tabular_demo('student_placements')
data.head()

model = GaussianCopula() # no constraints
model.fit(data)

condition1 = Condition({'gender': 'F', 'high_spec': 'Science'}, num_rows=100)
condition2 = Condition({'gender': 'M', 'high_spec': 'Science'}, num_rows=100)

model.sample_conditions(conditions=[condition1, condition2])

Sampling conditions: 93%|█████████▎| 186/200 [02:05<00:09, 1.49it/s]
/usr/local/lib/python3.7/dist-packages/sdv/tabular/utils.py:211: UserWarning: Only able to sample 186 rows for the given conditions. To sample more rows, try increasing `max_tries_per_batch` (currently: 100). Note that increasing this value will also increase the sampling time.
 warnings.warn(user_msg)

Meanwhile, changing it to label_encoding is 100x faster:

model = GaussianCopula(categorical_transformer='label_encoding')
...
Sampling conditions: 100%|██████████| 200/200 [00:00<00:00, 327.30it/s]
@npatki npatki added bug Something isn't working data:single-table Related to tabular datasets labels Jul 18, 2022
@npatki npatki changed the title Conditional sampling on a GaussianCopula model is inefficient Conditional sampling using GaussianCopula inefficient when categories are noised Jul 18, 2022
@npatki
Copy link
Contributor Author

npatki commented Jul 18, 2022

Note that there are two related issues --

  1. General performance degradation. Let's focus this issue on that.
  2. Occasionally needing to reject rows even though this model is not supposed to need reject sampling. This is likely due to RDT #528

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data:single-table Related to tabular datasets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants