Conditional sampling using `GaussianCopula` inefficient when categories are noised #910

npatki · 2022-07-18T19:09:30Z

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDV version: 0.16.0.dev2
Python version: 3.7
Operating System: Colab Notebook

Error Description

The GaussianCopula is listed as the most efficient way to perform conditional sampling. Yet, there are some configurations of this model that are very inefficient compared to others.

Inefficient configurations: categorical_transformer='categorical_fuzzy'
Efficient configurations (up to 100x faster): categorical_transformer='categorical' or 'label-encoding' or 'one_hot_encoding'

Steps to reproduce

The following is inefficient:

from sdv.demo import load_tabular_demo
from sdv.tabular import GaussianCopula
from sdv.sampling import Condition

data = load_tabular_demo('student_placements')
data.head()

model = GaussianCopula() # no constraints
model.fit(data)

condition1 = Condition({'gender': 'F', 'high_spec': 'Science'}, num_rows=100)
condition2 = Condition({'gender': 'M', 'high_spec': 'Science'}, num_rows=100)

model.sample_conditions(conditions=[condition1, condition2])

Sampling conditions: 93%|█████████▎| 186/200 [02:05<00:09, 1.49it/s]
/usr/local/lib/python3.7/dist-packages/sdv/tabular/utils.py:211: UserWarning: Only able to sample 186 rows for the given conditions. To sample more rows, try increasing `max_tries_per_batch` (currently: 100). Note that increasing this value will also increase the sampling time.
 warnings.warn(user_msg)

Meanwhile, changing it to label_encoding is 100x faster:

model = GaussianCopula(categorical_transformer='label_encoding')
...
Sampling conditions: 100%|██████████| 200/200 [00:00<00:00, 327.30it/s]

The text was updated successfully, but these errors were encountered:

npatki · 2022-07-18T20:27:53Z

Note that there are two related issues --

General performance degradation. Let's focus this issue on that.
Occasionally needing to reject rows even though this model is not supposed to need reject sampling. This is likely due to RDT #528

npatki added bug Something isn't working data:single-table Related to tabular datasets labels Jul 18, 2022

npatki changed the title ~~Conditional sampling on a GaussianCopula model is inefficient~~ Conditional sampling using GaussianCopula inefficient when categories are noised Jul 18, 2022

npatki mentioned this issue Jul 18, 2022

Unpredictable results for FrequencyEncoder(add_noise=True) sdv-dev/RDT#528

Closed

amontanez24 mentioned this issue Jul 19, 2022

Conditional sampling using GaussianCopula inefficient when categories are noised #912

Merged

pvk-developer added this to the 0.16.0 milestone Jul 19, 2022

pvk-developer mentioned this issue Jul 19, 2022

Update release notes v0.16.0 #911

Merged

amontanez24 closed this as completed in #912 Jul 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conditional sampling using `GaussianCopula` inefficient when categories are noised #910

Conditional sampling using `GaussianCopula` inefficient when categories are noised #910

npatki commented Jul 18, 2022 •

edited

Loading

npatki commented Jul 18, 2022

Conditional sampling using GaussianCopula inefficient when categories are noised #910

Conditional sampling using GaussianCopula inefficient when categories are noised #910

Comments

npatki commented Jul 18, 2022 • edited Loading

Environment Details

Error Description

Steps to reproduce

npatki commented Jul 18, 2022

Conditional sampling using `GaussianCopula` inefficient when categories are noised #910

Conditional sampling using `GaussianCopula` inefficient when categories are noised #910

npatki commented Jul 18, 2022 •

edited

Loading