Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conditional sampling + batch size: ValueError: Length of values (1) does not match length of index (5) #886

Closed
npatki opened this issue Jul 8, 2022 · 0 comments · Fixed by #904
Labels
bug Something isn't working data:single-table Related to tabular datasets feature:sampling Related to generating synthetic data after a model is built
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Jul 8, 2022

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: 0.16.0.dev1

Error Description

  1. Create a tabular model
  2. Try conditional sampling (either by using sample_conditions or sample_remaining_columns functions) while also applying the batch_size parameter
Sampling conditions:  25%|██▌       | 5/20 [00:00<00:01, 13.25it/s]Error: Sampling terminated. Partial results are stored in a temporary file: .sample.csv.temp. This file will be overridden the next time you sample. Please rename the file if you wish to save these results.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-42-55096f039bf9> in <module>()
      4 condition2 = Condition({'gender': 'M', 'high_spec': 'Science'}, num_rows=10)
      5 
----> 6 model.sample_conditions(conditions=[condition1, condition2], batch_size=5)

9 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/common.py in require_length_match(data, index)
    530     if len(data) != len(index):
    531         raise ValueError(
--> 532             "Length of values "
    533             f"({len(data)}) "
    534             "does not match length of index "

ValueError: Length of values (1) does not match length of index (5)

Expectation

The batch_size should be used for each of the conditions. For example:

a = Condition({'gender': 'M', 'high_spec': 'Science'}, num_rows=100)
b = Condition({'gender': 'M', 'high_spec': 'Science'}, num_rows=500)

model.sample_conditions(conditions=[a, b], batch_size=200)

Then each condition should be batched at the minimum of num_rows and batch_size:

  • Condition a should be batched at 100
  • Condition b should be batched at 200
@npatki npatki added bug Something isn't working data:single-table Related to tabular datasets feature:sampling Related to generating synthetic data after a model is built new Automatic label applied to new issues and removed new Automatic label applied to new issues labels Jul 8, 2022
@pvk-developer pvk-developer added this to the 0.16.0 milestone Jul 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data:single-table Related to tabular datasets feature:sampling Related to generating synthetic data after a model is built
Projects
None yet
2 participants