Create `sample_remaining_columns()` method #692

npatki · 2022-01-27T00:18:22Z

Problem Description

Let's make sampling more user friendly. We can create multiple methods for different user needs.

A new sample_remaining_columns() method can address conditional sampling with a given Dataframe.

Expected behavior

Parameters:

(required) known_columns: A pandas.DataFrame with the columns that are already known -- this specifies number of rows
max_tries: renamed from existing max_retries param (default: 100)
batch_size_per_try: Number of rows to sample per try (default: 10x requested num)
randomize_samples will determine whether or not there should be a fixed seed (default: True)

>>> model.sample_remaining_columns(known_columns=my_dataframe)

Error Handling

Running out of tries

# Always gracefully reject sample (ie return any rows that are sampled)
>>> synthetic_data = model.sample_remaining_columns(known_columns=my_dataframe)
Warning: Only able to sample 75 of the requested rows. To sample more rows, try increasing max_tries
(currently: 100) or increasing batch_size_per_try (currently: 10000). Note that increasing these values will also
increase the sampling time.

# Error if we weren't able to sample any rows
>>> synthetic_data = model.sample_remaining_columns(known_columns=my_dataframe)
Error: Unable to sample any rows for the given conditions. Try increasing max_tries
(currently: 100) or increasing batch_size_per_try (currently: 10000). Note that increasing these values will also
increase the sampling time.

Checking for invalid input

# Unexpected column name
>>> synthetic_data = model.sample_remaining_columns(known_columns=invalid_dataframe)
Error: Unexpected column name 'New_Column'. Use a column name that was present in the original data.
 
# Unexpected categorical column value
>>> synthetic_data = model.sample_remaining_columns(known_columns=invalid_dataframe_2)
Error: Unexpected value 'NEW_VALUE' in column 'New_Column'. Use a value that was present in the original data.

The text was updated successfully, but these errors were encountered:

npatki added feature request Request for a new feature data:single-table Related to tabular datasets labels Jan 27, 2022

This was referenced Jan 27, 2022

Enable Batch Sampling + Progress Bar #693

Closed

Detailed error message when conditional sampling on a model with CustomConstraint #694

Closed

Condition on primary keys #697

Open

Condition on missing values #695

Open

katxiao mentioned this issue Feb 10, 2022

Add method to sample remaining columns (3/3) #708

Merged

katxiao self-assigned this Mar 3, 2022

katxiao closed this as completed Mar 4, 2022

npatki mentioned this issue Mar 7, 2022

Updates to GaussianCopula conditional sampling methods #729

Closed

katxiao added this to the 0.14.0 milestone Mar 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create `sample_remaining_columns()` method #692

Create `sample_remaining_columns()` method #692

npatki commented Jan 27, 2022 •

edited

Loading

Create sample_remaining_columns() method #692

Create sample_remaining_columns() method #692

Comments

npatki commented Jan 27, 2022 • edited Loading

Problem Description

Expected behavior

Error Handling

Create `sample_remaining_columns()` method #692

Create `sample_remaining_columns()` method #692

npatki commented Jan 27, 2022 •

edited

Loading