Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Batch Sampling + Progress Bar #693

Closed
npatki opened this issue Jan 27, 2022 · 0 comments
Closed

Enable Batch Sampling + Progress Bar #693

npatki opened this issue Jan 27, 2022 · 0 comments
Assignees
Labels
data:single-table Related to tabular datasets feature request Request for a new feature
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Jan 27, 2022

Problem Description

Sampling the full number of rows is most efficient, but batch sampling is a useful feature for progress tracking & memory consumption. Let's enable batch sampling for each of our sampling methods (#690, #691, #692).

Expected behavior

In sample():

  • Add batch_size param (default: same as num_rows to yield only 1 batch)

In sample_conditions() and sample_remaining_columns():

  • Batch size determined by existing batch_size_per_try param

For all methods:

  • Add output_file_path: Name of file to write to (default: None)
  • Show progress bar when sampling
  • Periodically write to output_file_path. If None, then periodically write to a temp file.
# works for all methods: sample, sample_conditions, sample_remaining_columns

# show progress bar while sampling
# write to a temp file that we can later delete
>>> synthetic_data = model.sample(num_rows=1000)
76%|████████████████████████████         | 756/1000 [00:33<00:10, 229.00it/s]

# write to file path while also returning the samples; show progress
>>> synthetic_data = model.sample(num_rows=1000, output_file_path="./results/sample.csv")
76%|████████████████████████████         | 756/1000 [00:33<00:10, 229.00it/s]

Error States

When the system crashes or the user exits in the middle of sampling.

# works for all methods: sample, sample_conditions, sample_remaining_columns

# Partial results available in requested file path
>>> synthetic_data = model.sample(output_file_path='./results/synthetic.csv')

76%|████████████████████████████         | 756/1000 [00:33<00:10, 229.00it/s]
^C
Error: Sampling terminated. Partial results are stored in './results/synthetic.csv'

# If no file path, partial results are in a temp file
# Temp file will be overwritten on next sample, so tell the user to save it
>>> synthetic_data = model.sample()
^C
Error: Sampling terminated. Partial results are stored in a temporary file: '.sample.csv.temp'.
This file will be overridden the next time you sample. Rename the file if you wish to save
these results.
@npatki npatki added feature request Request for a new feature data:single-table Related to tabular datasets labels Jan 27, 2022
@katxiao katxiao self-assigned this Mar 3, 2022
@katxiao katxiao closed this as completed Mar 4, 2022
@katxiao katxiao added this to the 0.14.0 milestone Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:single-table Related to tabular datasets feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

2 participants