Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If no filepath is provided, do not create a file during sample #2042

Closed
npatki opened this issue May 31, 2024 · 0 comments · Fixed by #2053
Closed

If no filepath is provided, do not create a file during sample #2042

npatki opened this issue May 31, 2024 · 0 comments · Fixed by #2053
Assignees
Labels
feature request Request for a new feature feature:sampling Related to generating synthetic data after a model is built
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented May 31, 2024

Problem Description

As described in issue #1310, the single table sample method has the following functionality:

  • If you specify an output_filepath, the sampling function will write to the filepath after every batch, ensuring that you do not lose previously sampled data if there is a crash
  • If you do not specify an output_filepath, then the sampling function will still save batched results to .sample.csv.temp (still want to ensure that you have access to some data in the event of a crash). Once the sampling completes, we delete this file.

Problems:

  1. When you specify that output_filepath=None, SDV creates a temporary file anyway. This is unintuitive for users who assume that None means that no output filepath will be provided. It is also unintuitive that SDV deletes the file after-the-fact.
  2. In the event of a crash, there is a message asking users to check .sample.csv.temp, which is unexpected. This was meant to give the user some data rather than nothing. But practice there are always 0 rows in this file, because the default batch_size is kept equal to the sample size.
  3. There is no way to turn off the functionality for saving to a file. Users in both Crashes on FileNotFoundError for .sample.csv.temp #1310 and Sampling should not create a file called ‘.sample.csv.temp’ by default #2029 mention that they'd like to turn it off because it does not work well within their filesystem (they are not worried about crashes during sampling).

Expected behavior

  • If the output_filepath=None, we should no longer create or write to a temporary file. This will align with the user expectations and it will enable more forms of usage. In the event of a crash, update the error message. There will no longer be a file to reference.
  • If output_filepath is provided, we should create and periodically write to the file as we do today. In the event of a crash, the error message should continue to ask the user to check the file.

Default: We can make the default None. In the future, we should figure out a better default for batch_size for very large samples. In such a case, it would make sense to have a output filepath supplied as a default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature feature:sampling Related to generating synthetic data after a model is built
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants