Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to generate data with new set of primary keys? #686

Closed
abimarticio opened this issue Jan 21, 2022 · 3 comments
Closed

Is it possible to generate data with new set of primary keys? #686

abimarticio opened this issue Jan 21, 2022 · 3 comments
Labels
question General question about the software
Milestone

Comments

@abimarticio
Copy link

I want to generate a synthetic data that I can append to my original data. Is there a way to exclude the primary keys that is already in the original data when generating synthetic data?

I want to generate a large number of data, but I have limited resources so I am planning to generate small sets of data and just append it to my original data.

@npatki npatki added the question General question about the software label Jan 27, 2022
@npatki
Copy link
Contributor

npatki commented Jan 27, 2022

Hi @abimarticio,

Is there a way to exclude the primary keys that is already in the original data when generating synthetic data?

This isn't explicitly supported, so I've filed #697 for tracking progress.

In the meantime, have found that the default work for most cases: If your primary key is a string, the SDV will generate 'a', 'b', 'c', etc. by default. (If it's numerical, it'll generate 0, 1, 2,...). Is this causing conflicts for you?

Other options:

  • Write metadata and specify a different regex for the primary key (see guide)
  • Manually overwrite the column after sampling with whatever you want

I want to generate a large number of data, but I have limited resources so I am planning to generate small sets of data and just append it to my original data.

I have filed #693 for handling batch sampling internally -- so you would not have to do this manually. You can follow that issue for updates.

Let me know if that helps or if you have any follow ups!

@npatki npatki added the under discussion Issue is currently being discussed label Jan 27, 2022
@abimarticio
Copy link
Author

Hi @npatki! Thank you very much for the reply. This is really helpful.

For the meantime what I did was overwrite the column after sampling and create mapping for the foreign keys. But I will also try the first option you suggested. And will also wait and follow the issue you filed.

Thank you very much!

@npatki npatki removed the under discussion Issue is currently being discussed label Feb 1, 2022
@npatki
Copy link
Contributor

npatki commented Feb 1, 2022

Glad it helped! I'll close this issue in favor of the new feature request. Feel free to re-open if you have any follow ups.

@npatki npatki closed this as completed Feb 1, 2022
@katxiao katxiao added this to the 0.14.0 milestone Mar 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software
Projects
None yet
Development

No branches or pull requests

3 participants