Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When generating ids without a regex, create them randomly #1922

Closed
npatki opened this issue Apr 17, 2024 · 0 comments · Fixed by #1931
Closed

When generating ids without a regex, create them randomly #1922

npatki opened this issue Apr 17, 2024 · 0 comments · Fixed by #1931
Assignees
Labels
feature request Request for a new feature
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Apr 17, 2024

Problem Description

For any column that has sdtype id without a user-provided regex, the SDV currently generates generates index values in a sequential manner (eg, 0, 1, 2, ...) The resulting data doesn't look realistic.

Expected behavior

For any columns of sdtype id that do not have a user-provided regex, ensure that the synthetic data is created randomly.

In technical terms: We currently generate sequential values by assigning the IDGenerator RDT to these columns. Instead, we should assign the AnonymizedFaker to those columns and use the bothify function to make random strings. The exact params to bothify depend on the dtype (storage type) of the data.

  • If the column is numeric (int, float, etc), we need to ensure that the resulting synthetic values can be cast back to this dtype. In this case, assign the following transformer: AnonymizedFaker(provider_name=None, function_name='bothify', function_kwargs={'text': '##########'})
    • This allows for 1 billion possible values
    • When cast back to numbers, they will be completely randomized
    • If it's a primary key, then also set cardinality_rule='unique'
  • Otherwise, the synthetic values can remain as a string (object) type. In this case, assign the following transformer: AnonymizedFaker(provider_name=None, function_name='bothify', function_kwargs={'text': 'sdv-id-??????'})
    • This allows for well over 1 billion possible values
    • If it's a primary key, then also set cardinality_rule='unique'

Additional context

  • This change only applies if a column is sdtype 'id' AND there is no 'regex_format' available.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants