Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The UniformSynthesizer should follow the sdtypes in metadata (not the data's dtypes) #248

Closed
npatki opened this issue Jun 5, 2023 · 0 comments · Fixed by #263
Closed
Assignees
Labels
bug Something isn't working
Milestone

Comments

@npatki
Copy link

npatki commented Jun 5, 2023

Environment Details

  • SDGym version: 0.6.0 (latest)

What is expected

The UniformSynthesizer is expected to uniformly (randomly) create data within the observed ranges or categories.

  • For numerical or datetime data, it should learn the min and max values during fit. Then during sample, it can create random, uniform data in the range
  • For categorical or boolean data, it should learn the possible categories during fit. Then during sample, it can randomly select categories with equal probability (i.e. make it uniform)
  • For any other sdtype (such as id, pii, etc.), it can simply use the RegexGenerator or AnonymizedFaker to generate values from scratch (no learning or uniform sampling expected)

How does this synthesizer know which type is which? It should use the provided metadata as the ground source of truth.

What is actually observed

Rather than using the metadata to understand the sdtypes, the code just allows the RDT to guess based on the dataframe. See this line.

The automatically-detected RDT config is not guaranteed to be correct. For example:

  • The RDT will detect any integers as being numerical, but they may actually be categorical sdtypes or IDs
  • The RDT will detect any strings as being categorical, but they may actually be datetimes, PII or ID types

Instead of detect_initial_config, the synthesizer should be parsing the metadata and using the sdtype to decide what to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants