The `UniformSynthesizer` should follow the sdtypes in metadata (not the data's dtypes) #248

npatki · 2023-06-05T15:43:13Z

The UniformSynthesizer is expected to uniformly (randomly) create data within the observed ranges or categories.

For numerical or datetime data, it should learn the min and max values during fit. Then during sample, it can create random, uniform data in the range
For categorical or boolean data, it should learn the possible categories during fit. Then during sample, it can randomly select categories with equal probability (i.e. make it uniform)
For any other sdtype (such as id, pii, etc.), it can simply use the RegexGenerator or AnonymizedFaker to generate values from scratch (no learning or uniform sampling expected)

How does this synthesizer know which type is which? It should use the provided metadata as the ground source of truth.

Rather than using the metadata to understand the sdtypes, the code just allows the RDT to guess based on the dataframe. See this line.

The automatically-detected RDT config is not guaranteed to be correct. For example:

The RDT will detect any integers as being numerical, but they may actually be categorical sdtypes or IDs
The RDT will detect any strings as being categorical, but they may actually be datetimes, PII or ID types

Instead of detect_initial_config, the synthesizer should be parsing the metadata and using the sdtype to decide what to do.

The text was updated successfully, but these errors were encountered:

npatki added the bug Something isn't working label Jun 5, 2023

npatki mentioned this issue Jun 5, 2023

The IndependentSynthesizer should follow the sdtypes in the metadata (not the data's dtypes) #249

Open

lajohn4747 mentioned this issue Feb 23, 2024

Metadata sdtypes should be used instead of inferred sdtypes #263

Merged

lajohn4747 closed this as completed in #263 Mar 1, 2024

amontanez24 assigned lajohn4747 May 17, 2024

amontanez24 added this to the 0.7.1 milestone May 17, 2024

Provide feedback