You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The UniformSynthesizer is expected to uniformly (randomly) create data within the observed ranges or categories.
For numerical or datetime data, it should learn the min and max values during fit. Then during sample, it can create random, uniform data in the range
For categorical or boolean data, it should learn the possible categories during fit. Then during sample, it can randomly select categories with equal probability (i.e. make it uniform)
For any other sdtype (such as id, pii, etc.), it can simply use the RegexGenerator or AnonymizedFaker to generate values from scratch (no learning or uniform sampling expected)
How does this synthesizer know which type is which? It should use the provided metadata as the ground source of truth.
What is actually observed
Rather than using the metadata to understand the sdtypes, the code just allows the RDT to guess based on the dataframe. See this line.
The automatically-detected RDT config is not guaranteed to be correct. For example:
The RDT will detect any integers as being numerical, but they may actually be categorical sdtypes or IDs
The RDT will detect any strings as being categorical, but they may actually be datetimes, PII or ID types
Instead of detect_initial_config, the synthesizer should be parsing the metadata and using the sdtype to decide what to do.
The text was updated successfully, but these errors were encountered:
Environment Details
What is expected
The
UniformSynthesizer
is expected to uniformly (randomly) create data within the observed ranges or categories.numerical
ordatetime
data, it should learn the min and max values during fit. Then during sample, it can create random, uniform data in the rangecategorical
orboolean
data, it should learn the possible categories during fit. Then during sample, it can randomly select categories with equal probability (i.e. make it uniform)id
,pii
, etc.), it can simply use theRegexGenerator
orAnonymizedFaker
to generate values from scratch (no learning or uniform sampling expected)How does this synthesizer know which type is which? It should use the provided
metadata
as the ground source of truth.What is actually observed
Rather than using the metadata to understand the sdtypes, the code just allows the RDT to guess based on the dataframe. See this line.
The automatically-detected RDT config is not guaranteed to be correct. For example:
Instead of
detect_initial_config
, the synthesizer should be parsing the metadata and using the sdtype to decide what to do.The text was updated successfully, but these errors were encountered: