You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the major use cases for synthetic tabular data is creating anonymous data for others to work with.
I was wondering if it's possible by chance during sampling that a sampled DF contains rows which are similar enough to some of the original DF rows which could allow someone/something to be identified?
E.g. For a given DF containing columns: age , education, race, test-score, .....etc
Perhaps age, education, race, test-score are the only fields required to identify someone. It is therefore important that your sampled data does not contain a combination of age, education, race, test-score that is also present in original DF.
I understand that this check can easily be performed outside of CTGAN, but in terms of how CTGAN works, is it possible CTGAN generates personal identifiable information by chance during sampling?
The text was updated successfully, but these errors were encountered:
We're trying to run similar techniques in live systems now and there we do a check on the output rows to verify no identical rows are present in the synthesized data. However, the probability of a sample exactly the same as from the original dataset is very small, especially when there are continuous columns present. However, given the dataset you are describing (limited nr. of columns, limited options per column), the probability of these values occuring rises.
The probability of a datapoint of the real data being in the synthetic data is (assuming that CTGAN can fit the original distribution quite well) very close to that point being sampled from the joint probability distribution.
If you have a specific subset of columns that cannot occur identically together, make sure to check on them if those will release private information.
One of the major use cases for synthetic tabular data is creating anonymous data for others to work with.
I was wondering if it's possible by chance during sampling that a sampled DF contains rows which are similar enough to some of the original DF rows which could allow someone/something to be identified?
E.g. For a given DF containing columns:
age , education, race, test-score
, .....etcPerhaps
age, education, race, test-score
are the only fields required to identify someone. It is therefore important that your sampled data does not contain a combination ofage, education, race, test-score
that is also present in original DF.I understand that this check can easily be performed outside of CTGAN, but in terms of how CTGAN works, is it possible CTGAN generates personal identifiable information by chance during sampling?
The text was updated successfully, but these errors were encountered: