Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is synthetic data always anonymous? #34

Closed
oregonpillow opened this issue Mar 11, 2020 · 1 comment
Closed

Is synthetic data always anonymous? #34

oregonpillow opened this issue Mar 11, 2020 · 1 comment
Labels
question General question about the software

Comments

@oregonpillow
Copy link
Contributor

One of the major use cases for synthetic tabular data is creating anonymous data for others to work with.

I was wondering if it's possible by chance during sampling that a sampled DF contains rows which are similar enough to some of the original DF rows which could allow someone/something to be identified?

E.g. For a given DF containing columns: age , education, race, test-score, .....etc

Perhaps age, education, race, test-score are the only fields required to identify someone. It is therefore important that your sampled data does not contain a combination of age, education, race, test-score that is also present in original DF.

I understand that this check can easily be performed outside of CTGAN, but in terms of how CTGAN works, is it possible CTGAN generates personal identifiable information by chance during sampling?

@Baukebrenninkmeijer
Copy link
Contributor

As far as I understand, yes this is possible.

We're trying to run similar techniques in live systems now and there we do a check on the output rows to verify no identical rows are present in the synthesized data. However, the probability of a sample exactly the same as from the original dataset is very small, especially when there are continuous columns present. However, given the dataset you are describing (limited nr. of columns, limited options per column), the probability of these values occuring rises.

The probability of a datapoint of the real data being in the synthetic data is (assuming that CTGAN can fit the original distribution quite well) very close to that point being sampled from the joint probability distribution.

If you have a specific subset of columns that cannot occur identically together, make sure to check on them if those will release private information.

@csala csala added the question General question about the software label Jun 22, 2020
@leix28 leix28 closed this as completed Jun 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software
Projects
None yet
Development

No branches or pull requests

4 participants