Is synthetic data always anonymous? #34

oregonpillow · 2020-03-11T16:36:20Z

One of the major use cases for synthetic tabular data is creating anonymous data for others to work with.

I was wondering if it's possible by chance during sampling that a sampled DF contains rows which are similar enough to some of the original DF rows which could allow someone/something to be identified?

E.g. For a given DF containing columns: age , education, race, test-score, .....etc

Perhaps age, education, race, test-score are the only fields required to identify someone. It is therefore important that your sampled data does not contain a combination of age, education, race, test-score that is also present in original DF.

I understand that this check can easily be performed outside of CTGAN, but in terms of how CTGAN works, is it possible CTGAN generates personal identifiable information by chance during sampling?

The text was updated successfully, but these errors were encountered:

Baukebrenninkmeijer · 2020-03-25T15:04:04Z

As far as I understand, yes this is possible.

We're trying to run similar techniques in live systems now and there we do a check on the output rows to verify no identical rows are present in the synthesized data. However, the probability of a sample exactly the same as from the original dataset is very small, especially when there are continuous columns present. However, given the dataset you are describing (limited nr. of columns, limited options per column), the probability of these values occuring rises.

The probability of a datapoint of the real data being in the synthetic data is (assuming that CTGAN can fit the original distribution quite well) very close to that point being sampled from the joint probability distribution.

If you have a specific subset of columns that cannot occur identically together, make sure to check on them if those will release private information.

csala added the question General question about the software label Jun 22, 2020

leix28 closed this as completed Jun 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is synthetic data always anonymous? #34

Is synthetic data always anonymous? #34

oregonpillow commented Mar 11, 2020

Baukebrenninkmeijer commented Mar 25, 2020

Is synthetic data always anonymous? #34

Is synthetic data always anonymous? #34

Comments

oregonpillow commented Mar 11, 2020

Baukebrenninkmeijer commented Mar 25, 2020