Do not include the original real data in the trained model .pkl file #1156

npatki · 2022-12-23T19:18:17Z

A further consideration would be to not include the original real data in the trained model .pkl file. If a user only needs to supply final synthetic data, for example in the form of a .csv, then it is not a problem. But if they wish to supply a trained model .pkl file to another user so they can generate however much synthetic data they want, then it is a potential problem that the original real PII data is accessible from the .pkl

Here is an example that replicates the point

import cloudpickle
import pandas as pd

from sdv.tabular import CTGAN

# create dummy data
real_data = pd.DataFrame(
    data={"real_name": ["Peter", "John", "Mary", "Susan"]}
)

anonymize_fields = {
    "real_name": "name",
}

print(f"Raw data: {real_data.shape[0]} rows, {real_data.shape[1]} cols")

model = CTGAN(
    epochs=1,
    verbose=True,
    anonymize_fields=anonymize_fields,
)
model.fit(real_data)

with open("file.pkl", "wb") as output:
    cloudpickle.dump(model, output)

# delete the model to be sure it is not accessed
del model

# load back the model and inspect ANONYMIZATION_MAPPINGS
with open("file.pkl", "rb") as input:
    model_saved = cloudpickle.load(input)

print(model_saved._metadata._ANONYMIZATION_MAPPINGS)

which outputs:

Raw data: 4 rows, 1 cols
Epoch 1, Loss G:  1.5588,Loss D: -0.0033
{1636409547408: {'real_name': {'Peter': 'Jessica Reynolds', 'John': 'Jill Graham', 'Mary': 'Eric Williamson', 'Susan': 'Jordan Davis'}}}

containing the original names ["Peter", "John", "Mary", "Susan"]

Originally posted by @PJPRoche in #439 (comment)

The text was updated successfully, but these errors were encountered:

npatki · 2023-05-31T23:58:24Z

With the new SDV 1.0, this issue should be resolved.

For PII columns, we are by default using the AnonymizedFaker RDT. This RDT removes the column and does not store it in memory. See line of code.

Note that the PseudoAnonymizedFaker will save the real values in memory because it needs them to perform a mapping. Pseudo-anonymization is optional and it is not applied by default.*

npatki added the feature request Request for a new feature label Dec 23, 2022

npatki closed this as completed May 31, 2023

npatki added the resolution:resolved The issue was fixed, the question was answered, etc. label May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not include the original real data in the trained model .pkl file #1156

Do not include the original real data in the trained model .pkl file #1156

npatki commented Dec 23, 2022

npatki commented May 31, 2023

Do not include the original real data in the trained model .pkl file #1156

Do not include the original real data in the trained model .pkl file #1156

Comments

npatki commented Dec 23, 2022

npatki commented May 31, 2023