Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not include the original real data in the trained model .pkl file #1156

Closed
npatki opened this issue Dec 23, 2022 · 1 comment
Closed

Do not include the original real data in the trained model .pkl file #1156

npatki opened this issue Dec 23, 2022 · 1 comment
Labels
feature request Request for a new feature resolution:resolved The issue was fixed, the question was answered, etc.

Comments

@npatki
Copy link
Contributor

npatki commented Dec 23, 2022

A further consideration would be to not include the original real data in the trained model .pkl file. If a user only needs to supply final synthetic data, for example in the form of a .csv, then it is not a problem. But if they wish to supply a trained model .pkl file to another user so they can generate however much synthetic data they want, then it is a potential problem that the original real PII data is accessible from the .pkl

Here is an example that replicates the point

import cloudpickle
import pandas as pd

from sdv.tabular import CTGAN

# create dummy data
real_data = pd.DataFrame(
    data={"real_name": ["Peter", "John", "Mary", "Susan"]}
)

anonymize_fields = {
    "real_name": "name",
}

print(f"Raw data: {real_data.shape[0]} rows, {real_data.shape[1]} cols")

model = CTGAN(
    epochs=1,
    verbose=True,
    anonymize_fields=anonymize_fields,
)
model.fit(real_data)

with open("file.pkl", "wb") as output:
    cloudpickle.dump(model, output)

# delete the model to be sure it is not accessed
del model

# load back the model and inspect ANONYMIZATION_MAPPINGS
with open("file.pkl", "rb") as input:
    model_saved = cloudpickle.load(input)

print(model_saved._metadata._ANONYMIZATION_MAPPINGS)

which outputs:

Raw data: 4 rows, 1 cols
Epoch 1, Loss G:  1.5588,Loss D: -0.0033
{1636409547408: {'real_name': {'Peter': 'Jessica Reynolds', 'John': 'Jill Graham', 'Mary': 'Eric Williamson', 'Susan': 'Jordan Davis'}}}

containing the original names ["Peter", "John", "Mary", "Susan"]

Originally posted by @PJPRoche in #439 (comment)

@npatki npatki changed the title Do not include the original real data in the trained model .pkl file. If a user only needs to supply final synthetic data, for example in the form of a .csv, then it is not a problem. But if they wish to supply a trained model .pkl file to another user so they can generate however much synthetic data they want, then it is a potential problem that the original real PII data is accessible from the .pkl Do not include the original real data in the trained model .pkl file Dec 23, 2022
@npatki npatki added the feature request Request for a new feature label Dec 23, 2022
@npatki
Copy link
Contributor Author

npatki commented May 31, 2023

With the new SDV 1.0, this issue should be resolved.

For PII columns, we are by default using the AnonymizedFaker RDT. This RDT removes the column and does not store it in memory. See line of code.

Note that the PseudoAnonymizedFaker will save the real values in memory because it needs them to perform a mapping. Pseudo-anonymization is optional and it is not applied by default.*

@npatki npatki closed this as completed May 31, 2023
@npatki npatki added the resolution:resolved The issue was fixed, the question was answered, etc. label May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature resolution:resolved The issue was fixed, the question was answered, etc.
Projects
None yet
Development

No branches or pull requests

1 participant