Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to sample when using a PseudoAnonymizedFaker #1207

Closed
npatki opened this issue Jan 27, 2023 · 0 comments
Closed

Unable to sample when using a PseudoAnonymizedFaker #1207

npatki opened this issue Jan 27, 2023 · 0 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Jan 27, 2023

Environment Details

  • SDV version: 1.0.0 (in progress)
  • Python version: 3.8
  • Operating System: Linux (Colab Notebook)

Error Description

The new version of SDV allows me to change and update RDT transformers to my liking. For PII columns, I'd like to use pseudo-anonymization instead of full anonymization.

Whenever I try to use the PsuedoAnonymizedFaker, the synthesizer crashes when I try to sample.

Steps to reproduce

(Note that due to #1206, we have to update the metadata for the address column.)

from sdv.datasets.demo import download_demo
from sdv.single_table import GaussianCopulaSynthesizer
from rdt.transformers.pii import PseudoAnonymizedFaker

data, metadata = download_demo(
    modality='single_table',
    dataset_name='student_placements_pii')

# due to issue #1206, we need to update the metadata
metadata.update_column(
    column_name='address',
    sdtype='address',
    pii=True
)

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.auto_assign_transformers(data)

# update address to use psuedo_anonymization
address_transformer = PseudoAnonymizedFaker(provider_name='address', function_name='address')
synthesizer.update_transformers(column_name_to_transformer={
    'address': address_transformer
})

synthesizer.fit(data)
synthesizer.sample(1)

Stack Trace

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py](https://localhost:8080/#) in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

KeyError: 'address'
The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
[/usr/local/lib/python3.8/dist-packages/sdv/single_table/base.py](https://localhost:8080/#) in _sample_with_progress_bar(self, num_rows, max_tries_per_batch, batch_size, output_file_path, conditions, show_progress_bar)
    713                 progress_bar.set_description('Sampling rows')
--> 714                 sampled = self._sample_in_batches(
    715                     num_rows=num_rows,

[/usr/local/lib/python3.8/dist-packages/sdv/single_table/base.py](https://localhost:8080/#) in _sample_in_batches(self, num_rows, batch_size, max_tries_per_batch, conditions, transformed_conditions, float_rtol, progress_bar, output_file_path)
    638         for step in range(math.ceil(num_rows / batch_size)):
--> 639             sampled_rows = self._sample_batch(
    640                 batch_size=batch_size,

[/usr/local/lib/python3.8/dist-packages/sdv/single_table/base.py](https://localhost:8080/#) in _sample_batch(self, batch_size, max_tries, conditions, transformed_conditions, float_rtol, progress_bar, output_file_path)
    571             prev_num_valid = num_valid
--> 572             sampled, num_valid = self._sample_rows(
    573                 num_rows_to_sample,

[/usr/local/lib/python3.8/dist-packages/sdv/single_table/base.py](https://localhost:8080/#) in _sample_rows(self, num_rows, conditions, transformed_conditions, float_rtol, previous_rows)
    496 
--> 497             sampled = self._data_processor.reverse_transform(sampled)
    498 

[/usr/local/lib/python3.8/dist-packages/sdv/data_processing/data_processor.py](https://localhost:8080/#) in reverse_transform(self, data, reset_keys)
    672             elif column_name in self._keys:
--> 673                 column_data = generated_keys[column_name]
    674             else:

[/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py](https://localhost:8080/#) in __getitem__(self, key)
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):

[/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py](https://localhost:8080/#) in get_loc(self, key, method, tolerance)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 

KeyError: 'address'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
[<ipython-input-36-3c01bb2cc760>](https://localhost:8080/#) in <module>
----> 1 synthesizer.sample(1)
 

[/usr/local/lib/python3.8/dist-packages/sdv/single_table/utils.py](https://localhost:8080/#) in handle_sampling_error(is_tmp_file, output_file_path, sampling_error)
     78 
     79     if error_msg:
---> 80         raise type(sampling_error)(error_msg + '\n' + str(sampling_error))
     81 
     82     raise sampling_error

KeyError: "Error: Sampling terminated. Partial results are stored in a temporary file: .sample.csv.temp. This file will be overridden the next time you sample. Please rename the file if you wish to save these results.\n'address'"
@npatki npatki added the bug Something isn't working label Jan 27, 2023
@npatki npatki added this to the 1.0.0 milestone Jan 27, 2023
@fealho fealho changed the title Unable to sample when using a PsuedoAnonymizedFaker Unable to sample when using a PseudoAnonymizedFaker Feb 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants