Skip to content

Add DisclosureProtectionEstimate metric #676

@npatki

Description

@npatki

Problem Description

The CategoricalCAP metric is currently available in SDMetrics. It is based on a well-researched and cited methodology for measuring the risk of data disclosure. It acts as a great measure of privacy for particular sensitive columns.

However, there are a number of issues which make this metric hard for users to use and interpret. In #675, we are addressing many of the quality-related issues for this via a new metric called DiclosureProtection. However, there still remain performance issues.

In this issue, we will address performance issues in DisclosureProtection by creating a new metric called DisclosureProtectionEstimate.

Expected behavior

The DisclosureProtectionEstimate metric should wrap around the DisclosureProtection metric. It should estimate the original metric's score by subsetting the data, iterating over many subsets, and returning the average score.

Parameters: All the same parameters as DisclosureProtection, plus:

  • num_rows_subsample: An integer describing the number of rows to subsample in each of the real and synthetic datasets. This subsampling occurs with replacement during every iteration (that is to say, every iteration should start subsampling from the same, original dataset)
    • (default) 1000: Subsample 1000 rows in both the real and synthetic data
    • <int>: Subsample the number of rows provided
    • None: Do not subsample
  • num_iterations: The number of iterations to do for different subsamples. The final score will be the average
    • (default) 10: Do 10 iterations of the subsamples.
    • <int>: Perform the number of iterations provided
  • verbose: A boolean describing whether to show the progress
    • (default) True: Print the steps being run and the progress bar
    • False: Do not print anything out

Computation:

  • The baseline_protection score computation is exactly the same as in DisclosureProtection, and it only needs to be computed once.
  • The cap_protection score will instead be a cap_protection_estimate by running through the desired # of iterations and averaging out the results.
  • The final score will then be score = min(cap_protection_estimate/baseline_protection, 1)

Compute breakdown:

from sdmetrics.single_table import DisclosureProtectionEstimate

DisclosureProtectionEstimate.compute_breakdown(
    real_data=real_table,
    synthetic_data=synthetic_table,
    known_columns=['age', 'gender'],
    sensitive_column=['political_affiliation'],
    columns_to_discretize=['age'],
)

{
    'score': 0.912731436159061,
    'cap_protection_estimate': 0.782341231,
    'baseline_protection': 0.85714285715
}

Verbosity/Progress Bar:
If verbosity is turned on, it should show a progress bar that increments per iteration. The progress bar should be updated with the overall score (using the updated average), rounded to 4 decimal places.

>>> DisclosureProtectionEstimate.compute(real_data, synthetic_data, verbose=True)
Estimating Disclosure Protection (Score=0.8744): 100%|██████████| 10/10 [00:34<00:00,  3.42s/it]

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions